问题：

位扭曲帮助：扩展位以遵循给定的位掩码

陆雨华

2023-03-14

我对一种快速的“扩展位”方法感兴趣，它可以定义为：

< li >设b为n位二进制数，即{0,1}^n中的b \ < li >设P为B中所有1/真位的位置，即< code>1

举几个例子:

< li >给定B: 0010 1110，A: 0110，则Ap应为0000 1100 < li >给定B: 1001 1001，A: 1101，则Ap应为1001 0001

下面是一个简单的算法，但是我不禁觉得有一种更快/更简单的方法可以做到这一点。

unsigned int expand_bits(unsigned int A, unsigned int B, int n) {
  int k = popcount(B); // cuda function, but there are good methods for this
  unsigned int Ap = 0;
  int j = k-1;
  // Starting at the most significant bit,
  for (int i = n - 1; i >= 0; --i) {
    Ap <<= 1;
    // if B is 1, add the value at A[j] to Ap, decrement j. 
    if (B & (1 << i)) {
      Ap += (A >> j--) & 1;
    }
  }
  return Ap;
}

陆斌

2023-03-14

问题似乎是要求对BMI2指令PDEP进行CUDA仿真，该指令采用源操作数a，并根据掩码b的1位位置存储其位。目前运输的GPU上没有相同或类似操作的硬件支持；也就是说，包括麦克斯韦架构。

基于给出的两个例子，我假设掩码b通常是稀疏的，我们可以通过迭代b的1位来最小化工作。这可能会在GPU上导致不同的分支，但是如果不知道特定的用例，性能的确切权衡是未知的。目前，我假设在掩码b中利用稀疏性对性能的积极影响比发散的消极影响更大。

在下面的仿真代码中，我减少了潜在的“昂贵的”移位操作的使用，而是主要依靠简单的ALU指令。在各种GPU上，执行移位指令的吞吐量低于简单整数运算。我保留了一个移位，离开代码中的关键路径，以避免被算术单元限制执行。如果需要，表达式<代码> 1U

基本思想是依次隔离掩码< code>b的每个1位(从最低有效位开始)，并将其与< code>a的第I位的值合并，并将结果合并到扩展的目的地。使用了来自< code>b的1位后，我们将其从掩码中移除，并进行迭代，直到掩码变为零。

为了避免将a的第i位移动到位，我们只需将其隔离，然后通过简单的否定将其值复制到所有更重要的位，利用两者对整数的补码表示。

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0;
    int i;
    for (i = 0; b; i++) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & (1U << i)); // spread i-th bit of 'a' to more signif. bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
    }
    return r;
}

上面提到的不带任何换档操作的变型如下：

/* Emulate PDEP: deposit the bits of 'a' (starting with the least significant 
   bit) at the positions indicated by the set bits of the mask stored in 'b'.
*/
__device__ unsigned int my_pdep (unsigned int a, unsigned int b)
{
    unsigned int l, s, r = 0, m = 1;
    while (b) { // iterate over 1-bits in mask, until mask becomes 0
        l = b & (0 - b); // extract mask's least significant 1-bit
        b = b ^ l; // clear mask's least significant 1-bit
        s = 0 - (a & m); // spread i-th bit of 'a' to more significant bits
        r = r | (l & s); // deposit i-th bit of 'a' at position of mask's 1-bit
        m = m + m; // mask for next bit of 'a'
    }
    return r;
}

在下面的评论中，@Evgeny Kluev指出了chessprogramming网站上的一个无移位< code>PDEP仿真，它看起来可能比我上面的两个实现都要快；似乎值得一试。

位扭曲帮助：扩展位以遵循给定的位掩码

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档