问题：

如何使用CUDA并行化嵌套for循环以对二维数组执行计算

景修杰

2023-03-14

我正在做一些研究，并且是一个使用CUDA的初学者。我使用的语言是C和C++，与NVIDIA的CUDA兼容的基本语言。在过去的一周里，我一直试图通过将CUDA与C++代码集成来获得任何加速。

此外，CUDA的实现也比正常的非CUDA版本慢。

下面是我调用内核函数的函数。本质上，我将原来在这个函数中的计算移到了核函数中，以便将它并行化。//计算输入之间的距离void computeInput（int vectorNumber,double*dist,double**weight）{

double *d_dist, **d_weight;


//cout << "Dist[0] Before: " << dist[0] << endl;

cudaMalloc(&d_dist, maxClusters * sizeof(double));
cudaMalloc(&d_weight, maxClusters * vector_length * sizeof(double));

//  cout << "Memory Allocated" << endl;

//copy variables from host machine running on CPU to Kernel running on GPU
cudaMemcpy(d_dist, dist, maxClusters * sizeof(double), cudaMemcpyHostToDevice);
cudaMemcpy(d_weight, weight, maxClusters * vector_length * sizeof(double), cudaMemcpyHostToDevice);

//  cout << "Variables copied to GPU Device." << endl;

//kernel currently being run with 1 blocks with 4 threads for each block.
//right now only a single loop is parallelized, I need to parallelize each loop individually or 2d arrays individually.
dim3 blocks(8,8);
dim3 grid(1, 1);
threadedInput<<<grid,blocks>>>(vectorNumber, d_dist, d_weight);

//  cout << "Kernel Run." << endl;  

//Waits for the GPU to finish computations
cudaDeviceSynchronize();

//cout << "Weight[0][0] : " << weight[0][0];

//copy back varaible from kernelspace on GPU to host on CPU into variable weight
cudaMemcpy(weight, d_weight, maxClusters * vector_length * sizeof(double), cudaMemcpyDeviceToHost);
cudaMemcpy(dist, d_dist, maxClusters * sizeof(double), cudaMemcpyDeviceToHost);
//  cout << "GPU Memory Copied back to Host" << endl;

cout << "Dist[0] After: " << dist[0] << endl;

cudaFree(d_dist);
cudaFree(d_weight);

//cout << " Cuda Memory Freed" << endl;
}

下面是内核函数。它使用节点上的权重来计算距离。

__global__ void threadedInput(int vecNum, double *dist, double **weight)
{
int tests[vectors][vector_length] = {{0, 1, 1, 0},
                                     {1, 0, 0, 1},
                                     {0, 1, 0, 1},
                                     {1, 0, 1, 0}};
dist[0] = 0.0;
dist[1] = 0.0;
int indexX,indexY, incrX, incrY;
indexX = blockIdx.x * blockDim.x + threadIdx.x;
indexY = blockIdx.y * blockDim.y + threadIdx.y;
incrX = blockDim.x * gridDim.x; 
incrY = blockDim.y * gridDim.y; 

for(int i = indexY; i <= (maxClusters - 1); i+=incrY)
{
    for(int j = indexX; j <= (vectors - 1); j+= incrX)
    {       
        dist[i] += pow((weight[i][j] - tests[vecNum][j]), 2);
    }// end inner for
}// end outer for

}// end CUDA-kernel

Clusters for training input:

Vector (1, 0, 1, 0, ) Place in Bin 0

Vector (1, 1, 1, 0, ) Place in Bin 0

Vector (0, 1, 1, 1, ) Place in Bin 0

Vector (1, 1, 0, 0, ) Place in Bin 0

Weights for Node 0 connections:
0.74753098, 0.75753881, 0.74233157, 0.25246902, 

Weights for Node 1 connections:
0.00000000, 0.00000000, 0.00000000, 0.00000000, 

Categorized test input:

Vector (0, 1, 1, 0, ) Place in Bin 0

Vector (1, 0, 0, 1, ) Place in Bin 0

Vector (0, 1, 0, 1, ) Place in Bin 0

Vector (1, 0, 1, 0, ) Place in Bin 0
Time Ran: 0.96623900

预期输出（除了它所需的预期时间至少要快50%）

Clusters for training input:

Vector (1, 0, 1, 0, ) Place in Bin 0

Vector (1, 1, 1, 0, ) Place in Bin 1

Vector (0, 1, 1, 1, ) Place in Bin 0

Vector (1, 1, 0, 0, ) Place in Bin 1

Weights for Node 0 connections:
0.74620975, 0.75889148, 0.74351981, 0.25379025, 

Weights for Node 1 connections:
0.75368531, 0.75637331, 0.74105526, 0.24631469, 

Categorized test input:

Vector (0, 1, 1, 0, ) Place in Bin 0

Vector (1, 0, 0, 1, ) Place in Bin 1

Vector (0, 1, 0, 1, ) Place in Bin 0

Vector (1, 0, 1, 0, ) Place in Bin 1
Time Ran: 0.00033100

共有1个答案

皇甫乐

2023-03-14

您应该阅读一些教程，从：https://devblogs.nvidia.com/easy-introduction-cuda-c-and-c/开始

基本上每个线程都执行内核代码，所以内部不应该有循环。

我引用的是：

__global__
void saxpy(int n, float a, float *x, float *y)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) y[i] = a*x[i] + y[i];
}

int i = blockDim.x * blockIdx.x + threadIdx.x

生成用于访问数组元素的全局索引。我们在本例中没有使用它，但也有gridDim，它包含启动的第一个执行配置参数中指定的网格的>维度。

在这个索引用于访问数组元素之前，它的值将根据元素的数量n进行检查，以确保没有越界的内存访问。如果数组中的>元素数不能被线程块大小均匀整除，因此内核启动的线程数大于>数组大小，则需要进行此检查。内核的第二行执行>SAXPY的元素式工作，除了边界检查之外，它与SAXPY宿主实现的内部>循环相同。

if (i < n) y[i] = a*x[i] + y[i];

类似资料：

如何并行化两个嵌套的for循环？

我想在Python2.7中并行化两个嵌套的for循环，但我自己没有成功。我不知道如何接近什么是并行化的定义。总之，这里是单处理器代码：
嵌套for循环和多维数组

我试图弄清楚嵌套for循环是如何与JavaScipt中的多维数组一起工作的，但有一点让我有些困惑。以股票为例这就是我所期望的结果123456。但是，如果我将数字添加到外部数组的末尾：我仍然得到同样的输出1 2 3 4 5 6？我不明白为什么输出是一个bcdyz，这是我所期望的。为什么字符串的行为会有所不同？
嵌套并行for循环：“并行内部for循环作为函数”中的“并行外部for循环”

我想在一个并行外部循环中运行一个包含for循环（应该并行运行）的函数。因此看起来如下所示：给定上面的代码，我希望在函数中为循环创建5个并行线程，并且希望这5个线程中的每个线程创建另一个线程来运行自己的并行for循环。
如何在Bash中执行并行的“ for”循环？

问题内容：我一直在尝试并行化以下脚本，特别是for循环。我怎样才能做到这一点？问题答案：更换与
使用Numba时如何并行化此Python for循环

问题内容：我正在使用Python的Anaconda发行版以及Numba，并且编写了以下Python函数，该函数将稀疏矩阵（以CSR格式存储）乘以一个密集向量：这是一个大的稀疏矩阵，并且是一个数组。这是调用上述功能的代码片段：请注意， -decorator告诉Numba对函数进行即时编译。在我的实验中，我的功能大约是该方法的两倍。对于Numba来说，这是一个非常令人印象深刻的
如何在嵌套的json对象上执行ngFor循环？

问题内容：这似乎很简单，但是以某种方式我没有让内部数组元素在Angular 2上执行ngFor循环。我有如下的json数组，我需要遍历嵌套数组的响应数组中的可用“路由”。现在任何人都可以让我知道如何获得应该简单的路由，并且我正在尝试这样。routes= respondeJson [0] .routes或this.routes = resonseJson [0] [‘routes’]但没有运气。

如何使用CUDA并行化嵌套for循环以对二维数组执行计算

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档