第3章介绍OpenCL - 3.6 OpenCL运行时(例子)

优质

小牛编辑

167浏览

2023-12-01

OpenCL的四种模型在之前的章节中已经全部讨论了，OpenCL通过运行时API让应用开发者了解这些模型。平台模型用来使能一个主机，以及一个或多个设备，让其参与到OpenCL应用的执行中。应用开发者使用编程模型来让OpenCL内核实现其核心计算部分。内核执行时如何获取需要的数据，则有内存模型定义。开发者通过执行模型提交相应的命令到设备端(执行内存搬运或执行内核任务)。本节会将这些内容融合到一个完整的OpenCL应用中。

创建并执行一个简单的OpenCL应用大致需要以下几步：

查询平台和设备信息
创建一个上下文
为每个设备创建一个命令队列
创建一个内存对象(数组)用于存储数据
拷贝输入数据到设备端
使用OpenCL C代码创建并编译出一个程序
从编译好的OpenCL程序中提取内核
执行内核
拷贝输出数据到主机端
释放资源

下面的代码将具体实现以上总结的每一步。OpenCL应用中，大部分的通用代码对OpenCL的执行进行设置，这样就允许跨硬件平台(不同的供应商)架构来执行OpenCL内核。因此，其中的大部分代码可以直接在其他的应用中直接使用，并且可以抽象成用户定义函数。之后的章节，将展示OpenCL C++ API，其冗余要小于C API。

现在我们来讨论逐个步骤。本节之后，将会提供完整的程序代码。

1. 查询平台和设备

OpenCL内核需要执行在设备端，那么就需要至少一个平台和一个设备可以查询。

{%ace edit=false, lang=’c_cpp’%}
cl_int status; // 用于错误检查

// 检索平台的数量
cl_uint numPlatforms = 0;
status = clGetPlatformIDs(0, NULL, &numPlatforms);

// 为每个平台对象分配足够的空间
cl_platform_id platforms = NULL;
platforms = (cl_platform_id )malloc(numPlatforms * sizeof(cl_platform_id));

// 将具体的平台对象填充其中
status = clGetPlatformIDs(numPlatforms, platforms, NULL);

// 检索设备的数量
cl_uint numDevices = 0;
status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_ALL, 0, NULL, &numDevices);

// 为每个设备对象分配足够的空间
cl_device_id devices;
devices = (cl_device_id )malloc(numDevices * sizeof(cl_device_id));

// 将具体的设备对象填充其中
status = clGetDeviceIDs(platforms[0], CL_DEVICE_TYPE_ALL, numDevices, devices, NULL);
{%endace%}

之后的完整代码中，我们将默认使用首先找到的平台和设备。这样的源码看起来更加简单明了。

2. 创建一个上下文

找到平台和设备之后，就可以在主机端对上下文进行配置。

{%ace edit=false, lang=’c_cpp’%}
// 创建的上下文包含所有找到的设备
cl_context context = clCreateContext(NULL, numDevices, devices, NULL, NULL, &status);
{%endace%}

3. 为每个设备创建一个命令队列

创建完上下文，就要为每个设备创建一个命令队列(每个命令队列只关联其对应的设备)。主机端需要设备端执行的命令将提交到命令队列中，由命令队列管理执行。

{%ace edit=false, lang=’c_cpp’%}
// 为第一个发现的设备创建命令队列
cl_command_queue cmdQueue = clCreateCommandQUeueWithProperties(context, devices[0], 0, &status);
{%endace%}

4. 创建设备数组用于存储数据

创建一个数组需要提供其长度，以及与该数组相关的上下文；该数组能对该上下文所有设备可见。通常，调用者需要提供一些标识，来表明数据是可只读、只写或读写。如果第四个参数传NULL，OpenCL将不会在这步对数组进行初始化。

{%ace edit=false, lang=’c_cpp’%}
// 向量加法的三个向量，2个输入数组和1个输出数组
cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status);
cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status);
cl_mem bufC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, datasize, NULL, &status);
{%endace%}

5. 拷贝输入数据到设备端

下一步就是将主机端指针指向的数组拷贝到设备端。该API需要一个命令队列对象作为参数，所以数据通常都是直接拷贝到设备端。将第三个参数设置为CL_TRUE，我们将等待该API将数据全部拷贝到设备端之后才返回。

{%ace edit=false, lang=’c_cpp’%}
// 将输入数据填充到数组中
status = clEnqueueWriteBuffer(cmdQueue, bufA, CL_TRUE, 0, datasize, A, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdQueue, bufB, CL_TRUE, 0, datasize, B, 0, NULL, NULL);
{%endace%}

6. 使用OpenCL C代码创建并编译出一个程序

代码列表3.3中的向量相加内核存储在一个字符数组中，programSource，当可以用来创建程序对象(之后需要编译)。当我们编译一个程序时，我们需要提供目标设备的信息。

{%ace edit=false, lang=’c_cpp’%}
// 使用源码创建程序
cl_program program = clCreateProgramWithSource(context, 1, (const char **)&programSource, NULL, &status);

// 为设备构建(编译)程序
status = clBuildProgram(program, numDevices, devices, NULL, NULL, NULL);
{%endace%}

7. 从编译好的OpenCL程序中提取内核

内核通过提供内核函数名，在程序上进行创建。

{%ace edit=false, lang=’c_cpp’%}
// 创建向量相加内核
cl_kernel kernel = clCreateKernel(program, “vecadd”, &status);
{%endace%}

8. 执行内核

内核创建完毕，数据都已经传输到设备端，数组需要设置到内核的参数上。之后，一条需要执行内核的命令就进入了命令队列。内核的执行方式需要指定的NDRange进行配置。

{%ace edit=false, lang=’c_cpp’%}
// 设置内核参数
status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufA);
status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufB);
status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufC);

// 定义工作项的空间维度和空间大小
// 虽然工作组的设置不是必须的，不过可以设置一下
size_t indexSpaceSize[1],workGroupSize[1];

indexSpaceSize[0] = datasize / sizeof(int);
workGroupSize[0] = 256;

// 通过执行API执行内核
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, indexSpaceSize, workGroupSize, 0, NULL, NULL);
{%endace%}

9. 拷贝输出数据到主机端

{%ace edit=false, lang=’c_cpp’%}
// 将输出数组拷贝到主机端内存中
status = clEnqueueReadBuffer(cmdQueue, bufC, CL_TRUE, 0, datasize, C, 0, NULL, NULL);
{%endace%}

10. 释放资源

内核执行完成后，并且输出已经传出到主机端，OpenCL分配的资源需要进行释放。这点和C/C++程序中的内存操作、文件处理以及其他资源的处理，都需要开发者显式释放。OpenCL为不同的对象提供了不同的释放API。OpenCL上下文需要最后释放，因为数组和命令队列都绑定在上下文上。这点与C++删除对象有些相似，成员数组需要在成员释放前释放。

{%ace edit=false, lang=’c_cpp’%}
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(cmdQueue);
clReleaseMemObject(bufA);
clReleaseMemObject(bufB);
clReleaseMemObject(bufC);
clReleaseContext(context);
{%endace%}

3.6.1 向量相加的完整代码

下面将完整的展示向量相加这个例子。其具有上一节的所有步骤，不过这个例子中使用了第一个平台对象和设备对象。

{%ace edit=false, lang=’c_cpp’%}
// This program implements a vector addition using OpenCL

// System includes

include

// OpenCL includes

include

// OpenCL kernel to perform an element-wise addition
const char programSouce =
“kernel n”
“void vecadd(global int A, n”
“ global int *B, n”
“ global int *C) n”
“{ n”
“ // Get the work-item’s unique ID n”
“ int idx = get_global_id(0); n”
“ n”
“ // Add the corresponding locations of n”
“ // ‘A’ and ‘B’, and store the reasult in ‘C’ n”
“ C[idx] = A[idx] + B[idx]; n”
“} n”
;

int main(){
// This code executes on the OpenCL host

// Elements in each array
const int elements = 2048;

// Compute the size of the data
size_t datasize = sizeof(int) * elements;

// Allocate space for input/output host data
int A = (int )malloc(datasize); // Input array
int B = (int )malloc(datasize); // Input array
int C = (int )malloc(datasize); // Output array

// Initialize the input data
int i;
for (i = 0; i < elements; i++){
A[i] = i;
B[i] = i;
}

// Use this to check the output of each API call
cl_int status;

// Get the first platforms
cl_platform_id platform;
status = clGetPlatformIDs(1, &perform, NULL);

// Get the first devices
cl_device_id device;
status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 1, &device, NULL);

// Create a context and associate it with the device
cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, &status);

// Create a command-queue and associate it with device
cl_command_queue cmdQueue = clCreateCommandQUeueWithProperties(context, device, 0, &status);

// Allocate two input buffers and one output buffer for the three vectors in the vector addition
cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status);
cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY, datasize, NULL, &status);
cl_mem bufC = clCreateBuffer(context, CL_MEM_WRITE_ONLY, datasize, NULL, &status);

// Write data from the input arrays to the buffers
status = clEnqueueWriteBuffer(cmdQueue, bufA, CL_FALSE, 0, datasize, A, 0, NULL, NULL);
status = clEnqueueWriteBuffer(cmdQueue, bufB, CL_FALSE, 0, datasize, B, 0, NULL, NULL);

// Create a program with source code
cl_program program = clCreateProgramWithSource(context, 1, (const char**)&programSource, NULL, &status);

// Build(compile) the program for the device
status = clBuildProgram(program, 1, &device, NULL, NULL, NULL);

// Create the vector addition kernel
cl_kernel kernel = clCreateKernel(program, “vecadd”, &status);

// Set the kernel arguments
status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufA);
status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufB);
status = clSetKernelArg(kernel, 0, sizeof(cl_mem), &bufC);

// Define an incde space of work-items for execution
// A work-group size is not required, but can be used.
size_t indexSpaceSize[1], workGroupSize[1];

// There are ‘elements’ work-items
indexSpaceSize[0] = elements;
workGroupSize[0] = 256;

// Execute the kernel
status = clEnqueueNDRangeKernel(cmdQueue, kernel, 1, NULL, indexSpaceSize, workGroupSize, 0, NULL, NULL);

// Read the device output buffer to the host output array
status = clEnqueueReadBuffer(cmdQueue, bufC, CL_TRUE, 0, datasize, C, 0, NULL, NULL);

// Free OpenCL resouces
clReleaseKernel(kernel);
clReleaseProgram(program);
clReleaseCommandQueue(cmdQueue);
clReleaseMemObject(bufA);
clReleaseMemObject(bufB);
clReleaseMemObject(bufC);
clReleaseContext(context);

// free host resouces
free(A);
free(B);
free(C);

return 0;
}
{%endace%}

代码清单3.4 使用C API实现的OpenCL向量相加

第3章 介绍OpenCL - 3.6 OpenCL运行时(例子)

3.6.1 向量相加的完整代码

include

include

include

第3章介绍OpenCL - 3.6 OpenCL运行时(例子)