上次我们从源码编译了Halide,进行了代码示例学习;同时,我们也在该专题下进行过Intel-Cpu-Opencl的安装教程,同样有着相关示例学习和工程源码。
相关过往文章链接如下所示:
主要目的就是想要看下是否可以使用Halide进行opencl核函数的实现,以及想要看下halide调度对核函数的是如何产生影响的。同时,想看下其对后续的推理框架进行核函数生成的帮助,以及自创AutoKernel工具的可行性。
# 1.下载Halide的源码
curl -o Halide.zip https://codeload.github.com/halide/Halide/zip/refs/heads/main
# 2.解压文件
unzip Halide.zip
# 3.编译
if [ ! -d "./build" ]; then
mkdir build
fi
cd build
# 关闭python_bindings开关
#在配置运行cmake的时候,打开target_opencl开关
cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -DWITH_PYTHON_BINDINGS=OFF -DCMAKE_INSTALL_PREFIX=`pwd`/opencl-build -DTARGET_OPENCL=ON -S ..
cmake --build . -j32
# 4.此时安装在了build/opencl-build下面,记住此时的halide-install-path
make install -j32
#include <stdio.h>
#include "Halide.h"
// Include a clock to do performance testing.
#include "clock.h"
// Include some support code for loading pngs.
#include "halide_image_io.h"
using namespace Halide;
using namespace Halide::Tools;
int main(){
Target host_target = get_host_target();
Target new_target = host_target.with_feature(Target::OpenCL);
if (!host_supports_target_device(new_target) || !new_target.has_gpu_feature()) {
return -1;
}
Var x, y, c, i, ii, xo, yo, xi, yi;
Buffer<uint8_t> input = load_image("../../images/rgb.png");
Func lut(i) = cast<uint8_t>(clamp(pow(i / 255.0f, 1.2f) * 255.0f, 0, 255));
// Augment the input with a boundary condition.
Func padded(x, y, c) = input(clamp(x, 0, input.width() - 1),
clamp(y, 0, input.height() - 1), c);
// Cast it to 16-bit to do the math.
Func padded16(x, y, c) = cast<uint16_t>(padded(x, y, c));
// Next we sharpen it with a five-tap filter.
Func sharpen(x, y, c) = (padded16(x, y, c) * 2 -
(padded16(x - 1, y, c) +
padded16(x, y - 1, c) +
padded16(x + 1, y, c) +
padded16(x, y + 1, c)) /
4);
// Then apply the LUT.
Func curved(x, y, c) = lut(sharpen(x, y, c));
lut.compute_root();
Var block, thread;
lut.split(i, block, thread, 16);
lut.gpu_blocks(block)
.gpu_threads(thread);
curved.reorder(c, x, y)
.bound(c, 0, 3)
.unroll(c);
// Compute curved in 2D 8x8 tiles using the GPU.
curved.gpu_tile(x, y, xo, yo, xi, yi, 8, 8);
padded.compute_at(curved, xo);
padded.gpu_threads(x, y);
// JIT-compile the pipeline for the GPU. CUDA, OpenCL, or
// Metal are not enabled by default. We have to construct a
// Target object, enable one of them, and then pass that
// target object to compile_jit. Otherwise your CPU will very
// slowly pretend it's a GPU, and use one thread per output
// pixel.
printf("Target: %s\n", target.to_string().c_str());
curved.compile_jit(target);
Buffer<uint8_t> output(input.width(), input.height(), input.channels());
// Run the filter once to initialize any GPU runtime state.
curved.realize(output);
return 0;
}
#克隆工程
git clone https://github.com/pengzhikang/Halide-Learning.git
cd Halide-Learning/learn-halide/halide-opencl
# 把自己编译好的halide库放到3rdparty/halide中
cp -rf halide-install-path/* 3rdparty/halide
# 编译工程
chmod +x build.sh
./build.sh
展示的测速结果如下所示,这里可以看出来,虽然都是使用了cpu进行计算,但是还是intel-cpu-opencl速度更快些。
Running pipeline on CPU:
Running pipeline on GPU:
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-f16c-fma-opencl-sse41
Testing GPU correctness:
Testing performance on CPU:
4.6495 milliseconds
Testing performance on GPU:
3.7742 milliseconds
打开Target::OpenCL开关,我们可以看到使用opencl进行推理的时候,计算时调用的opencl api是如何的。
Entering Pipeline f4
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-debug-f16c-fma-jit-opencl-sse41-user_context
Input Buffer b0: buffer(0, 0x0, 0x7facf4120080, 1, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
Input (void *) __user_context: 0x7ffc331b64e0
Output Buffer f4: buffer(0, 0x0, 0x7facbb8a0080, 0, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
CL: halide_opencl_initialize_kernels (user_context: 0x7ffc331b64e0, state_ptr: 0x7fad16bff000, program: 0x7fad16c000c0, size: 7288
load_libopencl (user_context: 0x7ffc331b64e0)
Loaded OpenCL runtime library: libOpenCL.so
create_opencl_context (user_context: 0x7ffc331b64e0)
CL: platform 0 Intel(R) CPU Runtime for OpenCL(TM) Applications
Got platform 'Intel(R) CPU Runtime for OpenCL(TM) Applications', about to create context (t=13546)
device name: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
device vendor: Intel(R) Corporation
device profile: FULL_PROFILE
global mem size: 47926 MB
max mem alloc size: 11981 MB
local mem size: 32768
max compute units: 32
max workgroup size: 8192
max work item dimensions: 3
max work item sizes: 8192x8192x8192x0
clCreateContext -> 0x55bb00fc0128
clCreateCommandQueue 0x55bb00f912e8
halide_cuda_initialize_kernels got compilation_cache mutex.
clCreateProgramWithSource -> 0x55bb011493a8
clBuildProgram 0x55bb011493a8 -D MAX_CONSTANT_BUFFER_SIZE=131072 -D MAX_CONSTANT_ARGS=480
Caching compiled kernel: 0x55bb011493a8 id 2 context 0x55bb00fc0128
Time: 5.710537e+02 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x7ffc331b5b40)
allocating buffer(0, 0x0, 0x0, 0, uint8, {0, 65536, 1})
clCreateBuffer -> 65536 0x55bb0131c998 device_handle: 0x55bb00f3bf10
Allocated device buffer 0x55bb00f3bf10 for buffer 0x7ffc331b5b40
CL: validate 0x55bb0131c998 offset: 0: asked for 65536, actual allocated 65536
Time: 4.213800e-02 ms
CL: halide_opencl_run (user_context: 0x7ffc331b64e0, entry: _kernel_f0_s0_v12_v18___block_id_x, blocks: 4096x1x1, threads: 16x1x1, shmem: 0
clCreateKernel _kernel_f0_s0_v12_v18___block_id_x -> Time: 1.361441e+00 ms
clSetKernelArg 0 8 [0x55bb00f3bf10 ...] 1
Mapped dev handle is: 0x55bb0131c998
clSetKernelArg 1 0 [nullptr]
clEnqueueNDRangeKernel 4096x1x1, 16x1x1 -> CL_SUCCESS
Releasing kernel 0x55bb01011748
clReleaseKernel finished0x55bb01011748
Time: 1.849298e+00 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x55baffd54c68)
allocating buffer(0, 0x0, 0x7facf4120080, 1, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
clCreateBuffer -> 2949120 0x55baff28cf78 device_handle: 0x55bb00978520
Allocated device buffer 0x55bb00978520 for buffer 0x55baffd54c68
CL: validate 0x55baff28cf78 offset: 0: asked for 2949120, actual allocated 2949120
Time: 6.049300e-02 ms
CL: halide_opencl_buffer_copy (user_context: 0x7ffc331b64e0, src: 0x55baffd54c68, dst: 0x55baffd54c68)
CL: validate 0x55baff28cf78 offset: 0: asked for 0, actual allocated 2949120
from host to device, 0x7facf4120080 + 0 -> 0x55bb00978520 + 0, 2949120 bytes
Time: 2.930249e+00 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x55bb0102b678)
allocating buffer(0, 0x0, 0x7facbb8a0080, 0, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
clCreateBuffer -> 2949120 0x55baff54eca8 device_handle: 0x55bb008eeab0
Allocated device buffer 0x55bb008eeab0 for buffer 0x55bb0102b678
CL: validate 0x55baff54eca8 offset: 0: asked for 2949120, actual allocated 2949120
Time: 2.716900e-02 ms
CL: halide_opencl_run (user_context: 0x7ffc331b64e0, entry: _kernel_f4_s0_v10_v15___block_id_y, blocks: 96x160x1, threads: 10x10x1, shmem: 300
clCreateKernel _kernel_f4_s0_v10_v15___block_id_y -> Time: 1.656029e+00 ms
clSetKernelArg 0 8 [0x55bb00978520 ...] 1
Mapped dev handle is: 0x55baff28cf78
clSetKernelArg 1 8 [0x55bb00f3bf10 ...] 1
Mapped dev handle is: 0x55bb0131c998
clSetKernelArg 2 8 [0x55bb008eeab0 ...] 1
Mapped dev handle is: 0x55baff54eca8
clSetKernelArg 3 4 [0x50000000300 ...] 0
clSetKernelArg 4 4 [0x500 ...] 0
clSetKernelArg 5 4 [0x0 ...] 0
clSetKernelArg 6 4 [0x30000000000 ...] 0
clSetKernelArg 7 4 [0x300 ...] 0
clSetKernelArg 8 4 [0xf000000000000 ...] 0
clSetKernelArg 9 4 [0x1e0000000f0000 ...] 0
clSetKernelArg 10 4 [0x10101001e0000 ...] 0
clSetKernelArg 11 300 [nullptr]
clEnqueueNDRangeKernel 96x160x1, 10x10x1 -> CL_SUCCESS
Releasing kernel 0x55bb016fb848
clReleaseKernel finished0x55bb016fb848
Time: 5.408939e+00 ms
CL: halide_opencl_device_free (user_context: 0x7ffc331b64e0, buf: 0x7ffc331b5b40) cl_mem: 0x55bb0131c998
CL: validate 0x55bb0131c998 offset: 0: asked for 0, actual allocated 65536
clReleaseMemObject 0x55bb0131c998
Time: 8.309700e-02 ms
Exiting Pipeline f4
CL: halide_opencl_finalize_kernels (user_context: 0x7ffc331b64e0, state_ptr: 0x2
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x55bb0102b678) cl_mem: 0x55baff54eca8
CL: validate 0x55baff54eca8 offset: 0: asked for 0, actual allocated 2949120
clReleaseMemObject 0x55baff54eca8
Time: 2.133770e-01 ms
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x55baffd54c68) cl_mem: 0x55baff28cf78
CL: validate 0x55baff28cf78 offset: 0: asked for 0, actual allocated 2949120
clReleaseMemObject 0x55baff28cf78
Time: 1.516690e-01 ms
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-debug-f16c-fma-opencl-sse41
halide对于一个计算图的实例化是如何用opencl实现的,具体看上面的打印信息,我们发现对于该计算图,halide使用了两个kernel去实现计算,其调用opencl的流程就是普通的opencl api调用流程: