为Halide安装opencl支持

黎玺

2023-12-01

1、背景

上次我们从源码编译了Halide，进行了代码示例学习；同时，我们也在该专题下进行过Intel-Cpu-Opencl的安装教程，同样有着相关示例学习和工程源码。

2、为Halide安装Opencl支持的目的

主要目的就是想要看下是否可以使用Halide进行opencl核函数的实现，以及想要看下halide调度对核函数的是如何产生影响的。同时，想看下其对后续的推理框架进行核函数生成的帮助，以及自创AutoKernel工具的可行性。

3、安装opencl支持的过程

首先安装Intel-Cpu-OpenCL-Runtime-SDK，具体方便见过往文章链接
Halide源码编译开启opencl支持，具体见过往文章链接


# 1.下载Halide的源码
curl -o Halide.zip https://codeload.github.com/halide/Halide/zip/refs/heads/main
# 2.解压文件
unzip Halide.zip
# 3.编译
if [ ! -d "./build" ]; then
 	mkdir build
fi
cd build
# 关闭python_bindings开关
#在配置运行cmake的时候，打开target_opencl开关
cmake -DCMAKE_BUILD_TYPE=Release -DLLVM_DIR=$LLVM_ROOT/lib/cmake/llvm -DWITH_PYTHON_BINDINGS=OFF -DCMAKE_INSTALL_PREFIX=`pwd`/opencl-build -DTARGET_OPENCL=ON -S ..
cmake --build . -j32
# 4.此时安装在了build/opencl-build下面，记住此时的halide-install-path
make install -j32

4、如何在Halide中调用Opencl

4.1、调用流程

4.2、实际调用运用

#include <stdio.h>

#include "Halide.h"

// Include a clock to do performance testing.
#include "clock.h"

// Include some support code for loading pngs.
#include "halide_image_io.h"

using namespace Halide;
using namespace Halide::Tools;

int main(){
    Target host_target = get_host_target();
    Target new_target = host_target.with_feature(Target::OpenCL);
    if (!host_supports_target_device(new_target) || !new_target.has_gpu_feature()) {
        return -1;
    }
    Var x, y, c, i, ii, xo, yo, xi, yi;
    Buffer<uint8_t> input = load_image("../../images/rgb.png");
    Func lut(i) = cast<uint8_t>(clamp(pow(i / 255.0f, 1.2f) * 255.0f, 0, 255));

    // Augment the input with a boundary condition.
    Func padded(x, y, c) = input(clamp(x, 0, input.width() - 1),
                            clamp(y, 0, input.height() - 1), c);

    // Cast it to 16-bit to do the math.
    Func padded16(x, y, c) = cast<uint16_t>(padded(x, y, c));

    // Next we sharpen it with a five-tap filter.
    Func sharpen(x, y, c) = (padded16(x, y, c) * 2 -
                        (padded16(x - 1, y, c) +
                            padded16(x, y - 1, c) +
                            padded16(x + 1, y, c) +
                            padded16(x, y + 1, c)) /
                            4);

    // Then apply the LUT.
    Func curved(x, y, c) = lut(sharpen(x, y, c));

    lut.compute_root();

    Var block, thread;
    lut.split(i, block, thread, 16);
    lut.gpu_blocks(block)
        .gpu_threads(thread);

    curved.reorder(c, x, y)
        .bound(c, 0, 3)
        .unroll(c);

    // Compute curved in 2D 8x8 tiles using the GPU.
    curved.gpu_tile(x, y, xo, yo, xi, yi, 8, 8);

    padded.compute_at(curved, xo);

    padded.gpu_threads(x, y);

    // JIT-compile the pipeline for the GPU. CUDA, OpenCL, or
    // Metal are not enabled by default. We have to construct a
    // Target object, enable one of them, and then pass that
    // target object to compile_jit. Otherwise your CPU will very
    // slowly pretend it's a GPU, and use one thread per output
    // pixel.
    printf("Target: %s\n", target.to_string().c_str());
    curved.compile_jit(target);
    Buffer<uint8_t> output(input.width(), input.height(), input.channels());

    // Run the filter once to initialize any GPU runtime state.
    curved.realize(output);
    return 0;

}

4.3、实操过程

#克隆工程
git clone https://github.com/pengzhikang/Halide-Learning.git
cd Halide-Learning/learn-halide/halide-opencl
# 把自己编译好的halide库放到3rdparty/halide中
cp -rf halide-install-path/* 3rdparty/halide
# 编译工程
chmod +x build.sh
./build.sh

展示的测速结果如下所示,这里可以看出来，虽然都是使用了cpu进行计算，但是还是intel-cpu-opencl速度更快些。

Running pipeline on CPU:
Running pipeline on GPU:
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-f16c-fma-opencl-sse41
Testing GPU correctness:
Testing performance on CPU:
4.6495 milliseconds
Testing performance on GPU:
3.7742 milliseconds

打开Target::OpenCL开关，我们可以看到使用opencl进行推理的时候，计算时调用的opencl api是如何的。


Entering Pipeline f4
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-debug-f16c-fma-jit-opencl-sse41-user_context
 Input Buffer b0: buffer(0, 0x0, 0x7facf4120080, 1, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
 Input (void *) __user_context: 0x7ffc331b64e0
 Output Buffer f4: buffer(0, 0x0, 0x7facbb8a0080, 0, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
CL: halide_opencl_initialize_kernels (user_context: 0x7ffc331b64e0, state_ptr: 0x7fad16bff000, program: 0x7fad16c000c0, size: 7288
    load_libopencl (user_context: 0x7ffc331b64e0)
    Loaded OpenCL runtime library: libOpenCL.so
    create_opencl_context (user_context: 0x7ffc331b64e0)
CL: platform 0 Intel(R) CPU Runtime for OpenCL(TM) Applications
    Got platform 'Intel(R) CPU Runtime for OpenCL(TM) Applications', about to create context (t=13546)
      device name: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
      device vendor: Intel(R) Corporation
      device profile: FULL_PROFILE
      global mem size: 47926 MB
      max mem alloc size: 11981 MB
      local mem size: 32768
      max compute units: 32
      max workgroup size: 8192
      max work item dimensions: 3
      max work item sizes: 8192x8192x8192x0
    clCreateContext -> 0x55bb00fc0128
    clCreateCommandQueue 0x55bb00f912e8
halide_cuda_initialize_kernels got compilation_cache mutex.
    clCreateProgramWithSource -> 0x55bb011493a8
    clBuildProgram 0x55bb011493a8 -D MAX_CONSTANT_BUFFER_SIZE=131072 -D MAX_CONSTANT_ARGS=480 
Caching compiled kernel: 0x55bb011493a8 id 2 context 0x55bb00fc0128
    Time: 5.710537e+02 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x7ffc331b5b40)
    allocating buffer(0, 0x0, 0x0, 0, uint8, {0, 65536, 1})
    clCreateBuffer -> 65536 0x55bb0131c998 device_handle: 0x55bb00f3bf10
    Allocated device buffer 0x55bb00f3bf10 for buffer 0x7ffc331b5b40
CL: validate 0x55bb0131c998 offset: 0: asked for 65536, actual allocated 65536
    Time: 4.213800e-02 ms
CL: halide_opencl_run (user_context: 0x7ffc331b64e0, entry: _kernel_f0_s0_v12_v18___block_id_x, blocks: 4096x1x1, threads: 16x1x1, shmem: 0
    clCreateKernel _kernel_f0_s0_v12_v18___block_id_x ->     Time: 1.361441e+00 ms
    clSetKernelArg 0 8 [0x55bb00f3bf10 ...] 1
Mapped dev handle is: 0x55bb0131c998
    clSetKernelArg 1 0 [nullptr]
    clEnqueueNDRangeKernel 4096x1x1, 16x1x1 -> CL_SUCCESS
    Releasing kernel 0x55bb01011748
    clReleaseKernel finished0x55bb01011748
    Time: 1.849298e+00 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x55baffd54c68)
    allocating buffer(0, 0x0, 0x7facf4120080, 1, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
    clCreateBuffer -> 2949120 0x55baff28cf78 device_handle: 0x55bb00978520
    Allocated device buffer 0x55bb00978520 for buffer 0x55baffd54c68
CL: validate 0x55baff28cf78 offset: 0: asked for 2949120, actual allocated 2949120
    Time: 6.049300e-02 ms
CL: halide_opencl_buffer_copy (user_context: 0x7ffc331b64e0, src: 0x55baffd54c68, dst: 0x55baffd54c68)
CL: validate 0x55baff28cf78 offset: 0: asked for 0, actual allocated 2949120
    from host to device, 0x7facf4120080 + 0 -> 0x55bb00978520 + 0, 2949120 bytes
    Time: 2.930249e+00 ms
CL: halide_opencl_device_malloc (user_context: 0x7ffc331b64e0, buf: 0x55bb0102b678)
    allocating buffer(0, 0x0, 0x7facbb8a0080, 0, uint8, {0, 768, 1}, {0, 1280, 768}, {0, 3, 983040})
    clCreateBuffer -> 2949120 0x55baff54eca8 device_handle: 0x55bb008eeab0
    Allocated device buffer 0x55bb008eeab0 for buffer 0x55bb0102b678
CL: validate 0x55baff54eca8 offset: 0: asked for 2949120, actual allocated 2949120
    Time: 2.716900e-02 ms
CL: halide_opencl_run (user_context: 0x7ffc331b64e0, entry: _kernel_f4_s0_v10_v15___block_id_y, blocks: 96x160x1, threads: 10x10x1, shmem: 300
    clCreateKernel _kernel_f4_s0_v10_v15___block_id_y ->     Time: 1.656029e+00 ms
    clSetKernelArg 0 8 [0x55bb00978520 ...] 1
Mapped dev handle is: 0x55baff28cf78
    clSetKernelArg 1 8 [0x55bb00f3bf10 ...] 1
Mapped dev handle is: 0x55bb0131c998
    clSetKernelArg 2 8 [0x55bb008eeab0 ...] 1
Mapped dev handle is: 0x55baff54eca8
    clSetKernelArg 3 4 [0x50000000300 ...] 0
    clSetKernelArg 4 4 [0x500 ...] 0
    clSetKernelArg 5 4 [0x0 ...] 0
    clSetKernelArg 6 4 [0x30000000000 ...] 0
    clSetKernelArg 7 4 [0x300 ...] 0
    clSetKernelArg 8 4 [0xf000000000000 ...] 0
    clSetKernelArg 9 4 [0x1e0000000f0000 ...] 0
    clSetKernelArg 10 4 [0x10101001e0000 ...] 0
    clSetKernelArg 11 300 [nullptr]
    clEnqueueNDRangeKernel 96x160x1, 10x10x1 -> CL_SUCCESS
    Releasing kernel 0x55bb016fb848
    clReleaseKernel finished0x55bb016fb848
    Time: 5.408939e+00 ms
CL: halide_opencl_device_free (user_context: 0x7ffc331b64e0, buf: 0x7ffc331b5b40) cl_mem: 0x55bb0131c998
CL: validate 0x55bb0131c998 offset: 0: asked for 0, actual allocated 65536
    clReleaseMemObject 0x55bb0131c998
    Time: 8.309700e-02 ms
Exiting Pipeline f4
CL: halide_opencl_finalize_kernels (user_context: 0x7ffc331b64e0, state_ptr: 0x2
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x55bb0102b678) cl_mem: 0x55baff54eca8
CL: validate 0x55baff54eca8 offset: 0: asked for 0, actual allocated 2949120
    clReleaseMemObject 0x55baff54eca8
    Time: 2.133770e-01 ms
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x55baffd54c68) cl_mem: 0x55baff28cf78
CL: validate 0x55baff28cf78 offset: 0: asked for 0, actual allocated 2949120
    clReleaseMemObject 0x55baff28cf78
    Time: 1.516690e-01 ms
Target: x86-64-linux-avx-avx2-avx512-avx512_skylake-debug-f16c-fma-opencl-sse41

4.4、解析halide调用opencl的过程

halide对于一个计算图的实例化是如何用opencl实现的，具体看上面的打印信息，我们发现对于该计算图，halide使用了两个kernel去实现计算，其调用opencl的流程就是普通的opencl api调用流程:

为Halide安装opencl支持

1、背景

2、为Halide安装Opencl支持的目的

3、安装opencl支持的过程

4、如何在Halide中调用Opencl

4.1、调用流程

4.2、实际调用运用

4.3、实操过程

4.4、解析halide调用opencl的过程

相关阅读

相关文章

相关问答

相关文档