环境 win10, visual studio 2019, pycuda 2019.02,
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
a = np.random.randn(4,4)
a = a.astype(np.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
写一个代码来把a_gpu这段显存中存储的数组的每一个值都乘以2. 为了实现这个效果,我们就要写一段CUDA C代码,然后把这段代码提交给一个构造函数,这里用到了pycuda.compiler.SourceModule
:
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
如果这一步出错CompileError: nvcc compilation of C:\Users\KANGNI~1\AppData\Local\Temp\tmpbregy28e\kernel.cu failed
可能是环境配置的问题,尝试在代码中加入如下片段
import os
_path = r"D:\Microsoft Visual Studio\2019\Community\VC\Tools\MSVC\14.25.28610\bin\Hostx64\x64"
if os.system("cl.exe"):
os.environ['PATH'] += ';' + _path
if os.system("cl.exe"):
raise RuntimeError("cl.exe still not found, path probably incorrect")
mod = SourceModule("""
__global__ void doublify(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
这一步如果没有出错,就说明这段代码已经编译成功,并且加载到显卡中。然后咱们可以使用pycuda.driver.Function
,然后调用此引用,把显存中的数组a_gpu作为参数传过去,同时设定块大小为4x4:
func = mod.get_function("doublify")
func(a_gpu, block=(4,4,1))
a_doubled = np.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print(a_doubled)
print(a)
result:
[[ 1.4142118 1.48865 3.0958736 -0.64879215]
[ 0.5829473 1.684329 0.21461935 -2.3383026 ]
[ 1.2626396 -0.7566854 1.8427325 -0.52471924]
[-0.80164 1.3886924 -3.5787368 0.72956717]]
[[ 1.4142118 1.48865 3.0958736 -0.64879215]
[ 0.5829473 1.684329 0.21461935 -2.3383026 ]
[ 1.2626396 -0.7566854 1.8427325 -0.52471924]
[-0.80164 1.3886924 -3.5787368 0.72956717]]
ref: http://www.voidcn.com/article/p-zoelqetb-bvv.html