之前浏览过Python API并输出了笔记,但在实际使用过程中,上次的笔记没有任何卵用……
所以,本文根据 API 提供的几个功能,分别介绍相关API以及实例,希望下次用到TensorRT的时候,可以直接在这里复制粘贴。
资料:
目前进展
Builder/ICudaEngine/Runtime
对象提供loggermin_severity
,参数就是 trt.Logger.INTERNAL_ERROR/WARNING/ERROR/VERBOSE
等log(severity, msg)
INetworkDefinition
对象创建 ICudaEngine
对象。max_batch_size/max_workspace_size/int8_mode/fp16_mode
等。INetworkDefinition
对象,如create_network(flags)
INetworkDefinition
创建 ICudaEngine
,如 build_cuda_engine(network)/build_engine(network, config)
builder_config
以及optimization_profile
。相关功能暂时用不到,所以只是了解一下。ICudaEngine
对象。deserialize_cuda_engine(serialized_engine, plugin_factory)
,其中前者就是 open(filename, "rb").read()
的结果。Builder
的 build_cuda_engine/build_engine
创建。Runtime
的 deserialize_cuda_engine
创建。num_bindings/max_batch_size/num_layers/max_workspace_size
等。IExecutionContext
对象,例如 create_execution_context()
或 create_execution_context_without_device_memory()
serialize
,大概用法就是 open(filename, "wb").write(engine.serialize())
pycuda.driver.cuda.mem_alloc
bindings
作为输入,其具体数值为内存地址,即 int(buffer)
。ICudaEngine
相关函数包括:
binding_is_input(idx/name)
get_binding_shape(idx/name)
和get_binding_dtype(idx/name)
get_binding_shape/get_binding_name
get_binding_bytes_per_component(idx)
get_binding_components_per_element(idx)
get_binding_format(idx)
get_binding_format_desc(idx)
get_binding_vectorized_dim(idx)
is_execution_binding(idx)
is_shape_binding(idx)
for binding_name in engine:
遍历获取所有 binding_name
ICudaEngine.create_execution_context()
获取对象profiler/engine/name
等。execute/execute_v2/execute_async/execute_async_v2
This function is generalized for multiple inputs/outputs.
,v2的注解是 This function is generalized for multiple inputs/outputs for full dimension networks.
,但不太懂get_shape/get_binding_shape/set_shape_input/set_binding_shape
等方法get_strides/set_optimization_profile_async
tensorrt.ICudaEngine
对象。tensorrt.ICudaEngine
对象构建模型推理所需的 tensorrt.IExecutionContext
对象。tensorrt.IExecutionContext
执行模型推理。tensorrt.ICudaEngine
对象的创建
tensorrt.Builder
实现,主要就是通过 INetworkDefinition
对象实现。
tensorrt.Runtime
实现,具体就是通过一个 buffer 作为输入(如本地文件)。
tensorrt.IExecutionContext
对象的创建
tensorrt.ICudaEngine
对象的 create_execution_context
方法。pycuda.driver.cuda.pagelocked_empty
,API可以参考这里
pagelocked_empty(shape, dtype)
,主要参数就是shape和dtype。trt.volume(engine.get_binding_shape(id))
实现,可以理解为元素数量(而不是内存大小)np.float32
或 trt.float32
的形式。pycuda.driver.cuda.mem_alloc
,api可以参考这里
mem_alloc(buffer.nbytes)
,其中 buffer 可以是ndarray,也可以是前面的 pagelocked_empty()
结果。pycuda.driver.cuda.memcpy_htod
,API可以参考这里
memcpy_htod(dest, src)
,dest是 mem_alloc
的结果,src 是 numpy/pagelocked_empty
。pycuda.driver.cuda.memcpy_dtoh
,API可以参考这里
memcpy_dtoh(dest, src)
,dest是numpy/pagelocked_empty
,src是mem_alloc
。IExecutionContext
对象的 execute
系列方法
execute/execute_v2/execute_async/execute_async_v2
batch_size, bindings
两个参数。异步方法还有 stream_handle/input_consumed
两个参数bindings
是一个数组,包含所有input/outpu buffer(也就是device)的地址。获取方式就是直接通过 int(buffer)
,其中 buffer
就是 mem_alloc
的结果。stream_handle
是 cuda.Stream()
对象# 输入 Engine 本地文件构建 ICudaEngine 对象
ENGINE_PATH = '/path/to/model.trt'
trt_logger = trt.Logger(trt.Logger.INFO)
runtime = trt.Runtime(trt_logger)
with open(ENGINE_PATH, "rb") as f:
engine = runtime.deserialize_cuda_engine(f.read())
# 输入 ONNX/UFF/CAFFE 获取 ICudaEngine 对象
# 构建 context
context = engine.create_execution_context()
# 构建 buffer 方式一:如果确定只有一个输入一个输出
# 参考 https://github.com/dkorobchenko-nv/tensorrt-demo/blob/master/trt_infer.py
INPUT_DATA_TYPE = np.float32
stream = cuda.Stream()
host_in = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(0)), dtype=INPUT_DATA_TYPE)
host_out = cuda.pagelocked_empty(trt.volume(engine.get_binding_shape(1)), dtype=INPUT_DATA_TYPE)
devide_in = cuda.mem_alloc(host_in.nbytes)
devide_out = cuda.mem_alloc(host_out.nbytes)
bindings = [int(devide_in), int(devide_out)]
# 构建 buffer 的方式二:如果不知道有多少输入多少输出
# 参考 https://github.com/NVIDIA/TensorRT/blob/master/samples/python/common.py
class HostDeviceMem(object):
def __init__(self, host_mem, device_mem):
self.host = host_mem
self.device = device_mem
def __str__(self):
return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
def __repr__(self):
return self.__str__()
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()
for binding in engine:
# 注意,上面循环得到的是 binding_name
size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
dtype = trt.nptype(engine.get_binding_dtype(binding))
# Allocate host and device buffers
host_mem = cuda.pagelocked_empty(size, dtype)
device_mem = cuda.mem_alloc(host_mem.nbytes)
# Append the device buffer to device bindings.
bindings.append(int(device_mem))
# Append to the appropriate list.
if engine.binding_is_input(binding):
inputs.append(HostDeviceMem(host_mem, device_mem))
else:
outputs.append(HostDeviceMem(host_mem, device_mem))
# 如果输入输出已经确定
np.copyto(host_in, img.ravel())
cuda.memcpy_htod_async(devide_in, host_in, stream)
context.execute_async(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(host_out, devide_out, stream)
stream.synchronize()
# 如果输入输出数量不一定
# 参考 https://github.com/NVIDIA/TensorRT/blob/master/samples/python/common.py
# Transfer input data to the GPU.
[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
# Run inference.
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
# Transfer predictions back from the GPU.
[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
# Synchronize the stream
stream.synchronize()
# Return only the host outputs.
return [out.host for out in outputs]
TODO
TODO
TODO