Table of contents
Inference using the Python API
The Getting Started Tutorial for Model Deployment continues to be updated! I believe that after the previous studies, everyone has a more comprehensive understanding of the intermediate representation of ONNX, but in a specific production environment, the ONNX model often needs to be converted into a model format that can be used by the specific reasoning backend. In this tutorial, we will come to know the famous reasoning backend TensorRT with you.
Introduction to TensorRT
TensorRT is a deep learning framework released by NVIDIA to run deep learning inference on its hardware. TensorRT provides quantization-aware training and offline quantization. Users can choose two optimization modes, INT8 and FP16, to apply deep learning models to production deployments of different tasks, such as video streaming, speech recognition, recommendation, fraud detection, text generation, and natural language deal with. TensorRT is highly optimized to run on NVIDIA GPUs and is probably the fastest inference engine currently running models on NVIDIA GPUs. More specific information about TensorRT can be found on the TensorRT official website .
Install TensorRT
Windows
By default, on a machine with an NVIDIA graphics card, install CUDA and CUDNN in advance , and log in to the NVIDIA official website to download the TensorRT compressed package that is compatible with the host CUDA version.
Taking CUDA version 10.2 as an example, select the zip package that adapts to CUDA 10.2. After the download is complete, users with conda virtual environment can switch to the virtual environment first, and then execute commands similar to the following in powershell to install and test:
cd \the\path\of\tensorrt\zip\file
Expand-Archive TensorRT-8.2.5.1.Windows10.x86_64.cuda-10.2.cudnn8.2.zip .
$env:TENSORRT_DIR = "$pwd\TensorRT-8.2.5.1"
$env:path = "$env:TENSORRT_DIR\lib;" + $env:path
pip install $env:TENSORRT_DIR\python\tensorrt-8.2.5.1-cp36-none-win_amd64.whl
python -c "import tensorrt;print(tensorrt.__version__)"
The above command will check the TensorRT version after installation. If the printed result is 8.2.5.1, it means that the Python package is installed successfully.
Linux
Similar to the installation in the Windows environment, CUDA and CUDNN are installed in advance on a machine with an NVIDIA graphics card by default, and you can log in to the NVIDIA official website to download the TensorRT compressed package that is compatible with the host CUDA version.
Take CUDA version 10.2 as an example, select the tar package that adapts to CUDA 10.2 , and then execute commands similar to the following to install and test:
cd /the/path/of/tensorrt/tar/gz/file
tar -zxvf TensorRT-8.2.5.1.linux.x86_64-gnu.cuda-10.2.cudnn8.2.tar.gz
export TENSORRT_DIR=$(pwd)/TensorRT-8.2.5.1
export LD_LIBRARY_PATH=$TENSORRT_DIR/lib:$LD_LIBRARY_PATH
pip install TensorRT-8.2.5.1/python/tensorrt-8.2.5.1-cp37-none-linux_x86_64.whl
python -c "import tensorrt;print(tensorrt.__version__)"
If the printed result is 8.2.5.1, it means that the Python package is installed successfully.
model building
We use TensorRT to generate models in two main ways:
- Build the network layer by layer directly through the API of TensorRT;
- Convert the intermediate representation model to a TensorRT model, such as converting an ONNX model to a TensorRT model.
Next, we will use these two methods to build the TensorRT model in Python and C++, and use the generated model for inference.
build directly
Using TensorRT's API to build a network layer by layer, this process is similar to using a general training framework, such as using Pytorch or TensorFlow to build a network. It should be noted that for the weight part, such as convolution or normalization layer, the weight content needs to be assigned to the TensorRT network. This article will not show it in detail, but only build a simple network that pools the input.
Build with Python API
The first is to use the Python API to directly build the TensorRT network. This method mainly uses the tensorrt.Builder
functions create_builder_config
and create_network
functions to build config and network respectively. The former is used to set the parameters such as the maximum working space of the network, and the latter is the main body of the network, which needs to be added layer by layer. content.
In addition, it is necessary to define the input and output names, serialize the constructed network, and save it as a local file. It is worth noting that if you want the network to accept input and output with different resolutions, you need to use tensorrt.Builder
the create_optimization_profile
function and set the minimum and maximum sizes.
The implementation code is as follows:
import tensorrt as trt
verbose = True
IN_NAME = 'input'
OUT_NAME = 'output'
IN_H = 224
IN_W = 224
BATCH_SIZE = 1
EXPLICIT_BATCH = 1 << (int)(
trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
TRT_LOGGER = trt.Logger(trt.Logger.VERBOSE) if verbose else trt.Logger()
with trt.Builder(TRT_LOGGER) as builder, builder.create_builder_config(
) as config, builder.create_network(EXPLICIT_BATCH) as network:
# define network
input_tensor = network.add_input(
name=IN_NAME, dtype=trt.float32, shape=(BATCH_SIZE, 3, IN_H, IN_W))
pool = network.add_pooling(
input=input_tensor, type=trt.PoolingType.MAX, window_size=(2, 2))
pool.stride = (2, 2)
pool.get_output(0).name = OUT_NAME
network.mark_output(pool.get_output(0))
# serialize the model to engine file
profile = builder.create_optimization_profile()
profile.set_shape_input('input', *[[BATCH_SIZE, 3, IN_H, IN_W]]*3)
builder.max_batch_size = 1
config.max_workspace_size = 1 << 30
engine = builder.build_engine(network, config)
with open('model_python_trt.engine', mode='wb') as f:
f.write(bytearray(engine.serialize()))
print("generating file done!")
Build with C++ API
For small partners who want to directly use C++ language to build a network, the whole process is very similar to the above-mentioned execution process of Python. The main points to note are:
nvinfer1:: createInferBuilder
Corresponding to that in Python , the instance of the classtensorrt.Builder
needs to be passed in , but it is an abstract class, and the user needs to inherit the class and implement the internal virtual function. But here we directly use the implementation subclass in the samples folder file after decompressing the TensorRT package.ILogger
ILogger
../samples/common/logger.h
Logger
- Setting the input size of the TensorRT model requires multiple calls
IOptimizationProfile
,setDimensions
which is a little more cumbersome than Python.IOptimizationProfile
Functions are requiredcreateOptimizationProfile
, corresponding to Pythoncreate_builder_config
functions.
The implementation code is as follows:
#include <fstream>
#include <iostream>
#include <NvInfer.h>
#include <../samples/common/logger.h>
using namespace nvinfer1;
using namespace sample;
const char* IN_NAME = "input";
const char* OUT_NAME = "output";
static const int IN_H = 224;
static const int IN_W = 224;
static const int BATCH_SIZE = 1;
static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
int main(int argc, char** argv)
{
// Create builder
Logger m_logger;
IBuilder* builder = createInferBuilder(m_logger);
IBuilderConfig* config = builder->createBuilderConfig();
// Create model to populate the network
INetworkDefinition* network = builder->createNetworkV2(EXPLICIT_BATCH);
ITensor* input_tensor = network->addInput(IN_NAME, DataType::kFLOAT, Dims4{ BATCH_SIZE, 3, IN_H, IN_W });
IPoolingLayer* pool = network->addPoolingNd(*input_tensor, PoolingType::kMAX, DimsHW{ 2, 2 });
pool->setStrideNd(DimsHW{ 2, 2 });
pool->getOutput(0)->setName(OUT_NAME);
network->markOutput(*pool->getOutput(0));
// Build engine
IOptimizationProfile* profile = builder->createOptimizationProfile();
profile->setDimensions(IN_NAME, OptProfileSelector::kMIN, Dims4(BATCH_SIZE, 3, IN_H, IN_W));
profile->setDimensions(IN_NAME, OptProfileSelector::kOPT, Dims4(BATCH_SIZE, 3, IN_H, IN_W));
profile->setDimensions(IN_NAME, OptProfileSelector::kMAX, Dims4(BATCH_SIZE, 3, IN_H, IN_W));
config->setMaxWorkspaceSize(1 << 20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
// Serialize the model to engine file
IHostMemory* modelStream{ nullptr };
assert(engine != nullptr);
modelStream = engine->serialize();
std::ofstream p("model.engine", std::ios::binary);
if (!p) {
std::cerr << "could not open output file to save model" << std::endl;
return -1;
}
p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
std::cout << "generating file done!" << std::endl;
// Release resources
modelStream->destroy();
network->destroy();
engine->destroy();
builder->destroy();
config->destroy();
return 0;
}
IR conversion model
In addition to building the network layer by layer and serializing the model directly through TensorRT's API, TensorRT also supports converting intermediate representation models (such as ONNX) into TensorRT models.
Convert using the Python API
We first use Pytorch to implement a model consistent with the above, that is, only pool the input and output it once; then convert the Pytorch model to the ONNX model; finally convert the ONNX model to the TensorRT model.
The TensorRT function is mainly used here OnnxParser
, which can parse the ONNX model into the TensorRT network. Finally, we can also get a TensorRT model whose function is consistent with that of the model implemented in the above method.
The implementation code is as follows:
import torch
import onnx
import tensorrt as trt
onnx_model = 'model.onnx'
class NaiveModel(torch.nn.Module):
def __init__(self):
super().__init__()
self.pool = torch.nn.MaxPool2d(2, 2)
def forward(self, x):
return self.pool(x)
device = torch.device('cuda:0')
# generate ONNX model
torch.onnx.export(NaiveModel(), torch.randn(1, 3, 224, 224), onnx_model, input_names=['input'], output_names=['output'], opset_version=11)
onnx_model = onnx.load(onnx_model)
# create builder and network
logger = trt.Logger(trt.Logger.ERROR)
builder = trt.Builder(logger)
EXPLICIT_BATCH = 1 << (int)(
trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
network = builder.create_network(EXPLICIT_BATCH)
# parse onnx
parser = trt.OnnxParser(network, logger)
if not parser.parse(onnx_model.SerializeToString()):
error_msgs = ''
for error in range(parser.num_errors):
error_msgs += f'{parser.get_error(error)}\n'
raise RuntimeError(f'Failed to parse onnx, {error_msgs}')
config = builder.create_builder_config()
config.max_workspace_size = 1<<20
profile = builder.create_optimization_profile()
profile.set_shape('input', [1,3 ,224 ,224], [1,3,224, 224], [1,3 ,224 ,224])
config.add_optimization_profile(profile)
# create engine
with torch.cuda.device(device):
engine = builder.build_engine(network, config)
with open('model.engine', mode='wb') as f:
f.write(bytearray(engine.serialize()))
print("generating file done!")
During IR conversion, if multiple batches, multiple inputs, and dynamic shapes are required, they can be set_shape
set by calling the function multiple times. set_shape
The parameters accepted by the function are: input node name, minimum acceptable input size, optimal input size, and maximum acceptable input size. It is generally required that the size relationship of these three dimensions is monotonically increasing.
Convert using the C++ API
After introducing how to convert ONNX model to TensorRT model with Python language, then introduce how to convert ONNX model to TensorRT model with C++. Through this NvOnnxParser
, we can directly parse the ONNX file obtained in the previous section into the network.
The implementation code is as follows:
#include <fstream>
#include <iostream>
#include <NvInfer.h>
#include <NvOnnxParser.h>
#include <../samples/common/logger.h>
using namespace nvinfer1;
using namespace nvonnxparser;
using namespace sample;
int main(int argc, char** argv)
{
// Create builder
Logger m_logger;
IBuilder* builder = createInferBuilder(m_logger);
const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
IBuilderConfig* config = builder->createBuilderConfig();
// Create model to populate the network
INetworkDefinition* network = builder->createNetworkV2(explicitBatch);
// Parse ONNX file
IParser* parser = nvonnxparser::createParser(*network, m_logger);
bool parser_status = parser->parseFromFile("model.onnx", static_cast<int>(ILogger::Severity::kWARNING));
// Get the name of network input
Dims dim = network->getInput(0)->getDimensions();
if (dim.d[0] == -1) // -1 means it is a dynamic model
{
const char* name = network->getInput(0)->getName();
IOptimizationProfile* profile = builder->createOptimizationProfile();
profile->setDimensions(name, OptProfileSelector::kMIN, Dims4(1, dim.d[1], dim.d[2], dim.d[3]));
profile->setDimensions(name, OptProfileSelector::kOPT, Dims4(1, dim.d[1], dim.d[2], dim.d[3]));
profile->setDimensions(name, OptProfileSelector::kMAX, Dims4(1, dim.d[1], dim.d[2], dim.d[3]));
config->addOptimizationProfile(profile);
}
// Build engine
config->setMaxWorkspaceSize(1 << 20);
ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);
// Serialize the model to engine file
IHostMemory* modelStream{ nullptr };
assert(engine != nullptr);
modelStream = engine->serialize();
std::ofstream p("model.engine", std::ios::binary);
if (!p) {
std::cerr << "could not open output file to save model" << std::endl;
return -1;
}
p.write(reinterpret_cast<const char*>(modelStream->data()), modelStream->size());
std::cout << "generate file success!" << std::endl;
// Release resources
modelStream->destroy();
network->destroy();
engine->destroy();
builder->destroy();
config->destroy();
return 0;
}
model reasoning
Earlier, we used two ways to build TensorRT models, and generated four TensorRT models in Python and C++ respectively. The functions of these four models are theoretically identical.
Next, we will use Python and C++ to perform inference on the generated TensorRT model.
Inference using the Python API
The first is to use the Python API to infer the TensorRT model. Some of the code here refers to MMDeploy . Run the following code, you can find that one 1x3x224x224
tensor is input and one 1x3x112x112
tensor is output, which is exactly in line with our expectation of the result after input pooling.
from typing import Union, Optional, Sequence,Dict,Any
import torch
import tensorrt as trt
class TRTWrapper(torch.nn.Module):
def __init__(self,engine: Union[str, trt.ICudaEngine],
output_names: Optional[Sequence[str]] = None) -> None:
super().__init__()
self.engine = engine
if isinstance(self.engine, str):
with trt.Logger() as logger, trt.Runtime(logger) as runtime:
with open(self.engine, mode='rb') as f:
engine_bytes = f.read()
self.engine = runtime.deserialize_cuda_engine(engine_bytes)
self.context = self.engine.create_execution_context()
names = [_ for _ in self.engine]
input_names = list(filter(self.engine.binding_is_input, names))
self._input_names = input_names
self._output_names = output_names
if self._output_names is None:
output_names = list(set(names) - set(input_names))
self._output_names = output_names
def forward(self, inputs: Dict[str, torch.Tensor]):
assert self._input_names is not None
assert self._output_names is not None
bindings = [None] * (len(self._input_names) + len(self._output_names))
profile_id = 0
for input_name, input_tensor in inputs.items():
# check if input shape is valid
profile = self.engine.get_profile_shape(profile_id, input_name)
assert input_tensor.dim() == len(
profile[0]), 'Input dim is different from engine profile.'
for s_min, s_input, s_max in zip(profile[0], input_tensor.shape,
profile[2]):
assert s_min <= s_input <= s_max, \
'Input shape should be between ' \
+ f'{profile[0]} and {profile[2]}' \
+ f' but get {tuple(input_tensor.shape)}.'
idx = self.engine.get_binding_index(input_name)
# All input tensors must be gpu variables
assert 'cuda' in input_tensor.device.type
input_tensor = input_tensor.contiguous()
if input_tensor.dtype == torch.long:
input_tensor = input_tensor.int()
self.context.set_binding_shape(idx, tuple(input_tensor.shape))
bindings[idx] = input_tensor.contiguous().data_ptr()
# create output tensors
outputs = {}
for output_name in self._output_names:
idx = self.engine.get_binding_index(output_name)
dtype = torch.float32
shape = tuple(self.context.get_binding_shape(idx))
device = torch.device('cuda')
output = torch.empty(size=shape, dtype=dtype, device=device)
outputs[output_name] = output
bindings[idx] = output.data_ptr()
self.context.execute_async_v2(bindings,
torch.cuda.current_stream().cuda_stream)
return outputs
model = TRTWrapper('model.engine', ['output'])
output = model(dict(input = torch.randn(1, 3, 224, 224).cuda()))
print(output)
Reasoning using the C++ API
Finally, in many actual production environments, we will use C++ language to complete specific tasks to achieve more efficient code running effects. In addition, TensoRT users generally value its use under C++, so we also use C++ language Implement model reasoning again, which can also be compared with using the Python API to reason about the model.
The implementation code is as follows:
#include <fstream>
#include <iostream>
#include <NvInfer.h>
#include <../samples/common/logger.h>
#define CHECK(status) \
do\
{\
auto ret = (status);\
if (ret != 0)\
{\
std::cerr << "Cuda failure: " << ret << std::endl;\
abort();\
}\
} while (0)
using namespace nvinfer1;
using namespace sample;
const char* IN_NAME = "input";
const char* OUT_NAME = "output";
static const int IN_H = 224;
static const int IN_W = 224;
static const int BATCH_SIZE = 1;
static const int EXPLICIT_BATCH = 1 << (int)(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
void doInference(IExecutionContext& context, float* input, float* output, int batchSize)
{
const ICudaEngine& engine = context.getEngine();
// Pointers to input and output device buffers to pass to engine.
// Engine requires exactly IEngine::getNbBindings() number of buffers.
assert(engine.getNbBindings() == 2);
void* buffers[2];
// In order to bind the buffers, we need to know the names of the input and output tensors.
// Note that indices are guaranteed to be less than IEngine::getNbBindings()
const int inputIndex = engine.getBindingIndex(IN_NAME);
const int outputIndex = engine.getBindingIndex(OUT_NAME);
// Create GPU buffers on device
CHECK(cudaMalloc(&buffers[inputIndex], batchSize * 3 * IN_H * IN_W * sizeof(float)));
CHECK(cudaMalloc(&buffers[outputIndex], batchSize * 3 * IN_H * IN_W /4 * sizeof(float)));
// Create stream
cudaStream_t stream;
CHECK(cudaStreamCreate(&stream));
// DMA input batch data to device, infer on the batch asynchronously, and DMA output back to host
CHECK(cudaMemcpyAsync(buffers[inputIndex], input, batchSize * 3 * IN_H * IN_W * sizeof(float), cudaMemcpyHostToDevice, stream));
context.enqueue(batchSize, buffers, stream, nullptr);
CHECK(cudaMemcpyAsync(output, buffers[outputIndex], batchSize * 3 * IN_H * IN_W / 4 * sizeof(float), cudaMemcpyDeviceToHost, stream));
cudaStreamSynchronize(stream);
// Release stream and buffers
cudaStreamDestroy(stream);
CHECK(cudaFree(buffers[inputIndex]));
CHECK(cudaFree(buffers[outputIndex]));
}
int main(int argc, char** argv)
{
// create a model using the API directly and serialize it to a stream
char *trtModelStream{ nullptr };
size_t size{ 0 };
std::ifstream file("model.engine", std::ios::binary);
if (file.good()) {
file.seekg(0, file.end);
size = file.tellg();
file.seekg(0, file.beg);
trtModelStream = new char[size];
assert(trtModelStream);
file.read(trtModelStream, size);
file.close();
}
Logger m_logger;
IRuntime* runtime = createInferRuntime(m_logger);
assert(runtime != nullptr);
ICudaEngine* engine = runtime->deserializeCudaEngine(trtModelStream, size, nullptr);
assert(engine != nullptr);
IExecutionContext* context = engine->createExecutionContext();
assert(context != nullptr);
// generate input data
float data[BATCH_SIZE * 3 * IN_H * IN_W];
for (int i = 0; i < BATCH_SIZE * 3 * IN_H * IN_W; i++)
data[i] = 1;
// Run inference
float prob[BATCH_SIZE * 3 * IN_H * IN_W /4];
doInference(*context, data, prob, BATCH_SIZE);
// Destroy the engine
context->destroy();
engine->destroy();
runtime->destroy();
return 0;
}
Summarize
Through the study of this article, we have mastered two ways to build the TensorRT model: directly build the network layer by layer through the TensorRT API; convert the intermediate representation model into the TensorRT model. Not only that, we also completed the construction and reasoning of the TensorRT model in C++ and Python respectively. I believe everyone has gained something! In the next article, we will learn how to add TensorRT custom operators with you, so stay tuned~
FAQ
- Q : An error is reported when running the code: Could not find: cudnn64_8.dll. Is it on your PATH?
- A: First check whether your environment variable contains the path of cudnn64_8.dll. If you find that the path of cudnn is in C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v10.2\\bin, But there is only cudnn64_7.dll in it. The solution is to download cuDNN zip package from NVIDIA official website, unzip it, and copy cudnn64_8.dll to the bin directory of CUDA Toolkit. At this time, you can also copy a copy of cudnn64_7.dll, and then rename the copied copy to cudnn64_8.dll, which can also solve this problem.
reference
GitHub - wang-xinyu/tensorrtx: Implementation of popular deep learning networks with TensorRT networ
GitHub - NVIDIA/TensorRT: TensorRT is a C++ library for high performance inference on NVIDIA GPUs an
Series Portal
OpenMMLab: Introduction to Model Deployment Tutorial (4): Supporting More ONNX Operators in PyTorch
OpenMMLab: Interpretation of TorchScript (2): Torch jit tracer implementation analysis
OpenMMLab: Interpretation of TorchScript (3): subgraph rewriter in jit
OpenMMLab: Interpretation of TorchScript (4): Alias Analysis in Torch jit