This article was first published on the public account [DeepDriving], welcome to pay attention.
Introduction to CV-CUDA
With the development of deep learning technology in the field of computer vision, more and more AI
algorithm models are used in tasks such as target detection, image segmentation, and image generation. How to efficiently deploy these models on the cloud or edge devices is an urgent need for engineers. solved problem. A complete AI
model deployment process is generally divided into three stages: pre-processing, model reasoning, and post-processing. Generally, model reasoning is placed on GPU
or dedicated hardware for processing, while pre-processing and post-processing are placed on CPU
top. For a computer vision task, pre-processing and post-processing operations tend to consume more CPU
resources and are very time-consuming. This is especially obvious on embedded platforms. If these operations of pre-processing and post-processing can be put GPU
on Implementation will greatly improve the execution efficiency of the entire process.
CV-CUDA
It is an open source library jointly developed by Nvidia and ByteDance, which provides a set of specialized GPU
operators for accelerating image processing and computer vision algorithms to achieve efficient pre-processing and post-processing processes, thereby significantly improving vision tasks AI
. overall throughput. CV-CUDA
Key features of the library include:
- A unified, professional set of high-performance computer vision and image processing operators
- support
C/C++
andPython
this3
programming language'sAPI
- Support batch processing
- Provides a zero-copy interface for
PyTorch
andTensorFlow
- Provides end-to-end computer vision application examples
Code warehouse address : https://github.com/CVCUDA/CV-CUDA
Online document address : https://cvcuda.github.io/
This article will take the deployment of YOLOv6
the target detection model as an example to introduce CV-CUDA
its application in computer vision tasks. See the end of the article for how to obtain the code .
Specific applications of CV-CUDA
OpenCV image preprocessing
I wrote an article about how to TensorRT
deploy it before YOLOv6
: How to deploy YOLOv6 with TensorRT . In this article, the image preprocessing is implemented by calling OpenCV
the function above . Before introducing the image preprocessing, let us first review the operations that image preprocessing needs to do.CPU
CV-CUDA
As shown in the figure above, image preprocessing in general computer vision tasks includes the following operations:
- Color gamut conversion: After reading the picture, it is generally necessary to perform color gamut conversion. For example,
OpenCV
the format of the read picture isBGR
, but the format required by the model isRGB
,OpenCV
and the function for color gamut conversion iscvtColor
. - Size transformation: The size of the original image is generally not consistent with the input size required by the model, so a size transformation is required,
OpenCV
and the function for doing the size transformation isresize
. - Normalization: Floating-point data is needed when training the model, and the image pixel value needs to be divided by
255
normalization,OpenCV
which can be realized by calling a functionconvertTo
. - Data channel order transformation: The data channel of the original image is
HWC
, but the data channel order required by the general model isCHW
, so the order of the data channel must be rearranged.
How to use CV-CUDA
CV-CUDA
Currently the latest version is v0.3.0
, which is officially required to run in the following software environment:
Ubuntu >= 20.04
CUDA driver >= 11.7
(Actual measurementCUDA 11.6
is also possible)
First download the following two packages from CV-CUDA
the GitHub repository
nvcv-dev-0.3.0_beta-cuda11-x86_64-linux.tar.xz
nvcv-lib-0.3.0_beta-cuda11-x86_64-linux.tar.xz
Then use the following command to decompress:
tar -xvf nvcv-dev-0.3.0_beta-cuda11-x86_64-linux.tar.xz
tar -xvf nvcv-lib-0.3.0_beta-cuda11-x86_64-linux.tar.xz
The header files and library files will opt/nvidia/cvcuda0
be generated in the directory after decompression .CV-CUDA
CV-CUDA
For the usage method, please refer to the samples in the directory in the GitHub repository . samples/classification
In CV-CUDA
, GPU
the above data are nvcv::Tensor
used to represent, and the image preprocessing operation needs to use two Tensor
: the original input image Tensor
and the model input data Tensor
. These two Tensor
can be pre-built based on the dimensions of the original input image and the model input dimensions:
// Allocating memory for input image batch
nvcv::TensorDataStridedCuda::Buffer inBuf;
const int input_channels = input_image.channels();
const int input_width = input_image.cols;
const int input_height = input_image.rows;
inBuf.strides[3] = sizeof(uint8_t);
inBuf.strides[2] = input_channels * inBuf.strides[3];
inBuf.strides[1] = input_width * inBuf.strides[2];
inBuf.strides[0] = input_height * inBuf.strides[1];
cudaMalloc(&inBuf.basePtr, 1 * inBuf.strides[0]);
nvcv::Tensor::Requirements inReqs = nvcv::Tensor::CalcRequirements(
1, {input_width, input_height}, nvcv::FMT_BGR8);
nvcv::TensorDataStridedCuda inData(
nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout},
nvcv::DataType{inReqs.dtype}, inBuf);
nvcv::TensorWrapData input_image_tensor(inData);
// Allocate input layer buffer based on input layer dimensions and batch size
// Calculates the resource requirements needed to create a tensor with given
// shape
nvcv::Tensor::Requirements reqsInputLayer = nvcv::Tensor::CalcRequirements(
1, {model_width_, model_height_}, nvcv::FMT_RGBf32p);
// Calculates the total buffer size needed based on the requirements
int64_t inputLayerSize = nvcv::CalcTotalSizeBytes(
nvcv::Requirements{reqsInputLayer.mem}.cudaMem());
nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
std::copy(reqsInputLayer.strides,
reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK,
bufInputLayer.strides);
// Allocate buffer size needed for the tensor
cudaMalloc(&bufInputLayer.basePtr, inputLayerSize);
// Wrap the tensor as a CVCUDA tensor
nvcv::TensorDataStridedCuda inputLayerTensorData(
nvcv::TensorShape{reqsInputLayer.shape, reqsInputLayer.rank,
reqsInputLayer.layout},
nvcv::DataType{reqsInputLayer.dtype}, bufInputLayer);
nvcv::TensorWrapData model_input_tensor(inputLayerTensorData);
After constructing the original input image Tensor
, first copy the image data Tensor
to
// copy image data to tensor
auto input_image_data =
input_image_tensor.exportData<nvcv::TensorDataStridedCuda>();
cudaMemcpy(input_image_data->basePtr(), input_image.data,
input_image_data->stride(0), cudaMemcpyHostToDevice);
Then you can call CV-CUDA
the operator in to process the data.
The following uses size transformation as an example to introduce CV-CUDA
how to use the middle operator. CV-CUDA
The operator class corresponding to the medium size transformation is cvcuda::Resize
, before calling the operator, it is necessary to construct a Tensor
data to save the output of the operator:
nvcv::Tensor resizedTensor(batch_size, {width, height}, nvcv::FMT_BGR8);
The method of operator invocation is very simple, requiring only two lines of code:
cvcuda::Resize resizeOp;
resizeOp(stream_, input_image_tensor, resizedTensor,NVCV_INTERP_LINEAR);
As you can see, the above two codes only do two things: create cvcuda::Resize
objects resizeOp
and call ()
operators. How to achieve it? If you are interested, take a look at the source code analysis, I will not post the code here. The main idea is that the upper-level class creates the underlying operator object cvcuda::Resize
in the constructor , and then calls the operator's execution function in the operator overloading function to execute the specific operation of the operator. Other operators are designed in this way, so they are used as image preview The processing is actually very simple, and the operators that need to be used are as follows:CUDA
()
CUDA
CV-CUDA
- Color gamut transformation:
cvcuda::CvtColor
- Size transformation:
cvcuda::Resize
- Normalized:
cvcuda::ConvertTo
- Data channel sequence conversion:
cvcuda::Reformat
The code for the entire preprocessing process is as follows:
const int batch_size = 1;
// Resize to the dimensions of input layer of network
nvcv::Tensor resizedTensor(batch_size, {width, height}, nvcv::FMT_BGR8);
cvcuda::Resize resizeOp;
resizeOp(stream, input_image_tensor), resizedTensor,
NVCV_INTERP_LINEAR);
// convert BGR to RGB
nvcv::Tensor rgbTensor(batch_size, {width, height}, nvcv::FMT_RGB8);
cvcuda::CvtColor cvtColorOp;
cvtColorOp(stream, resizedTensor, rgbTensor, NVCV_COLOR_BGR2RGB);
// Convert to data format expected by network (F32). Apply scale 1/255.
nvcv::Tensor floatTensor(batch_size, {width, height}, nvcv::FMT_RGBf32);
cvcuda::ConvertTo convertOp;
convertOp(stream, rgbTensor, floatTensor, 1.0 / 255.0, 0.0);
// Convert the data layout from HWC to CHW
cvcuda::Reformat reformatOp;
reformatOp(stream, floatTensor, model_input_tensor);
The above is CV-CUDA
all the code used for image preprocessing, is it very simple?
Summarize
This article takes YOLOv6
image preprocessing in target detection as an example to introduce CV-CUDA
its application in computer vision tasks. There are still many operators that are not introduced in this article. Interested readers can directly view CV-CUDA
the documents and codes to learn and use. At present , CV-CUDA
only x86
versions of the library are provided. arm
It would be even better if versions can be provided. After all, it is just needed on the embedded platform (I have not tried compiling with source code on the embedded platform, and interested readers can try it) .
Follow the WeChat public account [DeepDriving], and reply to the keyword [YOLOv6] in the background to get the code of this article, and YOLOv5/YOLOv6/YOLOv7 can be deployed .