AI model deployment in practice: Using CV-CUDA to accelerate the deployment process of visual models

This article was first published on the public account [DeepDriving], welcome to pay attention.

Introduction to CV-CUDA

With the development of deep learning technology in the field of computer vision, more and more AIalgorithm models are used in tasks such as target detection, image segmentation, and image generation. How to efficiently deploy these models on the cloud or edge devices is an urgent need for engineers. solved problem. A complete AImodel deployment process is generally divided into three stages: pre-processing, model reasoning, and post-processing. Generally, model reasoning is placed on GPUor dedicated hardware for processing, while pre-processing and post-processing are placed on CPUtop. For a computer vision task, pre-processing and post-processing operations tend to consume more CPUresources and are very time-consuming. This is especially obvious on embedded platforms. If these operations of pre-processing and post-processing can be put GPUon Implementation will greatly improve the execution efficiency of the entire process.

CV-CUDAIt is an open source library jointly developed by Nvidia and ByteDance, which provides a set of specialized GPUoperators for accelerating image processing and computer vision algorithms to achieve efficient pre-processing and post-processing processes, thereby significantly improving vision tasks AI. overall throughput. CV-CUDAKey features of the library include:

  • A unified, professional set of high-performance computer vision and image processing operators
  • support C/C++and Pythonthis3 programming language'sAPI
  • Support batch processing
  • Provides a zero-copy interface for PyTorchandTensorFlow
  • Provides end-to-end computer vision application examples

Code warehouse address : https://github.com/CVCUDA/CV-CUDA

Online document address : https://cvcuda.github.io/

This article will take the deployment of YOLOv6the target detection model as an example to introduce CV-CUDAits application in computer vision tasks. See the end of the article for how to obtain the code .

Specific applications of CV-CUDA

OpenCV image preprocessing

I wrote an article about how to TensorRTdeploy it before YOLOv6: How to deploy YOLOv6 with TensorRT . In this article, the image preprocessing is implemented by calling OpenCVthe function above . Before introducing the image preprocessing, let us first review the operations that image preprocessing needs to do.CPUCV-CUDA

As shown in the figure above, image preprocessing in general computer vision tasks includes the following operations:

  • Color gamut conversion: After reading the picture, it is generally necessary to perform color gamut conversion. For example, OpenCVthe format of the read picture is BGR, but the format required by the model is RGB, OpenCVand the function for color gamut conversion is cvtColor.
  • Size transformation: The size of the original image is generally not consistent with the input size required by the model, so a size transformation is required, OpenCVand the function for doing the size transformation is resize.
  • Normalization: Floating-point data is needed when training the model, and the image pixel value needs to be divided by 255normalization, OpenCVwhich can be realized by calling a function convertTo.
  • Data channel order transformation: The data channel of the original image is HWC, but the data channel order required by the general model is CHW, so the order of the data channel must be rearranged.

How to use CV-CUDA

CV-CUDACurrently the latest version is v0.3.0, which is officially required to run in the following software environment:

  • Ubuntu >= 20.04
  • CUDA driver >= 11.7(Actual measurement CUDA 11.6is also possible)

First download the following two packages from CV-CUDAthe GitHub repository

  • nvcv-dev-0.3.0_beta-cuda11-x86_64-linux.tar.xz
  • nvcv-lib-0.3.0_beta-cuda11-x86_64-linux.tar.xz

Then use the following command to decompress:

tar -xvf nvcv-dev-0.3.0_beta-cuda11-x86_64-linux.tar.xz
tar -xvf nvcv-lib-0.3.0_beta-cuda11-x86_64-linux.tar.xz

The header files and library files will opt/nvidia/cvcuda0be generated in the directory after decompression .CV-CUDA

CV-CUDAFor the usage method, please refer to the samples in the directory in the GitHub repository . samples/classificationIn CV-CUDA, GPUthe above data are nvcv::Tensorused to represent, and the image preprocessing operation needs to use two Tensor: the original input image Tensorand the model input data Tensor. These two Tensorcan be pre-built based on the dimensions of the original input image and the model input dimensions:

// Allocating memory for input image batch
nvcv::TensorDataStridedCuda::Buffer inBuf;
const int input_channels = input_image.channels();
const int input_width = input_image.cols;
const int input_height = input_image.rows;
inBuf.strides[3] = sizeof(uint8_t);
inBuf.strides[2] = input_channels * inBuf.strides[3];
inBuf.strides[1] = input_width * inBuf.strides[2];
inBuf.strides[0] = input_height * inBuf.strides[1];
cudaMalloc(&inBuf.basePtr, 1 * inBuf.strides[0]);
nvcv::Tensor::Requirements inReqs = nvcv::Tensor::CalcRequirements(
    1, {input_width, input_height}, nvcv::FMT_BGR8);

nvcv::TensorDataStridedCuda inData(
    nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout},
    nvcv::DataType{inReqs.dtype}, inBuf);
nvcv::TensorWrapData input_image_tensor(inData);

// Allocate input layer buffer based on input layer dimensions and batch size
// Calculates the resource requirements needed to create a tensor with given
// shape
nvcv::Tensor::Requirements reqsInputLayer = nvcv::Tensor::CalcRequirements(
    1, {model_width_, model_height_}, nvcv::FMT_RGBf32p);
// Calculates the total buffer size needed based on the requirements
int64_t inputLayerSize = nvcv::CalcTotalSizeBytes(
    nvcv::Requirements{reqsInputLayer.mem}.cudaMem());
nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
std::copy(reqsInputLayer.strides,
        reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK,
        bufInputLayer.strides);
// Allocate buffer size needed for the tensor
cudaMalloc(&bufInputLayer.basePtr, inputLayerSize);
// Wrap the tensor as a CVCUDA tensor
nvcv::TensorDataStridedCuda inputLayerTensorData(
    nvcv::TensorShape{reqsInputLayer.shape, reqsInputLayer.rank,
                    reqsInputLayer.layout},
    nvcv::DataType{reqsInputLayer.dtype}, bufInputLayer);
nvcv::TensorWrapData model_input_tensor(inputLayerTensorData);

After constructing the original input image Tensor, first copy the image data Tensorto

// copy image data to tensor
  auto input_image_data =
      input_image_tensor.exportData<nvcv::TensorDataStridedCuda>();
  cudaMemcpy(input_image_data->basePtr(), input_image.data,
             input_image_data->stride(0), cudaMemcpyHostToDevice);

Then you can call CV-CUDAthe operator in to process the data.

The following uses size transformation as an example to introduce CV-CUDAhow to use the middle operator. CV-CUDAThe operator class corresponding to the medium size transformation is cvcuda::Resize, before calling the operator, it is necessary to construct a Tensordata to save the output of the operator:

nvcv::Tensor resizedTensor(batch_size, {width, height}, nvcv::FMT_BGR8);

The method of operator invocation is very simple, requiring only two lines of code:

cvcuda::Resize resizeOp;
resizeOp(stream_, input_image_tensor, resizedTensor,NVCV_INTERP_LINEAR);

As you can see, the above two codes only do two things: create cvcuda::Resizeobjects resizeOpand call ()operators. How to achieve it? If you are interested, take a look at the source code analysis, I will not post the code here. The main idea is that the upper-level class creates the underlying operator object cvcuda::Resizein the constructor , and then calls the operator's execution function in the operator overloading function to execute the specific operation of the operator. Other operators are designed in this way, so they are used as image preview The processing is actually very simple, and the operators that need to be used are as follows:CUDA()CUDACV-CUDA

  • Color gamut transformation:cvcuda::CvtColor
  • Size transformation:cvcuda::Resize
  • Normalized:cvcuda::ConvertTo
  • Data channel sequence conversion:cvcuda::Reformat

The code for the entire preprocessing process is as follows:

const int batch_size = 1;

// Resize to the dimensions of input layer of network
nvcv::Tensor resizedTensor(batch_size, {width, height}, nvcv::FMT_BGR8);
cvcuda::Resize resizeOp;
resizeOp(stream, input_image_tensor), resizedTensor,
        NVCV_INTERP_LINEAR);

// convert BGR to RGB
nvcv::Tensor rgbTensor(batch_size, {width, height}, nvcv::FMT_RGB8);
cvcuda::CvtColor cvtColorOp;
cvtColorOp(stream, resizedTensor, rgbTensor, NVCV_COLOR_BGR2RGB);

// Convert to data format expected by network (F32). Apply scale 1/255.
nvcv::Tensor floatTensor(batch_size, {width, height}, nvcv::FMT_RGBf32);
cvcuda::ConvertTo convertOp;
convertOp(stream, rgbTensor, floatTensor, 1.0 / 255.0, 0.0);

// Convert the data layout from HWC to CHW
cvcuda::Reformat reformatOp;
reformatOp(stream, floatTensor, model_input_tensor);

The above is CV-CUDAall the code used for image preprocessing, is it very simple?

Summarize

This article takes YOLOv6image preprocessing in target detection as an example to introduce CV-CUDAits application in computer vision tasks. There are still many operators that are not introduced in this article. Interested readers can directly view CV-CUDAthe documents and codes to learn and use. At present , CV-CUDAonly x86versions of the library are provided. armIt would be even better if versions can be provided. After all, it is just needed on the embedded platform (I have not tried compiling with source code on the embedded platform, and interested readers can try it) .

Follow the WeChat public account [DeepDriving], and reply to the keyword [YOLOv6] in the background to get the code of this article, and YOLOv5/YOLOv6/YOLOv7 can be deployed .

Guess you like

Origin blog.csdn.net/weixin_44613415/article/details/131239484