Xilinx Vitis AI quantitative deployment Yolov5 to DPU (PYNQ)

This article and subsequent updates will be placed on the personal homepage~ Welcome to
https://lgyserver.top/index.php/2023/05/08/xilinx-vitis-ai%e9%87%8f%e5%8c%96% e9%83%a8%e7%bd%b2yolov5%e8%87%b3dpu-pynq/

overview

This article describes the whole process of using Xilinx Vitis AI to quantify from the YOLOv5 source code and deploy it on the DPU. Run the test in the open Pynq environment and pass.

environment

Host: Ubuntu 22.04 + Vivado 2022.2 + Vitis AI 2.5.0 (installed using Docker) + CUDA 11.3

Development board: Xilinx Kria KV260 + Pynq 3.0 + DPU Pynq 2.5.1

Versions are important!
This code uses Pynq as the program interface, so the version support of DPU-PYNQ determines most of the version requirements. Generally speaking, the version of Ubuntu and Vivado has nothing to do with it, but pay attention to whether Vivado supports the chip model of Kria. As of the publication of this article, the package version of DPU-PYNQ is 2.5.1, which only supports Vitis
AI 2.5.0 and Pynq 3.0 in the official description.
After testing, the author found that the xmodel quantified and compiled in the latest Vitis AI 3.0.0 cannot be called by DPU-PYNQ, which is specifically reflected in the fact that the python kernel hangs directly without reporting an error message. Vitis AI 2.5.0 does not have this problem .

Quantitative model

In the following official Github repository of Vitis AI, there are many model running performances, and there are many quantified models for testing, which are located in the model zoo folder, and the official benchmarks have been carried out for common models. See the documentation for usage details .

Before we start again, we need to clone the Vitis AI warehouse first. The following operations are performed in the root directory of the Vitis AI warehouse

git clone Xilinx/Vitis-AI

Install the Vitis AI environment

The official document is given first, and the recommended way is to use Docker installation.

If you just want to install a CPU version of Vitis AI for compiling purposes, the matter is simple. Xilinx has built a Docker Image for the CPU platform. After installing Docker on Ubuntu, just run the following command (please refer to the installation method of Docker) See other tutorials online):

docker pull xilinx/vitis-ai-<Framework>-<Arch>:<Version>

Among them, the Framework and Arch supported by pre-build are as follows:

Desired Docker Framework Arch
PyTorch cpu-only pytorch cpu
TensorFlow 2 cpu-only tensorflow2 cpu
TensorFlow 1.15 cpu-only tensorflow cpu
PyTorch ROCm pytorch rocm
TensorFlow 2 ROCm tensorflow2 rocm
PyTorch with AI Optimizer ROCm opt-pytorch rocm
TF2 with AI Optimizer ROCm opt-tensorflow2 rocm

Note that the Version here must consider whether other dependencies support this version of Vitis AI, don't directly upload latest. In this article, the CPU version image (2.5.0) of Pytorch is pulled for compiling and running

docker pull xilinx/vitis-ai-pytorch-cpu:2.5.0

However, if you want to use the CUDA core of the NVIDIA GPU on the computer, you have to take some complicated operations, and you need to build your own image with Xilinx's Dockerfile. See the official documentation here . (It may be necessary to modify the Dockerfile to adapt to the Chinese mainland network)

Enter the Vitis AI root directory and modify itdocker_run.sh

Find docker_run_paramsand delete the mount parameters that do not exist

    # -v /opt/xilinx/dsa:/opt/xilinx/dsa \
    # -v /opt/xilinx/overlaybins:/opt/xilinx/overlaybins \

Execute the following command to enter the Vitis AI environment:

./docker_run.sh xilinx/vitis-ai-pytorch-cpu:latest

insert image description here
If you have not modified other parameters, the directory Dockerinside /workspaceis the root directory of the host's Vitis-AIwarehouse.

Quantify and compile yolov5

For this part, please refer to the UG1414 document. The general process is as follows:
insert image description here
insert image description here

First, clone the original Yolov5 warehouse. Ultralytics/yolov5 is used here . Although ultralytics/ultralytics also has yolov5, but because a lot of training tricks are added, the source code is difficult to modify, so the former is used.

After cloning, install the required dependencies

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt

Modify the model

There are many introductions about the structure of the yolov5 model. I will not introduce them one by one here. A more notable feature is that yolov5 adjusted the activation function from ReLU to SiLU, but the SiLU function is not supported by the DPU. Therefore, before training, You need to replace the activation function with ReLU or LeakyReLU, find the models folder under the yolov5 warehouse, modify the yaml file of the target network, and add the following text:

act: nn.ReLU()

training&finetune

You can refer to the documentation under the yolov5 warehouse, train it on other machines or finetune and export it to a pt file of pytorch.

Quantify

The quantification of the model can refer to the demo given by Vitis-AI . You can see that the quantification is divided into two steps: calib and test. calib is responsible for generating model calibration information, and test is responsible for exporting xmodel after quantization.

Before quantization, you need to make a little modification to the code of yolov5. The official document points out that the quantization model should only contain the forward method, but in the source file of yolov5 ( models/yolo.py ) , you can see the following code

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                if isinstance(self, Segment):  # (boxes + masks)
                    xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4)
                    xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf.sigmoid(), mask), 4)
                else:  # Detect (boxes only)
                    xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                    xy = (xy * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))

        return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x)

During inference, this code adds the predicted xy relative position to the grid coordinates multiplied by stride to map back to the original image, and returns the output of a detection head. However, these steps should be part of the post-processing, and need to be deleted during quantization, and only the purest network output x can be retained.

def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            if self.grid[i].shape[2:4] != x[i].shape[2:4]:
                self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
        return x

After the modification, you can start to write the quantization program, which can be easily done by calling the vai_q_pytorch package.

import os
import sys
import argparse
import random
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
from pytorch_nndct.apis import torch_quantizer, dump_xmodel
from common import *

from models.common import DetectMultiBackend
from models.yolo import Model

DIVIDER = '-----------------------------------------'

def quantize(build_dir,quant_mode,batchsize):

  dset_dir = build_dir + '/dataset'
  float_model = build_dir + '/float_model'
  quant_model = build_dir + '/quant_model'

  # use GPU if available   
  if (torch.cuda.device_count() > 0):
    print('You have',torch.cuda.device_count(),'CUDA devices available')
    for i in range(torch.cuda.device_count()):
      print(' Device',str(i),': ',torch.cuda.get_device_name(i))
    print('Selecting device 0..')
    device = torch.device('cuda:0')
  else:
    print('No CUDA devices available..selecting CPU')
    device = torch.device('cpu')

  # load trained model
  model = DetectMultiBackend("./v5n_ReLU_best.pt", device=device)

  # force to merge BN with CONV for better quantization accuracy
  optimize = 1

  # override batchsize if in test mode
  if (quant_mode=='test'):
    batchsize = 1
  
  rand_in = torch.randn([batchsize, 3, 960, 960])
  quantizer = torch_quantizer(quant_mode, model, (rand_in), output_dir=quant_model) 
  quantized_model = quantizer.quant_model

  # create a Data Loader
  test_dataset = CustomDataset('../../train/JPEGImages',transform=test_transform)

  test_loader = torch.utils.data.DataLoader(test_dataset,
                                            batch_size=batchsize, 
                                            shuffle=False)

  t_loader = torch.utils.data.DataLoader(test_dataset,
                                            batch_size=1 if quant_mode == 'test' else 10, 
                                            shuffle=False)

  # evaluate 
  test(quantized_model, device, t_loader)

  # export config
  if quant_mode == 'calib':
    quantizer.export_quant_config()
  if quant_mode == 'test':
    quantizer.export_xmodel(deploy_check=False, output_dir=quant_model)
  
  return

def run_main():

  # construct the argument parser and parse the arguments
  ap = argparse.ArgumentParser()
  ap.add_argument('-d',  '--build_dir',  type=str, default='build',    help='Path to build folder. Default is build')
  ap.add_argument('-q',  '--quant_mode', type=str, default='calib',    choices=['calib','test'], help='Quantization mode (calib or test). Default is calib')
  ap.add_argument('-b',  '--batchsize',  type=int, default=50,        help='Testing batchsize - must be an integer. Default is 100')
  args = ap.parse_args()

  print('\n'+DIVIDER)
  print('PyTorch version : ',torch.__version__)
  print(sys.version)
  print(DIVIDER)
  print(' Command line options:')
  print ('--build_dir    : ',args.build_dir)
  print ('--quant_mode   : ',args.quant_mode)
  print ('--batchsize    : ',args.batchsize)
  print(DIVIDER)

  quantize(args.build_dir,args.quant_mode,args.batchsize)

  return

if __name__ == '__main__':
    run_main()

In the quantization code, it is called torch_quantizerto perform quantization. The quantized model must be run once (evaluate) with the data set, which can be pure pictures without labels, because these pictures are only used to calibrate the quantization parameters and do not perform backpropagation.

Execute the python file to generate the quantization configuration

python quantize.py -q calib

https://lgyserver.top/wp-content/uploads/2023/05/image-4-1024x906.png
Pay attention to the warnings generated during this period, such as unrecognized OPs, which are the reasons for the subsequent execution of the DPU molecular graph. At this time, build/quant_model already has generated py, and you need to run test to generate xmodel

python quantize.py -q test -b 1

https://lgyserver.top/wp-content/uploads/2023/05/image-5-1024x66.png
With this xmodel, we need to use the compiler provided by xilinx to compile this xmodel into a DPU-supported, XIR-based xmodel, and run the following command:

vai_c_xir -x ./build/quant_model/DetectMultiBackend_int.xmodel -a /opt/vitis_ai/compiler/arch/DPUCZDX8G/KV260/arch.json -o ./ -n my_model

https://lgyserver.top/wp-content/uploads/2023/05/image-6-1024x261.png
I don’t know if it’s because of my installation problem. The GPU version Vitis AI Docker I compiled couldn’t find the command vai_c_xir, so I used the GPU’s Vitis AI to quantify and generate the xmodel, and then used the CPU’s pre-built Docker to generate the final xmodel.

Pay attention to whether the final DPU subgraph number is 1. If it is not 1, please check whether your model has an OP that is not supported by the DPU. When encountering an OP that is not supported, the DPU will be divided into multiple subgraphs for execution. PS After processing, it is sent to the DPU to slow down the efficiency. The generated xmodel can use netron to view the network input and output structure.

Take a close look at your model structure!
Be sure to use netron to check the network input and output structure. This is very important, because the xmodel after output is a quantized model, which is different from the original model running directly in python. During the deployment process on the board, the input image needs to be quantized and quantized After the output is converted to floating point, post-processing such as NMS is performed.

I used the 12-category yolov5n model trained by myself, and observed the input of the upload node as shown below:

https://lgyserver.top/wp-content/uploads/2023/05/image-7.png
That is to say, the input image is a xint8 fixed-point number, the decimal point is at the sixth place, and the size is 1 960 960*3.

With the generated xmodel, the task of the model part is over, and the next step is to deploy

deploy

Prepare for deployment

First of all, we need a DPU Design Hardware, which can be built with Vivado Manual Block Design, but this will involve a lot of troublesome address settings, I will write another article to talk about it separately, here we simply use the standard DPU Hardware built by Xilinx, It is available in the boards folder of the DPU-PYNQ warehouse . To build Design according to README , you need to install xrt and Vitis. The official script is relatively rigid, and only recognizes the 2022.1 version, which can be edited check_env.shto bypass the check

cd DPU-PYNQ/boards
source <vitis-install-path>/Vitis/2022.2/settings64.sh
source <xrt-install-path>/xilinx/xrt/setup.sh
make BOARD=kv260_som

When I was synthesizing, there was a timing mismatch error, which caused the synthesis to fail. My solution is to replace the synthesis Strategy, modify /prj_config, and add prop=run.impl_1.strategy=Performance_Explore at the bottom of the [vivado] section, which can be synthesized Success, don't know what went wrong.

After the script runs, three files will be generated

  • dpu.bit
  • dpu.hwh
  • dpu.xclbin

plus the previously generated

  • my_model.xmodel

The required files are ready, and then they can be deployed on pynq.

Install DPU-PYNQ

After configuring the PYNQ environment, you need to install DPU-PYNQ separately , which is a package that provides a Python interface to control the DPU, located in the following warehouse

Can be installed directly via pip

pip install pynq-dpu --no-build-isolation
cd $PYNQ_JUPYTER_NOTEBOOKS
pynq get-notebooks pynq-dpu -p .

After running, the pynq_dpu package can be used, and a sample file using pynq_dpu will appear.

Deploy Yolov5

Finally came the exciting deployment part! In order for the model to run, what needs to be done in ps is

  • Deploy the DPU overlay and load the model
  • Preprocessing + quantization input
  • Run DPU inference
  • Dequantization output + post-processing
  • We will address these issues one by one.

First introduce the pynq-dpu package, which is a package for pynq, and DpuOverlay inherits Pynq Overlay

from pynq_dpu import DpuOverlay
overlay = DpuOverlay("yolo5.bit")
overlay.load_model("yolo5.xmodel")

A few points to note:

  1. The parameter of DpuOverlay needs to be a bit file, and there should be .xclbin and .hwh files with the same name in the same path
  2. The xmodel generated by compilation needs to be the xmodel generated by vai_c_xir above, the xmodel generated by the test phase is useless, and for pynq-dpu, only the xmodel generated by Vitis AI 2.5.0 is currently supported. The xmodel generated by compiling a higher version will cause the notebook kernel to hang.

Then define the buffer for input and output:

dpu = overlay.runner
inputTensors = dpu.get_input_tensors()
outputTensors = dpu.get_output_tensors()

shapeIn = tuple(inputTensors[0].dims)
shapeOut0 = (tuple(outputTensors[0].dims))
shapeOut1 = (tuple(outputTensors[1].dims))
shapeOut2 = (tuple(outputTensors[2].dims))

outputSize0 = int(outputTensors[0].get_data_size() / shapeIn[0])
outputSize1 = int(outputTensors[1].get_data_size() / shapeIn[0])
outputSize2 = int(outputTensors[2].get_data_size() / shapeIn[0])

input_data = [np.empty(shapeIn, dtype=np.int8, order="C")]
output_data = [np.empty(shapeOut0, dtype=np.int8, order="C"), 
               np.empty(shapeOut1, dtype=np.int8, order="C"),
               np.empty(shapeOut2, dtype=np.int8, order="C")]
image = input_data[0]

In the above code, the size of outputTensors should be the same as in Netron. In this article are 1*120*120*36, 1*60*60*36and 1*30*30*36, corresponding to the three detection heads of yolov5-nano.

In netron, the outputTensors output after DPU calculation is expressed as the data type of the download node, not the type after fix2float in the last node. This step needs to be done on the CPU, as shown in the figure below.
https://lgyserver.top/wp-content/uploads/2023/06/image.png
Then the inference code of the DPU version can be written according to the inference code of the original full-precision model. First, the input image must be pre-processed. The input of yolov5 is the normalized pixel value, and the size is constant, so we use the letterbox in the original code to cut the size, normalize it, and then quantize it with int8.

im0 = cv2.imread('a.jpg')
im = letterbox(im0, new_shape=(960,960), stride=32)[0]  # padded resize
im = im.transpose((2, 0, 1))  # HWC to CHW
im = np.ascontiguousarray(im)  # contiguous
im = np.transpose(im,(1, 2, 0)).astype(np.float32) / 255 * (2**6) # norm & quant
if len(im.shape) == 3:
            im = im[None]  # expand for batch dim

In this code, the image is transposed after padding (the channel position of opencv is different from that of torch), and after /255normalization in the last step *2^6, why is it the 6th power? At this time, the information in the figure will be used. The data of the upload node is pasted on it. The decimal point is the sixth place, so it is the 6th power. Here you need to adjust it according to your model.

Next, reshape the processed image into a DPU input shape and send it to the DPU for execution.

image[0,...] = im.reshape(shapeIn[1:])
job_id = dpu.execute_async(input_data, output_data) # image below is input_data[0]
dpu.wait(job_id)

After the execution is completed, the DPU result needs to be dequantized and reshaped. Review the original code in the quantization process

    def forward(self, x):
        z = []  # inference output
        for i in range(self.nl):
            x[i] = self.m[i](x[i])  # conv
            bs, _, ny, nx = x[i].shape  # x(bs,255,20,20) to x(bs,3,20,20,85)
            x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()

            if not self.training:  # inference
                if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
                    self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)

                if isinstance(self, Segment):  # (boxes + masks)
                    xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4)
                    xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf.sigmoid(), mask), 4)
                else:  # Detect (boxes only)
                    xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
                    xy = (xy * 2 + self.grid[i]) * self.stride[i]  # xy
                    wh = (wh * 2) ** 2 * self.anchor_grid[i]  # wh
                    y = torch.cat((xy, wh, conf), 4)
                z.append(y.view(bs, self.na * nx * ny, self.no))

        return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x)

(x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2))The output of the DPU is actually the model output, that is, the output of this part of x[i] = self.m[i]. We have to make up for the reshape part below .

For 1*120*120*32the detection head of this model, the data accepted by the post-processing part of yolov5 is yes 1*3*120*120*12, so it is necessary to raise the dimension of 32 first, split it and then transfer the part of 12 back.

conv_out0 = np.transpose(output_data[0].astype(np.float32) / 4, (0, 3, 1, 2)).view(1, 3, 12, 120, 120).transpose(0, 1, 3, 4, 2)
conv_out1 = np.transpose(output_data[1].astype(np.float32) / 8, (0, 3, 1, 2)).view(1, 3, 12, 60, 60).transpose(0, 1, 3, 4, 2)
conv_out2 = np.transpose(output_data[2].astype(np.float32) / 4, (0, 3, 1, 2)).view(1, 3, 12, 30, 30).transpose(0, 1, 3, 4, 2)
pred = [conv_out0, conv_out1, conv_out2]

In the above code, after taking out the data from output_data, dequantization is performed first. As for why /4,
https://lgyserver.top/wp-content/uploads/2023/06/image-1.png
the Download node indicates that the output of the 120-size detection head is the quantization result of the decimal point at the second digit, so that /4is 2^2.

Next, apply the original post-processing and nms. Here, nms may need to dump the anchor information of the original model. You can directly access the model parameters in the original model code to get

model = DetectMultiBackend('yolov5.pt', device=device) 
print("nc: ",model.model.model[-1].nc)
print("anchors: ",model.model.model[-1].anchors)
print("nl: ",model.model.model[-1].nl)
print("na: ",model.model.model[-1].na)
print("stride: ",model.model.model[-1].stride)

The following is the same as the original code, but I will repeat it

epilogue

The running speed of the last model of the DPU can reach about 50fps, which is already very fast. However, it is mainly stuck in pre-processing. If the model is too large, resizing will take time, and the normalization step will also be time-consuming. I don’t know if the DPU will support its own normalization in the future. I want to make a design that puts the resize ip and the DPU together, which should speed up a lot.

Thanks for reading.

Guess you like

Origin blog.csdn.net/weixin_43192572/article/details/131306368