TensorRt(4)yolov3加载测试

本文介绍使用darknet项目原始的预训练模型yolov3.weights,经过tensorrt脚本转换为onnx模型,进一步编译优化编译位engine,最后使用TensorRt运行时进行推理。推理时的结果后处理使用c++实现,也给出了问题的说明。

1、darkent模型转换转换

1.1、yolov3-608.weights模型转换.onnx模型

在目录 samples\python\yolov3_onnx下提供了转换脚本,同时也提供了下载地址文件download.yml,注意这里目前脚本是针对模型是 yolov3-608。先手动下载需要的 yolov3-608的模型文件和配置文件,
在这里插入图片描述
运行脚本 yolov3_to_onnx.py 需要先安装依赖项 requirements.txt。安装后修改脚本中的文件路径(而不是下载),
在这里插入图片描述
运行脚本,可能会出现如下错误

RESTART: D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\yolov3_to_onnx.py 
Traceback (most recent call last):
  File "D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\yolov3_to_onnx.py", line 713, in <module>
    main()
  File "D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\yolov3_to_onnx.py", line 679, in main
    layer_configs = parser.parse_cfg_file(cfg_file_path)
  File "D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\yolov3_to_onnx.py", line 60, in parse_cfg_file
    layer_dict, layer_name, remainder = self._next_layer(remainder)
  File "D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\yolov3_to_onnx.py", line 101, in _next_layer
    layer_param_block, remainder = remainder.split("\n\n", 1)
ValueError: not enough values to unpack (expected 2, got 1)

错误是因为脚本解析yolov3-608.cfg文件的问题,需要在文件最后增加两个换行
在这里插入图片描述
再次执行脚本,运行成功,且生成了 yolov3-608.onnx 文件
在这里插入图片描述

1.2、onnx模型优化为engine测试

这里先使用 samples\python\yolov3_onnx\onnx_to_tensorrt.py 脚本进行测试,控制台打印输出为

>>> 
 RESTART: D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\onnx_to_tensorrt.py 
Loading ONNX file from path yolov3-608.onnx...
Beginning ONNX file parsing
Completed parsing of ONNX file
Building an engine from file yolov3-608.onnx; this may take a while...
Completed creating Engine
Running inference on image dog.jpg...
[[134.94207705 219.31212725 184.33163918 324.49731879]
 [ 98.63979515 136.01691416 499.64743187 298.43303029]
 [477.80247374  81.31213914 210.93741343  86.85370009]] [0.99852537 0.99885143 0.93933186] [16  1  7]
Saved image with bounding boxes of detected objects to dog_bboxes.png.

根据输出文本,对应的包围框、置信度、类别,检测到3个目标,输出图像 dog_bboxes.png 为
在这里插入图片描述
另外在build过程中,进行的配置有如下项目

config.max_workspace_size = 1 << 28  # 256MiB
builder.max_batch_size = 1

network.get_input(0).shape = [1, 3, 608, 608]

之后生成了一个engine文件,后面可以直接使用运行进行推理。
在这里插入图片描述

1.3、使用C++ api转换

int build()
{
    
    
    samplesCommon::OnnxSampleParams params;
    params.dataDirs.push_back(R"(data/yolo/)");
    params.dataDirs.push_back(R"(samples\python\yolov3_onnx\)");
    params.onnxFileName = "yolov3-608.onnx";
    params.inputTensorNames.push_back("000_net");               // (1, 3, 608, 608)
    params.outputTensorNames.push_back("082_convolutional");    // (1, 255, 19, 19)
    params.outputTensorNames.push_back("094_convolutional");    // (1, 255, 38, 38)
    params.outputTensorNames.push_back("106_convolutional");    // (1, 255, 76, 76)
    params.dlaCore = -1;
    //params.int8 = true;
   // params.fp16 = true;
    params.batchSize = 1;

    // 
    auto builder = SampleUniquePtr<nvinfer1::IBuilder>(nvinfer1::createInferBuilder(sample::gLogger.getTRTLogger()));

    const auto explicitBatch = 1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
    auto network = SampleUniquePtr<nvinfer1::INetworkDefinition>(builder->createNetworkV2(explicitBatch));

    auto config = SampleUniquePtr<nvinfer1::IBuilderConfig>(builder->createBuilderConfig());

    auto parser = SampleUniquePtr<nvonnxparser::IParser>(nvonnxparser::createParser(*network, sample::gLogger.getTRTLogger()));
    auto parsed = parser->parseFromFile(locateFile(params.onnxFileName, params.dataDirs).c_str(),
                                        static_cast<int>(sample::gLogger.getReportableSeverity()));

    if(params.fp16) {
    
    
        config->setFlag(BuilderFlag::kFP16);
    }
    if(params.int8) {
    
    
        config->setFlag(BuilderFlag::kINT8);
        samplesCommon::setAllDynamicRanges(network.get(), 127.0f, 127.0f);
    }
    samplesCommon::enableDLA(builder.get(), config.get(), params.dlaCore);

    auto profileStream = samplesCommon::makeCudaStream();
    config->setProfileStream(*profileStream);

	// 对应 python ?
    builder->setMaxBatchSize(params.batchSize);
    config->setMaxWorkspaceSize(1 << 28);
    network->getInput(0)->setDimensions(nvinfer1::Dims{
    
    4,{
    
     1,3,608,608 }});

	//
    SampleUniquePtr<IHostMemory> plan{
    
    builder->buildSerializedNetwork(*network, *config)};

    std::ofstream ofs("yolov3-608.trt", std::ostream::binary);
    ofs.write(static_cast<const char*>(plan->data()), plan->size());
    ofs.close();

    return 0;
}

不同编译优化选项,生成的engine文件大小也不一致。

2、c++ RUNTIME api进行推理

2.1、网络前向推理部分代码

int inference()
{
    
    
   // engine文件读取到内存
   std::string trtFile = locateFile("yolov3-608.trt", {
    
    "."});

   std::ifstream ifs(trtFile, std::ifstream::binary);
   if(!ifs) {
    
    
       return false;
   }

   ifs.seekg(0, std::ios_base::end);
   int size = ifs.tellg();
   ifs.seekg(0, std::ios_base::beg);

   std::unique_ptr<char> pData(new char[size]);
   ifs.read(pData.get(), size);

   ifs.close();

   // engine模型
   SampleUniquePtr<IRuntime> runtime{
    
    createInferRuntime(sample::gLogger.getTRTLogger())};
   auto mEngine = std::shared_ptr<nvinfer1::ICudaEngine>(
       runtime->deserializeCudaEngine(pData.get(), size), samplesCommon::InferDeleter());

   auto mInputDims = 
       mEngine->getBindingDimensions(0);   // [1,3,608,608]
   auto mOutputDims = {
    
    
       mEngine->getBindingDimensions(1),   // [1, 255, 19, 19]
       mEngine->getBindingDimensions(2),   // [1, 255, 38, 38]
       mEngine->getBindingDimensions(3)    // [1, 255, 76, 76]
   };
   
   samplesCommon::BufferManager buffers(mEngine);
   auto context = SampleUniquePtr<nvinfer1::IExecutionContext>(mEngine->createExecutionContext());

   // 输入
   cv::Mat img = cv::imread("dog.jpg");
   cv::Mat blob = cv::dnn::blobFromImage(img, 1 / 255., {
    
    608,608}, {
    
    0,0,0}, true, false);

   float* hostDataBuffer = static_cast<float*>(buffers.getHostBuffer(mEngine->getBindingName(0))); //  input
   memcpy(hostDataBuffer, blob.data, blob.total() * blob.elemSize());  //   1*3*608*608

   // 推理
   buffers.copyInputToDevice();
   context->executeV2(buffers.getDeviceBindings().data());
   buffers.copyOutputToHost();

   // 输出
   std::vector<cv::Mat> outs{
    
    
       cv::Mat({
    
     1,255,19,19 },CV_32F,static_cast<float*>(buffers.getHostBuffer(mEngine->getBindingName(1)))),
       cv::Mat({
    
     1,255,38,38 },CV_32F,static_cast<float*>(buffers.getHostBuffer(mEngine->getBindingName(2)))),
       cv::Mat({
    
     1,255,76,76 },CV_32F,static_cast<float*>(buffers.getHostBuffer(mEngine->getBindingName(3)))),
   };
   // 后处理和可视化
   std::vector<int> inds;
   std::vector<float> confs;
   std::vector<cv::Rect> rects;
   YoloDetector::postprocess(outs, img.rows, img.cols, inds, confs, rects);

   std::vector<std::string> classes = loadClasses(R"(D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\coco_labels.txt)");
   for(int i = 0; i < rects.size(); i++) {
    
    
       cv::rectangle(img, rects[i], {
    
    0,0,255}, 2);
       cv::putText(img, cv::format("%s: %.2f", classes[inds[i]], confs[i]), rects[i].tl() - cv::Point{
    
    16,16}, cv::FONT_HERSHEY_SIMPLEX, 0.5, {
    
    0,0,255}, 2);
   }
   cv::imshow("img", img);
   cv::imwrite("result.jpg", img);
   cv::waitKey(0);

   return 0;
}

其中使用到了一些common中的简要函数,以及cuda包装好的 BufferManager 对象,这些代码完全可以使用std::unique_ptr、cuda函数代替,只是稍多几行代码,参看前面博客【TensorRt(3)mnist示例中的C++ API

读取类型文件的代码为

 std::vector<std::string> loadClasses(const std::string& classesFile)
{
    
    
    std::vector<std::string> classes;
    if(!classesFile.empty()) {
    
    
        const std::string& file = classesFile;
        std::ifstream ifs(file.c_str());
        if(!ifs.is_open())
            CV_Error(cv::Error::StsError, "File " + file + " not found");
        std::string line;
        while(std::getline(ifs, line)) {
    
    
            classes.push_back(line);
        }
    }
    return classes;
}

运行结果如图所示,与python代码结果一致。
在这里插入图片描述

2.2、yolo detector 后处理代码

yolov3-608的输出有3个,图中显示的batchSize=64,可以将cfg中的batch=64修改为1重新执行转换脚本生成onnx即可。

在这里插入图片描述

三个输出,对应三个不同的尺度,分别是 19*19、38*38 和 76*76,每个尺度上的每个位置生成3个结果,每个结果包含 位置4个信息、目标置信度、每个类别的置信度,对于目前coco就是 3*(4 + 1 + 80) = 255。

这里直接给出后处理的主要函数部分

void YoloDetector::postprocess(const std::vector<cv::Mat> convBlobs,
    int frameHeight, int frameWidth,
    std::vector<int>& inds,
    std::vector<float>& confs,
    std::vector<cv::Rect>& rects)
{
    
    
    std::vector<cv::Mat> yoloBlobs(convBlobs.size());
    
    for (size_t ii = 0; ii < convBlobs.size(); ii++)
    {
    
    
        const cv::Mat& inpBlob = convBlobs[ii];
        cv::Mat& outBlob = yoloBlobs[ii];

        int batch_size = inpBlob.size[0];
        int channels = inpBlob.size[1];
        int rows = inpBlob.size[2];
        int cols = inpBlob.size[3];
        assert(channels == mYolov3Args.cell_size * mYolov3Args.anchor);

        //int hNorm = convBlobs.size() > 1 ? convBlobs[1].size[2] : rows;
        //int wNorm = convBlobs.size() > 1 ? convBlobs[1].size[3] : cols;
        int hNorm = mYolov3Args.input_resolution[0];
        int wNorm = mYolov3Args.input_resolution[1];

        // n*c*h*w => ( n*h*w*3 )*( pos + 1 + cls)   
        // 
        //int sample_size = mYolov3Args.cell_size * (rows * cols * mYolov3Args.anchor);
        outBlob.create(batch_size * rows * cols * mYolov3Args.anchor, mYolov3Args.cell_size, CV_32F);

        for (int b = 0; b < batch_size; b++) {
    
    
            for (int r = 0; r < rows; r++) {
    
    
                for (int c = 0; c < cols; c++) {
    
    

                    for (int anc = 0; anc < mYolov3Args.anchor; anc++) {
    
    

                        int index= b * (mYolov3Args.anchor * rows * cols) + r * (mYolov3Args.anchor * cols) + c * mYolov3Args.anchor + anc;
                        
                        // prob
                        float x = inpBlob.at<float>(cv::Vec4i(b, anc*mYolov3Args.cell_size + 4,r,c));
                        outBlob.at<float>(index, 4) = logistic_activate(x);	// logistic activation

                        // obj
                        for (int cls = 0; cls < mYolov3Args.classes; cls++) {
    
    
                            float x = inpBlob.at<float>(cv::Vec4i(b,anc * mYolov3Args.cell_size + 4 + 1 + cls, r,c));
                            outBlob.at<float>(index,4 + 1 + cls) = logistic_activate(x) * outBlob.at<float>(index,4);
                        }

                        // rect
                        const float scale_x_y = 1.0f;
                        float x_tmp = (logistic_activate(inpBlob.at<float>(cv::Vec4i(b,anc * mYolov3Args.cell_size + 0,r,c))) - 0.5f) * scale_x_y + 0.5f;
                        float y_tmp = (logistic_activate(inpBlob.at<float>(cv::Vec4i(b,anc * mYolov3Args.cell_size + 1,r,c))) - 0.5f) * scale_x_y + 0.5f;

                        const std::array<int,2>& anchor_ch = mYolov3Args.anchors[mYolov3Args.masks[ii][anc]];

                        outBlob.at<float>(index,0) = (c + x_tmp) / cols;
                        outBlob.at<float>(index,1) = (r + y_tmp) / rows;
                        outBlob.at<float>(index,2) = std::exp(inpBlob.at<float>(cv::Vec4i(b,anc * mYolov3Args.cell_size + 2,r,c))) * anchor_ch[0] / wNorm ;
                        outBlob.at<float>(index,3) = std::exp(inpBlob.at<float>(cv::Vec4i(b,anc * mYolov3Args.cell_size + 3,r,c))) * anchor_ch[1] / hNorm;

                    }

                }
            }
        }
    }


    // all detections
    inds.clear();
    confs.clear();
    rects.clear();

    for (int i = 0; i < yoloBlobs.size(); ++i)
    {
    
    
        // Network produces output blob with a shape NxC where N is a number of
        // detected objects and C is a number of classes + 4 where the first 4
        // numbers are [center_x, center_y, width, height]
        float* data = (float*)yoloBlobs[i].data;
        for (int j = 0; j < yoloBlobs[i].rows; ++j,data += yoloBlobs[i].cols)
        {
    
    

            cv::Mat scores = yoloBlobs[i].row(j).colRange(5,yoloBlobs[i].cols);
            cv::Point classIdPoint;
            double conf;
            minMaxLoc(scores,nullptr,&conf,nullptr,&classIdPoint);

            if (static_cast<float>(conf) < mYolov3Args.threshold)
                continue;

            int centerX = data[0] * frameWidth;
            int centerY = data[1] * frameHeight;
            int width = data[2] * frameWidth;
            int height = data[3] * frameHeight;

            int left = std::max(0,std::min(centerX - width / 2,frameWidth - 1));
            int top = std::max(0,std::min(centerY - height / 2,frameHeight - 1));
            width = std::max(1,std::min(width,frameWidth - left));
            height = std::max(1,std::min(height,frameHeight - top));

            inds.push_back(classIdPoint.x);
            confs.push_back(static_cast<float>(conf));
            rects.emplace_back(left,top,width,height);
        }
    }

	// 整体上的nms, 没有对每个类别单独做
    if (mYolov3Args.nms_threshold > 0 ) {
    
    

        std::vector<int> classIds;
        std::vector<float> confidences;
        std::vector<cv::Rect> boxes;

        std::vector<int> indices;
        cv::dnn::NMSBoxes(rects,confs, mYolov3Args.threshold ,mYolov3Args.nms_threshold,indices);
        for (int idx : indices)
        {
    
    
            boxes.push_back(rects[idx]);
            confidences.push_back(confs[idx]);
            classIds.push_back(inds[idx]);
        }

        rects = std::move(boxes);
        inds = std::move(classIds);
        confs = std::move(confidences);
    }
}



    struct Yolov3Args {
    
    
        // A list of 3 three-dimensional tuples for the YOLO masks
        std::vector<std::array<int,3>> masks{
    
    
            {
    
    6, 7, 8},
            {
    
    3, 4, 5},
            {
    
    0, 1, 2}
        };
        // A list of 9 two - dimensional tuples for the YOLO anchors
        std::vector<std::array<int,2>> anchors{
    
    
            {
    
    10, 13},
            {
    
    16, 30},
            {
    
    33, 23},   // 
            {
    
    30, 61},
            {
    
    62, 45},
            {
    
    59, 119},  //
            {
    
    116, 90},
            {
    
    156, 198},
            {
    
    373, 326}, //
        };
        // Threshold for object coverage,float value between 0 and 1
        float threshold{
    
     0.6f };
        // Threshold for non - max suppression algorithm,float value between 0 and 1
        float nms_threshold = {
    
     0.5f };
        std::array<int,2> input_resolution{
    
     608,608 };

        int anchor = 3;

        int classes = 80;
        int coords = 4;
        int cell_size = classes + coords + 1;
    };


static float logistic_activate(float x) {
    
     return 1.F / (1.F + exp(-x)); }

static void softmax_activate(const float* input,const int n,const float temp,float* output)
{
    
    
    int i;
    float sum = 0;
    float largest = -FLT_MAX;
    for (i = 0; i < n; ++i) {
    
    
        if (input[i] > largest) largest = input[i];
    }
    for (i = 0; i < n; ++i) {
    
    
        float e = exp((input[i] - largest) / temp);
        sum += e;
        output[i] = e;
    }
    for (i = 0; i < n; ++i) {
    
    
        output[i] /= sum;
    }
}

3、其他问题

3.1、yolov3-416模型如何转换

在脚本yolov3_to_onnx.py中模型转换是根据cfg文件解析,需要人为的指定输出层的名字以供onnx进行权重值转换,因此仅需要修改如下部分

# shape of (in CHW format):         yolov3-608
output_tensor_dims = OrderedDict()
output_tensor_dims["082_convolutional"] = [255, 19, 19]
output_tensor_dims["094_convolutional"] = [255, 38, 38]
output_tensor_dims["106_convolutional"] = [255, 76, 76]

# shape of (in CHW format):         yolov3-416
output_tensor_dims = OrderedDict()
output_tensor_dims["082_convolutional"] = [255, 13, 13]
output_tensor_dims["094_convolutional"] = [255, 26, 26]
output_tensor_dims["106_convolutional"] = [255, 52, 52]

转换后的查看输出为
在这里插入图片描述
测试脚本,修改加载的onnx和trt文件名称、修改的地方有

onnx_file_path = "yolov3-416.onnx"
engine_file_path = "yolov3-416.trt"

#input_resolution_yolov3_HW = (608, 608)
input_resolution_yolov3_HW = (416, 416)

#output_shapes = [(1, 255, 19, 19), (1, 255, 38, 38), (1, 255, 76, 76)]
output_shapes = [(1, 255, 13, 13), (1, 255, 26, 26), (1, 255, 52, 52)]

#network.get_input(0).shape = [1, 3, 608, 608]
network.get_input(0).shape = [1, 3, 416, 416]

执行结果为

>>> 
 RESTART: D:\Librarys\TensorRT-8.4.3.1\samples\python\yolov3_onnx\onnx_to_tensorrt.py 
Loading ONNX file from path yolov3-416.onnx...
Beginning ONNX file parsing
Completed parsing of ONNX file
Building an engine from file yolov3-416.onnx; this may take a while...
Completed creating Engine
Running inference on image dog.jpg...
[[122.5109715  225.01062311 199.10577509 317.32366383]
 [114.91165433 128.17760862 456.74105451 302.43399097]
 [472.04688328  86.89298721 219.78270777  79.65799847]] [0.99860307 0.98919281 0.95482076] [16  1  7]
Saved image with bounding boxes of detected objects to dog_bboxes.png.

从结果可视化看,608和416的结果框略有差异。
在这里插入图片描述

3.2、手写后处理的效率问题

目前使用机器台式机 i7-7700k,RTX 1080ti,cuda10.2,使用yolov3-416模型测试。编辑一个统计时间的代码

class TicketMeter {
    
    
public:
    TicketMeter(const std::string& info, int count = 1): infoPrefix(info), count(count) {
    
    
        start = std::chrono::high_resolution_clock::now();
    }

    ~TicketMeter() {
    
    
        end = std::chrono::high_resolution_clock::now();
        std::cout << infoPrefix << ": " << std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count()*1.f / count << " mills" << std::endl;
    }

private:
    std::chrono::steady_clock::time_point start, end;
    int count;
    std::string infoPrefix;
};

之后在c++的测试代码的前向、后处理增加以下代码,分别测试100次:

    // 推理
    buffers.copyInputToDevice();
    context->executeV2(buffers.getDeviceBindings().data());
    buffers.copyOutputToHost();
    {
    
    
        TicketMeter tm("========= inference avg time", 100);
        for(int i = 0; i < 100; i++)  context->executeV2(buffers.getDeviceBindings().data());
    }

	YoloDetector::postprocess(outs, img.rows, img.cols, inds, confs, rects);
    {
    
    
        TicketMeter tm("========= postprocess avg time", 100);
        for(int i = 0; i < 100; i++)  YoloDetector::postprocess(outs, img.rows, img.cols, inds, confs, rects);
    }

运行提示为

[12/27/2022-14:53:49] [I] [TRT] [MemUsageChange] Init CUDA: CPU +278, GPU +0, now: CPU 22733, GPU 1120 (MiB)
[12/27/2022-14:53:49] [I] [TRT] Loaded engine size: 383 MiB
[12/27/2022-14:53:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
[12/27/2022-14:53:50] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
========= inference avg time: 8.61 mills
========= postprocess avg time: 12.86 mills

可以看到推理耗时8.61ms,网络后处理时间耗费12.86ms,一张图的完整处理时间60%耗费在后处理上。


其他配置机器的测试时间结果如下:
笔记本 i7-10750H, RTX 2060, cuda11.8, 理论上这个配置比之前的台式机要差一点,但是硬件年代较新,cuda版本也可能有加成, 所以效率上要高一点。
在这里插入图片描述


网络输出的格式为 [N,C,H,W],可以整理为 [N,H,W,C]从数据格式上保证每一个候选框的输出数据维度上的255个内存数据是连续的,也就是调整输出为[N*H*W, C=255]的格式。


TensorRt中也含有NMS的插件,可以替换前面手写的cpu代码,可能进一步提高整体流程效率。
TODO

猜你喜欢

转载自blog.csdn.net/wanggao_1990/article/details/128452292