Evolution of YOLO family series models: from v1 to v8 (below)

In yesterday's article, we reviewed the first nine architectures of the YOLO family. This article will continue to summarize the last three frameworks, as well as the latest release of YOLO V8 this month.

YOLOR

Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao

“You Only Learn One Representation: Unified Network for Multiple Tasks”2021/05, https://arxiv.org/pdf/2105.04206.pdf

The translation of this name may vary, "you learn only one representation". The author said that this has nothing to do with the previous YOLO version, and the concept is different from YOLO.

Because there is tacit knowledge (generalization of previous experience) and explicit knowledge (perception through the senses). So a human who understands what is shown in a picture can process it better than a normal neural network that doesn't.

Convolutional neural networks usually perform a single specific task, and the goal of YOLOR is that they can be trained to solve multiple tasks at the same time. While they learn to parse the input to obtain the output, YOLOR tries to force the convolutional network to do two things:

  • Learn how to get the output
  • Trying to determine what all the different outputs might be.

So the model has multiple outputs, not one output.

YOLOR attempts to combine explicit and tacit knowledge. For neural networks, their explicit knowledge is stored in layers close to the input, while implicit knowledge is stored in layers farther away. YOLOR becomes a unified neural network.

The paper introduces the key issues in the process of integrating implicit knowledge and explicit knowledge in neural networks:

Methods such as kernel space alignment, prediction refinement and multi-task learning are introduced in the learning process of implicit knowledge. Vectors, neural networks, and matrix factorization are methods used to model tacit knowledge and analyze its effectiveness.

advantage

Detection accuracy and detection rate higher than competitors at launch

YOLOv6 / MT-YOLOv6

Meituan, China.

“YOLOv6: A Single-Stage Object Detection Framework for IndustrialApplications”2022/09, https://arxiv.org/pdf/2209.02976.pdf

Meituan's blog address: https://tech.meituan.com/2022/06/23/yolov6-a-fast-and-accurate-target-detection-framework-is-opening-source.html

The improvement of v6 mainly focuses on three aspects:

  • The backbone and neck parts have optimized the hardware design
  • forked head is more accurate
  • More Effective Training Strategies

The backbone and neck are designed to take advantage of hardware aspects, such as the computing characteristics of processor cores, memory bandwidth, etc., for efficient inference.

backbone

neck

The authors redesigned parts of the architecture using Rep-Pan and EfficientRep blocks, respectively.

Experiments conducted by the Meituan team showed a significant reduction in computation latency and detection accuracy. In particular, the YOLOv6-nano model is 21% faster and 3.6% more accurate than the YOLOv6-nano model.

head decoupling

Forked heads first appeared in V5. It is used for separate computation of the classification part and the regression part of the network. In v6, this method has been improved.

Strategies trained include:

  • anchorless
  • SimOTA tagging strategy
  • Loss for SIoU box regression

advantage

Detection accuracy and detection rate are higher than competitors

Using the standard PyTorch framework, it is easy to fine-tune

YOLOv7

Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao.

The author is the same as the team of YOLOv4, which can be considered as the official release of YOLO.

“YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors”2022/07, https://arxiv.org/pdf/2207.02696.pdf

The proposed method achieves state-of-the-art performance compared to other real-time models.

The main computing unit is E-ELAN (Extended Efficiency Layer Aggregation Network)

Its design takes into account the following factors that affect the accuracy and speed of calculations:

  • memory access cost
  • I/O ratio
  • element-wise operation
  • activation
  • gradient path

Different applications require different models. In some cases detection accuracy is more important - then the model should have more trainable parameters. In other cases, speed is more important and the model should be smaller in order to infer faster.

When scaling v7, the following hyperparameters need to be considered:

  • input resolution
  • Width (number of channels)
  • Depth (number of layers)
  • cascade (number of feature pyramids)

The figure below shows an example of a synchronous model extension.

The paper also discusses a set of methods that can improve the performance of the model without increasing the training cost.

Reparameterization is a technique applied to improve a model after training. It increases training time but improves inference performance. There are two types of reparameterization, model level and block level.

Model reparameterization can be done in two ways:

  • Train multiple models with the same settings using different training data. Then average their weights to get the final model.
  • Average model weights across training epochs.

Modular reparameterization is commonly used in research. This method divides the model training process into a large number of modules. The outputs are integrated to get the final model.

In the v7 architecture, there can be multiple heads performing different tasks, each with its own loss. A label assigner is a mechanism that considers network predictions and ground truth predictions and assigns soft labels. It generates soft tokens instead of hard tokens.

advantage

The detection accuracy and detection rate at the time of release are higher than those of competitors

Using the standard PyTorch framework, it is easy to fine-tune

Summary of previous models

Before introducing the V8, let's summarize the previous models

Although the table above does not mention all improvements and discoveries that improve performance. But we can see some patterns in the development of YOLO.

Backbone initially consists of a branch (GoogLeNet, VGG, Darknet) and then transitions to an architecture containing skip connections (Cross-Stage Partial connections — CSPDarknet, CSPRepResNet, Extended-ELAN).

Neck also initially consisted of a branch, and then gradually developed with various modifications of the feature pyramid network, which can maintain the accuracy of object detection at different scales.

Head: In earlier versions there was only one head, which contained all output parameters - classification, coordinates of bbox, etc. Later research found that it would be more efficient to separate them into different heads. There's also been a shift from anchor-based to anchorless (except for v7 - which still has anchors for some reason).

Data Augmentation: Early augmentations such as affine transformations, HSV dithering, and exposure changes are simple and do not change the background or environment of the object. And the more recent ones - MixUp, Mosaic, CutOut etc. change the content of the image. Balancing the ratio of augmentation in both directions is important for efficient training of neural networks.

YOLO v8

All YOLO object detection models before YOLOv3 were written in C and used the Darknet framework. Ultralytics released the first YOLO (YOLOv3) implemented using the PyTorch framework. Shortly after the release of YOLOv3, Joseph Redmon left computer vision research community.

After YOLOv3, Ultralytics released YOLOv5, and in January 2023, Ultralytics released YOLOv8.

YOLOv8 contains five models for detection, segmentation and classification. YOLOv8 Nano is the fastest and smallest among them, and YOLOv8 Extra Large (YOLOv8x) is the most accurate but slowest among them. See the following figure for the specific model.

YOLOv8 comes with the following pretrained models:

  • Object detection is trained on the COCO detection dataset with an image resolution of 640.
  • Instance segmentation is trained on the COCO segmentation dataset with an image resolution of 640.
  • The image classification model is pre-trained on the ImageNet dataset with an image resolution of 224.

The YOLOv8 model seems to perform better compared to the previous YOLO models. Not only YOLOv5, YOLOv8 is also ahead of YOLOv7 and YOLOv6, etc.

Compared to other YOLO models trained at 640 image resolution, all YOLOv8 models have better throughput with a similar number of parameters.

Let's see what the model has updated

YOLOv8 has not yet published papers, so we cannot get detailed information on the research methods and ablation studies at the time of construction. But we can see his improvements in the code, the image below was made by GitHub user rangging, showing a detailed visualization of the network architecture.

YOLOv8 is an anchor-free model, which means that it directly predicts the center of the object, rather than the offset of the known anchor box. Anchor points are notoriously troublesome parts of early YOLO models, as they may represent the distribution of the target fiducial box rather than the distribution of the custom dataset.

No anchors reduce the number of predicted boxes, thus speeding up non-maximum suppression (NMS). The following figure is the visualization of the detection head part of V8

new convolution

The first 6x6 conv of the stem becomes 3x3, and the main building block also uses C2f instead of C3. The module is summarized in the following figure, where "f" is the feature number, "e" is the expansion rate, and CBS is a block composed of Conv, BatchNorm and the following SiLU.

In C2f, all outputs of the bottleneck (two 3x3 convolutions with residual connections) are concatenated. In C3, only the output of the last Bottleneck is used.

Bottleneck is the same as in YOLOv5, but the first conv kernel size is changed from 1x1 to 3x3. We can see that YOLOv8 is starting to revert to the ResNet blocks defined in 2015.

In the neck section, the features are connected directly without forcing the same channel size. This reduces the parameter count and the overall size of the tensor.

Mosaic enhancements

Deep learning research tends to focus on model architectures, but the training process in YOLOv5 and YOLOv8 is an important part of their success.

YOLOv8 augments images during online training. At each epoch, the model sees slightly different changes in the image.

Mosaic augmentation, which stitches together four images, forces the model to learn new positions for objects that are partially occluded and differ in surrounding pixels.

Experience shows that such augmentation degrades performance if applied throughout the training procedure. Turning it off for the last 10 training epochs improves performance.

The following properties are from Ultralytics' github

It can be seen that the accuracy and inference delay of YOLOv8 are currently the most advanced.

YOLOv8 code structure

The YOLOv8 model leverages similar code to YOLOv5, but with a new structure where the same code is used to support task types such as classification, instance segmentation, and object detection. The model is still initialized using the same YOLOv5 YAML format, and the dataset format remains the same.

Ultralytics also provides pass commands, which should be familiar to many YOLOv5 users, where training, detection, and export interactions can be done through the CLI.

 yolo task=detect mode=val model={HOME}/runs/detect/train/weights/best.pt data={dataset.location}/data.yaml

The pip package can also easily allow us to perform custom development and fine-tuning training:

 fromultralyticsimportYOLO
 
 # Load a model
 model=YOLO("yolov8n.yaml")  # build a new model from scratch
 model=YOLO("yolov8n.pt")  # load a pretrained model (recommended for training)
 
 # Use the model
 results=model.train(data="coco128.yaml", epochs=3)  # train the model
 results=model.val()  # evaluate model performance on the validation set
 results=model("https://ultralytics.com/images/bus.jpg")  # predict on an image
 success=YOLO("yolov8n.pt").export(format="onnx")  # export a model to ONNX format

This makes it easy to use our data set for training. There are many articles on specific training methods, so we won't explain them here.

If you are interested, you can take a look at the official description (in Chinese): https://avoid.overfit.cn/post/7596c1ba2d9544189f46e1abf6445c60

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/128702580
v8