In yesterday's article, we reviewed the first nine architectures of the YOLO family. This article will continue to summarize the last three frameworks, as well as the latest release of YOLO V8 this month.
YOLOR
Chien-Yao Wang, I-Hau Yeh, Hong-Yuan Mark Liao
“You Only Learn One Representation: Unified Network for Multiple Tasks”2021/05, https://arxiv.org/pdf/2105.04206.pdf
The translation of this name may vary, "you learn only one representation". The author said that this has nothing to do with the previous YOLO version, and the concept is different from YOLO.
Because there is tacit knowledge (generalization of previous experience) and explicit knowledge (perception through the senses). So a human who understands what is shown in a picture can process it better than a normal neural network that doesn't.
Convolutional neural networks usually perform a single specific task, and the goal of YOLOR is that they can be trained to solve multiple tasks at the same time. While they learn to parse the input to obtain the output, YOLOR tries to force the convolutional network to do two things:
- Learn how to get the output
- Trying to determine what all the different outputs might be.
So the model has multiple outputs, not one output.
YOLOR attempts to combine explicit and tacit knowledge. For neural networks, their explicit knowledge is stored in layers close to the input, while implicit knowledge is stored in layers farther away. YOLOR becomes a unified neural network.
The paper introduces the key issues in the process of integrating implicit knowledge and explicit knowledge in neural networks:
Methods such as kernel space alignment, prediction refinement and multi-task learning are introduced in the learning process of implicit knowledge. Vectors, neural networks, and matrix factorization are methods used to model tacit knowledge and analyze its effectiveness.
advantage
Detection accuracy and detection rate higher than competitors at launch
YOLOv6 / MT-YOLOv6
Meituan, China.
“YOLOv6: A Single-Stage Object Detection Framework for IndustrialApplications”2022/09, https://arxiv.org/pdf/2209.02976.pdf
Meituan's blog address: https://tech.meituan.com/2022/06/23/yolov6-a-fast-and-accurate-target-detection-framework-is-opening-source.html
The improvement of v6 mainly focuses on three aspects:
- The backbone and neck parts have optimized the hardware design
- forked head is more accurate
- More Effective Training Strategies
The backbone and neck are designed to take advantage of hardware aspects, such as the computing characteristics of processor cores, memory bandwidth, etc., for efficient inference.
backbone
neck
The authors redesigned parts of the architecture using Rep-Pan and EfficientRep blocks, respectively.
Experiments conducted by the Meituan team showed a significant reduction in computation latency and detection accuracy. In particular, the YOLOv6-nano model is 21% faster and 3.6% more accurate than the YOLOv6-nano model.
head decoupling
Forked heads first appeared in V5. It is used for separate computation of the classification part and the regression part of the network. In v6, this method has been improved.
Strategies trained include:
- anchorless
- SimOTA tagging strategy
- Loss for SIoU box regression
advantage
Detection accuracy and detection rate are higher than competitors
Using the standard PyTorch framework, it is easy to fine-tune
YOLOv7
Chien-Yao Wang, Alexey Bochkovskiy, Hong-Yuan Mark Liao.
The author is the same as the team of YOLOv4, which can be considered as the official release of YOLO.
“YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors”2022/07, https://arxiv.org/pdf/2207.02696.pdf
The proposed method achieves state-of-the-art performance compared to other real-time models.
The main computing unit is E-ELAN (Extended Efficiency Layer Aggregation Network)
Its design takes into account the following factors that affect the accuracy and speed of calculations:
- memory access cost
- I/O ratio
- element-wise operation
- activation
- gradient path
Different applications require different models. In some cases detection accuracy is more important - then the model should have more trainable parameters. In other cases, speed is more important and the model should be smaller in order to infer faster.
When scaling v7, the following hyperparameters need to be considered:
- input resolution
- Width (number of channels)
- Depth (number of layers)
- cascade (number of feature pyramids)
The figure below shows an example of a synchronous model extension.
The paper also discusses a set of methods that can improve the performance of the model without increasing the training cost.
Reparameterization is a technique applied to improve a model after training. It increases training time but improves inference performance. There are two types of reparameterization, model level and block level.
Model reparameterization can be done in two ways:
- Train multiple models with the same settings using different training data. Then average their weights to get the final model.
- Average model weights across training epochs.
Modular reparameterization is commonly used in research. This method divides the model training process into a large number of modules. The outputs are integrated to get the final model.
In the v7 architecture, there can be multiple heads performing different tasks, each with its own loss. A label assigner is a mechanism that considers network predictions and ground truth predictions and assigns soft labels. It generates soft tokens instead of hard tokens.
advantage
The detection accuracy and detection rate at the time of release are higher than those of competitors
Using the standard PyTorch framework, it is easy to fine-tune
Summary of previous models
Before introducing the V8, let's summarize the previous models
Although the table above does not mention all improvements and discoveries that improve performance. But we can see some patterns in the development of YOLO.
Backbone initially consists of a branch (GoogLeNet, VGG, Darknet) and then transitions to an architecture containing skip connections (Cross-Stage Partial connections — CSPDarknet, CSPRepResNet, Extended-ELAN).
Neck also initially consisted of a branch, and then gradually developed with various modifications of the feature pyramid network, which can maintain the accuracy of object detection at different scales.
Head: In earlier versions there was only one head, which contained all output parameters - classification, coordinates of bbox, etc. Later research found that it would be more efficient to separate them into different heads. There's also been a shift from anchor-based to anchorless (except for v7 - which still has anchors for some reason).
Data Augmentation: Early augmentations such as affine transformations, HSV dithering, and exposure changes are simple and do not change the background or environment of the object. And the more recent ones - MixUp, Mosaic, CutOut etc. change the content of the image. Balancing the ratio of augmentation in both directions is important for efficient training of neural networks.
YOLO v8
All YOLO object detection models before YOLOv3 were written in C and used the Darknet framework. Ultralytics released the first YOLO (YOLOv3) implemented using the PyTorch framework. Shortly after the release of YOLOv3, Joseph Redmon left computer vision research community.
After YOLOv3, Ultralytics released YOLOv5, and in January 2023, Ultralytics released YOLOv8.
YOLOv8 contains five models for detection, segmentation and classification. YOLOv8 Nano is the fastest and smallest among them, and YOLOv8 Extra Large (YOLOv8x) is the most accurate but slowest among them. See the following figure for the specific model.
YOLOv8 comes with the following pretrained models:
- Object detection is trained on the COCO detection dataset with an image resolution of 640.
- Instance segmentation is trained on the COCO segmentation dataset with an image resolution of 640.
- The image classification model is pre-trained on the ImageNet dataset with an image resolution of 224.
The YOLOv8 model seems to perform better compared to the previous YOLO models. Not only YOLOv5, YOLOv8 is also ahead of YOLOv7 and YOLOv6, etc.
Compared to other YOLO models trained at 640 image resolution, all YOLOv8 models have better throughput with a similar number of parameters.
Let's see what the model has updated
YOLOv8 has not yet published papers, so we cannot get detailed information on the research methods and ablation studies at the time of construction. But we can see his improvements in the code, the image below was made by GitHub user rangging, showing a detailed visualization of the network architecture.
YOLOv8 is an anchor-free model, which means that it directly predicts the center of the object, rather than the offset of the known anchor box. Anchor points are notoriously troublesome parts of early YOLO models, as they may represent the distribution of the target fiducial box rather than the distribution of the custom dataset.
No anchors reduce the number of predicted boxes, thus speeding up non-maximum suppression (NMS). The following figure is the visualization of the detection head part of V8
new convolution
The first 6x6 conv of the stem becomes 3x3, and the main building block also uses C2f instead of C3. The module is summarized in the following figure, where "f" is the feature number, "e" is the expansion rate, and CBS is a block composed of Conv, BatchNorm and the following SiLU.
In C2f, all outputs of the bottleneck (two 3x3 convolutions with residual connections) are concatenated. In C3, only the output of the last Bottleneck is used.
Bottleneck is the same as in YOLOv5, but the first conv kernel size is changed from 1x1 to 3x3. We can see that YOLOv8 is starting to revert to the ResNet blocks defined in 2015.
In the neck section, the features are connected directly without forcing the same channel size. This reduces the parameter count and the overall size of the tensor.
Mosaic enhancements
Deep learning research tends to focus on model architectures, but the training process in YOLOv5 and YOLOv8 is an important part of their success.
YOLOv8 augments images during online training. At each epoch, the model sees slightly different changes in the image.
Mosaic augmentation, which stitches together four images, forces the model to learn new positions for objects that are partially occluded and differ in surrounding pixels.
Experience shows that such augmentation degrades performance if applied throughout the training procedure. Turning it off for the last 10 training epochs improves performance.
The following properties are from Ultralytics' github
It can be seen that the accuracy and inference delay of YOLOv8 are currently the most advanced.
YOLOv8 code structure
The YOLOv8 model leverages similar code to YOLOv5, but with a new structure where the same code is used to support task types such as classification, instance segmentation, and object detection. The model is still initialized using the same YOLOv5 YAML format, and the dataset format remains the same.
Ultralytics also provides pass commands, which should be familiar to many YOLOv5 users, where training, detection, and export interactions can be done through the CLI.
yolo task=detect mode=val model={HOME}/runs/detect/train/weights/best.pt data={dataset.location}/data.yaml
The pip package can also easily allow us to perform custom development and fine-tuning training:
fromultralyticsimportYOLO
# Load a model
model=YOLO("yolov8n.yaml") # build a new model from scratch
model=YOLO("yolov8n.pt") # load a pretrained model (recommended for training)
# Use the model
results=model.train(data="coco128.yaml", epochs=3) # train the model
results=model.val() # evaluate model performance on the validation set
results=model("https://ultralytics.com/images/bus.jpg") # predict on an image
success=YOLO("yolov8n.pt").export(format="onnx") # export a model to ONNX format
This makes it easy to use our data set for training. There are many articles on specific training methods, so we won't explain them here.
If you are interested, you can take a look at the official description (in Chinese): https://avoid.overfit.cn/post/7596c1ba2d9544189f46e1abf6445c60