YOLOv5+ hybrid attention mechanism increased by another 4.3%, and the Transformer hybrid design can still roll

In the industrial production process, traditional manual detection of welding defects is no longer used due to low efficiency, inconsistent evaluation, high cost, and lack of real-time data.

In order to solve the problems of low accuracy, high false detection rate and computational cost of welding defect detection in surface mount technology, a new method is proposed. This method is a hybrid attention mechanism specifically targeted at welding defect detection algorithms to improve quality control in the manufacturing process by increasing accuracy and reducing computational costs. The hybrid attention mechanism includes the proposed enhanced multi-head self-attention mechanism and coordinated attention mechanism to increase the ability of the attention network to perceive contextual information and improve network feature utilization. The coordinated attention mechanism enhances the connection between different channels and reduces the loss of position information. The hybrid attention mechanism enhances the network's ability to perceive long-range location information and learn local features.

The improved algorithm model has good welding defect detection capabilities, with mAP reaching 91.5%, which is 4.3% higher than Yolov5 and better than other comparison algorithms. Average precision, precision, recall, and frames per second metrics also improved compared to version. While meeting the requirements of real-time detection, the improvement of detection accuracy can be achieved.

1 Introduction

Surface mount device (SMD) pins are prone to soldering defects during the automatic production process, such as pin short circuits and pin offsets, as shown in Figure 1. In welding defect detection, traditional manual detection methods are no longer suitable for the development of industrial production. Manual inspection is inefficient, assessment is inconsistent, costly and lacks real-time data.

Computer vision is a combination of computer hardware and software that works with industrial cameras and light sources to capture images. It is used in various industrial scenarios to automate manufacturing and improve product quality. The welding defect detection system based on computer vision is real-time, continuous and non-contact. This approach can replace manual detection and improve the accuracy of results. Currently, computer vision has been widely used in defect detection. Therefore, using computer vision to detect welding defects has become a mainstream trend. In recent years, deep learning technology has developed rapidly as a branch of computer vision. There is a lack of automated welding defect detection methods under development. Welding defect detection methods can be divided into three main groups, namely feature-based methods, statistical methods and deep learning methods.

Due to its convolutional neural network (CNN) structure, the deep learning method can learn effective information and rules from welded joint images, solving the problem that manual design rules are difficult to extract effective features. The structure of deep learning neural network (DNN) can be divided into single-stage and two-stage networks. Although two-stage DNN is more accurate than single-stage, shallow features need to be utilized with caution to avoid losing information during the feature extraction stage, resulting in lower detection rates. In addition, the single-stage method performs well in real-time performance, but the detection effect is not good for small defect areas and low-resolution images. In the feature extraction module of defect detection, too much target feature information is lost, resulting in unsatisfactory small defect detection rate, resulting in serious missed detection problems. Deep learning methods use deep neural networks to extract features, but as the number of deep neural network layers increases, some shallow information is easily lost, resulting in the missed detection of small-sized targets. To solve this problem, a multi-scale feature fusion method is used to fuse deep and shallow features during the feature extraction process to enhance information transmission between different network layers. Therefore, optimizing the feature fusion method can improve the detection accuracy of small-sized targets.

Feature Pyramid Network (FPN) obtains feature images of different scales by upsampling the input image multiple times. It combines high-level abstract semantic information with low-level details in the feature extraction process, such as contour texture information extracted from top to bottom, to achieve the goal of enhanced feature extraction. However, although FPN systematically extracts low-level and high-level features, its feature fusion capability still cannot meet the needs, making it difficult to retain shallow feature information.

In order to solve the problem of missing information between high-level and low-level features, Liu et al. designed a Path Aggregation Network (PANet) to connect bottom-up enhancement paths at the bottom of the feature pyramid. This process is to shorten the transmission path of information fusion to increase the detection capability of the feature pyramid architecture. Bidirectional Feature Pyramid Network (BiFPN) is built on PANet, where nodes with only one input are removed to reduce the number of parameter calculations. Through additional jump transmission paths, the input and output layer features are directly connected to enhance the fusion ability of shallow layer features. BiFPN assigns adaptive learning weights to each layer and makes the network perceive the importance of different layers through weight distribution.

Multi-scale feature fusion is widely used in small object detection, significantly improving the detection performance of small objects by combining high-level semantic information and low-level detailed information. However, the construction of FPN is mainly divided into cross-layer connections and parallel branches. Although this mechanism improves performance, it adds additional parameter calculation and storage space. Therefore, it is necessary to study and design a pyramid feature network architecture that can enhance the feature fusion ability of defect detectors. The author proposes a hybrid attention mechanism to improve the feature fusion ability of feature pyramid network. The author applies the enhanced FPN to the YOLOv5 detection model. This paper designs comparative experiments and ablation experiments to verify the effectiveness of the proposed method on welding defect data sets. The overall flow chart of this article is shown in Figure 2.

The main work and innovation points of this paper are as follows.

  1. A novel enhanced multi-head self-attention mechanism (EMSA) is proposed to enhance the network's ability to perceive contextual information, expand the range of network feature utilization, and enable the network to have stronger nonlinear expression capabilities.

  2. The authors combined the coordinate attention mechanism (CA) with EMSA and designed a hybrid attention mechanism (HAM) network to solve the problem of shallow feature loss in the feature pyramid network and increase the network's ability to perceive remote location information and learn local features. ability.

  3. The hybrid attention mechanism improves FPN and improves its ability to fuse features and information transfer between network channels.

  4. The improved FPN is applied to the YOLOv5 detection model, which improves the welding defect detection capability of YOLOv5, significantly solves the problem of low detection rate of small defects, and enhances the general applicability of the defect detection model.

2 Related Work

 2.1 Feature Pyramid Network

Feature Pyramid Network (FPN) is a commonly used feature fusion method for target detection. It is a network model that extracts pyramid feature representation. Usually used in the feature fusion stage of target detection. After performing a bottom-up feature extraction operation on the Backbone network, connect the FPN to the front and back adjacent feature maps of the corresponding layer, and combine the two adjacent layers in the Backbone network feature hierarchy from top to bottom and laterally to build a feature pyramid. Although FPN is simple and effective, it still has some shortcomings. Before the feature fusion of each layer, there is a semantic gap between different layers, and direct fusion will have a negative impact on the multi-scale feature representation ability. During the feature fusion process, the high-level feature information of the pyramid network may be lost during the scaling process.

The Path Aggregation Network (PANet) based on the FPN structure has been widely used in the YOLO target detection framework and its variants. This network has two feature fusion paths, namely top-down and bottom-up. This method reduces the fusion distance between deep and shallow features, optimizes the feature fusion method of the FPN network, and improves the target detection effect. However, due to the addition of bottom-up paths, low-level feature information may be lost when the network layer is deepened, and the additional paths increase the computational complexity and network parameters, reducing the detection speed of the network model. Bidirectional feature pyramid network (BIFPN) introduces skip connections, which use skip connections to transfer information between feature input and output layers. Because the operations are on the same layer, this method can combine more features with fewer parameters. In order to achieve more feature fusion, BIFPN calculates the parameters of the same layer multiple times and treats each bidirectional path as a feature network layer.

Adaptive spatial feature fusion (ASFF) is a feature fusion algorithm with adaptive capabilities proposed in 2019. It can adaptively obtain important information through weight selection, thereby improving the effectiveness of feature fusion. By learning the connections between different feature maps, ASFF can solve the inconsistency problem between features of different sizes in the feature pyramid. It has the advantages of easy implementation, low cost and wide application. Qian et al. [1] proposed a centralized feature pyramid (CFP), which is based on global explicit centralized feature rules and can be used in target detection models. This scheme proposes a general inner-layer feature adjustment method, using a lightweight multi-layer perceptron (MLP) to capture full-length distance correlation, and emphasizes that using inner-layer feature rules, comprehensive but differentiated features can be effectively obtained. Feature representation. The CFP network can effectively improve the target detection capabilities of YOLOv5 and YOLOX. It improves the mAP value by 1.4% on the public dataset MS-COCO, but the computational complexity is relatively high.

FPN has been used in several instances involving defect detection. Chen et al. [14] used YOLOv3 for SMD LED chip defect detection and used basic FPN as the feature fusion module. It has reasonable detection rates for missing components, missing lines, and reverse polarity defects, but has lower detection rates for surface defects. The reason is that surface defects are difficult to detect due to their relatively small size and uncertain distribution. Yang et al. [17] used YOLOv5 to detect steel surface defects and used Path Aggregation Feature Pyramid Network (PAFPN) as the feature fusion module to detect six defects on the steel surface. They achieved good real-time detection results, but the detection rate for small defect targets was low. lower. Du et al. [15] used enhanced YOLOv5 for PCB defect detection and BiFPN as the feature fusion module to detect PCB surface defects. The mAP50 index reaches 95.3%, but the mAP value for small defects such as mission holes and open defects is lower. Mission hole defects are hole effects formed due to lack of solder in the pad sockets on the PCB. An open defect refers to a break in the circuit on the PCB.

Han et al. [10] designed a YOLO improvement scheme, using BiFPN to replace the original PAFPN, and using the self-attention mechanism to embed the upsampling and downsampling processing modules in BiFPN to improve the detection rate of the model in surface defect detection tasks. . However, the ability to detect smaller defects is weaker. Therefore, in order to improve the detection performance of the defect detection network, it is necessary to design an enhanced attention mechanism to improve the feature fusion ability of FPN, thereby reducing the missed detection rate of small-sized defects. In recent years, many studies have utilized attention mechanisms to enhance the detection capabilities of defect detection frameworks. The attention mechanism is a feature that enables a neural network to focus on a specific goal.

2.2 Attention Mechanism

Numerous inputs include both critical and irrelevant information required for the task. The attention mechanism can focus on these key information while filtering irrelevant information. The attention mechanism is inspired by the human visual system, which can quickly browse the image, locate the target area of ​​interest, and enhance attention to the target area, thereby obtaining important information in the area and suppressing interference from other irrelevant areas. Hu et al. proposed an attention module named Squeeze and Stimulation (SE). This attention module adaptively corrects the weight parameters of each channel by mining the interdependence between feature channels, allowing the network to focus on more key feature information. Wu et al. [17] expanded the spatial dimension and designed a convolutional block attention module (CBAM).

By sequentially constructing the Channel Attention Module (CAM) and the Spatial Attention Module (SAM), the network's ability to separate and enhance feature information is enhanced. The Effective Channel Attention (ECA) module [18] uses one-dimensional convolution operations to extract dependencies between channels and achieve cross-channel interaction. It solves the problem that SE cannot effectively extract the dependencies between channels due to the reduced compression dimension. ECA has lower computational complexity and has less impact on network speed.

Zhang et al. [19] embedded ECA into the feature fusion network of YOLOv5 for solar cell surface defect detection, which enhanced the ability of PAFPN to fuse solar cell surface defect features, thereby further improving the defect detection rate. The mAP50 value on the data set is 84.23%. However, ECA has larger computational overhead for smaller feature maps.

In order to better detect surface defects on the steel surface, Qian et al. [18] introduced the CA mechanism into the detection network. The mAP value is 79.23%, while the recall value is 62.4%. The CA mechanism needs to calculate the attention weight of the entire feature map, so it cannot capture long-range dependencies. In order to resolve long-range dependencies in small area detection, it is crucial to collect semantic information. On the other hand, Visual Transformer (ViT) relies entirely on self-attention to capture long-range global relationships, and its accuracy is better than Convolutional Neural Networks (CNN). ViT was introduced to the field of computer vision in 2020 and has achieved good performance in the field of vision.

2.3 Vision Transformer

Vision Transformer achieves good performance in the field of computer vision because it uses the multi-head self-attention (MSA) mechanism. The MSA mechanism is a feature extraction method different from CNN, which can establish global dependencies and expand the perceptual field of the image. Compared with CNN, ViT has a larger sensing area and can collect more contextual information.

However, due to filter inefficiency, some information critical for detection is removed. ViT does not exploit prior knowledge of feature localization, translation invariance, and image scale. ViT's ability to capture sufficient information is weaker than CNN, and it cannot utilize prior knowledge of feature location, translation invariance and image scale of the image itself. The ViT model is designed using a scaling dot product attention mechanism. ViT first divides the image into non-overlapping, fixed-size image blocks, and flattens the image blocks into one-dimensional vectors for linear projection to achieve feature extraction.

Swin Transformer is another Transformer type. Swin Transformer utilizes local attention and displacement window multi-head self-attention mechanism (SW-MSA) to realize the interaction between local and global features, achieves good results in various visual tasks, and solves the problem of ViT's local information being easily damaged. .

The difference between self-attention mechanism and attention mechanism is that Query and keys come from different sources, while in self-attention mechanism Query and keys come from the same set of elements. Zhu et al. designed a Transformer prediction head YOLOv5 (TPH-YOLOv5) model for small target detection in UAV images. This model uses Transformer to detect low-resolution feature maps, enhance the network's ability to extract different local information, and achieve better performance for high-density targets. However, distributing Transformer modules across multiple parts of the model results in significant computational effort.

3 Proposed Enhanced Feature Pyramid Network

3.1 Hybrid Attention Feature Pyramid Network Architecture

In the task of detecting defects in welded joints, some small defects are difficult to detect. Enhancing the feature fusion capabilities of FPN can help improve the detection effect of small defects. In order to enhance the feature fusion ability of FPN, this study proposes a hybrid attention feature pyramid network (HA-FPN), as shown in Figure 3(a).

Adding a hybrid attention mechanism (HAM) to basic FPN can enhance FPN's ability to perceive contextual information. At the same time, it also expands the use of feature information and solves the problem of severe loss of location information. The HAM network structure is shown in Figure 3(b).

3.2 Hybrid attention mechanism

The Hybrid Attention Mechanism (HAM) module is based on the Transformer structure. First, the input features go through a deep convolution (DWConv) residual block to achieve parameter sharing and enhance the learning of local features.

Next, normalize using Layer Normalization (LN). Then, the output is processed through two attention mechanism modules, namely enhanced multi-head self-attention (EMSA) and coordinate attention (CA). Finally, it is normalized through the LN layer, and the processing results are finally output through the MLP layer.

The entire processing process is shown in Figure 1.

In formula (1), X represents input features, Y represents output features, and X1, X2, and X3 are intermediate features. DWconv stands for depthwise separable convolution, LN stands for layer normalization, CA stands for coordinate attention, and EMSA stands for enhanced multi-head self-attention. MLP is a multi-layer perceptron.

(1) Enhanced Multi-head Self Attention

A novel EMSA module is proposed, as shown in Figure 3(b), to obtain contextual information and global features simultaneously, using the CA mechanism to capture accurate location features and effectively capture inter-channel information. Then, the fusion of informative features captured by EMSA and CA is performed to enhance the feature fusion capability of the feature pyramid network.

The design concept is based on the MSA mechanism in Transformer, as shown in Figure 4(a). The structure of EMSA is shown in Figure 4(b). The entire processing process of EMSA is shown in Figure 3.2.

(2) Coordinate attention

This study introduces the coordinate attention (CA) mechanism into HAM to enhance the position information fusion ability of FPN. The CA mechanism can effectively enhance the correlation between different channels and improve the network's ability to perceive remote location information. The operation process of the CA mechanism is shown in Figure 5.

For the input H (height of the input feature map) * W (width of the input feature map) * C (number of channels of the input feature map), first, global average pooling is performed from the height and width dimensions of the image to obtain a size H* Feature maps of 1*C and 1*W*C; then, the feature maps of the two sizes are spliced ​​together, and the dimensionality is reduced from the channel dimension through shared convolution, resulting in a size of 1*(W+H)/C /r feature map. After nonlinear layer processing, the nonlinear expression ability is improved.

Next, to increase the dimensionality, 11 convolutions are used to restore the feature map from the width and height dimensions to the A and B scales and assign weights through HardSigmaid. In order to speed up the processing speed of the CA mechanism, HardSigmoid is used to replace the original Sigmoid activation function for weight allocation. HardSigmoid does not require exponential operations, so its calculation speed is faster than Sigmoid. Finally, the size of the feature map becomes H*W*C.

3.3 Improved Feature Fusion Network In YOLOv5

The author uses HAFPN as a feature fusion module in YOLOv5, replacing the original PAFPN structure.

The original feature fusion network architecture is shown in Figure 6(a). It includes convolution (Convolution), batch normalization (Batch Normalization) and SiLu activation function (CBS), cross-stage part (CSP) bottleneck structure, which contains 3 convolutions (C3) and spatial pyramid pooling fast (SPPF) ).

Compared with FPN, PAFPN has better network accuracy, but the detection effect of some small defects in welded joints is not good, the network size is larger and there are more parameters. The enhancement method proposed by the author enhances the feature fusion capability of the FPN network to improve recognition accuracy while ensuring detection speed. The original feature fusion network architecture is shown in Figure 6(b).

4 Experiment

 In order to verify the effect of the hybrid attention mechanism proposed in this study, the Heatmap visualization method was used to compare the focusing capabilities of different attention mechanisms in the defect area, as shown in Figure 7.

If the attention mechanism is not used, YOLOv5's attention to welding joint defects is weak. After adding multiple attention mechanisms, certain improvements were shown. Among them, the defect attention of SE and ECA improved slightly, and even showed a declining effect. CBAM and CA attention have enhanced effects on defective attention. Transformer and Swin Transformer pay less attention to small size movement defects.

The hybrid attention mechanism proposed in this study significantly increases the Heatmap coverage effect of defect locations. Its ability to focus on small defects is more critical and its position positioning is more accurate, proving that hybrid attention can focus on more pixels combined with contextual content, proving the effectiveness of hybrid attention.

In order to verify the superiority of the HAFPN algorithm, the author compared the defect detection performance of different FPN algorithms on the same data set. CSPDarknet53 is always used as a feature extraction Backbone network. The compared feature fusion algorithms include FPN, PAFPN, and A.

Table 1. Experimental environment 2. The detection indicators of all defects are higher than FPN, PAFPN, BiFPN and CFPNet. The accuracy of insufficient defects is slightly lower than ASHF. The overall precision, recall rate and mAP value of HAFPN are better than other networks, among which the precision is 3.8%, 9.4%, 1.3%, 9.7%, 6.9% higher, and the recall rate is 0.5%, 4.8%, 0.7%, 1.5% and 1.2% are higher, and the mAP values ​​are 3%, 4.3%, 0.9%, 3.2% and 3.4%.

 This study uses HAFPN to improve the YOLOv5 defect detection model and compares it with different detection models on the welding joint defect data set. Comparative models include one-stage detection models such as YOLOv4 [DCL21], YOLOv5 [G22], YOLOv7 [WBL23] and YOLOv8 [G23], improved YOLOv5 detection models such as STC-YOLOv5, TPH-YOLOv5 and two-stage detection model Faster R- CNN [RHGS15].

Table 3 records the experimental results. Compared with the YOLO series of algorithms, the author's model achieved the best performance in overall precision, recall and mAP indicators. In terms of detection speed, although the FPS is lower than the original YOLOv5 model, it is higher than other models. Its precision, recall rate and mAP are 9.4%, 4.8% and 4.3% respectively, which are 9.4%, 4.8% and 4.3% higher than YOLOv5.

Compared with Faster R-CNN, the recall value is lower but three times faster, and the proposed algorithm has effective real-time performance. Compared with STC-YOLOv5 and TPH-YOLOv5, the author's model increased precision by 6.4% and 2.4%, recall rate by 3.1% and 2.2%, mAP by 2.8% and 0.6%, and FPS by 22.5%. , 31.6%.

The author performed a visual comparison of the detection performance of the original YOLOv5 network using the improved YOLOv5 network, as shown in Figure 8.

Out of the 12 pins, the first 9 were defective. It can be found that the original YOLOv5 network misses the detection of shift defects (the first two pins) when detecting small-sized insufficient defect targets. The improved network detection capabilities have been enhanced to avoid omissions and false detections. In Figure 8(b), all defects are correctly detected, achieving better detection results. 

 references

YOLO ALGORITHM WITH HYBRID ATTENTION FEATURE PYRAMID NETWORK FOR SOLDER JOINT DEFECT DETECTION

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_40716944/article/details/135433099