EfficientDet study notes

EfficientDet study notes

EfficientDet: Scalable and Efficient Object Detection

Abstract

Model efficiency is becoming increasingly important in computer vision. In this paper, we systematically examine neural network architecture design choices for object detection and propose several key optimizations to improve efficiency. First, we propose a weighted bidirectional feature pyramid network (BiFPN) to achieve simple and fast multi-scale feature fusion; second, we propose a compound scaling method for all backbone networks, feature networks, and box/class prediction networks. Resolution, depth and width are scaled uniformly. Based on these optimizations and better backbones, we develop a new family of object detectors called efficient entdets that consistently achieve better efficiency than existing techniques over a wide range of resource constraints. In particular, in the case of single model and single scale, our EfficientDet-D7 achieved the state-of-the-art 55.1 AP, 77M parameters and 410B FLOPs1 in COCO test development, which is 4 - 9 times smaller than the previous detector, using 13 - 42 times fewer FLOPs. Code at https://github.com/google/automl/tree/master/efficientdet.

1. Introduction

In recent years, tremendous progress has been made toward more accurate object detection; at the same time, state-of-the-art object detectors have become increasingly expensive. For example, the latest amoebanet-based NASFPN detector [45] requires 167M parameters and 3045B FLOPs (30 times more than retina anet [24]) to achieve state-of-the-art accuracy. The large model size and expensive computational cost hinder their deployment in many real-world applications, such as robotics and autonomous vehicles, where both model size and latency are highly constrained. Given these real-world resource constraints, model efficiency becomes increasingly important for object detection.

There are many previous works aimed at developing more efficient detector architectures, such as one-stage [27, 33, 34, 24] and anchor-free detectors [21, 44, 40], or compressing existing models [28, 29]. Although these methods tend to achieve better efficiency, they often sacrifice accuracy. Furthermore, most previous works only focus on specific or small-scale resource requirements, but various real-world applications from mobile devices to data centers often require different resource constraints.

A natural question is: Is it possible to build a scalable detection architecture with higher accuracy and better efficiency within a wide range of resource constraints (e.g., from 3B to 300B FLOPs)? This paper aims to Design options for various detector architectures are systematically studied to address this issue. Based on the one-stage detector paradigm, we examine the design choices of backbone, feature fusion, and class/bin networks and identify two main challenges:

Challenge 1:

efficient multi-scale feature fusion - Since its introduction in [23], FPN has been widely used in multi-scale feature fusion. Recently, research [20, 18, 42] and other studies [20, 18, 42] such as PANet [26] and NAS-FPN [10] have developed more methods for cross-scale feature fusion. Network structure. While fusing different input features, most previous works simply generalize without distinction; however, since these different input features have different resolutions, we observe that their contributions to the fused output features are usually different. equal. To solve this problem,we propose a simple and efficient weighted bidirectional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features, while Top-down and bottom-up multi-scale feature fusion is applied repeatedly.

Challenge 2:

model scaling - Although previous work mainly relied on larger backbones [24, 35, 34, 10] or larger input image sizes [13, 45] to achieve higher accuracy, we It is observed that enlarging the feature network and box/class prediction network is also critical when balancing accuracy and efficiency. Inspired by recent research results [39], we propose a composite scaling method for all backbone networks, feature networks, box/class prediction networks, this method jointly amplifies the resolution/depth/width of all backbone networks, feature networks, and box/class prediction networks.

Finally, we also observe thatthe recently introduced efficiencynets [39] achieve better efficiency than previously commonly used backbones. Combining the efficient network backbone with our proposed BiFPN and composite scales, we develop a new family of object detectors named efficiency entdet, which is consistent with previous Compared to object detectors, it achieves better accuracy with fewer parameters and flops. Figures 1 and 4 show the performance comparison on the COCO dataset [25]. Under similar accuracy constraints, our efficient entdet uses 28 times fewer flops than YOLOv3 [34], 30 times fewer than retina anet [24], and 19 times fewer than the recent ResNet-based NAS-FPN [10]. In particular, at single model and single test time scale, our efficiency ENTDT-D7 achieves state-of-the-art 55.1 AP, 77M parameters and 410B flop, which is 4 AP higher than the previous best detector [45] while volume 2.7 times smaller and 7.4 times less flop. Our efficient detector is also 4 to 11 times faster on GPU/CPU than previous detectors.

With simple modifications, we also demonstrate that our single-model single-scale efficiency entdet achieves 81.74% mIOU accuracy and 18B FLOPs on Pascal VOC 2012 semantic segmentation, which is 1.7% higher than the accuracy of DeepLabV3+ [6] and reduces FLOPs. 8.8 times.

2. Related Work

One-Stage Detectors:

Existing object detectors are mostly classified according to whether they have a region of interest proposal step (two-stage [11, 35, 13]) or not (single-stage [36, 27, 33, 24]) . While two-stage detectors tend to be more flexible and more accurate, one-stage detectors are often considered simpler and more effective by utilizing predefined anchors [17] =2>. In recent years, primary detectors have received widespread attention due to their high efficiency and simplicity [21, 42, 44]. In this paper, we mainly follow the design of a one-stage detector and we show that by optimizing the network architecture, better efficiency and higher accuracy can be achieved

Multi-Scale Feature Representations

One of the main difficulties in object detection is how to effectively represent and process multi-scale features. Early detectors often made predictions directly based on pyramidal feature hierarchies extracted from the backbone [4, 27, 36]. Feature Pyramid Network (FPN) [23] is a pioneering work that proposes a top-down path to combine multi-scale features. Following this idea, PANet [26] adds an additional bottom-up path aggregation network on top of FPN; STDL [43] proposes a scale transfer module to exploit cross-scale features; M2det [42] adopts U-shaped module fusion For multi-scale features, G-FRNet [2] introduces gate units to control the information flow between features. Recently, NAS-FPN [10] utilizes neural architecture search to automatically design feature network topology. Although NAS-FPN achieves better performance, it requires thousands of GPU hours during the search process and the resulting feature network is irregular and therefore difficult to interpret. The goal of this article is to optimize multi-scale feature fusion with a more intuitive and principled method.

Model Scaling:

Better accuracy can usually be achieved by using a larger backbone (e.g., from mobile-sized models [38, 16] and ResNet [14], to ResNeXt [41] and AmoebaNet [32]), or by increasing the input Image size (e.g., from 512x512 [24] to 1536x1536 [45]) to scale up the baseline detector. Some recent studies [10, 45] show that increasing the channel size and repeating the feature network can also achieve higher accuracy. Most of these scaling methods focus on a single or limited scaling dimension. Recently, [39] demonstrated significant image classification model efficiency by jointly scaling up network width, depth, and resolution. Our proposed compound scaling method for object detection is mainly inspired by [39].

3. BiFPN

In this section, we first elaborate on the multi-scale feature fusion problem, and then introduce the main idea of ​​our proposed BiFPN: efficient bidirectional cross-scale connection and weighted feature fusion.

3.1. Problem Formulation

image-20221104170904934

The purpose of multi-scale feature fusion is to aggregate features of different resolutions. Formally, given a multi-scale feature list P ⃗ i n = ( P l 1 i n , P l 2 i n , … ) \vec{P}^{i n}=\left(P_ {l_{1}}^{i n}, P_{l_{2}}^{i n}, \ldots\right) P in=(Pl1in,Pl2in,), inside P l i i n P_{l_{i}}^{i n} < /span>PliinDisplay number l i l_i li layer features, our goal is to find a transformation f that can effectively aggregate different features and output a new feature list: P ⃗ out = f ( P ⃗ i n ) \vec{P}^{\text {out }}=f\left(\vec{P}^{i n}\right) P out =f(P in). As a specific example, Figure 2(a) shows the traditional top-down FPN [23]. It requires input features of level 3-7 P ⃗ i n = ( P 3 i n , … P 7 i n ) \vec{P}^{i n}=\left(P_{3}^ {i n}, \ldots P_{7}^{i n}\right) P in=(P3in,P7in),其中 P i i n P^{in}_i PiinIndicates that the resolution is that of the input image 1 / 2 i 1/2^i 1/2Feature level of i. For example, if the input resolution is 640x640, then p 3 i n p^{in}_3 p3in represents feature level 3 (640/23 = 80) with a resolution of 80x80, while p 7 i n p^{in}_7 p7inRepresents feature level 7 with a resolution of 5x5. Traditional FPN aggregates multi-scale features in a top-down manner:

image-20221104172741562

Where Resize is usually an upsampling or downsampling operation for resolution matching, while Conv is usually a convolution operation for feature processing.

3.2. Cross-Scale Connections

Traditional top-down FPN is inherently limited by one-way information flow. To solve this problem, PANet [26] adds an additional bottom-up path aggregation network, as shown in Figure 2(b). [20, 18, 42] further studied cross-scale connections. Recently, NAS-FPN [10] adopts neural structure search to search better cross-scale feature network topology, but it requires thousands of GPU hours during the search process and the discovered network is irregular , difficult to interpret or modify, as shown in Figure 2©.

By studying the performance and efficiency of these three networks (Table 5), we observe that PANet achieves better accuracy than FPN and NAS-FPN, but requires more parameters and calculations. In order to improve model efficiency, this paper proposes several optimization methods for cross-scale connections: First, we delete those nodes that have only one input edge. Our intuition is simple:If a node has only one input edge and no feature fusion is performed, its contribution to the feature network designed to fuse different features will become smaller< a i=4>. This results in a simplified bidirectional network; secondly, ifthe original input to output node is at the same level, we add an extra edge between the two , in order to fuse more features without increasing cost; Thirdly, unlike PANet [26], there is only one top-down and one automatic Different from the bottom-up path, we regard each bidirectional (top-down and bottom-up) path asa feature network layer, and. Section 4.2 will discuss how to use the compound scaling method to determine the number of tiers for different resource constraints. Through these optimizations, we named the new feature network Bidirectional Feature Pyramid Network (BiFPN), as shown in Figures 2 and 3. Repeat the same layer multiple times to achieve more advanced feature fusion

image-20221105090514874

3.3. Weighted Feature Fusion

When fusing features with different resolutions, a common approach is to first resize them to the same resolution and then add them together. The pyramid attention network [22] introduces global self-attention upsampling to recover pixel localization, which is further studied in [10]. All previous methods treat all input features equally. However, we observe that since different input features have different resolutions, their contributions to the output features are usually unequal. To solve this problem, we propose to add an extra weight to each input and let the network learn the importance of each input feature. Based on this idea, we considered three weighted fusion methods:

Unbounded fusion:

image-20221105090658748

Among w i w_i Ini is a learnable weight, which can be a scalar (per feature), a vector (per channel), or a multidimensional tensor (per pixel). We find that ascale can achieve comparable accuracy to other methods at minimal computational cost. However, since scalar weights are unbounded, it may lead to unstable training. Therefore, we useweight normalization to limit the value range of each weight.

Softmax-based fusion:

image-20221105090850927

An intuitive idea is to apply softmax to each weight, so that all weights are normalized to the probability of a value from 0 to 1, indicating the importance of each input. However, as we showed in our ablation study in Section 6.3, additional softmax can cause significant slowdown of the GPU hardware. To reduce additional delay costs, we further propose a fast fusion method.

Fast normalized fusion:

image-20221105091002784

Among them, wi ≥ 0 is ensured by applying Relu after each wi, and $\epsilon$= 0.0001 is a small value to avoid numerical instability. Similarly, each normalized weight is also between 0 and 1, but since there is no softmax operation here, it is much more efficient. Our ablation studies show that this fast fusion method has very similar learning behavior and accuracy to softmax-based fusion, but runs 30% faster on GPUs (Table 6).

Our final BiFPN integrates bidirectional cross-scale connections and fast normalized fusion. As a concrete example, we describe here the two fused features of level 6 of BiFPN as shown in Figure 2(d).

image-20221105091220802

其中 P 6 t d P^{td}_6 P6td is the intermediate feature at level 6 on the top-down path, P 6 o u t P^{out}_6 P6outis the output feature of level 6 on the bottom-up path. All other properties are constructed in a similar manner. It is worth noting that to further improve efficiency, we use depthwise separable convolutions [7, 37] for feature fusion and add batch normalization and activation after each convolution.

4. EfficientIt

Based on our BiFPN, we developed a new family of detection models named efficient entdet. In this section, we discuss the network architecture of efficiency entdet and a new composite scaling method.

4.1. EfficientDet Architecture

Figure 3 shows the overall architecture of efficiency entdet, which largely follows the one-stage detector paradigm [27, 33, 23, 24]. We use imagenet preprocessed effecentnets as the backbone network. Our proposed BiFPN as a feature network extracts 3-7 level features {P3, P4, P5, P6, P7} from the backbone network, and repeatedly applies top-down and bottom-up bidirectional feature fusion. These fused features are fed into class and box networks, producing object class and bounding box predictions respectively. Similar to [24], class and box network weights are shared across all levels of features.

4.2. Compound Scaling

To optimize accuracy and efficiency, we hope to develop a family of models that can satisfy a wide range of resource constraints. A key challenge here is how to extend the baseline efficiency model.

Most previous work enlarges the baseline detector by using a larger backbone network (such as ResNeXt [41] or AmoebaNet [32]), using larger input images, or stacking more FPN layers [10]. These methods are often ineffective because they only focus on a single or limited scaling dimension. Recent work [39] demonstrated remarkable performance in image classification by jointly amplifying all dimensions of network width, depth, and input resolution. Inspired by these works [10, 39], we propose a new compound scaling object detection method that uses a simple compound coefficient φ to jointly scale all backbone networks, BiFPN, class/box networks and resolutions. dimensions. Unlike [39], object detectors have more scaling dimensions than image classification models, so grid search over all dimensions is very expensive. Therefore, we use a heuristic-based expansion method but still follow the main idea of ​​jointly expanding all dimensions.

Backbone network

We reuse the same width/depth scaling factors of EfficientNet-B0 to B6 [39] so that we can easily reuse their ImageNet pre-trained checkpoints.

BiFPN network

We linearly increase the BiFPN depth D b i f p n D_{bif pn} Dbifpn (#layers) because the depth needs to be rounded to a small integer. For BiFPN width W b i f p n W_{bif pn} INbifpn (#channels),将BiFPN宽度 W b i f p n W_{bifpn} INbifpn(#channels) grows exponentially by a value similar to [39]. Specifically, we perform a grid search on a list of values ​​{1.2, 1.25, 1.3, 1.35, 1.4, 1.45} and select the optimal value 1.35 as the BiFPN width scaling factor. Formally, the following formulas are used to express the width and depth of BiFPN:

image-20221105095202008

Box/class prediction network

We fix their width to always be the same as BiFPN (i.e. W p r e d = W b i f p n W_{pred} = W_{bif pn} INpred=INbifpn), but linearly using formulas to increase depth (#layers)

image-20221105095454336

Input image resolution

Since feature levels 3-7 are used in BiFPN, the input resolution must be 2 7 2^7 27 = 128 is divisible, so we use the formula to linearly increase the resolution:

image-20221105095549478

According to formulas 1, 2, 3 for different φ, we developed EfficientDet-D0 (φ = 0) to D7 (φ = 7), as shown in Table 1, where D7 and D7x have the same BiFPN and head, but D7 uses For higher resolution, the D7x uses a larger backbone and one more feature level (from P3 to P8). It is worth noting that our compound scaling is based on heuristics and may not be optimal, but we will show in Figure 6 that this simple scaling method can significantly improve efficiency over other one-dimensional scaling methods.

image-20221105095902686

5. Experiments

image-20221105100009779

image-20221105100030553

image-20221105100049333

image-20221105100100475

6. Ablation Study

image-20221105100110926

image-20221105100206837

image-20221105100231424

image-20221105100238596

7、Conclusion

This article systematically studies the network architecture design selection for effective target detection, and proposes abidirectional weighted feature network and a custom composite scaling method, to improve the accuracy and efficiency of target detection. Based on these optimizations, we develop a new family of detectors, named efficient entdet, which consistently achieve better accuracy and efficiency than existing techniques over a wide range of resource constraints. In particular, our scaled efficiency entdet achieves state-of-the-art accuracy with fewer parameters and flops compared to previous object detection and semantic segmentation models.

Guess you like

Origin blog.csdn.net/charles_zhang_/article/details/127700285