【DDRNets】Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scen

Deep Dual-resolution Networks for Real-time and Accurate Semantic Segmentation of Road Scenes

Deep dual-resolution network for real-time and accurate semantic segmentation of road scenes

https://arxiv.org/pdf/2101.06085.pdf
https://github.com/ydhongHIT/DDRNet
Yuanduo Hong, Huihui Pan, Weichao Sun, Senior Member, IEEE, Yisong Jia
2021

Summary

Semantic segmentation is a key technology for autonomous vehicles to understand the surrounding scene. The attractive performance of contemporary models often comes at the cost of heavy computation and long inference times, which is intolerable for autonomous driving. Using lightweight architectures (encoder-decoder or dual-pass) or inferring on low-resolution images, recent methods achieve very fast scene parsing, even running over 100 FPS on a single 1080Ti GPU. However, the performance gap between these real-time methods and dilated backbone-based models is still large. To address this problem, we propose a series of efficient backbones designed for real-time semantic segmentation. The proposed deep dual-resolution network (DDRNets) consists of two deep branches in which multiple bilateral fusions are performed. Furthermore, we designed aNew contextual information extractor called Deep Aggregation Pyramid Pooling Module (DAPPM), to expand the effective receptive field and fuse multi-scale context based on low-resolution feature maps. Our method achieves new state-of-the-art trade-offs between accuracy and speed on the Cityscapes and CamVid datasets. In particular, on a single 2080Ti GPU, DDRNet-23-slim achieves 77.4% mIoU at 102 FPS on the Cityscapes test set and 74.7% mIoU at 230 FPS on the CamVid test set. On widely used test augmentations, our method outperforms most state-of-the-art models, requiring less computation. The code and trained model are available online.

Keywords —semantic segmentation, real-time, deep convolutional neural network, autonomous driving

1 abstract

Figure 1: Comparison of speed and accuracy on the Cityscapes test set. Red triangles represent our method, blue triangles represent other methods, and green circles represent architecture search methods.

Semantic segmentation is a fundamental task whose goal is to assign each pixel of an input image to a corresponding label [1]–[3]. It plays an important role in many practical applications, such as medical image segmentation, autonomous driving navigation, and robotics [4], [5]. With the rise of deep learning technology, convolutional neural networks have been applied to image segmentation and have achieved significant advantages over traditional methods based on manual features. Since fully convolutional networks (FCN) [6] were proposed to handle semantic segmentation problems, a series of new networks have been proposed. DeepLab [7] eliminates some downsampling operations in ResNet to maintain high resolution and utilizes convolution operations [8] with large dilation rates to enlarge the receptive field. Since then, the atrous convolution-based backbone network and context extraction module have become widely used standard structures, including DeepLabV2 [9], DeepLabV3 [10], PSPNet [11] and DenseASPP [12].

Since semantic segmentation is a dense prediction task, neural networks need to output high-resolution feature maps with large receptive fields to produce satisfactory results, which is computationally expensive. This issue is particularly critical for scene parsing in autonomous driving, as it requires operating on very large images to cover a wide field of view. Therefore, the above method is very time-consuming in the inference phase and cannot be directly deployed on actual autonomous vehicles. They cannot even process one image in one second because multi-scale testing is used to improve accuracy.

With the growing demand for mobile device deployment, real-time segmentation algorithms [13]–[17] have received increasing attention.DFANet [18] adopts deep multi-scale feature aggregation and lightweight depth-separable convolutions, achieving a test mIoU of 71.3% at 100 frames per second. Different from the encoder-decoder paradigm, the authors in [19] proposed aA new bilateral network composed of spatial paths and contextual paths. In particular, the spatial path utilizes three relatively wide 3×3 convolutional layers to capture spatial details, while the contextual path is a compact pre-trained backbone network for extracting contextual information. These bilateral methods, including [20], at that time achieved higher inference speeds than encoder-decoder architectures.

Recently, some competitive real-time methods for semantic segmentation of road scenes have been proposed. These methods can be divided into two categories.A type of GPU-efficient backbone network, especially ResNet-18 [21–23]. Another type has developedComplex lightweight encoder trained from scratch, among which BiSeNetV2 [24] reached a new peak in real-time performance, achieving 72.6% test mIoU at 156 frames per second on the Cityscapes dataset. However, these recent works have not demonstrated the potential for higher quality results, except for [23] which used additional training data. Some of them lack scalability due to carefully designed architectures and tuned hyperparameters. Furthermore, ResNet-18 has few advantages given the development of more powerful backbone networks.

This paper proposes a dual-resolution network for real-time semantic segmentation of high-resolution images, specifically for road driving images. Our DDRNets start from a backbone network and then split into two parallel deep branches with different resolutions. One deep branch generates relatively high-resolution feature maps, and the other extracts rich semantic information through multiple down-sampling operations. Efficient information fusion is performed between the two branches through multiple bilateral connections. Furthermore, we propose a novel module named DAPPM, which inputs low-resolution feature maps, extracts multi-scale contextual information, and merges them in a cascade manner. Before training on the semantic segmentation dataset, the dual-resolution network is trained on ImageNet following a common paradigm.

Based on extensive experimental results on three popular benchmarks, namely Cityscapes, CamVid, and COCOStuff, DDRNets achieves an excellent balance between segmentation accuracy and inference speed. Compared with other real-time algorithms, our method achieves new state-of-the-art accuracy on the Cityscapes and CamVid datasets without using attention mechanisms or additional modifications. Under standard test-enhanced conditions, DDRNet is comparable to state-of-the-art models and requires fewer computational resources. We also report statistically significant performance and conduct ablation experiments to analyze the impact of architectural improvements and standard training techniques.

The main contributions are summarized as follows:

  • A series of novel bilateral networks with deep dual-resolution branches and multiple bilateral fusions are proposed as efficient real-time semantic segmentation backbone networks.
  • A novel module is designed to collect rich contextual information by combining feature aggregation with pyramid pooling. It adds little to the inference time when performed on low-resolution feature maps
  • Our method achieves a new state-of-the-art trade-off between accuracy and speed on 2080Ti, achieving 77.4% mIoU at 102 frames per second on the Cityscapes test set and 74.7% mIoU at 230 frames per second on the CamVid test set. To the best of our knowledge, we are the first method to achieve 80.4% mIoU on Cityscapes in near real-time (22 FPS) using only fine annotations.

2. Related work

In recent years, dilated convolution-based methods have improved semantic segmentation performance in many challenging scenarios. Pioneering work explores further possibilities with lightweight architectures such as encoder-decoder and two-path. Furthermore, in the scene parsing task,contextual informationproved to be very important. In this section, we classify related work into three categories, namelyHigh-performance semantic segmentation, real-time semantic segmentation and context extraction modules

A. High performance semantic segmentation

Figure 2. Comparison on dilation method, encoder-decoder method, dual-channel method and our deep dual-resolution network.

Due to the lack of spatial details, the output of the last layer of a common encoder cannot be directly used to predict segmentation masks. If only the downsampling of the classification backbone is removed, the effective receptive field will be too small to learn high-level semantic information. An acceptable strategy is to utilize atrous convolution to establish long connections between pixels while removing the last two downsampling layers, e.g.Picture 2 aAs shown [10], [11]. However, this also brings new challenges to real-time inference due to the exponential growth of high-resolution feature map dimensions and insufficient optimization of atrous convolution implementations. In fact, most state-of-the-art models are built based on the expanded backbone and are therefore not suitable for autonomous driving scenario analysis.

Several works have attempted to explore alternatives to the standard expansion backbone. The authors of DeepLabv3plus [25] proposed a simple decoder that fuses upsampled feature maps with low-level feature maps. It alleviates the requirement to generate high-resolution feature maps directly from atrous convolutions. Although the output stride of the encoder is set to 16, DeepLabv3plus can still achieve competitive results. HRNet [26] emphasizes deep high-resolution representation and reflects higher efficiency than dilated backbones. We find that HRNet is more computationally efficient and inference faster, with many of its fine-resolution streams being much smaller in size. Taking HRNetV2-W48 as an example, the dimensions of 1/4 resolution and 1/8 resolution features are 48 and 96 respectively, which are much smaller than pre-trained ResNets with atrous convolutions [27]. Although the high-resolution branches of HRNet are much smaller, they can be greatly enhanced by parallel low-resolution branches and repeated multi-scale fusion.

Our work starts with deep, detailed, high-resolution representations and proposes a more compact architecture. They simultaneously maintain high-resolution representation and extract high-level contextual information through two concise backbones.

B. Real-time semantic segmentation

Almost all real-time semantic segmentation models adopt two basic methods: encoder-decoder method and dual-channel method. Both approaches discuss the importance of lightweight encoders.

1) Encoder-decoder architecture : Compared with dilated convolution-based models, encoder-decoder architecture intuitively consumes less computational and inference time. The encoder is usually a deep network with repeated spatial downsampling to extract contextual information, while the decoder restores resolution through interpolation or transposed convolution [28] to complete dense prediction, such asFigure 2bshown. In particular, the encoder can be a lightweight backbone network pre-trained on ImageNet, or an efficient variant trained from scratch like ERFNet [5] and ESPNet [16]. SwiftNet [21] takes full advantage of pre-trained encoders on ImageNet and utilizes lightweight side connections to assist upsampling. The authors in [29] proposed a strategy of multiple spatial fusion and category boundary supervision. FANet [22] achieves a good balance between speed and accuracy through fast attention modules and additional downsampling of the entire network. SFNet [23] introduces the Flow Alignment Module (FAM) to align feature maps of adjacent levels for better fusion.

2) Dual-channel architecture : The encoder-decoder architecture reduces the computational workload, but because some information is lost during repeated downsampling, it cannot be fully recovered through upsampling, which affects the accuracy of semantic segmentation. To alleviate this problem, a dual-channel architecture [19] is proposed, such asFigure 2cshown. In addition to a channel for extracting semantic information, another shallow channel with higher resolution provides rich spatial details as a complement. To further improve accuracy, BiSeNetV2 [24] uses global average pooling for context embedding and proposes an attention-based feature fusion method. The two channels in BiSeNetV1&V2 are initially independent, while the two branches in Fast-SCNN [20] share the learning downsampling module. CABiNet [30] adopts the overall architecture of Fast-SCNN but uses MobileNetV3 [31] as the context branch.

In addition to existing dual-channel methods, the deep and fine-resolution branches of DDRNets can achieve multiple feature fusion and sufficient ImageNet pre-training while ensuring inference efficiency. Our method can be easily scaled to achieve higher accuracy (over 80% mIoU on Cityscapes dataset).

3) Lightweight encoders : There are many computationally efficient backbone networks that can be used as encoders, such as MobileNet [32], ShuffleNet [33], and small versions of Xception [34]. MobileNetUse depthwise separable convolutions instead of standard convolutions, to reduce parameters and calculation amount. The inverse residual block in MobileNetV2 [35] mitigates the strong regularization effect of depthwise separable convolutions. ShuffleNet exploits the compactness of grouped convolutions and proposesChannel shuffling operations to facilitate information fusion between different groups. However, these networks contain a large number of depthwise separable convolutions and cannot be efficiently implemented on existing GPU architectures. Therefore, although ResNet-18 [27] has about six times the FLOPs of MobileNetV2 1.0×, the former has higher inference speed than the latter [21] on a single 1080Ti GPU. However, existing lightweight backbone networks may not be optimal for semantic segmentation as they are often over-tuned for image classification.

C. Context extraction module

In semantic segmentation, another key is how to capture richer contextual information. Atrous Spatial Pyramid Pooling (ASPP) [9] consists of parallel atrous convolutional layers with different sampling rates, which can focus on multi-scale contextual information. The Pyramid Pooling Module (PPM) [11] implements pyramid pooling before the convolutional layer and is more computationally efficient than ASPP. Different from the local characteristics of convolution kernels, the self-attention mechanism is good at capturing global dependencies. Therefore, Dual Attention Network (DANet) [36] exploits the advantages of position attention and channel attention to further improve feature representation. Object Context Network (OCNet) [37] utilizes a self-attention mechanism to explore object context, i.e., a set of pixels belonging to the same object category. The authors of CCNet [38] proposed a cross-attention mechanism to improve memory usage and computational efficiency. However, these context extraction modules are designed and implemented for high-resolution feature maps, which is too time-consuming for lightweight models. We take low-resolution feature maps as input and enhance the PPM module by adding more scales and deep feature aggregation. When appended to the end of the low-resolution branch, our proposed module exhibits better performance on top of OCNet’s PPM and Base-OC modules.

3. Method

This section describes the entire process, which consists of two main components: a deep dual-resolution network and a deep aggregation pyramid pooling module.

A. Deep dual-resolution network

For convenience, we can add an additional high-resolution branch on a widely used classification backbone network such as ResNet. To strike a balance between resolution and inference speed, we let the high-resolution branch create feature maps with a resolution of 1/8 of the input image resolution. Therefore, the high-resolution branch is appended to the end of the conv3 stage. It should be noted that the high-resolution branch does not contain any downsampling operations and corresponds one-to-one with the low-resolution branch to form a deep high-resolution representation. Then, multiple bilateral feature fusions can be performed at different stages to fully integrate spatial information and semantic information.

Table I Architecture of DDRNet-23-SLIM and DDRNet-39 on Imagenet. 'CONV4×r' means CONV4 is repeated r times. For DDRNet-23-SLIM, r = 1; for DDRNet-39, r = 2.

The detailed architecture of DDRNets-23-slim and DDRNets-39 is as followsTable Ishown. We modify the input module of the original ResNet by replacing one 7×7 convolutional layer with two consecutive 3×3 convolutional layers. Use residual basic blocks to build the backbone and the two subsequent branches. To expand the output dimension, a bottleneck block is added at the end of each branch.

Figure 3. Details of bilateral fusion in DDRNet. Implement point-wise summation before ReLU.

Bilateral fusion includes fusing high-resolution branches to low-resolution branches (high-to-low fusion) and fusing low-resolution branches to high-resolution branches (low-to-high fusion). For high-to-low fusion, high-resolution feature maps are downsampled through a series of 3×3 convolutions (with stride 2), followed by point-wise summation. For low-to-high fusion, low-resolution feature maps are first compressed by 1×1 convolution and then upsampled using bilinear interpolation.image 3Demonstrates how bilateral integration can be achieved. The i-th high-resolution feature map XH i X_{Hi}XHiand low-resolution feature map XL iXLiIt can be expressed as:

Among them, FH and FL correspond to residual basic block sequences with high resolution and low resolution respectively, TL − H T_{LH}TLHand TH − L T_{HL}THLRefers to the low-to-high and high-to-low converters, and R represents the ReLU function.
In total, we constructed four dual-resolution networks with different depths and widths. DDRNet-23 is twice as wide as DDRNet-23-slim, and DDRNet-39 1.5× is a wider version of DDRNet-39.

B. Deep aggregation pyramid pooling module DAPPM

Figure 5. Detailed architecture of the deep aggregation pyramid pooling module. The number of multi-scale branches can be adjusted according to the input resolution.

Here, we propose a new module to further extract contextual information from low-resolution feature maps. Figure 5 shows the internal structure of DAPPM. Taking the feature map of 1/64 image resolution as input, a large pooling kernel with exponential stride is used to generate feature maps of 1/128, 1/256, and 1/512 image resolution. Input feature maps and image-level information generated by global average pooling are also utilized. We believe that fusing all multi-scale contextual information through a single 3×3 or 1×1 convolution is not sufficient. Inspired by Res2Net, we first upsample the feature map, and then use more 3×3 convolutions to fuse different scale contextual information in a hierarchical residual manner. For input xxx , each scaleyi y_iyiIt can be expressed as:

Among them, C 1 × 1 C_{1×1}C1×1Represents 1×1 convolution, C 3 × 3 C_{3×3}C3×3Represents 3×3 convolution, U represents upsampling operation, P j , k P_{j,k}Pj,kIndicates that the kernel size is jjj , the stride iskkk pooling layer, Pglobal represents global average pooling. Finally, all feature maps are concatenated and compressed by 1×1 convolution. Additionally, a 1×1 projection shortcut has been added to facilitate optimization. Similar to SPP in SwiftNet, DAPPM uses a sequence implementation of BN-ReLU-Conv.

Table II Considering an image with an input size of 1024×1024, the context dimensions generated by PPM and DAPPM are as follows:

Within DAPPM, the context extracted by larger pooling kernels is integrated with deeper information flows, and multi-scale features are formed by integrating pooling kernels of different depths and sizes.Table IIIt shows that DAPPM can provide richer contextual information than PPM. Although DAPPM contains more convolutional layers and more complex fusion strategies, it hardly affects the inference speed since the input resolution is only 1/64 of the image resolution. For example, for a 1024×1024 image, the maximum resolution of the feature map is 16×16.

C. Overall architecture of semantic segmentation

Figure 4. Overview of DDRNet in semantic segmentation. "RB" stands for sequential residual basic block. "RBB" represents a single residual bottleneck block. "DAPPM" stands for Deep Aggregation Pyramid Pooling Module. "Seg. Head" means splitting the head. The solid black line represents the information path including data processing (including upsampling and downsampling), and the black dashed line represents the information path without data processing. "sum" means summing point by point. Dashed boxes represent components that are discarded during the inference phase.

Our approach is summarized asFigure 4shown. Some improvements are made to the dual-resolution network for semantic segmentation tasks. First, the 3×3 convolution stride of RBB in the low-resolution branch is set to 2 for further downsampling. Then, DAPPM is added on the output of the low-resolution branch to extract rich contextual information from the high-level feature map at 1/64 image resolution. Furthermore, the final high-to-low fusion is replaced by a low-to-high fusion achieved by bilinear interpolation and summation fusion. Finally, we designed a simple segmentation head, including a 3×3 convolutional layer and a 1×1 convolutional layer. The computational load of the segmentation head can be adjusted by changing the output dimension of the 3×3 convolutional layer. For DDRNet-23-slim we set it to 64, for DDRNet-23 we set it to 128 and for DDRNet39 we set it to 256. Note that all modules except the segmentation head and DAPPM module are pre-trained on ImageNet.

D. In-depth supervision

Adding additional supervision during the training phase can alleviate the optimization problems of deep convolutional neural networks (DCNNs). In PSPNet, the output of the res4 22 block of ResNet-101 is supervised by adding an auxiliary loss function, and the corresponding weight is set to 0.4 according to the experimental results [11]. BiSeNetV2 proposes an augmentation training strategy that adds additional segmentation heads at the end of each stage of the semantic branch. However, extensive experimentation is required to find the optimal weights that balance each loss and results in a significant increase in training memory. To obtain better results, SFNet uses a similar strategy called cascaded deep supervised learning [23]. In this paper, we only employ simple additional supervision for fair comparison with most methods. likeFigure 4As shown, we add an auxiliary loss and set the weight to 0.4, the same as PSPNet. During the testing phase, the auxiliary segmentation header is discarded. The final loss function is the weighted sum of cross-entropy loss, which can be expressed as:
L f = L n + α L a (3) L_f=L_n+αLa\tag{3}Lf=Ln+αL a( 3 )
Here,L f L_fLfLnL_nLnJapanese L a L_aLarepresent the final loss, ordinary loss and auxiliary loss respectively, while α αα represents the weight of the auxiliary loss, which is 0.4 in this article.

4. Experiment

A. Dataset

Cityscapes [40] is one of the well-known datasets focusing on urban street scene parsing. It contains 2975 finely annotated training images, 500 validation images, and 1525 test images. We do not use the additional 20,000 roughly annotated images during training. This dataset has a total of 19 categories that can be used for semantic segmentation tasks. The resolution of the image is 2048×1024, which is challenging for real-time semantic segmentation.

CamVid [41] contains 701 densely annotated frames, each with a resolution of 960×720. It includes 367 training images, 101 validation images and 233 testing images. We combine the training and validation sets for training and evaluate our model on the test set using 11 categories following previous studies [18], [19], [21].

COCOStuff [42] provides 10K complex images densely annotated with 182 categories, including 91 object categories and 91 scene categories. It should be noted that 11 object categories do not have any segmentation annotations. We follow the split in [42] (9K for training, 1K for testing) for fair comparison.

B. Training settings

Table III Top-1 error rate, parameter size and GFLOPS of four scaled DDRNets:

Before fine-tuning for the semantic segmentation task, the dual-resolution network was trained on the ImageNet dataset, following the same data augmentation strategy as previous works [27], [44]. All models were trained on four 2080Ti GPUs for 100 epochs using an input resolution of 224×224 and a batch size of 256. The initial learning rate is set to 0.1 and is reduced by a factor of 10 at the 30th, 60th and 90th epochs. We train all networks using SGD with weight decay of 0.0001 and Nesterov momentum of 0.9.Table IIIShows the Top-1 error rate on the ImageNet validation set. Although DDRNet is not as efficient as many well-designed lightweight backbone networks on ImageNet, it still achieves good results on semantic segmentation benchmarks given the speed tradeoff. The training settings of Cityscapes, CamVid and COCOStuff are as follows:
1) Cityscapes : We use the SGD optimizer with an initial learning rate of 0.01, momentum of 0.9, weight decay of 0.0005, following the plot learning strategy, and an exponential parameter of 0.9 to remove the learning rate, And implement data enhancement methods including random cropping of images, random scaling in the range of 0.5 to 2.0, and random horizontal flipping. Following [18], [29], [23], images are randomly cropped into 1024×1024 for training. All models use a batch size of 12 for 484 epochs (~120K iterations) on four 2080Ti GPUs, with simultaneous BN. For models evaluated on the test server, images are input from both the train and val sets during training. For fair comparison with [24] and [23], we also use Online Hard Example Mining (OHEM) [50].
2) CamVid : We set the initial learning rate to 0.001 and train all models for 968 epochs. Following [18], images are randomly cropped to 960×720 for training. All models are trained on a single GPU, and other training details are the same as Cityscapes. When pre-training with Cityscapes, we fine-tune the model for 200 epochs.
3) COCOStuff : The initial learning rate is 0.001, and the total number of training epochs is 110. We resize the short side of the image to 640 before data augmentation. Same as BiSeNetV2 [24], the crop size is 640×640. Other training details are the same as Cityscapes, but with weight decay of 0.0001. During the inference phase, we fixed the image resolution to 640×640.

C. Measures of reasoning speed and accuracy

Inference speed was measured using a single GTX 2080Ti GPU with batch size set to 1, using CUDA 10.0, CUDNN 7.6 and PyTorch 1.3. Similar to MSFNet and SwiftNet, we exclude the batch normalization layer after the convolutional layer because it can be integrated into the convolution during inference. We use the protocol established by [51] for a fair comparison (image sizes: 2048×1024 for Cityscapes, 960×720 for CamVid, 640×640 for COCOStuff).

Similar to ResNet [27], we report the best results, average results and standard deviation of four experiments, except that the accuracy on the Cityscapes test set is provided by the official server.

D. Speed ​​and accuracy comparison

Table IV Accuracy and speed comparison on Cityscapes dataset. We report results on both the validation and test sets. Since the inference speed of different models is measured under different conditions, the corresponding GPU models and input resolutions are reported. Our GFLOPS calculation takes a 2048×1024 pixel image as input. If marked with †, the corresponding speed was measured using TensorRT acceleration.

Cityscapes : fromTable IVandfigure 1It can be seen that our method achieves a new optimal balance between real-time performance and high accuracy. In particular, DDRNet-23-slim (our smallest model) achieves 77.4% mIoU at 102 FPS on the test set. It achieves 6.1% higher mIoU than DFANet A and MSFNet* at similar inference speeds, and is approximately 2.5x faster than MSFNet. Furthermore, it is 40% faster than the smallest SFNet, achieving a 2.9% mIoU improvement on the test set. Notably, our method also outperforms architecture search-based real-time semantic segmentation methods such as CAS and GAS, which have similar inference speeds. For wider models, DDRNet-23Table IVObtained the overall best accuracy among real-time methods, achieving 79.4% mIoU at 37 FPS. DDRNet-23 improves performance by 0.5% compared to SFNet (ResNet-18) and runs twice as fast.

We continue our deep dive into DDRNet and achieve 80.4% mIoU at 22 FPS on the Cityscapes test server, using only finely annotated data. If benefiting from the Mapillary [52] dataset and TensorRT acceleration similar to [23], our method can establish a huge benchmark for real-time semantic segmentation of road scenes. On the Cityscapes validation set, DDRNet-23-slim outperforms all published results in Table IV, with 36.3 GFLOPs and 5.7M parameters. And DDRNet-23 achieved a new overall best result at 79.5% mIoU.Figure 6The visualization results of DDRNet-23-slim and DDRNet-23 in different scenarios are shown.

Figure 6. Visual segmentation results on the Cityscapes validation set. The four columns from left to right represent the input image, ground truth annotation, the output of DDRNet-23-slim and the output of DDRNet-23 respectively. The first four rows show the performance of the two models, while the last two rows represent some cases of segmentation failure.
Table V Accuracy and speed comparison on CAMVID dataset. MSFNet runs at 1024×768, MSFNet* runs at 768×512, while other methods run at 960×720. If marked †, measurements were made using TensorRT acceleration.

2) CamVid : As shown in Table V, DDRNet-23-slim achieves 74.7% mIoU at 230 FPS on the CamVid test set without pre-training on Cityscapes. It achieved the second highest accuracy and ran faster than all other methods. In particular, DDRNet-23 outperforms the previous state-of-the-art method MSFNet. DDRNet-23 also has a larger performance improvement than BiSeNetV2-L and SFNet (ResNet-18), and runs about twice as fast as them. Given that CamVid has far fewer training pixels than Cityscapes, we believe that DDRNet’s superior performance is partly due to sufficient ImageNet pre-training. Additionally, our models pre-trained on Cityscapes achieve new state-of-the-art accuracy at real-time speeds. In particular, the pre-trained DDRNet-23 on Cityscapes achieves 80.6% mIoU at 94 FPS, which is more powerful and faster than BiSeNetV2-L. The corresponding visualization results are as follows:Figure 7shown.

Figure 7. Visual segmentation results on the CamVid test set. Labels that are ignored during testing are color set to black. The three columns from left to right represent the input image, ground truth annotations and the output of DDRNet-23 respectively. The first four rows show successful samples, while the last two rows represent some cases where segmentation failed.
Table VI Accuracy and speed comparison on COCO-Stuff dataset. The input resolution is 640×640, and the results of PSPNet50 are from [24]. If marked †, measurements were made using TensorRT acceleration.

3) COCOStuff : We also validate our method on COCOStuff, a more challenging real-time semantic segmentation dataset with rich categories. Since the image resolution is smaller than the other two datasets, the stride of RBB is set to 1 in the low-resolution branch. The time to reshape the image and predict the mask is not included in the statistics.Table VIIt is shown that our method shows greater advantages than BiSeNetV2 in very challenging scenarios. Our DDRNet-23 runs 20 times faster with similar accuracy as PSPNet50.

E. Comparison with the latest existing results

Table VII State-of-the-art models on the Cityscapes test set. OS represents the final output stride. All methods train models on both training and validation sets, except PSPNet marked with † which only uses the training set. The GFLOPS calculation takes a 1024 × 1024 pixel image as input, and most results on GFLOPS and parameters can be found in [23].

In this section, we further demonstrate the capabilities of DDRNet in semantic segmentation and compare it with state-of-the-art models on the Cityscapes test set. These methods often employ multi-scale and horizontal flipping inference methods to achieve better results regardless of time cost. In order to make a fair comparison with them, we also adopt multiple scales including 0.50×, 0.75×, 1×, 1.25×, 1.5×, 1.75×, 2× and data augmentation including left and right flipping. =Table VII== shows that the standard test enhancement improves the accuracy of DDRNet-39 from 80.4% to 81.9%. Our DDRNet-39 outperforms many powerful models integrating self-attention modules such as CCNet, DANet and OCNet. Notably, our method requires only 11% of the computation of DANet. DDRNet-39 also leads the state-of-the-art method SFNet for real-time semantic segmentation based on the ResNet-101 backbone, requiring only 34% of the computation. DDRNet-39 1.5×, which is close in size to other models in Table VII, achieves very competitive performance (82.4% mIoU).

F. Comparison with HRNet

Table VIII Comparative experiments between DDRNet and HRNet, using MIOU, FPS and training memory as indicators:

The main difference between DDRNet and HRNet is the number of branches. Furthermore, we append a multi-scale context extraction module at the end of the low-resolution branch.Table VIIIThe experimental results in prove that DDRNet outperforms HRNet in terms of inference time and training memory usage. We obtain validation results for two smaller HRNets from the official implementation. Training memory is measured on a single 2080Ti with a batch size of 2 and a crop size of 1024×512, excluding auxiliary segmentation heads.

G. Ablation experiments on Cityscapes

Table IX The impact of standard additional fancy design on experimental results, including deep supervision (DS), OHEM and crop training at 1024×1024 (default is 1024×512):

1 Standard techniques: We analyze the impact of some basic training techniques on performance, which are also adopted by the latest advanced method SFNet [23]. likeTable IXAs shown, by training with deep supervision, OHEM and larger crop sizes, the accuracy improves from 76.1 to 77.8.

Table X: Comparison of DAPPM and other context extraction modules. RES2 stands for RES2NET module and BASE-OC is the object context module proposed in [37]:

2 DAPPM: We compared DAPPM with Pyramid Pooling (PPM), Self-Attention Module (Base-OC) and Res2Net module.TableXThe results in show that the proposed module improves the performance of scene parsing from 74.1% mIoU to 77.8% mIoU, while the inference speed is almost unaffected. DAPPM also achieves a 1% mIoU gain compared to PPM and RES2, while another recent method, Base-OC, has relatively poor performance using low-resolution feature maps.

Table XI: Ablation study of dual-resolution networks. The baseline is adapted from BiseNetV2 by replacing complex semantic branches with low-resolution branches. '+THINER DETAIL BRANCH' means halving the dimension of the detail branch. '+CONV3' means append the detail branch to the end of the CONV3 stage. '+RESIDUAL' means replacing the 3×3 convolution with a residual basic block. '+BOTTLENECK' means adding a bottleneck block at the end of each branch. '+LOW-TO-HIGH FUSION' or '+BILATERAL FUSION' means performing multiple low-to-high fusions or bilateral fusions:

  1. Dual-resolution network: In order to speed up the experiment, we trained all bilateral networks from scratch using an initial learning rate of 0.05, a crop size of 1024×512, a total of 600 epochs, and without using OHEM. likeTable XIAs shown, using finer detail branches results in a 1.3% accuracy drop and runs much faster than the baseline. Attaching detail branches to intermediate layers of the network helps generate deep high-resolution representations and improves inference speed because it avoids computation at higher resolutions. The bottleneck expands the feature dimension, producing richer features for DAPPM and the final segmentation head. Bilateral fusion further improves segmentation accuracy with a small time cost. Finally, our dual-resolution network achieves better performance while requiring less resources and time than the baseline.

V. Conclusion:

This paper focuses on real-time and accurate semantic segmentation of road scenes and proposes a simple solution without using additional fancy designs. In particular, a novel deep dual-resolution network is proposed as an efficient backbone structure for real-time semantic segmentation. A new module is also designed to extract multi-scale contextual information from low-resolution feature maps. To the best of our knowledge, we are the first to introduce deep high-resolution representations to real-time semantic segmentation, and our simple strategy outperforms previous real-time models on three popular benchmarks. DDRNet mainly consists of residual basic blocks and bottleneck blocks, providing a wide range of speed and accuracy trade-offs by adjusting model width and depth. Due to the simplicity and efficiency of our method, it can be considered as a powerful baseline for achieving real-time and high-precision semantic segmentation. Further research will focus on improving the baseline and transferring the backbone network to other downstream tasks.

REFERENCES

[1] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Deep learning markov
random field for semantic segmentation,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 40, no. 8, pp. 1814–1828, 2018.
[2] L. Jing, Y. Chen, and Y. Tian, “Coarse-to-fine semantic segmentation
from image-level labels,” IEEE Transactions on Image Processing,
vol. 29, pp. 225–236, 2020.
[3] X. Ren, S. Ahmad, L. Zhang, L. Xiang, D. Nie, F. Yang, Q. Wang,
and D. Shen, “Task decomposition and synchronization for semantic
biomedical image segmentation,” IEEE Transactions on Image Processing, vol. 29, pp. 7497–7510, 2020.
[4] M. Saha and C. Chakraborty, “Her2net: A deep framework for semantic
segmentation and classification of cell membranes and nuclei in breast
cancer evaluation,” IEEE Transactions on Image Processing, vol. 27,
no. 5, pp. 2189–2200, 2018
[5] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “Erfnet: Efficient residual factorized convnet for real-time semantic segmentation,”
IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1,
pp. 263–272, 2017.
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
for semantic segmentation,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 3431–3440, 2015.
[7] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Semantic image segmentation with deep convolutional nets and fully
connected crfs,” arXiv preprint arXiv:1412.7062, 2014.
[8] S. Mallat, A wavelet tour of signal processing. Elsevier, 1999.
[9] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille,
“Deeplab: Semantic image segmentation with deep convolutional nets,
atrous convolution, and fully connected crfs,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 40, no. 4, pp. 834–848,
2017.
[10] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking
atrous convolution for semantic image segmentation,” arXiv preprint
arXiv:1706.05587, 2017.
[11] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing
network,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 2881–2890, 2017.
[12] M. Yang, K. Yu, C. Zhang, Z. Li, and K. Yang, “Denseaspp for semantic
segmentation in street scenes,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3684–3692, 2018.
[13] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello, “Enet: A deep
neural network architecture for real-time semantic segmentation,” arXiv
preprint arXiv:1606.02147, 2016.
[14] Z. Yang, H. Yu, M. Feng, W. Sun, X. Lin, M. Sun, Z. H. Mao,
and A. Mian, “Small object augmentation of urban scenes for realtime semantic segmentation,” IEEE Transactions on Image Processing,
vol. 29, pp. 5175–5190, 2020.
[15] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia, “Icnet for real-time
semantic segmentation on high-resolution images,” in Proceedings of
the European Conference on Computer Vision, pp. 405–420, 2018.
[16] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi,
“Espnet: Efficient spatial pyramid of dilated convolutions for semantic
segmentation,” in Proceedings of the European Conference on Computer
Vision, pp. 552–568, 2018.
[17] B. Jiang, W. Tu, C. Yang, and J. Yuan, “Context-integrated and featurerefined network for lightweight object parsing,” IEEE Transactions on
Image Processing, vol. 29, pp. 5079–5093, 2020.
[18] H. Li, P. Xiong, H. Fan, and J. Sun, “Dfanet: Deep feature aggregation
for real-time semantic segmentation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 9522–
9531, 2019.
[19] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Bisenet:
Bilateral segmentation network for real-time semantic segmentation,” in
Proceedings of the European Conference on Computer Vision, pp. 325–
341, 2018.
[20] R. P. Poudel, S. Liwicki, and R. Cipolla, “Fast-scnn: Fast semantic
segmentation network,” arXiv preprint arXiv:1902.04502, 2019.
[21] M. Orsic, I. Kreso, P. Bevandic, and S. Segvic, “In defense of pre-trained
imagenet architectures for real-time semantic segmentation of roaddriving images,” in Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 12607–12616, 2019.
[22] P. Hu, F. Perazzi, F. C. Heilbron, O. Wang, Z. Lin, K. Saenko, and
S. Sclaroff, “Real-time semantic segmentation with fast attention,” arXiv
preprint arXiv:2007.03815, 2020.
[23] X. Li, A. You, Z. Zhu, H. Zhao, M. Yang, K. Yang, and Y. Tong,
“Semantic flow for fast and accurate scene parsing,” arXiv preprint
arXiv:2002.10120, 2020.
[24] C. Yu, C. Gao, J. Wang, G. Yu, C. Shen, and N. Sang, “Bisenet
v2: Bilateral network with guided aggregation for real-time semantic
segmentation,” arXiv preprint arXiv:2004.02147, 2020.
[25] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoderdecoder with atrous separable convolution for semantic image segmentation,” in Proceedings of the European Conference on Computer Vision,
pp. 801–818, 2018.
[26] K. Sun, Y. Zhao, B. Jiang, T. Cheng, B. Xiao, D. Liu, Y. Mu, X. Wang,
W. Liu, and J. Wang, “High-resolution representations for labeling pixels
and regions,” arXiv preprint arXiv:1904.04514, 2019.
[27] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, pp. 770–778, 2016.
[28] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus, “Deconvolutional networks,” in 2010 IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pp. 2528–2535.
[29] H. Si, Z. Zhang, F. Lv, G. Yu, and F. Lu, “Real-time semantic
segmentation via multiply spatial fusion network,” arXiv preprint
arXiv:1911.07217, 2019.
[30] S. Kumaar, Y. Lyu, F. Nex, and M. Y. Yang, “Cabinet: Efficient
context aggregation network for low-latency semantic segmentation,”
arXiv preprint arXiv:2011.00993, 2020.
[31] A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang,
Y. Zhu, R. Pang, V. Vasudevan, et al., “Searching for mobilenetv3,” in
Proceedings of the IEEE International Conference on Computer Vision,
pp. 1314–1324, 2019.
[32] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint
arXiv:1704.04861, 2017.
[33] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet: An extremely efficient convolutional neural network for mobile devices,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 6848–6856, 2018.
[34] F. Chollet, “Xception: Deep learning with depthwise separable convolutions,” in Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pp. 1251–1258, 2017.
[35] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen,
“Mobilenetv2: Inverted residuals and linear bottlenecks,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 4510–4520, 2018.
[36] J. Fu, J. Liu, H. Tian, Y. Li, Y. Bao, Z. Fang, and H. Lu, “Dual attention
network for scene segmentation,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 3146–3154, 2019.
[37] Y. Yuan and J. Wang, “Ocnet: Object context network for scene parsing,”
arXiv preprint arXiv:1809.00916, 2018.
[38] Z. Huang, X. Wang, L. Huang, C. Huang, Y. Wei, and W. Liu, “Ccnet:
Criss-cross attention for semantic segmentation,” in Proceedings of the
IEEE International Conference on Computer Vision, pp. 603–612, 2019.
[39] S. Gao, M.-M. Cheng, K. Zhao, X.-Y. Zhang, M.-H. Yang, and P. H. Torr,
“Res2net: A new multi-scale backbone architecture,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, 2019.
[40] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset
for semantic urban scene understanding,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 3213–
3223, 2016.
[41] G. J. Brostow, J. Fauqueur, and R. Cipolla, “Semantic object classes
in video: A high-definition ground truth database,” Pattern Recognition
Letters, vol. 30, no. 2, pp. 88–97, 2009.
[42] H. Caesar, J. Uijlings, and V. Ferrari, “Coco-stuff: Thing and stuff classes
in context,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, pp. 1209–1218, 2018.
[43] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al., “Imagenet large
scale visual recognition challenge,” International Journal of Computer
Vision, vol. 115, no. 3, pp. 211–252, 2015.
[44] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He, “Aggregated residual ´
transformations for deep neural networks,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 1492–
1500, 2017.
[45] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
no. 12, pp. 2481–2495, 2017.
[46] M. Treml, J. Arjona-Medina, T. Unterthiner, R. Durgesh, F. Friedmann,
P. Schuberth, A. Mayr, M. Heusel, M. Hofmarcher, M. Widrich, et al.,
“Speeding up semantic segmentation for autonomous driving,” in MLITS, NIPS Workshop, vol. 2, 2016.
[47] Y. Zhang, Z. Qiu, J. Liu, T. Yao, D. Liu, and T. Mei, “Customizable architecture search for semantic segmentation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 11641–
11650, 2019.
[48] P. Lin, P. Sun, G. Cheng, S. Xie, X. Li, and J. Shi, “Graph-guided architecture search for real-time semantic segmentation,” in Proceedings of
the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
pp. 4203–4212, 2020.
[49] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu,
Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution
representation learning for visual recognition,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, pp. 1–1, 2020.
[50] A. Shrivastava, A. Gupta, and R. Girshick, “Training region-based object
detectors with online hard example mining,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 761–769,
2016.
[51] W. Chen, X. Gong, X. Liu, Q. Zhang, Y. Li, and Z. Wang, “Fasterseg:
Searching for faster real-time semantic segmentation,” arXiv preprint
arXiv:1912.10917, 2019.
[52] G. Neuhold, T. Ollmann, S. Rota Bulo, and P. Kontschieder, “The
mapillary vistas dataset for semantic understanding of street scenes,” in
Proceedings of the IEEE international conference on computer vision
,
pp. 4990–4999, 2017.
[53] S. Chandra, C. Couprie, and I. Kokkinos, “Deep spatio-temporal random
fields for efficient video segmentation,” in Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pp. 8915–
8924, 2018.
[54] Z. Huang, X. Wang, Y. Wei, L. Huang, H. Shi, W. Liu, and T. S.
Huang, “Ccnet: Criss-cross attention for semantic segmentation,” IEEE
Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1,
2020.
[55] R. Zhang, S. Tang, Y. Zhang, J. Li, and S. Yan, “Scale-adaptive convolutions for scene parsing,” in Proceedings of the IEEE International
Conference on Computer Vision, pp. 2031–2039, 2017.
[56] S. Kong and C. C. Fowlkes, “Recurrent scene parsing with perspective
understanding in the loop,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pp. 956–965, 2018.
[57] Z. Wu, C. Shen, and A. Van Den Hengel, “Wider or deeper: Revisiting
the resnet model for visual recognition,” Pattern Recognition, vol. 90,
pp. 119–133, 2019.
[58] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang, “Learning a discriminative feature network for semantic segmentation,” in Proceedings
of the IEEE Conference on Computer Vision and Pattern Recognition,
pp. 1857–1866, 2018.
[59] H. Zhao, Y. Zhang, S. Liu, J. Shi, C. Change Loy, D. Lin, and J. Jia,
“Psanet: Point-wise spatial attention network for scene parsing,” in
Proceedings of the European Conference on Computer Vision (ECCV),
pp. 267–283, 2018.
[60] Y. Yuan, X. Chen, and J. Wang, “Object-contextual representations for
semantic segmentation,” arXiv preprint arXiv:1909.11065, 2019

Guess you like

Origin blog.csdn.net/wagnbo/article/details/131095555