Image Segmentation - Fast-SCNN: Fast Semantic Segmentation Network (arXiv 2019)

Disclaimer: This translation is only a personal study record

Article information

Summary

  The encoder-decoder framework is a state-of-the-art framework for offline semantic image segmentation. Real-time computing is gaining popularity with the rise of autonomous systems. In this paper, we introduce Fast Segmentational Convolutional Neural Network (Fast-SCNN), a real-time semantic segmentation model for high-resolution image data (1024×2048px), suitable for efficient segmentation on low-memory embedded devices. calculate. Building on the existing two fast segmentation branch methods, we introduce our "learned downsampling" module, which simultaneously computes low-level features for multiple resolution branches. Our network combines high-resolution spatial details with low-resolution extracted deep features, yielding an average intersection-over-union ratio accuracy of 68.0% on Cityscapes at 123.5 frames per second. We also show that large-scale pre-training is unnecessary. We thoroughly validate our metric in experiments on ImageNet pre-training and Cityscapes' roughly labeled data. Finally, we demonstrate faster computation with competitive results on subsampled inputs without any network modification.

1 Introduction

  Fast semantic segmentation is especially important in real-time applications, where inputs are parsed quickly to facilitate responsive interactions with the environment. Due to the growing interest in autonomous systems and robotics, it is clear that the study of real-time semantic segmentation has recently gained enormous popularity [21, 34, 17, 25, 36, 20]. We emphasize that, in fact, faster-than-real-time performance is often necessary, since semantic labeling is often only used as a preprocessing step for other time-critical tasks. Furthermore, real-time semantic segmentation on embedded devices (without access to powerful GPUs) enables many additional applications, such as augmented reality for wearables. We observe that in the literature, semantic segmentation is usually addressed by deep convolutional neural networks (DCNNs) with an encoder-decoder framework [29, 2], while many runtime-efficient implementations adopt two-branch or multi-branch architectures [21, 34, 17]. usually

  • A larger receptive field is important for learning complex correlations (i.e., global context) between target classes,

  • Spatial detail in the image is necessary to preserve object boundaries, and

  • A specific design is needed to balance speed and accuracy (instead of retargeting classification DCNNs).

  Specifically, in a two-branch network, the deeper branch is used to capture global context at low resolution, while the shallower branch is set to learn spatial details at full input resolution. The final semantic segmentation result is then provided by merging the two. Importantly, since the computational cost of deeper networks can be overcome with smaller input sizes, and full-resolution execution is used for only a few layers, real-time performance is achievable on modern GPUs. Compared with the encoder-decoder framework, in the two-branch approach, the initial convolutions of different resolutions are not shared. It is worth noting here that Guided Upsampling Network (GUN) [17] and Image Cascade Network (ICNet) [36] only share weights between the first few layers, but not computation.

  In this work, we propose Fast Segmentational Convolutional Neural Network Fast-SCNN, a two-branch setup of state-of-the-art [21, 34, 17, 36] with the classical encoder-decoder framework [ 29, 2] for a real-time semantic segmentation algorithm (Fig. 1). Based on the observations [35, 19] on the initial DCNN layer extracting low-level features, we share the computation of the initial layer in the two-branch approach. We call this technique downsampling learning. The effect is similar to skip connections in encoder-decoder models, but skips are used only once to maintain runtime efficiency, and modules are kept shallow to ensure efficient feature sharing. Finally, our Fast-SCNN employs efficient depthwise separable convolutions [30, 10] and inverse residual blocks [28].

insert image description here

Figure 1. Fast-SCNN shares computation between two branches (encoders) to build more than one real-time semantic segmentation network.

  Applied to Cityscapes [6], Fast SCNN produces an average Intersection-Union Ratio of 68.0% ( mIoU), which is twice as fast as the state-of-the-art BiSeNet (71.4% mIoU) [34].

  Although we use 1.11 million parameters, most offline segmentation methods (such as DeepLab [4] and PSPNet [37] ) and some real-time algorithms (such as GUN [17] and ICNet [36] ) require much more than that. The model capacity of Fast-SCNN is particularly low. The reasons are twofold: (i) lower memory enables execution on embedded devices, and (ii) better generalizability is expected. In particular, pre-training on ImageNet [27] is often suggested to improve accuracy and generalizability [37]. In our work, we investigate the impact of pre-training on low-capacity fast SCNNs. Contrary to the trend for high-capacity networks, we find that pre-training or additional coarsely labeled training data (+0.5% mIoU on Cityscapes [6]) only yields insignificant improvements in results. In summary, our contributions are:

  1. We present Fast-SCNN, a competitive (68.0%) and above real-time semantic segmentation algorithm (123.5fps) for high-resolution images (1024×2048px).

  2. We adapt skip connections popular in offline DCNNs and propose a shallow learning downsampling module for fast and efficient multi-branch low-level feature extraction.

  3. We specifically design Fast SCNN to be low-capacity, and we empirically verify that training running for more epochs is equally successful as training with ImageNet pre-training or with additional coarse data in our small-capacity network.

  Furthermore, we subsample the input data using Fast-SCNN, achieving state-of-the-art performance without requiring network redesign.

2. Related work

  We discuss and compare frameworks for semantic image segmentation, with a special focus on real-time execution with low energy and memory requirements [2, 20, 21, 36, 34, 17, 25, 18].

2.1 Basis of Semantic Segmentation

  State-of-the-art DCNNs for semantic segmentation combine two independent modules: encoder and decoder. The encoder module uses a combination of convolution and pooling operations to extract DCNN features. The decoder module recovers spatial details from sub-resolution features and predicts object labels (i.e., semantic segmentation) [29, 2]. Most commonly, encoders are adapted to simple classification DCNN methods such as VGG [31] or ResNet [9]. In semantic segmentation, fully connected layers are removed.

  The seminal fully convolutional network (FCN) [29] lays the foundation for most modern segmentation architectures. Specifically, FCN employs VGG [31] as an encoder combined with skip connections from lower layers for bilinear upsampling to recover spatial details. U-Net [26] further exploits spatial details using dense skip connections.

  Later, inspired by the global image-level context prior to DCNN [13, 16], the Pyramid Pooling module of PSPNet [37] and the Shrinking Space Pyramid Pooling (ASPP) of DeepLab [4] are used to encode and exploit the global context.

  Other competitive base segmentation architectures use conditional random fields (CRF) [38, 3] or recurrent neural networks [32, 38]. However, none of them run in real time.

  Similar to object detection [23, 24, 15], speed becomes an important factor in the design of image segmentation systems [21, 34, 17, 25, 36, 20]. SegNet [2] introduces an encoder-decoder joint model on the basis of FCN, becoming one of the earliest efficient segmentation models. Following SegNet, ENet [20] also designs an encoder-decoder with fewer layers to reduce computational cost.

  More recently, two-branch and multiple-branch systems have been introduced. ICNet [36], ContextNet [21], BiSeNet [34] and GUN [17] learn global context with reduced-resolution input in the deep branch, while learning boundaries at full-resolution in the shallow branch.

  However, state-of-the-art real-time semantic segmentation remains challenging and usually requires high-end GPUs. Inspired by two branching methods, Fast-SCNN incorporates a shared shallow network path to encode details while efficiently learning context at low resolution (Fig. 2).

2.2 Efficiency of DCNN

  Common techniques for efficient DCNNs can be grouped into four categories:

  Depthwise separable convolutions : MobileNet [10] decomposes standard convolutions into depthwise convolutions and 1×1 point convolutions, collectively referred to as depthwise separable convolutions. This factorization reduces floating-point operations and convolution parameters, thus reducing the computational cost and memory requirements of the model.

  Efficient Redesign of DCNN : Chollet [5] designed the Xception network using efficient depthwise separable convolutions. MobleNet-V2 proposes an inverse bottleneck residual block [28] to build an efficient DCNN for classification tasks. ContextNet [21] uses reverse bottleneck residual blocks to design dual-branch networks for efficient real-time semantic segmentation. Similarly, [34, 17, 36] propose multi-branch segmentation networks to achieve real-time performance.

  Network quantization : Since floating-point multiplication is expensive compared to integer or binary operations, quantization techniques for DCNN filters and activations can be used to further reduce running time [11, 22, 33].

  Network Compression : Pruning is used to reduce the size of pre-trained networks, resulting in faster runtime, smaller parameter sets, and smaller memory footprint [21, 8, 14].

  Fast-SCNN relies heavily on depthwise separable convolutions and residual bottleneck blocks [28]. Furthermore, we introduce a dual-branch model that incorporates our learning-to-downsampling module, allowing shared feature extraction at multiple resolution levels (Fig. 2). Note that although the initial layers of multiple branches extract similar features [35, 19], common two-branch methods do not take advantage of this. Network quantization and network compression can be applied orthogonally and are left for future work.

insert image description here

Figure 2: Schematic comparison of Fast-SCNN with encoder-decoder and two branch architectures. The encoder-decoder employs multiple skip connections at many resolutions, usually produced by depthwise convolutional blocks. Two branch methods exploit low-resolution global features and shallow spatial details. Fast-SCNN simultaneously encodes the spatial details of the global context and the initial layer in our learning to downsample the module.

2.3 Auxiliary task pre-training

  It is generally accepted that pre-training on auxiliary tasks can improve the accuracy of the system. Earlier work on object detection [7] and semantic segmentation [4, 37] has shown this in pre-training on ImageNet [27]. Following this trend, other real-time efficient semantic segmentation methods are also pre-trained on ImageNet [36, 34, 17]. However, it is unclear whether pre-training is necessary on low-capacity networks. Fast-SCNN is specially designed for low capacity. In our experiments, we show that small networks do not benefit significantly from pre-training. In contrast, aggressive data augmentation and more epochs provided similar results.

3. Proposed Fast-SCNN

  Fast-SCNN is inspired by two-branch architectures [21, 34, 17] and encoder-decoder networks with skip connections [29, 26]. Note that early layers usually extract low-level features. Our reinterpretation of skip connections as a downsampled learning module allows us to merge key ideas from both frameworks and enables us to build a fast semantic segmentation model. Figure 1 and Table 1 show the layout of Fast-SCNN. In the following, we discuss our motivation and describe our building blocks in more detail.

insert image description here

Table 1. Fast-SCNN uses standard convolution (Conv2D), depthwise separable convolution (DSConv), inverse residual bottleneck block (bottleneck), pyramid pooling module (PPM) and feature fusion module (FFM) blocks. The parameters t, c, n and s denote the expansion factor of the bottleneck block, the number of output channels, the number of times the block is repeated, and the stride parameter applied to the first sequence of repeated blocks. Horizontal lines separate the modules: learning downsampling, global feature extractor, feature fusion, and classifier (from top to bottom).

3.1 Motivation

  Current state-of-the-art semantic segmentation methods that operate in real-time are based on networks with two branches, each operating at a different resolution level [21, 34, 17]. They learn global information from a low-resolution version of the input image, and employ a shallow network at full input resolution to improve the accuracy of segmentation results. Since input resolution and network depth are the main factors of runtime, these two branching methods allow real-time computation.

  It is well known that the first few layers of DCNN extract low-level features such as edges and corners [35, 19]. Therefore, instead of a two-branch approach with separate computations, we introduce learning into downsampling, which shares feature computation between the low and high-level branches in a shallow network block.

3.2 Network Architecture

  Our Fast-SCNN uses a learned downsampling module, a coarse global feature extractor, a feature fusion module and a standard classifier. All modules are built using depthwise separable convolutions, which have become key building blocks of many efficient DCNN architectures [5, 10, 21].

insert image description here

Table 2. The bottleneck residual block transfers the input from the c channel to the c' channel with an expansion factor t. Note that the last pointwise convolution does not use the nonlinear f. The input has height h, width w, and x/s represents the kernel size and stride of the layer.

3.2.1 Learning to downsample

  In our learned downsampling module, we use three layers. Only three layers are employed to ensure that low-level feature sharing is effective and implemented efficiently. The first layer is a standard convolutional layer (Conv2D), and the remaining two layers are depthwise separable convolutional layers (DSConv). Here, we emphasize that although DSConv is more computationally efficient, we use Conv2D because the input image has only three channels, which makes the computational advantage of DSConv negligible at this stage.

  All three layers in our learned downsampling module use stride 2, followed by batch normalization [12] and ReLU. The spatial kernel size of the convolutional and depth layers is 3×3. Following [5, 28, 21], we omit the nonlinearity between depthwise convolution and pointwise convolution.

3.2.2 Global Feature Extractor

  The global feature extractor module aims to capture the global context for image segmentation. Unlike common two-branch methods that operate on a low-resolution version of the input image, our module directly takes the learned output as a downsampled module (whose resolution is 1 8 \frac{1}{8} of the original input81). The detailed structure of this module is shown in Table 1. We use the efficient bottleneck residual block introduced by MobileNet-V2 [28] (Table 2). In particular, we use residual connections for bottleneck residual blocks when the input and output sizes are the same. Our bottleneck block uses an efficient depthwise separable convolution, which reduces the number of parameters and floating-point operations. Furthermore, a pyramid pooling module (PPM) [37] is added at the end to aggregate contextual information based on different regions.

3.2.3 Feature fusion module

  Similar to ICNet [36] and ContextNet [21], we prefer to simply add features to ensure efficiency. Alternatively, more sophisticated feature fusion modules (such as [34]) can be employed to achieve better accuracy at the cost of runtime performance. The details of the feature fusion module are shown in Table 3.

insert image description here

Table 3. Feature Fusion Module (FFM) of Fast-SCNN. Note that pointwise convolution has the desired output and does not use the nonlinear f. After adding features, a non-linear f is used.

3.2.4 Classifiers

  In the classifier, we use two depthwise separable convolutions (DSConv) and one pointwise convolution (Conv2D). We found that adding several layers after the feature fusion module improves accuracy. The details of the classifier module are shown in Table 1.

  Softmax is used during training because it uses decent gradients. During inference, we can replace the costly softmax computation with argmax since both functions are monotonically increasing. We denote this option as Fast-SCNN cls(classification). On the other hand, if a standard DCNN-based probabilistic model is required, softmax is used, denoted as Fast-SCNN prob(probability).

3.3 Comparison with existing technologies

  Our model is inspired by the two-branch framework and incorporates ideas from the encoder-de-encoder approach (Fig. 2).

3.3.1 Relationship to the two-branch model

  State-of-the-art real-time models (ContextNet [21], BiSeNet [34] and GUN [17]) use two branch networks. Our downsampled learning module is equivalent to their spatial path, as it is shallow, learned from full resolution, and used for the feature fusion module (Fig. 1).

  Our global feature extractor module is equivalent to the deeper low-resolution branch of this method. Instead, our global feature extractor shares its computations for the first few layers with the learned downsampling module. By sharing layers, we not only reduce the computational complexity of feature extraction, but also reduce the required input size, since Fast-SCNN uses 1 8 \frac{1}{8}81resolution instead of 1 4 \frac{1}{4}41resolution for global feature extraction.

3.3.2 Relationship to encoder-decoder model

  The proposed Fast-SCNN can be viewed as a special case of encoder-decoder frameworks like FCN [29] or U-Net [26]. However, unlike multiple skip connections in FCN and dense skip connections in U-Net, Fast-SCNN uses only a single skip connection to reduce computation and memory.

  Consistent with [35], which advocates sharing features only at the early layers of a DCNN, we locate skip connections early in the network. In contrast, prior art usually employs deeper modules at each resolution before applying skip connections.

4. Experiment

  We evaluate our proposed Fast Segmentation Convolutional Neural Network (Fast-SCNN) on the validation set of the Cityscapes dataset [6] and report its performance on the Cityscapes test set (i.e., the Cityscabes benchmark server).

4.1 Implementation Details

  When it comes to efficient DCNNs, implementation details are as important as theory. Therefore, we describe our setup carefully here. We conducted experiments on the TensorFlow machine learning platform using Python. Our experiments are performed on workstations with Nvidia Titan X (Maxwell) or Nvidia Titan Xp (Pascal) GPUs, CUDA 9.0 and CuDNN v7. Runtime evaluation is performed in a single CPU thread and a GPU to measure forward inference time. We burn using 100 frames and report an average of 100 frames per second (fps) measurements.

  We use stochastic gradient descent (SGD) with a momentum of 0.9 and a batch size of 12. Inspired by [4, 37, 10], we use a “poly” learning rate with a base learning rate of 0.045 and a power of 0.9. Similar to MobileNet-V2, we find that ℓ 2 \ell22 regularization is unnecessary on depthwise convolutions, for other layersℓ 2 \ell22 is 0.00004. Since the training data for semantic segmentation is limited, we apply various data augmentation techniques: random resizing between 0.5 and 2, translation/cropping, horizontal flipping, color channel noise and brightness. Our model is trained with cross-entropy loss. We find it beneficial to learn an auxiliary loss up to the end of downsampling and a global feature extraction module with a weight of 0.4.

  Batch normalization [12] is used before each nonlinear function. Dropout is only used on the last layer before the softmax layer. Contrary to MobileNet [10] and ContextNet [21], we find that Fast-SCNN trains faster with ReLU and achieves slightly higher accuracy than ReLU6, even with the depthwise separable convolutions we use throughout the model.

  We found that the performance of DCNNs can be improved by training for a higher number of iterations, therefore, unless otherwise stated, we train our models for 1000 epochs using the Cityescapes dataset [6]. It is worth noting here that the capacity of Fast-SCNN is intentionally very low as we use 1.11 million parameters. Later we show that aggressive data augmentation techniques make overfitting less likely.

insert image description here

Table 4. Class and category mIoU of the proposed Fast-SCNN compared to other state-of-the-art semantic segmentation methods on the Cityscapes test set. The number of parameters is in millions.

insert image description here

Table 5. TensorFlow runtime (fps) on Nvidia Titan X (Maxwell, 3072 CUDA cores) [1]. Methods with "*" indicate results on Nvidia Titan Xp (Pascal, 3840 CUDA cores). Two versions of Fast-SCNN are shown: softmax output (our prob) and target label output (our cls).

4.2 Evaluation of Cityscapes

  We evaluate our proposed Fast-SCNN on Cityscapes, the largest publicly available dataset on urban roads [6]. This dataset contains a diverse set of high-resolution images (1024×2048px) taken from 50 different cities in Europe. It has 5000 images of high label quality: 2975 for training, 500 for validation and 1525 for testing. Labels for training and validation sets are available, and test results can be evaluated on the evaluation server. In addition, 20000 weakly annotated images (coarse labels) are available for training. We report results for both, fine and finely labeled coarse data only. Cityscapes provides 30 class labels, while only 19 classes are used for evaluation. The mean value of intersection-over-union ratio (mIoU) and network inference time are reported below.

  We evaluate the overall performance on the holdout test set of Cityscapes [6]. The proposed Fast-SCNN is comparable to other state-of-the-art real-time semantic segmentation methods (ContextNet[21], BiSeNet[34], GUN[17], ENet[20] and ICNet[36]) and offline methods (PSPNet[37] and The comparison between DeepLab-V2[4]) is shown in Table 4. Fast-SCNN achieves 68.0% mIoU, slightly lower than BiSeNet (71.5%) and GUN (70.4%). ContextNet achieved only 66.1%.

  Table 5 compares the running time at different resolutions. Here, BiSeNet (57.3 fps) and GUN (33.3 fps) are significantly slower than Fast-SCNN (123.5 fps). Fast-SCNN is also significantly faster on Nvidia Titan X (Maxwell) compared to ContextNet (41.9 fps). We therefore conclude that Fast-SCNN significantly improves the state-of-the-art runtime with a small loss in accuracy. At this point, we emphasize that our model is designed for low-memory embedded devices. Fast-SCNN uses 1.11 million parameters, which is five times less than the competitor BiSeNet's 5.8 million parameters.

  Finally, we zero the contribution of skip connections and measure the performance of Fast-SCNN. The mIoU of the validation set is reduced from 69.22% to 64.30%. Qualitative results are shown in Figure 3. As expected, Fast-SCNN benefits from skip connections, especially around boundaries and small-sized objects.

insert image description here

Table 6. Category mIoU for different Fast-SCNN settings on the Cityscapes validation set.

4.3 Pre-training and Weakly Labeled Data

  High-capacity DCNNs, such as R-CNN [7] and PSPNet [37], have shown that pre-training with different auxiliary tasks can improve performance. Since we specifically designed Fast-SCNN to have low capacity, we now wish to test performance with and without pre-training, and with and without additional weakly labeled data. To the best of our knowledge, the importance of pre-training and additional weakly labeled data on low-capacity DCNNs has not been studied before. The results are shown in Table 6.

  We pre-train Fast-SCNN on ImageNet [27] by replacing the feature fusion module with average pooling, and the classification module now only has a softmax layer. Fast-SCNN achieves 60.71% top-1 and 83.0% top-5 accuracy on the ImageNet validation set. This result suggests that the capacity of Fast-SCNN is insufficient to achieve comparable performance (>70% top-1) to most standard DCNNs on ImageNet [10, 28]. On the validation set of Cityscapes, the accuracy of Fast-SCNN with ImageNet pre-training produces 69.15% mIoU, which is only 0.53% higher than Fast-SCNN without pre-training. Therefore, we conclude that ImageNet pre-training cannot achieve significant improvements in Fast-SCNN.

insert image description here

Figure 3. Visualization of Fast-SCNN segmentation results. First column: input RGB image; second column: output of Fast-SCNN; last column: output of Fast-SCNN after zeroing out the contributions of skip connections. In all results, Fast-SCNN benefits from skip connections, especially at boundaries and small-sized objects.

  Since the overlap between Cityscapes' urban roads and ImageNet's classification tasks is limited, it is reasonable to assume that Fast-SCNN may not benefit due to the limited capacity of both domains. Therefore, we now incorporate the 20,000 coarsely labeled additional images provided by Cityscapes, since these images are from similar domains. However, Fast-SCNN trained with coarse training data (with or without ImageNet) performs similarly and only improves slightly over the original Fast-SCNN without pre-training. Note that small changes are trivial and are due to the random initialization of the DCNN.

  It is worth noting here that handling the auxiliary task is non-trivial as it requires modifications to the architecture in the network. Additionally, license constraints and lack of resources further limit this setup. These costs are saved because we show that neither ImageNet pre-training nor weakly labeled data benefit our low-capacity DCNN significantly. Figure 4 shows the training curves. Fast-SCNN trained with coarse data is slower in terms of iterations due to weak label quality. Both ImageNet pretrained versions perform better in earlier epochs (up to 400 epochs for the training set alone, and 100 epochs when training with additional coarsely labeled data). This means that when we train the model from scratch, we only need to train the model longer to achieve similar accuracy.

insert image description here

Figure 4. Training curves on Cityscapes. Iteration accuracy (top) and epoch accuracy (bottom) are shown. Dashed lines represent ImageNet pre-training of Fast-SCNN.

4.4 Lower input resolution

  Since we are interested in embedded devices that may not have full-resolution input or access to a powerful GPU, we conclude our evaluation by investigating performance at half and quarter input resolutions (Table 7).

  At quarter-resolution, Fast-SCNN achieves 51.9% accuracy at 485.4 frames/s, which significantly improves mIoU by 40.7% on (anonymous) MiniNet at 250 frames/s [6 ]. At half resolution, 62.8% mIoU at 285.8 frames per second can be achieved. We emphasize that, without modification, Fast-SCNN is directly applicable to lower input resolutions, making it well suited for embedded devices.

insert image description here

Figure 5. Qualitative results of Fast-SCNN on the Cityscapes [6] validation set. First column: input RGB image; second column: ground truth label; last column: Fast-SCNN output. Fast-SCNN achieves 68.0% class-level mIoU and 84.7% category-level mIoU.

insert image description here

Table 7. Running time and accuracy of Fast-SCNN at different input resolutions on the Cityscapes test set [6].

5 Conclusion

  We propose a fast segmentation network for real-time scene understanding as described above. The computational cost of sharing the multi-branch network yields runtime efficiencies. In experiments, our skip connections are shown to be beneficial for recovering spatial details. We also show that, for low-capacity networks, large-scale pretraining of the model on additional auxiliary tasks is unnecessary if the training time is long enough.

References

[1] M. Abadi and et. al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. 6
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. TPAMI, 2017. 1, 2, 6
[3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs, 2014. 2
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. arXiv:1606.00915 [cs], 2016. 2, 3, 5, 6
[5] F. Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. arXiv:1610.02357 [cs], 2016. 3, 4
[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In CVPR, 2016. 2, 5, 6, 8
[7] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, 2013. 3, 6
[8] S. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. In ICLR, 2016. 3
[9] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. arXiv:1512.03385 [cs], 2015. 2
[10] A. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv:1704.04861 [cs], 2017. 2, 3, 4, 5, 6
[11] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio. Binarized Neural Networks. In NIPS. 2016. 3
[12] S. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. arXiv:1502.03167 [cs], 2015. 4, 5
[13] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR, volume 2, pages 2169–2178, 2006. 2
[14] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf.Pruning Filters for Efficient ConvNets. In ICLR, 2017. 3
[15] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. 2015. 2
[16] A. Lucchi, Y. Li, X. B. Bosch, K. Smith, and P. Fua. Are spatial and global constraints really necessary for segmentation? In ICCV, 2011. 2
[17] D. Mazzini. Guided Upsampling Network for Real-Time Semantic Segmentation. In BMVC, 2018. 1, 2, 3, 4, 5, 6
[18] S. Mehta, M. Rastegari, A. Caspi, L. Shapiro, and H. Hajishirzi. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. arXiv:1803.06815 [cs], 2018. 2
[19] C. Olah, A. Mordvintsev, and L. Schubert. Feature visualization. Distill, 2017. 1, 3, 4
[20] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. arXiv:1606.02147 [cs], 2016. 1, 2, 3, 6
[21] R. Poudel, U. Bonde, S. Liwicki, and S. Zach. Contextnet: Exploring context and detail for semantic segmentation in real-time. In BMVC, 2018. 1, 2, 3, 4, 5, 6
[22] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks. In ECCV, 2016. 3
[23] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In CVPR, 2016. 2
[24] J. Redmon and A. Farhadi. Yolo9000: Better, faster, stronger, 2016. 2
[25] E. Romera, J. M. ´Alvarez, L. M. Bergasa, and R. Arroyo. ERFNet: Efficient Residual Factorized ConvNet for Real-Time Semantic Segmentation. IEEE Transactions on Intelligent Transportation Systems, 2018. 1, 2, 6
[26] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. In MICCAI, 2015. 2, 3, 5
[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015. 2, 3, 6
[28] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen. Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation. arXiv:1801.04381 [cs], 2018. 2, 3, 4, 6
[29] E. Shelhamer, J. Long, and T. Darrell. Fully convolutional networks for semantic segmentation. PAMI, 2016. 1, 2, 3, 5
[30] L. Sifre. Rigid-motion scattering for image classification. PhD thesis, 2014. 2
[31] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 2
[32] F. Visin, M. Ciccone, A. Romero, K. Kastner, K. Cho, Y. Bengio, M. Matteucci, and A. Courville. Reseg: A recurrent neural network-based model for semantic segmentation, 2015. 2
[33] S. Wu, G. Li, F. Chen, and L. Shi. Training and Inference with Integers in Deep Neural Networks. In ICLR, 2018. 3
[34] C. Yu, J. Wang, C. Peng, C. Gao, G. Yu, and N. Sang. Bisenet: Bilateral segmentation network for real-time semantic segmentation. In ECCV, 2018. 1, 2, 3, 4, 5, 6
[35] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. In ECCV, 2014. 1, 3, 4, 5
[36] H. Zhao, X. Qi, X. Shen, J. Shi, and J. Jia. ICNet for Real-Time Semantic Segmentation on High-Resolution Images. In ECCV, 2018. 1, 2, 3, 4, 6
[37] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid Scene Parsing Network. In CVPR, 2017. 2, 3, 4, 5, 6
[38] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. In ICCV, December 2015. 2

Guess you like

Origin blog.csdn.net/i6101206007/article/details/132098913