Representation Compensation Networks for Continual Semantic Segmentation翻译

RCIL: Representation Compensation Networks for Continual Semantic Segmentation (CVPR, 2022)
RCIL: Representation Compensation Network for Continuous Semantic Segmentation

Portal

paper
code

Abstract

In this work, we study the problem of continuous semantic segmentation, where deep neural networks need to incorporate new classes continuously without catastrophic forgetting. We propose to use a structural reparameterization mechanism, named Representation Compensation (RC) module, to decouple representation learning of old and new knowledge. The RC module consists of two dynamically evolving branches, one of which is frozen and the other is trainable.
In addition, we design an ensemble cube knowledge extraction strategy in both spatial and channel dimensions, which further enhances the plasticity and stability of the model. . We conduct experiments on two challenging continuous semantic segmentation scenarios, continuous class segmentation and continuous domain segmentation. Without any additional computational overhead and parameters during inference, our method outperforms the state-of-the-art. Code: https://github.com/zhangchbin/RCIL

1. Introduction

Data-driven deep neural networks [65, 73, 98, 109] have achieved many milestones in semantic segmentation. However, these fully supervised models [17, 24, 95] can only handle a fixed number of classes. In practical applications, it would be best if the model could be extended dynamically to recognize new classes. A simple solution is to rebuild the training set and retrain the model using all available data, called joint training. However, considering the cost of retraining the model, sustainable development of the algorithm, and privacy concerns, it is particularly critical to update the model with only current data to achieve the goal of identifying new and old classes. However, naive fine-tuning of a trained model with new data can lead to catastrophic [49] forgetting. Therefore, in this paper, we seek continual learning, which may allow the model to recognize new categories without catastrophic forgetting. In the context of
continuous semantic segmentation [9, 28, 63, 64], given the previous The trained model and the new classes of the training data, the model should distinguish between all seen classes, including previous classes (old classes) and new classes. However, to save labeling costs, new training data is usually only labeled for new classes, leaving old classes as background. It is very challenging to directly learn new data without any additional design, which can easily lead to catastrophic [49] forgetting.

insert image description here
figure 1. Illustrate our proposed continuous semantic segmentation training framework to avoid catastrophic forgetting. Two mechanisms are designed in our method, representation compensation (RC) module and convergent cubic distillation (PCD).

As pointed out in [29, 49, 52], fine-tuning the model on new data may lead to catastrophic forgetting, i.e., the model quickly fits the data distribution of the new class and loses its discriminative ability for the old class. Some methods [44, 49, 57, 67, 68, 81, 97] regularize model parameters to improve their stability. However, all parameters are updated on the training data of the new class. However, this is challenging because new and old knowledge are entangled in model parameters, making it difficult to maintain the fragile balance of learning new knowledge and maintaining old knowledge. Some other methods [46, 58, 76, 77, 83, 93] increase the capacity of the model to better balance stability and plasticity, but at the cost of increasing the memory of the network.

In this study, we propose an easy-to-use representation compensation module that aims to memorize old knowledge while allowing additional capacity for new knowledge. Inspired by structural reparameterization [25, 26], we refer to the convolutional layer with two parallel branches in the network as the representation compensation module during training. As shown in Figure 1, during training, the outputs of two parallel convolutions are fused before a non-linear activation layer. At the beginning of each successive learning step, we equivalently combine the parameters of two parallel convolutions into one convolution, which will be frozen to preserve old knowledge. The other branch is trainable, which inherits parameters from the corresponding branch in the previous step. The representation compensation strategy is to use frozen branches to memorize old knowledge, while using trainable branches to allow additional capacity to memorize new knowledge. Importantly, this module does not bring additional parameters and computational costs during inference.

To further mitigate catastrophic forgetting, we introduce a knowledge distillation mechanism71 between intermediate layers , named Pooled Cube distillation. It can suppress the negative effects of errors and noise in local feature maps. The main contributions of this paper are:

• We propose a representation compensation module with two branches during training, one for retaining old knowledge and one for adapting to new data. During inference, it always maintains the same computational and memory overhead as the number of tasks increases

• We conduct experiments on continuous class segmentation and continuous domain segmentation separately. The experimental results show that the method outperforms the state-of-the-art performance on three different datasets.

2. Related Work

Semantic Segmentation. Semantic segmentation.
Early approaches focused on modeling contextual relations [3, 50, 104]. Current methods focus more on multi-scale feature aggregation [4, 35, 53, 54, 60, 66, 69, 82]. Some methods [15, 23, 33, 38, 39, 51, 56], inspired by non-locality [86], use attention mechanisms to establish connections between image contexts. Another study [16, 62, 96] aims at fusing features from different receptive fields. Recently, Transformer architectures [8, 27, 87, 99, 105, 110] have performed prominently in semantic segmentation, focusing on multi-scale feature fusion [13, 85, 91, 102] and contextual feature aggregation [59, 80].

Continual Learning.
Continual Learning focuses on mitigating catastrophic forgetting while discriminating against newly learned classes. To solve this problem, many studies [5, 6, 12, 48, 78] propose to review knowledge through a rehearsal mechanism. Knowledge can be stored as various types, such as examples [5, 7, 10, 12, 74, 84], prototypes [36, 107, 108], generative networks [61], etc. While these rehearsal-based approaches generally achieve high performance, they require storage and storage permissions. In more challenging scenarios without any replay, many methods explore regularization to preserve old knowledge, including knowledge distillation [11, 19, 22, 29, 52, 70, 75], adversarial training [30, 90] , vanilla regularization [44, 49, 57, 67, 68, 81, 97, 100], etc. Others focus on the capacity of neural networks. One of the research lines [46, 58, 76, 77, 83, 93] is to extend the network architecture while learning new knowledge. Another line of research [1, 45] explores sparse regularization of network parameters, with the aim of activating as few neurons as possible in each task. This sparsity regularization reduces redundancy in the network while limiting the learning ability of each task. Some works propose to learn better representations by combining self-supervised learning of feature extractors [10, 88] and addressing class imbalance [40, 47, 55, 101, 103].

Continual Semantic Segmentation. Continuous semantic segmentation.
Continuous semantic segmentation remains an open problem, mainly focusing on catastrophic forgetting in semantic segmentation [49]. Continuous class segmentation is a classic setting in this domain, and several previous works have achieved great progress: [42,94] explored rehearsal-based methods to review old knowledge; classes to resolve the ambiguity of background classes; PLOP [28] applies a knowledge distillation strategy to intermediate layers; SDR [64] utilizes prototype matching to impose consistency constraints on latent space representations. While others [32, 79, 97] exploit high-dimensional information, self-training, and model adaptation to overcome this

Furthermore, continuous domain segmentation is a new setting proposed by PLOP [28], which aims to integrate new domains instead of new classes. Different from previous methods, we focus on dynamically expanding the network and decoupling the representation learning of old and new classes.

3. Method

3.1. Preliminaries

Let D = {xi, yi} be the training set, where xi is the input image and yi is the corresponding segmented ground-truth. In the challenging continuous learning scenario, we refer to each training on the newly added dataset Dt as a step. At step t, a model ft−1 given parameters θt−1 is trained on classes {D0, D1…Dt−1} and {C0, C1,…continuous Ct−1}, when the model encounters newly added data When integrating Dt and additional Ct new classes, it is proposed to learn the discrimination of Pt n=0 Cn classes. When training on Dt, the training data of the old classes are not accessible. In addition, to save the training cost, only new classes of Ct are included in the ground truth in Dt, while the old classes are marked as background. Thus, there is a pressing problem of catastrophic forgetting. Verifying the effectiveness of different methods often requires multiple consecutive learnings, eg N steps.

insert image description here
figure 2. Describe our representative compensation mechanism. We modify the 3 × 3 convolution into two parallel convolutions. Features from both branches are aggregated before the activation layer. Therefore, at the beginning of step t, the two parallel branches trained at step t−1 can be combined into an equivalent convolutional layer, and the convolutional layer can be frozen as a branch at step t. Initialize another branch in step t from the branch corresponding to step t−1. We demonstrate the merge operation on the right side of the figure.

3.2. Representation Compensation Networks 3.2. Representation Compensation Networks

As shown in Figure 2, in order to decouple the retention of old knowledge and the learning of new knowledge, we introduce a representation compensation mechanism. A 3 × 3 convolution followed by a normalization and non-linear activation layer is a common building block in most deep neural networks. We modify this architecture by adding a parallel 3 × 3 convolution followed by a normalization layer for each component. The outputs of two parallel convolutional normalization layers are fused and then corrected with a non-linear activation layer. Formally, the architecture consists of two parallel convolutional layers with weights {W 0, W 1} and biases {b0, b1}, followed by two independent normalization layers respectively. Let Norm0 ={µ0, σ0, γ0, β0}, Norm1 ={µ1, σ1, γ1, β1} represent the mean, variance, weight and bias of the Norm0 and Norm1 normalization layers. Therefore, the computation of the input x prior to the nonlinear activation function as insert image description here
this equation shows that two parallel branches can be equivalently expressed as a weight ˆW and bias ˆb, we also show this transformation on the right side of Fig. 2, thus, For this improved structure, we can equivalently combine the parameters of the two branches into one convolution.

More precisely, in step 0, all parameters are trainable to train a model that can distinguish C0 classes. In subsequent learning steps, the model will segment the newly added classes. During these successive learning steps, the network will be initialized with the parameters trained in the previous step, which facilitates the transfer of knowledge [9]

At the beginning of step t, in order to prevent the model from forgetting old knowledge, we merge the parallel branches trained at step t−1 into one convolutional layer. The parameters in this merged branch are frozen to memorize old knowledge, as shown in Figure 2. The other branch is trainable to learn new knowledge, initialized with the corresponding branch in the previous steps. In addition, we design a drop-path strategy, which is applied to the aggregation of the outputs x1 and x2 of the two branches. During training, the output before nonlinear activation is denoted as insert image description here
where, η is a random channel weight vector, uniformly sampled from the set {0,0.5,1}. During inference, the elements of the vector η are set to 0.5. Experimental results show that this strategy has a certain improvement effect.

Analysis on RC-Module's Effectiveness. The effectiveness analysis of rc module.
As shown in Figure 3, the parallel convolutional structure can be viewed as an implicit collection of multiple sub-networks [37, 41]

The parameters of some layers in these sub-networks are inherited from the merged teacher model (trained in the previous step) and are frozen. During training, similar to [34, 92], these frozen teacher layers will regularize the trainable parameters, encouraging the trainable layers to behave like the teacher model. As shown in Fig. 3(a), in the special case where only one layer in the sub-network is trainable, during training, this layer will consider both the adjustment of the representation of the frozen layer and the learning of new knowledge. Therefore, this mechanism can alleviate the catastrophic forgetting of the trainable layer. We further generalize this effect to general sub-networks like Fig. 3(b), which will also encourage trainable layers to adapt representations from frozen layers. Furthermore, all subnetworks are integrated to integrate the knowledge of different subnetworks into one network, as shown in Fig. 3©.
insert image description here
image 3. Illustration of our proposed representative compensation network. Our architecture© can be viewed as an implicit integration of numerous subnetworks (a), (b), etc. Blue indicates frozen layers inherited from the merged teacher model, and green indicates trainable layers. Gray indicates ignored layers in the subnetwork.

3.3. Pooled Cube Knowledge Distillation 3.3. Pooled Cube Knowledge Distillation

To further alleviate the forgetting of old knowledge, following PLOP [28], we also explore the knowledge extraction between intermediate layers. As shown in Fig. 4(a), PLOP [28] introduces strip pooling [39] to integrate the features of the teacher model and the current model separately. The pooling operation plays a key role in maintaining the distinction of old classes and allowing new classes to be learned. In our approach, we design an average-set based knowledge distillation along the spatial dimension. In addition, we also use channel-dimension average pooling at each location to maintain their respective activation strengths. Overall, as shown in Figure 4(b), we use average pooling in both spatial and channel dimensions.

Formally, we select all L stages of the last non-linear activation layer before the feature maps {X1, X2, ..., XL}, including all stages in the decoder and backbone. For the features from the teacher model and the student model, we first compute the square of the value of each pixel to preserve negative information. Then, multi-scale average pooling is performed on the spatial and channel dimensions respectively. The teacher model ˆXlT, and the student model ˆXlS, can be computed by an average pooling operation ∆: insert image description here
where M denotes the Mth average pooling kernel, and l denotes the l-th stage. For average pooling in spatial dimensions, we use multi-scale windows to model the relationship between pixels in local regions. The size of the kernel M is M = {4, 8, 12, 16, 20, 24} and the stride is set to 1. For average pooling over the channel dimension, we simply set the window size to 3. Then, the spatial knowledge distillation loss function Lskd of the intermediate layer is

insert image description here
Where H, W, and D represent the height, width, and number of channels, respectively. The same equation can be applied to the channel dimension of M = {3}, forming Lckd. In general, the distillation objective can be expressed as:
insert image description here
A verage pooling vs. strip pooling.
Thanks to its powerful aggregation features and ability to model long-term dependencies, strip pooling plays a huge role in many fully supervised semantic segmentation models [39,43]. The performance of continuous segmentation is still much worse than fully supervised segmentation. In the case of continuous segmentation, the prediction results tend to have more noise or error than fully supervised segmentation. Therefore, during distillation, when strip pooling is used to aggregate features, this long-range dependency can introduce some irrelevant noise to the intersections, leading to noise diffusion. This will lead to further deterioration of the prediction results of the student model. In our method, we use average pooling in local regions to suppress the negative effects of noise. Specifically, since the semantics of local regions are often similar, the current keypoint can find more neighbors to support its decision by aggregating the features of local regions. Therefore, the current keypoint is less negatively affected by local area noise.

As shown at the top of Fig. 5(b), strip pooling introduces noise or error to the intersection of the teacher model. During the distillation, the noise is further propagated into the student model, making the noise diffuse. For the average pooling at the bottom of Fig. 5, the keypoint will consider many nearby neighbors, resulting in an aggregated feature that is more robust to noise.
insert image description here
Figure 5. The effect of strip pooling (top row) used in PLOP [28] and average pooling (bottom row) in our method.

4. Experiments

Guess you like

Origin blog.csdn.net/m0_37690430/article/details/125972990