(2023, Network Pruning) Exploring Incompatible Knowledge Transfer in few-shot Image Generation

Exploring Incompatible Knowledge Transfer in Few-shot Image Generation

Official account: EDPJ

Table of contents

0. Summary

1 Introduction

2. Related work

3. Basics

4. Incompatible knowledge transfer in FSIG

4.1 Investigating incompatible knowledge

4.2 Experimental setup

4.3 Results and analysis

5. Suggested approach 

5.1 Knowledge Truncation via Network Pruning 

5.2 Design Choices

6. Experiment

6.1 Performance Evaluation and Comparison

6.2 Discussion

7. Conclusion

appendix 

F. Ablation Study: Effect of High Importance Filters

H. Ablation Studies: Additional Measures of Importance

reference

S. Summary

S.1 Main idea

S.2 Network pruning


0. Summary

Few-shot image generation (FSIG) learns to generate diverse and high-fidelity images from a target domain using a small number (e.g., 10) of reference samples. Existing FSIG methods select, preserve and transfer prior knowledge from source generators (pretrained in related domains) to learn target generators. In this work, we investigate an underexplored problem in FSIG called incompatible knowledge transfer, which can significantly reduce the authenticity of synthetic samples. Empirical observations show that the problem stems from the least significant filter in the source generator. To this end, we propose knowledge truncation to alleviate this problem in FSIG, which is a complementary operation to knowledge preservation and implemented by a lightweight pruning-based approach. Extensive experiments demonstrate that knowledge truncation is simple and effective, consistently achieving state-of-the-art performance, including challenging settings where the source and target domains are farther away.

1 Introduction

Incompatible knowledge transfer . Despite impressive improvements achieved by different knowledge preservation methods, in this work we argue that preventing incompatible knowledge transfer is equally important. This incompatible knowledge transfer is revealed by well-designed investigations in the presence of unexpected semantic features. These features are inconsistent with the target domain, thus reducing the realism of the synthetic samples. As shown in Figure 1, trees and buildings are not compatible with Sailboat's domain (observable by examining the 10 reference samples). However, when applying existing SOTA methods [41, 75] along with source generators trained on Church, they appear in synthetic images. This shows that existing methods cannot effectively prevent the transfer of incompatible knowledge. 

Knowledge truncation . Based on our observations, we propose Removing In-Compatible Knowledge (RICK), a lightweight filter pruning based method for removing encoding-incompatible knowledge during FSIG adaptation (i.e., the filter that is estimated to have the least adaptive importance). While filter pruning has been widely used to achieve compact deep networks with reduced computation, its application in preventing incompatible knowledge transfer has not been fully explored. We note that our proposed knowledge truncation and pruning of incompatible filters are orthogonal and complementary to existing knowledge preservation methods in FSIG. In this way, our method effectively removes incompatible knowledge compared to previous work and significantly improves the quality of generated images (e.g., FID).

2. Related work

Recent state-of-the-art methods propose to retain some knowledge for adaptation.

  • FreezeD fixes some lower layers of the discriminator for adaptation
  • EWC identifies important parameters of the source task and penalizes weight changes
  • CDC aims to keep the distance between generated images consistent before and after adaptation
  • DCL maximizes the mutual information between source and target generated images from the same input latent code to preserve knowledge.
  • Recently, AdAM proposes a modulation-based approach to identify source knowledge important to the target domain and retain the knowledge for adaptation.

3. Basics

Existing FSIG methods employ transfer learning (TL) methods and utilize source GANs pre-trained on large source datasets. We denote the source generator as Gs (and the source discriminator as Ds). During the adaptation process, the target generator Gt (target discriminator Dt) is obtained by fine-tuning the source GAN on a small number of target images via an adversarial loss L_adv.

where z is a one-dimensional hidden code sampled from a noise distribution p_z(z) (eg Gaussian distribution), and p_data(x) represents the few-shot target data distribution. Note that source data is not accessible. In fine-tuning, the weights of Gs (and Ds) are used to initialize Gt (and Dt), see Fig. 1(a). The main goal of FSIG is to learn Gt to capture p_data(x). 

To mitigate mode collapse due to very limited target samples, recent approaches augment fine-tuning with knowledge preservation to carefully select and preserve a subset of source knowledge during adaptation, such as freezing, regularization, and modulation-based methods. These methods aim to preserve knowledge considered useful for object generators, e.g., to increase the diversity of object sample generation. For knowledge considered less useful, fine-tuning using Equation 1 is a common practice, updating such knowledge during the adaptation process.

4. Incompatible knowledge transfer in FSIG

In this section, as our first contribution, we observe and identify an unnoticed problem of incompatible knowledge transfer in existing FSIG methods, and reveal that fine-tuning based knowledge updating is not sufficient to remove incompatible knowledge after adaptation. compatible knowledge.

To support our claim and find the root cause of incompatible knowledge transfer, we apply GAN dissection, a correspondence between different image recognition filters and semantic segmentation of specific object classes (e.g. trees) framework to reveal filters that preserve incompatible knowledge after fine-tuning.

4.1 Investigating incompatible knowledge

Previous SOTA FSIG methods propose different knowledge preservation criteria to select pre-trained source knowledge for few-shot adaptation. Adaptation is usually done by fine-tuning the source generator (via Equation 1) with a small number of target samples. One assumption in these methods is that fine-tuning can adapt the source generator to the target generator, so that irrelevant and incompatible source knowledge can be removed or updated.

In this work, we show that this assumption becomes invalid in cases where the source and target domains are semantically distant (e.g., human and cat faces in Figure 1), where incompatible knowledge transfer is severe compromising the authenticity of the generated images. We note that this has not been well studied in previous SOTA FSIG works, since they mainly focus on preserving knowledge from source (see Section 2), and pay little attention to knowledge transfer, which is incompatible with fine-tuning based knowledge updating.

In a convolutional neural network, each filter can be seen as an encoding of a specific part of knowledge. Intuitively, in generative models, such knowledge could be low-level textures (such as fur) or high-level human-interpretable concepts (such as eyes). Therefore, we hypothesize that clues to incompatible knowledge transfer can be found by focusing on the filters of the generator. Recently, AdAM proposed an importance probing (IP) method to determine whether a source GAN filter is important for adaptation, and achieved impressive performance. In our analysis, we employ IP to assess the importance of source generator filters for target domain adaptation (we briefly introduce IP in the Supplement). We propose two experiments with different granularities:

  • Exp-1: Generate images with fixed generator inputs . We visualize the generated images by different methods. To understand knowledge transfer before and after adaptation, we use the same noise as input to source and target generators. Conceptually, this provides us with an intuitive and direct comparison of knowledge transfer.
  • Exp-2: Dissecting a pretrained and adaptive generator . To find the filters that are most relevant to a specific type of knowledge across different images (e.g., source features that are incompatible with the target) and track their transfer before and after adaptation, we label Gs filters with estimated importance (via IP) and GAN dissection is applied to visualize semantic features corresponding to the same filters of Gs and Gt.
  • These experiments can help us understand knowledge transfer before and after coarse-grained (visualization of generated images in pixel space) and fine-grained (dissection of Gs and Gt in filter space) adaptation. Next, we discuss the setup and results.

4.2 Experimental setup

4.3 Results and analysis

We reveal that existing SOTA FSIG approaches that focus on source knowledge preservation lead to the transfer of incompatible knowledge. More importantly, the root cause of this incompatible knowledge transfer is that the least important filters in Gs are determined to be irrelevant to the target domain adaptation, and fine-tuning is not enough to remove the incompatible knowledge after adaptation. Specifically, we summarize our observations in Figures 1 and 2:

  • Observation 1 : In Fig. 1 (c), we visualize the generated images by different methods using a fixed noise input. Interestingly, features that are incompatible with the target domain are indeed transferred after adaptation using different knowledge preservation criteria, e.g., “tree on sea”, where “tree” is from the Church domain, “Cat with glasses” , where "glasses" comes from the FFHQ field. All these incompatible source features severely weaken the realism of the generated target images. Similar observations can be made on TGAN, a simple fine-tuning based approach without explicit knowledge preservation. On the contrary, our method (which we discuss in Section 5) can solve this problem.
  • Observation 2 : In Fig. 2, we dissect and visualize the incompatible features observed in Fig. 1 and find their most relevant filters in Gs and Gt. Surprisingly, we find that the filters in Gs identified as least important to the target domain are most correlated with incompatible features transferred from the source, which is the root cause of the authenticity degradation of generated images. After self-adaptation, the same filter will still cause the same type of incompatible features, and fine-tuning of knowledge updates cannot effectively solve this problem. This observation becomes more pronounced when the target domain becomes distant.

5. Suggested approach 

5.1 Knowledge Truncation via Network Pruning 

Pruning is one of the useful tools for achieving compact neural networks whose performance is comparable to larger full models. Early work on compressed networks focused on model acceleration, inference efficiency, and deployment, and they targeted discriminative tasks such as image classification and machine translation, often by removing the least important neurons (the definition of importance can vary, and discussed in Section 5.2). Compared to previous network pruning work pursuing model sparsity, we aim to improve the quality of generated images, especially in the FSIG task, by removing the least important filters associated with incompatible knowledge of the target domain.

Our proposed method consists of two main steps: 1) lightweight filter importance estimation on-the-fly during adaptation; 2) filter actions are determined based on the estimated importance. In step 1), we use the gradient information during the adaptation process to evaluate the filter importance for target adaptation once every certain number of iterations. Then in step 2), based on the estimated filter importance, we prune the least important filters, which are considered irrelevant to the target domain, to remove incompatible knowledge for adaptation. Meanwhile, we keep filters with high importance to achieve knowledge preservation in FSIG, and fine-tune the remaining filters to adapt the source generator to the target domain.

Proposed Filter Importance Estimation . We estimate the importance of each filter by exploiting instantaneous gradient information during FSIG adaptation. We denote the filter as

where k is the spatial size of the filter and c^in is the dimension (number) of the input feature map. We use Fisher Information (FI) as an importance estimator for each filter F( W ) (discussed further in Section 5.2), which can give quantitative information on the compatibility between filter weights and FSIG tasks.

where L_G is the binary cross-entropy loss computed from the output of the discriminator. x represents a set of generated images. In practice, we use a first-order approximation of FI to reduce computational cost. 

Our filter importance estimation for knowledge selection is lightweight and efficient: compared to previous SOTA methods that propose different knowledge selection criteria (although they only focus on knowledge preservation), our method does not require an external model during adaptation. Provides additional information, also does not introduce additional learnable parameters and pre-adaptation iterations for importance estimation, and it benefits from the output of Gt and Dt during training.

Knowledge truncation by filter pruning proposals . In Section 4, we have shown abundant evidence that the least important filters are associated with semantic features that are incompatible with the target domain (eg “trees at sea” or “building structures at sea”). Importantly, given different knowledge preservation criteria, fine-tuning based knowledge updating cannot correctly remove incompatible knowledge after adaptation. Therefore, we propose a simple and novel approach to knowledge truncation by pruning (zeroing out) the filters that are least important to adaptation.

Specifically, after estimating filter importance in step 1), for the i-th filter W^i in the network, we apply a threshold (q%, i.e., the quantile of its importance compared to all filters number) to determine whether W^i should be pruned:

We noticed that once a filter is determined to be pruned, it no longer participates in training/inference and is not restored for the rest of the training iterations. Knowledge truncation is applied to the generator and discriminator, and we use separate thresholds for Gt and Dt. Since we periodically estimate filter importance during adaptation and the "non-recoverable" property of pruned filters, the number of nullified filters using Equation 3 will accumulate to a certain value p% at the end of adaptation. 

Similar to previous work that focuses on knowledge preservation and proposes different knowledge selection criteria, we preserve filters with high adaptive importance by freezing them during training. For the rest of the filters, we just let them fine-tune using Equation 3. Whether a filter needs to be fine-tuned or retained depends on its importance to the target. We discuss the effect of selecting high importance filters in the supplementary material. Since we estimate filter importance multiple times during adaptation, the operation on a particular filter may change after different evaluations unless the filter is pruned and not restored.

5.2 Design Choices

Here, we discuss the design choices of our proposed method and the adopted importance measures. Since we dynamically evaluate filter importance every certain iteration, we need to keep the operation on each filter (which can be "keep", "fine-tune" or "prune") until the next estimation. To reduce computational cost, we keep each filter's operational decision (obtained by estimating filter importance) in a lightweight memory bank M: for each high-dimensional filter W, we only need one character to record in M corresponding operation. For example, for StyleGAN-V2 used in the main experiments, its generator contains about 30M parameters, where M is a one-dimensional array with a size of about 5,000.

Similar to previous work, we use FI as an importance measure to estimate the performance of network parameters (filters in our work) on the adaptation task. We note that there are other measures to estimate the importance of filters for adaptation, such as class saliency or reconstruction loss. In the supplementary material, we conduct a study and empirically find that we can achieve similar performance to FI. Furthermore, in Section 6.2, we are surprised to find that even without pruning (i.e., only filters can be retained or fine-tuned), our proposed method can still achieve competitive performance compared to SOTA methods, implying that the proposed The effectiveness of the dynamic importance estimator of .

6. Experiment

6.1 Performance Evaluation and Comparison

qualitative results . In the figure above, we visualize the images generated by different methods before and after adaptation for comparison. In each column, the images are generated from the same noisy input. We use FFHQ as the source domain. Babies and AFHQ-Cat are target domains with different semantic proximity to the source. We show that our proposed method improves the quality of generated images by reliably removing target-incompatible knowledge while preserving useful source knowledge. 

Quantitative results . Considering that the whole target dataset usually contains around 5,000 images (e.g. AFHQ-Cat), according to previous work, we use an adaptive generator to randomly generate 5,000 images and compare with the whole target dataset to calculate the FID. In Table 1, we show the complete FID results for six benchmark datasets. In the figure above, we also compute intra-LPIPS as a measure of diversity over 10 target samples, and we report FID using the same checkpoint. All these results demonstrate the effectiveness of our proposed method.

6.2 Discussion

Knowledge truncation for different methods . Ideally, our proposed concept of FSIG knowledge truncation can be applied to different methods, as long as we can estimate parameter importance (eg, filter importance in our method). In the literature, EWC and AdAM propose different approaches to assess parameter importance: EWC directly estimates the parameter importance on the source dataset of Gs, while AdAM uses a modulation-based approach to estimate the parameter importance of Gs on the target dataset . Therefore, in Table 1, we also show the results of applying our proposed knowledge truncation to EWC and AdAM. Since our method can effectively remove incompatible knowledge by pruning the least important filters, we can achieve consistent improved performance on different datasets. 

Trim filters with different percentages . We empirically study the impact of pruning different percentages of filters. According to the results in Fig. 5, we pruned different numbers of filters on three different methods. Ideally, if we prune more filters, some important knowledge will be removed and the performance will decrease accordingly. Therefore, we prune 3% (i.e., p=3 in Section 5.1) filters in different settings, which achieves a considerable and stable improvement. 

Can we train longer to remove incompatible knowledge? Ideally, an intuitive and potentially useful way to remove incompatible knowledge is to simply train for longer iterations. However, in the supplement, we conduct a study and show that, for existing FSIG methods, since the target set contains only 10 training images, training for longer iterations will overfit the generator and tend to replicate a small number of targets samples so that it can fool the discriminator. The diversity of generated images is significantly reduced. Therefore, it is important to remove incompatible knowledge before overfitting becomes severe. 

7. Conclusion

In this work we address the few-shot image generation (FSIG) problem. As a first contribution, we uncover an unnoticed problem of incompatible knowledge transfer in existing SOTA methods, which leads to a significant loss of realism in generated images. Surprisingly, we find that the root cause of this incompatible knowledge transfer is the filter considered to be least important for target adaptation, which cannot be properly addressed by fine-tuning based SOTA methods. Therefore, we propose a new concept, FSIG's knowledge truncation, which aims to eliminate incompatible knowledge by pruning filters that are least important to fitness. Our proposed filter importance estimation exploits gradient information from a dynamic training process and is computationally light. Through extensive experiments, we show that our proposed method can be applied to various adaptation settings with different GAN architectures. We achieve new state-of-the-art performance, including visually pleasing generated images without much incompatible knowledge being transferred, and improved quantitative results.

Limitations and Ethical Issues . The scale of our experiments is comparable to previous work. Nonetheless, extensions to our knowledge truncation approach, additional datasets and generative models beyond GANs (e.g., variational autoencoders or diffusion models) can be considered future work. If malicious users use our proposed FSIG method, it may have negative social impact. However, our work contributes to improving the understanding of limited data image generation.

appendix 

F. Ablation Study: Effect of High Importance Filters

In the main paper, we highlight our contribution to the investigation of incompatible knowledge transfer, its relation to the least important filter, and the proposed method for FSIG to address this unnoticed problem. In addition to knowledge truncation, following previous work, we also preserve useful source knowledge for adaptation. Specifically, we preserve filters considered important for target adaptation by freezing them. We select high importance filters by using the quantile (t_h, e.g. 75%) as a threshold. In this section, we conduct a study to show the effectiveness and impact of retaining different numbers of filters considered most relevant to target adaptation, the results are presented in Table S1. Note that we do not prune any filters in this experiment.

As shown in Table S1, different numbers of filters for saving actually improve performance in different ways. In practice, we choose t_h = 50% for FFHQ → Babies and t_h = 70% for FFHQ → AFHQ-Cat. This choice is intuitive: for target domains that are semantically closer to the source, retaining more source knowledge may Improve performance. 

H. Ablation Studies: Additional Measures of Importance

The importance of evaluation weights in generative tasks remains underexplored. In the main paper, we follow some previous work and use Fisher information (FI) as a measure for importance estimation and obtain superior performance across different datasets (see Table 1 in the main paper). However, there may be different ways to evaluate how well the obtained weights are given the adaptation task. In the literature, Class Salience (CS) is used as a tool to estimate which regions/pixels of a given input image stand out for a particular classification decision, similar to FI utilizing gradient information. Therefore, we note that CS may be related to FI because they both use the knowledge encoded in the gradient for knowledge importance estimation.

reference

Zhao Y, Du C, Abdollahzadeh M, et al. Exploring incompatible knowledge transfer in few-shot image generation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 7380-7391.

S. Summary

S.1 Main idea

This paper studies the incompatible knowledge transfer problem of few-shot domain adaptation: Entities in the source domain that do not match the target domain may appear in the target domain after adaptation, affecting the quality of adaptation. The authors address this issue using network pruning based on filter importance.

S.2 Network pruning

The reason for incompatible knowledge transfer is the filters that are not important for adaptation (extracting non-important features), and the problem can be solved by removing these filters. This operation is divided into two steps: 1) Estimate the importance of the filter for adaptation 2) Based on the importance, perform the following operations:

  • Filters with low importance to zero: irrelevant to the target domain, removed to avoid incompatible knowledge transfer
  • Freezing filters with high importance: for knowledge preservation in few-shot domain adaptation
  • Fine-tuning the residual filter: for domain adaptation

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/131219242