In-Context Learning open-book visual task, DeepMind proposes a "hummingbird" model that quickly adapts to new tasks

2ed3720fec8e4178a6e9581cfdad139b.png

Paper link: https://arxiv.org/abs/2306.01667

Recently, with the popularity of large models such as ChatGPT and GPT-4 , the academic community has begun to pay more attention to some key new technologies behind large models, such as In-Context Learning (situational learning, also known as contextual learning), Chain-of-thoughts (chain-of-thoughts) and Reinforcement Learning from Human Feedback (human feedback reinforcement learning), which are highly related to ChatGPT. New learning paradigms . In the field of natural language understanding and generation, In-Context Learning greatly alleviates the need for models to fine-tune feature tasks . Researchers can only design some more professional prompts to enable the model to acquire the ability to solve a variety of downstream tasks.

In contrast, large models in the computer vision community have not yet achieved this effect. For the current vision model, for a specific vision task, a dedicated decoder and fine-tuning strategy are usually required to adapt the model to a new downstream task . This article introduces the latest work from the Google DeepMind research team . They explore how to design similar contextual learning mechanisms in dense visual tasks such as semantic segmentation and depth estimation. They propose a large-scale visual model called Hummingbird. Hummingbird implements contextual learning in visual tasks based on the retrieval memory mechanism, and proposes a new pre-training model to generate visual representations suitable for multiple downstream tasks . The research team conducted extensive experimental evaluations, showing that Hummingbird can achieve the ability to perform various scene understanding tasks without fine-tuning the model by adjusting the input prompts, and can also achieve comparable model performance to using standard fine-tuning methods.

01. Introduction

The vision tasks mainly targeted in this paper are intensive scene understanding tasks, such as semantic segmentation and depth estimation. The author first studies the visual components required to accomplish these tasks, and designs these components from three aspects: (1) generality, (2) parameter efficiency, and (3) fast adaptation . In order to achieve an In-Context Learning effect similar to that in the natural language field, the author's team first extended the traditional non-parametric nearest neighbor (NN) retrieval method [1] to dense scene prediction tasks . The advantage of this retrieval-based decoding mechanism is that it does not require parameter fine-tuning for specific tasks. Therefore, the author believes that this method is currently the best solution to achieve visual In-Context Learning effects. It can directly load common standard visual encoders (such as ResNet or ViT) to easily adapt to other downstream tasks while maintaining certain model prediction performance. The figure below shows the semantic segmentation effect of this method and other standard fine-tuning methods on the PASCAL and ADE20K datasets. It can be seen that the nearest neighbor retrieval method in this paper can achieve better fine-tuning results with fewer samples .

8ea9562e95a8489db51bdfc5bd4c72e7.png

In addition, the research team also found that although the existing visual Transformers (such as MAE and DINO models) use the same pre-training method, they have great differences in scene understanding . Therefore, the author proposes a new pre-training method to integrate this aspect to generate a relatively general visual representation. Specifically, the author mainly does the following two steps:

  1. A simple modification of the standard self-supervised pre-training paradigm, called contextual pre-training, updates the spatial representation of each patch by using features retrieved from a memory pool, followed by attention computation across patches .

  2. A spatial attention pooling mechanism (attention-pooling) is proposed , which is different from the conventional standard average pooling. By calculating the attention weights between blocks in the image, the features in the entire grid area are converted into individual image-level features in a "context aggregation" manner , and then sent to the self-supervised loss function for optimization.

The authors found that the self-supervised features obtained in this way have strong cross-task adaptability, and the performance on downstream tasks is also very close to the performance of standard fine-tuning methods. Therefore, the author named the method of this paper as Hummingbird to highlight its rapid adaptability in various task scenarios .

02. The method of this paper

2.1 Scene Understanding Framework Based on Retrieval Mechanism

 

2.2 Context Pre-training

2.3 Self-supervised training objective function

 

03. Experimental effect

The experiment in this paper is mainly carried out on two intensive scene understanding tasks. For the semantic segmentation experiment, the author selects the PASCAL VOC and ADE20K data sets , and the evaluation index uses mIoU. For the monocular depth estimation experiment, the NYUv2 data set is selected, and the evaluation index uses the root mean square error (RMSE) as the evaluation index. The author selected a variety of self-supervised methods including MAE and DINO as comparison methods, and used the ViT-B version as the basic visual backbone . The following table shows the performance comparison of the method in this paper using the retrieval memory mechanism on the scene understanding task, where Hummingbird++ represents the use of supervised learning for training.

db5a995e4a574cdf924f9d666ed0870c.png

As can be seen from the above table, the performance of the method in this paper has been greatly improved compared with other methods using the ViT-B encoder. At the same time, as the dataset size increases from ImageNet-1k to ImageNet-22k, the method in this paper shows good scalability , but other methods (such as MAE) are somewhat inferior in comparison. In addition, the author also studied the performance of the method in this paper in the case of cross-architecture. As shown at the bottom of the figure above, the performance of the method has been significantly improved with the increase of the encoder parameter size, and it is significantly better than other methods, even some methods that have been fine-tuned by supervised learning .

In addition, the author also focused on evaluating the rapid adaptability of this method in downstream tasks . The author selected two commonly used baselines in the field of rapid adaptation tasks (Linear + frozen and Linear + E2E FT, E2E FT stands for end-to-end standard fine-tuning) for comparison. The following table shows the comparison of their fine-tuning performance on the PASCAL VOC and ADE20K data sets. It can be seen that the performance of this method is significantly better than the other two schemes.

13de46499e2c432fb2e468261e434a87.png

At the same time, the author also evaluates the time-consuming of these methods in the fine-tuning process. As shown in the figure below, for the method in this paper, it only takes 5 minutes (by training 1 epoch on the downstream training set) to build a high-performance NN decoder (70% mIoU on PASCAL VOC, 28% on ADE20K) . In contrast, the convergence speed of the Linear + frozen method is second only to the method in this paper, but its peak performance is significantly lower than that of the NN decoder in Hummingbird.

b00207cf7e7a424284c4c5d04d65edd9.png

04. Summary

Inspired by In-Context Learning in large language models, this paper focuses on the basic structure necessary to introduce the context learning paradigm in computer vision intensive prediction tasks. To this end, the research team of this paper proposes a very simple non-parametric nearest neighbor retrieval mechanism , which is not only independent of downstream tasks, but also does not require fine-tuning of specialized decoders. Subsequently, the author further proposed Hummingbird, as a new type of self-supervised pre-training method, Hummingbird can focus on the context attention between cross-image blocks in the pre-training stage , so that it has the ability to quickly adapt to downstream tasks. By combining the Hummingbird pretrained model as a general-purpose encoder with a retrieval-memory-based decoder, this paper leads the vision community toward an important step toward contextual learning.

reference

[1] O. Boiman, E. Shechtman, and M. Irani. In defense of nearest-neighbor based image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2008.


About TechBeat Artificial Intelligence Community

TechBeat (www.techbeat.net) belongs to Jiangmen Venture Capital. It is a growth community that gathers global Chinese AI elites. We hope to create more professional services and experiences for AI talents, and accelerate and accompany their learning and growth. I look forward to this being a high ground for you to learn the cutting-edge knowledge of Al, to share your latest work, and to upgrade and fight monsters on the road to AI advancement! More details >> TechBeat, a learning and growth community that gathers Chinese AI elites from all over the world

Guess you like

Origin blog.csdn.net/hanseywho/article/details/131852908