ICLR 2023 | Research on small-sample learning and zero-sample inference performance of medical images based on visual language pre-training model

In the past two years, the Visual Language Model (VLM) has gradually emerged, and has achieved remarkable results in Few-shot Learning and Zero-shot Inference. So can these large-scale pre-trained visual language models that are successful in natural images be successfully applied to the medical field? With such doubts in mind, the joint research of West China Biomedical Big Data Center Artificial Intelligence and Medical Robotics Laboratory of Sichuan University, West China Hospital-SenseTime Joint Laboratory, Shanghai Artificial Intelligence Laboratory, and Beijing University of Posts and Telecommunications has verified in detail and comprehensively , with the help of appropriate prompts (Prompt), can the visual language pre-training model trained on natural images be transferred to the medical image field under the condition of small samples or even zero samples . Related papers have been accepted by ICLR 2023 (International Conference on Learning and Representation), the top artificial intelligence conference.

Paper title:

Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study

Paper link:

https://arxiv.org/abs/2209.15517v1

1. Scarcity of large medical models

There has always been a problem of lack of data in the field of medical images: compared with natural images, the annotation of medical image data requires more professional practitioners to annotate; it is difficult to scale the data for some rare cases; factors such as moral privacy make data Unable to aggregate public. All of this makes the field of medical imagery unable to develop its own large pre-trained model  PLM (Pretrained Large Model)  . Therefore, transfer learning with large pre-trained models on natural images becomes a logical option. However, due to the large domain gap between medical images and natural images, the domain generalization of the migration training model is often limited.

2. Multimodal pre-training model and language invariance

Through visual-language cross-modal alignment training, the visual language model (VLM) enables the model to have better generalization ability. Several VLMs perform well on few-shot and zero-shot tasks. However, existing studies have not investigated whether these VLMs can understand less common medical concepts. Some studies have shown that through designed prompts (Prompt), VLM can recognize different visual styles of the same concept (such as being able to recognize a color photo of an object, a sketch, or a cartoon-style picture) or even a concept that has not been seen ( Unseen Concept). We believe that this generalization ability is mainly due to the invariance of language and text modalities in cross-domain images, and because the visual language model (VLM) is highly bound to the expression of language and visual modalities, through the language prompt The corresponding visual expression learning ability can be activated.

In short, if the prompt we design has a description of the expressive attribute (Expressive Attribute) such as the shape, color, texture or position of the object, even in the face of a brand new concept in the medical field, the visual language model can recognize the corresponding object. To this end, we first manually designed a set of prompt templates, and on this basis, proposed multiple sets of automatic prompt generation methods to generate corresponding prompts for different medical concepts. We validate our idea on 13 public medical image datasets, that is, a prompt with rich expression attributes can well help VLM to achieve a huge performance improvement in zero-shot or few-shot tasks.

3. Design of medical prompt: from manual to automatic

As mentioned earlier, in order to activate the generalization ability of the pre-trained VLM, a prompt with an expression attribute description is particularly important, but how to obtain such a prompt is a key issue. In fact, although many prompt generation methods for extracting knowledge from PLM have been proposed in the field of natural language processing (NLP), none of them are designed for visual tasks. Our prompt design focuses on obtaining the attribute description of a thing or concept. We designed a format template to make this generation process more regular.

For example, our Prompt for the medical concept design of polyp (Polyp) is: 'In rectum , Polyp is an oval bump, often in pink color.' We inserted descriptions of position, shape, and color respectively. However, manually designing these prompts is time-consuming and laborious, and requires designers to have certain professional knowledge, so we further proposed a process for automatically generating prompts. The first approach is to do the mask prediction task through a pre-trained language model (LM)  with domain expertise . For example, we set the attribute word as Mask – 'Polyp is in [Mask] color'. Studies have shown that this method can extract the corresponding knowledge more effectively than the question-and-answer method. We call this method MLM (masked language model) .

The above method is able to extract the comprehensive attributes of a concept or thing, but some attributes may have diversity, so we need to conduct an attribute screening for each image separately. In this regard, we leverage a generatively pretrained VLM to answer our attribute-specific questions. For example, if we ask: 'What color is this Polyp', we can get the attribute words for this picture. We call this method VQA (Visual Question Answering) .

In the end, we also tried to integrate the previous two methods, such as using the MLM method to extract some relatively fixed attributes (such as position texture), and using the VQA method to extract some unfixed attributes (such as color and shape). We call this method the Hybrid method .

It has been verified by experiments that the above-mentioned methods have achieved much higher performance in small-sample or zero-sample detection tasks than the baseline method that only uses the target object or concept name as a prompt . Among them, the automatically designed prompt has good results compared with the manually designed prompt, but the time cost of the automatically generated method is much lower.

4. Overall superiority in small-sample and zero-sample tasks

To comprehensively validate our proposed prompt method, we collect 13 publicly available medical datasets across different medical image modalities (CT, MRI, ultrasound, endoscopy, pathology images, etc.).

Among them, although the radiological image dataset has a large domain span with natural images, we found that only a small number of sample fine-tuning (finetune) can achieve good results. The results show that our method is significantly ahead of Traditional detection model .

In other data such as endoscopy and pathological images, we found that VLM has achieved amazing performance no matter whether it is direct transfer of zero samples or fine-tuned transfer of small samples. In the zero-sample scenario, we obtained prompts full of expressive properties through manual templates and automatic generation. We mainly compared the prompt obtained by the above method with the zero-sample detection effect using only the target object or concept name as the prompt. The results show that our method outperforms baseline methods using concept names as prompts by a large margin. In the scenarios of small samples and full samples, our method also has great advantages compared with traditional detection models .

partial visualization results

V. Summary

We believe that the generalization ability of the visual language pre-training model can effectively alleviate the problems of data scarcity and large domain span in the medical image field. Reasonable use of the invariance of attribute words in different domains in the language description is the key to making good use of the visual language pre-training model. We validate our conjecture with multiple datasets and extensive experiments.

· We propose a prompt design template containing expressive attribute words, and automate the design process based on this template. Three automatic prompt generation methods based on different requirements are proposed.

Compared with the traditional detection model, our method shows overall superiority in small-sample detection tasks; compared with the method of using only the name of the target object as the prompt word, the prompt we designed has brought a huge improvement in the zero-sample detection task .

Author: TechBeat hardcore broadcast

Illustration by IconScout Store from IconScout

-The End-

Guess you like

Origin blog.csdn.net/hanseywho/article/details/130124587