【笔记】Prompting Large Language Models with Answer Heuristics forKnowledge-based VQA

background

GPT-3 has not been fully activated to exert its strength. There are two main limitations:

1. GPT-3 should discard some useless information and focus on the information that the question itself is concerned about. "A group of people walking in a city square" does nothing to answer the question "What fruit do these trees bear?" In this case, GPT-3 had to make a rambling and biased guess to answer the question. "

 2. GPT-3 adopts a small number of learning paradigms and needs some examples in context to adapt to new tasks. Therefore, the selection of these examples is critical to model performance.

{few-shot: GPT-3 only needs to connect a few examples of the task with the input as hints at inference time, and does not require parameter updates. }

Idea of ​​this article

1. First pass a set of examples into the VQA model to get a set of cases.

 For example: the picture is a birthday cake in front of an old man, the question is what is the old man blowing? Then the answers generated by VQA have candles, birthdays, and fires.

Questions, semantics, and candidate answers are all used as prompts to enter gpt-3, and finally the desired answer is obtained.

According to the above figure , the final result is expressed by the formula:

 Since GPT-3 does not inherently understand images, an off-the-shelf captioning model needs to be used to convert images into text prompts (PICa):

The full prompt for PICa consists of a fixed prompt header, several in-context examples, and a test input. This cue is fed into GPT-3 for answer prediction.

Phase 1: Answer heuristic generation

 First explain that the VQA model generally includes two sub-models, one is embedding to generate the fusion feature z, and the other is the classification head, which is used to generate the answer vocabulary y.

First, the first sub-model is used to generate fusion features:

Second submodel: generate candidate vocabulary answers:

 The author uses the above model as a comparison scheme, and adds the guidelines of GPT-3 to verify the effectiveness.

The following is the author's preliminary operation:

Making the e-set is actually the support set in few-shot.

The generated example must first need candidate answer vocabulary, and the author chooses the TopK score in y:

 

 This is the generated prompt example (w is the vocabulary, y is the score), but there is another point, which pictures are selected as examples?

The author says:

"We speculate that these fused features lie in a latent answer space that contains the rich semantics of the answer for a given image-question pair."

"If z and zi are close in the latent space, they are more likely to share similar answers and image question inputs."

So the author calculated the cosine distance between the test (test case, that is, the query in few-shot) and the fusion features of other vq pairs, and selected the closest TopN.

 

 Of course, the author mentioned that these z features can be calculated in advance.

Phase Two: Heuristic Augmentation Hints

As shown in Figure 2 above, the next process is to generate hints to enhance gpt-3 predictions.

That is this part:

Although it can be seen that candidate answers are provided to gpt-3, gpt3 can also choose to generate new answers.

 

 Finally, regardless of whether it is an e-set or a test, the format of inputting gpt-3 is as follows:

Among them, confidence will help gpt-3 focus on more favorable candidate answers. The author will perform multiple operations to input gpt-3, and vote for the result:

 

Guess you like

Origin blog.csdn.net/qq_42533666/article/details/129907345