Large models + retrieval enhancements (RAG, Atlas and REPLUG)

https://zhuanlan.zhihu.com/p/651380539

https://github.com/ninehills/blog/issues/97

1. Retrieval enhancement to generate RAG

In question and answer and dialogue scenarios, a reply can usually be obtained through retrieval and generation. Retrieval-type replies retrieve satisfactory replies from external knowledge bases, which are more reliable and controllable, but lack diversity in replies; generative replies, on the other hand, rely on internal knowledge stored in powerful language models, are uncontrollable, and have poor interpretability. But can generate richer responses. Combining retrieval and generation, Facebook AI research, together with UCL and New York University, proposed in 2020: a generation model supported by external knowledge retrieval, Retrieval-Augmented Generation (RAG) retrieval-enhanced generation .

  • Retrieval: This refers to the process of systematically searching a large database or repository to find relevant information.
  • Generation: After retrieval, the system generates human-like text, integrating the acquired data.

Retrieval enhancement methods are used to overcome the limitations of large language models (llm), such as hallucination problems (gibberish) and limited knowledge problems (often used to supplement the latest knowledge and company internal knowledge). The idea behind the retrieval enhancement method is to maintain an external knowledge base, retrieve external data when a question is asked, and provide it to the LLM to enhance its ability to generate accurate and relevant answers.

principle

RAG consists of two parts:

  • The first part is responsible for in the knowledge base, according to queryx query_xqueryxRetrieve top-k matching documents ziz _izi
  • The second part concatenates the query and k documents as the QA prompt, sends it to the seq2seq model, and generates the reply yyy
    Insert image description here

Part One: Retriever

In the first part of Retriver, RAG embeds external knowledge and query into dense vectors through two different BERTs, and performs inner product to obtain the k documents with the largest inner product. Concatenate the query and k documents to form k inputs, which are used as the input of the second part.

Part 2: Generator

There are two ways to use documents in Generator.

  • The first is called RAG-Sequence model: use the same document to generate each word, first determine a document zi z_izi, then calculate p ( y ∣ x , zi ) p(y|x,z_i)p(yx,zi)
    Insert image description here

  • The second is called the RAG-Token model: different documents are used to generate each word. For the i-th position, the probability of the candidate word is equal to the sum of the conditional probabilities of all documents, that is, the marginal probability of the candidate word to the document is calculated.
    Insert image description here

BART is a pre-trained model based on a complete transformer, using denoising as a pre-training task. The author chose BART-large as the generator of RAG.

During the training process, only BERT, which is responsible for embedding query, and BART, which is responsible for generation, participate in fine-tuning and updating parameters. BERT, which is responsible for embedding external knowledge, does not need to update parameters. During the test process, when the RAG-Token model calculates the probability of the current word, the probability of the previous candidate word has been calculated. Therefore, the RAG-Token model uses beam search decoding just like the naive generative model. The RAG-Sequence model needs to traverse all documents to obtain the probability of candidate words at each position. Therefore, you need to use beam search to decode each document and then integrate it.

8 major challenges

Data source loading and processing

During various data loading and parsing processes, how to retain the logical and semantic relationships of the original data as much as possible is an issue that requires attention. You can try different loading and parsing methods and compare different py libraries. You can also book a good prompt to facilitate LLM understanding.

Insert image description here

A good prompt can be achieved by supplementing relevant materials: background knowledge, Internet search results, RAG search results, in-context QA examples.

Insert image description here

Data segmentation is difficult
  • chunk_size: The maximum length for splitting the input text sequence. Large language models generally limit the maximum input sequence length. For example, the maximum input length of GPT-3 is 2048 tokens. In order to process longer text, it needs to be divided into multiple chunks, and chunk_size controls the maximum length of each chunk.
  • chunk_overlap: The number of overlapping tokens between two adjacent chunks. In order to ensure the semantic coherence of the text, adjacent chunks will have a certain overlap. chunk_overlap controls the size of this overlap area.

For example, if chunk_size is set to 1024 and chunk_overlap is set to 128, a text sequence with a length of 2560 will be divided into 3 chunks:
chunk 1: tokens 1-1024
chunk 2: tokens 897-1920 (128 overlapping with chunk 1)
Chunk 3: 1793-2560 tokens (128 overlapping with chunk 2)
This segmentation method not only meets the maximum length limit, but also ensures the semantic connection between adjacent chunks. Appropriate chunk size and overlap can improve the fluency and coherence of large language models in processing long texts.

Insert image description here

How to segment the original text (how to choose the chunk size) has a great impact and needs to be judged based on the specific business needs:

  1. Experiment with changing hyperparameters and keep testing
  2. There are two ways to decouple the chunk size during index and generation:

Insert image description here

  • During retrieval, if the document can be constructed as a document tree structure, such as paragraph 1 and paragraph 2, you can first let LMM perform summary respectively. When searching, first locate the paragraph according to the summary (also using the similarity ann method), and then inside the paragraph ann retrieval.
  • Divide the document into very small pieces so that the internal semantics of each chunk are very clear. Then after retrieving chunk k, we take chunk k-1, k, k+1, that is, adjacent chunks, as the context.
Search effect is not good

Insert image description here
A valid chunk cannot be retrieved (model problem, segmentation problem); there is invalid information inside the chunk (summary and relevance filtering first).

How to improve:

  • Hybrid Search can be used, that is, mixing other similarity evaluation indicators, such as Bm25, Keyword, Vector
  • Mix other search fields, such as meta-data, and also consider letting LLM extract the meaning and keywords of the chunk; it can also be combined with contextual summary
Too many or too long search results

Insert image description here
Filtering of search results: Post-processing is actually a process of rearrangement

Insert image description here

Tell the model the collection of types in the original query and metadata, and let the model help us get the sub-collection for filtering.

Re-rank problem: Real business domain data is needed for fine-tuning; here you can try to use LLM to rerank (write a prompt for LLM).

Insert image description here

Answer synthesis strategy:
Insert image description here

Default version:
Insert image description here

Iterative refine version: input chunk by chunk, constantly modifying the model to update the original answer
Insert image description here

Interpretability and Robustness

Insert image description here

Processing of complex queries

Prompt → Sub-query, continue to split into sub-questions → until you can reply
Insert image description here

2. Atlas

Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas: Few-shot Learning with Retrieval Augmented Language Models

Atlas has two sub-models, a retriever and a language model. When faced with a task, Atlas uses a searcher to generate the most relevant top-k documents from a large amount of corpus based on the input question, and then puts these documents into the language model together with the question query to generate the required Output.
Insert image description here

2.1 Model ArchitectureArchitecture

The basic training strategy for the Atlas model is to co-train the retriever and language model using the same loss function. Both the retriever and the language model are based on the pre-trained Transformer network, where:

  • The retriever is designed based on Contriever . Contriever is pre-trained through unsupervised data and uses a two-layer encoder. Query and document are independently encoded into the encoder, and the similarity between query and document is obtained through the dot multiplication of the corresponding outputs. This design allows Atlas to use contrastive learning to pretrain the retriever without document annotation, thereby significantly reducing memory requirements.
  • The language model is trained based on T5 (encoder-decoder architecture). Each retrieved passage is spliced ​​with the question <question, passage>, encoded separately by the encoder, and then concated together and input into the decoder for Cross-Attention to generate the final reply. This Fusion-in-Decoder method helps Atlas effectively adapt to the expansion of the number of documents.
    Insert image description here

2.2 Training objectives Training objectives

Utilize the language model provided 监督信号to train the retriever, jointly train 检索器retrieverand 语言模型LM: If the language model finds a useful document when generating output, the retriever goal should encourage the retriever to rank said document higher . Based on this idea, the paper designed the following four different loss functions:

  1. Attention Distillation (ADist) : The score between the document and the output calculated in the cross attention module of the decoder can be used as the importance of each document. For each document, each of its tokens is calculated in the language model decoder. For each layer of the network, the average of the attention scores of the heads of each attention is used as the importance score of the document, and then the attention score distribution of multiple documents PATTN (dk) P_{ATTN}(d_k) is obtainedPATTN(dk) , return the probability distribution PRETR (dk)through最小化the retrieverPRETR(dk) and the attention score distribution of the language modelPATTN (dk) P_{ATTN}(d_k)PATTN(dk) toKL散度optimize the model. That is, the score distribution PRETR ( dk ) P_{RETR}(d_k)you want to retrievePRETR(dk) as close as possible toPATTN (dk) P_{ATTN}(d_k)PATTN(dk)这个loss仅用于优化检索器的参数,而不是语言模型。Insert image description here
    Insert image description here

  2. End-to-end training of Multi-Document Reader and Retriever (EMDR2) : The design of this loss will 检索返回的文档be as follows 隐变量: q is the given query, a is the final generation result, and the loss of the corresponding retriever is the 语言模型得分product 检索得分of It is composed of logarithms, but is implemented by fixing the parameters of the language model 只优化检索器的参数. The joint training of the previously mentioned retrieval enhancement models such as FiD and RAG basically uses this type of loss function.
    Insert image description here

  3. Perplexity Distillation (PDist) : An improved version of the above 1), changing the target distribution in ADist from pATTN to the probability distribution of the language model score after softmax operation, and then the training goal is to minimize pATTN and the improved version of the probability distribution , and then optimize the parameters of the retriever.
    Insert image description here
    Insert image description here

  4. Leave-one-out Perplexity Distillation (LOOP) : An improved version of 3) above, changing the probability score of the corresponding language model to the negative number of the language model score after removing the specific document. The training goal is also to minimize pATTN and the new version KL divergence of language model probability distributions. The computational cost of this loss function is significantly higher than the previous ones.
    Insert image description here

2.3 Agent task Pretext task

Based on the pretext task, 无监督学习it is used to jointly pretrain retriever and language
model. Regarding the tasks involved in pretraining, the paper also tried several different methods.

a) Prefix language modeling

Divide the text into chunks of N characters, and cut the text of each chunk into lengths of N/2 2段子序列, which are used 第一段子序列as queries. Relevant documents are recalled through the retrieval module, and then the results are generated. The generated target target is corresponding 第二段子序列.

b) Masked language modeling

Divide the text into chunks of N characters. For each chunk, randomly sample several sub-segments with an average length of 3 for masking until the masked length accounts for 15% of the total length of the text. After being masked, Each chunk is used as query input, and the relevant documents are recalled through the retrieval module, and then the language model is used to generate the masked fragments.

c) Title to section generation

Using the article information of Wikipeida, the title of the article and chapter is input as query, the relevant documents are recalled through the retrieval module, and then the language model is used to generate the detailed content of the corresponding chapter.

2.4 Efficient retriever Fine-tuning

The corpus in Retriever is encoded into vectors through the document encoder and stored in the index. When jointly training the retriever and the language model LM, after the retriever's document encoder is updated, the corresponding index needs to be updated , which 全量更新索引will consume a lot of time. Computing resources and time . Especially in the finetune stage, the number of training samples will be much smaller than the number of document indexes, and the time to update the index will increase the overall training time.

a) Full index update

训练每经过一定步数后更新全部索引, the advantage of this method is that updating the index in full can ensure the correlation between the document encoder in the retriever and the index, and at the same time, the update frequency can be set according to actual needs. In the paper, the total number of indexes is 37 million, the training batch size is 64, 20 documents are recalled each time, and all indexes are updated after every 1,000 steps. The calculation amount of updating the index accounts for about 30% of the model training.

b) Re-ranking

在训练的每一步,检索模块会召回top-L个文档,返回其中top-K个文档给语言模型,并且更新这个L个文档的索引, L will be greater than K, that is, the number of indexes updated each time will be greater than the number of documents used by the language model. In the paper, the number of indexes updated each time is 10 times the number of documents accepted by the language model, and the calculation amount of updating the index accounts for 10% of the model training.

c) Query-side fine-tuning

训练过程retriever模块只更新query的编码器,不更新文档document的编码器,那样就不需要更新索引了, so the calculation amount of updating the index accounts for 0% of model training. The impact of fixed document encoders varies on different tasks. In most few shot scenarios, this method will not have a major performance impact, and sometimes it can even improve model performance.

Summarize

Search Enhanced Benefits

  • Interpretability: The black-box attribute of large models makes it difficult for researchers to use large models to analyze the operating mechanism of the model. However, the search-enhanced model can directly extract the documents it retrieves, so that by analyzing the articles retrieved by the searcher, it is possible to Gain a better understanding of how Atlas works.
  • Controllability: We often think that large models have the risk of "leakage" of training data, that is, sometimes the answers of large models to test questions are not based on the learning ability of the model but on the memory ability of the large model. That is to say, in large model learning The answers to the test questions were leaked in a large amount of corpus. In this paper, after the author manually eliminated the corpus information that may have been leaked, the model accuracy dropped from 56.4% to 55.8%, a decrease of only 0.6%. It can be seen that Retrieval enhancement methods can effectively avoid the risk of model cheating.
  • Updatable: The retrieval enhancement model can be updated from time to time without retraining, but only by updating or replacing the corpus it relies on.

https://www.51cto.com/article/717069.html

https://zhuanlan.zhihu.com/p/595258642

https://zhuanlan.zhihu.com/p/563129454

3. REPLUG

REPLUG: Retrieval-Augmented Black-Box Language Models

The REPLUG model proposed in this work can be said to be a representative of "black box" retrieval enhancement. In this new paradigm, the language model is a black box (its parameters are frozen and will not be updated and optimized), and the retrieval component is the fine-tuned part.

RERLUG (Retrieve and Plug) actually adds an additional retrieval component to the language model. It uses the retrieval component to obtain some relevant information and uses it together with the original input as a new input to the language model. Neither the retrieval component nor the language model requires training. (I personally think that the retrieved text added to the original input, that is, the user query, is a bit like prompt).

RERLUG LST (RERLUG with LM-Supervised Retrieval) can be regarded as an upgraded version of RERLUG. It uses the language model to generate supervision signals to optimize the retrieval component, so that the retrieval component tends to select texts that can reduce the confusion of the text generated by the language model. .

https://blog.csdn.net/qq_27590277/article/details/129414851

Guess you like

Origin blog.csdn.net/weixin_54338498/article/details/133024348