LongLoRA: Enhances the contextual capabilities of pre-trained language models without requiring extensive computing resources

MIT and the Chinese University of Hong Kong introduce LongLoRA, a revolutionary fine-tuning method that improves the contextual capabilities of large numbers of pre-trained language models without requiring extensive computing resources.

LongLoRA is a new approach that makes it easier and cheaper to improve large language computer programs. Training LLM often requires a large amount of information and takes a lot of time and computer power. Training with a large amount of data (context length 8192) requires 16 times more computer power than using less data (context length 2048).

In the LongLoRA research paper, the authors share two ideas for making this process faster and cheaper.

First, they used a simpler form of attention (focusing on specific information) in their training, which they called shifted brief attention (S2-Attn). This new attention method helps save a lot of computer power and is almost as effective as the usual attention method.

Second, they revisit an approach to effectively expand context (the amount of information used for training).

LongLoRA shows good results on various tasks and can be used for llms of different sizes. It can increase the amount of data used for training from 4k to 100k for one model and 32k for another model, all on a single powerful computer.

The authors also integrated a dataset called LongQA, which contains more than 3000 question and answer pairs for training. This makes LongLoRA a very useful tool for efficiently improving large language computer programs.

LongLoRA

The long sequence language modeling study evaluated different models on the Proof-pile and PG19 datasets. It was found that the model performed better as the context size increased during training, demonstrating the effectiveness of LongLoRA's fine-tuning method. Simply put, training with more information leads to better results. For example, when the context window size increases from 8192 to 32768, the performance of one model improves from 2.72 to 2.50 in terms of perplexity.

The Maximum Context Length study explores how many contexts a model can handle on a single machine. They extended the model to handle very long contexts and found that the model still performed well, although performance decreased at smaller context sizes.

In addition to language modeling, the study also tested these models in a retrieval-based task. This task involves finding specific themes within a long conversation. The model performs similarly to state-of-the-art models on this task, and even performs better in some cases. Their model adapts to open source data more effectively than competitors.

LongLoRA shows that the more information a large model can process, the better its ability to understand language. And not only is it good at handling long texts, LongLoRA is also very good at finding specific topics in long conversations. This shows that it can handle complex and messy tasks in the real world.

Because the context window is enlarged, LongLoRA will have some problems when processing shorter text fragments. The author has not found the reason for this problem.

Summarize

Recent discussions around language models such as LLaMA and Falcon have shifted the focus from just increasing model parameters to considering the number of context tokens or context length. The emergence of LongLoRA emphasizes the key role that context length plays in the development of language models, providing a cost-effective way to extend its functionality.

Let’s summarize the key points of LongLoRA:

LongLoRA is a new fine-tuning method that improves the contextual capacity of large language models (llm) without excessive computation.

It adopts sparse local attention (S2-Attn) for context expansion, which reduces the computational cost while maintaining performance.

LongLoRA combines LoRA with trainable embeddings and normalization to achieve significant contextual scaling.

On a single machine, LongLoRA can scale context from 4k for LLaMA2 7B to 100k or 32k for LLaMA2 70B.

The LongQA dataset enhances the practicality of supervised fine-tuning.

Longer context sizes during training can significantly improve model performance.

The model performs well even in expanded contexts, although there is a slight degradation in smaller context sizes.

In retrieval-based tasks, longlora-equipped models outperformed competitors, especially when using open source data.

Paper address: LONGLORA: EFFICIENT FINE-TUNING OF LONG - CONTEXT LARGE LANGUAGE MODELS

https://avoid.overfit.cn/post/7b79c4325ff24114ad634a52d286f4f2

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/133427537