【Intensive reading of papers】QLORA: Efficient Finetuning of Quantized LLMs

foreword

An article for fine-tuning scenarios of large language models with low resources, based on LoRA implementation, can fine-tune a 65B large model on a single 48G professional card, and the performance is comparable to full-scale fine-tuning, which has caused quite a stir in the industry. This article will conduct a detailed and in-depth analysis of this article to see what technology is used to reduce the demand for video memory so significantly.


Paper: https://arxiv.org/pdf/2305.14314.pdf
code: https://github.com/artidoro/qlora

Abstract

This paper proposes QLORA, an efficient fine-tuning method, which can greatly reduce the overhead of video memory, and can fine-tune the 65B parameter model on 48G video memory while retaining the complete 16-bit precision. QLORA backpropagates gradients into LoRA through a frozen 4-bit quantized pre-trained language model. The best fine-tuning model achieves 99.3% of ChatGPT performance, and the fine-tuning time is only 24 hours. QLORA introduces a number of innovative technologies to reduce video memory:

  1. 4-bit float (NF4), which is theoretically optimal for normally distributed weights.
  2. Double quantization, which reduces the average memory footprint by quantizing quantization constants.
  3. The paging optimizer, which manages memory spikes.

The author uses 8 instruction fine-tuning datasets to conduct experiments on more than 1,000 models of different sizes, and conducts a detailed analysis of the model performance. It is found that QLORA fine-tuning can achieve SOTA on smaller models.

Introduction

Fine-tuning large models is costly, and a conventional 16-bit LLaMA 65B parameter model requires more than 780GB of video memory. Although recent quantization methods can reduce the memory usage of LLM, they are only applicable to the inference stage.
This paper demonstrates for the first time the ability to fine-tune quantized 4-bit models without performance degradation. That is, the pre-trained model is quantized to 4 bits, and then a small set of low-rank adapters (LoRA) are added for fine-tuning. Reduced the memory requirement of the 65B model from 780GB to 48GB without reducing performance. The smallest 7B Guanaco model designed by the author only needs 5GB of video memory, and the fine-tuning effect is more than 20% higher than that of the 26GB AIpaca.
QLORA introduces a number of innovative technologies designed to reduce video memory usage without loss of performance:

  1. 4-bit NormalFloat. It is theoretically optimal for normally distributed weights.
  2. Double Quantization. A method that quantizes the quantization constants, saving 0.37 bits per parameter (3GB for 65GB model).
  3. Paged Optimizers. Use NVIDIA Unified Memory to avoid processing checkpoint memory spikes.

Integrating the above methods into LoRA avoids the performance loss of previous LoRA-related work.
The author fine-tuned more than 1000 models with a scale of 80M-65B, and found the following:

  • Data quality is more important than dataset size.
  • For a given task, the suitability of the dataset is more important than the size.
  • Performance evaluations of chatbots have found that the way models are evaluated can be non-deterministic compared to human evaluations.

Background

Block-wise k-bit Quantization

Quantization is the process of discretizing a more informative representation into a less informative representation. It usually means reducing the number of bits of the data. In order to make full use of the value range of the low-bit data type, the input data is usually rescaled and normalized to the range allowed by the target data type. This normalization process is usually achieved by dividing the input data, which is generally organized as a tensor, by the absolute maximum value. For example, to quantize a 32-bit floating-point tensor to 8-bit [-127, 127]:

XI n 18 = round ⁡ ( 127 absmax ⁡ ( XFP 32 ) XFP 32 ) = round ⁡ ( c FP 32 ⋅ XFP 32 ) \mathbf{X}^{\mathrm{In}18}=\operatorname{round}\left (\frac{127}{\operatorname{absmax}\left(\mathbf{X}^{\mathrm{FP}32}\right)} \mathbf{X}^{\mathrm{FP}32}\right) =\operatorname{round}\left(c^{\mathrm{FP}32}\cdot\mathbf{X}^{\mathrm{FP}32}\right)XIn 18=round(absmax(XFP 32 )127XFP 32 )=round(cFP32 _XFP 32 )

where c is a quantization constant (proportion), and dequantization is the opposite process:

dequant ⁡ ( c FP 32 , XI nt 8 ) = XI n 88 c FP 32 = XFP 32 \operatorname{dequant}\left(c^{\mathrm{FP}32}, \mathbf{X}^{\mathrm{; Int} 8}\right)=\frac{\mathbf{X}^{\mathrm{In}88}}{c^{\mathrm{FP}32}}=\mathbf{X}^{\mathrm{FP } 32}dequant(cFP 32 ,XYou 8 )=cFP32 _XIn88=XFP32 _

The problem with this method is that if outliers appear, some quantization positions will not be fully utilized. To solve this problem, a common approach is to split the input tensor into independent quantization blocks, each with its own quantization constant c. Specifically, the input tensor X ∈ R b × h \mathbf{X} \in \mathbb{R}^{b \times h}XRb × h is cut into n consecutive blocks of size B (n intervals of equal length), and these blocks are independently quantized to obtain quantized tensors and n quantization constants cic_ici

Low-rank Adapters

Low-rank adapter (LoRA) is a method to fix the original model parameters and use a small number of trainable parameters to reduce memory. The LoRA method enhances the linear projection by introducing an additional factorization projection operation to improve the performance of the model or Effect. Given projection XW = Y \mathbf{XW=Y}XW=Y,其中 X ∈ R b × h \mathbf{X} \in \mathbb{R}^{b \times h} XRb×h W ∈ R h × o \mathbf{W} \in \mathbb{R}^{h \times o} WRh × o ,Solution:
Y = XW + s XL 1 L 2 \mathbf{Y}=\mathbf{XW} + s\mathbf{XL_1L_2}Y=XW+s X L1L2
其中 L 1 ∈ R h × r \mathbf{L_1} \in \mathbb{R}^{h \times r} L1Rh×r L 2 ∈ R r × o \mathbf{L_2} \in \mathbb{R}^{r \times o} L2Rr × o , s is a scalar.

Memory Requirement of Parameter-Efficient Finetuning

LoRA is a parameter-efficient fine-tuning method (PEFT), so the memory usage of the LLM fine-tuning process comes from the activation gradient rather than the LoRA parameters. Taking fine-tuning 7B LLaMA as an example, the LoRA weight only occupies 0.2%. In terms of memory usage, the LoRA input gradient is 567MB, while the LoRA parameters only occupy 26MB. By using gradient checkpointing, the input gradients per sequence are reduced to 18MB on average, which is still more memory-intensive than all LoRA weights combined. Therefore, reducing the amount of LoRA parameters will only produce a small video memory advantage, it is better to use more adapters and reduce video memory in other places such as gradient checkpoints.

QLORA Fine Tuning

QLORA achieves high-fidelity 4-bit fine-tuning through two strategies of NF4 quantization and double quantization. In addition, a paging optimizer is introduced to prevent insufficient video memory caused by video memory peaks during gradient checkpoints.
The data type of QLORA includes low-precision storage type (4 bits) and calculation data type (16 bits), which means that when QLORA weight tensor is used, the tensor will be dequantized into BFloat16 and perform 16-bit matrix multiplication.

4-bit NormalFloat Quantization

The NF data type is based on quantile quantization, which is the optimal data type of information theory, which can ensure that the values ​​​​in each quantization interval are equal. The main limitation of quantile quantization is the high cost of the quantile estimation process, so a fast quantile approximation algorithm (such as SRAM quantile) is used to estimate, but there will be a large error for outliers.
When the input tensor comes from a distribution with a fixed quantization constant, costly quantization estimation and approximation errors can be avoided. In this case, the input tensors have the same quantile, making exact quantile estimation computationally feasible.
Since the pre-trained model neural network weights usually have N ∼ ( 0 , σ ) \mathcal{N} \sim(0, \sigma)N(0,σ ) normal distribution, so all weights can be transformed into a single fixed distribution by scaling σ. For the data type in this paper, the random interval is set to [-1, 1]. Therefore, the quantiles of the data type and the weights of the neural network are both to be normalized to this interval.
The distribution calculation process conforming to the above description is as follows:

  1. Estimating the 2^k+1 quantiles of the theoretical N(0,1) distribution yields a k-quantile quantized data type suitable for normal distributions.
  2. Take that data type and normalize it.
  3. Normalizes the input weight tensor, thereby performing quantization.

Once the weight range and data type range match, quantization can proceed as usual. Step three amounts to rescaling the standard deviation of the input tensor to match the standard deviation of the k-quantiles. More formally, the authors estimate 2^k values ​​qi q_i for the data typeqi如下:
q i = 1 2 ( Q X ( i 2 k + 1 ) + Q X ( i + 1 2 k + 1 ) ) q_i=\frac{1}{2}\left(Q_X\left(\frac{i}{2^k+1}\right)+Q_X\left(\frac{i+1}{2^k+1}\right)\right) qi=21(QX(2k+1i)+QX(2k+1i+1) )
whereQX ( ⋅ ) Q_X(·)QX() is the quantile function of the standard normal function N(0,1). However, there is a problem with symmetric k-bit quantization, that is, this method cannot accurately represent zero. However, the exact representation of zero is an important property because it avoids errors when quantizing padding and other zero-valued elements (ie, zero requires an exact encoding). To ensure that the discrete zero point is 0, the authors estimate the quantilesqi q_iqito create asymmetric data types. For the negative part, qi q_iqiThe range is 2 k − 1 2^{k-1}2k 1 , the positive part is2 k − 1 + 1 2^{k-1}+12k1+1 , and then theseqi q_iqiUnify them together and remove duplicate 0s.

Double Quantization

That is, quantization constants are quantized to further save memory. Because quantization constants take up extra space.
Double quantization will be the first quantized constant c 2 FP 32 c_2^{\rm FP32}c2FP32As the input of the second quantization, the quantized quantization constant c 2 FP 8 c_2^{\rm FP8}c2FP8and the quantization constant c 1 FP 32 c_1^{\rm FP32} of the second stagec1FP32. The author uses 8-bit floating point with a block size of 256 for the second quantization with no performance degradation. c 1 FP 32 c_1^{\rm FP32}c1FP32is a positive number, so it needs to be processed as 0-point symmetry. Taking the block size as 64 bits as an example, the original quantized constant video memory usage is 32/64=0.5, after quantizing the quantized constant, the occupied amount is 8/64+32/64*256=0.127 bits, and each parameter is reduced by 0.373 bits .

Paged Optimizers

Using the NVIDIA unified memory function, page-to-page transfer between CPU and GPU can be realized, so that it can still process normally when the GPU memory is not enough. Specifically, it is automatically moved out to the CPU's RAM when GPU memory is low, and fetched back into GPU memory when needed by the optimizer update step.

QLORA

Combining the above components gives a single linear layer definition QLORA as follows:
YBF 16 = XBF 16 doubleDequant ( c 1 EP 32 , c 2 k -bit , WNF 4 ) + XBF 16 L 1 BF 16 L 2 BF 16 \mathbf {Y}^{\mathrm{BF} 16}=\mathbf{X}^{\mathrm{BF} 16} \text { doubleDequant }\left(c_{1}^{\mathrm{EP} 32}, c_ {2}^{\mathrm{k} \text {-bit }}, \mathbf{W}^{\mathrm{NF} 4}\right)+\mathbf{X}^{\mathrm{BF} 16} \mathbf{L}_{1}^{\mathrm{BF} 16} \mathbf{L}_{2}^{\mathrm{BF} 16}YBF 16=XBF16 doubleDequant (c1EP 32,c2k-bit ,WNF4)+XBF16L _ _1BF 16L2BF 16
public public doubleDequant(·) public
 doubleDequant( c 1 FP 32 , c 2 k -bit , W k -bit ) = dequant ⁡ ( dequant ⁡ ( c 1 FP 32 , c 2 k -bit ) , W 4 bit ) ; = WBF 16 \text {doubleDequant}\left(c_{1}^{\mathrm{FP}32}, c_{2}^{\mathrm{k}\text{-bit}}, \mathbf{W}^ {\mathrm{k}\text{-bit}}\right)=\operatorname{dequant}\left(\operatorname{dequant}\left(c_{1}^{\mathrm{FP}32}, c_{2 }^{\mathrm{k}\text {-bit}}\right), \mathbf{W}^{4 \mathrm{bit}}\right)=\mathbf{W}^{\mathrm{BF} } } doubleDequant (c1FP 32,c2k-bit ,Wk-bit )=dequant(dequant(c1FP 32,c2k-bit ),W4 bits )=WBF 16
authors apply NF4 to W and FP8 toc 2 c_2c2, W adopts 64-bit size blocks for higher quantization accuracy, for c 2 c_2c2Use 256 small blocks to save memory.
The parameter update is only for ∂ E ∂ L i \frac{\partial E }{\partial \mathbf{L}_{i}}LiE, instead of the 4-bit weight gradient ∂ E ∂ W \frac{\partial E }{\partial \mathbf{W}}WE. But the calculation of the former requires the calculation of the latter, that is, enter WNF 4 \mathbf{W}^{\rm{NF4}} through the first formulaWNF4 output getsWNF 16 \mathbf{W}^{\rm{NF16}}WNF16 to compute the derivative∂ E ∂ W \frac{\partial E }{\partial \mathbf{W}}WE.
In general, QLoRA has a storage data type (4-bit NormalFloat) and a calculation data type (16-bit BrainFloat), which dequantizes the storage data type into a calculation data type to perform forward and backward propagation, but only calculation uses Weight gradient for LoRA parameters in 16-bit BrainFloat.

QLoRA vs. Standard fine tuning

QLoRA has a great advantage in saving video memory, so can it achieve the same effect of full fine-tuning in terms of performance?

Experimental setup

The author conducted experiments on three different architecture models, and compared QLoRA with the 16-bit Adapter fine-tuning and the 3B model of full fine-tuning. See Appendix A for complete information.
Since the paging optimizer is only used to process long sequences and small batches of data, the author only conducted a simple measurement and found that when the batch size is 16, the paging optimizer provides the same training speed as the conventional optimizer. Future work can focus on the circumstances under which the paging optimizer will be less efficient.

Default LoRA hyperparameters do not match 16bit performance

Standard LoRA cannot achieve the performance of full fine-tuning, as shown in the figure below:
image.png
The experiment found that the most critical hyperparameter affecting LoRA performance is the number of LoRA, and using LoRA on each layer can match the effect of full fine-tuning.

4-bit NormalFloat yields better performance than 4-bit Floating Point

The authors conduct experimental evaluations using different data types on LLMs of different architectures and sizes, as shown in the figure and table below.
image.pngimage.png
It can be seen that compared with FP4 and Int4, NF4 significantly improves performance, and double quantization reduces memory usage without reducing performance.

k-bit QLORA matches 16-bit full finetuning and 16-bit LoRA performance

Recent studies have shown that 4-bit quantization is possible for inference, but the performance will be reduced compared to 16-bit fine-tuning, so is it possible to recover the lost performance by fine-tuning the 4-bit adapter?
First, compared with the 16-bit fully fine-tuned RoBERTA and T5 models (the size range of both is 125M-3B), the results are as follows:
image.png
16-bit, 8-bit and 4-bit adapter methods are observed to replicate the fully fine-tuned 16-bit baseline performance. It shows that the precision problem of quantization can be fully restored by adapter fine-tuning.
Then test whether the 4-bit QLoRA can match the 16-bit LoRA on the model of the parameter range of 7B-65B. The results are shown in the following table:
image.png
NF4 fully recovered the performance of 16-bit LoRA, and it was also found that FP4’s QLoRA lagged behind 16-bit LoRA by about 1 percentage point, the reasons may be:

  1. QLORA with NF4 replicates 16-bit full trim and 16-bit LoRA trim performance.
  2. NF4 outperforms FP4 in terms of quantization accuracy.

Summary

Summarizing the above experiments, it can be shown that 4-bit QLoRA with NF4 data type can achieve the performance of 16-bit full fine-tuning and LoRA fine-tuning, and NF4 is more efficient than FP4. Furthermore, the MMLU and Elo results show that increasing the number of adapter parameters while decreasing its precision is effective.

Pushing the Chatbot State-of-the-art with QLoRA

This section conducts an in-depth study of instruction fine-tuning. To evaluate the performance of instruction fine-tuning these models, the authors evaluate on MMLU and develop new methods for real-world chatbot performance evaluation.

Experimental setup

Data

The author selects eight latest datasets, including OASST1, HH-RLHF, AIpaca, etc., which cover different languages, data sizes and licenses.

Training Setup

In order to avoid the confounding effect caused by the different goals of different data sets, even on the data sets containing human judgments, only the cross-entropy loss is used as the objective function to fine-tune the QLoRA model, and no reinforcement learning method is used. The purpose of this is to ensure consistency in the fine-tuning process and reduce noise caused by different training objectives. All experiments use the NF4 QLoRA approach with dual quantization and paging optimizer.

Baselines

The authors compare the model with research (Vicuna, Open Assistant), and commercial (GPT-4, GPT-3.5-turbo, and Bard) chatbot systems. Among them, the research model is a fine-tuning of LLaMA.

Evaluation

The authors use the MMLU benchmark to measure performance on a range of language understanding tasks. In addition, generative language ability is tested through automatic assessment and human assessment.

Benchmark Data

The author evaluates on the Vicuna and OASST1 datasets as benchmarks.

Automated Evaluation

The author uses GPT-4 to score the output of different models on the Vicuna dataset. Since GPT-4 has a significant order effect, that is, the model will increase the score of the earlier position response in the prompt. In order to eliminate the impact, two The average score under the order.
The scoring scheme is reduced to a three-category labeling problem, the optimal output is output by GPT-4, and explanations are provided.

Human Evaluation

The current judgment of GPT-4 is difficult to prove to be reliable, so the author runs two parallel human evaluations on Vicuna, hires two annotators and ChatGPT for comparison, and three annotators for pairwise comparison, which can be done manually Annotated way to obtain evaluation data about ChatGPT performance or quality.

Elo Rating

The author uses the Elo scoring mechanism to evaluate the performance of different models in the form of competitions. After each game, the change of Elo score is proportional to the expected result, that is to say, an unexpected upset will cause a large change in Elo score, while the expected The result is smaller changes. Finally, over time, the Elo score will eventually match the capabilities of the model.

Guanaco: QLORA trained on OASST1 is a State-of-the-art Chatbot

Based on automated and manual evaluations, QLoRA-tuned model Guanaco 65B is the best performing open source chatbot model, comparable in performance to ChatGPT. The table below is the Vicuna benchmark results compared to ChatGPT.
image.png
It can be found that Guanaco 65B is the best performing model after GPT-4, and Guanaco 33B has more parameters than Vicuna 13B model, but uses 4-bit precision weights, so 21GB is more memory efficient than 26GB, and the performance is improved by 3 percentage point. Furthermore, the Guanaco 7B can easily fit on a modern phone with 5GB of RAM, while still scoring nearly 20 percentage points higher than the Alpaca 13B.
Furthermore, there is moderate agreement between GPT-4 and the system-level judgments of human annotators, so model-based evaluations represent a reliable alternative to human evaluations.
image.png
The Elo rankings in the above table show that the Guanaco 33B and 65B models outperform all models except GPT-4 in the Vicuna and OA benchmarks, and are comparable to ChatGPT performance. In addition, the performance of the models under different data sets is different, which indicates that the powerful Excellent MMLU performance does not imply strong chatbot performance, and vice versa.
Guanaco is the only top model in the evaluation that was not trained on proprietary data, and the next such model is the HH-RLHF model, which is 30 percentage points lower than Guanaco on the Vicuna benchmark, suggesting that 4-bit QLoRA is effective.

Qualitative Analysis

Quantitative analysis still has problems. Machine learning models sometimes use benchmark shortcuts. In order to alleviate this problem, here is some qualitative analysis in two parts:

  1. Show some examples that can represent the generative schema of the 65B Guanaco model.
  2. Section 6.2 details the discussion and interpretation of these results.

Qualitative Analysis of Example Generations

First look at what the generated data looks like under the Vicuna benchmark and the OpenAssistant benchmark. The author tries to test some models of the model, find and repair the shortcomings and deficiencies of the model through "lemons", and confirm and emphasize the advantages and strengths of the model through "cherries", so as to make the model more comprehensive and better understand and optimize the performance of the model.
Of course, such qualitative research is definitely not comprehensive enough. The authors hope that the responses generated by a given prompt are representative, and hope that future work can study the issues raised here in more detail.

Factual Recall

As the matter of fact gets more complex, such as asking what year the artist of a song was born, Guanaco gives the wrong artist and date of birth.

Suggestibility

Guanaco generates a strong backlash against instigating false statements. For example, the author stated in the question that the earth is flat, but Guanaco will correct this error and explain the cause of the error.

Refusal

Guanaco will sometimes refuse to respond to commands for random reasons, even if the task is simple.

Secret Keeping

For secret words, Guanaco can firmly keep the secret, but there will be very small possible tricks to break the secret, such as telling it that this is a game and let it say the secret word.

Math

Many large language models are weak in mathematics, and Guanaco is no exception. It will fail on slightly more complex mathematical tasks.

Theory of Mind

Guanaco has a strong ability to understand people's psychology, but may imagine things that don't exist.

Considerations

Evaluation

The authors found that agreement between human annotators decreases when comparing two powerful models, suggesting limitations of current benchmarks and human evaluation protocols for chatbot task performance, and that future work is needed to remove the subjectivity of human annotators To better measure the performance of the model.
In addition, the author also found that GPT-4 has a significant order effect, that is, it gives higher scores to the system that prompts first, and gives higher scores to its own responses when scoring. Future work needs to focus on how to eliminate this bias.

Data & Training

The OASST1 dataset contains multiple languages, and the OA benchmark also has multiple language prompts. Does this indicate that multiple languages ​​can improve the performance of the model, or how much it can be improved.
In the training part, the model is only trained on the cross-entropy loss function without relying on reinforcement learning from human feedback (RLHF), which requires further analysis.

Related Work

Quantization of Large Language Models

Quantization of LLMs focuses on quantification of inference time. Most methods focus on managing outlier features in order to maintain 16-bit LLM quality, while others use more complex grouping methods. Lossy quantization methods include general rounding tradeoffs, or how to optimize rounding decisions to improve quantization accuracy. Few works apply quantization to backpropagation.

Finetuning with Adapters

LoRA is proven to achieve full 16-bit fine-tuning performance. In addition, other parameter validity methods such as prompt tuning, fine-tuning the Embedding layer input, fine-tuning the hidden layer, fine-tuning the deviation, etc. These methods can be further explored quantitatively.

Instruction fine tuning

Instructional fine-tuning fine-tunes a pretrained LLM using input-output pairs from various data sources to generate an output given the input as a cue. Methods and datasets include MetaICL, MetaTuning, InstructGPT, FLAN, etc.

Chatbots

Existing chatbots are usually based on reinforcement learning with human feedback (RLHF), or generate data from existing models for training using artificial intelligence feedback (RLAIF).

Limitations and Discussion

The author did not prove that QLoRA can match the 33B and 65B models of 16-bit full fine-tuning. Another limitation is the evaluation of the instruction fine-tuning model. The author only fine-tuned on MMLU, OA and other data sets, but did not fine-tune on BigBench and RAFT, that is, the breadth of model evaluation is not enough.
In addition, the authors found that the performance of the benchmark may depend on the similarity of the fine-tuning data to the benchmark dataset, so before evaluating, we must carefully consider what the goal of the evaluation is.
Another limitation of the article is that the evaluation of Guanaco is not comprehensive enough. In addition, in terms of accuracy, 3-bit or other-bit models can also be used for evaluation, and there are many methods that can be used for parameter validity learning methods.

Broader Impacts

The QLoRA method is the first to achieve fine-tuning of 33B parameter models on a single consumer-grade graphics card and 65B parameter model fine-tuning on a single professional-grade graphics card without degrading the performance of the model. QLoRA helps bridge the computing resource gap between large companies and small teams.
In addition, another impact is that the QLoRA method can become a key milestone in the fine-tuning of mobile device LLM (the fine-tuning of the 7B model can be realized on the mobile phone). The author has tried it on the Apple 12s. Although it has not reached the performance of ChatGPT, it is suitable for large models. Privacy protection and deployment of large models are of great significance.

read summary

QLoRA itself is not an innovative work, but integrates multiple works together to break through the pain points of the previous methods one by one. LoRA can achieve the effect of full fine-tuning, and the two strategies of NF4 quantization and double quantization can achieve high-fidelity 4-bit fine-tuning , the paging optimizer can solve the problem of insufficient video memory peak. The combination of these methods achieves the effect of 1+1>2, which realizes the fine-tuning of a 33B large model with a single consumer-grade graphics card and the fine-tuning of a 65B large model with a single professional-grade graphics card. The above part is the first half of the article, proving that the QLoRA method can be fine-tuned in low computing resource scenarios, and the second half is more about introducing the Guanaco model to verify the performance of QLoRA. Guanaco is a large model based on QLoRA fine-tuning. The author compared it with other large models through qualitative and quantitative analysis, proving that the performance of Guanaco is equivalent to that of ChatGPT, which indirectly proves that QLoRA has achieved the ability of full fine-tuning in terms of performance. Some other properties of the model are further explored.
Then the focus of this article is naturally the first half. I don’t really understand the process of quantification, and I need to read the code to deepen the impact. The biggest inspiration for me from this work is that I can’t just do my work blindly, but combine it with the industry. needs, or find a problem in the application, and then find a way to solve the problem with existing knowledge. For example, QLoRA is based on LoRA. If it finds that the video memory is not enough, then compress the data, and the video memory will have a peak value, then put a part of it Putting it on the memory, you can’t blow up the video memory, these are things that are easy to think of, but no one has done this kind of work before, which may be the reason for the insufficient demand in the past two years. Only by doing more to solve problems can we discover more problems and get more inspiration for articles.

Guess you like

Origin blog.csdn.net/HERODING23/article/details/131584089