Customize your own ChatGPT: a lightweight LLM-IFT platform with multiple interfaces ( Alpaca-CoT )
This is the repository of the Alpaca-CoT project, which aims to build a multi-interface unified lightweight instruction fine-tuning (IFT) platform with a wide range of instruction sets (especially CoT datasets) and for various large-scale Unified interface for language models as well as various parameter efficiency methods (e.g. LoRA, P-Tuning). We are continuously expanding our instructional alignment data collection and integrating more LLMs. In addition, we also created a new branch tabular_llm to construct large-scale language models that can handle tabular intelligence tasks.
You are welcome to contribute any uncollected instruction tuning datasets (or their sources) to us. We will unify their format, train alpaca models (and other LLMs in the early future) with these datasets, open source model checkpoints, and conduct extensive empirical research. We hope that our project can make a small contribution to the open source process of large language models, lowering its entry barrier for NLP researchers.
You can also choose to join our group chat (WeChat) to communicate with more fellow researchers. There are currently too many people in the group chat, and you need to be invited by friends to join the group. Please scan the code to add me as a friend and pull you into the group.
News
- 6.25: Add model evaluation code, including belle and MMCU.
- 5.5: Created a new branch tabular_llm to construct a large language model that can handle a variety of tabular intelligence tasks.
- 5.4: All parameter-efficient methods (such as P-tuning) in PEFT are integrated and can be easily set through hyperparameters.
- 5.4: LLM
MOSS
has been integrated. - 4.21: Collected and uniformly formatted datasets
GAOKAO
,camel
,FLAN-Muffin
,COIG
. - 4.15: Collected and uniformly formatted datasets
webGPT
,dolly
,baize
,hh-rlhf
,OIG(part)
. - 4.12: Now you can experience Alpaca-CoT in Google Colab .
多轮对话
4.11: Feature added , thanks to @paulcx .- 4.9: Collected and uniformly formatted dataset
firefly
,instruct
,Code Alpaca
here . - 4.7: Added
参数合并
,本地使用
,批量预测
,web服务
features, thanks to @weberrr . - 4.4: Collected and uniformly formatted datasets
FastChat
,GPTeacher
,Guanaco
,HC3
,prosocial-dialog
,belle-chat&belle-math
,xP3
andnatural-instructions
. - 4.3: The Chinese CoT dataset
CoT_CN_data.json
has been uploaded here . - 4.1: Bloom7b fine-tuned on instinwild-CN(47k) + belle(1.5M)
checkpoint
has been uploaded here . - 4.1:
instnwild
(collected from Twitter, mainly generation, open QA and mind-storm types) have been uniformly formatted and collected.
0. The technology behind ChatGPT
LLM : ( Large Language Models ) refers to a language model that has undergone large-scale pre-training and has a large volume, generally a transformer-based model.
IFT : ( Instruction Fine-Tuning ) instruction fine-tuning, instruction refers to the input text with clear purpose passed in by the user, and the instruction fine-tuning allows the model to learn to follow the user's instruction.
CoT : ( Chain-of-Thought ) A special case of instruction form, including step-by-step reasoning process. As shown in the blue part of the figure below.
1. Positioning
The emergence of ChatGPT has verified the potential of large language model (LLM) on general artificial intelligence (AGI). The instruction-tuning research (eg, Alpaca[2]) based on Large Language Models (LLMs) such as LLaMA[1] has greatly accelerated the process of reproducing ChatGPT. Alpaca-CoT hopes to make moderate contributions in this research direction to promote the open source process of LLMs and reduce the cost of LLMs research and use.
Specifically, the Alpaca-CoT project aims to explore how to better induce LLM to have ChatGPT-like interaction and instruction-following capabilities through instruction-tuning. To this end, we extensively collected different types of instructions (especially the Chain-of-Thought data set), and gave an in-depth and detailed empirical study based on LLaMA for reference in future work. To the best of our knowledge, we are the first work to extend CoT into Alpaca, hence the abbreviation " Alpaca-CoT ".
You are warmly welcome to provide us with any instruction-tuning and various tasks datasets (or their sources) not collected by this project. we will:
- Collect and format these data in a unified manner;
- Use these datasets to instruct fine-tune LLaMA models (more LLMs will be integrated in the future), and open source their checkpoints;
- Extensive empirical research was conducted to explore the effects of newly included datasets.
We hope that our project can make a modest contribution to the open source process of large language models, and lower the threshold for NLP researchers to get started with LLM related research.
2. Overview
Recently, LLaMA [1] has shown amazing zero-shot and few-shot capabilities, requiring fewer parameters to be comparable to GPT-3.5 performance (LLaMA-13B is significantly better than GPT-3 (175B), LLaMA-65B Comparable to PaLM-540MB), which significantly reduces the cost of training, fine-tuning and using competitive large language models. Recently, in order to improve the instruction-following ability of LLaMA, Stanford Alpaca [2] fine-tuned LLaMA with 52K English instruction-finetuning data generated by self-instruct [3]. However, current research in this direction still faces the following three challenges:
- LLaMA-7b still has high requirements for computing resources;
- There are fewer open source datasets for instruction finetuning;
- There is a lack of empirical research on the impact of each instruction type, such as the ability to respond to Chinese and CoT ability.
To this end, we propose the Alpaca-CoT project, which combines relevant near-term cutting-edge technologies with the following advantages:
-
- The fine-tuning of LLaMA can be efficiently completed with only low computing resources .
7b
,13b
and30b
versions of LLaMA models can be easily trained on a single card 80G A100. This advantage mainly comes fromtechnologies such as low-rank adaptation (LoRA) [4], PEFT and bitsandbytes . Our code is mostly modified from here .
- The fine-tuning of LLaMA can be efficiently completed with only low computing resources .
-
- The model we released significantly improves the CoT (reasoning) reasoning ability .
-
- Our released model significantly improves the responsiveness to Chinese commands .
-
- Maintains a collection of instruction-finetuning datasets that are still expanding in size . This collection contains instruction data in Chinese, English and CoT. At the same time, we also maintain a model checkpoint collection trained from various instruction datasets.
-
- A variety of LLMs are integrated and the calling interface is unified , which can be easily switched through hyperparameters. Currently includes LLaMA, ChatGLM [5] and Bloom [6], and more will be added in the future, so that researchers can easily call and compare different LLMs.
-
- Provides a detailed and thorough empirical study and qualitative analysis , the findings here may have certain reference value for promoting future LLM exploration.
3. Data Collection
The relative sizes of the collected datasets are shown in the figure below:
We refer to here ( @yaodongC ), mark the collected data set with Tags according to the following rules:
(Lang)Lingual-Tags:
- EN: Instruction datasets in English
- CN: Instruction datasets in Chinese
- ML: [Multi-lingual] Instruction datasets in multiple languages
(Task)Task-Tags:
- MT: [Multi-task] Datasets containing multiple tasks
- TS: [Task-specific] Datasets tailored for specific tasks
(Gen)Generation-method:
- HG: [Human Generated Dataset] Datasets created by humans
- SI: [Self-Instruct] Datasets generated using self-instruct methods
- MIX: [Mixed Dataset] Dataset contains both human and machine generated data
- COL: [Collection of Dataset] Dataset made from a collection of other datasets
Statistics
data set | number | Lang | Task | Gen | type | source | Link |
---|---|---|---|---|---|---|---|
Chain of Thought | 74771 | EN/CN | MT | HG | CoT-related tasks | Humans annotate CoT on existing datasets | download |
GPT4all | 806199 | IN | MT | COL | code, story, dialogue | GPT-3.5-turbo distillation | download |
GPTeacher | 29013 | IN | MT | AND | General, Role Playing, Tool Instructions | GPT-4 & toolformer | download |
guanaco | 534610 | ML | MT | AND | Various nlp tasks | text-davinci-003 | download |
HC3 | 37175 | EN/CN | TS | MIX | dialogue evaluation | gpt-3.5 or manual | download |
alpaca | 52002 | IN | MT | AND | General Instructions | text-davinci-003 | download |
Natural Instructions | 5040134 | ML | MT | COL | Various nlp tasks | Collection of human-annotated datasets | download |
beautiful_cn | 1079517 | CN | TS/MT | AND | General Instruction, Mathematical Reasoning, Conversation | text-davunci-003 | download |
instinwild | 52191 | EN/CN | MT | AND | generation, open domain question answering, brainstorming | text-davunci-003 | download |
prosocial dialog | 165681 | IN | TS | MIX | dialogue | GPT-3 rewrites questions, human answers | download |
finance_en | 68912 | IN | TS | COL | Questions and Answers in the Financial Field | GPT3.5 | download |
xP3 | 78883588 | ML | MT | COL | Various nlp tasks | Collection of human-annotated datasets | download |
firefly | 1649398 | CN | MT | COL | 23 nlp tasks | Collect Chinese data sets and manually write instruction templates | download |
instruct | 888969 | IN | MT | COL | Enhancements to GPT4All, Alpaca and Open Source Datasets | Use the nlp enhancement tool provided by AllenAI | download |
Code Alpaca | 20022 | IN | AND | AND | code generation, editing, optimization | text-davinci-003 | download |
Alpaca_GPT4 | 52002 | EN/CN | MT | AND | General Instructions | Alpaca data generated by GPT-4 | download |
webGPT | 18994 | IN | TS | MIX | Information Retrieval Questions and Answers | fine-tuned GPT-3 + human evaluation | download |
dolly 2.0 | 15015 | IN | TS | HG | Open, closed question and answer, information extraction, abstract generation, open conception, classification and creative writing tasks | manual annotation | download |
yes | 653699 | IN | MT | COL | Alpaca and various question answering tasks | Collection of human-annotated datasets | download |
hh-rlhf | 284517 | IN | TS | MIX | dialogue | RLHF models | download |
OIG(part) | 49237 | IN | MT | COL | Various nlp tasks | Collection and data augmentation of human-annotated datasets | download |
GAOKAO | 2785 | CN | MT | COL | Multiple choice, fill in the blank and other questions in the college entrance examination | Collection of human-annotated datasets | download |
camel | 760620 | IN | MT | AND | Collection of human-annotated datasets for role-playing dialogues in physics, biochemistry, programming, mathematics, society, etc. | gpt-3.5-turbo build | download |
FLAN-Muffin | 1764800 | IN | MT | COL | 60 nlp tasks | Collection of human-annotated datasets | download |
FIVE | 298428 | CN | MT | COL | Examination, translation, collection of value instruction data sets, counterfactual dialogue based on knowledge graph | Automated tools + manual verification | download |
GPT4Tools | 71446 | IN | MT | AND | a collection of tool-related instructions | gpt-3.5-turbo | download |
ShareChat | 1663241 | IN | MT | MIX | general instruct | Collect ShareGPT | download |
Car CoT | IN | download | |||||
MOSS | 1583595 | EN/CN | AND | download | |||
ultrachat | 28247446 | IN | download | ||||
StackLLaMA | all | IN |
The collection is still being updated and expanded. More data details can be downloaded and viewed at the following link: https://github.com/PhoebusSi/alpaca-CoT/tree/main/data
download
You can download all the formatted data that we have standardized here . Then, put all the downloaded files into the data folder.
You can download all checkponts trained on various types of instruction data here . Then, set the download path generate.py
in LoRA_Weights
, and you can directly run the inference of the model to view the effect of the model.
Data Format
All the data in our collection has been transformed into the same format, and each sample has the following format:
[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]
Note that for the CoT data set, we first use the template provided by FLAN to convert it from the original data to the Chain-of-Thought form, and then unify it into the above format. The format unification script can be found here .
4. Unified open source platform with multiple interfaces
Environment configuration
pip install -r requirements.txt
Model fine-tuning
In order to facilitate researchers to do systematic IFT research on LLM, we collected different types of instruction data, integrated a variety of LLMs, and unified the interface to easily customize the desired collocation:
--model_type
: Set the LLM you want to study, currently supports [llama, chatglm and bloom], the latter two have strong Chinese ability, and more LLMs will be integrated in the future.--data
: Set the data type used for IFT to flexibly tailor the desired instruction following ability, such as the pursuit of stronger reasoning ability can be set alpaca-cot, stronger Chinese ability can be set belle1.5m, stronger coding and story creation The ability can be set to gpt4all, and the financial-related responsiveness can be set to finance.--model_name_or_path
:--model_type
Corresponding to the different model weights used to load the target LLM. For example, you can set decapoda-research/llama-13b-hf to load the model weight of llama's 13b.
single card
- for LLaMA
python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
--data alpaca-belle-cot --lora_target_modules q_proj v_proj \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
- for ChatGLM Note: for multiple datasets, you can use
--data
like--data ./data/alpaca.json ./data/finance.json <path2yourdata_1>
python3 uniform_finetune.py --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
--learning_rate 2e-5 --epochs 1
Note that load_in_8bit
is not yet suitable for ChatGLM, so batch_size must be much smaller than others.
- for BLOOM
python3 uniform_finetune.py --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
Note that you can also pass the local path (where the LLM weights saved) to --model_name_or_path
. And the data type --data
can be freely set according to your interests.
Doka
- for LLaMA
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy uniform_finetune.py \
--model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
--data alpaca-belle-cot --lora_target_modules q_proj v_proj \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
- for ChatGLM
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
uniform_finetune.py --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
--learning_rate 2e-5 --epochs 1
Note that load_in_8bit
is not yet suitable for ChatGLM, so batch_size must be much smaller than others.
- for BLOOM
python3 -m torch.distributed.launch --nproc_per_node 4 \
--nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
uniform_finetune.py --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
--data alpaca-belle-cot --lora_target_modules query_key_value \
--per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1
Inference
python3 generate.py --data alpaca-belle-cot --model_type llama
python3 generate.py --data alpaca-belle-cot --model_type chatglm
python3 generate.py --data alpaca-belle-cot --model_type bloom
Note that saved-xxx7b
the folder is the path to save LoRA weights, and LLaMA weights will be automatically downloaded from Hugging Face during script execution.
Generate relevant hyperparameter settings
top_p=0.9,
#适度调大核采样的概率阈值,扩大候选子集,增加生成多样性。
temperature=1.0,
#之前的温度参数过低会导致生成词的概率分布极化严重,导致生成策略退化成greedy decoding。
do_sample=True,
#do_sample参数默认关闭,不开启时生成仍保持beam-search解码策略,开启后为beam-search multinomial sampling解码策略。
no_repeat_ngram_size=6,
#通过配置下一个词重复出现n-gram的概率为0,来保证没有n-gram出现两次,设置过小会抑制合理的重复,影响生成的流畅性,过大会失去作用。
repetition_penalty=1.8,
#对于之前出现过的词语,在后续预测的过程中,通过引入惩罚因子repetition_penalty降低其出现的概率。
Parameter Merging
python3 merge.py --model_type llama --size 7b --lora_dir xxx --merged_dir yyy
local use
python3 server.py --model_type chatglm --lora_dir xxx
batch prediction
python3 predict.py --model_type chatglm --data for_dict_data --lora_dir xxx --result_dir yyy
web service
python3 web.py --model_type chatglm --lora_dir xxx
5. Quantitative Analysis
Note: The figure below is the statistics of the data set collected as of 3.26, which is only displayed as motivation. More datasets have been collected, such as financial-related instruction datasets. The current instruction-finetuning data collection mainly includes the following three parts:
alpaca_data_cleaned.json
: about 52K English instruction-following training samples.CoT_data.json
: 9 CoT datasets involving about 75k samples. (Related CoT datasets released by FLAN[7])belle_data_cn.json
: about 0.5M Chinese |instruction-following training samples. (Related Chinese instruction data released by BELLE[8])
About the ablation of CoT and Chinese Instructions
"w/o CoT" and "w/o CN" indicate that CoT data and Chinese instructions are not used during instruction-finetuning, respectively.
Performance on problems requiring reasoning skills
Performance on questions that require following instructions in Chinese
Performance on more complex problems
In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.
Show more abilities
references
[1]: LLaMA: Open and Efficient Foundation Language Models
[2]: Stanford Alpaca: An Instruction-following LLaMA model
[3]: Self-Instruct: Aligning Language Model with Self Generated Instructions
[4]: LoRA: Low-Rank Adaptation of Large Language Models
[5]: ChatGLM: An Open Bilingual Dialogue Language Model
[6]: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
[7]: FLAN: Scaling Instruction-Finetuned Language Models
[8]: BELLE: Bloom-Enhanced Large Language model Engine
[9]: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo
Citation
Please cite the repo if you use the data collection, code, and experimental findings in this repo.
@misc{alpaca-cot,
author = {Qingyi Si, Tong Wang, Naibin Gu, Rui Liu, Zheng Lin },
school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Lnguage Models Interface},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
}