Customize your own ChatGPT: a lightweight LLM-IFT platform with multiple interfaces (Alpaca-CoT)

Customize your own ChatGPT: a lightweight LLM-IFT platform with multiple interfaces ( Alpaca-CoT )

LICENSE torch data model ET AL

This is the repository of the Alpaca-CoT project, which aims to build a multi-interface unified lightweight instruction fine-tuning (IFT) platform with a wide range of instruction sets (especially CoT datasets) and for various large-scale Unified interface for language models as well as various parameter efficiency methods (e.g. LoRA, P-Tuning). We are continuously expanding our instructional alignment data collection and integrating more LLMs. In addition, we also created a new branch tabular_llm to construct large-scale language models that can handle tabular intelligence tasks.

You are welcome to contribute any uncollected instruction tuning datasets (or their sources) to us. We will unify their format, train alpaca models (and other LLMs in the early future) with these datasets, open source model checkpoints, and conduct extensive empirical research. We hope that our project can make a small contribution to the open source process of large language models, lowering its entry barrier for NLP researchers.

You can also choose to join our group chat (WeChat) to communicate with more fellow researchers. There are currently too many people in the group chat, and you need to be invited by friends to join the group. Please scan the code to add me as a friend and pull you into the group.

News

  • rocket

    6.25: Add model evaluation code, including belle and MMCU.
  • rocket

    5.5: Created a new branch tabular_llm to construct a large language model that can handle a variety of tabular intelligence tasks.
  • rocket

    5.4: All parameter-efficient methods (such as P-tuning) in PEFT are integrated and can be easily set through hyperparameters.
  • rocket

    5.4: LLM  MOSShas been integrated.
  • 4.21: Collected and uniformly formatted datasets  GAOKAOcamelFLAN-MuffinCOIG.
  • 4.15: Collected and uniformly formatted datasets  webGPTdollybaizehh-rlhfOIG(part).
  • 4.12: Now you can experience Alpaca-CoT in Google Colab .
  • 多轮对话4.11: Feature added , thanks to @paulcx .
  • 4.9: Collected and uniformly formatted dataset  fireflyinstructCode Alpaca here .
  • 4.7: Added 参数合并, 本地使用, 批量预测, web服务features, thanks to @weberrr .
  • 4.4: Collected and uniformly formatted datasets FastChatGPTeacher, Guanaco, HC3, prosocial-dialogbelle-chat&belle-mathxP3 and  natural-instructions.
  • 4.3: The Chinese CoT dataset CoT_CN_data.jsonhas been uploaded here .
  • 4.1: Bloom7b fine-tuned on instinwild-CN(47k) + belle(1.5M) checkpointhas been uploaded here .
  • 4.1:  instnwild(collected from Twitter, mainly generation, open QA and mind-storm types) have been uniformly formatted and collected.

0. The technology behind ChatGPT

LLM : (  Large Language Models ) refers to a language model that has undergone large-scale pre-training and has a large volume, generally a transformer-based model.

IFT : (  Instruction Fine-Tuning ) instruction fine-tuning, instruction refers to the input text with clear purpose passed in by the user, and the instruction fine-tuning allows the model to learn to follow the user's instruction.

CoT : (  Chain-of-Thought ) A special case of instruction form, including step-by-step reasoning process. As shown in the blue part of the figure below.

cot

1. Positioning

The emergence of ChatGPT has verified the potential of large language model (LLM) on general artificial intelligence (AGI). The instruction-tuning research (eg, Alpaca[2]) based on Large Language Models (LLMs) such as LLaMA[1] has greatly accelerated the process of reproducing ChatGPT. Alpaca-CoT hopes to make moderate contributions in this research direction to promote the open source process of LLMs and reduce the cost of LLMs research and use.

Specifically, the Alpaca-CoT project aims to explore how to better induce LLM to have ChatGPT-like interaction and instruction-following capabilities through instruction-tuning. To this end, we extensively collected different types of instructions (especially the Chain-of-Thought data set), and gave an in-depth and detailed empirical study based on LLaMA for reference in future work. To the best of our knowledge, we are the first work to extend CoT into Alpaca, hence the abbreviation " Alpaca-CoT ".

You are warmly welcome to provide us with any instruction-tuning and various tasks datasets (or their sources) not collected by this project. we will:

  • Collect and format these data in a unified manner;
  • Use these datasets to instruct fine-tune LLaMA models (more LLMs will be integrated in the future), and open source their checkpoints;
  • Extensive empirical research was conducted to explore the effects of newly included datasets.

We hope that our project can make a modest contribution to the open source process of large language models, and lower the threshold for NLP researchers to get started with LLM related research.

2. Overview

Recently, LLaMA [1] has shown amazing zero-shot and few-shot capabilities, requiring fewer parameters to be comparable to GPT-3.5 performance (LLaMA-13B is significantly better than GPT-3 (175B), LLaMA-65B Comparable to PaLM-540MB), which significantly reduces the cost of training, fine-tuning and using competitive large language models. Recently, in order to improve the instruction-following ability of LLaMA, Stanford Alpaca [2] fine-tuned LLaMA with 52K English instruction-finetuning data generated by self-instruct [3]. However, current research in this direction still faces the following three challenges:

  • LLaMA-7b still has high requirements for computing resources;
  • There are fewer open source datasets for instruction finetuning;
  • There is a lack of empirical research on the impact of each instruction type, such as the ability to respond to Chinese and CoT ability.

To this end, we propose the Alpaca-CoT project, which combines relevant near-term cutting-edge technologies with the following advantages:

    1. The fine-tuning of LLaMA can be efficiently completed with only low computing resources . 7b,13band30bversions of LLaMA models can be easily trained on a single card 80G A100. This advantage mainly comes fromtechnologies such as low-rank adaptation (LoRA)  [4],  PEFT and bitsandbytes . Our code is mostly modified from here .
    1. The model we released  significantly improves the CoT (reasoning) reasoning ability .
    1. Our released model  significantly improves the responsiveness to Chinese commands .
    1. Maintains a  collection of instruction-finetuning datasets that are still expanding in size . This collection contains instruction data in Chinese, English and CoT. At the same time, we also maintain a model checkpoint collection trained from various instruction datasets.
    1. A variety of LLMs are integrated  and the calling interface is unified , which can be easily switched through hyperparameters. Currently includes  LLaMA, ChatGLM [5] and  Bloom [6], and more will be added in the future, so that researchers can easily call and compare different LLMs.
    1. Provides  a detailed and thorough empirical study and qualitative analysis , the findings here may have certain reference value for promoting future LLM exploration.

3. Data Collection

The relative sizes of the collected datasets are shown in the figure below:

img

We refer to here  ( @yaodongC ), mark the collected data set with Tags according to the following rules:

(Lang)Lingual-Tags:

  • EN: Instruction datasets in English
  • CN: Instruction datasets in Chinese
  • ML: [Multi-lingual] Instruction datasets in multiple languages

(Task)Task-Tags:

  • MT: [Multi-task] Datasets containing multiple tasks
  • TS: [Task-specific] Datasets tailored for specific tasks

(Gen)Generation-method:

  • HG: [Human Generated Dataset] Datasets created by humans
  • SI: [Self-Instruct] Datasets generated using self-instruct methods
  • MIX: [Mixed Dataset] Dataset contains both human and machine generated data
  • COL: [Collection of Dataset] Dataset made from a collection of other datasets

Statistics

data set number Lang Task Gen type source Link
Chain of Thought 74771 EN/CN MT HG CoT-related tasks Humans annotate CoT on existing datasets download
GPT4all 806199 IN MT COL code, story, dialogue GPT-3.5-turbo distillation download
GPTeacher 29013 IN MT AND General, Role Playing, Tool Instructions GPT-4 & toolformer download
guanaco 534610 ML MT AND Various nlp tasks text-davinci-003 download
HC3 37175 EN/CN TS MIX dialogue evaluation gpt-3.5 or manual download
alpaca 52002 IN MT AND General Instructions text-davinci-003 download
Natural Instructions 5040134 ML MT COL Various nlp tasks Collection of human-annotated datasets download
beautiful_cn 1079517 CN TS/MT AND General Instruction, Mathematical Reasoning, Conversation text-davunci-003 download
instinwild 52191 EN/CN MT AND generation, open domain question answering, brainstorming text-davunci-003 download
prosocial dialog 165681 IN TS MIX dialogue GPT-3 rewrites questions, human answers download
finance_en 68912 IN TS COL Questions and Answers in the Financial Field GPT3.5 download
xP3 78883588 ML MT COL Various nlp tasks Collection of human-annotated datasets download
firefly 1649398 CN MT COL 23 nlp tasks Collect Chinese data sets and manually write instruction templates download
instruct 888969 IN MT COL Enhancements to GPT4All, Alpaca and Open Source Datasets Use the nlp enhancement tool provided by AllenAI download
Code Alpaca 20022 IN AND AND code generation, editing, optimization text-davinci-003 download
Alpaca_GPT4 52002 EN/CN MT AND General Instructions Alpaca data generated by GPT-4 download
webGPT 18994 IN TS MIX Information Retrieval Questions and Answers fine-tuned GPT-3 + human evaluation download
dolly 2.0 15015 IN TS HG Open, closed question and answer, information extraction, abstract generation, open conception, classification and creative writing tasks manual annotation download
yes 653699 IN MT COL Alpaca and various question answering tasks Collection of human-annotated datasets download
hh-rlhf 284517 IN TS MIX dialogue RLHF models download
OIG(part) 49237 IN MT COL Various nlp tasks Collection and data augmentation of human-annotated datasets download
GAOKAO 2785 CN MT COL Multiple choice, fill in the blank and other questions in the college entrance examination Collection of human-annotated datasets download
camel 760620 IN MT AND Collection of human-annotated datasets for role-playing dialogues in physics, biochemistry, programming, mathematics, society, etc. gpt-3.5-turbo build download
FLAN-Muffin 1764800 IN MT COL 60 nlp tasks Collection of human-annotated datasets download
FIVE 298428 CN MT COL Examination, translation, collection of value instruction data sets, counterfactual dialogue based on knowledge graph Automated tools + manual verification download
GPT4Tools 71446 IN MT AND a collection of tool-related instructions gpt-3.5-turbo download
ShareChat 1663241 IN MT MIX general instruct Collect ShareGPT download
Car CoT IN download
MOSS 1583595 EN/CN AND download
ultrachat 28247446 IN download
StackLLaMA all IN

The collection is still being updated and expanded. More data details can be downloaded and viewed at the following link: https://github.com/PhoebusSi/alpaca-CoT/tree/main/data

download

You can download all the formatted data that we have standardized here . Then, put all the downloaded files into the data  folder.

You can download all checkponts trained on various types of instruction data here . Then, set the download path generate.pyin LoRA_Weights, and you can directly run the inference of the model to view the effect of the model.

Data Format

All the data in our collection has been transformed into the same format, and each sample has the following format:

[
{"instruction": instruction string,
"input": input string, # (may be empty)
"output": output string}
]

Note that for the CoT data set, we first use the template provided by FLAN to convert it from the original data to the Chain-of-Thought form, and then unify it into the above format. The format unification script can be found here .

4. Unified open source platform with multiple interfaces

Environment configuration

pip install -r requirements.txt

Model fine-tuning

In order to facilitate researchers to do systematic IFT research on LLM, we collected different types of instruction data, integrated a variety of LLMs, and unified the interface to easily customize the desired collocation:

  • --model_type: Set the LLM you want to study, currently supports [llama, chatglm and bloom], the latter two have strong Chinese ability, and more LLMs will be integrated in the future.
  • --data: Set the data type used for IFT to flexibly tailor the desired instruction following ability, such as the pursuit of stronger reasoning ability can be set alpaca-cot, stronger Chinese ability can be set belle1.5m, stronger coding and story creation The ability can be set to gpt4all, and the financial-related responsiveness can be set to finance.
  • --model_name_or_path: --model_typeCorresponding to the different model weights used to load the target LLM. For example, you can set decapoda-research/llama-13b-hf to load the model weight of llama's 13b.

single card

  • for LLaMA
python3 uniform_finetune.py --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1 
    
  • for ChatGLM Note: for multiple datasets, you can use --data like --data ./data/alpaca.json ./data/finance.json <path2yourdata_1>
python3 uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

Note that load_in_8bit is not yet suitable for ChatGLM, so batch_size must be much smaller than others.

  • for BLOOM
python3 uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1 

Note that you can also pass the local path (where the LLM weights saved) to --model_name_or_path. And the data type --data can be freely set according to your interests.

Doka

  • for LLaMA
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy uniform_finetune.py \
    --model_type llama --model_name_or_path decapoda-research/llama-7b-hf \
    --data alpaca-belle-cot --lora_target_modules q_proj v_proj \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1 
  • for ChatGLM
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type chatglm --model_name_or_path THUDM/chatglm-6b \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --lora_r 32 --lora_alpha 32 --lora_dropout 0.1 --per_gpu_train_batch_size 2 \
    --learning_rate 2e-5 --epochs 1

Note that load_in_8bit is not yet suitable for ChatGLM, so batch_size must be much smaller than others.

  • for BLOOM
python3 -m torch.distributed.launch --nproc_per_node 4  \
    --nnodes=1 --node_rank=0 --master_addr=xxx --master_port=yyy \
    uniform_finetune.py   --model_type bloom --model_name_or_path bigscience/bloomz-7b1-mt \
    --data alpaca-belle-cot --lora_target_modules query_key_value \
    --per_gpu_train_batch_size 4 --learning_rate 3e-4 --epochs 1  

Inference

python3 generate.py  --data alpaca-belle-cot --model_type llama

python3 generate.py  --data alpaca-belle-cot --model_type chatglm

python3 generate.py  --data alpaca-belle-cot --model_type bloom

Note that saved-xxx7b the folder is the path to save LoRA weights, and LLaMA weights will be automatically downloaded from Hugging Face during script execution.

Generate relevant hyperparameter settings

top_p=0.9, 
        #适度调大核采样的概率阈值,扩大候选子集,增加生成多样性。
        
temperature=1.0, 
        #之前的温度参数过低会导致生成词的概率分布极化严重,导致生成策略退化成greedy decoding。
        
do_sample=True, 
        #do_sample参数默认关闭,不开启时生成仍保持beam-search解码策略,开启后为beam-search multinomial sampling解码策略。
        
no_repeat_ngram_size=6, 
        #通过配置下一个词重复出现n-gram的概率为0,来保证没有n-gram出现两次,设置过小会抑制合理的重复,影响生成的流畅性,过大会失去作用。
        
repetition_penalty=1.8, 
        #对于之前出现过的词语,在后续预测的过程中,通过引入惩罚因子repetition_penalty降低其出现的概率。

Parameter Merging

python3 merge.py --model_type llama --size 7b --lora_dir xxx --merged_dir yyy

local use

python3 server.py --model_type chatglm --lora_dir xxx

batch prediction

python3 predict.py --model_type chatglm --data for_dict_data --lora_dir xxx --result_dir yyy

web service

python3 web.py --model_type chatglm --lora_dir xxx

5. Quantitative Analysis

Note: The figure below is the statistics of the data set collected as of 3.26, which is only displayed as motivation. More datasets have been collected, such as financial-related instruction datasets. data collection statistics The current instruction-finetuning data collection mainly includes the following three parts:

  • alpaca_data_cleaned.json: about 52K English instruction-following training samples.
  • CoT_data.json: 9 CoT datasets involving about 75k samples. (Related CoT datasets released by FLAN[7])
  • belle_data_cn.json: about 0.5M Chinese |instruction-following training samples. (Related Chinese instruction data released by BELLE[8])

About the ablation of CoT and Chinese Instructions

"w/o CoT" and "w/o CN" indicate that CoT data and Chinese instructions are not used during instruction-finetuning, respectively.

Performance on problems requiring reasoning skills f3

Performance on questions that require following instructions in Chinese f4

Performance on more complex problems f5

In summary, the models finetuned from our complete dataset (English, Chinese, and CoT instruction data) can significantly improve model reasoning and Chinese instruction following abilities.

Show more abilities

ablation-cot

ablation-cot

references

[1]: LLaMA: Open and Efficient Foundation Language Models

[2]: Stanford Alpaca: An Instruction-following LLaMA model

[3]: Self-Instruct: Aligning Language Model with Self Generated Instructions

[4]: LoRA: Low-Rank Adaptation of Large Language Models

[5]: ChatGLM: An Open Bilingual Dialogue Language Model

[6]: BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

[7]: FLAN: Scaling Instruction-Finetuned Language Models

[8]: BELLE: Bloom-Enhanced Large Language model Engine

[9]: GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo

Citation

Please cite the repo if you use the data collection, code, and experimental findings in this repo.

@misc{alpaca-cot,
  author = {Qingyi Si, Tong Wang, Naibin Gu, Rui Liu, Zheng Lin },
  school = {Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China},
  title = {Alpaca-CoT: An Instruction Fine-Tuning Platform with Instruction Data Collection and Unified Large Lnguage Models Interface},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/PhoebusSi/alpaca-CoT}},
}

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/131726014