Bloom&LLAMA of the large model----Pre-Training (secondary pre-training)

0. Introduction

With the explosion of chatgpt, many large models have been appearing recently, such as the Bloom series and the LLAMA-based ziya and baichuan . These models are more promising than chatglm because they are fully commercially available and can be updated iteratively. Recently, the author is learning from hiyouga's LLaMA-Efficient-Tuning . Compared with other projects, this project is very suitable for learning and getting started.

1. The purpose of the second pre-training

In recent years, a lot of research work has shown that the pre-training model (PTM) on a large corpus can learn a general language representation, which is very helpful for downstream NLP tasks and can avoid training new models from scratch. With the development of computing power, the emergence of deep model (Transformer) and the continuous improvement of training skills, the PTM architecture has developed from shallow to deep.

For large models, the basic training process can generally be divided into two stages: pre-training and fine-tuning. The pre-training stage is the stage for the model to learn language knowledge. In this stage, the model will learn how to understand and generate text from a large amount of text data. The models generated at this stage are general and can be used to handle various types of text tasks. The fine-tuning stage is based on the pre-trained model, which is trained for a specific task, so that the model can better complete the task.

However, since the pre-trained model is trained on a large-scale, general-purpose corpus, its learned knowledge may not be all that is required for a specific task. For example, a pretrained model may not be sufficiently capable of understanding medical or legal texts, since expertise in these domains may not be sufficient in the pretrained corpus.

This is where we need a second pre-training. In the second pre-training, we retrain the model on the corpus of a specific field, so that the model can learn more knowledge of this field, so as to better complete specific tasks. Secondary pre-training can be seen as a transitional stage between pre-training and fine-tuning, which preserves the breadth of pre-training while adding task-specific specialization.

In addition, the second pre-training also has an important role, that is, it can effectively use small-scale, high-quality domain data (usually still at the GB level). In many cases, domain data is expensive and difficult to obtain, so we need to make use of this data as much as possible. By performing secondary pre-training on the large model, we can fully convert the knowledge of domain data into the performance improvement of the model.

2. Code reading – train_pt.py

The following is a script of a pre-trained model, which mainly includes the steps of model and data preparation, data set division, training and evaluation.

First, the code imports some necessary modules and functions. This includes some utility functions for data manipulation, training, loading pretrained models, and plotting loss graphs.

    # Prepare pretrained model and dataset
    model_args, data_args, training_args, finetuning_args = prepare_args(stage="pt")# 用于准备各种参数,包括模型参数、数据参数、训练参数和微调参数。
    dataset = prepare_data(model_args, data_args)# 用于准备数据集
    model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_train, stage="pt") # 用于加载预训练的模型和分词器。
    dataset = preprocess_data(dataset, tokenizer, data_args, training_args, stage="pt")# 用于预处理数据,例如将文本转换为模型可以理解的格式。
    data_collator = DynamicDataCollatorWithPadding(tokenizer, data_args.ignore_pad_token_for_loss)# 动态地对数据进行填充,使得每个batch中的数据长度一致。

Then, according to whether to train or not, the data set is divided. If training is performed, and the proportion of the development set is greater than 0, then the data set will be divided into training set and development set; otherwise, all data is used for training. If no training is performed, all data is used for evaluation or prediction.

    if training_args.do_train:
        if data_args.dev_ratio > 1e-6:
            dataset = dataset.train_test_split(test_size=data_args.dev_ratio)
            trainer_kwargs = {
    
    "train_dataset": dataset["train"], "eval_dataset": dataset["test"]}
        else:
            trainer_kwargs = {
    
    "train_dataset": dataset}
    else: # do_eval or do_predict
        trainer_kwargs = {
    
    "eval_dataset": dataset}

Next, initialize the PeftTrainer object, and pass in parameters such as fine-tuning parameters, model, training parameters, tokenizer, data processor, and callback function, as well as the previously divided data set. We will read the operation in the next section carefully.

trainer = PeftTrainer(
        finetuning_args=finetuning_args,
        model=model,
        args=training_args,
        tokenizer=tokenizer,
        data_collator=data_collator,
        callbacks=[LogCallback()],
        **trainer_kwargs
    )

After training, the code will record the training results and save the model and training results. If the model has a process number of 0 in all processes and plot loss is set, then the training loss and evaluation loss are plotted.

    if training_args.do_train:
        train_result = trainer.train()
        trainer.log_metrics("train", train_result.metrics)
        trainer.save_metrics("train", train_result.metrics)
        trainer.save_state()
        trainer.save_model()
        if trainer.is_world_process_zero() and model_args.plot_loss:
            plot_loss(training_args.output_dir, keys=["loss", "eval_loss"])

After evaluation, the code calculates the model's perplexity (a commonly used evaluation metric for language models) and records the evaluation results.

    if training_args.do_eval:
        metrics = trainer.evaluate(metric_key_prefix="eval")

        try:
            perplexity = math.exp(metrics["eval_loss"])
        except OverflowError:
            perplexity = float("inf")
        metrics["perplexity"] = perplexity

        trainer.log_metrics("eval", metrics)
        trainer.save_metrics("eval", metrics)

3. Pre-training processing –peft_trainer.py

This code defines two classes, LogCallback and PeftTrainer. The LogCallback class is used to log during training, and the PeftTrainer class is a custom trainer that supports checkpointing of parameter efficiency.

3.1 LogCallback function

First, let's take a look at the LogCallback function, which is inherited from the TrainerCallback class and is mainly used to record information during the training process, such as loss, learning rate, training cycle, current progress percentage, and estimated remaining time. These messages will be written to the file "trainer_log.jsonl".

First there is a __init__function that is executed when an instance of the class is created. The timestamp of creation is recorded here as the start time of training.

def __init__(self):
        self.start_time = time.time()

Then it will enter on_logthe method, which will be called when the training process needs to record the log. This method receives four parameters: args(training parameters), state(current state of training), control(object used to control the training process), and kwargs(other parameters).

 def on_log(self, args: TrainingArguments, state: TrainerState, control: TrainerControl, **kwargs) -> None:

First in the method body, first check if the last log record contains the key "loss". If not included, just return directly. Its purpose is to see if there is a training loss output

if "loss" not in state.log_history[-1]:
            return

Next, calculate the total elapsed time (in seconds) since the start of training, and the average time taken for each training step so far.

cur_time = time.time()
elapsed_time = cur_time - self.start_time
avg_time_per_step = elapsed_time / cur_steps if cur_steps != 0 else 0

Estimated time remaining for training based on average time and number of steps remaining.

remaining_steps = state.max_steps - cur_steps
remaining_time = remaining_steps * avg_time_per_step

Then, save this information, including current step count, total step count, loss, reward, learning rate, training epoch, percentage complete, elapsed time, and estimated remaining time, into a dictionary.

log_dict = {
    
    
      "current_steps": cur_steps,
      "total_steps": state.max_steps,
      "loss": state.log_history[-1].get("loss", None),
      "reward": state.log_history[-1].get("reward", None),
      "learning_rate": state.log_history[-1].get("learning_rate", None),
      "epoch": state.log_history[-1].get("epoch", None),
      "percentage": round(cur_steps / state.max_steps * 100, 2) if state.max_steps != 0 else 100,
      "elapsed_time": str(timedelta(seconds=int(elapsed_time))),
      "remaining_time": str(timedelta(seconds=int(remaining_time)))
}

If the output directory does not exist, create it, and then write the above dictionary in JSON format to a file named "trainer_log.jsonl". Each line of this file is a JSON object that records information about a log event.

os.makedirs(args.output_dir, exist_ok=True)
with open(os.path.join(args.output_dir, "trainer_log.jsonl"), "a") as f:
       f.write(json.dumps(log_dict) + "\n")

3.2 PeftTrainer function (same as chatglm method)

The PeftTrainer class is inherited from the Seq2SeqTrainer class and is designed to handle sequence-to-sequence models. The constructor of this class receives a FinetuningArguments object that contains the parameters for the finetuning process.

The first is __init__the function, which is executed when an instance of the class is created. Here, the constructor of the parent class is called first, and the fine-tuning parameters are stored. Then, if the current process is the main process (number 0), and a "log file" file already exists in the output directory, then delete this file. (This does not conflict with the above, because the LogCallback function is returned as a callback function in the main function)

…For details, please refer to Gu Yueju

Guess you like

Origin blog.csdn.net/lovely_yoshino/article/details/131303899