Comprehensive replacement for Llama 2! Baichuan 2 reveals the most complete training details in history

Comprehensive replacement for Llama 2! Baichuan 2 reveals the most complete training details in history

Original New Wisdom  New Wisdom  Published in Beijing on 2023-09-15 13:12 


  New Wisdom Report  

Editor: Editorial Department

[New Zhiyuan Introduction] Just recently, Baichuan Intelligent officially released the Baichuan 2 series of open source large models. As the Chinese model with the best performance in the open source field, Baichuan 2 will properly replace Llama 2 in China.

At home, Llama's era has passed.

On September 6, Baichuan Intelligent announced the official open source of the Baichuan 2 series of large models, including 7B and 13B Base and Chat versions, and provided 4bits quantification of the Chat version, all of which are free for commercial use.

Download link: https://github.com/baichuan-inc/Baichuan2

In all mainstream Chinese and English general lists, Baichuan 2 is ahead of Llama 2, and Baichuan2-13B even beats all open source models of the same size. It is no exaggeration to say that Baichuan2-13B is currently the Chinese open source model with the best performance in the same size .

In the past month, the Baichuan series has been downloaded more than 3.47 million times in open source communities such as Hugging Face. It is the most downloaded open source large model that month, and the total downloads have exceeded 5 million times.

picture

picture

Llama 2, no longer needed

In comparison, the popular foreign fried chicken Llama 2 can say goodbye to us.

After the Thousand Model War, large models have entered the "Android moment". Now it seems that the most promising domestic large model to replace Llama 2 is Baichuan 2.

The reason is actually very simple. On the one hand, the Baichuan 2 series large models not only lead the Llama 2 with an absolute advantage in performance, but are also significantly better than competing products of the same size.

picture

picture

On the other hand, in Meta's commercial agreement, the Llama model is not actually allowed to be open for commercial use in the Chinese community; while the Baichuan series of large models are currently fully open source for commercial use.

picture

The Llama 2 business agreement clearly states that no business outside of English is allowed

No. 1 in Chinese open source

As the first Chinese open source large model, Baichuan large model's performance in facing LLM's classic problems is also impressive.

For the profound Chinese language, Baichuan 2, which has precise semantic understanding capabilities, can fully understand its subtleties.

picture

And Llama 2-13B, who is not good at Chinese , just said a bunch of useless nonsense .

picture

In terms of code generation that tests reasoning capabilities, Baichuan 2 can achieve sufficient refinement, and its availability rate has reached the industry-leading level.

picture

For this question, Llama 2 can also handle it, but it will only respond in English by default.

picture

The more difficult

picture

In this regard, Baichuan Large Model can be said to be far ahead, and can easily complete various complex instructions to follow.

picture

Even the inference questions that stump GPT-4 are easily solved by the Baichuan model.

picture

picture

Model evaluation

In addition to these real-life scene evaluations, Baichuan 2 has achieved the best results of the same scale in multiple authoritative Chinese, English and multi-language general and professional benchmark tests, while Llama 2 has completely failed. .

For general fields, the benchmarks used for evaluation are: Chinese basic model evaluation data set C-Eval, mainstream English evaluation data set MMLU, Chinese benchmark CMMLU to evaluate knowledge and reasoning ability, data set Gaokao to evaluate language and logical reasoning ability, evaluation recognition data set AGIEval for general abilities such as knowledge and problem solving, and BBH, a subset of the challenging task Big-Bench.

picture

In the legal field, the JEC-QA data set based on the Chinese National Judicial Examination is used. In the medical field, in addition to medical-related questions in general domain data sets, there are MedQA and MedMCQA.

picture

The mathematics field is GSM8K and MATH data sets; the code field is HumanEval and MBPP data sets.

picture

Finally, in terms of multilingual capabilities, the Flores-101 data set originating from many different fields such as news, travel guides, and books is used, which contains 101 languages ​​including English.

picture

In summary, the Baichuan 2 series not only inherits the good generation and creation capabilities of the previous generation, smooth multi-round dialogue capabilities, and low deployment threshold and many other features, but also has outstanding capabilities in mathematics, code, security, logical reasoning, semantic understanding, etc. Significantly improved.

Among them, compared with the previous generation 13B model, Baichuan2-13B-Base has a 49% improvement in mathematical capabilities, a 46% improvement in coding capabilities, a 37% improvement in security capabilities, a 25% improvement in logical reasoning capabilities, and a 15% improvement in semantic understanding capabilities.

picture

data

One of the reasons why the Baichuan 2 series of large models can achieve such impressive results is that the training corpus is large in scale, comprehensive in coverage, and of high quality.

In terms of data acquisition, the Baichuan team mainly collects information from rich data sources such as web pages, books, research papers, and code libraries, covering various fields such as technology, business, and entertainment.

There is a total of 2.6TB token-sized data set.

At the same time, multi-language support has also been added to the data set, including dozens of languages ​​such as Chinese, English, Spanish, and French.

picture

Baichuan 2 training data distribution of different types

So, how is excellent data quality obtained?

As a company with search genes, Baichuan Intelligence draws on its previous experience in the search field and focuses on data frequency and quality.

On the one hand, by establishing a large-scale " duplication and clustering system ", hundreds of billions of data can be quickly cleaned and deduplicated within a few hours.

On the other hand, multi-granularity content quality scoring was also used during data cleaning, not only referring to chapter-level, paragraph-level, and sentence-level evaluations, but also referring to the selection of content evaluations in search.

Through fine-grained sampling, the quality of , especially in the Chinese field.

picture

Training data size at different data processing stages

train

After the data preparation is completed, the next step is to enter the most important stage of the large model - training.

The Baichuan team used the AdamW optimizer and BFloat16 mixed precision to train the model.

In order to stabilize the training process and improve model performance, the study also used NormHead to normalize the output embedding .

In addition, during training, the Baichuan team also found that the logarithm value of LLM may become very large, thus introducing Max-z loss to stabilize training and make model inference more robust to hyperparameters.

As shown in the figure below, you can see that the loss curve of Baichuan2-7B/13B continues to decrease.

picture

Previous research has shown that the performance of the model shows a certain degree of predictability as the parameter scale increases, which is often referred to as scaling law.

Before training large-scale language models with billions of parameters, Baichuan Intelligence pre-trained models with parameters ranging from 10M to 30B, with a total token size of 1 trillion.

By fitting a power law term to the loss for a given number of training float operations, the loss curve for training Baichuan2-7B and Baichuan2-13B on 2.6 trillion tokens can be predicted.

As shown in the figure below, the model curves of different parameter scales such as 30M, 50M, 100M, etc. are all declining, and can finally linearly regress to a value.

This allows for a more accurate estimate when predicting the performance of larger models.

picture

It is worth mentioning that this is similar to the situation when OpenAI released GPT-4. Only one ten thousandth of the training is needed to predict the performance of the subsequent model.

It can be seen that the entire fitting process can predict the loss of the model more accurately.

At the same time, the Baichuan Infrastructure team has done a lot of work to optimize cluster performance, making the current Qianka A800 cluster reach a training speed of 180TFLOPS , and the machine utilization rate exceeds 50%, reaching the industry-leading level.

As mentioned above, during the training process, Baichuan intelligent model showed efficient, stable and predictable capabilities.

Safety

So, how to ensure that the model obtained after training is safe? Baichuan Intelligence has also done a lot of security alignment work here.

Before model training, the team has strictly filtered the entire data set, and also planned a Chinese and English bilingual data set that incorporates various positive data.

On the other hand, Baichuan Intelligence has also made fine-tuning enhancements to the model, security reinforcement learning, set up 6 attack types, and conducted a large amount of red-blue confrontation training, which can improve the robustness of the model.

picture

In the reinforcement learning optimization stage, the DPO method can effectively utilize a small amount of annotated data to improve the model's performance on specific vulnerability issues.

In addition, a reward model that combines beneficial and harmless goals was used to conduct PPO security enhancement training, which significantly enhanced the security of the system without reducing the usefulness of the model.

It can be seen that Baichuan Intelligence has also made a lot of efforts in model security alignment, including pre-training data enhancement, security fine-tuning, security reinforcement learning, and the introduction of red-blue confrontation.

The open source of Baichuan 2 is truly open source

For academics, what hinders in-depth research on large model training?

The cost of completely training a model from 0 to 1 is extremely high, and each link requires a lot of investment in manpower and computing power.

Among them, the training of large models includes the acquisition of massive high-quality data, stable training of large-scale training clusters, model algorithm tuning, etc., and the slightest deviation can make a huge difference.

However, most of the current open source models only disclose the model weights and rarely mention the training details. Moreover, these models are final versions and even come with Chat, which is not friendly to academia.

For this reason, companies, research institutions, and developers can only make limited fine-tuning based on the model, making it difficult to conduct in-depth research.

In response to this, Baichuan Intelligence directly disclosed the technical report of Baichuan 2 and introduced in detail the entire process of Baichuan 2 training, including data processing, model structure optimization, scaling law, process indicators, etc.

More importantly, Baichuan Intelligence has also open sourced Check Ponit for the entire process of model training from 220B to 2640B.

This is the first time in the domestic open source ecosystem!

Check Ponit is extremely valuable for research on the model training process, model continued training, and model value alignment.

picture

Effect changes of the 11 intermediate checkpoints of Baichuan 2 on the three benchmarks of C-Eval, MMLU, and CMMLU

In this regard, Zhang Qi, a professor at the School of Computing Science and Technology at Fudan University, said:

The model sharding released by the Baichuan series is of great benefit for studying the nature of large models. We can not only know its iteration process each time, but also do a lot of things in the middle shards.

Moreover, compared with those models that directly open source the final version, or even the Chat version, Baichuan open source is very clean, and it is a very clean language model from the base.

In addition, many evaluations are conducted from a single point of view. In some lists, GPT-4 is even ranked 10th, which actually makes no sense. The evaluation results of Baichuan are very good.

From a business perspective, the Baichuan 2 model is also a very good choice for enterprises.

After the release of Llama 2, which was previously free and commercially available, many people believed that it would be a blow to many startups because it could meet low-cost, personalized needs.

But after careful consideration, you can understand that Llama 2 has not changed the market structure.

If an enterprise wants to use a model, even fine-tuning it will require some cost, effort and time.

And if you choose a model with weak performance (especially a model mainly based on English corpus), it will be difficult to retrain, and the cost is almost the same as building a large model yourself.

Since Llama 2 is not good at Chinese, and the agreement prohibits commercialization in non-English scenarios, it is obvious that in the commercial field, Baichuan 2, an open source model with stronger comprehensive capabilities, is almost the best choice.

Based on the Baichuan 2 series large models, domestic researchers can carry out secondary development and quickly integrate the technology into real scenarios.

In a word, Baichuan 2 is like a steady stream of living water. It not only greatly promotes the scientific research progress of domestic large models through open source as comprehensively as possible, but also allows application innovation to continue to emerge by lowering the threshold for domestic commercial deployment.

References:

https://github.com/baichuan-inc/Baichuan2

Guess you like

Origin blog.csdn.net/sinat_37574187/article/details/132997474