Can the basic large model label data like a human?

Since the emergence of ChatGPT, we have witnessed unprecedented development in the field of large language models (LLM), especially dialogue models, which can be fine-tuned to complete relevant requirements and commands according to the given prompts (prompts).

However, until now we have not been able to compare the performance of these large models, because of the lack of a unified benchmark, it is difficult to rigorously test their respective performance. It is inherently difficult to evaluate the instructions we send to them and the dialogue model itself. After all, users' evaluation criteria are all based on subjective feelings about the quality of answers; while the existing performance evaluation criteria for natural language processing tasks are not Most are limited to specific indicators and certain quantitative criteria.

In this field, usually when a new large language model is released, it will be advertised like this: Our model is better than ChatGPT in what percentage of cases. The underlying meaning of this sentence is that the model uses some evaluation criteria based on GPT-4, in what percentage of the cases it is better than ChatGPT. What these scores are really intended to represent is an alternative to a different evaluation criterion: scores provided by human labelers. Reinforcement Learning with Human Feedback (RLHF) provides an abundance of interfaces and data to compare the two models. These data from RLHF are used to train a reward model to judge which answer is better, but the idea of ​​scoring and ranking model output has evolved into a more general model evaluation tool .

Here we show some examples from the instruct and code-instruct subsets of our blind test data.

 

 

In terms of iteration speed, it is already very efficient to use a language model to evaluate the model output, but there is a big problem missing here: whether this downstream shortcut tool has been calibrated and aligned for the original evaluation form. In this article, we will take a closer look at when you need to trust or not trust the data labels you get from the large language model you choose by extending the Open LLM Leaderboard evaluation system.

Nowadays, various leaderboards have begun to emerge, such as LMSYS and nomic / GPT4All, etc., to compare models from various perspectives. But we still need a more complete resource to compare model performance. Some will use existing NLP benchmarks to look at question-answering capabilities; others will use some crowdsourced, open-ended question-answering leaderboard. In order to provide you with a more comprehensive and general evaluation method, we have expanded the Hugging Face Open LLM Leaderboard to include various automated academic evaluation benchmarks, professional marking, and GPT-4 related evaluation methods.

Evaluating Preferences for Open Source Models
At any point in the training phase, the need to manually organize the data is inherently costly. So far, there are only a few human-annotated preference datasets in this field that can be used to train large models, such as Anthropic's HHH data, OpenAssistant's dialogue rankings, or OpenAI's Learning to Summarize / WebGPT dataset. The same preference label can also be obtained from the model output to construct the Elo ranking between two models (Elo ranking is a method commonly used in chess or games to build a global leaderboard by pairwise comparison, the higher the ranking, the better ). The data becomes interesting when the text source given to the annotators is generated by the model we focus on.

Many unexpected and interesting things happen in the process of training the model, so we need to do a more rigorous control experiment on various open source models to see how the preference collection process translates into the prevailing GPT-4/ChatGPT preference evaluation today, and how it compares with them. Difference comparison.

For this purpose, we organized a collection of command prompts, and a corresponding series of completions done by open source models (Koala 13b, Vicuna 13b, OpenAssistant 12b, Dolly 12b).

 We collected a series of high-quality, human-written prompts from the Self-Instruct evaluation set, and also collected some early discussion conversation data from data vendors, covering generation, brainstorming, Q&A, summary, common sense, Various task categories such as programming. There are a total of 327 prompts covering these task types, 25 of which are programming related.

Here we list some statistics related to prompts, including their sentence length.

 

With these data, we started to use Scale AI and GPT-4 to evaluate the model quality. We used Antropic's approach to the preference model and asked raters to rate from 1 to 8 on a Likert scale. In this range, a score of 1 means that the rater has a strong preference for the current model compared to the first model; a score of 4 means that it is basically the same as the first model; a score of 8 means that the evaluator's opinion is completely opposite to that of the first model.

Guess you like

Origin blog.csdn.net/elinkenshujuxian/article/details/131590083