LLMs: Reinforcement learning from human feedback (RLHF)

Let’s consider the task of text summarization, which is to use a model to generate a short piece of text that captures the most important points of a longer article. Your goal is to use fine-tuning to improve your model's summarizing capabilities by showing it human-generated summary examples. In 2020, researchers at OpenAI published a paper that explored using human feedback for fine-tuning to train a model to write short summaries of text articles. Here you can see that a model fine-tuned based on human feedback produces better responses than a pre-trained model, a fine-tuned model on instructions, or even a reference human baseline.
Insert image description here
A popular technique for using human feedback to fine-tune large language models is called reinforcement learning from human feedback (RLHF).

As the name suggests, RLHF uses reinforcement learning (RL for short) to fine-tune LLM using human feedback data to generate a model that is more consistent with human preferences. You can use RLHF to ensure that your model produces output that maximizes the usefulness and relevance of your input prompts. Perhaps most importantly, RLHF can help minimize possible injuries. You can train your model to give notes that acknowledge its limitations and avoid harmful language and topics.
Insert image description here

A potentially exciting application of RLHF is the personalization of LLM, where the model learns each user's preferences through a continuous feedback process. This could lead to exciting new technologies, such as personalized learning plans or personalized AI assistants.

But to understand how these future applications are possible, let's first take a closer look at how RLHF works. If you are new to reinforcement learning, here is a high-level overview of some of the most important concepts.

Reinforcement learning is a type of machine learning in which an agent learns to make decisions related to a specific goal by taking actions in the environment, with the goal of maximizing some notion of cumulative reward.
Insert image description here

In this framework, agents continuously learn from their experiences by taking actions, observing the resulting changes in the environment, and receiving rewards or penalties based on the results of their actions. By iterating this process, agents gradually refine their strategies or policies to make better decisions and increase their chances of success.
Insert image description here

A useful example to illustrate these ideas is training a model to play tic-tac-toe. let's see. In this example, the agent is the model or strategy that acts as the Tic-Tac-Toe player. Its goal is to win the game. The environment is a three by three game board, and the state at any time is the current configuration of the board. The action space includes all possible positions that the player can choose based on the current board state. The agent makes decisions by following a policy called RL policy. Now, when the agent takes an action, it collects rewards based on how effectively the action leads to victory. The goal of reinforcement learning is for agents to learn the best policy in a given environment, thereby maximizing their reward. This learning process is iterative and involves trial and error.
Insert image description here

Initially, the agent randomly performs an action that results in a new state. From this state, the agent continues to explore subsequent states through further actions. A sequence of actions and corresponding states form a layout, often called a deployment. As the agent gains experience, it gradually discovers the actions that yield the highest long-term rewards, ultimately achieving success in the game.
Insert image description here

Let's now see how to extend the Tic-Tac-Toe example to use RLHF to fine-tune large language models. In this case, the agent's policy guiding action is Instruct LLM, whose goal is to generate text that is believed to conform to human preferences. For example, this might mean that the text is useful, accurate, and nontoxic. The environment is the context window of the model, a space in which text can be entered via prompts. The state that the model considers before taking action is the current context. This means any text currently contained in the context window. The operation here is the act of generating text. This can be a single word, a sentence, or longer formatted text, depending on the task specified by the user. The action space is the token vocabulary, meaning that the model can choose to generate all possible tokens of completion.
Insert image description here

How the Instruct LLM decides to generate the next token in the sequence depends on the statistical representation of the language it learned during training. At any given moment, the action the model will take, i.e. which token it will choose next, depends on the cue text in the context and the probability distribution over the vocabulary space. Rewards are distributed based on how closely the completion matches human preferences.

Given that humans respond differently to language, determining rewards is more complex than the tic-tac-toe example. One way you could do this is to have a human evaluate all completions of the model against some alignment metric, such as determining whether the generated text is poisonous or non-toxic. This feedback can be expressed as a scalar value, which can be zero or one.
Insert image description here

The LLM weights are then iteratively updated to maximize the reward obtained from the human classifier, allowing the model to generate poison-free completions.

However, obtaining human feedback can be time-consuming and expensive. As a practical and scalable alternative, you can use an additional model called a reward model to classify the output of Instruct LLM and evaluate the consistency with human preferences. You'll start with a small number of human examples and train a secondary model through traditional supervised learning methods. Once training is complete, you will use the reward model to evaluate the LLM's output and assign a reward value, which in turn is used to update the LLM's weights and train a new human-aligned version.
Insert image description here

When evaluating model completion, exactly how the weights are updated depends on the algorithm used to optimize the strategy. You'll explore these issues in more depth soon. Finally, note that in the context of language modeling, the sequence of actions and states is called rollout, rather than the term playout used in classic reinforcement learning.

Insert image description here

The reward model is a core component of the reinforcement learning process. It encodes all preferences learned from human feedback and plays a central role in how the model updates weights through multiple iterations. In the next video you will see how this model is trained and how it is used during reinforcement learning to classify the model's output. Let's go ahead and see.

reference

https://www.coursera.org/learn/generative-ai-with-llms/lecture/NY6K0/reinforcement-learning-from-human-feedback-rlhf

Guess you like

Origin blog.csdn.net/zgpeace/article/details/133411622