Deep Reinforcement Learning from Human Preferences: A Comprehensive Guide

Artificial intelligence models have undergone a remarkable transformation in recent years, redefining how we interact with technology. Among these advancements, Deep Reinforcement Learning from Human Preferences (RLHF) stands out as a particularly ingenious concept. It effectively combines reinforcement learning (RL) and human input within the field of Natural Language Processing (NLP). This article provides a comprehensive tutorial on RLHF, exploring its mechanics, applications, and the challenges it presents.

Introduction to RLHF

Reinforcement Learning from Human Feedback (RLHF) is a technique used to align AI systems, particularly large language models (LLMs), with human preferences. Instead of relying solely on predefined reward functions, RLHF trains models using feedback from people, enabling them to learn from examples of what humans consider good or bad outputs. This approach is particularly important for tasks where success is subjective or hard to quantify, such as generating helpful, harmless, and safe text responses.

Traditionally, reinforcement learning has been challenging to apply and primarily confined to gaming and simulated environments like Atari or MuJoCo. Just a few years ago, RL and NLP advanced independently, employing different tools, techniques, and experimental setups. The effectiveness of RLHF in such a vast and evolving field is truly remarkable.

The Evolution of Language Models: From Raw Data to Customer-Ready AI

To understand where RLHF fits in, it's essential to examine the development process of LLMs like ChatGPT. Think of the model as evolving from a raw, unrefined state, much like the meme of a Shoggoth with a smiley face.

The initial pre-trained model resembles an unrefined entity, having been trained on a vast array of internet data that includes clickbait, misinformation, propaganda, conspiracy theories, and inherent biases against certain groups. This "creature" then undergoes fine-tuning using higher-quality data sources like StackOverflow, Quora, and human annotations, making it more socially acceptable and improving its overall reliability. Subsequently, the fine-tuned model is refined further through RLHF, transforming it into a version suitable for customer interactions - think giving it a smiley face - and eliminating the Shoggoth analogy.

Read also: Comprehensive Overview of Deep Learning for Cybersecurity

How RLHF Operates: A Three-Phase Training Process

The RLHF training process unfolds in three stages:

Phase 1: Pretraining a Language Model (LM)

The pretraining phase establishes the groundwork for RLHF. In this stage, a Language Model (LM) undergoes training using a substantial dataset of text material sourced from the internet. This data enables the LM to comprehend diverse aspects of human language, encompassing syntax, semantics, and even context-specific intricacies. The result of the pretraining phase is a large language model (LLM), often known as the pretrained model.

  • Language Models and Statistical Information: A language model encodes statistical information about language. For simplicity, statistical information tells us how likely something (e.g. a word, a character) is to appear in a given context. The term token can refer to a word, a character, or a part of a word (like -tion), depending on the language model. Fluent speakers of a language subconsciously have statistical knowledge of that language. Similarly, language models should also be able to fill in that blank.

  • The Completion Machine: You can think of a language model as a "completion machine": given a text (prompt), it can generate a response to complete that text. As simple as it sounds, completion turned out to be incredibly powerful, as many tasks can be framed as completion tasks: translation, summarization, writing code, doing math, etc.

  • Training for Completion: To train a language model for completion, you feed it a lot of text so that it can distill statistical information from it. The text given to the model to learn from is called training data. Consider a language that contains only two tokens 0 and 1. Since language models mimic its training data, language models are only as good as their training data, hence the phrase "Garbage in, garbage out". GPT-3’s dataset (OpenAI): 0.5 trillion tokens.

    Read also: Continual learning and plasticity: A deeper dive

  • The Scarcity of Data: Today, a language model like GPT-4 uses so much data that there’s a realistic concern that we’ll run out of Internet data in the next few years. It sounds crazy, but it’s happening. To get a sense of how big a trillion token is: a book contains around 50,000 words or 67,000 tokens. The rate of training dataset size growth is much faster than the rate of new data being generated (Villalobos et al, 2022).

    If you’ve ever put anything on the Internet, you should assume that it is already or will be included in the training data for some language models, whether you consent or not. On top of that, the Internet is being rapidly populated with data generated by large language models like ChatGPT. Once the publicly available data is exhausted, the most feasible path for more training data is with proprietary data.

Step-by-Step Breakdown of Pretraining:

  1. Choosing a Base Language Model: The initial phase involves the critical task of selecting a foundational language model. The choice of model is not universally standardized; instead, it hinges on the specific task, available resources, and unique complexities of the problem at hand. Industry approaches differ significantly, with OpenAI adopting a smaller iteration of GPT-3 called InstructGPT, while Anthropic and DeepMind explore models with parameter counts ranging from 10 million to 280 billion.

  2. Acquiring and Preprocessing Data: In the context of RLHF, the chosen language model undergoes preliminary training on an extensive dataset, typically comprising substantial volumes of text sourced from the internet. This raw data requires cleaning and preprocessing to render it suitable for training. This preparation often entails eliminating undesired characters, rectifying errors, and normalizing text anomalies.

  3. Language Model Training: Following this, the LM undergoes training using the curated dataset, acquiring the ability to predict subsequent words in sentences based on preceding words. This phase involves refining model parameters through techniques like Stochastic Gradient Descent. The overarching objective is to minimize disparities between the model's predictions and actual data, typically quantified using a loss function such as cross-entropy.

    Read also: An Overview of Deep Learning Math

  4. Model Assessment: The model's performance is assessed upon completing training using an isolated dataset that was not employed in the training process. This step is crucial to verify the model's capacity for generalization and discover that it has yet to memorize the training data. If the assessment metrics meet the required criteria, the model is deemed prepared for the subsequent phase of RLHF.

  5. Preparing for RLHF: Although the LM has amassed substantial knowledge about human language, it needs an understanding of human inclinations. To address this, supplementary data is necessary. Often, organizations compensate individuals to produce responses to prompts, which subsequently contribute to training a reward model. While this stage can incur expenses and consume time, it's pivotal for orienting the model towards human-like preferences.

Notably, the pretraining phase doesn't yield a perfect model; errors and erroneous outputs are expected. Nevertheless, it furnishes a significant foundation for RLHF to build upon, enhancing the model's accuracy, safety, and utility. Pretraining is the most resource-intensive phase. For the InstructGPT model, pretraining takes up 98% of the overall compute and data resources.

Phase 2: Supervised Fine-Tuning (SFT)

During SFT, we show our language model examples of how to appropriately respond to prompts of different use cases (e.g. question answering, summarization, translation). The examples follow the format (prompt, response) and are called demonstration data. To train a model to mimic the demonstration data, you can either start with the pretrained model and finetune it, or train from scratch. In fact, OpenAI showed that the outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3.

  • Demonstration Data: Demonstration data can be generated by humans, like what OpenAI did with InstructGPT and ChatGPT. Unlike traditional data labeling, demonstration data is generated by highly educated labelers who pass a screen test. OpenAI’s 40 labelers created around 13,000 (prompt, response) pairs for InstructGPT. OpenAI’s approach yields high-quality demonstration data but is expensive and time-consuming.

*Side note: on finetuning for dialogues vs. OpenAI’s InstructGPT is finetuned for following instructions. Each example of demonstration data is a pair of (prompt, response). DeepMind’s Gopher is finetuned for conducting dialogues. Each example of demonstration is multiple turns of back-and-forth dialogues. Dialogue-finetuned Gopher: ~5 billion tokens, which I estimate to be in the order of 10M messages.

Phase 3: Reinforcement Learning with Human Feedback (RLHF)

Empirically, RLHF improves performance significantly compared to SFT alone. However, I haven’t seen an argument that I find foolproof. Dialogues are flexible. Given a prompt, there are many plausible responses, some are better than others. The idea: what if we have a scoring function that, if given a prompt and a response, outputs a score for how good that response is? Then we use this scoring function to further train our LLMs towards giving responses with high scores. That’s exactly what RLHF does.

The negative feedback hypothesis: demonstration only gives the model positive signals (e.g. only showing the model good responses), not negative signals (e.g. showing models what bad responses look like).

3.1. Developing a Reward Model through Training

At the core of the RLHF procedure lies establishing and training a reward model (RM). The RM’s job is to output a score for a pair of (prompt, response). Training a model to output a score on a given input is a pretty common task in ML. You can simply frame it as a classification or a regression task. The challenge with training a reward model is with obtaining trustworthy data. Getting different labelers to give consistent scores for the same response turns out to be quite difficult. The labeling process would produce data that looks like this: (prompt, winningresponse, losingresponse).

People have experimented with different ways to initialize an RM: e.g. training an RM from scratch or starting with the SFT model as the seed. Starting from the SFT model seems to give the best performance.

Step-by-Step Guide to Reward Model Training:

  1. Creating the Reward Model: The reward model can be either an integrated language model or a modular structure. Its fundamental role is associating input text sequences with a numerical reward value, progressively facilitating reinforcement learning algorithms to enhance their performance in various settings.

  2. Data Compilation: Initiating the training of the reward model involves assembling a distinct dataset separate from the one employed in the language model's initial training. This dataset is specialized, concentrating on particular use cases, and composed of pairs consisting of prompts and corresponding rewards. Each prompt is linked to an anticipated output, accompanied by rewards that signify desirability for that output. While this dataset is generally smaller than the initial training dataset, it plays a crucial role in steering the model toward generating content that resonates with users.

  3. Model Learning: Using the prompt and reward pairs, the model learns to associate specific outputs with their corresponding reward values. This process often harnesses expansive "teacher" models or combinations to enhance diversity and counteract potential biases. The primary objective here is to construct a reward model capable of effectively gauging the appeal of potential outputs.

  4. Incorporating Human Feedback: Integrating human feedback is an integral facet of refining the reward model. A prime illustration of this can be observed in ChatGPT, where users can rate the AI's outputs using a thumbs-up or thumbs-down mechanism. This collective feedback holds immense value in enhancing the reward model, providing direct insights into human preferences.

3.2. Fine-Tuning with Reinforcement Learning

In this phase, we will further train the SFT model to generate output responses that will maximize the scores by the RM. During this process, prompts are randomly selected from a distribution - e.g. we might randomly select among customer prompts. OpenAI also found that it’s necessary to add a constraint: the model resulting from this phase should not stray too far from the model resulting from the SFT phase (mathematically represented as the KL divergence term in the objective function below) and the original pretraining model. The intuition is that there are many possible responses for any given prompt, the vast majority of them the RM has never seen before. For many of those unknown (prompt, response) pairs, the RM might give an extremely high or low score by mistake.

Action space: the vocabulary of tokens the LLM uses. Policy: the probability distribution over all actions to take (aka all tokens to generate) given an observation (aka a prompt).

Techniques to Fine-Tune Model with Reinforcement Learning with Human Feedback

Fine-tuning plays a vital role in Reinforcement Learning with the Human Feedback approach. It enables the language model to refine its responses according to user inputs.

  1. Applying the Reward Model: At the outset, a user's input, referred to as a prompt, is directed to the RL policy, which is essentially a refined version of the language model (LM). The RL policy generates a response, and the reward model assesses both the RL policy's output and the initial LM's output. The reward model assigns a numeric reward value to gauge the quality of these responses.

  2. Establishing the Feedback Loop: This process is iterated within a feedback loop, allowing the reward model to assign rewards to as many responses as resources permit. Responses receiving higher rewards gradually influence the RL policy, guiding it to generate responses that better align with human preferences.

  3. Quantifying Differences Using KL Divergence: A pivotal role is played by Kullback-Leibler (KL) Divergence, a statistical technique that measures distinctions between two probability distributions.

  4. Fine-tuning Through Proximal Policy Optimization: Integral to the fine-tuning process is Proximal Policy Optimization (PPO), a widely recognized reinforcement learning algorithm known for its effectiveness in optimizing policies within intricate environments featuring complex state and action spaces. PPO's strength in maintaining a balance between exploration and exploitation during training is particularly advantageous for the RLHF fine-tuning phase. This equilibrium is vital for RLHF agents, enabling them to learn from both human feedback and trial-and-error exploration. The integration of PPO accelerates learning and enhances robustness.

  5. Discouraging Inappropriate Outputs: Fine-tuning serves the purpose of discouraging the language model from generating improper or nonsensical responses. Responses that receive low rewards are less likely to be repeated, incentivizing the language model to produce that more closely align with human expectations.

A Practical Example of RLHF Implementation with REBEL Algorithm

This section demonstrates a single iteration of REBEL ((t=0)) using the base model (\pi{\theta0}). The complete code for this part is available [here](link to code).

Step 1: Generating Samples from the Policy

The first step in the RLHF pipeline is generating samples from the policy to receive feedback on. Concretely, in this section, we will load the base model using vllm for fast inference, prepare the dataset, and generate multiple responses for each prompt in the dataset. You can select a subset of the dataset using dataset.select. The Llama model uses special tokens to distinguish prompts and responses. Finally, we can generate the responses using vllm with the prompts we just formatted.

Step 2: Querying the Reward Model

The second step in the RLHF pipeline is querying the reward model to tell us how good a generated sample was. Concretely, in this part, we will calculate reward scores for the responses generated in Part 1 what are later used for training. To begin, we’ll initialize the Armo reward model pipeline.

Step 3: Filtering and Preparing the Dataset

While the preceding two parts are all we need in theory to do RLHF, it is often advisable in practice to perform a filtering process to ensure training runs smoothly. Concretely, in this part, we’ll walk through the process of preparing a dataset for training by filtering excessively long prompts and responses to prevent out-of-memory (OOM) issues, selecting the best and worst responses for training, and removing duplicate responses. These two different tokenizers allow us to pad the prompt from left and the response from the right such that they meet in the middle.

Step 4: Updating Model Parameters with REBEL

Finally, we’re now ready to update the parameters of our model using an RLHF algorithm! We will now use our curated dataset and the REBEL algorithm to fine-tune our base model.

Looking again at the REBEL objective, the only things we need now to train is to compute (\pi\theta(y|x)) and (\pi{\theta0}(y|x)). output.logits contains the logits of all tokens in the vocabulary for the sequence of inputids.output.logits[:, args.task.maxlen_prompt - 1 : -1] is the logits of all tokens in the vocabulary for the sequence of response only.

Applications of RLHF

RLHF is not yet widely used in the industry except for a few big key players - OpenAI, DeepMind, and Anthropic. However, RLHF uses human feedback for structured learning, aligning models with user values. This is especially valuable in NLP and computer vision, where matching human preferences matters more than raw power.

Large Language Models (Chatbots and NLP tasks)

RLHF enhances the performance of LLMs in various applications, including chatbots, text summarization, and content generation. By aligning the model's outputs with human preferences, RLHF ensures that the generated responses are not just accurate but also helpful, safe, and aligned with human intent.

  • Improving Multilingual Customer Engagement: Global customer support systems are turning to multilingual language models. These models connect people worldwide and translate words accurately. However, they often miss the tone and cultural nuances that affect how messages are received. RLHF fine-tuning with native-speaker feedback corrected tone and preserved cultural appropriateness. Annotators selected translations aligned with meaning and norms, supported by a reward model. This alignment enhanced user satisfaction in multilingual customer support. It highlights the importance of RLHF in developing culturally aware NLP systems.

  • Advancing Learning Reasoning: Tutoring systems in math, science, and writing need to answer questions, provide reasoning, and sometimes avoid giving direct answers to encourage learning. RLHF-trained models offer guiding questions and hints. Training data comes from interactions selected by educators. The best answers help with problem-solving instead of taking over.

Computer Vision

  • Vision-Based Preference Learning: Defining a numerical reward function is often impractical in vision tasks like image generation, captioning, or object arrangement. Questions such as “Is this image beautiful?” or “Does this movement look natural?” are better answered by people. Human preference data bridges this gap. Annotators rate images for qualities like realism, composition, and prompt adherence. These rankings train a reward model that scores outputs based on learned human criteria. The generator improves through reinforcement learning using PPO. The reward signal guides it to match user preferences. Projects like Pick-a-Pic show how human feedback can notably improve visual quality without relying on hand-crafted loss functions.

  • Multi-Modal Alignment: Text and Vision in Harmony: Base models can often make mistakes in multi-modal tasks. This includes visual question answering, image explanation, or instruction following. Supervised fine-tuning helps, but cannot fully eliminate these errors. RLHF-V addresses this with fine-grained human feedback, where annotators tag specific answer segments as correct, misleading, or irrelevant. These segment-level preferences help the model train with Dense Direct Preference Optimization (DDPO). This method uses alignment signals for short spans and full outputs.

RLHF in Robotics: Matching Physical Behaviors to Human Preferences

Robots deployed in homes, hospitals, and warehouses encounter messy, dynamic, and deeply human-driven tasks. Many control systems and RL agents fail, not due to technical limitations, but from misaligned human intent. New methods address this by embedding human feedback directly into training. The SEED framework (Skill-based Enhancements with Evaluative Feedback for Robotics) integrates human preferences into skills, removing hand-crafted rewards. SEED decomposes tasks into primitive skills like grasping, rotating, or pushing, arranged in a skill graph. Annotators give feedback on individual skills, capturing precise intent with minimal oversight.

Challenges and Limitations of RLHF

While RLHF presents tremendous potential, it's essential to remain vigilant about the challenges it faces.