Preference Learning Algorithms: A Comprehensive Guide

Introduction

Preference learning is a subfield of machine learning that focuses on learning predictive models from preference data. This data typically consists of pairwise comparisons, rankings, or ratings that express a user's or a system's preferences among a set of items or options. Preference learning has broad applications, including recommender systems, information retrieval, and decision-making. This article provides a comprehensive overview of preference learning algorithms, covering various aspects such as object preferences, label preferences, utility-based models, and recent advancements in aligning Large Language Models (LLMs) with human preferences.

Object Preferences

Object preferences involve learning user preferences based on the characteristics or attributes associated with the objects themselves. For example, in food preference learning, the goal is to understand user preferences among different foods by analyzing data such as "I prefer food A over food B." Here, the user implicitly expresses preferences for specific attributes of the food, such as ingredients and cooking methods.

Gaussian Processes for Object Preference

Gaussian Processes (GPs) offer a powerful and flexible framework for modeling object preferences. Compared to traditional statistical methods, GPs enable us to move beyond the assumption of a linear utility function in the covariates. Unlike machine learning approaches that require parameterizing the utility function through neural networks, GPs allow us to directly model the utility function without parameterization. This simplifies both the modeling treatment and the model inference process.

Label Preferences

Label preferences involve learning user preferences based on the labels associated with the objects. For instance, in travel preference learning, the objective is to predict user preferences for different modes of transportation (labels) such as cars, trains, buses, and bicycles, based on data collected in the form "I prefer using a train to go from A to B." The attributes considered in this context pertain to the user, such as age, education, and occupation.

Utility-Based Models

A fundamental result in economics states that a rational preference relation can be represented by a utility function. This means that when presented with various alternatives, a rational subject tends to prefer the option with the highest utility. In preference learning, the utility function is latent (not directly observable), and the objective is to learn this latent function based on the observed subject's preferences.

Read also: Understanding PLCs

Random Utility Models

To account for the inherent noise and uncertainty in human preferences, random utility models (RUMs) are often employed. These models assume that a subject's preference is determined by a noisy utility function. The diversity of random utility models arises from different assumptions about the distribution of this noise. Examples of proposed distributions for the noise include Gaussian and Gumbel distributions.

Learning from Choice Data

In many applications, we do not directly observe preferences but only the choices a subject makes among a set of alternatives. Under certain rationality assumptions, it is still possible to learn the underlying utility function(s) that determine the subject's choices.

Non-Utility-Based Approaches

An alternative approach to preference learning aims to directly learn a two-argument function that represents the preference relation. In machine learning, this is a predominant method for preference learning because it allows the practitioner to frame the problem as a classification task.

Preference Learning and Large Language Models (LLMs)

Recent advancements in natural language processing have led to the development of Large Language Models (LLMs) with remarkable capabilities. However, aligning these models with human preferences remains a significant challenge. Several promising methods have emerged to address this challenge, including Direct Preference Optimization (DPO), Identity Preference Optimisation (IPO), and Kahneman-Tversky Optimisation (KTO).

Direct Preference Optimization (DPO)

Direct Preference Optimization (DPO) has emerged as a promising alternative for aligning Large Language Models (LLMs) to human or AI preferences. Unlike traditional alignment methods, which are based on reinforcement learning, DPO recasts the alignment formulation as a simple loss function that can be optimised directly on a dataset of preferences {(x,yw,yl)} {(x, yw, yl)} , where xx is a prompt and yw,ylyw,yl are the preferred and dispreferred responses.

Read also: Learning Resources Near You

DPO avoids the iterative RL loop of PPO. It directly optimizes the LLM based on human preference data using a clever loss function. DPO updates the LLM’s policy using a special loss function that directly compares the logits (the raw output scores before probabilities) from two models: the current model being trained and a reference model (often an older version of the LLM). The loss function is designed to increase the logits (and probabilities) of preferred responses and decrease the logits (and probabilities) of dispreferred responses, while also encouraging the current model to stay close to the behavior of the reference model.

Kahneman-Tversky Optimisation (KTO)

ContextualAI recently proposed an interesting alternative called Kahneman-Tversky Optimisation (KTO), which defines the loss function entirely in terms of individual examples that have been labelled as "good" or "bad" (for example, the 👍 or 👎 icons one sees in chat UIs).

Empirical Evaluation of DPO, IPO, and KTO

Empirical analysis has demonstrated that DPO and IPO can achieve comparable results, outperforming KTO in a paired preference setting. However, DPO tends to quickly overfit on the preference dataset.

When performing preference alignment, it is crucial to choose the right set of hyperparameters, especially the β \beta parameter, which controls how much to weight the preference of the reference model.

Preference Learning and Reinforcement Learning (RL)

Preference learning techniques can be integrated with reinforcement learning (RL) to enable learning from non-numerical feedback in sequential domains. This approach, known as preference-based reinforcement learning, offers the advantage of defining feedback that is independent of arbitrary reward choices/shaping/engineering.

Read also: Learning Civil Procedure

Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a powerful and reliable RL algorithm, often the starting point for RLHF. PPO is like teaching your LLM to walk step-by-step, making sure it doesn’t stumble and fall with each update. It makes gentle changes to the LLM’s “walking style” (policy).

PPO’s Key Players:

Policy (LLM): The LLM we’re training to generate better text.
Reward Model: The AI judge that scores text based on human preferences.
Value Function (Critic - The Assistant Coach): Another AI model that acts like an “assistant coach.” It estimates how “good” each state is (how promising is the current text generation). This helps PPO make smarter updates.

Group Relative Policy Optimization (GRPO)

GRPO, from DeepSeek AI, is a smart twist on PPO, designed to be even more efficient, especially for complex reasoning tasks. GRPO is like PPO’s streamlined cousin. It keeps the core PPO idea but removes the need for the separate value function (critic), making it lighter and faster.

GRPO’s trick is Group-Based Advantage Estimation (GRAE): GRPO’s magic ingredient is how it estimates advantages. Instead of a critic, it uses a group of LLM-generated responses for the same prompt to estimate “how good” each response is relative to the others in the group.

Preference Learning with Chain-of-Thought (CoT) Reasoning

For complex tasks like math problems, coding, or logical reasoning, it is crucial to reward LLMs for generating correct and helpful chain-of-thought reasoning. Chain-of-Thought (CoT) reasoning involves generating a sequence of intermediate "thoughts" or reasoning steps, leading up to the final answer. This makes the LLM's reasoning process more transparent, improves accuracy, and facilitates debugging.

Active Preference Learning

Active preference learning focuses on efficiently collecting preference data by strategically selecting the most informative queries to present to the user. This approach aims to minimize the amount of data required to learn an accurate preference model.

Thompson Sampling

Thompson Sampling (TS) is among the most conceptually transparent and empirically successful methods for decision-making under uncertainty. It converts Bayesian inference into a simple randomized decision rule: sample a plausible world, then act optimally within it.

Multi-Armed Bandit (MAB) Problems

The multi-armed bandit (MAB) problem involves a gambler deciding which lever to pull on an MAB machine to maximize the winning rate, despite not knowing which machine is the most rewarding. This scenario highlights the need to balance exploration (trying new machines to discover potential higher rewards) and exploitation (using current knowledge to maximize gains).

Contextual Bandits

Contextual bandits extend the multi-armed bandits by making decisions conditional on the state of the environment and previous observations. The benefit of such a model is that observing the environment can provide additional information, potentially leading to better rewards and outcomes.

tags: #preference #learning #algorithms #tutorial