Generative Adversarial Imitation Learning: A Comprehensive Tutorial
Introduction
Imitation learning (IL) stands as a unifying framework for studying the supervised learning paradigm. It encompasses training a foundation model by imitating a large corpus of domain-specific demonstrations. This paradigm is central to many impressive advances in generative AI, including large language model pre-training, robotic behavior foundation models, and foundation models for chemistry and life sciences. In essence, imitation learning provides AI agents with the ability to learn by observing expert behavior, mimicking how humans acquire new skills.
The Essence of Imitation Learning
At its core, imitation learning involves providing an agent with samples of expert behavior. These samples can take various forms, such as videos of humans playing online games or recordings of cars driving on the road. The agent's objective is to learn a policy that closely mirrors this expert behavior. This contrasts with reinforcement learning (RL), where the goal is to learn a policy that maximizes a predefined reward function.
A significant advantage of imitation learning lies in its ability to circumvent the need for meticulously hand-designed reward functions. Since it relies solely on expert behavior data, it becomes easier to scale to real-world tasks where gathering expert behavior is feasible. Imitation learning has played a crucial role in developing AI methods for decades, with early approaches emerging in the 1990s and early 2000s. These early methods aimed to learn a policy as a machine learning model, mapping environment observations to optimal actions taken by the expert using supervised learning.
Imitation Learning as a Unifying Framework
This tutorial views imitation learning (IL) as a unifying framework through which to study the supervised learning paradigm. Imitation learning trains a foundation model by imitating a large corpus of domain-specific demonstrations-at the heart of many of the most impressive advances in generative AI, including large language model pre-training, robotic behavior foundation models, and foundation models for chemistry and life sciences.
The aim of this tutorial is to:
Read also: Understanding Generative AI and Deep Learning
- Give an overview of recent theoretical advances that aim to understand when and why imitation learning can succeed with powerful generative models.
- Explain why the field has converged to certain interventions and best practices that are now ubiquitous.
- Highlight new opportunities for transfer between theory and practice.
A running theme is understanding domain-specific challenges and solutions. The tutorial will examine how discrete settings (language modeling) and continuous settings (robotics) require different algorithmic interventions, including action chunking, score-matching, and interactive data collection. In parallel, the tutorial will unify seemingly disparate techniques: next-token prediction in language models becomes behavior cloning with log-loss, while exposure bias in autoregressive generation mirrors the compounding error phenomenon in control.
Classical Results in Imitation Learning
Several classical results have laid the foundation for modern imitation learning techniques. These include:
- Efficient reductions for imitation learning: This work focuses on developing efficient algorithms for imitation learning by reducing it to online learning problems. (Stéphane Ross and J. Andrew Bagnell. AISTATS, 2010.)
- A reduction of imitation learning and structured prediction to no-regret online learning: This research provides a framework for imitation learning and structured prediction by reducing them to no-regret online learning. (Stéphane Ross, Geoffrey J. Gordon, and J. Andrew Bagnell. ICML, 2011.)
- Reinforcement and imitation learning via interactive no-regret learning: This paper explores how interactive learning can be used to improve both reinforcement and imitation learning algorithms. (Stéphane Ross and J. Andrew Bagnell. ICML, 2014.)
- Generative adversarial imitation learning: This seminal work introduces the concept of using generative adversarial networks (GANs) for imitation learning. (Jonathan Ho and Stefano Ermon. NeurIPS, 2016.)
- Toward the Fundamental Limits of Imitation Learning: This study investigates the theoretical limits of imitation learning, providing insights into the factors that affect its performance. (Nived Rajaraman, Lin F. Yang, Jiantao Jiao, Kannan Ramachandran. NeurIPS, 2020.)
- Of moments and matching: a game-theoretic framework for closing the imitation gap: This paper presents a game-theoretic framework for understanding and addressing the imitation gap in imitation learning. (Gokul Swamy, Sanjiban Choudhury, J. Andrew Bagnell, and Zhiwei Steven Wu. ICML, 2021.)
The Challenge of Behavioral Cloning
One of the earliest and most straightforward approaches to imitation learning is Behavioral Cloning (BC). BC involves training a machine learning model to map environment observations to the actions taken by an expert. This is essentially a supervised learning problem where the model learns to mimic the expert's behavior.
However, BC suffers from a critical drawback: it lacks guarantees that the model will generalize to unseen environmental observations. When the agent encounters a situation not present in the expert trajectories, BC is prone to failures. For instance, a self-driving car trained with BC might not know how to react if it deviates from the expert's trajectory, potentially leading to accidents.
Addressing the Limitations of Behavioral Cloning
A simple fix, Dataset Aggregation (DAGGER), was proposed to interactively collect more expert data to recover from mistakes and was used to create the first autonomous drone that could navigate forests. Nevertheless, this requires a human in the loop and such interactive access to an expert is usually infeasible. Instead, we want to emulate the trial-and-error process that humans use to fix mistakes.
Read also: Transforming Education with AI
Generative Adversarial Imitation Learning (GAIL)
In 2016, Ho and Ermon introduced Generative Adversarial Imitation Learning (GAIL), framing Inverse Reinforcement Learning as a minimax game between two AI models, drawing parallels to Generative Adversarial Networks (GANs). In this formulation, the agent policy model (the "generator") produces actions interacting with an environment to attain the highest rewards from a reward model using RL, while the reward model (the "discriminator") attempts to distinguish the agent policy behavior from expert behavior. Thus, if the policy does something that is not expert-like, it gets a low reward from the discriminator and learns to correct this behavior. This minimax game has a unique equilibrium solution called the saddle point solution (due to the geometrical saddle shape of the optimization). At the equilibrium, the discriminator learns a reward such that the policy behavior based on it is indistinguishable from the expert. With this adversarial learning of a policy and a discriminator, it is possible to reach expert performance using few demonstrations. Techniques inspired by such are referred to as Adversarial Imitation.
In essence, GAIL trains two networks simultaneously:
- Generator (Policy): A policy network that learns to generate actions in the environment.
- Discriminator (Reward): A reward function that tries to distinguish between the actions of the generator and the actions of the expert.
The generator aims to fool the discriminator by producing actions that are indistinguishable from the expert's actions, while the discriminator aims to correctly identify the source of the actions. This adversarial process drives the generator to learn a policy that closely mimics the expert's behavior.
Challenges with GAIL
Unfortunately, as adversarial imitation is based on GANs, it suffers from the same limitations, such as mode collapse and training instability, and so training requires careful hyperparameter tuning and tricks like gradient penalization. Furthermore, the process of reinforcement learning complicates training because it is not possible to train the generator here through simple gradient descent. This amalgamation of GANs and RL makes for a very brittle combination, which does not work well in complex image-based environments like Atari.
Inverse Q-Learning (IQ-Learn)
To address the limitations of existing approaches, a novel algorithm called Inverse Q-Learning (IQ-Learn) has been proposed. IQ-Learn offers a non-adversarial approach to imitation learning, potentially resolving many challenges in the field.
Read also: AI's Impact on Learning
In RL, Q-functions measure the expected sum of future rewards an agent can obtain starting from the current state and choosing a particular action. By learning Q-functions using a neural network that takes in the current state and a potential action of the agent as input, one can predict the overall expected future reward obtained by the agent. Because the prediction is of the overall reward, as opposed to only the reward for taking that one step, determining the optimal policy is as simple as sequentially taking actions with the highest predicted Q-function values in the current state. This optimal policy can be represented as the argmax over all possible actions for a Q-function in a given state. In IL, a simple, stable, and data-efficient approach has always been out of reach because of the above-mentioned issues with previous approaches. Additionally, the instability of adversarial methods makes the Inverse RL formulation hard to solve. A non-adversarial approach to IL could likely resolve many of the challenges the field faces.
The key insight behind IQ-Learn is that the Q-function can represent both the optimal behavior policy and the reward function. This is because the mapping from single-step rewards to Q-functions is bijective for a given policy. This can be used to avoid the difficult minimax game over the policy and reward functions seen in the Adversarial Imitation formulation, by expressing both using a single variable: the Q-function. Plugging this change of variables into the original Inverse RL objective leads to a much simpler minimization problem over just the single Q-function; which we refer to as the Inverse Q-learning problem.
How IQ-Learn Works
Our Inverse Q-learning problem shares a one-to-one correspondence with the minimax game of adversarial IL in that each potential Q-function can be mapped to a pair of discriminator and generator networks.
Now crucially, if we know the Q-function then we explicitly know the optimal policy for it: this optimal policy is to simply choose the (softmax) action that maximizes the Q-function in the given state. Now, instead of optimizing over the space of all possible rewards and policies, we only need to optimize along a manifold in this space corresponding to the choice of a Q-function and the optimal policy for it.
During the course of learning, for discrete action spaces, IQ-Learn optimizes the objective (\mathcal{J}^*), taking gradient steps on the manifold with respect to the Q-function (the green lines) converging to the globally optimal saddle point. For continuous action spaces calculating the exact gradients is often intractable and IQ-Learn additionally learns a policy network. It updates the Q-function (the green lines) and the policy (the blue lines) separately to remain close to the manifold.
IQ-Learn update is a form of contrastive learning, where expert behavior is assigned a large reward, and the policy behavior a low reward; with rewards parametrized using Q-functions.
Advantages of IQ-Learn
IQ-Learn offers several advantages over existing imitation learning algorithms:
- Simplicity: The algorithm is relatively simple to implement, requiring only a modified update rule to train a Q-network.
- Stability: IQ-Learn avoids the instability issues associated with adversarial training.
- Data Efficiency: IQ-Learn can reach expert performance with a small number of expert demonstrations.
- Versatility: IQ-Learn can be used for imitation without expert actions, relying solely on expert observations.
Experimental Results
Despite the simplicity of the approach, we were surprised to find that it substantially outperformed a number of existing approaches on popular imitation learning benchmarks such as OpenAI Gym, MujoCo, and Atari, including approaches that were much more complex or domain-specific. In all these benchmarks, IQ-Learn was the only method to successfully reach expert performance by relying on a few expert demonstrations (less than 10). Beyond simple imitation, we also tried to imitate experts where only partial expert data is available or the expert has changes in its environment or goals compared to the agent - more akin to the real world. We were able to show that IQ-Learn can be used for imitation without expert actions, and relying solely on expert observations, enabling learning from videos.
Imitation Learning in Discrete Settings
Contemporary research in imitation learning has yielded significant results in discrete settings, such as language modeling. These results shed light on the theoretical underpinnings of imitation learning and provide insights into its limitations.
- Is behavior cloning all you need? Understanding horizon in imitation learning: This paper explores the conditions under which behavior cloning can be sufficient for imitation learning, focusing on the role of the horizon (the number of steps into the future) in determining performance. (Dylan J. Foster, Adam Block, and Dipendra Misra. NeurIPS, 2024.)
- Computational-statistical tradeoffs at the next-token prediction barrier: Autoregressive and imitation learning under misspecification: This study investigates the computational and statistical tradeoffs involved in next-token prediction, a common task in language modeling, and how these tradeoffs relate to imitation learning under model misspecification. (Dhruv Rohatgi, Adam Block, Audrey Huang, Akshay Krishnamurthy, and Dylan J. Foster. COLT, 2025.)
- A theory of learning with autoregressive chain of thought: This research develops a theoretical framework for understanding learning with autoregressive chain of thought, a technique used in large language models to improve reasoning and problem-solving abilities. (Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, and Nathan Srebro. COLT, 2025.)
- The Coverage Principle: How Pre-Training Enables Post-Training: This paper introduces the coverage principle, which explains how pre-training on large datasets enables effective post-training for various downstream tasks. (Fan Chen, Audrey Huang, Noah Golowich, Sadhika Malladi, Adam Block, Jordan T. Ash, Akshay Krishnamurthy, and Dylan J. Foster. Preprint, 2025.)
Imitation Learning in Continuous Settings
Imitation learning has also made significant strides in continuous settings, such as robotics. These advancements have led to the development of robust and efficient algorithms for learning complex motor skills.
- DART: Noise injection for robust imitation learning: This paper introduces DART, a technique that injects noise into the expert's actions to improve the robustness of imitation learning algorithms. (Michael Laskey, Jonathan Lee, Roy Fox, Anca Dragan, and Ken Goldberg. CoRL, 2017.)
- TaSIL: Taylor series imitation learning: This research proposes TaSIL, an imitation learning algorithm based on Taylor series expansion, which can effectively learn from limited data. (Daniel Pfrommer, Thomas Zhang, Stephen Tu, and Nikolai Matni. NeurIPS, 2022.)
- Provable guarantees for generative behavior cloning: Bridging low-level stability and high-level behavior: This paper provides theoretical guarantees for generative behavior cloning, demonstrating its ability to bridge low-level stability and high-level behavior. (Adam Block, Ali Jadbabaie, Daniel Pfrommer, Max Simchowitz, and Russ Tedrake. NeurIPS, 2023.)
- The pitfalls of imitation learning when actions are continuous: This study explores the challenges and pitfalls of imitation learning in continuous action spaces, providing insights into the factors that can affect performance. (Max Simchowitz, Daniel Pfrommer, and Ali Jadbabaie. COLT 2025.)
Optimization Perspectives in Imitation Learning
Optimization techniques play a critical role in imitation learning, influencing the convergence and generalization performance of algorithms.
- Acceleration of stochastic approximation by averaging: This work explores how averaging can be used to accelerate stochastic approximation algorithms, which are commonly used in imitation learning. (Boris T. Polyak and Anatoli B. Juditsky. SIAM Journal on Control and Optimization, 30(4):838-855, 1992.)
- Averaging weights leads to wider optima and better generalization: This paper demonstrates that averaging weights during training can lead to wider optima and better generalization performance. UAI, 2018.
- Butterfly effects of SGD noise: Error amplification in behavior cloning and autoregression: This research investigates how the noise in stochastic gradient descent (SGD) can lead to error amplification in behavior cloning and autoregression. (Adam Block, Dylan J. Foster, Akshay Krishnamurthy, Max Simchowitz, and Cyril Zhang. ICLR, 2024.)
- EMA without the lag: Bias-corrected iterate averaging schemes: This paper introduces bias-corrected iterate averaging schemes that can improve the performance of exponential moving average (EMA) methods. (Adam Block and Cyril Zhang.)
Benchmarking Imitation Learning Algorithms
To provide a comprehensive evaluation of imitation learning algorithms, several benchmark environments have been used, including:
- OpenAI Gym: A toolkit for developing and comparing reinforcement learning algorithms.
- MujoCo: A physics engine for simulating robotic systems.
- Atari: A suite of classic video games used for evaluating reinforcement learning agents.
Imitation Learning Beyond Basic Imitation
Beyond simple imitation, imitation learning can be extended to more complex scenarios, such as:
- Imitation with partial expert data: Learning from experts when only partial data is available.
- Imitation with changing environments or goals: Learning from experts when the environment or goals change.
- Imitation from observations only: Learning from expert demonstrations without access to actions.
Other Imitation Learning Algorithms
Behavior Cloning (BC)
Behavior Cloning learns a policy as a supervised learning problem over state-action pairs from expert trajectories. It is the most common and simple approach for imitation learning as it treats the imitation problem as a supervised problem. However, it only tends to succeed with large amounts of training data and is not efficient due to compounding error caused by covariant shifts. BC also fails to capture the intuition that actions made by the expert are purposeful and have a goal, which the expert is trying to achieve. A common example of BC is mapping the front stereo camera of a car to the steering angle using the data captured by a human driver.
MaxEnt
MaxEnt is another method to train a reward model separately (not iteratively), beside Behavior Cloning (BC). Its main idea lies in maximizing the probability of taking expert trajectories based on the current reward function. From there, the method derives its main objective based on the maximum entropy theorem, which states that the most representative policy fulfilling a given condition is the one with highest entropy H. Therefore, MaxEnt requires an additional objective to maximize the entropy of the policy.
Generative Adversarial Imitation Learning (GAIL)
The original work on GAIL was inspired by the concept of Generative Adversarial Networks (GANs), which apply the idea of adversarial training to enhance the generative abilities of a main model. This can be derived as Kullback-Leibler divergence. The main benefit of GAIL over previous methods (and the reason it performs better) lies in its interactive training process.
Adversarial Inverse Reinforcement Learning (AIRL)
One remaining problem with GAIL is that the trained reward model (the discriminator) does not actually represent the ground truth reward. Instead, the discriminator is trained as a binary classifier between expert and generator state-action pairs, resulting in an average of its values of 0.5.
InfoGAIL
Despite the advancements made by previous methods, an important problem still persists in Imitation Learning (IL): multi-modal learning. To apply IL to practical problems, it is necessary to learn from multiple possible expert policies. To address this issue, InfoGAIL was developed. Inspired by InfoGAN, which conditions the style of outputs generated by GAN using an additional style vector, InfoGAIL builds on the GAIL objective and adds another criterion: maximizing the mutual information between state-action pairs and a new controlling input vector z.
Challenges and Future Directions
Imitation learning is a challenging and fascinating field with numerous open research questions. Some key challenges and future directions include:
- Multi-agent Imitation Learning: Developing imitation learning algorithms for multi-agent systems.
- Scalability to real-world applications: Applying imitation learning to complex real-world problems.
- Robustness to noisy or imperfect data: Developing algorithms that are robust to noise and imperfections in expert data.
- Theoretical understanding of generalization: Developing a deeper theoretical understanding of the factors that affect generalization in imitation learning.
tags: #generative #adversarial #imitation #learning #tutorial

