Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Large Language Models (LLMs) have demonstrated promising reasoning capabilities that approach average human performance in a number of domains. These breakthroughs have extended the utility of LLMs from traditional chat and text-based applications to more dynamic, agentic roles. However, LLMs still struggle to generalize effectively in interactive, multi-step environments, since they are not natively trained for such applications. To address these challenges, a novel approach called Agent Q has been introduced, combining several key concepts in reasoning, search, self-critique, and reinforcement learning to improve the autonomy of AI agents in dynamic environments.

The Challenge of Multi-Step Reasoning in LLMs

While Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning, their application in agentic, multi-step reasoning within interactive environments remains a difficult challenge. Reasoning and planning have indeed been highlighted as core challenges for current LLMs. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities needed to perform complex decision-making in dynamic settings like web navigation. Recent works have tried to overcome this challenge by supervised fine-tuning on curated expert demonstrations in such environments; however, such behavior cloning objectives suffer from compounding errors and yield sub-optimal policies due to limited exploration data.

Introducing Agent Q: A Novel Approach

Agent Q is an advanced AI framework designed to improve the autonomy of AI agents in dynamic environments, such as web interfaces. To overcome these challenges, the framework combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions using an off-policy variant of the Direct Preference Optimization (DPO) algorithm. Our method allows LLM agents to learn effectively from both successful and unsuccessful trajectories, thereby improving their generalization in complex, multi-step reasoning tasks.

The goal is to design an approach that allows the agent to improve with autonomous experience and limited supervision. Challenges are even greater than during text generation, as the model needs to further understand how its actions affect its environment. Agent Q utilizes LLaMa 3-70B as the base model in experiments.

Key Components of the Agent Q Framework

The Agent Q framework incorporates several key components: Monte Carlo Tree Search (MCTS), a self-critique mechanism, and Direct Preference Optimization (DPO).

Read also: Education for FBI Agents

Monte Carlo Tree Search (MCTS)

Inspired by the success of search-based methods in prior game-playing settings and mathematical reasoning, Agent Q deploys a Monte Carlo Tree Search (MCTS) based search routine over web pages to guide agent exploration. MCTS helps the model explore its environment in a way that balances trying new actions (exploration) and relying on known successful actions (exploitation). Given the complexity of the environment, a base LLM is used for sampling possible rationales and web actions to explore.

In this context, MCTS allows the model to plan several steps ahead rather than reacting blindly to each new situation. By sampling different trajectories of actions and evaluating their outcomes, MCTS improves the agent’s performance over time.

AI Self-Critique Mechanism

To guide the search, the same model is utilized as a zero-shot critic evaluator at each node branch in a form of AI self-critique. The model learns not only from successful actions but also from its mistakes. Every time the model takes an action that doesn’t lead to the desired outcome, it evaluates what went wrong. This allows the model to avoid repeating the same mistakes and helps it refine its decision-making strategy. This is particularly useful in complex environments where a single wrong action can result in an overall failure of the task.

Direct Preference Optimization (DPO)

To correct the need for significant online interactions and the capability to rollback actions, the traces generated by the search process are used to improve the capabilities of the model by learning from both the successful and unsuccessful trajectories with offline reinforcement learning, utilizing the Direct Preference Optimization (DPO) algorithm.

Rather than simply training the model to succeed or fail, DPO allows the model to learn from preferences between different actions. For example, even if multiple actions lead to success, DPO helps the model identify which action was more efficient or effective. This helps fine-tune the model to choose optimal actions, making it more reliable and efficient in its tasks.

How DPO Works

DPO is a method for aligning language models with human preferences without using reinforcement learning. The key steps are:

Start with a pre-trained language model.
Collect human preference data by having humans compare pairs of model outputs and choose which they prefer.
Use this preference data to directly optimize the model’s policy using a specially derived loss function.

The core of DPO is its loss function, which is derived from the Bradley-Terry model of preferences and the KL divergence between the optimized policy and the original model. DPO directly leverages pairwise comparisons between outputs to optimize the model’s behavior, using a simpler and more stable loss function inspired by the Bradley-Terry model.

The core idea behind DPO is to frame model alignment as a preference-learning problem. Instead of assigning rewards to individual outputs as in RLHF, DPO works by comparing pairs of model outputs and optimizing the model to prefer the better option. For two outputs, y_w (winner) and y_l (loser), the loss function is derived from the probability of preferring one output over the other.

Here, r(x, y) represents the reward function for an output y given an input x, and sigma is the sigmoid function, which translates the difference in rewards into a probability.

The reward function, rather than being defined by external feedback signals, is optimized based on the log-likelihood of choosing the better response according to human preferences. The optimization objective of DPO is to maximize this probability, effectively teaching the model to align more closely with human preferences without the need for complex reinforcement learning strategies.

In summary, this loss encourages the model to:

Increase the probability of generating preferred outputs
Decrease the probability of generating non-preferred outputs
Stay close to the original model’s distribution

Comparison to RLHF

RLHF and DPO both aim to align language models with human preferences, but they differ in their approach:

RLHF:

Trains a separate reward model on human preference data
Uses reinforcement learning (typically PPO) to optimize the language model using the reward model
Requires careful tuning and can be unstable

DPO:

Directly optimizes the language model using preference data
Doesn’t require a separate reward model or reinforcement learning
Is simpler to implement and more stable

DPO achieves similar or better results compared to RLHF while being computationally more efficient and easier to implement. This approach significantly simplifies the learning process, replacing RLHF’s policy gradients and reward models with a straightforward log-likelihood-based loss. This not only reduces the risk of instability seen in RLHF but also ensures more efficient training by directly optimizing for human-aligned preferences.

In the Agent Q framework, DPO plays a critical role in fine-tuning the agent’s decision-making process. This makes the agent much more effective compared to models that only learn from success or failure feedback.

Validation and Results

The framework was tested in a simulated e-commerce environment called WebShop and in real-world booking tasks, such as restaurant reservations. The approach was validated in the WebShop environment, a simulated e-commerce platform where it consistently outperforms behavior cloning and reinforced fine-tuning baseline, and beats average human performance when equipped with the capability to do online search. In these tests, Agent Q significantly outperformed baseline models. For example, in WebShop, Agent Q showed a dramatic increase in success rates compared to behavior cloning and reinforcement learning methods. In real-world booking experiments, using the Agent Q framework improves the model's zero-shot absolute success rate from 18.6% to 81.7% (a 340% relative increase), outperforming GPT-4’s performance after a single day of autonomous data collection. When equipped with online search capability, the absolute success further improves to 95.4%.

The use of MCTS stands out as a significant advancement. It proposes several actions, tests them, and selects the most promising one, enabling the agent to complete multi-step tasks more reliably. The self-critique mechanism is another important innovation. Most models only learn from explicit successes, but in the real world, failure often provides even more valuable information. By analyzing unsuccessful trajectories, Agent Q enhances its reasoning, making it better prepared to handle similar tasks in the future.

POMDP Setup

The framework considers a general POMDP setup (𝒪,𝒮,𝒜,T,R,μ0,γ)(\mathcal{O},\mathcal{S},\mathcal{A},T,R,\mu{0},\gamma)( caligraphicO , caligraphicS , caligraphicA , italicT , italicR , italicμ startPOSTSUBSCRIPT 0 endPOSTSUBSCRIPT , italicγ ) where 𝒪\mathcal{O}caligraphicO denotes the observation space, 𝒮\mathcal{S}caligraphicS the unobserved state space, 𝒜\mathcal{A}caligraphicA the action space, T⁢(𝐬t+1|𝐬t,𝐚t)T(\mathbf{s}{t+1}|\mathbf{s}{t},\mathbf{a}{t})italicT ( bolds startPOSTSUBSCRIPT italict + 1 endPOSTSUBSCRIPT | bolds startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT , bolda startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT ) the transition distribution (in this case the dynamics of a web browser), R⁢(𝐬,𝐚)R(\mathbf{s},\mathbf{a})italicR ( bolds , bolda ) the reward function (in this work we use sparse rewards of 1/0 representing success/failure), μ0⁢(𝐬0)\mu{0}(\mathbf{s}{0})italicμ startPOSTSUBSCRIPT 0 endPOSTSUBSCRIPT ( bolds startPOSTSUBSCRIPT 0 endPOSTSUBSCRIPT ) the initial state distribution, and γ\gammaitalic_γ the discount factor, which is set to 1.

A POMDP is the most suitable framework to model web interactions for several reasons:

Novel environments, which the agent is unfamiliar with, require exploration in order to locate the task objective.
The real web is dynamic, which creates partial observability of the current state each time the agent is deployed.

The agent observation 𝐨t∈𝒪\mathbf{o}{t}\in\mathcal{O}boldo startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT ∈ caligraphicO are commands/information given by the user and the web browser. Subsequent observations consist of web pages from the browser, represented as a HTML DOM format. The agent actions 𝐚t∈𝒜\mathbf{a}{t}\in\mathcal{A}bolda startPOSTSUBSCRIPT italict endPOSTSUBSCRIPT ∈ caligraphicA.

Related Work

Agent Q touches on a large number of research directions around agent design, self-improvement, reasoning, and reinforcement learning.

The latest generation of Large Language Models (LLMs) have demonstrated promising emerging properties around reasoning and planning, which have also become an integral part of agentic design. Another emerging research direction is based around step-by-step verifiers or “Process Reward Models,” specifically for mathematical reasoning. A number of concurrent works have further explored tree-based search approaches in combination with DPO training for math-based reasoning. These algorithms optimize actions at the node level, using different branches produced by the search algorithm to create preference pairs. The approach shares similarities to the self-supervised search with a combination of AI-based feedback to guide intermediate search steps but is the first to scale this a realistic agent setting. Similar approaches were proposed in other works; however, these works only use the base model’s zero-shot capability and do not train it further and are only evaluated on simulated environments.

The strength and capabilities of recent pretrained Large Language (Vision) Models LL(V)Ms has significantly boosted progress in developing autonomous web-agents. Improved code understanding and long context have allowed agents to represent environment state and action space with document object model (DOM) allowing for deployment in complex and realistic domains. Moreover, strong reasoning and planning capabilities have also led to the development of a number of promising agents. Beyond using LL(V)Ms as plug-and-play planners/policies, recent works have sought to improve agentic-specific performance. Examples include online exploration, planning, error-correction, and self- or AI-critique. However, with small exceptions, these agents mostly provide a framework around a strong pre-existing model like GPT4-V or deploy limited fine-tuning and adaptation. Model training is crucial for continuous improvement.

Reinforcement Learning has become a significant component of training modern generative AI systems. Classical approaches have deployed the PPO algorithm or similar policy-gradient based methods and have even been scaled to autonomous web search agents as well as embodied applications with vision-language models (in simulation). However, these algorithms are challenging due to their complexity and the need for a high number of online samples from the model. Implicit Language Q-learning and the Q-transformer are offline RL algorithms designed for auto-regressive transformer models, and hence can be safely trained on pre-collected datasets; however, they have not been successfully scaled to modern LLMs. While these methods represent a token-level MDP, others have shown success formulating the RL problem at a step level, and these ideas have recently been scaled to a general device-control agent.

Limitations and Future Directions

While Agent Q performs exceptionally well with guided exploration, it still faces challenges in “zero-shot” situations, where the model must handle completely new tasks without prior experience. This gap between training in known environments and generalizing to new ones remains a key area for improvement in future research.

tags: #agent #q #advanced #reasoning #and #learning