Meta Agents Research Environments and the Gaia2 Benchmark: Advancing Agent Capabilities

The pursuit of artificial general intelligence (AGI) necessitates continuous improvements in large language models (LLMs) through reinforcement learning (RL). This progress hinges on the availability of diverse, controllable, and realistic environments for model training and evaluation. Current environments often fall short, being tightly coupled to specific tasks and idealized agent interactions, leading to rapid saturation and the need for frequent boilerplate code rewriting. To address these limitations, we introduce Meta Agents Research Environments (ARE), a research platform designed for scalable environment creation, integration of synthetic or real-world applications, and execution of agentic orchestrations.

Introducing Meta Agents Research Environments (ARE)

Meta Agents Research Environments (ARE) is a research platform designed to address the limitations of existing environments for agent development and evaluation. ARE supports the running of orchestrations, creation of environments, and connection of synthetic or real-world apps for agent development and evaluations. The platform facilitates both the creation of diverse environments and tasks, as well as the integration of existing ones. It also supports a shift from sequential to asynchronous interaction between an agent and its environment, unlocking new tasks and capabilities.

ARE proposes abstractions for simulation and verification that facilitate both the creation of diverse environments and tasks, as well as the integration of existing ones, like τ-bench. ARE supports a shift from sequential to asynchronous interaction between an agent and its environment, unlocking new tasks and capabilities in the process, like handling time. Though simulated, the platform is not unrealistic. ARE supports connection of real apps e.g., through Model Context Protocol (MCP) integration (Anthropic, 2024), so that model development, evaluation, and production deployment can be consistent.

ARE's Key Features

Scalable Environment Creation: ARE allows for the creation of diverse environments and tasks, as well as the integration of existing environments.
Integration of Synthetic and Real-World Applications: ARE supports the connection of real applications through Model Context Protocol (MCP) integration.
Agentic Orchestrations: ARE is designed for the execution of agentic orchestrations.
Asynchronous Interaction: ARE supports asynchronous interaction between an agent and its environment, unlocking new tasks and capabilities, such as handling time.
Event-Based, Time-Driven Simulations: ARE environments are event-based, time-driven simulations that run asynchronously from the agent and the user.
Flexible Data Storage: The framework offers flexible options for data storage.
Reproducible Evaluations: The environment runs deterministically given a fixed starting state and seed, ensuring reproducible evaluations.
Support for Multi-Agent Setups: It can host one or multiple agents simultaneously, supporting both single-agent and multi-agent setups.

Core Principles of ARE

Everything is an Event: ARE is time-driven and built on the principle that "everything is an event". Events are anything that happens in the Environment.
Asynchronous Interactions: Notifications are queued by timestamp and exposed to agents through a notification queue, enabling asynchronous interactions.
Dynamic Scenarios: ARE shifts from static, single-turn tasks to dynamic scenarios, capturing real-world complexity through temporal dynamics, events, and multi-turn interactions.

Components of ARE

Apps: Collections of tools that interact with a data source. Each app starts in the simulation with an initial state and keeps track of changes as agents use its tools or as events occur in the environment.
Tools: Role-scoped-agent, user, or env.
AgentUserInterface: The communication channel between users and agents: messages are tool calls, and user messages generate notifications that agents can process asynchronously.
System: Provides core simulation controls like get_current_time (query time), wait (pause for a duration), and wait_for_next_notification (pause until an event).
Environment: A Markov Decision Process with states, observations, actions, and transition rules.
Events: Any agent action or app-state change, timestamped and logged.
Notifications: Messages from the Environment that inform the agent about Events.

How ARE Works

ARE environments allow to play scenarios, which typically contain tasks for the agent and verification logic. Whether initiated by agent or user, interactions happen through the same interfaces and can be either tool calls, or tool output/notification observations. ARE environments evolve continuously and are strictly decoupled from the agent. Time advances in the simulation, and the environment continuously introduces events.

Scenarios in ARE

ARE shifts from static, single-turn tasks to dynamic scenarios. Scenarios attempt to capture real-world complexity through temporal dynamics, events, and multi-turn interactions. This enables evaluation of agent capabilities that cannot be assessed through traditional request-response paradigms. In practice, scenarios are implemented in a scenario.py containing the apps, scheduled events, and arbitrary verification logic. Scenarios typically start with an environment instance and a send_message_to_agent tool call, waking the agent up.

Gaia2: A Benchmark for General Agent Capabilities

Building on ARE, we introduce Gaia2, a new evaluation for agents. as well as their associated content, similar to AppWorld (Trivedi et al., 2024) or ToolSandbox (Lu et al., 2024). Gaia2 retains the core principles of Gaia, consisting of verifiable tasks that are simple for humans but challenging for today’s models, and that align with actual model use. The benchmark also integrates our learnings from using Gaia. We propose more diverse tasks in a simulated but dense and realistic environment with built-in tools so that signal-to-noise ratio and reproducibility are better.

Gaia2 departs from most agent benchmarks in two ways. (i) Deeper and more realistic interactions between the agent and the environment, since both run asynchronously. (ii) A shift from tasks to scenarios spanning an arbitrary period of time. Environment time passes independent of whether the agent acts or not, and the environment state is continuously updated with random or scheduled events, such as a friend replying to message sent by a user or an agent.

Key Features of Gaia2

Verifiable Tasks: Gaia2 consists of verifiable tasks that are simple for humans but challenging for today’s models.
Alignment with Actual Model Use: The tasks in Gaia2 align with actual model use.
Diverse Tasks: Gaia2 proposes more diverse tasks in a simulated but dense and realistic environment with built-in tools so that signal-to-noise ratio and reproducibility are better.
Adaptability: Scenarios require agents to exhibit adaptability.
Handling Ambiguity and Noise: Scenarios require agents to effectively handle ambiguity, noise, time, and collaboration with other agents.
Asynchronous Interactions: Deeper and more realistic interactions between the agent and the environment, since both run asynchronously.
Temporal Dynamics: A shift from tasks to scenarios spanning an arbitrary period of time.

Capabilities Targeted by Gaia2

Adaptability: Agents must adapt to dynamic environments.
Handling Ambiguity: Agents must handle ambiguities and noise.
Collaboration: Agents must collaborate with other agents.
Temporal Reasoning: Agents must operate under temporal constraints.

The Mobile Environment in ARE

We release ARE with an environment analogous to a mobile device, Mobile, in which Gaia2 lives. Mobile device environments offer a broad range of tasks with complex interactions between agents and their environment that are aligned with actual model use. Mobile uses turn-based interaction. During a turn, the environment operates asynchronously-the simulation clock advances while the agent processes information and selects actions. The agent’s computational time directly consumes simulated time, making slow responses quantifiably impact the simulation. Between turns, the simulation pauses while awaiting user input.

Content Generation for Mobile

We create content for Mobile apps with synthetic data generated using Llama 3.3 70B Instruct. The primary challenge lies in generating coherent data across all applications - contacts in the Contacts app must match those in messaging apps, calendar events should align with user descriptions, etc. To address this, we define an app dependency graph to guide the generation process, though more complex inter-app dependencies remain unhandled at this stage.

To maximize diversity, we initiate populations by generating unstructured content, subsequently processed through structured decoding guided by individual app schemas. We repeat this process to create diverse Mobile instances-termed “universes”-sharing identical rules and applications but containing distinct content. For example, one universe centers on a retired French physics professor, while another focuses on a Chinese professional athlete. Each universe contains approximately 400K tokens of raw unstructured content on average. When accounting for the complete structured representation, universes reach approximately 800K tokens. Both estimates constitute lower bounds, as they do not include the contents of the filesystem.

Scenario Creation and Verification in Mobile

ARE enables Mobile scenario creation through an annotation interface. The novelty of Mobile scenario creation is that it focuses on collecting the DAG of write events as ground truth, including user, oracle, and environment events. The ARE Verifier validates that agent write actions match annotated oracle actions. Mobile scenarios are designed such that there is a unique set of write actions that solves the scenario, while read actions are not explicitly verified as they do not change the environment state.

ARE as a General Platform

Mobile is an example of the environments that can be built in ARE: beyond Mobile, ARE abstractions encompass many existing agent benchmarks. For example, we internally replicated τ-bench (Yao et al., 2024) and BFCLv3 (Patil et al., 2025) in ARE without major modifications. In particular for τ-bench, a domain such as Airline is implemented as a single app environment, leverages ARE LLM user abstraction, and Oracle Events are parsed from τ-bench ground truth actions. The simulation is stopped with a Conditional Event monitoring that the agent defers to a human assistant or the user stops the interaction, and the trajectory is verified by implementing τ-bench verification logic as validation events.

Verification in ARE

Verifiable rewards have proven crucial for improving reasoning (DeepSeek-AI et al., 2025; Lambert et al., 2024), code generation (Gehring et al., 2024), agentic web browsing (OpenAI, 2025a; Wei et al., 2025b) and software engineering (Yang et al., 2025b; MoonshotAI et al., 2025). Similarly, recent reasoning and agent benchmarks adopted short-formed answers that can be easily matched (Hendrycks et al., 2021; Mialon et al., 2023), or binary feedback from an execution environment (Yao et al., 2024; Jimenez et al., 2024). We verify scenario successful completion by comparing agent actions with a ground truth, defined as the minimal sequence of write actions needed to solve a task.

The ARE Verifier validates that agent write actions match annotated oracle actions. Mobile scenarios are designed such that there is a unique set of write actions that solves the scenario, while read actions are not explicitly verified as they do not change the environment state.

Verification Procedure

In a preliminary phase, the verifier checks that used tool names counters are identical in both the oracle actions and the agent’s write actions. If this test is successful, the verifier sorts the oracle actions in a topological order based on the oracle graph, which reflects their dependencies.

Read also: Guide to UC Davis Student Housing

Consistency: the verifier tests whether the oracle action and the candidate agent’s action are equivalent.
- Hard check: to compare action parameters that require exactness.
- Soft check: an LLM judge is prompted with the user task as context, and the arguments from both the agent action and the oracle action as inputs. The LLM then determines if the actions are equivalent according to tool-specific guidelines.
Causality: crucially, oracle actions are organized within an oracle graph, whereas agent actions are collected from a trajectory and simply ordered by execution time.
Timing: scenarios can include a time delay for certain actions relative to their parent actions, which the agent must respect.

Benchmarking Results and Analysis

Equipped with a simple ReAct-like scaffold, no model evaluated here dominates across the intelligence spectrum-each trades off capability, efficiency, and budget. At equal cost, some models fare better, yet all curves plateau, suggesting that standard scaffolds and/or models miss key ingredients for sustained progress.

Workshop Organization Considerations

The following points are for organizing a workshop.

Pre-Workshop Planning

Co-organizers: Meet with all workshop co-organizers and establish regular meetings to keep them involved.
Milestones: Be aware that once accepted, the workshop organizers will verify that your workshop is making progress towards basic milestones.
Deadlines: Schedule all deadlines at 23:59 (11:59PM) “Anywhere on Earth” time zone (AoE) to reduce confusion.
Submission Date: The suggested submission date for workshop contributions is September 29th for NeurIPS 2023, but this is really up to you. Consider setting your submission deadline to be after the conference’s author notification date September 22nd to attract conference rejections too.
Notification Date: Make sure to set your notification date so authors have enough time to get visas. The latest possible date to notify authors of your accept-or-reject decision of their paper is October 27th for NeurIPS 2023. Consider notifying authors of your decision before the Early Registration Deadline October 21st to reduce your authors’ registration costs.
Announcements and Reminders: Schedule a set of announcements and reminders about your workshop using social media to remind people about your workshop.
Speakers: It’s important to keep speakers updated with relevant developments (like when your workshop is selected, and when the conference selects which date your workshop will be). Speaker cancellations are normal.
Interactive Elements: Opting for fewer speakers means you can expand the interactive elements of your schedule; like posters (ideally 2+ hours), panel discussions (45+ mins), or each speaker’s question time (10 mins).

Submission and Review Process

Paper Format: Consider allowing 4-page submissions (“extended abstracts”).
Proceedings: In some situations, you may invite submission for both proceedings and non-proceedings. For example, full-length papers go to proceedings, extended abstracts go to non-proceedings.
Reviewing: Reviewing submitted papers is optional.
Lightweight Reviewing: Some workshops prefer lightweight reviewing: largely checking for fit and correctness and a selection of stronger papers for longer presentations.
Detailed Reviews: Detailed reviews are more valuable to authors but place more burden on reviewers and can be less practical for larger workshops to provide reviews of consistent quality.
Program Committee: If you wish to provide reviews, consider recruiting a program committee (aka reviewers) if you expect 15+ papers, otherwise the organizers can review the papers themselves.
Reviewer Workload: To keep the workload low, not more than 3 reviews per reviewer are expected. Aim for 3 reviews per paper so that the majority of papers receive 2-3 reviews.
Emergency Reviewers: In anticipation of some papers with 0-1 reviews, you can recruit some “emergency reviewers” in advance to be “on call” for the couple days prior to paper decisions, so that every paper can have at least 2 reviews by the time you made your acceptance/rejection decisions.
Reviewer Matching: To match reviewers to papers, several tools are available. OpenReview has its own matching mechanism. You can also manually match reviewers.
Acceptance Rate: Generally workshop decision thresholds are more lenient than conferences, resulting in the acceptance of 50-90% of submissions.

Workshop Day Logistics

Organizer Presence: Schedule which organizers will attend the workshop when.
Organizer Communication: Setup a private messaging system for communication between organizers, e.g., with slack, so that on the day of the workshop organizers can coordinate quickly resolving issues as they arise.
Calendar Invite for Speakers: Make it impossible for your speakers to be confused with locations and time zones by sending them each a calendar event for their talk.
Speaker Introductions: Prepare introductions for each speaker (their bio) for a warm introduction and smooth transitions between speakers.
Time Management: Help speakers avoid going overtime by bringing boldly printed numbers that you can hold up to inform them how many minutes they have left.
Spare Equipment: Bring spare AV adaptors including Apple adaptors, your laptop, USB drives; and spare tape and push pins for posters.
Backup Questions: Prepare “backup questions” for each keynote or spotlight speaker.
Remote Cameras: If using new remote cameras, the workshops provide an organizer to select the stream using a SlidesLive stream box.
Room Setup: Identify any special needs for room setup several weeks before the conference.
Redundancy: Have multiple organizers attending the workshop to deal with issues together. Carry a spare laptop, AV adapters, and USB drive in case speakers have equipment issues.
AV Staff: Greet audio-visual (AV) staff, if any, to understand how everything will work.
Poster Sessions: Go chat to authors during the poster sessions and learn about their work. As an organizer, you are recommended to visit posters that are not receiving as much attention.
Live Backup Questions: For non-recorded talks, you’ll want at least one organizer (the MC) to be paying attention to each speaker during the event to think of backup questions live.
Dinner: You can organize a dinner for the speakers and (if you have them) sponsors.

Post-Workshop

Public Links to Talks: It’s nice for speakers to have public links to their talks and for others who couldn’t make your event to see.
Website: Create a website. Google Sites is very easy to use, but not accessible in China.

Sponsorship

Sponsorship is Optional: You do not need sponsorship.
Use of Funds: Some workshops use sponsorship to fund best paper awards (e.g., cash, cloud credits, or hardware).
Student Travel Awards: It’s better to fund student travel awards based on financial need, to enable more people to attend your workshop or at least reduce the financial stress of doing so.
Fund Management: If you have multiple sponsors, then holding funds together can be difficult. You generally cannot hold funds in your own personal bank account.
Payment Processing: When companies want to sponsor using their corporate credit card, use PayPal.
Export Bans: Certain countries unfortunately have export bans against other countries, especially for computing hardware.

tags: #isa #fulford #pieter #abbeel #research