Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Zhaoyang Wang, Canwen Xu, Boyi Liu, Yite Wang, Siwei Han +3 more

2/10/2026

cs.AIcs.CLcs.LG

Abstract

Recent advances in large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments. However, scaling such agent training is limited by the lack of diverse and reliable environments. In this paper, we propose Agent World Model (AWM), a fully synthetic environment generation pipeline. Using this pipeline, we scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets (35 tools per environment on average) and obtain high-quality observations. Notably, these environments are code-driven and backed by databases, providing more reliable and consistent state transitions than environments simulated by LLMs. Moreover, they enable more efficient agent interaction compared with collecting trajectories from realistic environments. To demonstrate the effectiveness of this resource, we perform large-scale reinforcement learning for multi-turn tool-use agents. Thanks to the fully executable environments and accessible database states, we can also design reliable reward functions. Experiments on three benchmarks show that training exclusively in synthetic environments, rather than benchmark-specific ones, yields strong out-of-distribution generalization. The code is available at https://github.com/Snowflake-Labs/agent-world-model.

View on arXiv View PDF

Code Implementations(10)

Snowflake-Labs/agent-world-modelOfficial100%

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

24131PythonFeb 8, 20262 months ago

lisalims/agent-world-model70%

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

00Feb 17, 20261 months ago

OpenPipe/ART68%

Agent Reinforcement Trainer: train multi-step agents for real-world tasks using GRPO. Give your agents on-the-job training. Reinforcement learning for Qwen2.5, Qwen3, Llama, and more!

8,120645Mar 10, 20253 months agoApache-2.0

agentagentic-aigrpollmslora+4 more

microsoft/TextWorld67%

TextWorld is a sandbox learning environment for the training and evaluation of reinforcement learning (RL) agents on text-based games.

1,394194Jun 26, 20182 months agoNOASSERTION

reinforcement-learningtext-based-adventuretext-based-game

eloialonso/diamond65%

DIAMOND (DIffusion As a Model Of eNvironment Dreams) is a reinforcement learning agent trained in a diffusion world model. NeurIPS 2024 Spotlight.

1,932139Shell, PythonMay 19, 20241 years agoMIT

artificial-intelligenceatarideep-learningdiffusion-modelsmachine-learning+3 more

microsoft/maro65%

Multi-Agent Resource Optimization (MARO) platform is an instance of Reinforcement Learning as a Service (RaaS) for real-world resource optimization problems.

909160Dec 27, 201911 months agoMIT

agentciti-bikedockerfinanceinventory-management+11 more

capuzz/Intelligent-Roundabout-Insertion-using-Deep-Reinforcement-Learning58%

An important topic in the autonomous driving research is the development of maneuver planning systems. Vehicles have to interact and negotiate with each other so that optimal choices, in terms of time and safety, are taken. For this purpose, we present a maneuver planning module able to negotiate the entering in busy roundabouts. The proposed module is based on a neural network trained to predict when and how entering the roundabout throughout the whole duration of the maneuver. Our model is trained with a novel implementation of A3C, which we will call Delayed A3C (D-A3C), in a synthetic environment where vehicles move in a realistic manner with interaction capabilities. In addition, the system is trained such that agents feature a unique tunable behavior, emulating real world scenarios where drivers have their own driving styles. Similarly, the maneuver can be performed using different aggressiveness levels, which is particularly useful to manage busy scenarios where conservative rule-based policies would result in undefined waits.

31Jan 7, 20204 years ago

artificial-intelligenceautonomous-drivingdeep-learningmulti-agent-systemsreinforcement-learning

KingOfRlyeh/AI-Ouroboros58%

A model of an environment of recursive training of multiple AI models on increasingly synthetic datasets in a multi-agentic system using real-world values

00Nov 17, 20253 months agoMIT

IncludeBrake/synthetic_market_machine55%

Level 1 Simulated Market Environment** that closely mirrors real-world dynamics for validating a vague product idea, we need a robust, data-driven, agent-based system that maximizes fidelity while maintaining zero reputational risk.

00Sep 9, 20257 months ago

cair/DeepAxie54%

Implementation of a simplified Axie Infinity Environment in C++ that is used to train an agent with the reinforcement learning algorithm DQN to play the game.

20May 29, 20223 years ago

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning

Abstract

Code Implementations(10)

Discussion