Early Experience: Training Language Agents from Their Own Outcomes to Beat Imitation Learning

Early Experience is a reward-free training recipe that lets language agents learn from the consequences of their own actions instead of relying on large sets of human demonstrations or a reinforcement learning reward loop. Meta Superintelligence Labs present two concrete strategies that turn agent-generated future states into supervision and report consistent improvements across eight language-agent benchmarks.

How the method works

Early Experience begins from a small seed of expert rollouts to collect representative states. At selected states the agent proposes alternative actions, executes them, and records the resulting next observations. Those recorded outcomes become the supervision signal instead of a scalar reward or additional expert trajectories.

Two practical instantiations are described:

Both strategies use the same optimization budgets and decoding settings as imitation learning; the only difference is that supervision data come from agent-generated branches rather than more expert trajectories.

Benchmarks and empirical gains

The research evaluates Early Experience on eight diverse language-agent environments spanning web navigation, constrained planning, scientific and embodied tasks, and multi-domain API workflows. Examples include WebShop, TravelPlanner, ScienceWorld, ALFWorld and Tau-Bench.

Across the full matrix of tasks and base models, Early Experience achieves average absolute gains of +9.6 success and +9.4 out-of-domain compared to imitation learning. Per-task reported gains include +18.4 on WebShop, +15.0 on TravelPlanner and +13.3 on ScienceWorld under matched budgets and settings.

These improvements are robust across backbone sizes tested (3B to 8B) and persist out-of-distribution.

Data efficiency and practical benefits

A major practical win is demonstration efficiency. With a fixed optimization budget, Early Experience matches or exceeds imitation learning while using far fewer expert demonstrations. For instance, on WebShop only 1/8 of demonstrations with Early Experience already outperforms IL trained on the full demo set; on ALFWorld parity is reached with half the demos. The advantage increases with more demonstrations, suggesting agent-generated future states provide supervision signals that demonstrations alone miss.

Relationship with reinforcement learning

Early Experience is not a replacement for RL when verifiable rewards exist. Rather, it is a reward-free pretraining stage that yields better initialization. When standard RL (for example GRPO) is applied after Early Experience, the training often reaches higher final performance and does so faster. The paper reports up to +6.4 absolute improvement in post-RL ceilings versus RL started from imitation learning initialization.

Why this matters

Early Experience occupies the practical middle ground between imitation learning and reinforcement learning. It keeps the optimization stability and simplicity of supervised learning while grounding supervision in outcomes the agent itself experiences. This addresses brittle generalization from pure imitation and the infrastructure and reward-specification costs of RL, making it immediately actionable for web and tool-use agent stacks where verifiable rewards are scarce.