Early Experience: Training Language Agents from Their Own Outcomes to Beat Imitation Learning

Early Experience is a reward-free training recipe that lets language agents learn from the consequences of their own actions instead of relying on large sets of human demonstrations or a reinforcement learning reward loop. Meta Superintelligence Labs present two concrete strategies that turn agent-generated future states into supervision and report consistent improvements across eight language-agent benchmarks.

How the method works

Early Experience begins from a small seed of expert rollouts to collect representative states. At selected states the agent proposes alternative actions, executes them, and records the resulting next observations. Those recorded outcomes become the supervision signal instead of a scalar reward or additional expert trajectories.

Two practical instantiations are described:

Implicit World Modeling (IWM): the model is trained to predict the next observation given the current state and a chosen action. This tightens the agent's implicit model of environment dynamics and reduces off-policy drift over long horizons.
Self-Reflection (SR): the agent is presented with an expert action and alternative actions at the same state together with their observed outcomes. The model generates grounded explanations why the expert action is preferable given those outcomes, and this contrastive, outcome-verified supervision is used to fine-tune the policy.

Both strategies use the same optimization budgets and decoding settings as imitation learning; the only difference is that supervision data come from agent-generated branches rather than more expert trajectories.

Benchmarks and empirical gains

The research evaluates Early Experience on eight diverse language-agent environments spanning web navigation, constrained planning, scientific and embodied tasks, and multi-domain API workflows. Examples include WebShop, TravelPlanner, ScienceWorld, ALFWorld and Tau-Bench.

Across the full matrix of tasks and base models, Early Experience achieves average absolute gains of +9.6 success and +9.4 out-of-domain compared to imitation learning. Per-task reported gains include +18.4 on WebShop, +15.0 on TravelPlanner and +13.3 on ScienceWorld under matched budgets and settings.

These improvements are robust across backbone sizes tested (3B to 8B) and persist out-of-distribution.

Data efficiency and practical benefits

A major practical win is demonstration efficiency. With a fixed optimization budget, Early Experience matches or exceeds imitation learning while using far fewer expert demonstrations. For instance, on WebShop only 1/8 of demonstrations with Early Experience already outperforms IL trained on the full demo set; on ALFWorld parity is reached with half the demos. The advantage increases with more demonstrations, suggesting agent-generated future states provide supervision signals that demonstrations alone miss.

Relationship with reinforcement learning

Early Experience is not a replacement for RL when verifiable rewards exist. Rather, it is a reward-free pretraining stage that yields better initialization. When standard RL (for example GRPO) is applied after Early Experience, the training often reaches higher final performance and does so faster. The paper reports up to +6.4 absolute improvement in post-RL ceilings versus RL started from imitation learning initialization.

Why this matters