Weak-for-Strong: Training a 7B Meta-Agent to Orchestrate Powerful LLMs

What W4S is and why it matters

Researchers from Stanford, EPFL, and UNC present Weak-for-Strong Harnessing (W4S), a reinforcement learning framework that trains a small meta-agent to design and refine executable code workflows which call a stronger executor model. Instead of fine-tuning the powerful model, W4S teaches a lightweight planner to orchestrate it. The approach frames workflow design as a multi-turn Markov decision process and trains the planner with a method called Reinforcement Learning for Agentic Workflow Optimization (RLAO). The full paper is available at https://arxiv.org/pdf/2504.04785.

Iterative loop: how the system operates

W4S runs as an iterative generate-execute-refine loop. Each turn contains:

State: task instructions, the current Python workflow program, and feedback from earlier executions.
Action: two parts — an analysis describing what to change, and new Python workflow code implementing the changes.
Execution: the environment runs the workflow on validation items using a stronger executor and returns accuracy and failure cases.
Update: the meta-agent receives feedback and produces the next action.

The meta-agent can perform a quick self-check on one sample and attempt up to three automatic repairs if errors are detected; if repairs fail, the action is skipped. This design provides a learning signal without modifying the weights of the strong executor.

RLAO: training the planner with offline RL

Reinforcement Learning for Agentic Workflow Optimization (RLAO) operates offline over multi-turn trajectories. At every iteration the system samples multiple candidate actions, selects the best-performing candidate to advance the state, and stores the remaining candidates for training. The policy is optimized via reward-weighted regression. Rewards are sparse and compare current validation accuracy to historical records: a higher weight is given when the new result beats the previous best, and a smaller weight when it merely improves over the last iteration. This objective encourages steady progress while keeping exploration costs controlled. See https://arxiv.org/pdf/2504.04785 for details.

Results and efficiency

W4S reports consistent improvements across 11 benchmarks. Notable results include:

HumanEval with GPT-4o-mini as executor: Pass@1 of 95.4 after about 33 minutes of workflow optimization. Training the 7B meta-agent required roughly 1 GPU hour. Meta-agent API cost was zero because the strong executor was not fine-tuned; total optimization and execution cost was around 0.9 dollars.
Math transfer: the meta-agent trained with GPT-3.5-Turbo on GSM Plus and MGSM and evaluated on GSM8K, GSM Hard, and SVAMP. Results include 86.5 on GSM8K and 61.8 on GSM Hard, both above automated baselines.

Across seen tasks with GPT-4o-mini, W4S outperforms search-based automated methods that do not learn a planner. Ablations show that RLAO-trained agents outperform supervised fine-tuning under the same compute budget. Compared to a GRPO baseline on a 7B weak model for GSM Hard, W4S also shows better performance under limited compute.

Iteration budgets and sample efficiency

The team observed that iteration budgets affect outcomes: W4S typically uses about 10 optimization turns on main tables, while competing methods like AFlow and ADAS run 20 and 30 turns respectively. Despite fewer turns, W4S reaches higher accuracy, suggesting that learning to plan over code with validation feedback makes the search more sample efficient.

Key takeaways

W4S trains a 7B weak meta-agent with RLAO to write Python workflows that harness stronger executors, modeled as a multi-turn MDP. The method achieves strong empirical gains across benchmarks while avoiding any fine-tuning of the strong model, offering a cost-effective and sample-efficient strategy for designing agentic workflows. For full technical detail consult the paper at https://arxiv.org/pdf/2504.04785.