PokeeResearch-7B: Open 7B Research Agent Trained with RLAIF and a Multi-Thread Reasoning Scaffold

How the agent operates

PokeeResearch-7B is an open 7B-parameter deep research agent designed to run full research loops. For each query it decomposes the task, issues web search and page-read calls, proposes interim answers, verifies candidates against retrieved evidence, and synthesizes multiple research threads into a final response. This end-to-end loop aims to reduce brittle reasoning trajectories and catch obvious errors before finalization.

Research and verification loop

The agent alternates between research and verification stages. During research it calls external tools for web search and page reading or proposes an interim answer. During verification it inspects the candidate answer against retrieved evidence and either accepts it or restarts research. This structure helps detect malformed tool calls and early mistakes, improving final answer quality.

Training recipe: RLAIF with RLOO

PokeeResearch-7B is fine-tuned from Qwen2.5-7B-Instruct using an annotation-free Reinforcement Learning from AI Feedback called RLAIF, with the REINFORCE Leave-One-Out estimator (RLOO). The training optimizes rewards for semantic correctness, citation faithfulness, and instruction adherence rather than token overlap. The model card reports these training settings: batch size 64, 8 research threads per prompt during RL, learning rate 3e-6, 140 steps, 32,768 token context, bf16 precision, and a checkpoint around 13 GB. The team notes that RLOO provides an unbiased on-policy gradient estimate and contrasts it with PPO-family algorithms that are approximately on-policy and biased.

Reasoning scaffold and Research Threads Synthesis

The scaffold combines three mechanisms: self correction, self verification, and Research Threads Synthesis. Self correction detects malformed tool calls and retries. Self verification inspects the agent’s own answer against evidence. Research Threads Synthesis runs several independent research threads per question, summarizes each thread, and synthesizes a final answer from those independent summaries. The team reports that synthesis improves accuracy on difficult benchmarks.

Evaluation protocol

The team evaluated text-only questions drawn from 10 benchmarks: NQ, TriviaQA, PopQA, HotpotQA, 2WikiMultiHopQA, Musique, Bamboogle, GAIA, BrowseComp, and Humanity’s Last Exam (HLE). They sampled 125 questions per dataset except GAIA which had 103, for a total of 1,228 questions. For each question they ran 4 research threads and computed mean accuracy, mean@4, with correctness judged by Gemini-2.5-Flash-lite. Maximum interaction turns were capped at 100.

Results at 7B scale

PokeeResearch-7B reports the best mean@4 accuracy among 7B deep research agents across the 10 datasets. Notable results include HLE 15.2 without Research Threads Synthesis (RTS) and 17.6 with RTS; GAIA 36.9 without RTS and 41.3 with RTS; BrowseComp 5.4 without RTS and 8.4 with RTS. On seven QA benchmarks (Bamboogle, 2WikiMultiHopQA, TriviaQA, NQ, PopQA, Musique, HotpotQA) the model improves over recent 7B baselines. Gains from RTS are largest on HLE, GAIA, and BrowseComp, and smaller on the QA sets.

Release and practical notes

The project is released under the Apache-2.0 license with code and weights public on Hugging Face and GitHub. The research stack uses Serper and Jina and the authors report that the setup runs on a single A100 80 GB and scales from there. The repo and paper are available for replication and further experiments.