AutoCode: Teaching LLMs to Author and Judge Competition-Grade Programming Problems

October 18, 2025 · 4 min

AutoCode is a new framework that trains large language models (LLMs) to not only solve programming tasks but to create and verify contest-quality problems and judging logic. By reframing evaluation as problem setting rather than just problem solving, the system emulates the workflow of human problem setters and produces test suites and verdicts that match official online judges.

Why problem setting matters

Public code benchmarks often rely on under-specified unit tests. Those weak tests can let incorrect or shortcut solutions pass (false positives) and can also reject correct solutions when inputs are malformed (false negatives). AutoCode addresses these failure modes by centering validation and adversarial test generation, improving the fidelity of evaluation and the downstream reward signals used for reinforcement learning.

The core loop: Validator → Generator → Checker

AutoCode runs a closed loop that mirrors human contest workflows. Each stage uses multiple LLM-generated candidates and selects the best-performing one against targeted in-framework tests.

Validator (reduce false negatives)

The Validator step enforces input legality to avoid rejecting correct solutions due to malformed cases. The system asks an LLM to synthesize 40 evaluation inputs — typically 10 valid and 30 near-valid illegal examples (such as off-by-one boundary violations). It then produces three candidate validator programs and selects the one that best classifies those cases, preventing crashes and unintended rejections.

Generator (reduce false positives)

The Generator produces adversarial test cases using three complementary strategies:

Small-data exhaustion for boundary coverage
Randomized and extreme cases (overflows, precision, hash collisions)
Structures designed to induce time-limit-exceeded behavior and break wrong-complexity solutions

Invalid cases get filtered by the validator; remaining cases are deduplicated and bucket-balanced before sampling to build a robust suite.

Checker (verdict logic)

The Checker compares contestant outputs with the reference solution under complex, protocol-aware rules. AutoCode generates 40 checker scenarios and three candidate checker programs, keeps only scenarios with validator-approved inputs, and selects the checker with the highest accuracy on these labeled scenarios.

Interactor (handling interactive tasks)

For interactive problems, AutoCode uses a mutant-based interactor: it creates small logical edits (“mutants”) to the reference solution and selects interactors that accept the true solution but reject the mutants. This maximizes discrimination and fills a gap where many public datasets avoided interactives.

Dual verification and generating new problems

AutoCode can also create novel problem variants from a seed Codeforces problem (<2200 Elo). The LLM drafts a new statement and two solutions: an efficient reference solution and a simpler brute-force baseline. A problem is accepted only if the reference output matches the brute-force output across the generated test suite (the brute-force may TLE on large tests but serves as ground truth on small/exhaustive cases). This dual-verification protocol filters out roughly 27% of error-prone items and raises reference-solution correctness from 86% to 94% before human review.

Human experts then grade surviving items for solvability, correctness, quality, novelty, and difficulty. After filtering, 61.6% of generated items are usable for model training, 76.3% are useful for human training, and 3.2% reach ICPC/IOI-level difficulty. Difficulty tends to increase relative to the seed, and perceived quality correlates with difficulty gains.

Results and performance

On a benchmark of 7,538 existing problems (with 195,988 human submissions), AutoCode achieves 91.1% consistency with official judge decisions, with a 3.7% false-positive rate (FPR) and 14.1% false-negative rate (FNR). This outperforms prior generators like CodeContests, CodeContests+, TACO, and HardTests, which reached 72.9–81.0% consistency on the same set.

On a separate set of 720 recent Codeforces problems (including interactives), the full framework reports 98.7% consistency with official judgments, 1.3% FPR, and 1.2% FNR. Ablation studies show that all three generator strategies and prompt optimization contribute: removing prompt optimization reduces consistency to 98.0% and more than doubles FNR to 2.9%.

Implications for evaluation and benchmarking

AutoCode couples a Validator–Generator–Checker (+Interactor) loop with dual verification (reference vs. brute-force) to build contest-grade test suites and new problems. By standardizing constraint legality, adversarial coverage, and protocol-aware judging, it reduces false positives and false negatives and produces judge-aligned consistency (≈99% on held-out problems; 98.7% on recent Codeforces including interactives). This makes evaluation signals cleaner and more robust for training and benchmarking code-reasoning models.

The mutant-based interactor and dual-verification pipeline are practical enhancements that address known shortcomings of public benchmarks and enable automated generation of high-quality training and contest items. The AutoCode approach aligns with efforts like LiveCodeBench Pro to create more hallucination-resistant, expert-checked evaluation suites.

References and resources

See the AutoCode paper and project resources for full technical details and reproducible code: https://arxiv.org/pdf/2510.12803