Meta's REFRAG Unlocks 16× Longer Contexts and Up to 31× Faster RAG Decoding
'Meta Superintelligence Labs released REFRAG, a decoding framework that compresses retrieved passages to enable 16× longer contexts and up to 30.85× faster time-to-first-token while preserving accuracy.'
Meta Superintelligence Labs and collaborators have introduced REFRAG, a decoding framework that dramatically improves retrieval-augmented generation by compressing retrieved passages into compact embeddings. The approach lets decoders see far longer contexts while cutting inference latency by an order of magnitude.
Why long context is costly for LLMs
The attention mechanism used by large language models scales roughly quadratically with input length. Doubling the token count can increase compute and memory by about four times. That makes long-context applications expensive and often impractical in production. In RAG setups, many retrieved passages are only marginally useful, yet the model still pays the full attention cost to process them.
How REFRAG compresses retrieved passages
REFRAG adds a lightweight encoder that splits retrieved documents into fixed-size chunks (for example, 16 tokens) and compresses each chunk into a dense embedding. Instead of feeding thousands of raw tokens to the decoder, it supplies a much shorter sequence of embeddings. The method achieves about a 16× reduction in sequence length without changing the LLM architecture itself. The authors provide further details and experiments in the arXiv paper: https://arxiv.org/pdf/2509.01092
How the acceleration is achieved
Shortening the decoder input reduces quadratic attention computation and shrinks the key-value cache needed during decoding. Empirical results show speedups in time-to-first-token (TTFT) on the order of 16.53× at k=16 and up to 30.85× at k=32, far exceeding prior methods such as CEPE. End-to-end throughput also improves, with up to 6.78× gains relative to LLaMA baselines.
Preserving accuracy with selective expansion
A reinforcement learning policy guides which compressed chunks can bypass compression and be fed as raw tokens into the decoder. This selective expansion preserves critical details like exact numbers or rare entities that could be lost if everything were compressed. Across multiple benchmarks, REFRAG maintains or improves perplexity compared to CEPE while operating at much lower latency.
Experimental evidence
REFRAG was pretrained on 20 billion tokens from the SlimPajama corpus and evaluated on long-document datasets including Book, ArXiv, PG19, and ProofPile. On RAG benchmarks, multi-turn conversations, and long-document summarization tasks, REFRAG delivered:
- A 16× extension of context beyond standard LLaMA-2 4k windows.
- About 9.3% perplexity improvement over CEPE across four datasets.
- Better robustness when retrievers return many irrelevant passages, since REFRAG can process more passages under the same latency budget.
The authors plan to release code at facebookresearch/refrag on GitHub.
Practical implications
By compressing retrieved passages and selectively expanding the most informative parts, REFRAG makes large-context applications more feasible and efficient. Use cases include analyzing full reports, handling lengthy multi-turn dialogues, and scaling enterprise RAG systems where latency and memory are critical constraints.
Сменить язык
Читать эту статью на русском