<RETURN_TO_BASE

Cache-to-Cache (C2C): LLMs Communicate Directly Through KV-Cache Fusion

'Cache-to-Cache (C2C) lets LLMs exchange semantic information via KV-Cache fusion, boosting accuracy by about 3–10% over text-based pipelines and roughly halving latency.'

Why text-based LLM collaboration is limited

Current multi-LLM systems mostly communicate by generating and reading natural language. One model writes an explanation or a hint, the other reads it as context. That design creates three practical costs: internal activations are compressed into short text, natural language adds ambiguity and loses structural signals, and token-by-token decoding inflates latency for long analytical exchanges.

Oracle experiments that validate KV-Cache as a channel

The authors ran two oracle-style studies to test whether layer activations in the KV-Cache can carry useful semantic signals between models.

Cache enrichment oracle

They compared three prefill strategies on multiple-choice benchmarks: Direct (prefill on the question only), Few-shot (prefill on exemplars plus question), and Oracle (prefill on exemplars plus question but then discard exemplar segments and keep only the question-aligned slice of the cache, so cache length matches Direct). Oracle improved accuracy from 58.42% to 62.34% at the same cache length, while Few-shot reached 63.39%. This shows enriching the question-aligned slice of the KV-Cache alone, without adding tokens, improves performance. Layer-wise analysis found that enriching selected layers performs better than enriching all layers, motivating selective injection.

Cache transformation oracle

They trained a three-layer MLP to map KV-Cache vectors from Qwen3 4B into the space of Qwen3 0.6B. t-SNE visualizations indicate the transformed cache lies inside the target cache manifold, though in a subregion. This confirms that cache representations can be projected between models, making KV-Cache a viable medium for cross-model communication.

Cache-to-Cache (C2C) paradigm

C2C frames communication as direct semantic transfer between a Sharer model and a Receiver model via KV-Cache fusion. During prefill, both models read the same input and produce layer-wise KV-Cache. For each Receiver layer, C2C chooses a mapped Sharer layer and applies a C2C Fuser to produce a fused cache. At decoding, the Receiver conditions token prediction on this fused cache instead of its original cache.

C2C Fuser architecture

The fuser integrates Sharer and Receiver caches using a residual integration principle and consists of three modules:

  • Projection module: concatenates Sharer and Receiver KV vectors, applies a projection layer, then a feature fusion layer.
  • Dynamic weighting module: modulates attention heads based on input so some heads rely more on Sharer information.
  • Learnable gate: a per-layer gate decides whether to inject Sharer context into that layer. It uses a Gumbel sigmoid during training and becomes binary at inference.

C2C also handles cross-family and cross-size setups via token alignment (decode Receiver tokens to strings, re-encode with Sharer tokenizer, and choose Sharer tokens with maximal string coverage) and a terminal layer alignment strategy that pairs top layers first and walks backwards until the shallower model is covered.

Training setup and evaluation

Both LLMs remain frozen during training; only the C2C module is learned. Training minimizes next-token prediction loss on Receiver outputs. Main C2C fusers were trained on the first 500k samples of the OpenHermes2.5 dataset and evaluated on OpenBookQA, ARC Challenge, MMLU Redux and C-Eval.

Results: accuracy and latency gains

Across many Sharer/Receiver pairs built from Qwen2.5, Qwen3, Llama 3.2 and Gemma3, C2C consistently improves Receiver accuracy and reduces latency.

  • C2C yields about 8.5 to 10.5 percentage points higher average accuracy than individual models.
  • C2C outperforms text-to-text communication by about 3.0 to 5.0 percentage points on average.
  • C2C delivers roughly 2x speedup in latency compared to text-based collaboration, sometimes more in certain setups.

A concrete example: Qwen3 0.6B as Receiver and Qwen2.5 0.5B as Sharer. On MMLU Redux, Receiver alone: 35.53%, text-to-text: 41.03%, C2C: 42.92%. Average time per query for text-to-text is 1.52 units, while C2C is close to the single model at 0.40. Similar improvements appear on OpenBookQA, ARC Challenge and C-Eval. On LongBenchV1, C2C outperforms text communication across sequence length buckets, preserving gains even for long contexts.

What this means for multi-LLM systems

C2C reframes multi-LLM communication as a direct semantic transfer problem rather than a prompt engineering problem. By projecting and fusing KV-Cache between models with a neural fuser and learnable gating, C2C preserves deep, specialized semantics while avoiding the information loss and latency of intermediate text. The approach is a practical systems-level step toward KV-native collaboration between models with measurable accuracy and speed benefits.

Where to look next

The paper includes implementation details, experimental results and links to code. C2C opens avenues for more advanced cache alignment, adaptive gating strategies and hybrid text-plus-cache communication for tasks that still need human-readable traces.

🇷🇺

Сменить язык

Читать эту статью на русском

Переключить на Русский