Context-Folding LLM Agent: Scaling Long-Horizon Reasoning with Memory Compression and Tools
Overview
This tutorial shows how to build a Context-Folding LLM Agent that manages limited context efficiently to solve long, complex tasks. The agent decomposes a large task into smaller subtasks, uses tools (for example, a calculator) when needed, and folds completed sub-trajectories into concise summaries. The folding mechanism preserves essential knowledge while keeping the active memory small.
Local LLM setup
The agent runs with a lightweight Hugging Face model locally, making it usable in environments like Google Colab without external API calls. The code below demonstrates loading a model and wrapping it in a simple generation function.
import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
import transformers
except:
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"]
return out.strip()
Calculator tool and folding memory
A safe small calculator is provided to handle arithmetic operations requested by the agent. The FoldingMemory class collects active context entries and, when the active context exceeds a length threshold, moves older entries into folded summaries. Folded summaries are concise lines that represent past sub-trajectory outcomes.
import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
if isinstance(n, ast.Num): return n.n
if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
raise ValueError("Unsafe expression")
def calc(expr: str):
node = ast.parse(expr, mode='eval').body
return _eval_node(node)
class FoldingMemory:
def __init__(self, max_chars:int=800):
self.active=[]; self.folds=[]; self.max_chars=max_chars
def add(self,text:str):
self.active.append(text.strip())
while len(self.active_text())>self.max_chars and len(self.active)>1:
popped=self.active.pop(0)
fold=f"- Folded: {popped[:120]}..."
self.folds.append(fold)
def fold_in(self,summary:str): self.folds.append(summary.strip())
def active_text(self)->str: return "\n".join(self.active)
def folded_text(self)->str: return "\n".join(self.folds)
def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}
Prompt templates and planning strategy
The system defines several prompt templates: one to decompose a task into subtasks, a solver prompt for handling a subtask (producing either a one-line calculation marker or an ANSWER), a summarizer prompt that condenses a subtask outcome into a few bullets, and a final synthesis prompt that uses folded summaries to produce the final output.
SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """
SUBTASK_SOLVER_PROMPT="""You are a precise problem solver with minimal steps.
If a calculation is needed, write one line 'CALC(expr)'.
Otherwise write 'ANSWER: <final>'.
Think briefly; avoid chit-chat.
Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}
Now respond with either CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with '- '."""
FINAL_SYNTH_PROMPT="""You are a senior agent. Synthesize a final, coherent solution using ONLY:
- The original task
- Folded summaries (below)
Avoid repeating steps. Be concise and actionable.
Task: {task}
Folded summaries:
{folds}
Final answer:"""
def parse_bullets(text:str)->List[str]:
return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith("- ")]
Running subtasks with tools and folding
The run_subtask function orchestrates iterative solver calls, detects when the model requests a calculation via the ‘CALC(expr)’ pattern, executes the calculation with the local calc tool, and then feeds the tool result back to the model to obtain a final ‘ANSWER:’. The resulting final answer is summarized with the SUBTASK_SUMMARY_PROMPT and folded into memory.
def run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
notes=(memory.folded_text() or "(none)")
trace=[]; final=""
for _ in range(max_tool_iters):
prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
m=re.search(r"CALC\((.+?)\)",out)
if m:
try:
val=calc(m.group(1))
trace.append(f"TOOL:CALC -> {val}")
out2=llm_gen(prompt+f"\nTool result: {val}\nNow produce 'ANSWER: ...' only.",max_new_tokens=64)
trace.append(out2)
if out2.strip().startswith("ANSWER:"):
final=out2.split("ANSWER:",1)[1].strip(); break
except Exception as e:
trace.append(f"TOOL:CALC ERROR -> {e}")
if out.strip().startswith("ANSWER:"):
final=out.split("ANSWER:",1)[1].strip(); break
if not final:
final="No definitive answer; partial reasoning:\n"+"\n".join(trace[-2:])
summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace="\n".join(trace),final=final),max_new_tokens=80)
summary_bullets="\n".join(parse_bullets(summ)[:3]) or f"- {subtask}: {final[:60]}..."
return final, summary_bullets, trace
class ContextFoldingAgent:
def __init__(self,max_active_chars:int=800):
self.memory=FoldingMemory(max_chars=max_active_chars)
self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
def decompose(self,task:str)->List[str]:
plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
subs=parse_bullets(plan)
return subs[:4] if subs else ["Main solution"]
def run(self,task:str)->Dict:
t0=time.time()
self.memory.add(f"TASK: {task}")
subtasks=self.decompose(task)
self.metrics["subtasks"]=len(subtasks)
folded=[]
for st in subtasks:
self.memory.add(f"SUBTASK: {st}")
final,fold_summary,trace=run_subtask(task,st,self.memory)
self.memory.fold_in(fold_summary)
folded.append(f"- {st}: {final}")
self.memory.add(f"SUBTASK_DONE: {st}")
final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
t1=time.time()
return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(),
"active_context_chars":len(self.memory.active_text()),
"subtask_finals":folded,"runtime_sec":round(t1-t0,2)}
Demo tasks and usage
The example demo runs two tasks: planning a 3-day study schedule and computing a small project budget. The main script creates the agent, runs the tasks, and prints folded summaries, final answers, and diagnostics so you can observe the memory-folding process and runtime metrics.
DEMO_TASKS=[
"Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.",
"Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__":
agent=ContextFoldingAgent(max_active_chars=700)
for i,task in enumerate(DEMO_TASKS,1):
print("="*70)
print(f"DEMO #{i}: {task}")
res=agent.run(task)
print("\n--- Folded Summaries ---\n"+(res["folded_summaries"] or "(none)"))
print("\n--- Final Answer ---\n"+res["final"])
print("\n--- Diagnostics ---")
diag={k:res[k] for k in ["active_context_chars","runtime_sec"]}
diag["n_subtasks"]=len(agent.decompose(task))
print(pretty(diag))
Observations
Context folding enables the agent to iterate across multiple reasoning steps without blowing up the active prompt size. By summarizing completed subtasks into compact bullets and appending them to a folded memory, later subtasks can consult a distilled history instead of reprocessing long traces. This pattern combines task decomposition, limited tool integration, and memory compression to scale long-horizon reasoning in a lightweight, reproducible way.
Resources and next steps
The repository referenced by the original code includes full examples, notebooks, and a paper. You can adapt the FoldingMemory thresholds, prompt templates, or tool set (adding web queries, file access, or domain calculators) to fit other workflows or larger models.