Context-Folding LLM Agent: Scaling Long-Horizon Reasoning with Memory Compression and Tools

Overview

This tutorial shows how to build a Context-Folding LLM Agent that manages limited context efficiently to solve long, complex tasks. The agent decomposes a large task into smaller subtasks, uses tools (for example, a calculator) when needed, and folds completed sub-trajectories into concise summaries. The folding mechanism preserves essential knowledge while keeping the active memory small.

Local LLM setup

The agent runs with a lightweight Hugging Face model locally, making it usable in environments like Google Colab without external API calls. The code below demonstrates loading a model and wrapping it in a simple generation function.

import os, re, sys, math, random, json, textwrap, subprocess, shutil, time
from typing import List, Dict, Tuple
try:
   import transformers
except:
   subprocess.run([sys.executable, "-m", "pip", "install", "-q", "transformers", "accelerate", "sentencepiece"], check=True)
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
MODEL_NAME = os.environ.get("CF_MODEL", "google/flan-t5-small")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_NAME)
llm = pipeline("text2text-generation", model=model, tokenizer=tokenizer, device_map="auto")
def llm_gen(prompt: str, max_new_tokens=160, temperature=0.0) -> str:
   out = llm(prompt, max_new_tokens=max_new_tokens, do_sample=temperature>0.0, temperature=temperature)[0]["generated_text"]
   return out.strip()

Calculator tool and folding memory

A safe small calculator is provided to handle arithmetic operations requested by the agent. The FoldingMemory class collects active context entries and, when the active context exceeds a length threshold, moves older entries into folded summaries. Folded summaries are concise lines that represent past sub-trajectory outcomes.

import ast, operator as op
OPS = {ast.Add: op.add, ast.Sub: op.sub, ast.Mult: op.mul, ast.Div: op.truediv, ast.Pow: op.pow, ast.USub: op.neg, ast.FloorDiv: op.floordiv, ast.Mod: op.mod}
def _eval_node(n):
   if isinstance(n, ast.Num): return n.n
   if isinstance(n, ast.UnaryOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.operand))
   if isinstance(n, ast.BinOp) and type(n.op) in OPS: return OPS[type(n.op)](_eval_node(n.left), _eval_node(n.right))
   raise ValueError("Unsafe expression")
def calc(expr: str):
   node = ast.parse(expr, mode='eval').body
   return _eval_node(node)
class FoldingMemory:
   def __init__(self, max_chars:int=800):
       self.active=[]; self.folds=[]; self.max_chars=max_chars
   def add(self,text:str):
       self.active.append(text.strip())
       while len(self.active_text())>self.max_chars and len(self.active)>1:
           popped=self.active.pop(0)
           fold=f"- Folded: {popped[:120]}..."
           self.folds.append(fold)
   def fold_in(self,summary:str): self.folds.append(summary.strip())
   def active_text(self)->str: return "\n".join(self.active)
   def folded_text(self)->str: return "\n".join(self.folds)
   def snapshot(self)->Dict: return {"active_chars":len(self.active_text()),"n_folds":len(self.folds)}

Prompt templates and planning strategy

The system defines several prompt templates: one to decompose a task into subtasks, a solver prompt for handling a subtask (producing either a one-line calculation marker or an ANSWER), a summarizer prompt that condenses a subtask outcome into a few bullets, and a final synthesis prompt that uses folded summaries to produce the final output.

SUBTASK_DECOMP_PROMPT="""You are an expert planner. Decompose the task below into 2-4 crisp subtasks.
Return each subtask as a bullet starting with '- ' in priority order.
Task: "{task}" """
SUBTASK_SOLVER_PROMPT="""You are a precise problem solver with minimal steps.
If a calculation is needed, write one line 'CALC(expr)'.
Otherwise write 'ANSWER: <final>'.
Think briefly; avoid chit-chat.


Task: {task}
Subtask: {subtask}
Notes (folded context):
{notes}


Now respond with either CALC(...) or ANSWER: ..."""
SUBTASK_SUMMARY_PROMPT="""Summarize the subtask outcome in <=3 bullets, total <=50 tokens.
Subtask: {name}
Steps:
{trace}
Final: {final}
Return only bullets starting with '- '."""
FINAL_SYNTH_PROMPT="""You are a senior agent. Synthesize a final, coherent solution using ONLY:
- The original task
- Folded summaries (below)
Avoid repeating steps. Be concise and actionable.


Task: {task}
Folded summaries:
{folds}


Final answer:"""
def parse_bullets(text:str)->List[str]:
   return [ln[2:].strip() for ln in text.splitlines() if ln.strip().startswith("- ")]

Running subtasks with tools and folding

The run_subtask function orchestrates iterative solver calls, detects when the model requests a calculation via the ‘CALC(expr)’ pattern, executes the calculation with the local calc tool, and then feeds the tool result back to the model to obtain a final ‘ANSWER:’. The resulting final answer is summarized with the SUBTASK_SUMMARY_PROMPT and folded into memory.

def run_subtask(task:str, subtask:str, memory:FoldingMemory, max_tool_iters:int=3)->Tuple[str,str,List[str]]:
   notes=(memory.folded_text() or "(none)")
   trace=[]; final=""
   for _ in range(max_tool_iters):
       prompt=SUBTASK_SOLVER_PROMPT.format(task=task,subtask=subtask,notes=notes)
       out=llm_gen(prompt,max_new_tokens=96); trace.append(out)
       m=re.search(r"CALC\((.+?)\)",out)
       if m:
           try:
               val=calc(m.group(1))
               trace.append(f"TOOL:CALC -> {val}")
               out2=llm_gen(prompt+f"\nTool result: {val}\nNow produce 'ANSWER: ...' only.",max_new_tokens=64)
               trace.append(out2)
               if out2.strip().startswith("ANSWER:"):
                   final=out2.split("ANSWER:",1)[1].strip(); break
           except Exception as e:
               trace.append(f"TOOL:CALC ERROR -> {e}")
       if out.strip().startswith("ANSWER:"):
           final=out.split("ANSWER:",1)[1].strip(); break
   if not final:
       final="No definitive answer; partial reasoning:\n"+"\n".join(trace[-2:])
   summ=llm_gen(SUBTASK_SUMMARY_PROMPT.format(name=subtask,trace="\n".join(trace),final=final),max_new_tokens=80)
   summary_bullets="\n".join(parse_bullets(summ)[:3]) or f"- {subtask}: {final[:60]}..."
   return final, summary_bullets, trace
class ContextFoldingAgent:
   def __init__(self,max_active_chars:int=800):
       self.memory=FoldingMemory(max_chars=max_active_chars)
       self.metrics={"subtasks":0,"tool_calls":0,"chars_saved_est":0}
   def decompose(self,task:str)->List[str]:
       plan=llm_gen(SUBTASK_DECOMP_PROMPT.format(task=task),max_new_tokens=96)
       subs=parse_bullets(plan)
       return subs[:4] if subs else ["Main solution"]
   def run(self,task:str)->Dict:
       t0=time.time()
       self.memory.add(f"TASK: {task}")
       subtasks=self.decompose(task)
       self.metrics["subtasks"]=len(subtasks)
       folded=[]
       for st in subtasks:
           self.memory.add(f"SUBTASK: {st}")
           final,fold_summary,trace=run_subtask(task,st,self.memory)
           self.memory.fold_in(fold_summary)
           folded.append(f"- {st}: {final}")
           self.memory.add(f"SUBTASK_DONE: {st}")
       final=llm_gen(FINAL_SYNTH_PROMPT.format(task=task,folds=self.memory.folded_text()),max_new_tokens=200)
       t1=time.time()
       return {"task":task,"final":final.strip(),"folded_summaries":self.memory.folded_text(),
               "active_context_chars":len(self.memory.active_text()),
               "subtask_finals":folded,"runtime_sec":round(t1-t0,2)}

Demo tasks and usage

The example demo runs two tasks: planning a 3-day study schedule and computing a small project budget. The main script creates the agent, runs the tasks, and prints folded summaries, final answers, and diagnostics so you can observe the memory-folding process and runtime metrics.

DEMO_TASKS=[
   "Plan a 3-day study schedule for ML with daily workouts and simple meals; include time blocks.",
   "Compute a small project budget with 3 items (laptop 799.99, course 149.5, snacks 23.75), add 8% tax and 5% buffer, and present a one-paragraph recommendation."
]
def pretty(d): return json.dumps(d, indent=2, ensure_ascii=False)
if __name__=="__main__":
   agent=ContextFoldingAgent(max_active_chars=700)
   for i,task in enumerate(DEMO_TASKS,1):
       print("="*70)
       print(f"DEMO #{i}: {task}")
       res=agent.run(task)
       print("\n--- Folded Summaries ---\n"+(res["folded_summaries"] or "(none)"))
       print("\n--- Final Answer ---\n"+res["final"])
       print("\n--- Diagnostics ---")
       diag={k:res[k] for k in ["active_context_chars","runtime_sec"]}
       diag["n_subtasks"]=len(agent.decompose(task))
       print(pretty(diag))

Observations

Context folding enables the agent to iterate across multiple reasoning steps without blowing up the active prompt size. By summarizing completed subtasks into compact bullets and appending them to a folded memory, later subtasks can consult a distilled history instead of reprocessing long traces. This pattern combines task decomposition, limited tool integration, and memory compression to scale long-horizon reasoning in a lightweight, reproducible way.

Resources and next steps

The repository referenced by the original code includes full examples, notebooks, and a paper. You can adapt the FoldingMemory thresholds, prompt templates, or tool set (adding web queries, file access, or domain calculators) to fit other workflows or larger models.