DeepSeek's 3B OCR: Compress Pages into Vision Tokens for Near-Lossless Document Parsing
What DeepSeek-OCR Does
DeepSeek-AI released DeepSeek-OCR-3B, a vision-language model (VLM) built for end-to-end OCR and structured document parsing. The system compresses page images into a compact set of vision tokens and then decodes those tokens with a 3B parameter Mixture-of-Experts (MoE) language decoder. The main idea is simple: represent text optically as compact visual tokens to dramatically reduce decoder sequence length while retaining most textual information.
Architecture and What’s New
DeepSeek-OCR-3B has two primary components: DeepEncoder and DeepSeek3B-MoE-A570M. DeepEncoder is a vision encoder optimized for high-resolution inputs with low activation cost and a small number of output tokens. Its pipeline includes a windowed attention stage for local perception inspired by SAM, a two-layer convolutional compressor that achieves 16× token downsampling, and a dense global attention stage based on CLIP to aggregate visual knowledge. This combination keeps activation memory controlled at high resolutions while minimizing vision token count.
The decoder, DeepSeek3B-MoE-A570M, is a 3B parameter Mixture-of-Experts model with roughly 570M active parameters per token. The MoE design helps scale representational power while controlling compute per token through expert selection.
Multi-Resolution Modes and Token Budgeting
DeepEncoder ships with native token modes and dynamic modes to match different document and budget needs. Native modes include:
- Tiny: 64 tokens at 512×512
- Small: 100 tokens at 640×640
- Base: 256 tokens at 1024×1024
- Large: 400 tokens at 1280×1280
Dynamic modes named Gundam and Gundam-Master combine tiled local views with a global view. Gundam yields n×100 plus 256 tokens or n×256 plus 400 tokens with n between 2 and 9. For padded modes the team provides a formula to compute valid tokens that depends on aspect ratio and yields a lower effective token count than raw token values. These modes let practitioners align token budgets with page complexity.
Compression and Benchmark Results
On the Fox benchmark, DeepSeek reports high decoding precision when text tokens are compressed into relatively few vision tokens. With 100 vision tokens:
- Pages containing 600–700 text tokens reach 98.5% precision at about 6.7× compression.
- Pages containing 900–1000 text tokens reach 96.8% precision at about 9.7× compression.
With only 64 vision tokens, precision drops as compression increases; for example, pages with 1200–1300 text tokens see about 59.1% precision at ~19.7× compression. These numbers come directly from the reported tables in the technical document.
On OmniDocBench, the team reports that DeepSeek-OCR outperforms GOT-OCR 2.0 when using only 100 vision tokens per page, and outperforms MinerU 2.0 under 800 vision tokens even though MinerU uses over 6000 tokens per page on average. The benchmark section presents overall performance primarily in terms of edit distance.
Training and Throughput
Training followed a two-phase pipeline: first training DeepEncoder with next-token prediction on OCR 1.0, OCR 2.0 and 100M LAION samples, then training the full encoder-decoder system with pipeline parallelism across 4 partitions. The reported hardware setup used 20 nodes with 8 A100 40GB GPUs each, trained with AdamW. The team reports training speeds of about 90B text-only tokens per day and 70B multimodal tokens per day. In production they report generating over 200k pages per day on a single A100 40GB node.
How to Evaluate in Your Stack
For typical reports and books, start with Small mode (100 tokens) and raise token budget only if edit distance is unacceptable. For dense small fonts or pages with extremely high token counts, use a Gundam mode to mix local tiled views with a global view and explicitly control token budgets. For documents containing charts, tables, or chemical structures, consult the “Deep parsing” qualitative section in the paper that shows conversions to HTML tables, SMILES, and structured geometry, then design easily validated outputs.
Practical Takeaways
DeepSeek-OCR operationalizes optical context compression: pages become compact optical carriers that reduce decoder sequence length without discarding most information. The model claims near-lossless decoding around 10× compression (about 97% precision on Fox) and degrades to roughly 60% precision at ~20× compression. The release includes explicit token budget modes, a 3B MoE decoder with a DeepEncoder front end, and a tested Hugging Face setup with Python 3.12.9, CUDA 11.8, PyTorch 2.6.0, Transformers 4.46.3, Tokenizers 0.20.3, and FlashAttention 2.7.3, simplifying engineer setup and evaluation.
For details and reproducibility, consult the technical paper and the Hugging Face and GitHub repositories linked from the project page.