HunyuanOCR: Tencent's 1B-Parameter End-to-End OCR Vision-Language Model
'Tencent Hunyuan released HunyuanOCR, a 1B-parameter end-to-end OCR vision-language model that runs spotting, parsing, information extraction, VQA and translation in one prompt-driven pipeline and matches larger models on core OCR benchmarks.'
HunyuanOCR: a compact end-to-end OCR VLM
HunyuanOCR is a 1 billion parameter vision-language model from Tencent Hunyuan designed specifically for optical character recognition and document understanding. Built on Hunyuan native multimodal architecture, it performs spotting, parsing, information extraction, visual question answering, and text-image translation in a single end-to-end pipeline.
Native-resolution encoder and lightweight language model
HunyuanOCR links a native-resolution visual encoder called Hunyuan ViT, an Adaptive MLP Connector, and a lightweight language model. The encoder extends SigLIP-v2-400M to accept arbitrary input resolutions with adaptive patching that preserves aspect ratio. This patching plus global attention improves recognition on long text lines, lengthy documents, and low-quality scans.
The Adaptive MLP Connector applies learnable pooling across spatial tokens, compressing dense visual tokens while keeping details from text-rich regions. That lowers sequence length and compute for the language model while preserving OCR-relevant information.
The language model is derived from Hunyuan 0.5B and uses XD RoPE, which splits rotary position embeddings into four subspaces for token order, height, width, and time. This alignment lets the same architecture handle multi-column pages, cross-page flows, and sequences of video frames.
End-to-end training and prompt-driven tasks
Training and inference are fully end-to-end: there is no external layout analysis or post-processing model. All tasks are expressed as natural language prompts and handled in a single forward pass, removing error propagation between pipeline stages and simplifying deployment.
Large multilingual dataset and staged pre-training
The data pipeline builds over 200 million image-text pairs across nine real-world scenarios including documents, street views, ads, handwriting, screenshots, cards and invoices, game interfaces, video frames, and artistic typography, covering more than 130 languages. Synthetic data augmentation simulates right-to-left scripts, paragraph rendering, font and color variation, warping, blur, and lighting changes to mimic mobile captures.
Pre-training uses a four-stage recipe: vision-language alignment, multimodal pre-training, long-context pre-training with up to 32k context, and application-oriented supervised fine-tuning. The team then applies reinforcement learning with structured rewards.
Reinforcement learning with verifiable rewards
After supervised fine-tuning, HunyuanOCR is optimized with Group Relative Policy Optimization and a Reinforcement Learning with Verifiable Rewards setup. For spotting, rewards combine intersection-over-union of boxes with normalized edit distance on text. Document parsing uses normalized edit distance between generated structures and references. VQA employs a binary semantic match reward from an LLM judge, while translation uses COMET-style scores normalized to a 0 to 1 range. The framework enforces strict output formats and penalizes schema violations, encouraging valid structured outputs and JSON where required.
Benchmark performance
Despite its 1B parameter size, HunyuanOCR matches or surpasses much larger VLMs on OCR-centric tasks. On an internal 900-image spotting benchmark it scored 70.92, outperforming pipeline methods and general VLMs including Gemini 2.5 Pro and Qwen3 VL variants. It achieves 94.10 on OmniDocBench and 860 on OCRBench, setting state-of-the-art results among models under 3B parameters. Other reported results include strong scores on DocML, card and receipt extraction, video subtitle extraction, and document translation benchmarks.
Why this matters
HunyuanOCR demonstrates that compact, task-specialized vision-language models can be practical for production. By combining native-resolution vision encoding, an adapter connector, long-context capability, and reinforcement learning with verifiable rewards, Tencent produced a single instruction-driven model that handles spotting, parsing, information extraction, VQA, and translation for over 100 languages while remaining efficient enough for real-world deployment.
Сменить язык
Читать эту статью на русском