Baidu Launches PaddleOCR-VL (0.9B): NaViT-Style Vision-Language Model for Fast Multilingual Document Parsing
What PaddleOCR-VL is aiming to solve
Converting dense, multilingual documents with complex layouts, small scripts, formulas, charts and handwriting into faithful structured outputs has been a persistent challenge. PaddleOCR-VL is a 0.9B-parameter vision-language model from Baidu’s PaddlePaddle team built to perform end-to-end document parsing that outputs structured Markdown and JSON while keeping inference latency and memory practical for deployments.
Two-stage pipeline for stability and speed
PaddleOCR-VL is deployed as a two-stage pipeline. The first stage, PP-DocLayoutV2, performs page-level layout analysis: an RT-DETR detector localizes and classifies regions and a pointer network predicts reading order. The second stage, PaddleOCR-VL-0.9B, conducts element-level recognition conditioned on the detected layout. Final outputs from both stages are aggregated into Markdown and JSON for downstream consumption.
This decoupled design reduces the long-sequence decoding latency and instability that end-to-end vision-language models often face on dense, multi-column and mixed text–graphic pages.
Model architecture and design choices
At the core, PaddleOCR-VL-0.9B integrates a NaViT-style dynamic high-resolution encoder, which uses native-resolution sequence packing rather than destructive resizing or heavy tiling, with a 2-layer MLP projector and the ERNIE-4.5-0.3B language model as decoder. 3D-RoPE is used for positional representation.
The NaViT-style encoder processes variable-resolution inputs in patch-and-pack fashion, preserving typography cues and visual detail that matter for small scripts, formulas and handwriting. According to the technical report, native-resolution processing helps reduce hallucinations and improve performance on text-dense content compared with fixed-resize approaches.
Benchmarks and evaluations
PaddleOCR-VL reports state-of-the-art results on OmniDocBench v1.5 and competitive or leading scores on v1.0. It shows strength across overall quality metrics and sub-tasks such as text edit distances, Formula-CDM, Table-TEDS/TEDS-S and reading-order edit. Complementary strengths are reported on olmOCR-Bench and internal handwriting, table, formula and chart evaluations.
Practical implications for production
The combination of a NaViT-style encoder and a lightweight ERNIE-4.5-0.3B decoder aims to deliver high accuracy at practical inference cost. The PP-DocLayoutV2 -> PaddleOCR-VL-0.9B two-stage approach stabilizes reading order and preserves native typography cues, which is important for small scripts and complex page elements across 109 supported languages. Structured Markdown/JSON outputs and optional acceleration with vLLM/SGLang make the system operationally suitable for production document intelligence.
Resources and further reading
Technical report and model details are available from Baidu’s publication page: https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf The release notes also point to a Hugging Face model, GitHub tutorials and community channels for additional examples and deployment guidance.