Baidu Launches PaddleOCR-VL (0.9B): NaViT-Style Vision-Language Model for Fast Multilingual Document Parsing

What PaddleOCR-VL is aiming to solve

Converting dense, multilingual documents with complex layouts, small scripts, formulas, charts and handwriting into faithful structured outputs has been a persistent challenge. PaddleOCR-VL is a 0.9B-parameter vision-language model from Baidu's PaddlePaddle team built to perform end-to-end document parsing that outputs structured Markdown and JSON while keeping inference latency and memory practical for deployments.

Two-stage pipeline for stability and speed

PaddleOCR-VL is deployed as a two-stage pipeline. The first stage, PP-DocLayoutV2, performs page-level layout analysis: an RT-DETR detector localizes and classifies regions and a pointer network predicts reading order. The second stage, PaddleOCR-VL-0.9B, conducts element-level recognition conditioned on the detected layout. Final outputs from both stages are aggregated into Markdown and JSON for downstream consumption.

This decoupled design reduces the long-sequence decoding latency and instability that end-to-end vision-language models often face on dense, multi-column and mixed text–graphic pages.

Model architecture and design choices

At the core, PaddleOCR-VL-0.9B integrates a NaViT-style dynamic high-resolution encoder, which uses native-resolution sequence packing rather than destructive resizing or heavy tiling, with a 2-layer MLP projector and the ERNIE-4.5-0.3B language model as decoder. 3D-RoPE is used for positional representation.

The NaViT-style encoder processes variable-resolution inputs in patch-and-pack fashion, preserving typography cues and visual detail that matter for small scripts, formulas and handwriting. According to the technical report, native-resolution processing helps reduce hallucinations and improve performance on text-dense content compared with fixed-resize approaches.

Benchmarks and evaluations

PaddleOCR-VL reports state-of-the-art results on OmniDocBench v1.5 and competitive or leading scores on v1.0. It shows strength across overall quality metrics and sub-tasks such as text edit distances, Formula-CDM, Table-TEDS/TEDS-S and reading-order edit. Complementary strengths are reported on olmOCR-Bench and internal handwriting, table, formula and chart evaluations.

Practical implications for production

The combination of a NaViT-style encoder and a lightweight ERNIE-4.5-0.3B decoder aims to deliver high accuracy at practical inference cost. The PP-DocLayoutV2 -> PaddleOCR-VL-0.9B two-stage approach stabilizes reading order and preserves native typography cues, which is important for small scripts and complex page elements across 109 supported languages. Structured Markdown/JSON outputs and optional acceleration with vLLM/SGLang make the system operationally suitable for production document intelligence.

Resources and further reading

Technical report and model details are available from Baidu's publication page: https://ernie.baidu.com/blog/publication/PaddleOCR-VL_Technical_Report.pdf The release notes also point to a Hugging Face model, GitHub tutorials and community channels for additional examples and deployment guidance.