Jina-VLM: A Breakthrough in Multilingual Visual QA
Jina AI unveils a 2.4B parameter multilingual vision language model for efficient visual question answering.
Overview
Jina AI has released Jina-VLM, a 2.4B parameter vision language model that targets multilingual visual question answering and document understanding on constrained hardware. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone and uses an attention pooling connector to reduce visual tokens while preserving spatial structure. Among open 2B scale VLMs, it reaches state-of-the-art results on multilingual benchmarks such as MMMB and Multilingual MMBench.
Architecture Insights
Token Efficiency with Attention Pooling
Jina-VLM optimizes the vision side for arbitrary resolution and low token count. The vision encoder is SigLIP2 So400M/14 384, featuring a 27-layer Vision Transformer with approximately 400M parameters. It processes 378×378 pixel crops into a 27×27 grid of 14×14 patches, yielding 729 patch tokens per tile.
To manage high-resolution images, the model creates a grid of up to 12 overlapping tiles along with a global thumbnail. Each tile overlaps by 112 pixels and has a stride of 266 pixels, effectively covering a resolution of 1176×910 pixels.
The vision language connector combines features from two intermediate layers rather than the final ViT layer. It reduces visual tokens from 729 to 182 via attention pooling, achieving a 4x compression.
Training Pipeline
Multilingual Data Mix
Training occurs in two stages without freezing components. The full corpus includes about 5M multimodal samples and 12B text tokens in over 30 languages. Half the text is English, while others include Chinese, Arabic, German, and more.
- Stage 1: Focuses on alignment training using caption-heavy datasets such as PixmoCap and PangeaIns.
- Stage 2: Emphasizes instruction fine-tuning for visual question answering with various multilingual datasets.
Benchmark Performance
Highlights and Key Metrics
On standard English VQA tasks, Jina-VLM scores 72.3 across 8 benchmarks, outperforming 2B scale models. On multimodal comprehension tasks, it scores 67.4 for multi-image reasoning.
Multi-image reasoning is weaker, achieving an average of 47.3 on related benchmarks. However, it excels in hallucination control, scoring 90.3 on the POPE benchmark.
Overall, its multilingual performance is impressive, averaging 78.8 on MMMB and 74.3 on Multilingual MMBench.
Comparison Table
| Model | Params | VQA Avg | MMMB | Multi. MMB | DocVQA | OCRBench | |---------------|--------|---------|------|------------|--------|----------| | Jina-VLM | 2.4B | 72.3 | 78.8 | 74.3 | 90.6 | 778 | | Qwen2-VL-2B | 2.1B | 66.4 | 71.3 | 69.4 | 89.2 | 809 | | Qwen3-VL-2B | 2.8B | 71.6 | 75.0 | 72.3 | 92.3 | 858 | | InternVL3-2B | 2.2B | 69.2 | 73.6 | 71.9 | 87.4 | 835 | | InternVL3.5-2B | 2.2B | 71.6 | 74.6 | 70.9 | 88.5 | 836 |
Key Takeaways
- Jina-VLM is a 2.4B parameter model that efficiently reduces visual tokens with attention pooling, maintaining spatial integrity.
- It uses overlapping 378×378 tiles to manage images up to 4K resolution, significantly lowering computational costs.
- The training pipeline incorporates 5M multimodal samples across 30 languages in two stages.
- It excels in multilingual benchmarks, outperforming its peers in average scores.
Сменить язык
Читать эту статью на русском