Alibaba’s Qwen Packs Full Multimodal Power into FP8 4B/8B Qwen3‑VL Models

Dense 4B and 8B SKUs

Alibaba's Qwen team has released compact, dense versions of Qwen3-VL at 4B and 8B parameter scales, each offered in two task profiles: Instruct and Thinking. These smaller models are presented as deployment-friendly complements to the previously published 30B and 235B MoE tiers, and they aim to preserve the same capability surface while reducing VRAM requirements.

Context length and capabilities

The model cards report a native context length of 256K tokens with expandability to 1M tokens. Despite the reduced parameter counts, the 4B and 8B SKUs retain the multimodal features of larger Qwen3-VL models: long-document and video comprehension, 32-language OCR, 2D/3D spatial grounding, visual coding, and agentic GUI control across desktop and mobile environments.

Architecture highlights

Qwen3-VL continues to use three core architectural updates that enable robust multimodal performance at different scales:

Interleaved-MRoPE for stable positional encoding across time, width, and height, improving long-horizon video handling.
DeepStack for multi-level ViT feature fusion and sharper image-text alignment.
Text-Timestamp Alignment for event localization in video beyond T-RoPE.

These design choices appear in the new 4B and 8B model cards, indicating architectural continuity across sizes.

FP8 checkpoints and deployment

A notable part of this release is the availability of FP8-quantized checkpoints for the 4B and 8B Instruct and Thinking variants. The repositories specify fine-grained FP8 quantization with a block size of 128 and report performance metrics nearly identical to the original BF16 checkpoints. That parity claim reduces the re-quantization and re-validation burden for teams integrating these models into multimodal stacks.

Tooling guidance

The model cards note that common Transformers loaders do not yet support these FP8 weights directly and recommend using vLLM or SGLang for serving. The cards include working launch snippets and vLLM recipes that emphasize FP8 checkpoints for H100 memory efficiency, providing immediate, supported paths for low-VRAM inference.

Practical implications

The combination of dense 4B/8B SKUs, Instruct and Thinking profiles, and vendor-produced FP8 weights makes Qwen3-VL more accessible for single-GPU and edge deployments. Teams that need the full multimodal capability surface but are constrained by VRAM can now experiment with smaller models without losing long-context handling, OCR, spatial grounding, video reasoning, or GUI/agent control.

Model sizes and availability

Model card-reported sizes include Qwen3-VL-4B at about 4.83B parameters and Qwen3-VL-8B-Instruct at about 8.77B parameters. The release was recorded on Oct 15, 2025, and the repositories and model assets are available on GitHub and Hugging Face.

Links and resources

Qwen3-VL GitHub: https://github.com/QwenLM/Qwen3-VL/tree/main
Model pages on Hugging Face and accompanying documentation provide launch snippets, serving guidance, and recipes for FP8 deployment.

Key takeaways

Dense Qwen3-VL 4B and 8B models are available in Instruct and Thinking variants with FP8 checkpoints.
FP8 uses fine-grained quantization with block size 128 and reports near-BF16 metrics.
The full multimodal capability surface, including 256K→1M context and 32-language OCR, is preserved at smaller scales.

Alibaba’s Qwen Packs Full Multimodal Power into FP8 4B/8B Qwen3‑VL Models

Сменить язык