X-Fusion: Enhancing Frozen Language Models with Vision Without Sacrificing Language Skills

Advancing Multimodal AI with X-Fusion

Large Language Models (LLMs) have revolutionized tasks such as conversational AI, reasoning, and code generation. Yet, human communication often involves visual information, which LLMs alone cannot process. To achieve truly versatile AI, models must handle both text and visual inputs seamlessly.

Challenges in Multimodal Model Training

Building unified vision-language models from scratch, using methods like autoregressive token prediction or combining diffusion and language losses, demands massive computational resources and retraining when new modalities are added. Adapting pretrained LLMs to include vision capabilities is a more efficient alternative, but it often compromises the language model’s original performance.

Existing Approaches and Their Limitations

Current strategies include merging LLMs with standalone image generation models, training large multimodal models end-to-end, or combining diffusion and autoregressive losses. These achieve impressive results but either require retraining large models or degrade core language capabilities. Although pretrained LLMs augmented with vision components show promise, challenges in efficiency and flexibility remain.

Introducing X-Fusion: Dual-Tower Architecture

Researchers from UCLA, the University of Wisconsin-Madison, and Adobe Research propose X-Fusion, a novel approach that adapts pretrained LLMs for multimodal tasks without losing language proficiency. X-Fusion employs a dual-tower design, freezing the language model’s weights while adding a dedicated vision tower to process visual data.

Text and vision features are aligned at multiple levels, enhancing performance on image-to-text and text-to-image tasks. Images are tokenized using a pretrained encoder, and the model jointly optimizes image and text tokens. An optional X-Fuse operation merges features from both towers, further boosting performance.

Training and Evaluation

X-Fusion is trained with autoregressive and image denoising losses. Evaluations demonstrate superior performance in both image generation and understanding compared to alternative transformer designs like Single Tower, Gated Tower, and Dual Projection. The Dual Tower architecture outperforms others by 23% in FID scores without increasing training parameters.

Key Findings from Ablation Studies

The research highlights the importance of clean image data for training, showing that reducing noise improves both image understanding and generation. Aligning vision features with pretrained encoders such as CLIP accelerates convergence and enhances performance, particularly for smaller models.

X-Fusion represents a significant step forward in creating efficient, flexible multimodal AI models by preserving language capabilities while adding robust vision processing.