ViSMaP: Revolutionizing Hour-Long Video Summarization with Unsupervised Meta-Prompting

Challenges in Long-Form Video Summarization

Video captioning models have traditionally been trained on short videos, typically under three minutes, paired with captions. While effective for describing simple actions, these models fall short when applied to hour-long videos such as vlogs, sports, and movies. The descriptions tend to be fragmented and fail to capture the overarching narrative. Existing attempts like MA-LMM and LaViLa extend captioning up to 10-minute clips but struggle with longer content due to lack of large annotated datasets. Although Ego4D provides hour-long video data, its first-person perspective limits general use. Video ReCap addresses this by training on hour-long videos with detailed annotations, but this approach is expensive and inconsistent.

Advances in Visual-Language Models and Dataset Limitations

Visual-language models like CLIP, ALIGN, LLaVA, and MiniGPT-4 have pushed the boundaries in image and video understanding by combining vision and language tasks. However, the scarcity of large annotated datasets for long videos remains a major bottleneck. Tasks like video question answering and captioning on short videos require spatial or temporal understanding, but summarizing hour-long videos demands filtering key frames from extensive redundant content. Models such as LongVA and LLaVA-Video can handle long videos for VQA but struggle with summarization due to limited data.

Introducing ViSMaP: An Unsupervised Solution

Researchers from Queen Mary University and Spotify propose ViSMaP, an unsupervised method designed to summarize hour-long videos without costly annotations. Unlike traditional models that excel on short clips, ViSMaP effectively handles longer videos by using large language models (LLMs) and a novel meta-prompting strategy. This approach iteratively generates and refines pseudo-summaries from short-form clip descriptions. The system comprises three LLMs operating sequentially for generation, evaluation, and prompt optimization.

Methodology

ViSMaP starts by training a model on 3-minute videos using TimeSFormer features, a visual-language alignment module, and a text decoder optimized via cross-entropy and contrastive losses. Longer videos are segmented into 3-minute clips, and pseudo-captions are generated. An iterative meta-prompting process involving generator, evaluator, and optimizer LLMs refines these summaries. The model is then fine-tuned on pseudo-summaries using symmetric cross-entropy loss to handle noise and enhance domain adaptation.

Evaluation and Performance

ViSMaP was tested in three scenarios: summarizing long videos with Ego4D-HCap, cross-domain generalization on MSRVTT, MSVD, and YouCook2, and adaptation to short videos using EgoSchema. Compared to supervised and zero-shot methods like Video ReCap and LaViLa+GPT3.5, ViSMaP demonstrated competitive or better performance without requiring supervision. Metrics used included CIDEr, ROUGE-L, METEOR, and QA accuracy. Ablation studies confirmed the effectiveness of meta-prompting, contrastive learning, and symmetric cross-entropy loss. The implementation utilized TimeSFormer, DistilBERT, and GPT-2, with training on an NVIDIA A100 GPU.

Future Directions

While ViSMaP shows promising results, it relies on pseudo-labels from a source domain model, which may limit performance under significant domain shifts. Currently, it uses only visual data; future work could incorporate multimodal information, hierarchical summarization techniques, and more generalized meta-prompting methods to improve robustness and applicability.