LongViTU: Instruction Tuning for
Long-Form Video Understanding

1Peking University  2BIGAI  3National University of Singapore Research lead

Abstract

This paper introduces LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We propose a systematic approach that organizes videos into a hierarchical tree structure for QA generation and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.)). We also offer explicit timestamp annotations of relevant events for each QA pair. We have conducted extensive human studies on LongViTU, and the results prove the quality of our dataset. To better evaluate the challenges posed by LongViTU's emphasis on long-term context and condensed reasoning, we manually curate a subset of LongViTU into a benchmark. Evaluations using a state-of-the-art open-source model (LongVU), a proprietary model (Gemini-1.5-Pro), and human annotators yield GPT-4 scores of 49.9, 52.3, and 81.0, respectively, underscoring the substantial difficulty presented by LongViTU questions. Performing supervised fine-tuning (SFT) of LongVU and LLaVA-Video on LongViTU data results in average performance gains of 2.5% and 3.7%, respectively, across a suite of long video understanding benchmarks (EgoSchema, VideoMME-Long, MLVU, LVBench).

Pipeline

We design an automatic pipeline to generate QA pairs from Ego4D videos while addressing the challenge of long-form video comprehension. To mitigate the context length limitation of LLMs and grasp spatial-temporal information in long video context, we employ a hierarchical tree structure. This structure first condenses dense captions to refine event descriptions and then segments content into sequential events with finer temporal granularity. Such representations facilitate the generation of QA pairs with explicit timestamps and varying contextual lengths, ensuring adaptability across different temporal levels (frame level, event level, segment level). Additionally, we design prompting strategies to generate high-quality questions that require condensed reasoning. Finally, a self-revision step refines results by removing redundancy and irrelevant prior information. For more details, please refer to our Paper.

Qualitative

We present visualizations of various question-answering types to facilitate a more thorough qualitative analysis. The yellow box indicates the key frame that contains the answer, while the red box highlights the relevant objects. For better illustration, only concise key information is presented in the predictions. For a clearer view, please refer to our Paper.

Statistics

Here are distributions and sunburst of LongViTU, if you want to view our complete dataset, please refer to our Dataset.

BibTeX

@misc{wu2025longvituinstructiontuninglongform,
      title={LongViTU: Instruction Tuning for Long-Form Video Understanding}, 
      author={Rujie Wu and Xiaojian Ma and Hai Ci and Yue Fan and Yuxuan Wang and Haozhe Zhao and Qing Li and Yizhou Wang},
      year={2025},
      eprint={2501.05037},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.05037}, 
}