LongViTU: Instruction Tuning for
Long-Form Video Understanding

1Peking University  2BIGAI  3National University of Singapore *Equal contribution  Research lead

Abstract

This paper introduce LongViTU, a large-scale (~121k QA pairs, ~900h videos), automatically generated dataset for long-form video understanding. We developed a systematic approach that organizes videos into a hierarchical tree structure and incorporates self-revision mechanisms to ensure high-quality QA pairs. Each QA pair in LongViTU features: 1) long-term context (average certificate length of 4.6 minutes); 2) rich knowledge and condensed reasoning (commonsense, causality, planning, etc.); and 3) explicit timestamp labels for relevant events. LongViTU also serves as a benchmark for instruction following in long-form and streaming video understanding. We evaluate the open-source state-of-the-art long video understanding model, LongVU, and the commercial model, Gemini-1.5-Pro, on our benchmark. They achieve GPT-4 scores of 49.9 and 52.3, respectively, underscoring the substantial challenge posed by our benchmark. Further supervised fine-tuning (SFT) on LongVU led to performance improvements of 12.0% on our benchmark, 2.2% on the in-distribution (ID) benchmark EgoSchema, 1.0%, 2.2% and 1.2% on the out-of-distribution (OOD) benchmarks VideoMME (Long), WorldQA and OpenEQA, respectively. These outcomes demonstrate LongViTU's high data quality and robust OOD generalizability.

Pipeline

We adopt a hierarchical pipeline that organizes video content into a tree structure, with subtrees encapsulating information at different temporal scales. This framework facilitates the generation of QA pairs with explicit timestamps, ensuring adaptability to varying contextual lengths. By summarizing content across multiple temporal levels (frame level, event level, segment level), our approach overcomes the challenge of excessively long input length for LLMs from long-form video. This enables LLMs to generate distinct types of questions, resulting in a fine-grained categorization aligned with the video content. To further enhance quality, a self-revision step refines results by removing redundancy and irrelevant prior information. For more details, please refer to our Paper.

Qualitative

We present visualizations of various question-answering types to facilitate a more thorough qualitative analysis. The yellow box indicates the key frame that contains the answer, while the red box highlights the relevant objects. For better illustration, only concise key information is presented in the predictions. For a clearer view, please refer to our Paper.

Statistics

Here are distributions and sunburst of LongViTU, if you want to view our complete dataset, please refer to our Dataset.

BibTeX

@misc{wu2025longvituinstructiontuninglongform,
      title={LongViTU: Instruction Tuning for Long-Form Video Understanding}, 
      author={Rujie Wu and Xiaojian Ma and Hai Ci and Yue Fan and Yuxuan Wang and Haozhe Zhao and Qing Li and Yizhou Wang},
      year={2025},
      eprint={2501.05037},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.05037}, 
}