Recent Advances in Video Content Understanding and Generation

Over the last decade, tremendous interests have been attracted to this field, and great success has been achieved for various video-centric tasks (e.g., action recognition, motion capture and understanding, video understanding, temporal localization, and video generation etc.) based on conventional short videos. In recent years, with the explosion of videos and various application demands (e.g., video editing, AR/VR, human-robot interaction etc.), significantly more efforts are required to enable an intelligent system to understand and generate video content under different scenarios within multimodal, long-term and fine-grained inputs. Moreover, with the development of recent large language models (LLMs) and large multimodal models (LMMs), there are growing new trends and challenges to be discussed and addressed. The goal of this tutorial is to foster interdisciplinary communication among researchers so that more attention from the broader community can be drawn to this field. This tutorial will discuss current progress and future directions, and new ideas and discoveries in related fields are expected to emerge. The tentative topics include but are not limited to:

  • Video Content Understanding: human pose/mesh recovery from video; action recognition, localization, and segmentation; video captioning and video question answering with LLMs or LMMs;
  • Video Content Generation: basic models for video generation (diffusion model, GAN model, etc.); controllable video generation with multimodal inputs; 4D video creation;
  • Foundations and Beyond: large language models/large multimodal models for video representation learning, unified modelling of video understanding and generation, long-term video modeling, dataset, and evaluation.


* Equal Contribution