.

Recent Advances in Video Content Understanding and Generation

Over the last decade, tremendous interests have been attracted to this field, and great success has been achieved for various video-centric tasks (e.g., action recognition, motion capture and understanding, video understanding, temporal localization, and video generation etc.) based on conventional short videos. In recent years, with the explosion of videos and various application demands (e.g., video editing, AR/VR, human-robot interaction etc.), significantly more efforts are required to enable an intelligent system to understand and generate video content under different scenarios within multimodal, long-term and fine-grained inputs. Moreover, with the development of recent large language models (LLMs) and large multimodal models (LMMs), there are growing new trends and challenges to be discussed and addressed. The goal of this tutorial is to foster interdisciplinary communication among researchers so that more attention from the broader community can be drawn to this field. This tutorial will discuss current progress and future directions, and new ideas and discoveries in related fields are expected to emerge. The tentative topics include but are not limited to:

  • Video Content Understanding: human pose/mesh recovery from video; action recognition, localization, and segmentation; video captioning and video question answering with LLMs or LMMs;
  • Video Content Generation: basic models for video generation (diffusion model, GAN model, etc.); controllable video generation with multimodal inputs; 4D video creation;
  • Foundations and Beyond: large language models/large multimodal models for video representation learning, unified modelling of video understanding and generation, long-term video modeling, dataset, and evaluation.

Schedule

Time

Speaker

Content

09:00 am - 09:45 am Mike Z. Shou Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
09:45 am - 10:30 am Angjoo Kanazawa Perceiving the World in 4D and thereafter
11:00 am - 11:45 am Chuang Gan Videos as World Models: Blending Visual and Physical Intelligence
11:45 am - 12:30 am Hao Zhao Video Simulation and Holistic Understanding for Autonomous Driving: Systems and Backbones
14:00 pm - 14:45 pm Cordelia Schmid Invited Talk5
14:45 pm - 15:30 pm Yingqing He LLMs Meet image and video generation
16:00 pm - 16:45 pm Kashyap Chitta Invited Talk7
16:45 pm - 17:30 pm Sherry Yang Video Generation as Real-World Simulators
17:30 pm - 18:00 pm Ziwei Wang Invited Talk9

Talk Details

Mike Z. Shou

Talk Title: Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Abstract: Exciting models have been developed in multimodal video understanding and generation, such as video LLM and video diffusion model. One emerging pathway to the ultimate intelligence is to create one single foundation model that can do both understanding and generation. After all, humans only use one brain to do both tasks. Towards such unification, recent attempts employ a base language model for multimodal understanding but require an additional pre-trained diffusion model for visual generation, which still remain as two separate components. In this work, we present Show-o, one single transformer that handles both multimodal understanding and generation. Unlike fully autoregressive models, Show-o is the first to unify autoregressive and discrete diffusion modeling, flexibly supporting a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation of any input/output format, all within one single 1.3B transformer. Across various benchmarks, Show-o demonstrates comparable or superior performance, shedding light for building the next-generation video foundation model.

Bio: Mike Shou is a tenure-track Assistant Professor at National University of Singapore. He was a Research Scientist at Facebook AI in the Bay Area. He obtained his Ph.D. degree at Columbia University, working with Prof. Shih-Fu Chang. He received the Best Paper Finalist at CVPR 2022, Best Student Paper Nomination at CVPR 2017, PREMIA Best Paper Award 2023, EgoVis Distinguished Paper Award 2022/23. His team won 1st place in the international challenges including ActivityNet 2017, EPIC-Kitchens 2022, Ego4D 2022 & 2023. He is a Singapore Technologies Engineering Distinguished Professor and a Fellow of National Research Foundation Singapore. He is on the Forbes 30 Under 30 Asia list.

Hao Zhao

Talk Title: Video Simulation and Holistic Understanding for Autonomous Driving: Systems and Backbones

Bio: Zhao Hao is an Assistant Professor at the Institute for AI Industry Research (AIR) at Tsinghua University. He received his Bachelor's and Ph.D. degrees from the Department of Electronic Engineering at Tsinghua University. He has worked as a research scientist at Intel Labs China and conducted postdoctoral research at Peking University. He has published over 50 research papers in academic conferences such as CVPR/NeurIPS/SIGGRAPH and journals such as T-PAMI/IJCV. He has won multiple championships in 3D scene understanding algorithm challenges and led the development of the world's first open-source modular realistic autonomous driving simulator, MARS, which won the Best Paper Runner-up award at CICAI 2023. His neural rendering method, SlimmeRF, which allows adjustable precision and speed during the rendering phase, won the Best Paper award at 3DV 2024.

Sherry Yang

Talk Title: Video Generation as Real-World Simulators

Abstract: Generative models have transformed content creation, and the next frontier may be simulating realistic experiences in response to actions by humans and agents. In this talk, I will talk about a line of work that involves learning a real-world simulator to emulate interactions through generative modeling of video content. I will also talk about the applications of a real-world simulator, including training vision-language planners and reinforcement learning policies, which have demonstrated zero-shot real-world transfer. Lastly, I will talk about how to improve generative simulators from real-world feedback.

Bio: Sherry is an incoming assistant professor of Computer Science at NYU Courant, a post-doc at Stanford University, and a research scientist at Google DeepMind. Her research aims to develop machine learning models with internet-scale knowledge to make better-than-human decisions. To this end, her work has pioneered representation learning and generative modeling from large vision and language data coupled with algorithms for sequential decision making such as imitation learning, planning, and reinforcement learning. Her research UniSim: Learning Interactive Real-World Simulators has been recognized by the Outstanding Paper award at ICLR. Prior to her current roles, Sherry received her PhD in Computer Science at UC Berkeley and her Bachelor's and Master’s degree in Electrical Engineering and Computer Science at MIT.

Kashyap Chitta

Talk Title: Specializing Video Diffusion Models

Abstract: Latent diffusion models (LDMs) have emerged as a powerful class of generative models and demonstrated exceptional results, in particular in image and video synthesis. In this tutorial, we aim to provide an introduction to LDMs and the image-to-video model Stable Video Diffusion (SVD). Following this, we summarize the lessons we learned while fine-tuning SVD for applications in autonomous driving as part of our recent work Vista. In particular, we present several practical tips relevant for enhancing the capabilities of large video LDMs or specializing them to new application domains.

Bio: Kashyap Chitta is a doctoral researcher at the University of Tübingen, and the lead of the Tübingen AI autonomous driving team which has won multiple autonomous driving challenge awards from 2020-2024. He was selected for the doctoral consortium at ICCV 2023, as a 2023 RSS pioneer, and a top reviewer for CVPR, ICCV, ECCV, and NeurIPS.

Yinqing He

Talk Title: LLMs Meet image and video generation

Abstract: In light of the recent progress of large language models (LLMs), there is a growing interest in integrating LLMs with abilities of processing multiple modalities, especially image and video. This tutorial delves into the breakthrough of AI-generated content, such as images and videos, examining the current status, the milestones achieved, and the challenges that remain. Given the formidable capabilities of LLMs, we pose the question: Can LLMs enhance the generation of images and videos? If they can, in what ways can LLMs assist in the generation of images and videos? To answer this question, we conduct a comprehensive review of related works and summarize the various roles that LLMs can take in the field of image and video generation, including serving as a unified backbone, planner, captioner, conditioner, evaluator, and agent. We hope this tutorial can provide the audience with a clear understanding of the current status of image and video generation, and how LLMs function within it, in order to promote better integration of LLMs with image and video generation paradigms in the future.

Bio: Yingqing He is a final-year PhD student at HKUST, under the supervision of Prof. Qifeng Chen. Her research interests are text-to-video generation, controllable generation and multimodal generation. Her featured works include LVDM, VideoCrafter1, Follow-your-pose, Animate-A-Story, and ScaleCrafter.

Speakers

Organizers

* Equal Contribution