ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction


METADATA ONLY
Loading...

Date

2024

Publication Type

Conference Paper

ETH Bibliography

yes

Citations

Altmetric
METADATA ONLY

Data

Rights / License

Abstract

Video prediction is a challenging task due to its nature of uncertainty, especially for forecasting a long period. To model the temporal dynamics, advanced methods benefit from the recent success of diffusion models, and repeatedly refine the predicted future frames with 3D spatiotemporal U-Net. However, there exists a gap between the present and future and the repeated usage of U-Net brings a heavy computation burden. To address this, we propose a diffusion-based video prediction method that predicts future frames by extrapolating the present distribution of features, namely ExtDM. Specifically, our method consists of three components: (i) a motion autoencoder conducts a bijection transformation between video frames and motion cues; (ii) a layered distribution adaptor module extrapolates the present features in the guidance of Gaussian distribution; (iii) a 3D U-Net architecture specialized for jointly fusing guidance and features among the temporal dimension by spatiotemporal-window attention. Extensive experiments on five popular benchmarks covering short- and long-term video prediction verify the effectiveness of ExtDM.

Publication status

published

Editor

Book title

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Journal / series

Volume

Pages / Article No.

19310 - 19320

Publisher

IEEE

Event

2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Edition / version

Methods

Software

Geographic location

Date collected

Date created

Subject

Video Generation; Diffusion Model

Organisational unit

Notes

Funding

Related publications and datasets