Scaling video generation to longer contexts by full-video attention (see Full Denoising below) captures global context, but incurs quadratic computational and memory costs and often leads to motion stagnation. Conversely, autoregressively generating video in shorter chunks (see Autoregressive Denoising below) reduces computational burden but sacrifices temporal consistency across segments and necessitates repeated sampling. We combine the strengths of both approaches via a parallel generation paradigm: leveraging the global context of full-video attention, but mitigating its computational cost and motion issues by first abstracting the video into a compact global tokens. The global tokens then guide the generation of consistent, short video segments, achieving both efficiency and coherence.
Prompt: A fat rabbit wearing a purple robe walking through a fantasy landscape (256 frames)
Prompt: A confused grizzly bear trying to learn Calculus (256 frames)
Prompt: Back view of a young woman dressed in a yellow jacket, walking in the forest (256 frames)
Prompt: Happy Corgi playing in the park, golden hour, 4K (256 frames)
More comparisons can be found here.
We observed that emergent global tokens bind to semantically meaningful parts of the video frames.
@misc{dedhia2025generatingfastslowscalable,
title={Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks},
author={Bhishma Dedhia and David Bourgin and Krishna Kumar Singh and Yuheng Li and Yan Kang and Zhan Xu and Niraj K. Jha and Yuchen Liu},
year={2025},
eprint={2503.17539},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.17539},
}
This work was partially done during a internship at Adobe Research and was supported by NSF grant CCF2203399