Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

Abstract

Diffusion Transformers (DiTs) can generate short photorealistic videos, yet directly training and sampling longer videos with full attention across the video remains computationally challenging. Alternative methods break long videos down into sequential generation of short video segments, requiring multiple sampling chain iterations and specialized consistency modules. To overcome these challenges, we introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks. At each diffusion step, VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel. The coupling of VIN and DiT is learned end-to-end on the denoising objective. Further, the VIN architecture maintains fixed-size encoding tokens that encode the input via a single cross-attention step. Disentangling the encoding tokens from the input thus enables VIN to scale to long videos and learn essential semantics. Experiments on VBench demonstrate that VINs surpass existing chunk-based methods in preserving background consistency and subject coherence. We then show via an optical flow analysis that our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation. Finally, human raters favorably assessed the overall video quality and temporal consistency of our method in a user study.

Methodology

Scaling video generation to longer contexts by full-video attention (see Full Denoising below) captures global context, but incurs quadratic computational and memory costs and often leads to motion stagnation. Conversely, autoregressively generating video in shorter chunks (see Autoregressive Denoising below) reduces computational burden but sacrifices temporal consistency across segments and necessitates repeated sampling. We combine the strengths of both approaches via a parallel generation paradigm: leveraging the global context of full-video attention, but mitigating its computational cost and motion issues by first abstracting the video into a compact global tokens. The global tokens then guide the generation of consistent, short video segments, achieving both efficiency and coherence.

Comparison against open weight models

Prompt: Back view of a young woman dressed in a yellow jacket, walking in the forest (256 frames)

Open Sora v1.2

Mochi-1

Hunyuan Video

Video Interface Networks

Prompt: Happy Corgi playing in the park, golden hour, 4K (256 frames)

Open Sora v1.2

Mochi-1

Hunyuan Video

Video Interface Networks

More comparisons can be found here.

Bibtex

@misc{dedhia2025generatingfastslowscalable,
    title={Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks}, 
    author={Bhishma Dedhia and David Bourgin and Krishna Kumar Singh and Yuheng Li and Yan Kang and Zhan Xu and Niraj K. Jha and Yuchen Liu},
    year={2025},
    eprint={2503.17539},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2503.17539}, 
}

This work was partially done during a internship at Adobe Research and was supported by NSF grant CCF2203399
Imitation is the best form of flattery.

Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks

Abstract

arXiv Paper

Methodology

Comparison against other inference scaling methods

Full Denoising

Autoregressive Denoising

Streaming T2V

FreeNoise

Spectral Blending

Video Interface Networks

Full Denoising

Autoregressive Denoising

Streaming T2V

FreeNoise

Spectral Blending

Video Interface Networks

Comparison against open weight models

Open Sora v1.2

Mochi-1

Hunyuan Video

Video Interface Networks

Open Sora v1.2

Mochi-1

Hunyuan Video

Video Interface Networks

Global Tokens Interpretability

Global Tokens attention encoding on individual frames

Global Tokens attention encoding across videos

Bibtex