Overview

This paper proposes a two-stage framework where a Vision Language Model performs coarse motion planning before a diffusion model generates final videos. This ensures motion plausibility and consistency across frames.

Why it matters

Demonstrates a promising way to integrate reasoning into generation, producing videos with both realism and physical correctness. Useful for domains requiring reliable dynamics.

Key trade-offs / limitations

  • Dependent on the quality of the VLM’s planning.
  • May increase inference time due to two-stage process.
arXiv:2503.23368