Overview
This paper proposes a two-stage framework where a Vision Language Model performs coarse motion planning before a diffusion model generates final videos. This ensures motion plausibility and consistency across frames.Why it matters
Demonstrates a promising way to integrate reasoning into generation, producing videos with both realism and physical correctness. Useful for domains requiring reliable dynamics.Key trade-offs / limitations
- Dependent on the quality of the VLM’s planning.
- May increase inference time due to two-stage process.