Physically Plausible Video Generation via VLM Planning (2025)

Overview

This paper proposes a two-stage framework where a Vision Language Model performs coarse motion planning before a diffusion model generates final videos. This ensures motion plausibility and consistency across frames.

Why it matters

Demonstrates a promising way to integrate reasoning into generation, producing videos with both realism and physical correctness. Useful for domains requiring reliable dynamics.

Key trade-offs / limitations

Dependent on the quality of the VLM’s planning.
May increase inference time due to two-stage process.

Link

arXiv:2503.23368

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

Physically Plausible Video Generation via VLM Planning (2025)

Overview

Why it matters

Key trade-offs / limitations

Link

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

​Overview

​Why it matters

​Key trade-offs / limitations

​Link

Overview

Why it matters

Key trade-offs / limitations

Link