Vid2World: Crafting Video Diffusion Models to Interactive World Models (2025)
Overview
Huang et al. present Vid2World, a method that converts pre-trained video diffusion models into interactive world models by adding architectural adjustments and a training objective that enables autoregressive generation plus action controllability. They test in robotic manipulation and game simulations. :contentReference[oaicite:2]Why it matters
Pretrained video diffusion models are powerful in generating realistic dynamics, but often lack control. This work provides a pathway to reuse those models in interactive settings, which expands their applicability in robotics, simulation, etc.Key trade-offs / limitations
- Action controllability may still be coarse depending on the domain.
- Diffusion models are heavy; adapting them can increase inference cost.
- Domains with complex geometry or occlusion may reduce quality.