Vid2World: Crafting Video Diffusion Models to Interactive World Models (2025)

Overview

Huang et al. present Vid2World, a method that converts pre-trained video diffusion models into interactive world models by adding architectural adjustments and a training objective that enables autoregressive generation plus action controllability. They test in robotic manipulation and game simulations. :contentReference[oaicite:2]

Why it matters

Pretrained video diffusion models are powerful in generating realistic dynamics, but often lack control. This work provides a pathway to reuse those models in interactive settings, which expands their applicability in robotics, simulation, etc.

Key trade-offs / limitations

  • Action controllability may still be coarse depending on the domain.
  • Diffusion models are heavy; adapting them can increase inference cost.
  • Domains with complex geometry or occlusion may reduce quality.
arXiv:2505.14357