Vid2World: Crafting Video Diffusion Models to Interactive World Models (2025)

Overview

Huang et al. present Vid2World, a method that converts pre-trained video diffusion models into interactive world models by adding architectural adjustments and a training objective that enables autoregressive generation plus action controllability. They test in robotic manipulation and game simulations. :contentReference[oaicite:2]

Why it matters

Pretrained video diffusion models are powerful in generating realistic dynamics, but often lack control. This work provides a pathway to reuse those models in interactive settings, which expands their applicability in robotics, simulation, etc.

Key trade-offs / limitations

Action controllability may still be coarse depending on the domain.
Diffusion models are heavy; adapting them can increase inference cost.
Domains with complex geometry or occlusion may reduce quality.

Link

arXiv:2505.14357

TesserAct: Learning 4D Embodied World Models (2025)Aether: Geometric-Aware Unified World Modeling (2025)

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

Vid2World

Vid2World: Crafting Video Diffusion Models to Interactive World Models (2025)

Overview

Why it matters

Key trade-offs / limitations

Link

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

​Vid2World: Crafting Video Diffusion Models to Interactive World Models (2025)

​Overview

​Why it matters

​Key trade-offs / limitations

​Link

Vid2World: Crafting Video Diffusion Models to Interactive World Models (2025)

Overview

Why it matters

Key trade-offs / limitations

Link