Introduction

Video generation and world models are rapidly converging. What started as separate pursuits — one focused on producing realistic videos and the other on predicting and simulating environments — is moving toward a common goal: real-time, interactive, persistent simulations of the world. This convergence is reshaping how we think about AI for gaming, robotics, AR/VR, sports, and interactive media.

The Shared Foundation

Both video and world models build on the same hierarchy of capabilities:
  • Causal → predictions must flow forward in time.
  • Interactive → must accept and respond to user/agent actions.
  • Persistent → maintain consistency across long horizons.
On top of this shared base, the goals diverge:
  • Physically Accurate (for robotic learning): fidelity to real-world physics and generalization across conditions.
  • Real-Time (for human entertainment): ultra-low latency and high frame-rate responsiveness.

What Are World Models?

A world model is any system that predicts how the world evolves over time, often under the influence of actions. Two traditions exist:
  • Abstract / semantic world models
    Capture internal representations useful for reasoning, prediction, and planning.
    Example: An agent predicting “if I push the cup, it will fall,” without needing to render the cup in pixels.
  • Full-fidelity simulators
    Generate high-detail, physically accurate environments, often pixel-by-pixel.
    Example: A physics-based engine rendering a falling cup in real time.
The ultimate goal is convergence: models that understand the world semantically and can simulate it visually and physically.

Video Models: Stepping Toward World Models

Video models — especially those trained on large video datasets — generate temporally coherent frames. While impressive in producing realistic clips, they’ve historically lacked:
  • Causality
  • Interactivity
  • Persistence
  • Real-time responsiveness
  • Physical accuracy
These limitations highlight why today’s video models aren’t yet true world models.

Why Real-Time Matters

For many applications — gaming, AR/VR, telepresence, live sports analysis, robotics — latency and responsiveness are non-negotiable:
  • Throughput: generate frames fast enough for live playback.
  • Latency: reflect actions with near-instant updates.
  • Consistency: objects and environments should remain stable across thousands of frames.
Without these properties, video models remain beautiful demos rather than interactive worlds.

Pathways Toward Convergence

Researchers are pushing several directions to unify video and world models:
ApproachImprovesReal-Time Implications
Autoregressive & causal modelingCausality, interactivityEnables frame-by-frame responsiveness.
Action-conditioned video generationInteractivityBridges agent control with video outputs.
Memory & state-space modelsPersistenceMaintain object stability over long sequences.
Optimized architectures (e.g. few-step diffusion)Real-time performancePush toward VR/AR frame-rate targets.
Physics-aware training & loss functionsPhysical accuracyEnsure believable motion & generalization.

The Convergence Point

We are seeing video models evolving into world models, and world models adopting video-first realism:
  • From Video → World Models:
    Video generation learns to accept actions, maintain causality, and simulate physics.
  • From World Models → Video:
    Abstract predictors are extended to produce visually rich renderings that humans and machines can both use.
The result? Interactive, real-time environments that serve as both simulations for agents and immersive experiences for humans.

What’s Next

Key challenges on the path to convergence:
  1. Datasets pairing video with actions for training interactivity.
  2. Benchmarks that measure not just pixel quality, but latency, persistence, and physical realism.
  3. Hybrid systems combining video generation, 3D/4D representations, and explicit physics.
  4. Scaling to real-time while maintaining fidelity.

Summary

  • Video models generate frames, but often lack causality, interactivity, persistence, real-time speed, and physical accuracy.
  • World models aim to predict the unfolding of reality, either abstractly or with simulation fidelity.
  • Convergence is underway: video models are gaining world-model properties, while world models are adopting visual realism.
  • The outcome: real-time, interactive video world models that can power games, robotics, AR/VR, and beyond.