Video World Models with Long-term Spatial Memory (2025)

Overview

Wu et al. propose a framework to bolster video world models with a geometry-grounded long-term spatial memory that can store and retrieve information across long horizons. This is meant to mitigate issues like forgetting or inconsistencies when a generated scene revisits parts of an environment. They also curate datasets specifically designed for evaluating these long-term memory mechanisms. :contentReference[oaicite:0]

Why it matters

Maintaining spatial consistency over time is crucial for interactive agents, simulation, AR/VR, robotics, etc. This helps avoid drift and visual or geometric mismatches which degrade user experience or reliability of world models. The idea of geometry-grounded memory is promising for scaling up scenes.

Key trade-offs / limitations

  • Adding memory structures increases model complexity and computation/memory cost.
  • Dataset curated for revisits may not cover all types of environments (diversity of geometry, lighting, etc.).
  • Retrieval / memory mechanisms can introduce latency, or errors if memory is noisy.
arXiv:2506.05284