Video World Models with Long-term Spatial Memory (2025)
Overview
Wu et al. propose a framework to bolster video world models with a geometry-grounded long-term spatial memory that can store and retrieve information across long horizons. This is meant to mitigate issues like forgetting or inconsistencies when a generated scene revisits parts of an environment. They also curate datasets specifically designed for evaluating these long-term memory mechanisms. :contentReference[oaicite:0]Why it matters
Maintaining spatial consistency over time is crucial for interactive agents, simulation, AR/VR, robotics, etc. This helps avoid drift and visual or geometric mismatches which degrade user experience or reliability of world models. The idea of geometry-grounded memory is promising for scaling up scenes.Key trade-offs / limitations
- Adding memory structures increases model complexity and computation/memory cost.
- Dataset curated for revisits may not cover all types of environments (diversity of geometry, lighting, etc.).
- Retrieval / memory mechanisms can introduce latency, or errors if memory is noisy.