Video World Models with Long-term Spatial Memory (2025)

Overview

Wu et al. propose a framework to bolster video world models with a geometry-grounded long-term spatial memory that can store and retrieve information across long horizons. This is meant to mitigate issues like forgetting or inconsistencies when a generated scene revisits parts of an environment. They also curate datasets specifically designed for evaluating these long-term memory mechanisms. :contentReference[oaicite:0]

Why it matters

Maintaining spatial consistency over time is crucial for interactive agents, simulation, AR/VR, robotics, etc. This helps avoid drift and visual or geometric mismatches which degrade user experience or reliability of world models. The idea of geometry-grounded memory is promising for scaling up scenes.

Key trade-offs / limitations

Adding memory structures increases model complexity and computation/memory cost.
Dataset curated for revisits may not cover all types of environments (diversity of geometry, lighting, etc.).
Retrieval / memory mechanisms can introduce latency, or errors if memory is noisy.

Link

arXiv:2506.05284

Long-Context State-Space Video World Models (2025)BEVControl (2023)

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

Video World Models with Long-term Spatial Memory (2025)

Video World Models with Long-term Spatial Memory (2025)

Overview

Why it matters

Key trade-offs / limitations

Link

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

​Video World Models with Long-term Spatial Memory (2025)

​Overview

​Why it matters

​Key trade-offs / limitations

​Link

Video World Models with Long-term Spatial Memory (2025)

Overview

Why it matters

Key trade-offs / limitations

Link