Overview

This paper proposes using state-space models (SSMs) in conjunction with dense local attention to maintain spatial consistency while extending the temporal context length of video world models. It addresses the difficulty of scaling to long horizons due to the usual cost of attention. :contentReference[oaicite:3]

Why it matters

Long horizon consistency is a big challenge in world modeling: errors compound over time, memory is limited, and many methods break when asked to maintain scenes over extended durations. Solutions that scale temporal memory without huge inefficiency are key.

Key trade-offs / limitations

  • Might sacrifice some fine detail spatial resolution for longer temporal coverage.
  • SSMs introduce their own complexity and sometimes harder optimization.
  • Benchmarks may not fully reflect real-world varying conditions (lighting, viewpoint changes) that stress the model.
arXiv:2505.20171