Overview

DIAMOND is a diffusion-based world model trained on environment frames that demonstrates improved visual fidelity and downstream RL performance (e.g., on Atari). The paper argues that preserving visual detail via diffusion improves performance for pixel-based RL agents compared to coarse latent transitions.

Why it matters

Shows diffusion models can be effective representations for world models where detailed pixel fidelity matters for agent decision-making.

Key trade-offs / limitations

  • Diffusion models are typically more compute-intensive at training and inference.
  • Results demonstrated in constrained domains (Atari); transfer to large real-world scenes requires further study.
NeurIPS 2024 poster / paper page