Overview

Aether introduces a unified world modeling framework combining 4D dynamic reconstruction, action-conditioned video prediction, and goal-conditioned planning. It emphasizes geometry-aware features and shows generalization without using real-world data, as well as zero-shot performance in action following and reconstruction. :contentReference[oaicite:5]

Why it matters

Merging reconstruction, prediction, planning in one model simplifies pipelines for embodied AI tasks. Geometry awareness helps bridge simulation to reality and improves perceptual alignment and control. Zero-shot generalization is especially powerful for real-world deployment.

Key trade-offs / limitations

  • Training multiple objectives jointly can lead to complex tuning.
  • Synthetic-to-real generalization, though promising, may still falter in more complex real scenes.
  • The approach may not yet scale to very high resolution or large outdoor scenes.
arXiv:2503.18945