TesserAct: Learning 4D Embodied World Models (2025)
Overview
Zhen et al. propose TesserAct, which learns world models from RGB-D-N (RGB, Depth, Normals) video data to produce temporally and spatially coherent predictions of 3D scenes as an agent acts. They also show improvements for novel view synthesis and policy learning over prior video-only world models. :contentReference[oaicite:4]Why it matters
Embedding 3D geometry + normals adds richer structural information, improving fidelity in scene reconstruction and enabling more robust agents that can plan in the 3D world. Bridges gap between purely 2D video generation and embodied 3D world modeling.Key trade-offs / limitations
- Requires datasets with depth & normal annotations, which are harder to get and large.
- More computational load due to richer modalities.
- View synthesis and geometry fidelity may still degrade at edges/out-of-distribution views.