TesserAct: Learning 4D Embodied World Models (2025)

Overview

Zhen et al. propose TesserAct, which learns world models from RGB-D-N (RGB, Depth, Normals) video data to produce temporally and spatially coherent predictions of 3D scenes as an agent acts. They also show improvements for novel view synthesis and policy learning over prior video-only world models. :contentReference[oaicite:4]

Why it matters

Embedding 3D geometry + normals adds richer structural information, improving fidelity in scene reconstruction and enabling more robust agents that can plan in the 3D world. Bridges gap between purely 2D video generation and embodied 3D world modeling.

Key trade-offs / limitations

Requires datasets with depth & normal annotations, which are harder to get and large.
More computational load due to richer modalities.
View synthesis and geometry fidelity may still degrade at edges/out-of-distribution views.

Link

arXiv:2504.20995

ProphetDWM (2025)Vid2World

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

TesserAct: Learning 4D Embodied World Models (2025)

TesserAct: Learning 4D Embodied World Models (2025)

Overview

Why it matters

Key trade-offs / limitations

Link

Tutorials

StreamDiffusion Reference

Video and World Models

Models

Methodologies and Benchmarking

​TesserAct: Learning 4D Embodied World Models (2025)

​Overview

​Why it matters

​Key trade-offs / limitations

​Link

TesserAct: Learning 4D Embodied World Models (2025)

Overview

Why it matters

Key trade-offs / limitations

Link