Overview

Vid2Sim bridges the sim-to-real gap by converting monocular videos into photorealistic, physically interactable 3D simulation environments. It enables RL training of visual navigation agents in complex urban environments using neural 3D scene reconstruction and simulation.

Why it matters

Addresses the major challenge of sim-to-real transfer for robot learning by creating realistic digital twins from minimal video input. Enables scalable, cost-efficient training for urban navigation applications like food delivery bots and assistive vehicles.

Key trade-offs / limitations

  • Time-consuming scene building process requiring GLOMAP initialization
  • Limited to 30 environments in current dataset (needs expansion for better performance)
  • Requires extensive geometric processing for scene reconstruction
  • Weather simulation capabilities still under development
arxiv:2501.06693