WorldSimBench proposes a dual evaluation framework: (1) Explicit Perceptual Evaluation (visual quality, alignment with prompts) and (2) Implicit Manipulative Evaluation (whether generated videos can guide downstream embodied tasks). It introduces the HF-Embodied dataset and tests models on embodied control scenarios.
Moves evaluation beyond aesthetics to actual usefulness. It bridges video generation with embodied AI and simulation, making it possible to assess whether generative models can serve as true world simulators.