Overview
MotionBench introduces a benchmark focused on fine-grained motion comprehension for vision-language models (VLMs). It includes ~8,000 video/question pairs across tasks like motion recognition, motion location, action order, and repetition counting, drawn from both real and synthetic video sources. The authors also propose “Through-Encoder Fusion” to better preserve motion information. Results show current VLMs struggle, often scoring below 60%.Why it matters
This fills a gap in evaluating low-level temporal and motion perception, critical for robotics, surveillance, and medical video analysis. It highlights that architectural and input strategies are key levers for improving motion understanding.Key trade-offs / limitations
- Current models perform poorly on repeated or subtle motions.
- Dataset focuses on specific motion types; broader dynamics remain untested.