Overview

MotionBench introduces a benchmark focused on fine-grained motion comprehension for vision-language models (VLMs). It includes ~8,000 video/question pairs across tasks like motion recognition, motion location, action order, and repetition counting, drawn from both real and synthetic video sources. The authors also propose “Through-Encoder Fusion” to better preserve motion information. Results show current VLMs struggle, often scoring below 60%.

Why it matters

This fills a gap in evaluating low-level temporal and motion perception, critical for robotics, surveillance, and medical video analysis. It highlights that architectural and input strategies are key levers for improving motion understanding.

Key trade-offs / limitations

  • Current models perform poorly on repeated or subtle motions.
  • Dataset focuses on specific motion types; broader dynamics remain untested.
arXiv:2501.02955