📜 Paper:
arxiv.org/abs/2505.006...
📊 Dataset:
github.com/google-deepm...
This is work with my amazing colleagues and collaborators Arsha Nagrani, Sachit Menon, Ahmet Iscen, Shyamal Buch, Ramin Mehran, Nilpa Jha, Anja Hauth, Yukun Zhu, Carl Vondrick, Mikhail Sirotenko, and Cordelia Schmid

MINERVA: Evaluating Complex Video Reasoning
Multimodal LLMs are turning their focus to video benchmarks, however most video benchmarks only provide outcome supervision, with no intermediate or interpretable reasoning steps. This makes it challe...