ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark

1Alibaba DAMO Academy, 2Zhejiang University, 3Tongji University
introduction

There are 386 RGB-D videos, 4,324 QA pairs, and 30 distinct embodied cognitive abilities, spanning across various aspects such as perception, reasoning, self-awareness, dynamic capturing, and hallucination. ECEval employs distinct evaluation methods for different types of answers.

Abstract

The enhancement of generalization in robots by large vision-language models (LVLMs) is increasingly evident. Therefore, the embodied cognitive abilities of LVLMs based on egocentric videos are of great interest. However, current datasets for embodied video question answering lack comprehensive and systematic evaluation frameworks. Critical embodied cognitive issues, such as robotic self-cognition, dynamic scene perception, and hallucination, are rarely addressed.

To tackle these challenges, we propose ECBench, a high-quality benchmark designed to systematically evaluate the embodied cognitive abilities of LVLMs. ECBench features a diverse range of scene video sources, open and varied question formats, and 30 dimensions of embodied cognition. To ensure quality, balance, and high visual dependence, ECBench uses class-independent meticulous human annotation and multi-round question screening strategies. Additionally, we introduce ECEval, a comprehensive evaluation system that ensures the fairness and rationality of the indicators.

Utilizing ECBench, we conduct extensive evaluations of proprietary, open-source, and task-specific LVLMs. ECBench is pivotal in advancing the embodied cognitive capabilities of LVLMs, laying a solid foundation for developing reliable core models for embodied agents.

Comparision of ECBench and Other Embodied / General VideoQA Benchmarks

The current evaluation of LVLMs in embodied scenarios has the following limitations:

1. Not Systematic:

Current benchmarks focus on independent embodied abilities, such as object recognition and counting. They lack a comprehensive top-down analysis of embodied cognition requirements, leading to limitations in evaluation hierarchy and dimensions.

2. Lack of Robot-Centric:

Robots often need to address questions related to their own embodiment, such as the distance to a target, or their historical trajectory. However, benchmarks like OpenEQA focus solely on third-person scenario questions, significantly overlooking the evaluation of robots' self-awareness.

3. Lack of Dynamics:

In the real physical world, scene dynamics are perpetually ongoing. For complex tasks like ``Revert the screen content to before you faced the whiteboard," a robot must recognize these dynamics and accurately recall their timing and process. However, current embodied question answering benchmarks typically overlook these dynamic aspects, defaulting to static context assumptions.

4. Hallucination issue:

Although the hallucination phenomenon has been extensively analyzed within LVLMs, embodied-based question answering presents unique hallucination challenges. For instance, LVLMs like GPT-4o often rely too much on common sense in the scene when answering questions in counterintuitive scenes, resulting in incorrect answers. These embodied hallucination issues remain unexplored in the academic literature.

Comparing ECBench and widely adopted Embodied / General VideoQA benchmarks. ECBench has significant advantages in terms of quality, diversity and evaluation dimensions.

Data analysis of ECBench reflects a rich diversity of scenario categories, video sources, and evaluation dimensions.


Capability Taxonomy of ECBench

Overview of embodied cognition dimensions in ECBench. ECBench includes three subsets: static scenes, dynamic scenes, and hallucination, evaluating a total of 30 embodied cognitive abilities.


ECBench Leaderboard

Static Scene: SB (Scene-Based), RC (Robot-Centric)

Dynamic Scene: ID (Information Dynamics), QD (Quantity Dynamics), SPD (Spatial Dynamics), STD (State Dynamics)

Hallucination: UI (Over-Confidence in User Input), CS (Over-Confidence in Common Sense)

The best results are highlighted in bold and underlined.

Model Static Scene Dynamic Scene Hallucination Overall Mean Model Type
SB RC Mean ID QD SPD STD Mean UI CS Mean
GPT-4o-[32f] 🥇 59.74 49.04 55.06 25.71 14.29 22.73 24.04 22.74 7.69 57.01 31.80 50.35 Proprietary Image-LVLMs
Qwen2VL-72B-[20f] 🥈 52.40 43.95 49.19 37.14 11.43 18.18 24.05 24.03 5.86 42.38 23.71 44.62 Open-Source Image-LVLMs
GPT-4o-mini-[32f] 🥉 51.20 43.12 48.13 26.35 20.00 15.15 18.10 19.68 9.89 41.38 25.28 43.69 Proprietary Image-LVLMs
LongVA-7B-[384f] 🥉 49.03 38.56 45.05 28.57 8.57 13.94 16.67 17.82 8.06 33.49 20.49 40.47 Native Video-LVLMs
Qwen2VL-7B-[20f] 46.21 39.33 43.60 32.70 8.57 19.70 18.33 20.97 4.40 39.08 21.35 39.57 Open-Source Image-LVLMs
GPT-4v-[8f] 45.09 40.16 43.22 29.52 8.57 22.12 20.24 21.45 9.16 32.11 20.37 39.16 Proprietary Image-LVLMs
Kangaroo-8B-[64f] 40.79 40.01 40.49 19.20 17.14 16.97 15.95 17.42 0.37 33.49 16.55 36.23 Native Video-LVLMs
InternVL2-40B-[20f] 41.27 38.09 40.06 33.33 0.00 12.12 21.66 19.03 3.66 29.58 16.33 35.94 Open-Source Image-LVLMs
Video-LLaVA-7B-[8f] 41.21 37.25 39.71 18.41 14.29 19.20 16.67 17.66 1.10 26.97 13.75 35.25 Native Video-LVLMs
AlanaVLM-7B-[64f] 40.38 36.61 38.95 19.05 5.71 19.70 14.76 15.89 1.83 29.96 15.58 34.75 Embodied / Egocentric LVLMs
Video-LLaMA2-7B-[16f] 41.23 33.52 38.30 16.51 5.71 16.97 11.19 13.31 4.03 24.29 13.93 33.87 Native Video-LVLMs
Idefics3-8B-[20f] 36.56 38.09 37.14 18.10 8.57 10.61 19.29 15.16 4.76 28.89 16.55 33.35 Open-Source Image-LVLMs
GeLM-7B-[180f] 25.72 23.70 24.95 5.08 5.71 8.18 3.51 5.48 0.37 12.49 6.29 21.54 Embodied / Egocentric LVLMs

BibTeX

@article{ECBench,
      title={ECBench: Can Multi-modal Foundation Models Understand the Egocentric World? A Holistic Embodied Cognition Benchmark},
      author={Dang, Ronghao and Yuan, Yuqian and Zhang, Wenqi and Xin, Yifei and Zhang, Boqiang and Li, Long and Wang, Liuyi and Zeng, Qinyang and Li, Xin and Bing, Lidong},
      journal={arXiv preprint arXiv:2501.05031},
      year={2025}
    }