Static Scene: SB (Scene-Based), RC (Robot-Centric)
Dynamic Scene: ID (Information Dynamics), QD (Quantity Dynamics), SPD (Spatial Dynamics), STD (State Dynamics)
Hallucination: UI (Over-Confidence in User Input), CS (Over-Confidence in Common Sense)
The best results are highlighted in bold and underlined.
Model | Static Scene | Dynamic Scene | Hallucination | Overall Mean | Model Type | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SB | RC | Mean | ID | QD | SPD | STD | Mean | UI | CS | Mean | |||
GPT-4o-[32f] 🥇 | 59.74 | 49.04 | 55.06 | 25.71 | 14.29 | 22.73 | 24.04 | 22.74 | 7.69 | 57.01 | 31.80 | 50.35 | Proprietary Image-LVLMs |
Qwen2VL-72B-[20f] 🥈 | 52.40 | 43.95 | 49.19 | 37.14 | 11.43 | 18.18 | 24.05 | 24.03 | 5.86 | 42.38 | 23.71 | 44.62 | Open-Source Image-LVLMs |
GPT-4o-mini-[32f] 🥉 | 51.20 | 43.12 | 48.13 | 26.35 | 20.00 | 15.15 | 18.10 | 19.68 | 9.89 | 41.38 | 25.28 | 43.69 | Proprietary Image-LVLMs |
LongVA-7B-[384f] 🥉 | 49.03 | 38.56 | 45.05 | 28.57 | 8.57 | 13.94 | 16.67 | 17.82 | 8.06 | 33.49 | 20.49 | 40.47 | Native Video-LVLMs |
Qwen2VL-7B-[20f] | 46.21 | 39.33 | 43.60 | 32.70 | 8.57 | 19.70 | 18.33 | 20.97 | 4.40 | 39.08 | 21.35 | 39.57 | Open-Source Image-LVLMs |
GPT-4v-[8f] | 45.09 | 40.16 | 43.22 | 29.52 | 8.57 | 22.12 | 20.24 | 21.45 | 9.16 | 32.11 | 20.37 | 39.16 | Proprietary Image-LVLMs |
Kangaroo-8B-[64f] | 40.79 | 40.01 | 40.49 | 19.20 | 17.14 | 16.97 | 15.95 | 17.42 | 0.37 | 33.49 | 16.55 | 36.23 | Native Video-LVLMs |
InternVL2-40B-[20f] | 41.27 | 38.09 | 40.06 | 33.33 | 0.00 | 12.12 | 21.66 | 19.03 | 3.66 | 29.58 | 16.33 | 35.94 | Open-Source Image-LVLMs |
Video-LLaVA-7B-[8f] | 41.21 | 37.25 | 39.71 | 18.41 | 14.29 | 19.20 | 16.67 | 17.66 | 1.10 | 26.97 | 13.75 | 35.25 | Native Video-LVLMs |
AlanaVLM-7B-[64f] | 40.38 | 36.61 | 38.95 | 19.05 | 5.71 | 19.70 | 14.76 | 15.89 | 1.83 | 29.96 | 15.58 | 34.75 | Embodied / Egocentric LVLMs |
Video-LLaMA2-7B-[16f] | 41.23 | 33.52 | 38.30 | 16.51 | 5.71 | 16.97 | 11.19 | 13.31 | 4.03 | 24.29 | 13.93 | 33.87 | Native Video-LVLMs |
Idefics3-8B-[20f] | 36.56 | 38.09 | 37.14 | 18.10 | 8.57 | 10.61 | 19.29 | 15.16 | 4.76 | 28.89 | 16.55 | 33.35 | Open-Source Image-LVLMs |
GeLM-7B-[180f] | 25.72 | 23.70 | 24.95 | 5.08 | 5.71 | 8.18 | 3.51 | 5.48 | 0.37 | 12.49 | 6.29 | 21.54 | Embodied / Egocentric LVLMs |