MMIE

Massive Multimodal Interleaved Comprehension Benchmark For Large Vision-Language Models

ICLR 2025 Oral

Peng Xia^*, Siwei Han^*, Shi Qiu^*, Yiyang Zhou, Zhaoyang Wang, Wenhao Zheng, Zhaorun Chen, Chenhang Cui,
Mingyu Ding, Linjie Li, Lijuan Wang, Huaxiu Yao

▶ UNC-Chapel Hill ▶ University of Chicago ▶ Microsoft Research ▶ NUS

^*Equal Contribution

arXiv alphaXiv Code Dataset Evaluation Model Leaderboard

Abstract

We present MMIE, a Massive Multimodal Interleaved understanding Evaluation benchmark, designed for Large Vision-Language Models (LVLMs). MMIE offers a robust framework for evaluating the interleaved comprehension and generation capabilities of LVLMs across diverse fields, supported by reliable automated metrics.

🌟 Key Features

🗂 Dataset

Comprehensive: 20K+ examples in interleaved multimodal format, consolidated into one JSON file for easy access.
Diverse: Spanning 12 fields and 102 subfields, offering broad and deep evaluation across domains.
Ground Truth Reference: Each question comes paired with a reference, ensuring accurate evaluations of model performance.

⚙️ Metric

Automated Scoring: Evaluate your model’s results with our scoring model, MMIE-Score, powered by InternVL-2-4B.
Bias Mitigation: Fine-tuned to reduce bias and ensure objective evaluations.
Multimodal Capability: Tailored for interleaved inputs and outputs, evaluating both text and image comprehension.
High Correlation with Human Scores: Outperforms alternative metrics such as GPT-4o in multimodal tasks, ensuring reliable benchmarking results.

MMIE is curated from four multimodal datasets, encompassing:

3 categories: Situational analysis, project-based learning, and multi-step reasoning.
12 fields: Mathematics, physics, coding, statistics, literature, philosophy, education, finance, health, sports, art, and Electrical Engineering and Computer Science (EECS).
102 subfields: Offering in-depth coverage across multiple domains.

The dataset contains 20,103 multimodal questions that support both interleaved inputs and outputs. It includes a mix of multiple-choice and open-ended questions, evaluating a wide range of competencies and reasoning skills. Each query is paired with a ground truth reference, enabling effective evaluation.

In addition, we propose an automated evaluation metric powered by a scoring model, which is available for use at MMIE-Score. This evaluation tool provides a streamlined way to assess your model's performance using the benchmark dataset.

Statistic	Number	Percentage
Questions	20103	-
- Situational analysis	5005	24.89%
- Project-based learning	11482	57.12%
- Multi-step reasoning	3616	17.99%
Total Categories/Fields/Subfields	3/12/102	-
Formats
- Multiple-Choice Questions	663	3.40%
- Open-Ended Questions	19340	96.60%
Questions with Images	20103	100%
Questions with answer label	20103	100%
Average question length	76.0	-
Average images per question	1.32	-

🔧 Benchmark Details

Distribution of categories and fields in MMIE.

🗂 Dataset

MMIE evaluates LVLMs across interleaved multimodal comprehension and generation tasks. The dataset is carefully curated to ensure a wide range of examples across various fields, providing balanced coverage for comprehensive evaluations. These examples test reasoning, cognitive tasks, and multimodal alignment, ensuring detailed insights into model performance.

⚙️ Metric

The MMIE evaluation metric is built on InternVL-2-4B, a high-performing vision-language model fine-tuned for multimodal reasoning. This pipeline evaluates models including:

Text Quality: Clarity, coherence, and grammar.
Image Quality: Vividness and accuracy of image descriptions.
Text-Image Coherence: How well visual descriptions support the narrative.
Stylistic Consistency: Consistent style and structure throughout text and images.

For detailed evaluation criteria, please refer to Appendix A.9 in our paper.

Note: Higher values indicate better performance for Pearson and Cosine Similarity, while lower values are better for MSE and MAE.

The MMIE evaluation metric achieves high correlations with human annotations in all aspects of multimodal comprehension and generation. It consistently outperforms other metrics, like GPT-4o, making it ideal for large-scale model benchmarking and comparison.

🏆 Leaderboard

MMIE provides a systematic evaluation of existing open-source LVLMs supporting interleaved multimodal input and output interleaved LVLMs, along with the integration of state-of-the-art LVLMs and text-to-image generative models integrated LVLMs. To view detailed results, please see the paper. Leaderboard is also available on huggingface.

Scores on MMIE benchmark.

Model	Model Type	Situational analysis	Project-based learning	Multi-step reasoning	AVG
MiniGPT-5	Interleaved LVLM	47.63	55.12	42.17	50.92
EMU-2	Interleaved LVLM	39.65	46.12	50.75	45.33
GILL	Interleaved LVLM	46.72	57.57	39.33	51.58
Anole	Interleaved LVLM	48.95	59.05	51.72	55.22
GPT-4o \| Openjourney	Integrated LVLM	53.05	71.4	53.67	63.65
GPT-4o \| SD-3	Integrated LVLM	53	71.2	53.67	63.52
GPT-4o \| SD-XL	Integrated LVLM	56.12	73.25	53.67	65.47
GPT-4o \| Flux	Integrated LVLM	54.97	68.8	53.67	62.63
Gemini-1.5 \| Openjourney	Integrated LVLM	48.08	67.93	60.05	61.57
Gemini-1.5 \| SD-3	Integrated LVLM	47.48	68.7	60.05	61.87
Gemini-1.5 \| SD-XL	Integrated LVLM	49.43	71.85	60.05	64.15
Gemini-1.5 \| Flux	Integrated LVLM	47.07	68.33	60.05	61.55
LLAVA-34b \| Openjourney	Integrated LVLM	54.12	73.47	47.28	63.93
LLAVA-34b \| SD-3	Integrated LVLM	54.72	72.55	47.28	63.57
LLAVA-34b \| SD-XL	Integrated LVLM	55.97	74.6	47.28	65.05
LLAVA-34b \| Flux	Integrated LVLM	54.23	71.32	47.28	62.73
Qwen-VL-70b \| Openjourney	Integrated LVLM	52.73	71.63	55.63	64.05
Qwen-VL-70b \| SD-3	Integrated LVLM	54.98	71.87	55.63	64.75
Qwen-VL-70b \| SD-XL	Integrated LVLM	52.58	73.57	55.63	65.12
Qwen-VL-70b \| Flux	Integrated LVLM	54.23	69.47	55.63	63.18

BibTeX


@article{xia2024mmie,
  title={MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models},
  author={Xia, Peng and Han, Siwei and Qiu, Shi and Zhou, Yiyang and Wang, Zhaoyang and Zheng, Wenhao and Chen, Zhaorun and Cui, Chenhang and Ding, Mingyu and Li, Linjie and Wang, Lijuan and Yao, Huaxiu},
  journal={arXiv preprint arXiv:2410.10139},
  year={2024}
}

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

We would like to express our sincere gratitude to the teams behind InternVL, MiniGPT, EMU, GILL, Anole, LLaVA, Qwen2-VL, Openjourney, Stable Diffusion and Flux for providing open-source models.

MMIE

Massive Multimodal Interleaved Comprehension Benchmark For Large Vision-Language Models

ICLR 2025 Oral

🔥[NEW!] We introduce MMIE, a large-scale knowledge-intensive benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)

Abstract

🌟 Key Features

🗂 Dataset

⚙️ Metric

MMIE Datasets

🔧 Benchmark Details

🗂 Dataset

⚙️ Metric

🏆 Leaderboard

BibTeX

Acknowledgement