You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

60 lines
2.7 KiB

2 years ago
# MME Benchmark
[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.
Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.
Perception Evaluation
| Rank | Model | Version | Score |
|:----:|:---------------:|:------------------------:|:-------:|
| 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)**| **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **1487.57** |
| 2 | Skywork-MM | Skywork-MM-13B | 1419.08 |
| 3 | MMICL | FlanT5xxl | 1376.00 |
| 4 | Lynx | vicuna-7b | 1373.23 |
| 5 | BLIVA | FlanT5xxl | 1337.73 |
Cognition Evaluation
| Rank | Model | Version | Score |
|:----:|:----------------:|:--------------:|:----------:|
| 1 | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)** | **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)** | **360.71** |
| 2 | MMICL | FlanT5xxl | 360.36 |
| 3 | Skywork-MM | Skywork-MM-13B | 356.43 |
| 4 | BLIVA | FlanT5xxl | 331.43 |
| 5 | LRV-Instruction | LRV-7B | 328.21 |
Full Metrics
```
=========== Perception ===========
total score: 1487.576330532213
existence score: 158.33333333333331
count score: 150.0
position score: 128.33333333333334
color score: 170.0
posters score: 178.57142857142856
celebrity score: 120.58823529411764
scene score: 152.25
landmark score: 164.0
artwork score: 125.5
OCR score: 140.0
=========== Cognition ===========
total score: 360.71428571428567
commonsense_reasoning score: 130.7142857142857
numerical_calculation score: 40.0
text_translation score: 147.5
code_reasoning score: 42.5
```
## How To Reproduce Results of MME Benchmark
1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)
2. Rearrange images by executing `python get_images.py`
3. Evaluate Qwen-VL-Chat results by executing `python eval.py`
4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool.