Qwen-VL/eval_mm/mme/EVAL_MME.md

# MME Benchmark

[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.

Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.

Perception Evaluation

| Rank |      Model      |          Version         |  Score  |
|:----:|:---------------:|:------------------------:|:-------:|
|   1  | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)**|        **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)**       | **1487.57** |
|   2  |    Skywork-MM   |      Skywork-MM-13B      | 1419.08 |
|   3  |      MMICL      |         FlanT5xxl        | 1376.00 |
|   4  |       Lynx      |         vicuna-7b        | 1373.23 |
|   5  |      BLIVA      |         FlanT5xxl        | 1337.73 |

Cognition Evaluation

| Rank |       Model      |     Version    |    Score   |
|:----:|:----------------:|:--------------:|:----------:|
|   1  | **[Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)** |   **[Qwen-7B](https://github.com/QwenLM/Qwen-7B)**  | **360.71** |
|   2  |       MMICL      |    FlanT5xxl   |   360.36   |
|   3  |    Skywork-MM    | Skywork-MM-13B |   356.43   |
|   4  |       BLIVA      |    FlanT5xxl   |   331.43   |
|   5  |  LRV-Instruction |     LRV-7B     |   328.21   |

Full Metrics

```
=========== Perception ===========
total score: 1487.576330532213 

         existence  score: 158.33333333333331
         count  score: 150.0
         position  score: 128.33333333333334
         color  score: 170.0
         posters  score: 178.57142857142856
         celebrity  score: 120.58823529411764
         scene  score: 152.25
         landmark  score: 164.0
         artwork  score: 125.5
         OCR  score: 140.0


=========== Cognition ===========
total score: 360.71428571428567 

         commonsense_reasoning  score: 130.7142857142857
         numerical_calculation  score: 40.0
         text_translation  score: 147.5
         code_reasoning  score: 42.5
```

## How To Reproduce Results of MME Benchmark

1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)
2. Rearrange images by executing `python get_images.py`
3. Evaluate Qwen-VL-Chat results by executing `python eval.py`
4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool.
图生文项目 2 years ago			`# MME Benchmark`

			`[MME](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/tree/Evaluation) is a comprehensive evaluation benchmark for multimodal large language models. It measures both perception and cognition abilities on a total of 14 subtasks, including existence, count, position, color, poster, celebrity, scene, landmark, artwork, OCR, commonsense reasoning, numerical calculation, text translation, and code reasoning.`

			`Qwen-VL-Chat achieves SOTAs on both perception and cognition evaluation.`

			`Perception Evaluation`

			`\| Rank \| Model \| Version \| Score \|`
			`\|:----:\|:---------------:\|:------------------------:\|:-------:\|`
			`\| 1 \| [Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/)\| [Qwen-7B](https://github.com/QwenLM/Qwen-7B) \| 1487.57 \|`
			`\| 2 \| Skywork-MM \| Skywork-MM-13B \| 1419.08 \|`
			`\| 3 \| MMICL \| FlanT5xxl \| 1376.00 \|`
			`\| 4 \| Lynx \| vicuna-7b \| 1373.23 \|`
			`\| 5 \| BLIVA \| FlanT5xxl \| 1337.73 \|`

			`Cognition Evaluation`

			`\| Rank \| Model \| Version \| Score \|`
			`\|:----:\|:----------------:\|:--------------:\|:----------:\|`
			`\| 1 \| [Qwen-VL-Chat](https://github.com/QwenLM/Qwen-VL/) \| [Qwen-7B](https://github.com/QwenLM/Qwen-7B) \| 360.71 \|`
			`\| 2 \| MMICL \| FlanT5xxl \| 360.36 \|`
			`\| 3 \| Skywork-MM \| Skywork-MM-13B \| 356.43 \|`
			`\| 4 \| BLIVA \| FlanT5xxl \| 331.43 \|`
			`\| 5 \| LRV-Instruction \| LRV-7B \| 328.21 \|`

			`Full Metrics`

			```
			`=========== Perception ===========`
			`total score: 1487.576330532213`

			`existence score: 158.33333333333331`
			`count score: 150.0`
			`position score: 128.33333333333334`
			`color score: 170.0`
			`posters score: 178.57142857142856`
			`celebrity score: 120.58823529411764`
			`scene score: 152.25`
			`landmark score: 164.0`
			`artwork score: 125.5`
			`OCR score: 140.0`


			`=========== Cognition ===========`
			`total score: 360.71428571428567`

			`commonsense_reasoning score: 130.7142857142857`
			`numerical_calculation score: 40.0`
			`text_translation score: 147.5`
			`code_reasoning score: 42.5`
			```

			`## How To Reproduce Results of MME Benchmark`

			`1. Download MME images and eval_tool from the [MME repo](https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models/blob/Evaluation/README.md)`
			2. Rearrange images by executing `python get_images.py`
			3. Evaluate Qwen-VL-Chat results by executing `python eval.py`
			4. Calculate MME results by executing `python calculation.py --results_dir Qwen-VL-Chat`, which the calculation script comes from the MME eval_tool.