[feature]add Chinese README (#8)

* [feature]add Chinese README * Update Chinese README and add guide for custom evaluation tasks - Revised descriptions for `--exec` and `--model` parameters in README_ZH.md for clarity. - Improved formatting of command line arguments in README_ZH.md. - Introduced a new document (ADD_TASK_ZH.md) detailing the steps to add custom evaluation tasks, including configuration, evaluation logic, and data processing.
flageval-baai · Jan 23, 2025 · 563c356 · 563c356
1 parent fb554d4
commit 563c356
Show file tree

Hide file tree

Showing 4 changed files with 310 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,8 @@ ChangeLog
 
 # Internal
 internal
+
+# results
+results/
+
+.json
diff --git a/README.md b/README.md
@@ -6,6 +6,8 @@
 
 FlagEvalMM is an open-source evaluation framework designed to comprehensively assess multimodal models. It provides a standardized way to evaluate models that work with multiple modalities (text, images, video) across various tasks and metrics.
 
+For Chinese documentation, please refer to [README_ZH.md](README_ZH.md).
+
 ## Key Features
 
 - **Flexible Architecture**: Support for multiple multimodal models and evaluation tasks, including: VQA, image retrieval, text-to-image, etc.

diff --git a/README_ZH.md b/README_ZH.md
@@ -0,0 +1,197 @@
+# FlagEvalMM: 一个灵活的多模态模型综合评测框架
+
+![FlagEvalMM Logo](assets/logo.png)
+
+## 概述
+
+FlagEvalMM 是一个开源的评测框架，旨在对多模态模型进行全面评估。它为评估处理多种模态（文本、图像、视频）的模型提供了标准化的方法，可以跨多种任务和指标进行评测。
+
+## 主要特点
+
+- **灵活的架构**：支持多种多模态模型和评估任务，包括：视觉问答(VQA)、图像检索、文本生成图像等。
+- **全面的基准和指标**：支持新的和常用的基准测试和评估指标。
+- **广泛的模型支持**：model_zoo 提供了对众多流行多模态模型的推理支持，包括 QwenVL 和 LLaVA。此外，还提供与基于 API 的模型（如 GPT、Claude 和 HuanYuan）的无缝集成。
+- **可扩展设计**：易于扩展以整合新的模型、基准和评估指标。
+
+## 安装
+
+### 基础安装
+
+```bash
+git clone https://github.com/flageval-baai/FlagEvalMM.git
+cd FlagEvalMM
+pip install -e .
+```
+
+### 可选依赖
+
+FlagEvalMM 支持多个推理后端引擎。根据需要安装：
+
+#### VLLM 后端
+
+目前（2024年11月30日），我们推荐使用 vllm==0.6.3.post1 和 torch==2.4.0 以获得最佳的推理性能和稳定性。
+
+```bash
+pip install vllm==0.6.3.post1
+```
+
+#### SGLang 后端
+
+```bash
+pip install --upgrade pip
+pip install "sglang[all]"
+pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
+```
+
+详细安装说明请参考 [SGLang 官方文档](https://sgl-project.github.io/start/install.html)。
+
+#### Transformers
+
+为获得最佳性能，我们建议安装 flash-attention
+
+```bash
+pip install flash-attn --no-build-isolation
+```
+
+### 关于 API 密钥
+
+如果你想使用 GPT 评估某些任务（如 charxiv、math_verse 等），需要设置以下环境变量：
+
+```bash
+export FLAGEVAL_API_KEY=$YOUR_OPENAI_API_KEY
+export FLAGEVAL_BASE_URL="https://api.openai.com/v1"
+```
+
+## 使用方法
+
+FlagevalMM 支持一键评测：
+
+使用 vllm 作为后端的 llava 示例：
+
+```bash
+flagevalmm --tasks tasks/mmmu/mmmu_val.py \
+        --exec model_zoo/vlm/api_model/model_adapter.py \
+        --model llava-hf/llava-onevision-qwen2-7b-ov-chat-hf \
+        --num-workers 8 \
+        --output-dir ./results/llava-onevision-qwen2-7b-ov-chat-hf \
+        --backend vllm \
+        --extra-args "--limit-mm-per-prompt image=10 --max-model-len 32768"
+```
+
+`--tasks` 是要评测的任务路径，支持多个任务。
+
+`--exec` 是执行模型推理的代码。
+
+`--model` 模型路径，可以是 huggingface 上的模型名称或你自己的模型路径。建议提前从 huggingface 下载模型。
+
+`--extra-args` 是 vllm 服务器的参数。
+
+对于像 Qwen2-VL-72B 这样使用 vllm 的大型模型，可以通过 `--tensor-parallel-size` 参数启用多 GPU 推理：
+
+```bash
+flagevalmm --tasks tasks/mmmu_pro/mmmu_pro_standard_test.py tasks/ocrbench/ocrbench_test.py \
+        --exec model_zoo/vlm/api_model/model_adapter.py \
+        --model Qwen/Qwen2-VL-72B-Instruct \
+        --num-workers 8 \
+        --output-dir ./results/Qwen2-VL-72B-Instruct \
+        --backend vllm \
+        --extra-args "--limit-mm-per-prompt image=18 --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code \
+        --mm-processor-kwargs '{\"max_dynamic_patch\":4}'"
+```
+
+由于参数可能相当复杂，建议使用 JSON 配置文件。示例如下：
+
+创建一个名为 `qwen2_vl_72b_instruct.json` 的配置文件：
+
+```json
+{
+    "model_name": "Qwen/Qwen2-VL-72B-Instruct",
+    "api_key": "EMPTY",
+    "output_dir": "./results/Qwen2-VL-72B-Instruct",
+    "min_short_side": 28,
+    "num_workers": 8,
+    "backend": "vllm",
+    "extra_args": "--limit-mm-per-prompt image=18 --tensor-parallel-size 4 --max-model-len 32768 --trust-remote-code --mm-processor-kwargs '{\"max_dynamic_patch\":4}'"
+}
+```
+
+这样可以简化你的评测命令为：
+
+```bash
+flagevalmm --tasks tasks/mmmu_pro/mmmu_pro_standard_test.py tasks/ocrbench/ocrbench_test.py \
+        --exec model_zoo/vlm/api_model/model_adapter.py \
+        --cfg qwen2_vl_72b_instruct.json
+```
+
+不使用 vllm 的模型评测示例（使用 transformers）：
+
+```bash
+flagevalmm --tasks tasks/mmmu/mmmu_val.py \
+        --exec model_zoo/vlm/llama-vision/model_adapter.py \
+        --model meta-llama/Llama-3.2-11B-Vision-Instruct \
+        --output-dir ./results/Meta-Llama-3.2-11B-Vision-Instruct
+```
+
+对于直接使用 transformers 的模型，不需要 `--backend` 和 `--extra-args` 参数。更多模型示例可以在 `model_zoo/vlm/` 目录中找到。
+
+评测 gpt 风格模型的示例：
+
+```bash
+flagevalmm --tasks tasks/mmmu/mmmu_val.py \
+        --exec model_zoo/vlm/api_model/model_adapter.py \
+        --model gpt-4o-mini \
+        --num-workers 4 \
+        --url https://api.openai.com/v1/chat/completions \
+        --api-key $OPENAI_API_KEY \
+        --output-dir ./results/gpt-4o-mini \
+        --use-cache
+```
+
+`--use-cache` 是可选的，它会缓存模型输出，相同设置下的相同问题将从缓存中获取结果。
+
+## 分别启动数据服务器和评测
+
+上面是一键评测，你也可以分别启动数据服务器和进行评测。以评测 qwen-vl-2 模型为例：
+
+```bash
+# 启动数据服务器
+python flagevalmm/server/run_server.py --tasks tasks/charxiv/charxiv_val.py --output-dir ./results/qwenvl2-7b --port 11823 
+```
+
+### 分别评测
+
+这将在端口 11823 上启动服务器，数据服务器将一直运行直到你停止它。
+
+```bash
+python flagevalmm/eval.py --output-dir ./results/qwenvl2-7b --tasks tasks/charxiv/charxiv_val.py --model your_model_path/Qwen2-VL-7B-Instruct/ --exec model_zoo/vlm/qwen_vl/model_adapter.py --server-port 11823
+```
+
+这将在数据服务器上评测模型。
+如果你已经从数据服务器生成了结果，可以直接评测结果：
+
+```bash
+python flagevalmm/eval.py --output-dir ./results/qwenvl2-7b --exec model_zoo/vlm/qwen_vl/model_adapter.py --tasks tasks/charxiv/charxiv_val.py --without-infer
+```
+
+## 添加你自己的任务
+
+如果你想添加自己的数据集，请参照该文档：[添加你自己的任务](.tasks/README.md)
+
+## 关于数据
+
+在任务配置文件中，我们默认从 HuggingFace 下载数据集。如果你需要使用自己的数据集，请在配置文件中将 `dataset_path` 设置为你的数据集路径。
+
+FlagEvalMM 会预处理来自各种来源的数据，处理后的数据默认存储在 `~/.cache/flagevalmm` 目录中。你可以通过修改 `FLAGEVALMM_CACHE` 环境变量来更改数据存储路径。
+
+## 引用
+
+```bibtex
+@misc{flagevalmm,
+    author = {Zheqi He, Yesheng Liu, Jingshu Zheng, Bowen Qin, Jinge Yao, Richen Xuan and Xi Yang},
+    title = {FlagEvalMM: A Flexible Framework for Comprehensive Multimodal Model Evaluation},
+    year = {2024},
+    publisher = {Zenodo},
+    version = {0.3.4},
+    url = {https://github.com/flageval-baai/FlagEvalMM}
+}
+```
diff --git a/tasks/ADD_TASK_ZH.md b/tasks/ADD_TASK_ZH.md
@@ -0,0 +1,106 @@
+# 添加自定义评测任务
+
+本指南将介绍如何添加新的评测任务到系统中。主要包含三个步骤：
+
+## 1. 创建评测任务配置
+
+在 `tasks` 目录下创建新的文件夹和对应的配置文件。对于简单的评测任务（如 VQA），可以直接使用 `VqaBaseDataset` 类。
+
+基本配置模板如下：
+
+```python
+# tasks/custom_task/custom_task.py
+
+config = dict(
+    dataset_path="<数据集路径>",  # 原始数据集路径
+    split="image",               # 数据集分割
+    processed_dataset_path="CustomBench",  # 处理后数据集保存路径
+    processor="process.py",      # 数据处理脚本
+)
+
+# 配置方式 1: 使用默认 prompt
+dataset = dict(
+    type="VqaBaseDataset",
+    config=config,
+    prompt_template=dict(type="PromptTemplate"),  # 默认 prompt: 'Answer the question using a single word or phrase.'
+    name="custom_bench",
+)
+
+# 配置方式 2: 自定义 prompt
+dataset = dict(
+    type="VqaBaseDataset",
+    config=config,
+    prompt_template=dict(
+        type="PromptTemplate", 
+        pre_prompt="",   # prompt 前缀
+        post_prompt=""   # prompt 后缀
+    ),
+    name="custom_bench",
+)
+
+evaluator = dict(
+    type="BaseEvaluator", 
+    eval_func="evaluate.py"
+)
+```
+
+## 2. 实现评估逻辑
+
+创建 `evaluate.py` 文件并实现评估函数。该函数需要返回包含评估指标的字典。
+
+```python
+# tasks/custom_task/evaluate.py
+
+def get_result(annotations: Dict, predictions: List[Dict]) -> Dict:
+    """评估模型预测结果
+    
+    Args:
+        annotations: 标注数据
+        predictions: 模型预测结果
+        
+    Returns:
+        Dict: 包含评估指标的字典
+    """
+    results = {}
+    results["question_acc"] = cal_accuracy(annotations, predictions)
+    return results
+```
+
+## 3. 实现数据处理逻辑
+
+创建 `process.py` 文件并实现数据处理函数。该函数需要将原始数据转换为标准格式。
+
+```python
+# tasks/custom_task/process.py
+
+def process(cfg):
+    """处理原始数据集
+    
+    处理后的数据必须包含以下字段：
+    - question_id: 问题唯一标识
+    - img_path: 图片路径
+    - question: 问题文本
+    - question_type: 问题类型
+    """
+    # 加载原始数据
+    dataset = load_dataset(cfg)
+
+    # 设置输出路径
+    output_dir = osp.join(cfg.processed_dataset_path, cfg.split, cfg.get("dataset_name", ""))
+    os.makedirs(output_dir, exist_ok=True)
+
+    # 处理数据
+    processed_data = []
+    for i, item in enumerate(tqdm.tqdm(dataset)):
+        processed_item = {
+            "question_id": f'{item["data_type"]}_{str(i)}',
+            "img_path": item["image_path"],
+            "question": item["question"],
+            "question_type": item["data_type"]
+        }
+        processed_data.append(processed_item)
+
+    # 保存处理后的数据
+    with open(osp.join(output_dir, "data.json"), "w") as f:
+        json.dump(processed_data, f)
+```