[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72192":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},72192,"lmms-eval","EvolvingLMMs-Lab\u002Flmms-eval","EvolvingLMMs-Lab","One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks","https:\u002F\u002Fwww.lmms-lab.com",null,"Python",4218,601,8,25,0,19,40,104,57,30.34,"Other",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37],"agi","audio-evaluation","benchmark","evaluation","large-language-models","llm-evaluation","multimodal","multimodal-evaluation","video-understanding","vision-language-model","vlm","2026-06-12 02:02:59","\u003Cp align=\"center\" width=\"70%\">\n\u003Cimg src=\"https:\u002F\u002Fi.postimg.cc\u002FKvkLzbF9\u002FWX20241212-014400-2x.png\">\n\u003C\u002Fp>\n\n# LMMs-Eval: Probing Intelligence in the Real World\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Flmms-eval)](https:\u002F\u002Fpypi.org\u002Fproject\u002Flmms-eval)\n![PyPI - Downloads](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fdm\u002Flmms-eval)\n[![GitHub contributors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FEvolvingLMMs-Lab\u002Flmms-eval)](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval\u002Fgraphs\u002Fcontributors)\n[![issue resolution](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed-raw\u002FEvolvingLMMs-Lab\u002Flmms-eval)](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval\u002Fissues)\n[![open issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002FEvolvingLMMs-Lab\u002Flmms-eval)](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval\u002Fissues)\n\n> We are building the unified evaluation toolkit for frontier models and probing the abilities in real world, shape what we build next.\n\n\u003Cdetails>\n\u003Csummary>🌐 Available in 17 languages\u003C\u002Fsummary>\n\n[简体中文](docs\u002Fi18n\u002FREADME_zh-CN.md) | [繁體中文](docs\u002Fi18n\u002FREADME_zh-TW.md) | [日本語](docs\u002Fi18n\u002FREADME_ja.md) | [한국어](docs\u002Fi18n\u002FREADME_ko.md) | [Español](docs\u002Fi18n\u002FREADME_es.md) | [Français](docs\u002Fi18n\u002FREADME_fr.md) | [Deutsch](docs\u002Fi18n\u002FREADME_de.md) | [Português](docs\u002Fi18n\u002FREADME_pt-BR.md) | [Русский](docs\u002Fi18n\u002FREADME_ru.md) | [Italiano](docs\u002Fi18n\u002FREADME_it.md) | [Nederlands](docs\u002Fi18n\u002FREADME_nl.md) | [Polski](docs\u002Fi18n\u002FREADME_pl.md) | [Türkçe](docs\u002Fi18n\u002FREADME_tr.md) | [العربية](docs\u002Fi18n\u002FREADME_ar.md) | [हिन्दी](docs\u002Fi18n\u002FREADME_hi.md) | [Tiếng Việt](docs\u002Fi18n\u002FREADME_vi.md) | [Indonesia](docs\u002Fi18n\u002FREADME_id.md)\n\n\u003C\u002Fdetails>\n\n📚 [Documentation](docs\u002FREADME.md) | 📖 [100+ Tasks](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval\u002Fblob\u002Fmain\u002Fdocs\u002Fadvanced\u002Fcurrent_tasks.md) | 🌟 [30+ Models](https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval\u002Ftree\u002Fmain\u002Flmms_eval\u002Fmodels) | ⚡ [Quickstart](docs\u002Fgetting-started\u002Fquickstart.md)\n\n🏠 [Homepage](https:\u002F\u002Fwww.lmms-lab.com\u002F) | 💬 [Discord](https:\u002F\u002Fdiscord.gg\u002F8xTM6jWnXa) | 🤝 [Contributing](CONTRIBUTING.md)\n\n---\n\n## Why `lmms-eval`?\n\nBenchmarks decide what gets built next. A model team that trusts its eval numbers can focus on real improvements instead of chasing noise. But the multimodal evaluation ecosystem is fragmented - scattered datasets, inconsistent post-processing, and single-number accuracy scores that hide whether a gain is real or random. Two teams evaluating the same model on the same benchmark routinely report different results.\n\nWe believe [better evals lead to better models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.09110). Good evaluation maps the border of what models can do and shapes what we build next.\n\nWe are building `lmms-eval` and focusing on three core principles:\n\n- **Reproducible** - One pipeline, deterministic results. Same model, same benchmark, same numbers, every time.\n- **Efficient** - Evaluation should not be the bottleneck, even at large scale. Async serving, adaptive batching, and video I\u002FO optimizations keep your GPUs saturated end to end.\n- **Trustworthy** - Not just accuracy. Confidence intervals, clustered standard errors, paired comparisons, and ongoing research into evaluation methodology. Results you can trust enough to act on.\n\nFor how the pipeline works and the concrete mechanisms behind these principles, see [How the Evaluation Pipeline Works](docs\u002FREADME.md#how-the-evaluation-pipeline-works) and [Why it's Efficient and Trustworthy](docs\u002FREADME.md#why-its-efficient-and-trustworthy).\n\n## What's New\n\n**v0.7** (Feb 2026) - Operational simplicity and pipeline maturity. 25+ new tasks across 8 domains, 2 new model backends, agentic task evaluation (`generate_until_agentic`), video I\u002FO overhaul with TorchCodec (up to 3.58x faster), Lance-backed video distribution on Hugging Face, safety\u002Fred-teaming baselines, efficiency metrics (per-sample token counts, run-level throughput), and streamlined flattened JSONL log output for cleaner post-analysis. [Release notes](docs\u002Freleases\u002Flmms-eval-0.7.md) | [Changelog](docs\u002Freleases\u002FCHANGELOG.md).\n\n**v0.6** (Feb 2026) - Evaluation as a service. Standalone HTTP eval server, ~7.5x throughput over v0.5, statistically grounded results (CI, paired t-test), 50+ new tasks. [Release notes](docs\u002Freleases\u002Flmms-eval-0.6.md) | [Changelog](docs\u002Freleases\u002FCHANGELOG.md).\n\n**v0.5** (Oct 2025) - Audio expansion. Comprehensive audio evaluation, response caching, 50+ benchmark variants across audio, vision, and reasoning. [Release notes](docs\u002Freleases\u002Flmms-eval-0.5.md).\n\n\u003Cdetails>\n\u003Csummary>Older updates\u003C\u002Fsummary>\n\n- [2025-01] [Video-MMMU](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13826) - Knowledge acquisition from multi-discipline professional videos.\n- [2024-12] [MME-Survey](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.15296) - Comprehensive survey on evaluation of multimodal LLMs.\n- [2024-11] **v0.3** - Audio evaluation support (Qwen2-Audio, Gemini-Audio). [Release notes](docs\u002Freleases\u002Flmms-eval-0.3.md).\n- [2024-06] **v0.2** - Video evaluation (LLaVA-NeXT Video, Gemini 1.5 Pro, VideoMME, EgoSchema). [Blog](https:\u002F\u002Flmms-lab.github.io\u002Fposts\u002Flmms-eval-0.2\u002F).\n- [2024-03] **v0.1** - First release. [Blog](https:\u002F\u002Flmms-lab.github.io\u002Fposts\u002Flmms-eval-0.1\u002F).\n\n\u003C\u002Fdetails>\n\n## Quickstart\n\nInstall and run your first evaluation in under 5 minutes:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval.git\ncd lmms-eval && uv pip install -e \".[all]\"\n\n# Run a quick evaluation (Qwen2.5-VL on MME, 8 samples)\npython -m lmms_eval \\\n  --model qwen2_5_vl \\\n  --model_args pretrained=Qwen\u002FQwen2.5-VL-3B-Instruct \\\n  --tasks mme \\\n  --batch_size 1 \\\n  --limit 8\n```\n\nIf it prints metrics, your environment is ready. For the full guide, see [`docs\u002Fgetting-started\u002Fquickstart.md`](docs\u002Fgetting-started\u002Fquickstart.md).\n\n## Installation\n\n### Using `uv` (Recommended for consistent environments)\n\nWe use `uv` for package management to ensure all developers use exactly the same package versions. First, install uv:\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n```\n\nFor development with consistent environment:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval\ncd lmms-eval\n# Recommend\nuv pip install -e \".[all]\"\n# If you want to use uv sync\n# uv sync  # This creates\u002Fupdates your environment from uv.lock\n```\n\nTo run commands:\n```bash\nuv run python -m lmms_eval --help  # Run any command with uv run\n```\n\nTo add new dependencies:\n```bash\nuv add \u003Cpackage>  # Updates both pyproject.toml and uv.lock\n```\n\n### Alternative Installation\n\nFor direct usage from Git:\n```bash\nuv venv eval\nuv venv --python 3.12\nsource eval\u002Fbin\u002Factivate\n# You might need to add and include your own task yaml if using this installation\nuv pip install git+https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval.git\n```\n\n\u003Cdetails>\n\u003Csummary>Reproduction of LLaVA-1.5's paper results\u003C\u002Fsummary>\n\nYou can check the [torch environment info](miscs\u002Frepr_torch_envs.txt) and [results check](miscs\u002Fllava_result_check.md) to **reproduce LLaVA-1.5's paper results**. We found torch\u002Fcuda versions difference would cause small variations in the results.\n\n\u003C\u002Fdetails>\n\nIf you want to test on caption dataset such as `coco`, `refcoco`, and `nocaps`, you will need to have `java==1.8.0` to let pycocoeval api to work. If you don't have it, you can install by using conda\n```\nconda install openjdk=8\n```\nyou can then check your java version by `java -version`\n\n\n\u003Cdetails>\n\u003Csummary>Comprehensive Evaluation Results of LLaVA Family Models\u003C\u002Fsummary>\n\u003Cbr>\n\nAs demonstrated by the extensive table below, we aim to provide detailed information for readers to understand the datasets included in lmms-eval and some specific details about these datasets (we remain grateful for any corrections readers may have during our evaluation process).\n\nWe provide a Google Sheet for the detailed results of the LLaVA series models on different datasets. You can access the sheet [here](https:\u002F\u002Fdocs.google.com\u002Fspreadsheets\u002Fd\u002F1a5ImfdKATDI8T7Cwh6eH-bEsnQFzanFraFUgcS9KHWc\u002Fedit?usp=sharing). It's a live sheet, and we are updating it with new results.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Fi.postimg.cc\u002Fjdw497NS\u002FWX20240307-162526-2x.png\"  width=\"100%\" height=\"80%\">\n\u003C\u002Fp>\n\nWe also provide the raw data exported from Weights & Biases for the detailed results of the LLaVA series models on different datasets. You can access the raw data [here](https:\u002F\u002Fdocs.google.com\u002Fspreadsheets\u002Fd\u002F1AvaEmuG4csSmXaHjgu4ei1KBMmNNW8wflOD_kkTDdv8\u002Fedit?usp=sharing).\n\n\u003C\u002Fdetails>\n\u003Cbr>\n\nIf you want to test [VILA](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FVILA), you should install the following dependencies:\n\n```bash\npip install s2wrapper@git+https:\u002F\u002Fgithub.com\u002Fbfshi\u002Fscaling_on_scales\n```\n\nOur Development will be continuing on the main branch, and we encourage you to give us feedback on what features are desired and how to improve the library further, or ask questions, either in issues or PRs on GitHub.\n\n## Usage Examples\n\n> More examples can be found in [examples\u002Fmodels](examples\u002Fmodels)\n\n### Evaluation with vLLM\n\n**Qwen2.5-VL:**\n```bash\nbash examples\u002Fmodels\u002Fvllm_qwen2vl.sh\n```\n\n**Qwen3-VL:**\n```bash\nbash examples\u002Fmodels\u002Fvllm_qwen3vl.sh\n```\n\n**Qwen3.5:**\n```bash\nbash examples\u002Fmodels\u002Fvllm_qwen35.sh\n```\n\n### Evaluation with SGLang\n\n```bash\nbash examples\u002Fmodels\u002Fsglang.sh\n```\n\n**Qwen3.5:**\n```bash\nbash examples\u002Fmodels\u002Fsglang_qwen35.sh\n```\n\n### Evaluation of OpenAI-Compatible Model\n\n```bash\nbash examples\u002Fmodels\u002Fopenai_compatible.sh\n```\n\n### Evaluation of Qwen2.5-VL\n\n```bash\nbash examples\u002Fmodels\u002Fqwen25vl.sh\n```\n\n### Evaluation of Qwen3-VL\n\n```bash\nbash examples\u002Fmodels\u002Fqwen3vl.sh\n```\n\n**More Parameters**\n\n```bash\npython3 -m lmms_eval --help\n```\n\n**Environmental Variables**\n\nBefore running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run.\n\n```bash\nexport OPENAI_API_KEY=\"\u003CYOUR_API_KEY>\"\nexport HF_HOME=\"\u003CPath to HF cache>\"\nexport HF_TOKEN=\"\u003CYOUR_API_KEY>\"\nexport HF_HUB_ENABLE_HF_TRANSFER=\"1\"\nexport REKA_API_KEY=\"\u003CYOUR_API_KEY>\"\n# Other possible environment variables include\n# ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc.\n```\n\n**Common Environment Issues**\n\nSometimes you might encounter some common issues for example error related to httpx or protobuf. To solve these issues, you can first try\n\n```bash\npython3 -m pip install httpx==0.23.3;\npython3 -m pip install protobuf==3.20;\n# If you are using numpy==2.x, sometimes may causing errors\npython3 -m pip install numpy==1.26;\n# Someties sentencepiece are required for tokenizer to work\npython3 -m pip install sentencepiece;\n```\n\n## Custom Model Integration\n\n`lmms-eval` supports two types of models: **Chat (recommended)** and **Simple (legacy)**.\n\n### Chat Models (Recommended) 🌟\n\n- Location: `lmms_eval\u002Fmodels\u002Fchat\u002F`\n- Use: `doc_to_messages` function from task\n- Input: Structured `ChatMessages` with roles (`user`, `system`, `assistant`) and content types (`text`, `image`, `video`, `audio`)\n- Supports: Interleaved multimodal content\n- Uses: Model's `apply_chat_template()` method\n- Reference: `lmms_eval\u002Fmodels\u002Fchat\u002Fqwen2_5_vl.py` or `lmms_eval\u002Fmodels\u002Fchat\u002Fqwen3_vl.py`\n\n**Example input format:**\n```python\n[\n    {\"role\": \"user\", \"content\": [\n        {\"type\": \"image\", \"url\": \u003Cimage>},\n        {\"type\": \"text\", \"text\": \"What's in this image?\"}\n    ]}\n]\n```\n\n### Simple Models (Legacy)\n\n- Location: `lmms_eval\u002Fmodels\u002Fsimple\u002F`\n- Use: `doc_to_visual` + `doc_to_text` functions from task\n- Input: Plain text with `\u003Cimage>` placeholders + separate visual list\n- Supports: Limited (mainly images)\n- Manual processing: No chat template support\n- Reference: `lmms_eval\u002Fmodels\u002Fsimple\u002Finstructblip.py`\n\n**Example input format:**\n```python\n# Separate visual and text\ndoc_to_visual -> [PIL.Image]\ndoc_to_text -> \"What's in this image?\"\n```\n\n### Key Differences\n\n| Aspect | Chat Models | Simple Models |\n|--------|-------------|---------------|\n| File location | `models\u002Fchat\u002F` | `models\u002Fsimple\u002F` |\n| Input method | `doc_to_messages` | `doc_to_visual` + `doc_to_text` |\n| Message format | Structured (roles + content types) | Plain text with placeholders |\n| Interleaved support | ✅ Yes | ❌ Limited |\n| Chat template | ✅ Built-in | ❌ Manual\u002FNone |\n| Recommendation | **Use this** | Legacy only |\n\n### Why Use Chat Models?\n\n- ✅ Built-in chat template support\n- ✅ Interleaved multimodal content\n- ✅ Structured message protocol\n- ✅ Better video\u002Faudio support\n- ✅ Consistent with modern LLM APIs\n\n### Chat Model Implementation Example\n\n```python\nfrom lmms_eval.api.registry import register_model\nfrom lmms_eval.api.model import lmms\nfrom lmms_eval.protocol import ChatMessages\n\n@register_model(\"my_chat_model\")\nclass MyChatModel(lmms):\n    is_simple = False  # Use chat interface\n\n    def generate_until(self, requests):\n        for request in requests:\n            # 5 elements for chat models\n            doc_to_messages, gen_kwargs, doc_id, task, split = request.args\n\n            # Get structured messages\n            raw_messages = doc_to_messages(self.task_dict[task][split][doc_id])\n            messages = ChatMessages(messages=raw_messages)\n\n            # Extract media and apply chat template\n            images, videos, audios = messages.extract_media()\n            hf_messages = messages.to_hf_messages()\n            text = self.processor.apply_chat_template(hf_messages)\n\n            # Generate...\n```\n\nFor more details, see the [Model Guide](docs\u002Fguides\u002Fmodel_guide.md).\n\n## Custom Dataset Integration\n\n### Task Configuration with `doc_to_messages`\n\nImplement `doc_to_messages` to transform dataset documents into structured chat messages:\n\n```python\ndef my_doc_to_messages(doc, lmms_eval_specific_kwargs=None):\n    # Extract visuals and text from doc\n    visuals = my_doc_to_visual(doc)\n    text = my_doc_to_text(doc, lmms_eval_specific_kwargs)\n\n    # Build structured messages\n    messages = [{\"role\": \"user\", \"content\": []}]\n\n    # Add visuals first\n    for visual in visuals:\n        messages[0][\"content\"].append({\"type\": \"image\", \"url\": visual})\n\n    # Add text\n    messages[0][\"content\"].append({\"type\": \"text\", \"text\": text})\n\n    return messages\n```\n\n### YAML Configuration\n\n```yaml\ntask: \"my_benchmark\"\ndataset_path: \"my-org\u002Fmy-dataset\"\ntest_split: test\noutput_type: generate_until\n\n# For chat models (recommended)\ndoc_to_messages: !function utils.my_doc_to_messages\n\n# OR legacy approach:\ndoc_to_visual: !function utils.my_doc_to_visual\ndoc_to_text: !function utils.my_doc_to_text\n\nprocess_results: !function utils.my_process_results\nmetric_list:\n  - metric: acc\n```\n\n### Key Features\n\n#### `doc_to_messages`\n\n- Transforms dataset document into structured chat messages\n- Returns: List of message dicts with `role` and `content`\n- Content supports: `text`, `image`, `video`, `audio` types\n- Protocol: Defined in `lmms_eval\u002Fprotocol.py` (`ChatMessages` class)\n- Auto-fallback: If not provided, uses `doc_to_visual` + `doc_to_text`\n\n\nFor more details, see the [Task Guide](docs\u002Fguides\u002Ftask_guide.md).\n\n## Web UI\n\nLMMS-Eval includes an optional Web UI for interactive evaluation configuration.\n\n### Requirements\n\n- Node.js 18+ (for building the frontend, auto-built on first run)\n\n### Usage\n\n```bash\n# Start the Web UI (opens browser automatically)\nuv run lmms-eval-ui\n\n# Custom port\nLMMS_SERVER_PORT=3000 uv run lmms-eval-ui\n```\n\nThe web UI provides:\n- Model selection from all available models\n- Task selection with search\u002Ffilter\n- Real-time command preview\n- Live evaluation output streaming\n- Start\u002FStop evaluation controls\n- Log Viewer for browsing saved evaluation results and samples\n\nFor more details, see [Web UI README](lmms_eval\u002Ftui\u002FREADME.md).\n\n## HTTP Evaluation Server\n\nLMMS-Eval includes a production-ready HTTP server for remote evaluation workflows.\n\n### Why Use Eval Server?\n\n- **Decoupled evaluation**: Run evaluations on dedicated GPU nodes while training continues\n- **Async workflow**: Submit jobs without blocking training loops\n- **Queue management**: Sequential job processing with automatic resource management\n- **Remote access**: Evaluate models from any machine\n\n### Start Server\n\n```python\nfrom lmms_eval.entrypoints import ServerArgs, launch_server\n\n# Configure server\nargs = ServerArgs(\n    host=\"0.0.0.0\",\n    port=8000,\n    max_completed_jobs=200,\n    temp_dir_prefix=\"lmms_eval_\"\n)\n\n# Launch server\nlaunch_server(args)\n```\n\nServer runs at `http:\u002F\u002Fhost:port` with auto-generated API docs at `\u002Fdocs`\n\n### Client Usage\n\n**Sync Client:**\n```python\nfrom lmms_eval.entrypoints import EvalClient\n\nclient = EvalClient(\"http:\u002F\u002Feval-server:8000\")\n\n# Submit evaluation (non-blocking)\njob = client.evaluate(\n    model=\"qwen2_5_vl\",\n    tasks=[\"mmmu_val\", \"mme\"],\n    model_args={\"pretrained\": \"Qwen\u002FQwen2.5-VL-7B-Instruct\"},\n    num_fewshot=0,\n    batch_size=1,\n    device=\"cuda:0\",\n)\n\n# Continue training...\n# Later, retrieve results\nresult = client.wait_for_job(job[\"job_id\"])\nprint(result[\"result\"])\n```\n\n**Async Client:**\n```python\nfrom lmms_eval.entrypoints import AsyncEvalClient\n\nasync with AsyncEvalClient(\"http:\u002F\u002Feval-server:8000\") as client:\n    job = await client.evaluate(\n        model=\"qwen3_vl\",\n        tasks=[\"mmmu_val\"],\n        model_args={\"pretrained\": \"Qwen\u002FQwen3-VL-4B-Instruct\"},\n    )\n    result = await client.wait_for_job(job[\"job_id\"])\n```\n\n### Server API Endpoints\n\n| Endpoint | Method | Description |\n|----------|--------|-------------|\n| `\u002Fhealth` | GET | Server health check |\n| `\u002Fevaluate` | POST | Submit evaluation job |\n| `\u002Fjobs\u002F{job_id}` | GET | Get job status and results |\n| `\u002Fqueue` | GET | View queue status |\n| `\u002Ftasks` | GET | List available tasks |\n| `\u002Fmodels` | GET | List available models |\n| `\u002Fjobs\u002F{job_id}` | DELETE | Cancel queued job |\n| `\u002Fmerge` | POST | Merge FSDP2 sharded checkpoints |\n\n### Example Workflow\n\n```python\n# Training loop pseudocode\nfor epoch in range(num_epochs):\n    train_one_epoch()\n\n    # After every N epochs, evaluate checkpoint\n    if epoch % 5 == 0:\n        checkpoint_path = f\"checkpoints\u002Fepoch_{epoch}\"\n\n        # Submit async evaluation (non-blocking)\n        eval_job = client.evaluate(\n            model=\"vllm\",\n            model_args={\"model\": checkpoint_path},\n            tasks=[\"mmmu_val\", \"mathvista\"],\n        )\n\n        # Training continues immediately\n        print(f\"Evaluation job submitted: {eval_job['job_id']}\")\n\n# After training completes, retrieve all results\nresults = []\nfor job_id in eval_jobs:\n    result = client.wait_for_job(job_id)\n    results.append(result)\n```\n\n### Security Note\n\n⚠️ **This server is intended for trusted environments only**. Do NOT expose to untrusted networks without additional security layers (authentication, rate limiting, network isolation).\n\nFor more details, see the [v0.6 release notes](docs\u002Freleases\u002Flmms-eval-0.6.md).\n\n## Frequently Asked Questions\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>What models does lmms-eval support?\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nWe support 30+ model families out of the box, including Qwen2.5-VL, Qwen3-VL, LLaVA-OneVision, InternVL-2, VILA, and more. Any OpenAI-compatible API endpoint is also supported. See the full list in [`lmms_eval\u002Fmodels\u002F`](lmms_eval\u002Fmodels\u002F).\n\nQwen3.5 is supported through existing runtime backends (`--model vllm` and `--model sglang`) by setting `model=Qwen\u002FQwen3.5-397B-A17B` in `--model_args`.\n\nThe Qwen3.5 example scripts align with official runtime references (for example, `max_model_len\u002Fcontext_length=262144` and `reasoning_parser=qwen3`).\n\nIf a new model family is already fully supported by vLLM or SGLang at runtime, we generally only need documentation and examples instead of adding a dedicated model wrapper.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>What benchmarks and tasks are available?\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nOver 100 evaluation tasks across image, video, and audio modalities, including MMMU, MME, MMBench, MathVista, VideoMME, EgoSchema, and many more. Check [`docs\u002Fadvanced\u002Fcurrent_tasks.md`](docs\u002Fadvanced\u002Fcurrent_tasks.md) for the full list.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>How do I add my own benchmark?\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nCreate a YAML config under `lmms_eval\u002Ftasks\u002F` with dataset path, splits, and a `doc_to_messages` function. See [`docs\u002Fguides\u002Ftask_guide.md`](docs\u002Fguides\u002Ftask_guide.md) for a step-by-step guide.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Can I evaluate a model behind an API (e.g., GPT-4o, Claude)?\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nYes. Use `--model openai` with `--model_args model=gpt-4o` and set `OPENAI_API_KEY`. Any OpenAI-compatible endpoint works, including local vLLM\u002FSGLang servers.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>How do I run evaluations on multiple GPUs?\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nUse `accelerate launch` or pass `--device cuda` with tensor parallelism via vLLM\u002FSGLang backends. See [`docs\u002Fgetting-started\u002Fcommands.md`](docs\u002Fgetting-started\u002Fcommands.md) for multi-GPU flags.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>How do I cite lmms-eval?\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nUse the BibTeX entries below, or click the \"Cite this repository\" button in the GitHub sidebar (powered by our [`CITATION.cff`](CITATION.cff)).\n\n\u003C\u002Fdetails>\n\n## Acknowledgement\n\nlmms_eval is a fork of [lm-eval-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness). We recommend you to read through the [docs of lm-eval-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmain\u002Fdocs) for relevant information.\n\n---\n\nBelow are the changes we made to the original API:\n- Build context now only pass in idx and process image and doc during the model responding phase. This is due to the fact that dataset now contains lots of images and we can't store them in the doc like the original lm-eval-harness otherwise the cpu memory would explode.\n- Instance.args (lmms_eval\u002Fapi\u002Finstance.py) now contains a list of images to be inputted to lmms.\n- lm-eval-harness supports all HF language models as single model class. Currently this is not possible of lmms because the input\u002Foutput format of lmms in HF are not yet unified. Therefore, we have to create a new class for each lmms model. This is not ideal and we will try to unify them in the future.\n\n---\n\n## Citations\n\n```bibtex\n@misc{zhang2024lmmsevalrealitycheckevaluation,\n      title={LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models},\n      author={Kaichen Zhang and Bo Li and Peiyuan Zhang and Fanyi Pu and Joshua Adrian Cahyono and Kairui Hu and Shuai Liu and Yuanhan Zhang and Jingkang Yang and Chunyuan Li and Ziwei Liu},\n      year={2024},\n      eprint={2407.12772},\n      archivePrefix={arXiv},\n      primaryClass={cs.CL},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12772},\n}\n\n@misc{lmms_eval2024,\n    title={LMMs-Eval: Accelerating the Development of Large Multimoal Models},\n    url={https:\u002F\u002Fgithub.com\u002FEvolvingLMMs-Lab\u002Flmms-eval},\n    author={Bo Li*, Peiyuan Zhang*, Kaichen Zhang*, Fanyi Pu*, Xinrun Du, Yuhao Dong, Haotian Liu, Yuanhan Zhang, Ge Zhang, Chunyuan Li and Ziwei Liu},\n    publisher    = {Zenodo},\n    version      = {v0.1.0},\n    month={March},\n    year={2024}\n}\n```\n","LMMs-Eval 是一个面向文本、图像、视频和音频任务的多模态统一评估工具包。该项目的核心功能包括提供超过100种任务和30多种模型的支持，旨在通过可复现、高效且值得信赖的方式对前沿模型进行评估。它采用异步服务、自适应批处理及视频I\u002FO优化等技术手段确保大规模评估时的性能。适用于需要对多模态模型进行全面测试与比较的研究者和开发者，帮助他们更好地理解模型能力边界并指导后续开发方向。",2,"2026-06-11 03:40:46","high_star"]