[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72483":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},72483,"lighteval","huggingface\u002Flighteval","huggingface","Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends","https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fen\u002Findex",null,"Python",2440,482,25,212,0,4,11,31,12,76.65,"MIT License",false,"main",[26,27,28,7],"evaluation","evaluation-framework","evaluation-metrics","2026-06-12 04:01:06","\u003Cp align=\"center\">\n  \u003Cbr\u002F>\n    \u003Cimg alt=\"lighteval library logo\" src=\".\u002Fassets\u002Flighteval-doc.svg\" width=\"376\" height=\"59\" style=\"max-width: 100%;\">\n  \u003Cbr\u002F>\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\n    \u003Ci>Your go-to toolkit for lightning-fast, flexible LLM evaluation, from Hugging Face's Leaderboard and Evals Team.\u003C\u002Fi>\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n\n[![Tests](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval\u002Factions\u002Fworkflows\u002Ftests.yaml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval\u002Factions\u002Fworkflows\u002Ftests.yaml?query=branch%3Amain)\n[![Quality](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval\u002Factions\u002Fworkflows\u002Fquality.yaml\u002Fbadge.svg?branch=main)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval\u002Factions\u002Fworkflows\u002Fquality.yaml?query=branch%3Amain)\n[![Python versions](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Flighteval)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green.svg)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval\u002Fblob\u002Fmain\u002FLICENSE)\n[![Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Flighteval)](https:\u002F\u002Fpypi.org\u002Fproject\u002Flighteval\u002F)\n\n\u003C\u002Fdiv>\n\n---\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fmain\u002Fen\u002Findex\" target=\"_blank\">\n    \u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-4F4F4F?style=for-the-badge&logo=readthedocs&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenEvals\u002Fopen_benchmark_index\" target=\"_blank\">\n    \u003Cimg alt=\"Open Benchmark Index\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpen%20Benchmark%20Index-4F4F4F?style=for-the-badge&logo=huggingface&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n**Lighteval** is your *all-in-one toolkit* for evaluating LLMs across multiple\nbackends—whether your model is being **served somewhere** or **already loaded in memory**.\nDive deep into your model's performance by saving and exploring *detailed,\nsample-by-sample results* to debug and see how your models stack-up.\n\n*Customization at your fingertips*: letting you either browse all our existing tasks and [metrics](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fmetric-list) or effortlessly create your own [custom task](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fadding-a-custom-task) and [custom metric](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fadding-a-new-metric), tailored to your needs.\n\n\n## Available Tasks\n\nLighteval supports **1000+ evaluation tasks** across multiple domains and\nlanguages. Use [this\nspace](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenEvals\u002Fopen_benchmark_index) to find what\nyou need, or, here's an overview of some *popular benchmarks*:\n\n\n### 📚 **Knowledge**\n- **General Knowledge**: MMLU, MMLU-Pro, MMMU, BIG-Bench\n- **Question Answering**: TriviaQA, Natural Questions, SimpleQA, Humanity's Last Exam (HLE)\n- **Specialized**: GPQA, AGIEval\n\n### 🧮 **Math and Code**\n- **Math Problems**: GSM8K, GSM-Plus, MATH, MATH500\n- **Competition Math**: AIME24, AIME25\n- **Multilingual Math**: MGSM (Grade School Math in 10+ languages)\n- **Coding Benchmarks**: LCB (LiveCodeBench)\n\n### 🎯 **Chat Model Evaluation**\n- **Instruction Following**: IFEval, IFEval-fr\n- **Reasoning**: MUSR, DROP (discrete reasoning)\n- **Long Context**: RULER\n- **Dialogue**: MT-Bench\n- **Holistic Evaluation**: HELM, BIG-Bench\n\n### 🌍 **Multilingual Evaluation**\n- **Cross-lingual**: XTREME, Flores200 (200 languages), XCOPA, XQuAD\n- **Language-specific**:\n  - **Arabic**: ArabicMMLU\n  - **Filipino**: FilBench\n  - **French**: IFEval-fr, GPQA-fr, BAC-fr\n  - **German**: German RAG Eval\n  - **Serbian**: Serbian LLM Benchmark, OZ Eval\n  - **Turkic**: TUMLU (9 Turkic languages)\n  - **Chinese**: CMMLU, CEval, AGIEval\n  - **Russian**: RUMMLU, Russian SQuAD\n  - **Kyrgyz**: Kyrgyz LLM Benchmark\n  - **And many more...**\n\n### 🧠 **Core Language Understanding**\n- **NLU**: GLUE, SuperGLUE, TriviaQA, Natural Questions\n- **Commonsense**: HellaSwag, WinoGrande, ProtoQA\n- **Natural Language Inference**: XNLI\n- **Reading Comprehension**: SQuAD, XQuAD, MLQA, Belebele\n\n\n## ⚡️ Installation\n\n> **Note**: lighteval is currently *completely untested on Windows*, and we don't support it yet. (*Should be fully functional on Mac\u002FLinux*)\n\n```bash\npip install lighteval\n```\n\nLighteval allows for *many extras* when installing, see [here](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Finstallation) for a **complete list**.\n\nIf you want to push results to the **Hugging Face Hub**, add your access token as\nan environment variable:\n\n```shell\nhf auth login\n```\n\n## 🚀 Quickstart\n\nLighteval offers the following entry points for model evaluation:\n\n- `lighteval eval`: Evaluation models using [inspect-ai](https:\u002F\u002Finspect.aisi.org.uk\u002F) as a backend (prefered).\n- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗\n  Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate)\n- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️\n  Nanotron](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fnanotron)\n- `lighteval vllm`: Evaluate models on one or more GPUs using [🚀\n  VLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- `lighteval sglang`: Evaluate models using [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) as backend\n- `lighteval endpoint`: Evaluate models using various endpoints as backend\n  - `lighteval endpoint inference-endpoint`: Evaluate models using Hugging Face's [Inference Endpoints API](https:\u002F\u002Fhuggingface.co\u002Finference-endpoints\u002Fdedicated)\n  - `lighteval endpoint tgi`: Evaluate models using [🔗 Text Generation Inference](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Findex) running locally\n  - `lighteval endpoint litellm`: Evaluate models on any compatible API using [LiteLLM](https:\u002F\u002Fwww.litellm.ai\u002F)\n  - `lighteval endpoint inference-providers`: Evaluate models using [HuggingFace's inference providers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Finference-providers\u002Fen\u002Findex) as backend\n\nDid not find what you need ? You can always make your custom model API by following [this guide](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fmain\u002Fen\u002Fevaluating-a-custom-model)\n- `lighteval custom`: Evaluate custom models (can be anything)\n\nHere's a **quick command** to evaluate using a remote inference service:\n\n```shell\nlighteval eval \"hf-inference-providers\u002Fopenai\u002Fgpt-oss-20b\" gpqa:diamond\n```\n\nOr use the **Python API** to run a model *already loaded in memory*!\n\n```python\nfrom transformers import AutoModelForCausalLM\n\nfrom lighteval.logging.evaluation_tracker import EvaluationTracker\nfrom lighteval.models.transformers.transformers_model import TransformersModel, TransformersModelConfig\nfrom lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters\n\n\nMODEL_NAME = \"meta-llama\u002FMeta-Llama-3-8B-Instruct\"\nBENCHMARKS = \"gsm8k\"\n\nevaluation_tracker = EvaluationTracker(output_dir=\".\u002Fresults\")\npipeline_params = PipelineParameters(\n    launcher_type=ParallelismManager.NONE,\n    max_samples=2\n)\n\nmodel = AutoModelForCausalLM.from_pretrained(\n  MODEL_NAME, device_map=\"auto\"\n)\nconfig = TransformersModelConfig(model_name=MODEL_NAME, batch_size=1)\nmodel = TransformersModel.from_model(model, config)\n\npipeline = Pipeline(\n    model=model,\n    pipeline_parameters=pipeline_params,\n    evaluation_tracker=evaluation_tracker,\n    tasks=BENCHMARKS,\n)\n\nresults = pipeline.evaluate()\npipeline.show_results()\nresults = pipeline.get_results()\n```\n\n## 🙏 Acknowledgements\n\nLighteval took inspiration from the following *amazing* frameworks: Eleuther's [AI Harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness) and Stanford's\n[HELM](https:\u002F\u002Fcrfm.stanford.edu\u002Fhelm\u002Flatest\u002F). We are grateful to their teams for their **pioneering work** on LLM evaluations.\n\nWe'd also like to offer our thanks to all the community members who have contributed to the library, adding new features and reporting or fixing bugs.\n\n## 🌟 Contributions Welcome 💙💚💛💜🧡\n\n**Got ideas?** Found a bug? Want to add a\n[task](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fadding-a-custom-task) or\n[metric](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Flighteval\u002Fadding-a-new-metric)?\nContributions are *warmly welcomed*!\n\nIf you're adding a **new feature**, please *open an issue first*.\n\nIf you open a PR, don't forget to **run the styling**!\n\n```bash\n# For basic development (code quality, tests)\npip install -e \".[dev]\"\n\n# Or for GPU\u002Fvllm development and slow tests\npip install -e \".[dev-gpu]\"\n\npre-commit install\npre-commit run --all-files\n```\n## 📜 Citation\n\n```bibtex\n@misc{lighteval,\n  author = {Habib, Nathan and Fourrier, Clémentine and Kydlíček, Hynek and Wolf, Thomas and Tunstall, Lewis},\n  title = {LightEval: A lightweight framework for LLM evaluation},\n  year = {2023},\n  version = {0.11.0},\n  url = {https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Flighteval}\n}\n```\n","Lighteval 是一个全面的工具包，用于在多个后端上评估大型语言模型（LLM）。它支持超过1000个跨领域的评估任务，并允许用户自定义任务和指标以满足特定需求。核心功能包括灵活的任务选择、详细的样本级结果保存与分析，以及对已加载或远程服务中模型的支持。适用于需要深入理解模型性能、进行调试或比较不同模型表现的场景，特别是在知识问答、数学与编程能力测试及聊天模型评估等领域。基于Python开发，采用MIT许可证开源。",2,"2026-06-11 03:42:13","high_star"]