[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-70925":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},70925,"deepeval","confident-ai\u002Fdeepeval","confident-ai","The LLM Evaluation Framework","https:\u002F\u002Fdeepeval.com",null,"Python",16084,1520,62,209,0,130,263,762,390,44.55,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32],"evaluation-framework","evaluation-metrics","llm-evaluation","llm-evaluation-framework","llm-evaluation-metrics","python","2026-06-12 02:02:45","\u003Cp align=\"center\">\n    \u003Cpicture>\n        \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"assets\u002Fhero\u002Fwordmark-dark.svg\">\n        \u003Cimg alt=\"DeepEval.\" src=\"assets\u002Fhero\u002Fwordmark-light.svg\" width=\"520\">\n    \u003C\u002Fpicture>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ch1 align=\"center\">The LLM Evaluation Framework\u003C\u002Fh1>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F5917\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Ftrendshift.io\u002Fapi\u002Fbadge\u002Frepositories\u002F5917\" alt=\"confident-ai%2Fdeepeval | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002F3SEyvpgu2f\">\n        \u003Cimg alt=\"discord-invite\" src=\"https:\u002F\u002Fdcbadge.vercel.app\u002Fapi\u002Fserver\u002F3SEyvpgu2f?style=flat\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fgetting-started?utm_source=GitHub\">Documentation\u003C\u002Fa> |\n        \u003Ca href=\"#-metrics-and-features\">Metrics and Features\u003C\u002Fa> |\n        \u003Ca href=\"#-quickstart\">Getting Started\u003C\u002Fa> |\n        \u003Ca href=\"#-integrations\">Integrations\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fwww.confident-ai.com?utm_source=deepeval&utm_medium=github&utm_content=header_nav\">Confident AI\u003C\u002Fa>\n    \u003Cp>\n\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fdeepeval\u002Freleases\">\n        \u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fconfident-ai\u002Fdeepeval.svg?color=violet\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1PPxYEBa6eu__LquGoFFJZkhYgWVYE6kh?usp=sharing\">\n        \u003Cimg alt=\"Try Quickstart in Colab\" src=\"https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fdeepeval\u002Fblob\u002Fmaster\u002FLICENSE.md\">\n        \u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fconfident-ai\u002Fdeepeval.svg?color=yellow\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fx.com\u002Fdeepeval\">\n        \u003Cimg alt=\"Twitter Follow\" src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002Fdeepeval?style=social&logo=x\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003C!-- Keep these links. Translations will automatically update with the README. -->\n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=de\">Deutsch\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=es\">Español\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=fr\">français\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=ja\">日本語\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=ko\">한국어\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=pt\">Português\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=ru\">Русский\u003C\u002Fa> | \n    \u003Ca href=\"https:\u002F\u002Fwww.readme-i18n.com\u002Fconfident-ai\u002Fdeepeval?lang=zh\">中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n**DeepEval** is a simple-to-use, open-source LLM evaluation framework, for evaluating large-language model systems. It is similar to Pytest but specialized for unit testing LLM apps. DeepEval incorporates the latest research to run evals via metrics such as G-Eval, task completion, answer relevancy, hallucination, etc., which uses LLM-as-a-judge and other NLP models that run **locally on your machine**.\n\nWhether you're building AI agents, RAG pipelines, or chatbots, implemented via LangChain or OpenAI, DeepEval has you covered. With it, you can easily determine the optimal models, prompts, and architecture to improve your AI quality, prevent prompt drifting, or even transition from OpenAI to Claude with confidence.\n\n> [!IMPORTANT]\n> Need a place for your DeepEval testing data to live 🏡❤️? [Sign up to the DeepEval platform](https:\u002F\u002Fwww.confident-ai.com?utm_source=deepeval&utm_medium=github&utm_content=signup_callout) to compare iterations of your LLM app, generate & share testing reports, and more.\n>\n> ![Demo GIF](assets\u002Fdemo.gif)\n\n> Want to talk LLM evaluation, need help picking metrics, or just to say hi? [Come join our discord.](https:\u002F\u002Fdiscord.com\u002Finvite\u002F3SEyvpgu2f)\n\n\u003Cbr \u002F>\n\n# 🔥 Metrics and Features\n\n- 📐 Large variety of ready-to-use LLM eval metrics (all with explanations) powered by **ANY** LLM of your choice, statistical methods, or NLP models that run **locally on your machine** covering all use cases:\n\n  - **Custom, All-Purpose Metrics:**\n\n    - [G-Eval](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-llm-evals) — a research-backed LLM-as-a-judge metric for evaluating on any custom criteria with human-like accuracy\n    - [DAG](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-dag) — DeepEval's graph-based deterministic LLM-as-a-judge metric builder\n\n  - \u003Cdetails>\n    \u003Csummary>\u003Cb>Agentic Metrics\u003C\u002Fb>\u003C\u002Fsummary>\n\n    - [Task Completion](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-task-completion) — evaluate whether an agent accomplished its goal\n    - [Tool Correctness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-tool-correctness) — check if the right tools were called with the right arguments\n    - [Goal Accuracy](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-goal-accuracy) — measure how accurately the agent achieved the intended goal\n    - [Step Efficiency](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-step-efficiency) — evaluate whether the agent took unnecessary steps\n    - [Plan Adherence](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-plan-adherence) — check if the agent followed the expected plan\n    - [Plan Quality](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-plan-quality) — evaluate the quality of the agent's plan\n    - [Tool Use](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-tool-use) — measure quality of tool usage\n    - [Argument Correctness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-argument-correctness) — validate tool call arguments\n\n    \u003C\u002Fdetails>\n\n  - \u003Cdetails>\n    \u003Csummary>\u003Cb>RAG Metrics\u003C\u002Fb>\u003C\u002Fsummary>\n\n    - [Answer Relevancy](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-answer-relevancy) — measure how relevant the RAG pipeline's output is to the input\n    - [Faithfulness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-faithfulness) — evaluate whether the RAG pipeline's output factually aligns with the retrieval context\n    - [Contextual Recall](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-contextual-recall) — measure how well the RAG pipeline's retrieval context aligns with the expected output\n    - [Contextual Precision](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-contextual-precision) — evaluate whether relevant nodes in the RAG pipeline's retrieval context are ranked higher\n    - [Contextual Relevancy](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-contextual-relevancy) — measure the overall relevance of the RAG pipeline's retrieval context to the input\n    - [RAGAS](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-ragas) — average of answer relevancy, faithfulness, contextual precision, and contextual recall\n\n    \u003C\u002Fdetails>\n\n  - \u003Cdetails>\n    \u003Csummary>\u003Cb>Multi-Turn Metrics\u003C\u002Fb>\u003C\u002Fsummary>\n\n    - [Knowledge Retention](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-knowledge-retention) — evaluate whether the chatbot retains factual information throughout a conversation\n    - [Conversation Completeness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-conversation-completeness) — measure whether the chatbot satisfies user needs throughout a conversation\n    - [Turn Relevancy](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-turn-relevancy) — evaluate whether the chatbot generates consistently relevant responses throughout a conversation\n    - [Turn Faithfulness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-turn-faithfulness) — check if the chatbot's responses are factually grounded in retrieval context across turns\n    - [Role Adherence](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-role-adherence) — evaluate whether the chatbot adheres to its assigned role throughout a conversation\n\n    \u003C\u002Fdetails>\n\n  - \u003Cdetails>\n    \u003Csummary>\u003Cb>MCP Metrics\u003C\u002Fb>\u003C\u002Fsummary>\n\n    - [MCP Task Completion](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-mcp-task-completion) — evaluate how effectively an MCP-based agent accomplishes a task\n    - [MCP Use](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-mcp-use) — measure how effectively an agent uses its available MCP servers\n    - [Multi-Turn MCP Use](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-multi-turn-mcp-use) — evaluate MCP server usage across conversation turns\n\n    \u003C\u002Fdetails>\n\n  - \u003Cdetails>\n    \u003Csummary>\u003Cb>Multimodal Metrics\u003C\u002Fb>\u003C\u002Fsummary>\n\n    - [Text to Image](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmultimodal-metrics-text-to-image) — evaluate image generation quality based on semantic consistency and perceptual quality\n    - [Image Editing](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmultimodal-metrics-image-editing) — evaluate image editing quality based on semantic consistency and perceptual quality\n    - [Image Coherence](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmultimodal-metrics-image-coherence) — measure how well images align with their accompanying text\n    - [Image Helpfulness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmultimodal-metrics-image-helpfulness) — evaluate how effectively images contribute to user comprehension of the text\n    - [Image Reference](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmultimodal-metrics-image-reference) — evaluate how accurately images are referred to or explained by accompanying text\n\n    \u003C\u002Fdetails>\n\n  - \u003Cdetails>\n    \u003Csummary>\u003Cb>Other Metrics\u003C\u002Fb>\u003C\u002Fsummary>\n\n    - [Hallucination](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-hallucination) — check whether the LLM generates factually correct information against provided context\n    - [Summarization](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-summarization) — evaluate whether summaries are factually correct and include necessary details\n    - [Bias](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-bias) — detect gender, racial, or political bias in LLM outputs\n    - [Toxicity](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-toxicity) — evaluate toxicity in LLM outputs\n    - [JSON Correctness](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-json-correctness) — check whether the output matches an expected JSON schema\n    - [Prompt Alignment](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-prompt-alignment) — measure whether the output aligns with instructions in the prompt template\n\n    \u003C\u002Fdetails>\n\n- 🎯 Supports both end-to-end and component-level LLM evaluation.\n- 🧩 Build your own custom metrics that are automatically integrated with DeepEval's ecosystem.\n- 🔮 Generate both single and multi-turn synthetic datasets for evaluation.\n- 🔗 Integrates seamlessly with **ANY** CI\u002FCD environment.\n- 🧬 Optimize prompts automatically based on evaluation results.\n- 🏆 Easily benchmark **ANY** LLM on popular LLM benchmarks in [under 10 lines of code.](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fbenchmarks-introduction?utm_source=GitHub), including MMLU, HellaSwag, DROP, BIG-Bench Hard, TruthfulQA, HumanEval, GSM8K.\n\n\u003Cbr \u002F>\n\n# 🔌 Integrations\n\nDeepEval plugs into any LLM framework — OpenAI Agents, LangChain, CrewAI, and more. To scale evals across your team — or let anyone run them without writing code — **Confident AI** gives you a native platform integration.\n\n## Frameworks\n\n- [OpenAI](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fopenai?utm_source=GitHub) — evaluate and trace OpenAI applications via a client wrapper\n- [OpenAI Agents](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fopenai-agents?utm_source=GitHub) — evaluate OpenAI Agents end-to-end in under a minute\n- [LangChain](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Flangchain?utm_source=GitHub) — evaluate LangChain applications with a callback handler\n- [LangGraph](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Flanggraph?utm_source=GitHub) — evaluate LangGraph agents with a callback handler\n- [Pydantic AI](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fpydanticai?utm_source=GitHub) — evaluate Pydantic AI agents with type-safe validation\n- [CrewAI](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fcrewai?utm_source=GitHub) — evaluate CrewAI multi-agent systems\n- [Anthropic](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fanthropic?utm_source=GitHub) — evaluate and trace Claude applications via a client wrapper\n- [AWS AgentCore](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fagentcore?utm_source=GitHub) — evaluate agents deployed on Amazon AgentCore\n- [LlamaIndex](https:\u002F\u002Fwww.deepeval.com\u002Fintegrations\u002Fframeworks\u002Fllamaindex?utm_source=GitHub) — evaluate RAG applications built with LlamaIndex\n\n## ☁️ Platform + Ecosystem\n\n[Confident AI](https:\u002F\u002Fwww.confident-ai.com?utm_source=deepeval&utm_medium=github&utm_content=platform_section) is an all-in-one platform that integrates natively with DeepEval.\n\n- Manage datasets, trace LLM applications, run evaluations, and monitor responses in production — all from one platform.\n- Don't need a UI? Confident AI can also be your data persistant layer - run evals, pull datasets, and inspect traces straight from claude code, cursor, via Confident AI's [MCP server](https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fconfident-mcp-server).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fconfident-mcp-architecture.png\" alt=\"Confident AI MCP Architecture\" width=\"500\">\n\u003C\u002Fp>\n\n\u003Cbr \u002F>\n\n# 🤖 Vibe-Coder QuickStart\n\nWant your coding agent to add evals and fix failures for you? Install the DeepEval skill, point it at your agent, RAG pipeline, or chatbot, and ask it to generate a dataset, write the eval suite, run `deepeval test run`, and iterate on the failing metrics.\n\n[Start with the 5-minute vibe-coder guide](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fvibe-coder-quickstart?utm_source=GitHub).\n\n\u003Cbr \u002F>\n\n# 🚀 Human QuickStart\n\nLet's pretend your LLM application is a RAG based customer support chatbot; here's how DeepEval can help test what you've built.\n\n## Installation\n\nDeepeval works with **Python>=3.9+**.\n\n```\npip install -U deepeval\n```\n\n## Create an account (highly recommended)\n\nUsing the `deepeval` platform will allow you to generate sharable testing reports on the cloud. It is free, takes no additional code to setup, and we highly recommend giving it a try.\n\nTo login, run:\n\n```\ndeepeval login\n```\n\nFollow the instructions in the CLI to create an account, copy your API key, and paste it into the CLI. All test cases will automatically be logged (find more information on data privacy [here](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fdata-privacy?utm_source=GitHub)).\n\n## Write your first test case\n\nCreate a test file:\n\n```bash\ntouch test_chatbot.py\n```\n\nOpen `test_chatbot.py` and write your first test case to run an **end-to-end** evaluation using DeepEval, which treats your LLM app as a black-box:\n\n```python\nimport pytest\nfrom deepeval import assert_test\nfrom deepeval.metrics import GEval\nfrom deepeval.test_case import LLMTestCase, SingleTurnParams\n\ndef test_case():\n    correctness_metric = GEval(\n        name=\"Correctness\",\n        criteria=\"Determine if the 'actual output' is correct based on the 'expected output'.\",\n        evaluation_params=[SingleTurnParams.ACTUAL_OUTPUT, SingleTurnParams.EXPECTED_OUTPUT],\n        threshold=0.5\n    )\n    test_case = LLMTestCase(\n        input=\"What if these shoes don't fit?\",\n        # Replace this with the actual output from your LLM application\n        actual_output=\"You have 30 days to get a full refund at no extra cost.\",\n        expected_output=\"We offer a 30-day full refund at no extra costs.\",\n        retrieval_context=[\"All customers are eligible for a 30 day full refund at no extra costs.\"]\n    )\n    assert_test(test_case, [correctness_metric])\n```\n\nSet your `OPENAI_API_KEY` as an environment variable (you can also evaluate using your own custom model, for more details visit [this part of our docs](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-introduction#using-a-custom-llm?utm_source=GitHub)):\n\n```\nexport OPENAI_API_KEY=\"...\"\n```\n\nAnd finally, run `test_chatbot.py` in the CLI:\n\n```\ndeepeval test run test_chatbot.py\n```\n\n**Congratulations! Your test case should have passed ✅** Let's breakdown what happened.\n\n- The variable `input` mimics a user input, and `actual_output` is a placeholder for what your application's supposed to output based on this input.\n- The variable `expected_output` represents the ideal answer for a given `input`, and [`GEval`](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fmetrics-llm-evals) is a research-backed metric provided by `deepeval` for you to evaluate your LLM output's on any custom with human-like accuracy.\n- In this example, the metric `criteria` is correctness of the `actual_output` based on the provided `expected_output`.\n- All metric scores range from 0 - 1, which the `threshold=0.5` threshold ultimately determines if your test have passed or not.\n\n[Read our documentation](https:\u002F\u002Fdeepeval.com\u002Fdocs\u002Fgetting-started?utm_source=GitHub) for more information!\n\n\u003Cbr \u002F>\n\n## Evals With Full Traceability\n\nUse `evals_iterator()` to run the same dataset through your app, whether you instrument it manually or through one of DeepEval's framework integrations.\n\nHere's an example of manual instrumentation:\n\n```python\nfrom deepeval.tracing import observe, update_current_span\nfrom deepeval.test_case import LLMTestCase\nfrom deepeval.metrics import TaskCompletionMetric\n\n@observe()\ndef inner_component(input: str):\n    output = \"result\"\n    update_current_span(test_case=LLMTestCase(input=input, actual_output=output))\n    return output\n\n@observe()\ndef app(input: str):\n    return inner_component(input)\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):\n    app(golden.input)\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>OpenAI\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.openai import OpenAI\nfrom deepeval.tracing import trace\nfrom deepeval.metrics import TaskCompletionMetric\n\nclient = OpenAI()\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator():\n    with trace(metrics=[TaskCompletionMetric()]):\n        client.chat.completions.create(\n            model=\"gpt-4o\",\n            messages=[{\"role\": \"user\", \"content\": golden.input}],\n        )\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>OpenAI Agents\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom agents import Runner\nfrom deepeval.metrics import TaskCompletionMetric\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):\n    Runner.run_sync(agent, golden.input)\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Anthropic\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.anthropic import Anthropic\nfrom deepeval.tracing import trace\nfrom deepeval.metrics import TaskCompletionMetric\n\nclient = Anthropic()\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator():\n    with trace(metrics=[TaskCompletionMetric()]):\n        client.messages.create(\n            model=\"claude-sonnet-4-5\",\n            max_tokens=1024,\n            messages=[{\"role\": \"user\", \"content\": golden.input}],\n        )\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>LangChain\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.integrations.langchain import CallbackHandler\nfrom deepeval.metrics import TaskCompletionMetric\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator():\n    llm.invoke(\n        golden.input,\n        config={\"callbacks\": [CallbackHandler(metrics=[TaskCompletionMetric()])]},\n    )\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>LangGraph\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.integrations.langchain import CallbackHandler\nfrom deepeval.metrics import TaskCompletionMetric\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator():\n    agent.invoke(\n        {\"messages\": [{\"role\": \"user\", \"content\": golden.input}]},\n        config={\"callbacks\": [CallbackHandler(metrics=[TaskCompletionMetric()])]},\n    )\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Pydantic AI\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.metrics import TaskCompletionMetric\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):\n    agent.run_sync(golden.input)\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>CrewAI\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.integrations.crewai import instrument_crewai\nfrom deepeval.metrics import TaskCompletionMetric\n\ninstrument_crewai()\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):\n    crew.kickoff({\"input\": golden.input})\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>AWS AgentCore\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.integrations.agentcore import instrument_agentcore\nfrom deepeval.metrics import TaskCompletionMetric\n\ninstrument_agentcore()\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):\n    invoke({\"prompt\": golden.input})\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>LlamaIndex\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nimport asyncio\nfrom deepeval.evaluate.configs import AsyncConfig\nfrom deepeval.metrics import TaskCompletionMetric\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(\n    async_config=AsyncConfig(run_async=True),\n    metrics=[TaskCompletionMetric()],\n):\n    task = asyncio.create_task(agent.run(golden.input))\n    dataset.evaluate(task)\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Google ADK\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nimport asyncio\nfrom deepeval.evaluate.configs import AsyncConfig\nfrom deepeval.integrations.google_adk import instrument_google_adk\nfrom deepeval.metrics import TaskCompletionMetric\n\ninstrument_google_adk()\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(\n    async_config=AsyncConfig(run_async=True),\n    metrics=[TaskCompletionMetric()],\n):\n    task = asyncio.create_task(run_agent(golden.input))\n    dataset.evaluate(task)\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Strands\u003C\u002Fb>\u003C\u002Fsummary>\n\n```python\nfrom deepeval.integrations.strands import instrument_strands\nfrom deepeval.metrics import TaskCompletionMetric\n\ninstrument_strands()\n\n# This metric will be run on your trace end to end.\nfor golden in dataset.evals_iterator(metrics=[TaskCompletionMetric()]):\n    agent(golden.input)\n```\n\n\u003C\u002Fdetails>\n\nLearn more about component-level evaluations [here.](https:\u002F\u002Fwww.deepeval.com\u002Fdocs\u002Fevaluation-component-level-llm-evals)\n\n\u003Cbr \u002F>\n\n## Evaluate Without Pytest Integration\n\nAlternatively, you can evaluate without Pytest, which is more suited for a notebook environment.\n\n```python\nfrom deepeval import evaluate\nfrom deepeval.metrics import AnswerRelevancyMetric\nfrom deepeval.test_case import LLMTestCase\n\nanswer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)\ntest_case = LLMTestCase(\n    input=\"What if these shoes don't fit?\",\n    # Replace this with the actual output from your LLM application\n    actual_output=\"We offer a 30-day full refund at no extra costs.\",\n    retrieval_context=[\"All customers are eligible for a 30 day full refund at no extra costs.\"]\n)\nevaluate([test_case], [answer_relevancy_metric])\n```\n\n## Using Standalone Metrics\n\nDeepEval is extremely modular, making it easy for anyone to use any of our metrics. Continuing from the previous example:\n\n```python\nfrom deepeval.metrics import AnswerRelevancyMetric\nfrom deepeval.test_case import LLMTestCase\n\nanswer_relevancy_metric = AnswerRelevancyMetric(threshold=0.7)\ntest_case = LLMTestCase(\n    input=\"What if these shoes don't fit?\",\n    # Replace this with the actual output from your LLM application\n    actual_output=\"We offer a 30-day full refund at no extra costs.\",\n    retrieval_context=[\"All customers are eligible for a 30 day full refund at no extra costs.\"]\n)\n\nanswer_relevancy_metric.measure(test_case)\nprint(answer_relevancy_metric.score)\n# All metrics also offer an explanation\nprint(answer_relevancy_metric.reason)\n```\n\nNote that some metrics are for RAG pipelines, while others are for fine-tuning. Make sure to use our docs to pick the right one for your use case.\n\n## A Note on Env Variables (.env \u002F .env.local)\n\nDeepEval auto-loads `.env.local` then `.env` from the current working directory **at import time**.\n**Precedence:** process env -> `.env.local` -> `.env`.\nOpt out with `DEEPEVAL_DISABLE_DOTENV=1`.\n\n```bash\ncp .env.example .env.local\n# then edit .env.local (ignored by git)\n```\n\n# DeepEval With Confident AI\n\n[Confident AI](https:\u002F\u002Fwww.confident-ai.com?utm_source=deepeval&utm_medium=github&utm_content=cli_login_section) is an all-in-one platform to manage datasets, trace LLM applications, and run evaluations in production. Log in from the CLI to get started:\n\n```bash\ndeepeval login\n```\n\nThen run your tests as usual — results are automatically synced to the platform:\n\n```bash\ndeepeval test run test_chatbot.py\n```\n\n![Demo GIF](assets\u002Fdemo.gif)\n\nPrefer to stay in your IDE? Use DeepEval via [Confident AI's MCP server](https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fconfident-mcp-server) as the persistent layer to run evals, pull datasets, and inspect traces without leaving your editor.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fconfident-mcp-architecture.png\" alt=\"Confident AI MCP Architecture\" width=\"500\">\n\u003C\u002Fp>\n\nEverything on Confident AI is available [here](https:\u002F\u002Fwww.confident-ai.com\u002Fdocs?utm_source=deepeval&utm_medium=github&utm_content=cloud_docs).\n\n\u003Cbr \u002F>\n\n# Contributing\n\nPlease read [CONTRIBUTING.md](https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fdeepeval\u002Fblob\u002Fmain\u002FCONTRIBUTING.md) for details on our code of conduct, and the process for submitting pull requests to us.\n\n\u003Cbr \u002F>\n\n# Roadmap\n\nFeatures:\n\n- [x] Integration with Confident AI\n- [x] Implement G-Eval\n- [x] Implement RAG metrics\n- [x] Implement Conversational metrics\n- [x] Evaluation Dataset Creation\n- [x] Red-Teaming\n- [ ] DAG custom metrics\n- [ ] Guardrails\n\n\u003Cbr \u002F>\n\n# Authors\n\nBuilt by the founders of Confident AI. Contact jeffreyip@confident-ai.com for all enquiries.\n\n\u003Cbr \u002F>\n\n# License\n\nDeepEval is licensed under Apache 2.0 - see the [LICENSE.md](https:\u002F\u002Fgithub.com\u002Fconfident-ai\u002Fdeepeval\u002Fblob\u002Fmain\u002FLICENSE.md) file for details.\n","DeepEval是一个易于使用的开源LLM评估框架，专为大型语言模型系统的评估而设计。它提供了类似Pytest的单元测试功能，但专门针对LLM应用程序进行了优化。该框架集成了最新的研究成果，支持多种评估指标和特性，如准确性、一致性、偏见检测等。DeepEval适用于需要对AI生成内容进行质量控制和性能测试的各种场景，包括但不限于聊天机器人、文本生成服务和自动摘要工具。通过使用Python编写，DeepEval确保了良好的可移植性和易用性，同时其活跃的社区支持和丰富的文档资源使得开发者能够快速上手并集成到现有项目中。",2,"2026-06-11 03:34:58","high_star"]