[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-108":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":14,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":15,"rankGlobal":9,"rankLanguage":9,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":14,"starSnapshotCount":14,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},108,"chainreason","joshawome\u002Fchainreason","joshawome","A benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks",null,"Python",499,343,8,0,7.61,"MIT License",false,"main",[],"2026-06-12 02:00:08","# ChainReason\n\n> A small benchmark for evaluating LLM reasoning on Ethereum and DeFi tasks.\n\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](LICENSE)\n\nChainReason is a lightweight evaluation suite that asks language models to do five\nthings a smart-contract engineer or DeFi analyst would consider routine:\n\n1. **`protocol_qa`** — multiple-choice questions about specific DeFi protocol mechanics.\n2. **`vuln_detect`** — classify a Solidity snippet by vulnerability category.\n3. **`contract_class`** — classify a contract from its ABI summary + an optional hint.\n4. **`tx_intent`** — given a sequence of decoded actions, infer the transaction's intent.\n5. **`slippage_pred`** — given an AMM pool state and a swap, compute the output amount.\n\nThe point of having five tasks instead of one is that each one stresses a different\ncapability — symbolic reasoning, code understanding, structural pattern recognition,\nnumeric reasoning. A model that's strong on `vuln_detect` but weak on `slippage_pred`\ntells you something different than a model that's strong on both.\n\n## Why another benchmark\n\nExisting benchmarks for Solidity \u002F blockchain LLMs largely focus on either\n(a) code generation, or (b) vulnerability detection. ChainReason adds three\nother axes that I haven't seen consolidated elsewhere:\n\n- **Protocol-level reasoning.** Knowing what `getReserves()` returns is one thing;\n  knowing what happens when you yank 30% of the reserves out of a Uniswap v2 pair\n  is another.\n- **Transaction-graph understanding.** Telling a sandwich apart from a swap from\n  an arbitrage requires looking at the *structure* of an execution trace, not\n  just opcodes.\n- **Numeric grounding.** AMMs have closed-form pricing. If a model gets the\n  CPMM math wrong, it'll be wrong about every downstream task.\n\nThe dataset is small and hand-curated — this is *not* a leaderboard scraper or a\nten-thousand-row crawl of Etherscan. The included seed examples are meant to be\nillustrative; you can extend them with your own data via `--data-path`.\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fjoshawome\u002Fchainreason\ncd chainreason\npip install -e .\n```\n\nFor local model inference (HuggingFace), also install:\n\n```bash\npip install torch transformers accelerate\n```\n\n## Quick start\n\n```bash\nexport OPENAI_API_KEY=...\npython scripts\u002Frun_eval.py --task protocol_qa --client openai --model gpt-4o-mini --limit 5\n```\n\nOr run a full sweep from a YAML config:\n\n```bash\npython scripts\u002Frun_eval.py --config configs\u002Ffull_run.yaml\npython scripts\u002Faggregate_results.py results\u002Ffull -o results\u002Ffull\u002FSUMMARY.md\n```\n\n## Programmatic use\n\n```python\nfrom chainreason.tasks import get_task\nfrom chainreason.models.openai_client import OpenAIClient\nfrom chainreason.runner import run_eval\n\ntask = get_task(\"vuln_detect\")\nmodel = OpenAIClient(model=\"gpt-4o-mini\")\nsummary = run_eval(task, model, limit=10, output_dir=\"results\u002F\")\nprint(summary[\"metrics\"])\n```\n\n## Tasks\n\n| Task | n (seed) | Output type | Metric |\n|------|----------|-------------|--------|\n| `protocol_qa` | 14 | A\u002FB\u002FC\u002FD | accuracy |\n| `vuln_detect` | 12 | label (1 of 6) | accuracy + macro-F1 |\n| `contract_class` | 14 | label (1 of 11) | accuracy + macro-F1 |\n| `tx_intent` | 14 | label (1 of 14) | accuracy + macro-F1 |\n| `slippage_pred` | 10 | numeric | tiered relative error |\n\nFor baseline numbers, see [results\u002FBASELINES.md](results\u002FBASELINES.md).\n\nThe seed sets are intentionally small. They exist to make sure the benchmark\n*runs* and so you can sanity-check a new model in under a minute. Real\nevaluation should use a larger held-out set.\n\n## Adding your own task\n\nSubclass `Task`, implement four methods, register it:\n\n```python\nfrom chainreason.tasks.base import Task, Example\n\nclass MyTask(Task):\n    name = \"my_task\"\n    def load(self): ...\n    def build_prompt(self, ex): ...\n    def parse_response(self, text): ...\n    def score(self, prediction, target): ...\n```\n\nAdd it to `TASK_REGISTRY` in `chainreason\u002Ftasks\u002F__init__.py` and you're set.\n\n## Citation\n\nIf this is useful in your work:\n\n```bibtex\n@misc{yamamoto2025chainreason,\n  author       = {Yamamoto, Joshua},\n  title        = {{ChainReason}: A Benchmark for LLM Reasoning over On-Chain Tasks},\n  year         = {2025},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fjoshawome\u002Fchainreason}}\n}\n```\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","ChainReason 是一个用于评估语言模型在以太坊和DeFi任务上推理能力的基准测试工具。它提供了五个核心功能，包括协议问答、漏洞检测、合约分类、交易意图推断以及滑点预测，旨在全面考察模型在符号推理、代码理解、结构模式识别和数值推理等方面的能力。该项目采用Python编写，适合于需要对智能合约或DeFi分析相关任务进行自动化处理的场景下使用，帮助开发者快速评估不同模型的实际表现。通过ChainReason，用户可以获得比现有专注于代码生成或漏洞检测的基准更全面的模型性能视图。",2,"2026-05-06 17:18:47","CREATED_QUERY"]