[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76189":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},76189,"IndustryBench","alibaba-multimodal-industrial-ai\u002FIndustryBench","alibaba-multimodal-industrial-ai","A multi-lingual benchmark for evaluating industrial domain knowledge of LLMs.",null,"Python",152,10,8,1,0,78,50.92,"MIT License",false,"main",[],"2026-06-12 04:01:20","# IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs\n\n[📝Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.10267) | [🤗HuggingFace Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falibaba-multimodal-industrial-ai\u002FIndustryBench)  \n\n**Source-grounded industrial procurement QA** for LLMs: each item is tied to a **Chinese national standard (GB\u002FT)** or a **structured industrial product record**, with **human-reviewed** English, Russian, and Vietnamese renderings **aligned to the same item ids** as the Chinese source.\n\n| | |\n|:---|:---|\n| **Items** | 2,049 |\n| **Languages** | Chinese (source) + EN \u002F RU \u002F VI (aligned) |\n| **Labels** | 7 capability dimensions · 10 industry categories · panel-derived difficulty (easy \u002F medium \u002F hard) |\n| **Sources** | GB\u002FT excerpts + industrial product records (see paper §3) |\n| **Paper** | [arXiv:2605.10267](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.10267) |\n| **Dataset** | [`alibaba-multimodal-industrial-ai\u002FIndustryBench`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falibaba-multimodal-industrial-ai\u002FIndustryBench) on Hugging Face |\n\n**Evaluation idea (paper §4):** models answer **closed-book** from the question only; a **calibrated LLM judge** scores **raw** correctness on **0–3**; a **separate safety-violation (SV)** check uses the **source excerpt** (`knowledge_text`). SV hits can zero the effective score—see paper for the full protocol and human calibration (**κ_w ≈ 0.798** on the judge sample).\n\n---\n\n## Who this repo is for\n\n| You want… | Do this |\n|:---|:---|\n| **Only the data** | Use Hugging Face below—no clone required. |\n| **The same scoring pipeline as the paper** | Clone this repo, export a CSV, run `evaluate.py` (below). |\n\n---\n\n## 1. Load the dataset (most users)\n\n```bash\npip install datasets\n```\n\n```python\nfrom datasets import load_dataset\n\nds = load_dataset(\"alibaba-multimodal-industrial-ai\u002FIndustryBench\", split=\"train\")\n# e.g. inspect\nprint(ds[0].keys())\n```\n\nThe Hugging Face UI may show a small **metadata** table (language, license, task tags, etc.) if the dataset `README.md` on the Hub starts with a **`---` YAML block**. That block is optional; a YAML-free template lives in this repo at [`huggingface\u002FREADME.md`](huggingface\u002FREADME.md) for you to paste on the Hub if you want that table gone.\n\nTypical columns include `question` \u002F `answer` (Chinese), `question_en` \u002F `answer_en`, `question_ru` \u002F `answer_ru`, `question_vi` \u002F `answer_vi`, `knowledge_text`, `capability`, `difficulty`, `domain`, `industry_primary`, etc. Full schema is documented in the **paper appendix** and on the **HF dataset card** body (markdown below any YAML).\n\n---\n\n## 2. Reproduce the released evaluation script\n\n**Prereqs:** Python 3.10+, `pip install -r requirements.txt`, and an **OpenAI-compatible** HTTP API (any host that exposes `POST …\u002Fv1\u002Fchat\u002Fcompletions`).\n\n**Steps**\n\n1. Export the HF split to CSV (path can be anything; used as `--data-path`):\n\n   ```python\n   from datasets import load_dataset\n   load_dataset(\"alibaba-multimodal-industrial-ai\u002FIndustryBench\", split=\"train\").to_csv(\"industrybench.csv\")\n   ```\n\n2. Set an API key (`--api-key` **or** env `OPENAI_API_KEY` **or** `DASHSCOPE_API_KEY`).\n\n3. Run (example: DashScope-compatible base + Qwen):\n\n   ```bash\n   python evaluate.py \\\n     --data-path industrybench.csv \\\n     --language zh \\\n     --api-base https:\u002F\u002Fdashscope.aliyuncs.com\u002Fcompatible-mode\u002Fv1 \\\n     --model qwen3-max\n   ```\n\n   - **`--api-base`** — Root URL that ends with **`\u002Fv1`**. The script appends **`\u002Fchat\u002Fcompletions`** itself. It is **not** the model name.\n   - **`--model`** — Model that **answers** the questions.\n   - **`--judge-model`** — Optional; defaults to `--model`. Set to your judge (e.g. `qwen3-max`) if the answer model differs.\n\n4. Results and checkpoints go under `results\u002F` by default. See `python evaluate.py --help`.\n\n---\n\n## 3. What’s in this repository\n\n| Path | Role |\n|:---|:---|\n| `evaluate.py` | End-to-end multilingual runner: generation → LLM judge (0–3) → optional safety review → CSV. |\n| `requirements.txt` | Minimal Python deps for `evaluate.py`. |\n| `huggingface\u002FREADME.md` | Suggested **Hub dataset card** (no YAML frontmatter); paste on HF to drop the auto metadata table. |\n| `LICENSE` | MIT |\n\nLarge raw CSVs are **not** stored in git; the canonical release is **Hugging Face**.\n\n---\n\n## Citation\n\n```bibtex\n@misc{bai2026industrybenchprobingindustrialknowledge,\n  title={IndustryBench: Probing the Industrial Knowledge Boundaries of LLMs},\n  author={Songlin Bai and Xintong Wang and Linlin Yu and Bin Chen and Zhiang Xu and Yuyang Sheng and Changtong Zan and Xiaofeng Zhu and Yizhe Zhang and Jiru Li and Mingze Guo and Ling Zou and Yalong Li and Chengfu Huo and Liang Ding},\n  year={2026},\n  eprint={2605.10267},\n  archivePrefix={arXiv},\n  primaryClass={cs.AI},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.10267},\n}\n```\n","IndustryBench 是一个用于评估大语言模型在工业领域知识掌握情况的多语言基准测试工具。该项目基于Python开发，包含2049个与中文国家标准（GB\u002FT）或结构化工业产品记录相关联的条目，并提供经过人工审核的英文、俄文和越南文版本。每个条目都标注了7种能力维度、10个行业类别以及难度等级（简单\u002F中等\u002F困难）。适合需要验证其AI系统在无参考材料情况下处理工业采购问答任务表现的研究者和开发者使用。此外，项目还提供了一套评分管道，包括一个校准过的大语言模型裁判来评估答案的准确性及安全违规检查。",2,"2026-06-11 03:54:46","CREATED_QUERY"]