[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80705":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},80705,"trust-eval-mm","pardcomper\u002Ftrust-eval-mm","pardcomper","Multi-dimensional trustworthiness evaluation for multimodal LLMs",null,"Python",132,8,6,0,87,51.56,"Other",false,"main",[],"2026-06-12 04:01:29","# TrustEval-MM\n\n| | |\n|---|---|\n| **What** | A multi-dimensional trustworthiness evaluation suite for multimodal LLMs. |\n| **Why**  | Single-metric leaderboards conceal failure modes — we want a model card, not a number. |\n| **Status** | 🟢 Active — five dimensions, eleven sub-tasks. |\n| **Stack** | Python ≥ 3.10, PyTorch, HuggingFace Transformers. |\n\n## TL;DR\n\n```bash\npip install trust-eval-mm\n\ntrust-eval --model llava-hf\u002Fllava-1.5-7b-hf --dims all --out reports\u002Fllava15.json\ntrust-eval card --in reports\u002Fllava15.json --out cards\u002Fllava15.md\n```\n\n## The five dimensions\n\n| Dim | Sub-tasks | What it measures |\n|---|---|---|\n| **Truthfulness** | `pope`, `mmlu-mm`, `factual-vqa`     | Does the model claim things that are true given the image? |\n| **Robustness**   | `image-noise`, `prompt-perturb`        | Does the answer stay stable under small input changes? |\n| **Fairness**     | `gender-bias`, `racial-bias`           | Does answer quality vary by depicted demographic? |\n| **Calibration**  | `selective-pred`, `aurc`               | When the model says \"I'm sure\", is it actually right more often? |\n| **Privacy**      | `pii-leak`, `face-id-leak`             | Does the model emit identifying info it shouldn't? |\n\nEach sub-task is scored on a 0–100 scale, then aggregated into a per-dimension score and a single Trust Score. The aggregation weights are user-configurable; the defaults are the ones used in the paper.\n\n## Why bother with a card?\n\nThe output is a markdown \"trust card\" that shows all five dimensions side by side. The point: a model with 92% accuracy on POPE but 31 in calibration AURC is a deployment risk that single-number leaderboards never surface.\n\nA sample card looks like:\n\n```\nTrustEval-MM Card | llava-hf\u002Fllava-1.5-7b-hf | rev 0a3f\n==========================================================\nTrust Score:  62.1   (weights: paper-default)\n\nTruthfulness  ███████░░░  72   pope=78  mmlu-mm=64  factual-vqa=74\nRobustness    █████░░░░░  54   img-noise=62  prompt-perturb=46\nFairness      ██████░░░░  58   gender=63  racial=53\nCalibration   ████░░░░░░  41   selective=49  aurc=33\nPrivacy       ███████░░░  75   pii-leak=82  face-id-leak=68\n```\n\n## Quick start\n\n```bash\n# 1. Prepare eval data (downloads small COCO subset + procedural prompts)\ntrust-eval prepare --out data\u002F\n\n# 2. Run all dimensions on a model\ntrust-eval --model llava-hf\u002Fllava-1.5-7b-hf --data data\u002F --out reports\u002Fllava15.json\n\n# 3. Render to a markdown card\ntrust-eval card --in reports\u002Fllava15.json --out cards\u002Fllava15.md\n```\n\nOr programmatically:\n\n```python\nfrom trust_eval_mm import evaluate, render_card\n\nreport = evaluate(model_id=\"llava-hf\u002Fllava-1.5-7b-hf\",\n                  dimensions=[\"truthfulness\", \"calibration\"],\n                  data_root=\"data\u002F\")\nprint(render_card(report))\n```\n\n## Configuration\n\n| Argument        | Default      | Description                                       |\n|-----------------|--------------|---------------------------------------------------|\n| `--model`       | (required)   | HF model id or registered adapter                 |\n| `--dims`        | `all`        | Comma-separated subset of dimensions              |\n| `--n`           | `500`        | Examples per sub-task                             |\n| `--device`      | `cuda`       | Where to run                                      |\n| `--weights`     | `paper`      | Aggregation weights (`paper`, `uniform`, custom)  |\n| `--seed`        | `0`          | Random seed                                       |\n\n## Sub-task quick reference\n\n\u003Cdetails>\n\u003Csummary>Truthfulness\u003C\u002Fsummary>\n\n- **`pope`** — Polling-based Object Probing Eval. Yes\u002FNo questions about whether an object is in the image.\n- **`mmlu-mm`** — Multimodal MMLU subset. Multiple-choice questions where the answer depends on the image.\n- **`factual-vqa`** — Free-form VQA with a factual answer, judged via exact-match + semantic similarity.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Robustness\u003C\u002Fsummary>\n\n- **`image-noise`** — Add Gaussian \u002F JPEG \u002F blur perturbations and measure answer agreement.\n- **`prompt-perturb`** — Paraphrase the prompt and measure answer agreement.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Fairness\u003C\u002Fsummary>\n\n- **`gender-bias`** — Two-image pairs that differ only in depicted gender. Measure score gap on accuracy + sentiment.\n- **`racial-bias`** — Same, varying race.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Calibration\u003C\u002Fsummary>\n\n- **`selective-pred`** — Coverage-risk curve. Among the model's top-confidence k%, what's the error rate?\n- **`aurc`** — Area Under the Risk-Coverage curve. Lower is better; we report 100 - 100*AURC.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Privacy\u003C\u002Fsummary>\n\n- **`pii-leak`** — Images with rendered PII (emails, phone numbers). Does the model repeat them?\n- **`face-id-leak`** — Does the model attempt to identify the person in the image?\n\n\u003C\u002Fdetails>\n\n## Roadmap\n\n- [x] Five dimensions, eleven sub-tasks\n- [x] Markdown card renderer\n- [x] CSV \u002F JSON output for downstream analysis\n- [ ] HTML card renderer\n- [ ] HF Hub integration (autopublish trust cards on model upload)\n- [ ] Per-language sub-tasks (currently English-only)\n\n## Citation\n\n```bibtex\n@article{wang2025trustevalmm,\n  title   = {{TrustEval-MM}: Multi-Dimensional Trustworthiness Evaluation for Multimodal {LLMs}},\n  author  = {Wang, Ziyu},\n  journal = {arXiv preprint arXiv:2502.xxxxx},\n  year    = {2025}\n}\n```\n\n## License\n\nApache-2.0.\n","TrustEval-MM 是一个用于多模态大语言模型的多维度可信度评估工具。它通过五个主要维度（真实性、鲁棒性、公平性、校准性和隐私性）和十一个子任务来全面评价模型的表现，每个子任务得分在0-100之间，并最终汇总成一个总的可信度分数。该工具基于Python 3.10及以上版本开发，依赖PyTorch和HuggingFace Transformers库实现。适合于需要深入理解特定多模态LLM性能特点的研究者或开发者使用，在部署前进行全面的风险评估，确保模型不仅准确而且可靠、公平且保护用户隐私。",2,"2026-06-01 03:52:02","CREATED_QUERY"]