[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80021":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":12,"lastSyncTime":23,"discoverSource":24},80021,"CiteVQA","opendatalab\u002FCiteVQA","opendatalab",null,"Python",68,5,2,0,1,39.43,"MIT License",false,"main",[],"2026-06-12 04:01:26","# CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.12882\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.12882-b31b1b?style=flat-square&logo=arxiv\" alt=\"arXiv\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopendatalab\u002FCiteVQA\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97_Dataset-HuggingFace-yellow?style=flat-square\" alt=\"Hugging Face dataset\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.modelscope.cn\u002Fdatasets\u002FOpenDataLab\u002FCiteVQA\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDataset_on_ModelScope-purple?logo=data:image\u002Fsvg+xml;base64,PHN2ZyB3aWR0aD0iMjIzIiBoZWlnaHQ9IjIwMCIgeG1sbnM9Imh0dHA6Ly93d3cudzMub3JnLzIwMDAvc3ZnIj4KCiA8Zz4KICA8dGl0bGU+TGF5ZXIgMTwvdGl0bGU+CiAgPHBhdGggaWQ9InN2Z18xNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTAsODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTUiIGZpbGw9IiM2MjRhZmYiIGQ9Im05OS4xNCwxMTUuNDlsMjUuNjUsMGwwLDI1LjY1bC0yNS42NSwwbDAsLTI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTYiIGZpbGw9IiM2MjRhZmYiIGQ9Im0xNzYuMDksMTQxLjE0bC0yNS42NDk5OSwwbDAsMjIuMTlsNDcuODQsMGwwLC00Ny44NGwtMjIuMTksMGwwLDI1LjY1eiIvPgogIDxwYXRoIGlkPSJzdmdfMTciIGZpbGw9IiMzNmNmZDEiIGQ9Im0xMjQuNzksODkuODRsMjUuNjUsMGwwLDI1LjY0OTk5bC0yNS42NSwwbDAsLTI1LjY0OTk5eiIvPgogIDxwYXRoIGlkPSJzdmdfMTgiIGZpbGw9IiMzNmNmZDEiIGQ9Im0wLDY0LjE5bDI1LjY1LDBsMCwyNS42NWwtMjUuNjUsMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzE5IiBmaWxsPSIjNjI0YWZmIiBkPSJtMTk4LjI4LDg5Ljg0bDI1LjY0OTk5LDBsMCwyNS42NDk5OWwtMjUuNjQ5OTksMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIwIiBmaWxsPSIjMzZjZmQxIiBkPSJtMTk4LjI4LDY0LjE5bDI1LjY0OTk5LDBsMCwyNS42NWwtMjUuNjQ5OTksMGwwLC0yNS42NXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIxIiBmaWxsPSIjNjI0YWZmIiBkPSJtMTUwLjQ0LDQybDAsMjIuMTlsMjUuNjQ5OTksMGwwLDI1LjY1bDIyLjE5LDBsMCwtNDcuODRsLTQ3Ljg0LDB6Ii8+CiAgPHBhdGggaWQ9InN2Z18yMiIgZmlsbD0iIzM2Y2ZkMSIgZD0ibTczLjQ5LDg5Ljg0bDI1LjY1LDBsMCwyNS42NDk5OWwtMjUuNjUsMGwwLC0yNS42NDk5OXoiLz4KICA8cGF0aCBpZD0ic3ZnXzIzIiBmaWxsPSIjNjI0YWZmIiBkPSJtNDcuODQsNjQuMTlsMjUuNjUsMGwwLC0yMi4xOWwtNDcuODQsMGwwLDQ3Ljg0bDIyLjE5LDBsMCwtMjUuNjV6Ii8+CiAgPHBhdGggaWQ9InN2Z18yNCIgZmlsbD0iIzYyNGFmZiIgZD0ibTQ3Ljg0LDExNS40OWwtMjIuMTksMGwwLDQ3Ljg0bDQ3Ljg0LDBsMCwtMjIuMTlsLTI1LjY1LDBsMCwtMjUuNjV6Ii8+CiA8L2c+Cjwvc3ZnPg==&labelColor=white&style=flat-square\" alt=\"Dataset on ModelScope\" \u002F>\u003C\u002Fa>\n  \u003Ca href=\".\u002FLICENSE.txt\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green?style=flat-square\" alt=\"License MIT\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.12882\">\n    \u003Cimg src=\"img\u002Fhuggingface_paper_gold_day.svg\"\u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  📖 \u003Ca href=\".\u002FREADME.md\">\u003Cb>English\u003C\u002Fb>\u003C\u002Fa> &nbsp;|&nbsp; \u003Ca href=\".\u002FREADME_zh.md\">\u003Cb>简体中文\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>If you like our project, please give us a star ⭐ on GitHub for the latest update.\u003C\u002Fb>\n\u003C\u002Fp>\n\n---\n## 🔎 Overview\n\n**CiteVQA** is a document visual question answering benchmark for **faithful evidence attribution**. Unlike conventional DocVQA datasets that only score the final answer, CiteVQA requires a model to answer a question with evidence grounded in the source document at the **element level**. The benchmark is designed to evaluate whether a system can not only answer correctly, but also cite the right supporting region in long, real-world PDFs.\n\nThe dataset contains **1,897 questions** built from **711 PDFs** across **7 macro-domains** and **30 sub-domains**, with an average of **40.6 pages per document**. It covers both **English** and **Chinese** documents, and includes **single-document** as well as **multi-document** settings.\n\nThe evaluation covers three dataset types:\n\n- **Single-Doc**: Single-document question answering.\n- **Multi (1-Gold)**: Multi-document QA with exactly one gold document.\n- **Multi (N-Gold)**: Multi-document QA with multiple gold documents.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fimg\u002Fcitevqa_example.png\" width=\"92%\" alt=\"CiteVQA overview\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Cem>\n    Overview of CiteVQA. Left: a prediction is counted as correct only when the answer is correct and the cited evidence region is both relevant and spatially aligned with the gold evidence under Strictly Attributed Accuracy (SAA). Right top: dataset statistics show that CiteVQA emphasizes long, realistic PDFs. Right bottom: existing MLLMs exhibit a substantial gap between answer accuracy and evidence-grounded accuracy.\n  \u003C\u002Fem>\n\u003C\u002Fp>\n\n\n## ✨ Highlights\n\n- **Joint answer-and-evidence evaluation**: Evaluates both answer correctness and citation faithfulness.\n- **Element-level evidence**: Structured gold evidence features bounding boxes, page, and document indices.\n- **Long-document setting**: Focuses on multi-page PDFs with realistic lengths and complex layouts.\n- **Cross-domain and bilingual**: Spans **7 domains**, **30 sub-domains**, and two languages (`en`, `zh`).\n- **Multi-document reasoning**: Features cross-document questions that require evidence aggregation.\n- **Three evaluation settings**: Supports `Single-Doc`, `Multi (1-Gold)`, and `Multi (N-Gold)`.\n\n## ⚙️ Setup\n\nInstall dependencies:\n\n```bash\npip install -r requirements.txt\n```\n\nOptional CJK font configuration for PDF rendering:\n\n\u003Cdetails>\n\u003Csummary>Expand font setup for Chinese PDFs\u003C\u002Fsummary>\n\n```bash\napt install fonts-noto-cjk poppler-data\n\ncat > \u002Fetc\u002Ffonts\u002Fconf.d\u002F99-pdf-cjk.conf \u003C\u003C 'EOF'\n\u003C?xml version=\"1.0\"?>\n\u003C!DOCTYPE fontconfig SYSTEM \"fonts.dtd\">\n\u003Cfontconfig>\n  \u003Calias>\u003Cfamily>STSong-Light\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Serif CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n  \u003Calias>\u003Cfamily>STSong\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Serif CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n  \u003Calias>\u003Cfamily>SimSun\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Serif CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n  \u003Calias>\u003Cfamily>FangSong\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Serif CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n  \u003Calias>\u003Cfamily>KaiTi\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Serif CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n  \u003Calias>\u003Cfamily>SimHei\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Sans CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n  \u003Calias>\u003Cfamily>Microsoft YaHei\u003C\u002Ffamily>\u003Cprefer>\u003Cfamily>Noto Sans CJK SC\u003C\u002Ffamily>\u003C\u002Fprefer>\u003C\u002Falias>\n\u003C\u002Ffontconfig>\nEOF\n\nfc-cache -f\n```\n\n\u003C\u002Fdetails>\n\n## 📦 Data\n\nFrom the repository root, you can fetch the benchmark files from Hugging Face into `data\u002F`, then download the source PDFs:\n\n```bash\npip install -U \"huggingface_hub[cli]\"\nhf download opendatalab\u002FCiteVQA --repo-type dataset --local-dir .\npython data\u002Fdownload\u002Fdownload_pdfs.py --workers 16 --out data\u002Fpdf --csv data\u002Fdownload\u002Fpdf_source.csv\n```\n\nFrom the repository root, you can also fetch the benchmark files from ModelScope into `data\u002F`, then download the source PDFs:\n\n```bash\npip install -U modelscope\nmodelscope download --dataset OpenDataLab\u002FCiteVQA --local_dir .\npython data\u002Fdownload\u002Fdownload_pdfs.py --workers 16 --out data\u002Fpdf --csv data\u002Fdownload\u002Fpdf_source.csv\n```\n\nThe PDF downloader reads `data\u002Fdownload\u002Fpdf_source.csv` and saves all files to `data\u002Fpdf\u002F`.\n\nIf you run into dataset or download issues, jump to the [Contact](#contact) section.\n\n\u003Cdetails>\n\u003Csummary>Download Arguments\u003C\u002Fsummary>\n\n| Option | Default | Description |\n| --- | --- | --- |\n| `--csv` | `pdf_source.csv` | CSV file containing PDF URLs |\n| `--out` | `pdf` | Output directory |\n| `--workers` | `16` | Concurrent download workers |\n| `--timeout` | `120` | Timeout per file in seconds |\n| `--retries` | `3` | Retry count |\n| `--no-skip` | - | Re-download existing files |\n\n\u003C\u002Fdetails>\n\n## 🚀 Inference and Evaluation\n\n`bash run.sh` provides a demo for evaluating `GPT-5.4`. Edit the API settings in `run.sh`, then run:\n\n```bash\nbash run.sh\n```\n\nReference workflow:\n\n```bash\n# API config\nAPI_TYPE=openai\nAPI_KEY=YOUR_API_KEY\nBASE_URL=YOUR_BASE_URL\n\n# Inference\npython infer\u002Frun.py \\\n  --api ${API_TYPE} \\\n  --model MODEL_NAME \\\n  --base_url ${BASE_URL} \\\n  --api_key ${API_KEY} \\\n  --workers 4 \\\n  --out outputs\u002Finfer\u002FMODEL_NAME.json\n\n# Evaluation\npython eval\u002Frun.py \\\n  --judge_api ${API_TYPE} \\\n  --judge_model JUDGE_MODEL_NAME \\\n  --judge_api_key ${API_KEY} \\\n  --base_url ${BASE_URL} \\\n  --input outputs\u002Finfer\u002FMODEL_NAME.json \\\n  --out outputs\u002Feval\u002FMODEL_NAME.json \\\n  --workers 24\n\n# Summary\npython eval\u002Fsummarize.py \\\n  --input outputs\u002Feval\u002FMODEL_NAME.json \\\n  --out_dir outputs\u002Feval\u002FMODEL_NAME\n```\n\n### 🧭 Inference Arguments\n\n\u003Cdetails>\n\u003Csummary>Inference Arguments\u003C\u002Fsummary>\n\n| Option | Required | Description |\n| --- | --- | --- |\n| `--api` | Yes | `openai`, `genai`, or `anthropic` |\n| `--model` | Yes | Model name |\n| `--api_key` | Yes | API key |\n| `--base_url` | No | API base URL |\n| `--workers` | No | Number of workers, default `4` |\n| `--out` | No | Output JSON path |\n| `--benchmark` | No | Benchmark path, default `data\u002Fdata_items.json` |\n| `--limit` | No | Sample limit, `0` means all |\n| `--max_pdf_mb` | No | Compress PDFs larger than this size in MB |\n\n\u003C\u002Fdetails>\n\n### 📏 Evaluation Arguments\n\n\u003Cdetails>\n\u003Csummary>Evaluation Arguments\u003C\u002Fsummary>\n\n| Option | Required | Description |\n| --- | --- | --- |\n| `--input` | Yes | Inference output JSON |\n| `--judge_api` | No | Judge API type, default `openai` |\n| `--judge_model` | No | Judge model name, default `gpt-4o` |\n| `--judge_api_key` | Yes | Judge API key |\n| `--base_url` | No | API base URL |\n| `--metrics` | No | Metrics list, default `recall,rel` |\n| `--workers` | No | Number of workers |\n| `--out` | No | Output JSON path |\n| `--limit` | No | Sample limit |\n\n\u003C\u002Fdetails>\n\n## 🗂️ Repository Structure\n\n```text\nCiteVQA\u002F\n├── data\u002F\n│   ├── validation\u002F\n│   │   └── CiteVQA.json         # Benchmark QA pairs\n│   ├── pdf\u002F                     # Downloaded PDFs\n│   └── download\u002F\n│       ├── pdf_source.csv       # PDF metadata & URLs\n│       └── download_pdfs.py     # PDF download script\n├── infer\u002F\n│   └── run.py                   # Inference script\n├── eval\u002F\n│   ├── run.py                   # Evaluation script\n│   └── summarize.py             # Summary table generator\n├── prompts\u002F                     # System & user prompts\n├── outputs\u002F                     # Inference & evaluation outputs\n├── requirements.txt\n└── run.sh                       # Demo script\n```\n\n## 📊 Evaluation Metrics\n\n| Metric | Meaning |\n| --- | --- |\n| `Recall` | Whether predicted evidence overlaps with crucial ground-truth evidence |\n| `Relevance (Rel.)` | Whether the cited evidence semantically supports the answer |\n| `Answer Correctness (Ans.)` | Whether the answer is correct |\n| `SAA` | Strict Attributed Accuracy: answer and evidence must both be valid |\n| `Page Recall` | Whether the correct page is identified |\n| `Precision \u002F F1` | Precision and overlap quality of predicted evidence |\n\n`SAA` is the core metric of CiteVQA.\n\n## 🏆 Evaluation Result\n\nWe evaluated 20 state-of-the-art MLLMs on CiteVQA using a unified prompt template. The results show that faithful evidence attribution remains substantially harder than answer-only scoring.\n\n- **Best overall SAA**: `Gemini-3.1-Pro-Preview` reaches **76.0** SAA with **86.1** answer score.\n- **Best answer accuracy**: `GPT-5.4` reaches **87.1** answer score, but its SAA drops to **59.0**.\n- **Best open-source model**: `Qwen3-VL-235B-A22B` reaches **22.5** SAA with **72.3** answer score.\n- **Key finding**: a large gap between `Ans.` and `SAA` appears across models, highlighting the benchmark's `Attribution Hallucination` challenge.\n\nFull overall results:\n\n| Model | Category | Rec. | Rel. | Ans. | SAA |\n| --- | --- | ---: | ---: | ---: | ---: |\n| Gemini-3.1-Pro-Preview | Closed-source MLLMs | 66.0 | 83.6 | 86.1 | 76.0 |\n| Gemini-3-Flash-Preview | Closed-source MLLMs | 45.4 | 75.7 | 84.5 | 65.4 |\n| GPT-5.4 | Closed-source MLLMs | 31.0 | 67.5 | 87.1 | 59.0 |\n| Gemini-2.5-Pro | Closed-source MLLMs | 27.4 | 59.8 | 82.2 | 47.0 |\n| Seed2.0-Pro | Closed-source MLLMs | 28.5 | 54.9 | 81.3 | 44.1 |\n| GPT-5.2 | Closed-source MLLMs | 18.2 | 56.6 | 71.5 | 33.7 |\n| Qwen3.6-Plus | Closed-source MLLMs | 7.7 | 25.0 | 85.9 | 17.5 |\n| GLM-5V-Turbo | Closed-source MLLMs | 14.9 | 29.2 | 49.6 | 12.8 |\n| Qwen3-VL-235B-A22B | Open-source Large MLLMs | 11.3 | 35.3 | 72.3 | 22.5 |\n| Gemma-4-31B | Open-source Large MLLMs | 11.6 | 35.0 | 69.8 | 20.2 |\n| Kimi-K2.5 | Open-source Large MLLMs | 6.2 | 26.8 | 74.3 | 19.1 |\n| Qwen3.5-397B-A17B | Open-source Large MLLMs | 5.4 | 24.6 | 76.5 | 18.3 |\n| Qwen3.5-27B | Open-source Large MLLMs | 5.3 | 25.3 | 75.6 | 17.3 |\n| Qwen3-VL-32B | Open-source Large MLLMs | 6.6 | 30.5 | 72.3 | 17.3 |\n| Qwen3.5-122B-A10B | Open-source Large MLLMs | 3.9 | 19.0 | 73.6 | 14.8 |\n| Qwen3.5-9B | Open-source Small MLLMs | 1.6 | 14.7 | 65.0 | 11.1 |\n| Qwen3.5-35B-A3B | Open-source Small MLLMs | 1.7 | 13.7 | 76.4 | 10.7 |\n| Qwen3-VL-30B-A3B | Open-source Small MLLMs | 3.5 | 14.6 | 62.2 | 8.2 |\n| Qwen3-VL-8B | Open-source Small MLLMs | 1.0 | 14.7 | 61.2 | 7.5 |\n| Gemma-4-26B-A4B | Open-source Small MLLMs | 3.0 | 17.9 | 48.4 | 6.2 |\n\n\u003Ca id=\"contact\">\u003C\u002Fa>\n## 📬 Contact\n\nSince the PDF sources are downloaded from external links, issues such as broken links or data accessibility problems may occur during download. If you encounter any download-related problems, please email [wzr@stu.pku.edu.cn](mailto:wzr@stu.pku.edu.cn).\n\n## 📚 Citation\n\n```bibtex\n@article{ma2026citevqa,\n  title={CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence},\n  author={Ma, Dongsheng and Li, Jiayu and Wang, Zhengren and Wang, Yijie and Kong, Jiahao and Zeng, Weijun and Xiao, Jutao and Yang, Jie and Zhang, Wentao and Wang, Bin and He, Conghui},\n  journal={arXiv preprint arXiv:2605.12882},\n  year={2026}\n}\n```\n\n## 🙏 Acknowledgements\n\n- [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU) for document parsing.\n- [ViDoRe V3](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fvidore\u002Fvidore-benchmark-v3) and other open-source datasets (SPIQA, MedQA, PubMedQA, MaintNorm, PolicyBench) for inspiring our benchmark construction.\n\n## 📄 License\n\nThis project is licensed under the MIT License. See the [LICENSE](.\u002FLICENSE) file for details.\n\n## ©️ Copyright Notice\n\nCiteVQA is provided for academic research and non-commercial use only. We fully respect the rights of original copyright holders. If any rights holder believes that the inclusion, indexing, or use of any relevant content in this benchmark is inappropriate, please contact `OpenDataLab@pjlab.org.cn`. We will verify the request and remove or update the relevant content when appropriate.\n","CiteVQA 是一个专注于可信证据归因的文档视觉问答基准。该项目的核心功能在于评估模型在回答问题时引用文档中证据的准确性与可靠性，而不仅仅是最终答案的正确性。采用Python语言开发，利用了Hugging Face和ModelScope平台上的数据集。CiteVQA特别适用于需要提高文档理解任务中模型解释性和透明度的应用场景，如法律文件解析、医学报告分析等。","2026-06-11 03:58:55","CREATED_QUERY"]