[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80061":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":11,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":11,"stars30d":14,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":15,"rankGlobal":8,"rankLanguage":8,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":8,"pushedAt":8,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":13,"starSnapshotCount":13,"syncStatus":14,"lastSyncTime":23,"discoverSource":24},80061,"OmniDoc-TokenBench","alibaba\u002FOmniDoc-TokenBench","alibaba",null,"Python",64,1,62,0,2,0.9,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:57","\u003Cdiv align=\"center\">\n\n\u003Ch2>OmniDoc-TokenBench\u003C\u002Fh2>\n\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg)](LICENSE)\n[![HF Dataset](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHF%20Dataset-OmniDoc--TokenBench-yellow?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falibabagroup\u002FOmniDoc-TokenBench)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-QwenImageVAE2.0-B31B1B?logo=arxiv&logoColor=white)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13565)\n\n\u003C\u002Fdiv>\n\n---\n\n## 📄 Overview\n\n### Introduction\n\n> 🤗 **Dataset Download**: [https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falibabagroup\u002FOmniDoc-TokenBench](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Falibabagroup\u002FOmniDoc-TokenBench)\n\nWe propose **OmniDoc-TokenBench** in [Qwen-Image-VAE-2.0](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13565), a curated benchmark specifically designed to evaluate VAE reconstruction on text-rich document images. It contains ~3K samples spanning nine categories (*book*, *slides*, *color textbook*, *exam paper*, *academic paper*, *magazine*, *financial report*, *newspaper*, *note*) in both English and Chinese, alongside an evaluation toolkit supporting PSNR, SSIM, LPIPS, FID, and OCR-based NED metrics.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fbench.png\" alt=\"OmniDoc-TokenBench\" width=\"80%\" \u002F>\n\u003Cp>\n\nWe develop OmniDoc-TokenBench based on [OmniDocBench](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FOmniDocBench). First, we crop each sample from a text block and resize it to 256×256, then filter for a character count range ([200, 600] for Chinese, [300, 600] for English) to ensure a reference font size of approximately 16px and 10px, respectively. Finally, we deduplicate via n-gram overlap and manually inspect for quality.\n\n\n### Evaluation Metric\n\nBeyond traditional metrics (PSNR, SSIM, LPIPS, FID), we use **NED** (Normalized Edit Distance) as the primary text-fidelity metric. NED directly measures text preservation by comparing recognized character sequences between original and reconstructed images using Levenshtein distance:\n\n$$\n\\mathrm{NED} = \\frac{1}{N}\\sum_{i=1}^{N}\\left(1 - \\frac{d_{\\mathrm{edit}}(s_{\\mathrm{gt}}^{(i)}, s_{\\mathrm{recon}}^{(i)})}{\\max(|s_{\\mathrm{gt}}^{(i)}|, |s_{\\mathrm{recon}}^{(i)}|)}\\right)\n$$\n\nNED is sensitive to semantic corruption such as character substitutions, making it a necessary complementary metric when traditional metrics alone are insufficient.\n\n---\n\n## 📊 Performance\n\nWe conduct a comprehensive evaluation on OmniDoc-TokenBench (~3K text-rich images, 256×256 resolution). Models are grouped by spatial compression factor and sorted by NED within each group.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fresults.png\" alt=\"Eval-Results\" width=\"65%\" \u002F>\n\u003C\u002Fp>\n\nOur Qwen-Image-VAE-2.0 achieves state-of-the-art reconstruction across all compression ratios. The f16c128 variant attains SSIM **0.9706** and PSNR **30.45 dB**, surpassing the best f8 baseline (FLUX.1-dev at 0.9364 \u002F 26.24 dB) despite 2× higher spatial compression. In terms of text fidelity (NED), f16c128 reaches **0.9617**, exceeding all evaluated VAEs. Even under extreme f32 compression, our f32c192 achieves NED **0.8555**, surpassing multiple f16 baselines.\n\n---\n\n## ⚡ Evaluation\n\n### Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Falibaba\u002FOmniDoc-TokenBench.git\ncd OmniDoc-TokenBench\n\npip install torch torchvision piq lpips pytorch-fid pillow numpy tqdm\npip install paddleocr python-Levenshtein  # required for NED\n```\n\n### Download the Dataset\n\n```bash\n# Make sure the hf CLI is installed\ncurl -LsSf https:\u002F\u002Fhf.co\u002Fcli\u002Finstall.sh | bash\n# Download the dataset\nhf download alibabagroup\u002FOmniDoc-TokenBench --repo-type=dataset --local-dir .\u002Fdataset_download\n# Move benchmark images to project root as gt_dir\nmv .\u002Fdataset_download\u002Fdata .\u002Fgt_dir && rm -rf .\u002Fdataset_download\n```\n\n### Reconstruct Images with Your VAE\n\nWe provide an example reconstruction script at the project root. Edit `example_recon.py` to set your VAE model path, then run:\n\n```bash\npython example_recon.py  # FLUX1.dev as example\n```\n\n### Compute Metrics\n\nPlace your ground-truth images in `gt_dir\u002F` and reconstructed images in `recon_dir\u002F` (filenames must match one-to-one).\n\n```bash\n# Compute NED only (default)\npython eval_metrics.py --gt_dir .\u002Fgt_dir --recon_dir .\u002Frecon_dir\n\n# Compute traditional metrics (PSNR \u002F SSIM \u002F LPIPS \u002F FID)\npython eval_metrics.py --gt_dir .\u002Fgt_dir --recon_dir .\u002Frecon_dir --mode pixel\n\n# Compute all metrics\npython eval_metrics.py --gt_dir .\u002Fgt_dir --recon_dir .\u002Frecon_dir --mode all\n\n# Specify output directory and device\npython eval_metrics.py --gt_dir .\u002Fgt_dir --recon_dir .\u002Frecon_dir --save_path .\u002Fresults --device cuda\n```\n\n### Output\n\nThe script writes results (FLUX1.dev as example) to the `--save_path` directory (default: `.\u002Feval_results`):\n\n- `results.json` --- Aggregated metrics:\n  ```json\n  {\n    \"num_samples\": 3042,\n    \"PSNR\": 26.2377,\n    \"SSIM\": 0.9364,\n    \"LPIPS\": 0.0247,\n    \"FID\": 0.5543,\n    \"NED\": 0.9546\n  }\n  ```\n\n- `ned_details.json` --- Per-image OCR results and NED scores (generated when `--mode` is `ned` or `all`):\n  ```json\n  {\n    \"avg_ned\": 0.9546,\n    \"total_samples\": 3042,\n    \"valid_samples\": 3042,\n    \"details\": [\n      {\n        \"file\": \"0001.png\",\n        \"gt_ocr\": \"ocr output from gt image...\",\n        \"recon_ocr\": \"ocr output from recon image...\",\n        \"ned\": 0.9764\n      }\n    ]\n  }\n  ```\n\n### Notes\n\n- `FID` and `LPIPS` require downloading model checkpoints on the first run (InceptionV3 ~90MB for FID, VGG16 ~530MB for LPIPS). Ensure network access or pre-download the weight files.\n- PaddleOCR defaults to CPU inference. For large-scale evaluation, consider switching to GPU by setting `device=\"gpu\"` in `compute_ned()`.\n- The progress bar for PSNR\u002FSSIM\u002FLPIPS displays running means in real time.\n\n---\n\n## 📝 Citation\n\nIf you use OmniDoc-TokenBench or this evaluation toolkit in your research, please cite:\n\n```bibtex\n@misc{zhang2026qwenimagevae20technicalreport,\n      title={Qwen-Image-VAE-2.0 Technical Report}, \n      author={Zekai Zhang and Deqing Li and Kuan Cao and Yujia Wu and Chenfei Wu and Yu Wu and Liang Peng and Hao Meng and Jiahao Li and Jie Zhang and Kaiyuan Gao and Kun Yan and Lihan Jiang and Ningyuan Tang and Shengming Yin and Tianhe Wu and Xiao Xu and Xiaoyue Chen and Yan Shu and Yanran Zhang and Yilei Chen and Yixian Xu and Yuxiang Chen and Zhendong Wang and Zihao Liu and Zikai Zhou and Yiliang Gu and Yi Wang and Xiaoxiao Xu and Lin Qu},\n      year={2026},\n      eprint={2605.13565},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13565}, \n}\n```\n\n---\n\n##  Acknowledgements\n\nOmniDoc-TokenBench is a derivative dataset based on [OmniDocBench](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FOmniDocBench), Thanks for their great work.\n\n## License\n\nThis dataset is developed by the Qwen Team at Alibaba Group, and licensed under the [Apache License 2.0](LICENSE).\n","OmniDoc-TokenBench 是一个专为评估文本丰富文档图像的VAE重建效果而设计的基准测试数据集。该项目包含约3000个样本，涵盖九种不同类型的文档（如书籍、幻灯片、彩色教科书等），支持英文和中文，并提供了一个评估工具包，支持PSNR、SSIM、LPIPS、FID以及基于OCR的NED等多种评价指标。特别地，项目引入了NED（归一化编辑距离）作为主要的文本保真度度量标准，以直接衡量原始与重建图像之间字符序列的差异。此数据集适用于需要对文档图像处理技术进行深入研究和评估的场景，尤其是在关注文本信息保持质量的应用中。","2026-06-11 03:59:04","CREATED_QUERY"]