[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80871":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":15,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":16,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":12,"starSnapshotCount":12,"syncStatus":15,"lastSyncTime":25,"discoverSource":26},80871,"InsightTok","LeapLabTHU\u002FInsightTok","LeapLabTHU","InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation",null,"Python",37,0,35,1,2,3,43.2,"MIT License",false,"main",[],"2026-06-12 04:01:30","\n\u003Cdiv align=\"center\">\n\n# InsightTok\n\n**InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation**\n\n\u003Cp align=\"center\">\n \u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=Q9cLkdcAAAAJ\">Yang Yue\u003Csup>1\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n \u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=-ncz2s8AAAAJ\">Fangyun Wei\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n \u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=P08KU1YAAAAJ\">Tianyu He\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fopenreview.net\u002Fprofile?id=~Jinjing_Zhao1\">Jinjing Zhao\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fnzl-thu.github.io\u002F\">Zanlin Ni\u003Csup>1\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=55tpKaoAAAAJ\">Zeyu Liu\u003Csup>1\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fwww.jiayiguo.net\u002F\">Jiayi Guo\u003Csup>1\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?user=mbHPse8AAAAJ\">Lei Shi\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fyuedong.shading.me\u002F\">Yue Dong\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fscholar.google.com\u002Fcitations?view_op=list_works&hl=en&user=ksl2q9kAAAAJ\">Li Chen\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fsites.google.com\u002Fview\u002Fji-li-homepage\">Ji Li\u003Csup>2\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fgaohuang-net.github.io\u002F\">Gao Huang\u003Csup>1,✉\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003Ca href=\"https:\u002F\u002Fwww.dongchen.pro\u002F\">Dong Chen\u003Csup>2,✉\u003C\u002Fsup>\u003C\u002Fa> &emsp;\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Csup>1\u003C\u002Fsup>Tsinghua University&emsp;\n\u003Csup>2\u003C\u002Fsup>Microsoft Research\n\u003C\u002Fp>\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.14333-b31b1b.svg)](http:\u002F\u002Farxiv.org\u002Fabs\u002F2605.14333) [![Model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-InsightTok-orange.svg)](https:\u002F\u002Fhuggingface.co\u002Fyueyang2000\u002FInsightTok)\n\n\n\u003C\u002Fdiv>\n\n## Overview\nInsightTok is a discrete visual tokenizer designed to improve the fidelity of **text** and **faces**, two of the most challenging yet perceptually important structures in autoregressive image generation.\n\nExisting visual tokenizers are typically trained with generic reconstruction objectives, which do not explicitly prioritize these fidelity-critical regions. InsightTok addresses this limitation through **localized, content-aware perceptual supervision**, enabling substantially better preservation of textual content and facial details under a compact discrete bottleneck.\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002FMethod.png\" width=\"95%\">\n\u003C\u002Fp>\n\n\n## Highlights\n\n- **State-of-the-art text and face reconstruction** among discrete visual tokenizers at the same compression rate, using **16× downsampling** and a compact **16,384-entry codebook**\n- **Minimal additional training overhead** over a vanilla VQGAN-style tokenizer\n- **No changes required to downstream generative modeling**. Readily compatible with standard autoregressive image generation pipelines\n- **Tokenizer improvements transfer effectively** to downstream text-to-image generation, yielding clearer text and more faithful facial details\n\n\n## Main Results\n\n### Tokenizer Reconstruction\n\nInsightTok delivers substantial improvements in both text and face reconstruction quality while maintaining strong general reconstruction performance.\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002FCompare_Recon.png\" width=\"95%\">\n\u003C\u002Fp>\n\n\n### Autoregressive Image Generation\n\nThe benefits of InsightTok also transfer to downstream autoregressive image generation.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002FCompare_Gen_Tok.png\" width=\"95%\">\n\u003C\u002Fp>\n\nBelow is a gallery of images generated by InsightAR.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002FGen_Vis.png\" width=\"95%\">\n\u003C\u002Fp>\n\n\n## Usage\n\nModel checkpoints are available at [https:\u002F\u002Fhuggingface.co\u002Fyueyang2000\u002FInsightTok](https:\u002F\u002Fhuggingface.co\u002Fyueyang2000\u002FInsightTok).\n\nInsightTok follows the standard VQGAN-style autoencoding interface:\n\n```python\n# image encoding\nlatents, _, [_, _, indices] = vq_model.encode(input_image_tensor)\n# image decoding\nrecon_image_tensor = vq_model.decode(latents)\n```\n\nWe also provide a simple image reconstruction demo in `recon_demo.py`:\n\n```bash\npython recon_demo.py \\\n  --ckpt_path \u003Cmodel-checkpoint-path> # will download from hf if not provided \\\n  --input assets\u002Fvalset \\\n  --output outputs\u002Frecon\n```\n\n## Acknowledgments\n\nThis project builds upon the excellent open-source efforts of [LlamaGen](https:\u002F\u002Fgithub.com\u002FFoundationVision\u002FLlamaGen), [Seed-Voken](https:\u002F\u002Fgithub.com\u002Ftencentarc\u002Fseed-voken), [Janus-Pro](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FJanus), [TokBench](https:\u002F\u002Fgithub.com\u002Fwjf5203\u002FTokBench), [DocTR](https:\u002F\u002Fgithub.com\u002Fmindee\u002Fdoctr), and [InsightFace](https:\u002F\u002Fgithub.com\u002Fdeepinsight\u002Finsightface).\n\nWe sincerely thank the authors and contributors of these projects and benchmarks for making this research possible.\n\n\n## Citation\nIf you find this work useful, please consider citing our paper.\n\n```bibtex\n@article{yue2026insighttok,\n  title={InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation},\n  author={Yue, Yang and Wei, Fangyun and He, Tianyu and Zhao, Jinjing and Ni, Zanlin and Liu, Zeyu and Guo, Jiayi and Shi, Lei and Dong, Yue and Chen, Li and Li, Ji and Huang, Gao and Chen, Dong},\n  journal={arXiv preprint arXiv:2605.14333},\n  year={2026}\n}\n```\n\n## Contact\nIf you have any questions, please feel free to contact the authors.\n\nYang Yue: [yueyang22@mails.tsinghua.edu.cn](yueyang22@mails.tsinghua.edu.cn)","InsightTok 是一个专为提高自回归图像生成中文本和人脸保真度而设计的离散视觉分词器。该项目通过局部化、内容感知的感知监督技术，显著提升了文本内容和面部细节在压缩表示下的保留质量，使用16倍下采样和紧凑的16,384条目码本，在相同压缩率下实现了领先于其他离散视觉分词器的文字与脸部重建效果。此外，InsightTok 相对于传统的VQGAN风格分词器仅需极小的额外训练开销，并且无需修改下游生成模型即可无缝集成到标准自回归图像生成流程中。此项目适用于需要高保真度文字或人脸图像生成的应用场景，如数字艺术创作、虚拟人物生成等。","2026-06-11 04:02:37","CREATED_QUERY"]