[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-898":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":14,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":16,"starSnapshotCount":16,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},898,"tuna-2","facebookresearch\u002Ftuna-2","facebookresearch","Official implementation of Tuna-2: Pixel Embeddings Beat Vision Encoders for Unified Understanding and Generation","",null,"Python",708,28,13,8,0,6,81,18,74.99,"Apache License 2.0",false,"main",[],"2026-06-12 04:00:06","\u003Cdiv align=\"center\">\n\n# TUNA-2: Pixel Embeddings Beat Vision Encoders for Unified Understanding and Generation\n\n[Zhiheng Liu](https:\u002F\u002Fjohanan528.github.io\u002F)\\*\u003Csup>1,2\u003C\u002Fsup>,\n[Weiming Ren](https:\u002F\u002Fcs.uwaterloo.ca\u002F~w2ren\u002F)\\*\u003Csup>1,3\u003C\u002Fsup>,\n[Xiaoke Huang](https:\u002F\u002Fxk-huang.github.io\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Shoufa Chen](https:\u002F\u002Fwww.shoufachen.com\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Tianhong Li](https:\u002F\u002Fwww.tianhongli.me\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Mengzhao Chen](https:\u002F\u002Fchenmnz.github.io\u002F)\u003Csup>2\u003C\u002Fsup>,\n[Yatai Ji](https:\u002F\u002Fyataiji.github.io\u002F)\u003Csup>2\u003C\u002Fsup>,\n[Sen He](https:\u002F\u002Fsenhe.github.io\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Jonas Schult](https:\u002F\u002Fjonasschult.github.io\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Belinda Zeng](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fbelindazeng\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Tao Xiang](https:\u002F\u002Fwww.surrey.ac.uk\u002Fpeople\u002Ftao-xiang)\u003Csup>1\u003C\u002Fsup>,\n[Wenhu Chen](https:\u002F\u002Fwenhuchen.github.io\u002F)\u003Csup>3\u003C\u002Fsup>,\n[Ping Luo](http:\u002F\u002Fluoping.me\u002F)\u003Csup>2\u003C\u002Fsup>,\n[Luke Zettlemoyer](https:\u002F\u002Fwww.cs.washington.edu\u002Fpeople\u002Ffaculty\u002Fluke-zettlemoyer\u002F)\u003Csup>1\u003C\u002Fsup>,\n[Yuren Cong](https:\u002F\u002Fyrcong.github.io\u002F)\u003Csup>1\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup>Meta &nbsp; \u003Csup>2\u003C\u002Fsup>The University of Hong Kong &nbsp; \u003Csup>3\u003C\u002Fsup>University of Waterloo\n\n\\* Equal contribution\n\n**[[Project Page]](https:\u002F\u002Ftuna-ai.org\u002Ftuna-2)** &nbsp; **[[arXiv]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.24763)**\n\n\u003C\u002Fdiv>\n\n## Overview\n\nWe simplify [Tuna](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.10441) by progressively stripping away its visual encoding components. By removing the VAE, we first derive **Tuna-R**, a pixel-space unified multimodal model (UMM) that relies solely on a representation encoder. **Tuna-2** further streamlines the design by bypassing the representation encoder entirely, utilizing direct patch embedding layers for raw image inputs. Tuna-2 using pixel embeddings outperforms both Tuna-R and Tuna across a diverse suite of multimodal benchmarks.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fteaser.png\" width=\"90%\" alt=\"Evolution of Tuna-2 architecture and multimodal performance comparison\"\u002F>\n\u003C\u002Fp>\n\n## Generation Results\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fvisual.png\" width=\"85%\" alt=\"Tuna-2 generation samples\"\u002F>\n\u003C\u002Fp>\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ftuna-2.git\ncd tuna-2\nbash scripts\u002Fsetup_uv.sh   # creates .venv with all dependencies\nsource .venv\u002Fbin\u002Factivate\n```\n\n\u003Cdetails>\n\u003Csummary>Manual setup (if you prefer to drive uv yourself)\u003C\u002Fsummary>\n\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\nuv sync\nuv pip install torch torchvision --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\nuv pip install -e .\nsource .venv\u002Fbin\u002Factivate\n```\n\n\u003C\u002Fdetails>\n\n## Inference\n\nAll inference is done through a single unified script:\n\n```bash\nbash scripts\u002Flaunch\u002Fpredict.sh --ckpt \u003CPATH> --prompt \u003CTEXT> [OPTIONS]\n```\n\n### Options\n\n| Flag | Values | Default | Description |\n|---|---|---|---|\n| `--ckpt` | path | *(required)* | Path to the model checkpoint |\n| `--prompt` | text | *(required)* | Text prompt (t2i) or editing instruction (edit) |\n| `--task` | `t2i`, `edit` | `t2i` | Inference task |\n| `--variant` | `none_encoder`, `siglip_pixel`, `vae` | `none_encoder` | Model variant: **Tuna-2**, **Tuna-R**, or **Tuna** |\n| `--size` | `7b`, `2b` | `7b` | Model size (2b only available for `--variant vae`) |\n| `--resolution` | See table below | `512x512` | Output resolution (HxW) |\n| `--gpu` | int | `0` | GPU device index |\n| `--image` | path | — | Source image (required for `--task edit`) |\n| `--steps` | int | `50` | Number of diffusion steps |\n| `--guidance` | float | *(from config)* | Classifier-free guidance scale |\n| `--seed` | int | `42` | Random seed |\n| `--negative` | text | *(from config)* | Negative prompt |\n\n### Supported Resolutions\n\n| 512-class | 1024-class |\n|---|---|\n| `512x512` | `1024x1024` |\n| `448x576` | `896x1152` |\n| `576x448` | `1152x896` |\n| `384x672` | `768x1344` |\n| `672x384` | `1344x768` |\n\n### Examples\n\nSee `assets\u002Fprompts.txt` for sample prompts.\n\n```bash\n# Tuna-2 (7B, no encoder, 512px)\nbash scripts\u002Flaunch\u002Fpredict.sh \\\n    --ckpt \u002Fpath\u002Fto\u002Ftuna_2_pixel_7b.pt \\\n    --prompt \"A highly realistic beauty portrait in extreme close-up, showing the face of a young woman from just above the eyebrows down to the lips. Her skin is natural, luminous, and textured, with visible pores, fine facial hairs, subtle unevenness, and a slightly dewy finish, without heavy retouching or artificial smoothing.\"\n\n# Tuna (2B, VAE latent, 512px)\nbash scripts\u002Flaunch\u002Fpredict.sh \\\n    --variant vae --size 2b \\\n    --ckpt \u002Fpath\u002Fto\u002Ftuna_2b.pt \\\n    --prompt \"A brutally realistic cinematic close-up inside a real space station cupola, side profile of a blonde female astronaut floating in zero gravity beside the window, her loose braid drifting naturally, looking out at Earth in silence.\"\n```\n\n### Video\n\nDue to policy constraints, we are unable to release the video generation model at this time. However, we provide the complete video training and inference codebase. If you are interested in training your own video model, this is a ready-to-use starting point — see `configs\u002Ftrain\u002Fvideo_t2v.yaml` for training configuration and `configs\u002Fpredict\u002Ft2v_2b.yaml` for inference.\n\n## TODO\n\n- [ ] Release some of the Tuna-2 model weights.\n- [ ] Release some of the Tuna model weights.\n- [ ] Release the fully restored model weights (fine-tuned on external data to recover the missing layers).\n\n### A Note on Model Release\n\nDue to organizational policy constraints, we are unable to release the full production-trained model weights. To support the research community, we plan to release a **foundation checkpoint** with a small number of layers removed from both the LLM backbone and the diffusion head (flow head). The remaining layers and all other components (vision encoder, projections, embeddings, etc.) are fully preserved. With a short fine-tuning pass on your own data, the removed layers can be quickly re-learned and the model restored to full quality.\n\nFor detailed fine-tuning instructions, please refer to the [training guide](tuna\u002FREADME.md).\n\nMeanwhile, we are also actively working on fine-tuning the removed layers using external data, and plan to release the complete weights as soon as possible.\n\n## Citation\n\n```bibtex\n@article{tuna2,\n  title={TUNA-2: Pixel Embeddings Beat Vision Encoders\n         for Unified Understanding and Generation},\n  author={Liu, Zhiheng and Ren, Weiming and Huang, Xiaoke\n          and Chen, Shoufa and Li, Tianhong and Chen, Mengzhao\n          and Ji, Yatai and He, Sen and Schult, Jonas\n          and Xiang, Tao and Chen, Wenhu and Luo, Ping\n          and Zettlemoyer, Luke and Cong, Yuren},\n  journal={arXiv preprint arXiv:2604.24763},\n  year={2026}\n}\n```\n\n```bibtex\n@article{liu2025tuna,\n  title={Tuna: Taming unified visual representations for native unified multimodal models},\n  author={Liu, Zhiheng and Ren, Weiming and Liu, Haozhe and Zhou, Zijian and Chen, Shoufa and Qiu, Haonan and Huang, Xiaoke and An, Zhaochong and Yang, Fanny and Patel, Aditya and others},\n  journal={CVPR2026},\n  year={2026}\n}\n```\n\n## License\n\nThis project is licensed under the Apache License 2.0. See [LICENSE](LICENSE) for details.\n","TUNA-2 是一个基于像素嵌入的多模态理解和生成模型。该项目通过去除视觉编码组件，简化了原始 Tuna 模型的设计，直接使用像素嵌入层处理图像输入，从而在多种多模态基准测试中表现出色。其核心技术特点包括利用直接的补丁嵌入层替代复杂的表示编码器，以及支持文本到图像生成和图像编辑等多种任务。TUNA-2 适合需要高效、高质量多模态内容生成的应用场景，如图像合成、创意设计工具等。项目采用 Python 编写，并遵循 Apache License 2.0 开源协议。",2,"2026-06-11 02:40:06","CREATED_QUERY"]