[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-4385":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},4385,"MUSE","PanqiYang1\u002FMUSE","PanqiYang1","ICML2026: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality",null,"Python",399,10,5,1,0,238,3.12,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:01:02","\u003Cdiv align=\"center\">\n\n# MUSE: Resolving Manifold Misalignment in Visual Tokenization via Topological Orthogonality\n\n\u003Cp>\n  \u003Ca href=\"#\">\u003Cimg alt=\"ICML 2026\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FICML-2026-blue.svg\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg alt=\"arXiv\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2506.xxxxx-b31b1b.svg\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"LICENSE\">\u003Cimg alt=\"License: Apache-2.0\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg alt=\"Python 3.10+\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10%2B-3776AB.svg\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"#\">\u003Cimg alt=\"PyTorch 2.0+\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.0%2B-EE4C2C.svg\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n**Breaking the Visual Tokenization Trade-off — Generation ∧ Understanding, Not Generation ∨ Understanding.**\n\n[Paper](#) · [Project Page](#) · [Model Zoo](#model-zoo) · [Quick Start](#quick-start)\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"asst\u002FFigure1.jpg\" width=\"95%\"\u002F>\n\u003C\u002Fdiv>\n\n> **TL;DR**: Unified visual tokenizers suffer from *Manifold Misalignment* — pixel gradients and semantic gradients destructively interfere. MUSE resolves this via **Topological Orthogonality**, physically decoupling structure into attention topology and semantics into feature values. Result: **gFID 3.08** (matching generation specialists) + **Linear Probe 85.2%** (surpassing its own teacher InternViT-300M at 82.5%).\n\n---\n\n## Highlights\n\n\u003Ctable>\n\u003Ctr>\n\u003Ctd width=\"50%\">\n\n### 🎯 Mutual Reinforcement, Not Trade-off\n\nUnlike prior unified tokenizers trapped in a zero-sum game, MUSE achieves **genuine synergy** — structurally aligned reconstruction *actively refines* semantic perception.\n\n| Metric | MUSE | Best Prior |\n|:--|:--:|:--:|\n| gFID ↓ | **3.08** | 3.08 (VTP) |\n| Zero-Shot Acc ↑ | **77.1%** | 75.7% (UniLIP) |\n| Linear Probe ↑ | **85.2%** | 82.5% (Teacher) |\n| Seg. mIoU ↑ | **46.5** | 36.8 (UniLIP) |\n| MMVP ↑ | **74.8** | 72.7 (UniLIP) |\n\n\u003C\u002Ftd>\n\u003Ctd width=\"50%\">\n\n### 🧠 Key Insight: Gradient Orthogonality\n\n\u003Cimg src=\"asst\u002FFigure3_v3.jpg\" width=\"100%\"\u002F>\n\nSemantic gradients naturally occupy **W_V** while structural gradients cluster in **W_Q, W_K**. MUSE respects this inductive bias, eliminating destructive interference.\n\n\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n---\n\n## Method\n\n### Manifold Misalignment & Topological Orthogonality\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"asst\u002FFigure2.jpg\" width=\"95%\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cbr>\n\nThe core challenge: pixel reconstruction wants to *unfold* the latent manifold for detail, while semantic alignment wants to *collapse* it for invariance. Naively combining them causes **destructive gradient interference**.\n\n**MUSE** resolves this via the **Synergistic Block**, which physically decouples the two objectives:\n\n- **Topology Stream** (`W_Q`, `W_K`) → structural gradients refine the attention routing graph\n- **Semantic Stream** (`W_V`) → semantic gradients update feature content values\n- **Stop-Gradient (`\u002F\u002F`)** isolates the semantic branch from reconstruction gradients\n\nThis transforms interference into **mutual reinforcement** — a single architecture, two orthogonal optimization subspaces.\n\n### Three-Stage Progressive Training\n\nMUSE follows an information-theoretic curriculum: structure first, then semantics, then synergy.\n\n| Stage | Name | What Learns | What's Frozen | Key Objective |\n|:-----:|:-----|:------------|:--------------|:--------------|\n| **1** | Topology Warmup | Connector (`W_Q`, `W_K`) | Encoder + Semantic Proj. | `L_topo`: align attention topology with DINO teacher |\n| **2** | Semantic Injection | Connector (`W_V`) + Proj. | Encoder | `L_ITC`: anchor feature values to CLIP manifold |\n| **3** | Synergistic Tuning | Full model | DINO teacher only | All losses: `L_rec` + `L_topo` + `L_ITC` + `L_GAN` |\n\n---\n\n## Results\n\n### Tokenizer Comparison\n\nMUSE breaks the generative-semantic trade-off, establishing a new **Pareto frontier**:\n\n\u003Cdiv align=\"center\">\n\n| Method | Type | rFID ↓ | gFID ↓ | ZS Acc ↑ | LP Acc ↑ | mIoU ↑ |\n|:-------|:----:|:------:|:------:|:--------:|:--------:|:------:|\n| VQGAN | Gen. | 1.28 | 5.20 | — | — | 15.4 |\n| VA-VAE | Gen. | **0.46** | 3.92 | — | — | 18.5 |\n| UniLIP | Unified | 0.74 | 3.62 | 75.7 | 83.6 | 36.8 |\n| VTP | Unified | 0.73 | **3.08** | 71.2 | 81.4 | 32.1 |\n| **MUSE** | **Unified** | 0.62 | **3.08** | **77.1** | **85.2** | **46.5** |\n\n\u003C\u002Fdiv>\n\n### Unified Multimodal Model (UMM)\n\nWhen integrated into a full UMM pipeline, MUSE enables high-quality generation and editing **without** compromising perception:\n\n\u003Cdiv align=\"center\">\n\n| Model | MMB ↑ | MMVP ↑ | GenEval ↑ | WISE ↑ | Edit Bkg. ↑ |\n|:------|:-----:|:------:|:---------:|:------:|:-----------:|\n| InternVL3 (specialist) | 78.2 | 72.7 | — | — | — |\n| FLUX.1-dev (specialist) | — | — | 0.76 | 0.50 | — |\n| UniLIP | 72.6 | 72.7 | 0.78 | 0.62 | 0.79 |\n| **MUSE** | **73.4** | **74.8** | **0.82** | **0.65** | **0.87** |\n\n\u003C\u002Fdiv>\n\n### Qualitative Results\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Attention Maps — MUSE vs. Baselines\u003C\u002Fb>\u003C\u002Fsummary>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"asst\u002FAppendix_Figure2.jpg\" width=\"80%\"\u002F>\n\u003C\u002Fdiv>\n\nMUSE faithfully mirrors the precise, ground-truth-like attention patterns of the DINO teacher, while VQGAN scatters across textures and UniLIP produces overly diffuse maps.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Text-to-Image Generation\u003C\u002Fb>\u003C\u002Fsummary>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"asst\u002FAppendix_T2I.jpg\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\nComplex attribute binding, accurate spatial reasoning, and realistic textures across diverse prompts.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Image Editing\u003C\u002Fb>\u003C\u002Fsummary>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"asst\u002FAppendix_Figure3.jpg\" width=\"90%\"\u002F>\n\u003C\u002Fdiv>\n\nLocalized semantic modifications while strictly maintaining global layout and background consistency.\n\n\u003C\u002Fdetails>\n\n---\n\n## Model Zoo\n\n| Model | Backbone | Params | gFID ↓ | LP Acc ↑ | Checkpoint |\n|:------|:---------|:------:|:------:|:--------:|:----------:|\n| MUSE-1B | InternVL3-1B + SANA-0.6B | 496M | 3.08 | 85.2 | [Huggingface](https:\u002F\u002Fhuggingface.co\u002Fyangpanqi\u002FMUSE-1B\u002Ftree\u002Fmain) |\n| MUSE-3B | InternVL3-2B + SANA-1.6B | — | — | — | Coming Soon |\n\n---\n\n## Installation\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-org\u002FMUSE.git\ncd MUSE\n\n# Create conda environment (recommended)\nconda create -n muse python=3.10 -y\nconda activate muse\n\n# Install dependencies\npip install -e .\n# Or:\npip install -r requirements.txt\n```\n\n### Prerequisites\n\nDownload the following pretrained models:\n\n| Model | Role | Source |\n|:------|:-----|:-------|\n| InternVL3-1B \u002F 2B | Vision backbone encoder | [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FInternVL3-1B) |\n| DC-AE (SANA) | Pixel decoder | [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FEfficient-Large-Model\u002FSana_1600M_1024px_diffusers) |\n| DINOv3-ViT-H+ | Topology teacher (Stage 1) | Custom checkpoint |\n| CLIP-ViT-L-14 | Text encoder for ITC (Stage 2+) | [OpenCLIP](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) |\n\n---\n\n## Quick Start\n\n### Training the Tokenizer\n\n```bash\nexport CHECKPOINT_DIR=\u002Fpath\u002Fto\u002Fpretrained\u002Fmodels\nexport DATA_DIR=\u002Fpath\u002Fto\u002Fdatasets\n\n# Stage 1: Topology Warmup — align attention with DINO teacher\nbash tools\u002Ftrain_stage1.sh muse_1b\n\n# Stage 2: Semantic Injection — anchor features to CLIP manifold\nbash tools\u002Ftrain_stage2.sh muse_1b\n\n# Stage 3: Synergistic Tuning — full co-optimization\nbash tools\u002Ftrain_stage3.sh muse_1b\n```\n\n### Evaluation\n\n```bash\n# Reconstruction metrics (rFID, PSNR, SSIM, LPIPS)\nbash tools\u002Fevaluate.sh \\\n    configs\u002Fmuse_1b\u002Fstage3.yaml \\\n    \u002Fpath\u002Fto\u002Fcheckpoint.bin\n\n# Linear probe on ImageNet-1K\npython scripts\u002Flinear_probe.py \\\n    --config configs\u002Fmuse_1b\u002Fstage3.yaml \\\n    --checkpoint \u002Fpath\u002Fto\u002Fcheckpoint.bin\n\n# Zero-shot ImageNet classification\npython scripts\u002Fzero_shot.py \\\n    --config configs\u002Fmuse_1b\u002Fstage3.yaml \\\n    --checkpoint \u002Fpath\u002Fto\u002Fcheckpoint.bin\n\n# ADE20K segmentation probe (mIoU)\nbash tools\u002Fsegment_probe.sh muse \\\n    --config configs\u002Fmuse_1b\u002Fstage3.yaml \\\n    --checkpoint \u002Fpath\u002Fto\u002Fcheckpoint.bin \\\n    --train-url \"\u002Fpath\u002Fto\u002Fade20k-train-{000000..000020}.tar\" \\\n    --val-url \"\u002Fpath\u002Fto\u002Fade20k-validation-{000000..000002}.tar\"\n```\n\n### Inference\n\n```bash\n# Single image reconstruction\npython scripts\u002Finference.py \\\n    --config configs\u002Fmuse_1b\u002Fstage3.yaml \\\n    --checkpoint \u002Fpath\u002Fto\u002Fcheckpoint.bin \\\n    --image_path \u002Fpath\u002Fto\u002Fimage.jpg \\\n    --output_dir outputs\u002Finference\n\n# Attention map visualization\nbash tools\u002Fvisualize_attention.sh single \\\n    --config configs\u002Fmuse_1b\u002Fstage3.yaml \\\n    --checkpoint \u002Fpath\u002Fto\u002Fcheckpoint.bin \\\n    --image \u002Fpath\u002Fto\u002Fimage.jpg \\\n    --output outputs\u002Fattention_viz\n```\n\n---\n\n## Data Format\n\nMUSE uses [WebDataset](https:\u002F\u002Fgithub.com\u002Fwebdataset\u002Fwebdataset) (`.tar`) format for scalable data loading:\n\n```\nshard-000000.tar\n├── 00000.jpg          # Image\n├── 00000.txt          # Caption (Stages 2–3)\n├── 00001.jpg\n├── 00001.txt\n└── ...\n```\n\nFor segmentation probing, each shard additionally contains `*.seg.png` (ADE20K labels).\n\n---\n\n## Project Structure\n\n```\nMUSE\u002F\n├── muse\u002F                              # Core library\n│   ├── models\u002F\n│   │   ├── muse_vit.py                #   MUSE_ViT + SynergisticBlock\n│   │   ├── base_model.py              #   Save\u002Fload utilities\n│   │   ├── ema_model.py               #   EMA model wrapper\n│   │   ├── discriminator.py           #   PatchGAN discriminator\n│   │   ├── lpips.py                   #   LPIPS perceptual metric\n│   │   └── perceptual_loss.py         #   LPIPS + ConvNeXt-S perceptual\n│   ├── losses\u002F\n│   │   └── muse_loss.py               #   Pixel + Perceptual + GAN + Topo + ITC\n│   ├── data\u002F\n│   │   └── dataloader.py              #   WebDataset loader\n│   ├── evaluation\u002F\n│   │   ├── evaluator.py               #   rFID \u002F PSNR \u002F SSIM \u002F LPIPS\n│   │   └── inception.py               #   InceptionV3 for FID\n│   └── utils\u002F\n│       ├── viz_utils.py               #   Attention visualization pipeline\n│       ├── train_utils.py             #   Training helpers\n│       ├── lr_schedulers.py           #   LR schedule (cosine \u002F constant)\n│       └── logger.py                  #   Logging setup\n├── scripts\u002F                           # Entry-point scripts\n│   ├── train_stage{1,2,3}.py          #   Three-stage training\n│   ├── evaluate.py                    #   Batch reconstruction eval\n│   ├── inference.py                   #   Single-image reconstruction\n│   ├── linear_probe.py               #   ImageNet linear probe\n│   ├── zero_shot.py                   #   Zero-shot classification\n│   ├── zero_shot_meta.py              #   ImageNet class names + templates\n│   ├── segment_probe.py              #   ADE20K segmentation probe\n│   └── visualize_attention.py         #   Attention map visualization\n├── configs\u002F\n│   ├── muse_1b\u002F                       #   MUSE-1B configs (stage1–3)\n│   └── muse_3b\u002F                       #   MUSE-3B configs (stage1–3)\n├── tools\u002F                             # Shell launch scripts\n│   ├── train_stage{1,2,3}.sh\n│   ├── evaluate.sh\n│   ├── visualize_attention.sh\n│   └── segment_probe.sh\n├── asst\u002F                              # Paper figures & assets\n├── requirements.txt\n├── setup.py\n└── LICENSE                            # Apache 2.0\n```\n\n---\n\n## Citation\n\nIf you find MUSE useful in your research, please consider citing:\n\n```bibtex\n@inproceedings{muse2026,\n  title     = {MUSE: Resolving Manifold Misalignment in Visual Tokenization \n               via Topological Orthogonality},\n  author    = {Panqi Yang, Haodong Jing, Jiahao Chao, Tingyan Xiang, Li Lin, Yao Hu, Yang Luo, Yongqiang Ma},\n  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},\n  year      = {2026}\n}\n```\n\n## Acknowledgements\n\nMUSE builds upon several excellent open-source projects:\n\n- [InternVL3](https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FInternVL) — Vision backbone\n- [SANA](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FSana) \u002F [DC-AE](https:\u002F\u002Fgithub.com\u002Fmit-han-lab\u002Fefficientvit) — Pixel decoder\n- [DINOv3](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fdinov3) — Structural topology teacher\n- [OpenCLIP](https:\u002F\u002Fgithub.com\u002Fmlfoundations\u002Fopen_clip) — Text encoder for semantic anchoring\n\n## License\n\nThis project is licensed under the [Apache License 2.0](LICENSE).\n","MUSE项目旨在通过拓扑正交性解决视觉标记化中的流形错位问题。其核心功能在于利用Synergistic Block将结构梯度和语义梯度物理上解耦，使得像素重建与语义对齐不再互相干扰而是相互强化，从而在生成质量和理解能力上同时取得优异表现。技术特点包括采用Python 3.10+及PyTorch 2.0+框架实现，并且基于Apache License 2.0开源。该项目适用于需要高效统一视觉编码器的场景，比如图像生成、零样本学习以及语义分割等领域，在这些应用中能够显著提升模型性能。",2,"2026-06-11 02:59:52","CREATED_QUERY"]