[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72067":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},72067,"Bagel","ByteDance-Seed\u002FBagel","ByteDance-Seed","Open-source unified multimodal model","",null,"Python",6000,533,50,142,0,9,26,94,27,39.18,"Apache License 2.0",false,"main",[],"2026-06-12 02:02:58","\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Flf3-static.bytednsdoc.com\u002Fobj\u002Feden-cn\u002Fnuhojubrps\u002Fbanner.png\" alt=\"BAGEL\" width=\"480\"\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fbagel-ai.org\u002F\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Website-0A66C2?logo=safari&logoColor=white\"\n      alt=\"BAGEL Website\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14683\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Paper-red?logo=arxiv&logoColor=red\"\n      alt=\"BAGEL Paper on arXiv\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FBAGEL-7B-MoT\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Model-yellow?logo=huggingface&logoColor=yellow\" \n        alt=\"BAGEL Model\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdemo.bagel-ai.org\u002F\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Demo-blue?logo=googleplay&logoColor=blue\"\n      alt=\"BAGEL Demo\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FByteDance-Seed\u002FBAGEL\">\n    \u003Cimg \n        src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Space-orange?logo=huggingface&logoColor=yellow\" \n        alt=\"BAGEL Model\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FeXQNFhWe\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Discord-5865F2?logo=discord&logoColor=purple\"\n      alt=\"BAGEL Discord\"\n    \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"mailto:bagel@bytedance.com\">\n    \u003Cimg\n      src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBAGEL-Email-D14836?logo=gmail&logoColor=red\"\n      alt=\"BAGEL Email\"\n    \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n# Unified Model for Multimodal Understanding and Generation\n> [Chaorui Deng*](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=k0TWfBoAAAAJ), [Deyao Zhu*](https:\u002F\u002Ftsutikgiau.github.io\u002F), [Kunchang Li*](https:\u002F\u002Fandy1621.github.io\u002F), [Chenhui Gou*](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fchenhui-gou-9201081a1\u002F?originalSubdomain=au), [Feng Li*](https:\u002F\u002Ffengli-ust.github.io\u002F), [Zeyu Wang](https:\u002F\u002Fzw615.github.io\u002F), Shu Zhong, [Weihao Yu](https:\u002F\u002Fwhyu.me\u002F), [Xiaonan Nie](https:\u002F\u002Fcodecaution.github.io\u002F), [Ziang Song](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fziang-song-43b0ab8a\u002F), Guang Shi :email: , [Haoqi Fan* :tophat: ](https:\u002F\u002Fhaoqifan.github.io\u002F)\n>\n> contact: shiguang.sg@bytedance.com\n> \n> We present **BAGEL**, an open‑source multimodal foundation model with 7B active parameters (14B total) trained on large‑scale interleaved multimodal data. BAGEL outperforms the current top‑tier open‑source VLMs like Qwen2.5-VL and InternVL-2.5 on standard multimodal understanding leaderboards, and delivers text‑to‑image quality that is competitive with strong specialist generators such as SD3.\nMoreover, BAGEL demonstrates superior qualitative results in classical image‑editing scenarios than the leading open-source models. More importantly, it extends to free-form visual manipulation, multiview synthesis, and world navigation, capabilities that constitute \"world-modeling\" tasks beyond the scope of previous image-editing models.\nThe figure below showcases BAGEL's qualitative performance.\n\n\u003Cp align=\"center\">\u003Cimg src=\"assets\u002Fteaser.webp\" width=\"95%\">\u003C\u002Fp>\n\n\n\u003C!-- ## 🧠 Method\nBAGEL adopts a Mixture-of-Transformer-Experts (MoT) architecture to maximize the model’s capacity to learn from richly diverse multimodal information. Following the same principle of capacity maximization, it utilizes two separate encoders to capture pixel-level and semantic-level features of an image. The overall framework follows a Next Group of Token Prediction paradigm, where the model is trained to predict the next group of language or visual tokens as a compression target.\n\nBAGEL scales MoT’s capacity through Pre-training, Continued Training, and Supervised Finetuning on trillions of interleaved multimodal tokens spanning language, image, video, and web data. It surpasses open models on standard understanding and generation benchmarks and demonstrates advanced in-context multimodal abilities like free-form image editing, future frame prediction, 3D manipulation, world navigation, and sequential reasoning.\n\n\u003Cp align=\"center\">\u003Cimg src=\"assets\u002Farch.png\" width=\"95%\">\u003C\u002Fp>\n\n\n## 🌱 Emerging Properties\n\u003Cp align=\"center\">\u003Cimg src=\"assets\u002Femerging_curves.png\" width=\"95%\">\u003C\u002Fp>\n\nAs we scale up BAGEL’s pretraining with more multimodal tokens, we observe consistent performance gains across understanding, generation, and editing tasks. Different capabilities emerge at distinct training stages—multimodal understanding and generation appear early, followed by basic editing, while complex, intelligent editing emerges later. This staged progression suggests an emergent pattern, where advanced multimodal reasoning builds on well-formed foundational skills. Ablation studies further show that combining VAE and ViT features significantly improves intelligent editing, underscoring the importance of visual-semantic context in enabling complex multimodal reasoning and further supporting its role in the emergence of advanced capabilities. -->\n\n## 📢 News\n\nWe sincerely thank all contributors from the open community for their valuable support.\n\n- **June 15, 2025:** We have updated and fixed the evaluation results for [KRIS-Bench](https:\u002F\u002Fgithub.com\u002Fmercurystraw\u002FKris_Bench) and [RISEBench](https:\u002F\u002Fgithub.com\u002FPhoenixZ810\u002FRISEBench). **Our model, BAGEL, demonstrates performance comparable to Gemini 2.0 on these reasoning benchmarks.** We have also released the evaluation code for both KRIS-Bench and RISEBench, along with [ImgEdit-Bench](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FImgEdit). For further details, please refer to [EVAL](.\u002FEVAL.md).\n- **Jun 5, 2025:** Thanks to [@davideuler](https:\u002F\u002Fgithub.com\u002Fdavideuler) for contributing the [Dockerfile with prebuilt flash_attn](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FBagel\u002Fissues\u002F125).\n- **May 30, 2025:** Many thanks to [@prartio](https:\u002F\u002Fgithub.com\u002Fprartio) for contributing the [Windows 11 installation guideline](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FBagel\u002Fissues\u002F92), and to [@gluttony-10](https:\u002F\u002Fgithub.com\u002Fgluttony-10) for his work on the [inference of quantization](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FBagel\u002Fpull\u002F88).\n- **May 29, 2025:** Special thanks to [@jnc-nj](https:\u002F\u002Fgithub.com\u002Fjnc-nj) for contributing the [Dockerfile](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FBagel\u002Fissues\u002F75).\n- **May 26, 2025:** Thanks to [@neverbiasu](https:\u002F\u002Fgithub.com\u002Fneverbiasu) for contributing [ComfyUI](https:\u002F\u002Fgithub.com\u002Fneverbiasu\u002FComfyUI-BAGEL).\n- **May 25, 2025:** Special thanks to [@LeanModels](https:\u002F\u002Fgithub.com\u002FLeanModels) for providing the [DF11-compressed version](https:\u002F\u002Fhuggingface.co\u002FDFloat11\u002FBAGEL-7B-MoT-DF11), and to [@Gapeleon](https:\u002F\u002Fhuggingface.co\u002FGapeleon) for the [INT8-compressed version](https:\u002F\u002Fhuggingface.co\u002FGapeleon\u002Fbytedance_BAGEL-7B-MoT-INT8). We also appreciate [@gluttony-10](https:\u002F\u002Fgithub.com\u002Fgluttony-10) for contributions to the [Windows package](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FBagel\u002Fissues\u002F51).\n- **May 24, 2025:** Together with [@wangwei1237](https:\u002F\u002Fgithub.com\u002Fwangwei1237), [@gluttony-10](https:\u002F\u002Fgithub.com\u002Fgluttony-10), and [@KingNish24](https:\u002F\u002Fgithub.com\u002FKingNish24), we built a Gradio [app](app.py) and launched a [Hugging Face Space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FByteDance-Seed\u002FBAGEL).\n- **May 23, 2025:** We have provided a training guideline in [TRAIN](.\u002FTRAIN.md).\n- **May 20, 2025:** We released the official [website](https:\u002F\u002Fbagel-ai.org\u002F), [demo](https:\u002F\u002Fdemo.bagel-ai.org\u002F), [model](https:\u002F\u002Fhuggingface.co\u002FByteDance-Seed\u002FBAGEL-7B-MoT), and [report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14683) for BAGEL.\n\n\n## 📮 Notice\n**Call for Bad Cases:** If you have encountered any cases where the model performs poorly, we would greatly appreciate it if you could share them in the [issue#11](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FBagel\u002Fissues\u002F11) or [Discord](https:\u002F\u002Fdiscord.gg\u002FZ836xxzy).\n\n**About Inference Hyperparameters:**\n- **`cfg_text_scale`:** Controls how strongly the model follows the text prompt. `1.0` disables text guidance. Typical range: `4.0–8.0`.\n- **`cfg_image_scale`:** Controls how much the model preserves input image details. `1.0` disables image guidance. Typical range: `1.0–2.0`.\n- **`cfg_interval`:** Fraction of denoising steps where CFG is applied. Later steps can skip CFG to reduce computation. Typical: `[0.4, 1.0]`.\n- **`timestep_shift`:** Shifts the distribution of denoising steps. Higher values allocate more steps at the start (affects layout); lower values allocate more at the end (improves details).\n- **`num_timesteps`:** Total denoising steps. Typical: `50`.\n- **`cfg_renorm_min`:** Minimum value for CFG-Renorm. `1.0` disables renorm. Typical: `0`.\n- **`cfg_renorm_type`:** CFG-Renorm method:  \n  - `global`: Normalize over all tokens and channels (default for T2I).\n  - `channel`: Normalize across channels for each token.\n  - `text_channel`: Like `channel`, but only applies to text condition (good for editing, may cause blur).\n- **If edited images appear blurry, try `global` CFG-Renorm, decrease `cfg_renorm_min` or decrease `cfg_scale`.**\n\n\n## 🔥 Quick Start\n\n1️⃣  Set up environment\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fbytedance-seed\u002FBAGEL.git\ncd BAGEL\nconda create -n bagel python=3.10 -y\nconda activate bagel\npip install -r requirements.txt\npip install flash_attn==2.5.8 --no-build-isolation\n```\n\n2️⃣  Download pretrained checkpoint\n```python\nfrom huggingface_hub import snapshot_download\n\nsave_dir = \"models\u002FBAGEL-7B-MoT\"\nrepo_id = \"ByteDance-Seed\u002FBAGEL-7B-MoT\"\ncache_dir = save_dir + \"\u002Fcache\"\n\nsnapshot_download(cache_dir=cache_dir,\n  local_dir=save_dir,\n  repo_id=repo_id,\n  local_dir_use_symlinks=False,\n  resume_download=True,\n  allow_patterns=[\"*.json\", \"*.safetensors\", \"*.bin\", \"*.py\", \"*.md\", \"*.txt\"],\n)\n\n```\n\n3️⃣ Use Gradio WebUI to start playing with BAGEL!\n```bash\n# For 32GB+ VRAM GPU or multi GPUs.\npython app.py\n```\n\n```bash\n# For 12~32GB VRAM GPU, recommend using NF4 quantization. And use Chinese interface.\npython app.py --mode 2 --zh\n```\n\n```bash\n# For 22~32GB VRAM GPU, not recommended to use INT8 quantization.\npython app.py  --mode 3\n```\n\n## 🔥 Train & Eval\n\n### Train\n\n```bash\nbash scripts\u002Ftrain.sh\n```\n\nYou can replace the variables in the script with your own before running. \nSee [TRAIN](TRAIN.md) for more details.\n\n### Eval\nWe provide the scripts for evaluating VLM, T2I and Editing benchmarks. \nPlease See [EVAL](EVAL.md) for more details.\n\n\n## 📊 Benchmarks\n\n### 1. Visual Understanding\n\n| Model | MME | MMBench |   MMMU | MM-Vet | MathVista |\n| ------------------- | ----------: | ----------: | -------: | -------: | ----------: |\n| Janus-Pro-7B        | -  |     79.2 |     41.0 |     50.0 |           – |\n| Qwen2.5-VL-7B      | 2347    |   83.5 | **58.6** |     67.1 |           68.2 |\n| **BAGEL**    | **2388**  |  **85.0** |     55.3 | **67.2** |    **73.1** |\n\n### 2. Text-to-Image Generation\n\n| Model        | GenEval | WISE |\n| ------------ | --------- | --------- |\n| Janus-Pro-7B | 0.80      | 0.35 | \n| SD3-Medium   | 0.74      | - |\n| FLUX-1-dev   | 0.82      | 0.50 |\n| **BAGEL**    | 0.82  | 0.52  |\n| **BAGEL + Rewritter\u002FCoT**    | **0.88**  | **0.70** |\n\n### 3. Image Editing\n\n| Model         | GEdit-Bench-EN (SC) | GEdit-Bench-EN (PQ) | GEdit-Bench-EN (O) | IntelligentBench | KISE-Bench | RISEBench |\n| ------------- | ---------------------: | ---------------------: | -------------------: | ------------------: | ------------: | ------------: | \n| Step1X-Edit   | 🥉7.09                | 🥉6.76                | 🥈6.70            | 14.9               |  43.29   |  1.9  |\n| Gemini 2.0    | 6.73                  | 6.61                  | 6.32                | 🥈57.6             | 🥈62.41   |  🥈13.3  |\n| GPT-4o        | 🥇7.85              | 🥇7.62              | 🥇7.53            | 🥇78.9           | 🥇80.09   |  🥇28.9  |\n| **BAGEL**     | 🥈7.36                | 🥈6.83                | 🥉6.52                | 44.0               |  56.21   |  6.1 |\n| **BAGEL+CoT** | –                     | –                     | –                   | 🥉55.3             |  🥉60.18   |  🥉11.9 |\n\n\n\n\n## ✍️ Citation\n\n```bibtex\n@article{deng2025bagel,\n  title   = {Emerging Properties in Unified Multimodal Pretraining},\n  author  = {Deng, Chaorui and Zhu, Deyao and Li, Kunchang and Gou, Chenhui and Li, Feng and Wang, Zeyu and Zhong, Shu and Yu, Weihao and Nie, Xiaonan and Song, Ziang and Shi, Guang and Fan, Haoqi},\n  journal = {arXiv preprint arXiv:2505.14683},\n  year    = {2025}\n}\n```\n\n\n## 📜 License\nBAGEL is licensed under the Apache 2.0.\n","BAGEL 是一个开源的统一多模态模型，具有70亿活跃参数（总共140亿），在大规模交织的多模态数据上进行训练。其核心功能包括多模态理解和生成，能够在标准多模态理解排行榜上超越当前顶级开源视觉-语言模型，并在文本到图像生成方面与专业生成器如SD3相媲美。此外，BAGEL在经典图像编辑场景中表现出色，并且能够扩展到自由形式的视觉操作、多视角合成和世界导航等任务，这些能力超出了以往图像编辑模型的范围。该项目适合需要高级多模态处理能力的应用场景，例如跨模态内容生成、复杂图像编辑以及虚拟环境中的导航任务。",2,"2026-06-11 03:40:13","high_star"]