[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72536":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":16,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},72536,"Lumina-T2X","Alpha-VLLM\u002FLumina-T2X","Alpha-VLLM","Lumina-T2X is a unified framework for Text to Any Modality Generation","",null,"Python",2251,95,32,54,0,57.95,"MIT License",false,"main",[22,23,24,25,26,27,28,29],"aigc","diffusion","diffusion-model","diffusion-models","diffusion-transformer","generation-models","transformer","transformers","2026-06-12 04:01:06","\u003C!-- \u003Cp align=\"center\">\n \u003Cimg src=\".\u002Fassets\u002Flumina-logo.png\" width=\"40%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp> -->\n\n# $\\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers \n\n### \u003Cdiv align=\"center\"> ICLR 2025 Spotlight & NeurIPS 2024 \u003Cdiv>\n\n\u003Cdiv align=\"center\">\n\n\u003C!--[![GitHub repo contributors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors-anon\u002FAlpha-VLLM\u002FLumina-T2X?style=flat&label=Contributors)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fgraphs\u002Fcontributors)-->\n\n\u003C!--[![GitHub Commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcommit-activity\u002Fm\u002FAlpha-VLLM\u002FLumina-T2X?label=Commit)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fcommits\u002Fmain\u002F)-->\n\n\u003C!--[![Pr](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-pr-closed-raw\u002FAlpha-VLLM\u002FLumina-T2X.svg?label=Merged+PRs&color=green)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fpulls) \u003Cbr>-->\n\n\u003C!--[![GitHub repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAlpha-VLLM\u002FLumina-T2X?style=flat&logo=github&logoColor=whitesmoke&label=Stars)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fstargazers) -->\n\n\u003C!--[![GitHub repo watchers](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fwatchers\u002FAlpha-VLLM\u002FLumina-T2X?style=flat&logo=github&logoColor=whitesmoke&label=Watchers)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fwatchers) -->\n\n\u003C!--[![GitHub repo size](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frepo-size\u002FAlpha-VLLM\u002FLumina-T2X?style=flat&logo=github&logoColor=whitesmoke&label=Repo%20Size)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Farchive\u002Frefs\u002Fheads\u002Fmain.zip) -->\n\n[![Lumina-Next](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Lumina--Next-2b9348.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18583)&#160;\n[![Lumina-T2X](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Lumina--T2X-2b9348.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05945)&#160;\n[![Lumina-mGPT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Lumina--mGPT-2b9348.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02657)&#160;\n\n[![Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-WeChat@Group-000000?logo=wechat&logoColor=07C160)](http:\u002F\u002Fimagebind-llm.opengvlab.com\u002Fqrcode\u002F)&#160;\n[![weixin](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-WeChat@机器之心-000000?logo=wechat&logoColor=07C160)](https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FNwwbaeRujh-02V6LRs5zMg)&#160;\n[![zhihu](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-知乎-000000?logo=zhihu&logoColor=0084FF)](https:\u002F\u002Fwww.zhihu.com\u002Forg\u002Fopengvlab)&#160;\n[![zhihu](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-Twitter@OpenGVLab-black?logo=twitter&logoColor=1D9BF0)](https:\u002F\u002Ftwitter.com\u002Fopengvlab\u002Fstatus\u002F1788949243383910804)&#160;\n![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F-MIT-MIT?logoColor=%231082c3&label=Code%20License&link=https%3A%2F%2Fgithub.com%2FAlpha-VLLM%2FLumina-T2X%2Fblob%2Fmain%2FLICENSE)\n\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVideo%20Introduction%20of%20Lumina--Next-red?logo=youtube)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=K0-AJa33Rw4)\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FVideo%20Introduction%20of%20Lumina--T2X-pink?logo=youtube)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KFtHmS5eUCM)\n\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOfficial(node1)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-Next-SFT)](http:\u002F\u002F106.14.2.150:10020\u002F)&#160;\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOfficial(node2)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-Next-SFT)](http:\u002F\u002F106.14.2.150:10021\u002F)&#160;\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOfficial(node3)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-Next-SFT)](http:\u002F\u002F106.14.2.150:10022\u002F)&#160;\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOfficial(compositional)-6B88E3?logo=youtubegaming&label=Demo%20Lumina-Next-T2I)](http:\u002F\u002F106.14.2.150:10023\u002F)&#160;\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOfficial(node1)-violet?logo=youtubegaming&label=Demo%20Lumina-Text2Music)](http:\u002F\u002F139.196.83.164:8000\u002F)&#160;\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--Next--SFT-HF_Space-yellow?logoColor=violet&label=%F0%9F%A4%97%20Demo%20Lumina-Next-SFT)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FAlpha-VLLM\u002FLumina-Next-T2I)\n\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--Next--SFT%20checkpoints-Model(2B)-purple?logoColor=#571482&label=%F0%9F%A4%97%20Lumina-Next-SFT%20checkpoints)](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FAlpha-VLLM\u002FLumina-Next-SFT)\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--Next--T2I%20checkpoints-Model(2B)-purple?logoColor=#571482&label=%F0%9F%A4%97%20Lumina-Next-SFT%20checkpoints)](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FAlpha-VLLM\u002FLumina-Next-T2I)\n\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--Next--SFT%20checkpoints-Model(2B)-yellow?logoColor=violet&label=%F0%9F%A4%97%20Lumina-Next-Diffusers%20checkpoints)](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-SFT-diffusers)\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--Next--SFT%20checkpoints-Model(2B)-yellow?logoColor=violet&label=%F0%9F%A4%97%20Lumina-Next-SFT%20checkpoints)](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-SFT)\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--Next--T2I%20checkpoints-Model(2B)-yellow?logoColor=violet&label=%F0%9F%A4%97%20Lumina-Next-T2I%20checkpoints)](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-T2I)\n[![Static Badge](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLumina--T2I%20checkpoints-Model(5B)-yellow?logoColor=violet&label=%F0%9F%A4%97%20Lumina-T2I%20checkpoints)](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-T2I)\n\n\u003C!-- [![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002FAlpha-VLLM\u002FLumina-T2X?color=critical&label=Issues)]() -->\n\n\u003C!-- [![GitHub closed issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002FAlpha-VLLM\u002FLumina-T2X?color=success&label=Issues)]() \u003Cbr> -->\n\n\u003C!-- [![GitHub repo forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002FAlpha-VLLM\u002FLumina-T2X?style=flat&logo=github&logoColor=whitesmoke&label=Forks)](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fnetwork)  -->\n\n\u003C!--\n[[📄 Lumina-T2X arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05945)]\n[[📽️ Video Introduction of Lumina-T2X](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=KFtHmS5eUCM)]\n[👋 join our \u003Ca href=\"http:\u002F\u002Fimagebind-llm.opengvlab.com\u002Fqrcode\u002F\" target=\"_blank\">WeChat\u003C\u002Fa>]\n\n-->\n\n\u003C!-- [[📺 Website](https:\u002F\u002Flumina-t2-x-web.vercel.app\u002F)] -->\n\n\u003C\u002Fdiv>\n\n![intro_large](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F9f52eabb-07dc-4881-8257-6d8a5f2a0a5a)\n\n\u003C!-- [[中文版本]](.\u002FREADME_cn.md) -->\n\n## 📰 News\n\n- **[2024-08-06] 🎉🎉🎉 We have released [Lumina-mGPT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02657), the next-generation of generative models in our Lumina family! Lumina-mGPT is an autoregressive transformer capable of photorealistic image generation and other vision-language tasks, e.g., controllable generation, multi-turn dialog, depth\u002Fnormal\u002Fsegmentation map estimation.**\n- **[2024-07-08] 🎉🎉🎉 Lumina-Next is now supported in the [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers)! Thanks to [@yiyixuxu](https:\u002F\u002Fgithub.com\u002Fyiyixuxu) and [@sayakpaul](https:\u002F\u002Fgithub.com\u002Fsayakpaul)! [HF Model Repo](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-SFT-diffusers).**\n- [2024-06-26] We have released the inference code for img2img translation using `Lumina-Next-T2I`. [CODE](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Ftree\u002Fmain\u002Flumina_next_t2i_mini\u002Fscripts\u002Fsample_img2img.sh) [ComfyUI](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-LuminaWrapper)\n- [2024-06-21] 🥰🥰🥰 Lumina-Next has a jupyter nootbook for inference, thanks to [canenduru](https:\u002F\u002Fgithub.com\u002Fcamenduru)! [LINK](https:\u002F\u002Fgithub.com\u002Fcamenduru\u002FLumina-Next-jupyter)\n- [2024-06-21] We have uploaded the `Lumina-Next-SFT` and `Lumina-Next-T2I` to [wisemodel.cn](https:\u002F\u002Fwisemodel.cn\u002Fmodels). [wisemodel repo](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FAlpha-VLLM\u002FLumina-Next-SFT)\n- [2024-06-19] We have released the `Lumina-T2Audio` (Text-to-Audio) code and model for music generation. [MODEL](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-T2Audio)\n- [2024-06-17] 🚀🚀🚀 We have support both inference and training (including Dreambooth) of SD3, implemented in our Lumina framework! [CODE](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Ftree\u002Fmain\u002Flumina_next_t2i_mini)\n- **[2024-06-17] 🥰🥰🥰 Lumina-Next supports ComfyUI now, thanks to [Kijai](https:\u002F\u002Fgithub.com\u002Fkijai)! [LINK](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-LuminaWrapper)**\n- **[2024-06-08] 🚀🚀🚀 We have released the `Lumina-Next-SFT` model, demonstrating better visual quality! [MODEL](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-SFT)**\n- [2024-06-07] We have released the `Lumina-T2Music` (Text-to-Music) code and model for music generation. [MODEL](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-T2Music) [DEMO](http:\u002F\u002F139.196.83.164:8000\u002F)\n- [2024-06-03] We have released the `Compositional Generation` version of `Lumina-Next-T2I`, which enables compositional generation with multiple captions for different regions. [model](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-T2I). [DEMO](http:\u002F\u002F106.14.2.150:10023\u002F)\n- [2024-05-29] We updated the new `Lumina-Next-T2I` [Code](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Ftree\u002Fmain\u002Flumina_next_t2i) and [HF Model](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-T2I). Supporting 2K Resolution image generation and Time-aware Scaled RoPE.\n- [2024-05-25] We released training scripts for Flag-DiT and Next-DiT, and we have reported the comparison results between Next-DiT and Flag-DiT. [Comparsion Results](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fblob\u002Fmain\u002FNext-DiT-ImageNet\u002FREADME.md#results)\n- [2024-05-21] Lumina-Next-T2I supports a higher-order solver. It can generate images in just 10 steps without any distillation. Try our demos [DEMO](http:\u002F\u002F106.14.2.150:10021\u002F).\n- [2024-05-18] We released training scripts for Lumina-T2I 5B. [README](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Ftree\u002Fmain\u002Flumina_t2i#training)\n- [2024-05-16] ❗❗❗ We have converted the `.pth` weights to `.safetensors` weights. Please pull the latest code and use `demo.py` for inference.\n- [2024-05-14] Lumina-Next now supports simple **text-to-music** generation ([examples](#text-to-music-generation)), **high-resolution (1024*4096) Panorama** generation conditioned on text ([examples](#panorama-generation)), and **3D point cloud** generation conditioned on labels ([examples](#point-cloud-generation)).\n- [2024-05-13] We give [examples](#multilingual-generation) demonstrating Lumina-T2X's capability to support **multilingual prompts**, and even support prompts containing **emojis**.\n- **[2024-05-12] We excitedly released our `Lumina-Next-T2I` model ([checkpoint](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-Next-T2I)) which uses a 2B Next-DiT model as the backbone and Gemma-2B as the text encoder. Try it out at [demo1](http:\u002F\u002F106.14.2.150:10020\u002F) & [demo2](http:\u002F\u002F106.14.2.150:10021\u002F) & [demo3](http:\u002F\u002F106.14.2.150:10022\u002F). Please refer to the paper [Lumina-Next](assets\u002Flumina-next.pdf) for more details.**\n- [2024-05-10] We released the technical report on [arXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.05945).\n- [2024-05-09] We released `Lumina-T2A` (Text-to-Audio) Demos. [Examples](#text-to-audio-generation)\n- [2024-04-29] We released the 5B model [checkpoint](https:\u002F\u002Fhuggingface.co\u002FAlpha-VLLM\u002FLumina-T2I) and demo built upon it for text-to-image generation.\n- [2024-04-25] Support 720P video generation with arbitrary aspect ratio. [Examples](#text-to-video-generation)\n- [2024-04-19]  Demo examples released.\n- [2024-04-05] Code released for `Lumina-T2I`.\n- [2024-04-01] We release the initial version of `Lumina-T2I` for text-to-image generation.\n\n## 🚀 Quick Start\n\n> [!Warning]\n> **Since we are updating the code frequently, please pull the latest code:**\n>\n> ```bash\n> git pull origin main\n> ```\n\n### Fast Demo\n\nWe have supported Lumina-Next in the [diffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers). \n\n> [!Note]\n> You should install the development version of diffusers (`main` branch) before diffusers releasing the new version.\n> ```bash\n> pip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers\n\nand you can try the code below:\n\n```python\nfrom diffusers import LuminaText2ImgPipeline\nimport torch\n\npipeline = LuminaText2ImgPipeline.from_pretrained(\n\"\u002Fmnt\u002Fhdd1\u002Fxiejunlin\u002Fcheckpoints\u002FLumina-Next-SFT-diffusers\", torch_dtype=torch.bfloat16\n).to(\"cuda\")\n\nimage = pipeline(prompt=\"Upper body of a young woman in a Victorian-era outfit with brass goggles and leather straps. Background shows an industrial revolution ciyscape with smoky skies and tall, metal structures\", height=1024, width=768).images[0]\n```\n\nFor more details about training and inference of Lumina framework, please refer to [Lumina-T2I](.\u002Flumina_t2i\u002FREADME.md#Installation), [Lumina-Next-T2I](.\u002Flumina_next_t2i\u002FREADME.md#Installation), and [Lumina-Next-T2I-Mini](.\u002Flumina_next_t2i_mini\u002FREADME.md#Installation). We highly recommend you to use the **[Lumina-Next-T2I-Mini](.\u002Flumina_next_t2i_mini\u002FREADME.md#Installation)** for training and inference, which is an extremely simplified version of Lumina-Next-T2I with full functionalities.\n\n### GUI Demo\n\nIn order to quickly get you guys using our model, we built different versions of the GUI demo site.\n\n#### Lumina-Next-T2I model demo:\n\nImage Generation: [[node1](http:\u002F\u002F106.14.2.150:10020\u002F)] [[node2](http:\u002F\u002F106.14.2.150:10021\u002F)] [[node3](http:\u002F\u002F106.14.2.150:10022\u002F)]\n\nImage Compositional Generation: [[node1](http:\u002F\u002F106.14.2.150:10023\u002F)]\n\nMusic Generation: [[node1](http:\u002F\u002F139.196.83.164:8000)]\n\n\u003C!-- > [!Warning] -->\n\u003C!-- > **Lumina-T2X employs FSDP for training large diffusion models. FSDP shards parameters, optimizer states, and gradients across GPUs. Thus, at least 8 GPUs are required for full fine-tuning of the Lumina-T2X 5B model. Parameter-efficient Finetuning of Lumina-T2X shall be released soon.** -->\n\n### Installation\nUsing `Lumina-T2X` as a library, using installation command on your environment:\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\n```\n\n### Development\nIf you want to contribute to the code, you should run command below to install `pre-commit` library:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\n\ncd Lumina-T2X\npip install -e \".[dev]\"\npre-commit install\npre-commit\n```\n\n## 📑 Open-source Plan\n\n- [X] Lumina-Text2Image (Demos✅, Training✅, Inference✅, Checkpoints✅, Diffusers✅)\n- [ ] Lumina-Text2Video (Demos✅)\n- [X] Lumina-Text2Music (Demos✅, Inference✅, Checkpoints✅)\n- [X] Lumina-Text2Audio (Demos✅, Inference✅, Checkpoints✅)\n\n## 📜 Index of Content\n\n- [$\\\\textbf{Lumina-T2X}$: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers](#textbflumina-t2x-transforming-text-into-any-modality-resolution-and-duration-via-flow-based-large-diffusion-transformers)\n  - [📰 News](#-news)\n  - [🚀 Quick Start](#-quick-start)\n    - [GUI Demo](#gui-demo)\n      - [Lumina-Next-T2I model demo:](#lumina-next-t2i-model-demo)\n    - [Installation](#installation)\n    - [Development](#development)\n  - [📑 Open-source Plan](#-open-source-plan)\n  - [📜 Index of Content](#-index-of-content)\n  - [Introduction](#introduction)\n  - [📽️ Demo Examples](#️-demo-examples)\n    - [Demos of Lumina-Next-SFT](#demos-of-lumina-next-sft)\n    - [Demos of Lumina-T2I](#demos-of-lumina-t2i)\n      - [Panorama Generation](#panorama-generation)\n    - [Text-to-Video Generation](#text-to-video-generation)\n    - [Text-to-3D Generation](#text-to-3d-generation)\n      - [Point Cloud Generation](#point-cloud-generation)\n    - [Text-to-Audio Generation](#text-to-audio-generation)\n    - [Text-to-music Generation](#text-to-music-generation)\n    - [Multilingual Generation](#multilingual-generation)\n  - [⚙️ Diverse Configurations](#️-diverse-configurations)\n  - [Contributors](#contributors)\n  - [📄 Citation](#-citation)\n\n## Introduction\n\nWe introduce the $\\textbf{Lumina-T2X}$ family, a series of text-conditioned Diffusion Transformers (DiT) capable of transforming textual descriptions into vivid images, dynamic videos, detailed multi-view 3D images, and synthesized speech. At the core of Lumina-T2X lies the **Flow-based Large Diffusion Transformer (Flag-DiT)**—a robust engine that supports up to **7 billion parameters** and extends sequence lengths to **128,000** tokens. Drawing inspiration from Sora, Lumina-T2X integrates images, videos, multi-views of 3D objects, and speech spectrograms within a spatial-temporal latent token space, and can generate outputs at **any resolution, aspect ratio, and duration**.\n\n🌟 **Features**:\n\n- **Flow-based Large Diffusion Transformer (Flag-DiT)**: Lumina-T2X adopts the **flow matching** formulation and is equipped with many advanced techniques, such as RoPE, RMSNorm, and KQ-norm, **demonstrating faster training convergence, stable training dynamics, and a simplified pipeline**.\n- **Any Modalities, Resolution, and Duration within One Framework**:\n  1. $\\textbf{Lumina-T2X}$ can **encode any modality, including mages, videos, multi-views of 3D objects, and spectrograms into a unified 1-D token sequence at any resolution, aspect ratio, and temporal duration.**\n  2. By introducing the `[nextline]` and `[nextframe]` tokens, our model can **support resolution extrapolation**, i.e., generating images\u002Fvideos with out-of-domain resolutions **not encountered during training**, such as images from 768x768 to 1792x1792 pixels.\n- **Low Training Resources**: Our empirical observations indicate that employing larger models,\n  high-resolution images, and longer-duration video clips can **significantly accelerate the convergence**\n  **speed** of diffusion transformers. Moreover, by employing meticulously curated text-image and text-video pairs featuring high aesthetic quality frames and detailed captions, our $\\textbf{Lumina-T2X}$ model is learned to generate high-resolution images and coherent videos with minimal computational demands. Remarkably, the default Lumina-T2I configuration, equipped with a 5B Flag-DiT and a 7B LLaMA as the text encoder, **requires only 35% of the computational resources compared to Pixelart-**$\\alpha$.\n\n![framework](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F60d2f248-67b1-43ef-a530-c75530cf26c5)\n\n## 📽️ Demo Examples\n\n### Demos of Lumina-Next-SFT\n\n![github_banner](https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F926adf8c-3f34-44ed-8ff6-5eb650b9712c)\n\n### Demos of Visual Anagrams\n\n![](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7a200023-6e85-4209-96f1-49e0ddadf021)\n\n![](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F8006da1f-18be-45a0-b292-e1f2ef1e029a)\n\n### Demos of Lumina-T2I\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F27bd36a8-8411-47dd-a3a7-3607c1d5d644\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n#### Panorama Generation\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F88b75b4e-5e16-4ea3-aba8-134904dd3381\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n### Text-to-Video Generation\n\n**720P Videos:**\n\n**Prompt:** The majestic beauty of a waterfall cascading down a cliff into a serene lake.\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F17187de8-7a07-49a8-92f9-fdb8e2f5e64c\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F0a20bb39-f6f7-430f-aaa0-7193a71b256a\n\n**Prompt:** A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F7bf9ce7e-f454-4430-babe-b14264e0f194\n\n**360P Videos:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fd7fec32c-3655-4fd1-aa14-c0cb3ace3845\n\n### Text-to-3D Generation\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fcd061b8d-c47b-4c0c-b775-2cbaf8014be9\n\n#### Point Cloud Generation\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F742237ad-be47-4a7d-aa11-b3aaba07a75a\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n### Text-to-Audio Generation\n\n> [!Note]\n> **Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.**\n\n\u003C!-- > 🌟🌟🌟 **We recommend visiting the Lumina website to try it out! [🌟 visit](https:\u002F\u002Flumina-t2-x-web.vercel.app\u002Fdocs\u002Fdemos\u002Fdemo-of-audio)** -->\n\n**Prompt:** Semiautomatic gunfire occurs with slight echo\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F25f2a6a8-0386-41e8-ab10-d1303554b944\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F6722a68a-1a5a-4a44-ba9c-405372dc27ef\n\n**Prompt:** A telephone bell rings\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F7467dd6d-b163-4436-ac5b-36662d1f9ddf\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F703ea405-6eb4-4161-b5ff-51a93f81d013\n\n**Prompt:** An engine running followed by the engine revving and tires screeching\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F5d9dd431-b8b4-41a0-9e78-bb0a234a30b9\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F9ca4af9e-cee3-4596-b826-d6c25761c3c1\n\n**Prompt:** Birds chirping with insects buzzing and outdoor ambiance\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fb776aacb-783b-4f47-bf74-89671a17d38d\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fa11333e4-695e-4a8c-8ea1-ee5b83e34682\n\n### Text-to-music Generation\n\n> [!Note]\n> **Attention: Mouse over the playbar and click the audio button on the playbar to unmute it.**\n> For more details check out [this](.\u002Flumina_music\u002FREADME.md)\n\n**Prompt:** An electrifying ska tune with prominent saxophone riffs, energetic e-guitar and acoustic drums, lively percussion, soulful keys, groovy e-bass, and a fast tempo that exudes uplifting energy.\n\n**Generated Music:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002Ffef8f6b9-1e77-457e-bf4b-fb0cccefa0ec\n\n**Prompt:** A high-energy synth rock\u002Fpop song with fast-paced acoustic drums, a triumphant brass\u002Fstring section, and a thrilling synth lead sound that creates an adventurous atmosphere.\n\n**Generated Music:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F1f796046-64ab-44ed-a4d8-0ebc0cfc484f\n\n**Prompt:** An uptempo electronic pop song that incorporates digital drums, digital bass and synthpad sounds.\n\n**Generated Music:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F4768415e-436a-4d0e-af53-bf7882cb94cd\n\n**Prompt:** A medium-tempo digital keyboard song with a jazzy backing track featuring digital drums, piano, e-bass, trumpet, and acoustic guitar.\n\n**Generated Music:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F8994a573-e776-488b-a86c-4398a4362398\n\n**Prompt:** This low-quality folk song features groovy wooden percussion, bass, piano, and flute melodies, as well as sustained strings and shimmering shakers that create a passionate, happy, and joyful atmosphere.\n\n**Generated Music:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002Fe0b5d197-589c-47d6-954b-b9c1d54feebb\n\n### Multilingual Generation\n\nWe present three multilingual capabilities of Lumina-Next-2B.\n\n**Generating Images conditioned on Chinese poems:**\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F9aa79d67-e304-4867-81f3-cfc934c625d9\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n**Generating Images with multilingual prompts:**\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F7c62bb94-42e4-4525-a298-9e25475b511d\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F07fc8138-e67c-4c9f-bc01-e749a6507ada\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n**Generating Images with emojis:**\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F86041420\u002F980b4999-9d1c-4fbd-a695-88b6b675f34b\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n\u003C!--\n**Prompt:** Water trickling rapidly and draining\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F88fcf0e1-b71a-4e94-b9a6-138db6a670f0\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F6fb9963f-46a5-4020-b160-f9a004528d7e\n\n**Prompt:** Thunderstorm sounds while raining\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Ffad8baf3-d80b-4915-ba31-aab13db5ce06\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fc01a7e6e-3421-4a28-93c5-831523ec061d\n\n**Prompt:** Birds chirping repeatedly\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F0fa673a3-f9de-487b-8812-1f96a335e913\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F718289f9-a93e-4ea9-b7db-a14c2b209b28\n\n**Prompt:** Several large bells ring\n\n**Generated Audio:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F362fde84-e4ae-4152-aeb5-4355155c8719\n\n**Groundtruth:**\n\nhttps:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fda93e13d-6462-48d2-b6dc-af6ff0c4d07d\n\n-->\n\n\u003C!-- For more audio demos visit [lumina website - audio demos](https:\u002F\u002Flumina-t2-x-web.vercel.app\u002Fdocs\u002Fdemos\u002Fdemo-of-audio) -->\n\n\u003C!-- ### More examples -->\n\n\u003C!-- For more demos visit [this website](https:\u002F\u002Flumina-t2-x-web.vercel.app\u002Fdocs\u002Fdemos) -->\n\n\u003C!-- ### High-res. Image Editing\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F55981976-c989-4f07-982a-1e567c7078ef\" width=\"90%\"\u002F>\n \u003Cbr>\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fa1ac7190-c49c-4d8b-965c-9ccf83a4f6a7\" width=\"90%\"\u002F>\n\u003C\u002Fp>\n\n### Compositional Generation\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F8c8eb921-134c-4f55-918a-0ad07f9a47f4\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n### Resolution Extrapolation\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002Fe37e2db7-3ead-451e-ba18-b375eb773578\" width=\"90%\"\u002F>\n \u003Cbr>\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F9da47c34-5e09-48d3-9c48-78663fd01cc8\" width=\"100%\"\u002F>\n\u003C\u002Fp>\n\n### Consistent-Style Generation\n\n\u003Cp align=\"center\">\n \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F6403417a-42c6-4048-9419-375d211e14bb\" width=\"90%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp> -->\n\n## ⚙️ Diverse Configurations\n\nWe support diverse configurations, including text encoders, DiTs of different parameter sizes, inference methods, and VAE encoders.AAdditionally, we offer features such as 1D-RoPE, image enhancement, and more.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fassets\u002F54879512\u002F221de325-d9fb-4b7e-a97c-4b24cd2df0fc\" width=\"100%\"\u002F>\n \u003Cbr>\n\u003C\u002Fp>\n\n## Contributors\n\nCore member for code developlement and maintence:\n\nDongyang Liu, Le Zhuo, Junlin Xie, Ruoyi Du, Peng Gao\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FAlpha-VLLM\u002FLumina-T2X\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Fcontrib.rocks\u002Fimage?repo=Alpha-VLLM\u002FLumina-T2X\" \u002F>\n\u003C\u002Fa>\n\n## 📄 Citation\n\n```\n@article{gao2024lumina-next,\n  title={Lumina-Next: Making Lumina-T2X Stronger and Faster with Next-DiT},\n  author={Zhuo, Le and Du, Ruoyi and Han, Xiao and Li, Yangguang and Liu, Dongyang and Huang, Rongjie and Liu, Wenze and others},\n  journal={arXiv preprint arXiv:2406.18583},\n  year={2024}\n}\n```\n\n```\n@article{gao2024lumin-t2x,\n  title={Lumina-T2X: Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers},\n  author={Gao, Peng and Zhuo, Le and Liu, Chris and and Du, Ruoyi and Luo, Xu and Qiu, Longtian and Zhang, Yuhang and others},\n  journal={arXiv preprint arXiv:2405.05945},\n  year={2024}\n}\n\n```\n\n\u003C!--\n## Star History\n\n [![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=Alpha-VLLM\u002FLumina-T2X&type=Date)](https:\u002F\u002Fstar-history.com\u002F#Alpha-VLLM\u002FLumina-T2X&Date) -->\n","Lumina-T2X 是一个统一的文本到任意模态生成框架。该项目利用基于流的大规模扩散变压器模型，能够将文本转换为不同分辨率和时长的各种模态内容，如图像、视频等。其核心功能包括多模态生成、高分辨率输出以及灵活的时长控制，采用Python语言实现，并已在ICLR 2025和NeurIPS 2024上展示。适合需要从文本描述中自动生成高质量视觉内容的应用场景，例如创意设计、广告制作及多媒体内容生成等领域。",2,"2026-06-11 03:42:29","high_star"]