[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72433":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72433,"xDiT","xdit-project\u002FxDiT","xdit-project","xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism","",null,"Python",2633,321,79,86,0,5,7,22,15,29.52,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:03","\u003Cdiv align=\"center\">\n  \u003C!-- \u003Ch1>KTransformers\u003C\u002Fh1> -->\n  \u003Cp align=\"center\">\n\n  \u003Cpicture>\n    \u003Cimg alt=\"xDiT\" src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fxdit-project\u002Fxdit_assets\u002Fmain\u002FXDiTlogo.png\" width=\"50%\">\n\n  \u003C\u002Fp>\n  \u003Ch3>A Scalable Inference Engine for Diffusion Transformers (DiTs) on Multiple Computing Devices\u003C\u002Fh3>\n  \u003Ca href=\"#cite-us\">📝 Papers\u003C\u002Fa> | \u003Ca href=\"#QuickStart\">🚀 Quick Start\u003C\u002Fa> | \u003Ca href=\"#support-dits\">🎯 Supported DiTs\u003C\u002Fa> | \u003Ca href=\"#dev-guide\">📚 Dev Guide \u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FxDiT\u002Fdiscussions\">📈  Discussion \u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fmedium.com\u002F@xditproject\">📝 Blogs\u003C\u002Fa>\u003C\u002Fstrong>\n  \u003Cp>\u003C\u002Fp>\n\n[![](https:\u002F\u002Fdcbadge.limes.pink\u002Fapi\u002Fserver\u002Fhttps:\u002F\u002Fdiscord.gg\u002FYEWzWfCF9S)](https:\u002F\u002Fdiscord.gg\u002FYEWzWfCF9S)\n\n\u003C\u002Fdiv>\n\n\u003Ch2 id=\"agenda\">Table of Contents\u003C\u002Fh2>\n\n- [🔥 Meet xDiT](#meet-xdit)\n- [📢 Open-source Community](#updates)\n- [🎯 Supported DiTs](#support-dits)\n- [📈 Performance](#perf)\n- [🚀 QuickStart](#QuickStart)\n- [🖼️ ComfyUI with xDiT](#comfyui)\n- [✨ xDiT's Arsenal](#secrets)\n  - [Parallel Methods](#parallel)\n    - [1. PipeFusion](#PipeFusion)\n    - [2. Unified Sequence Parallel](#USP)\n    - [3. Hybrid Parallel](#hybrid_parallel)\n    - [4. CFG Parallel](#cfg_parallel)\n    - [5. Parallel VAE](#parallel_vae)\n  - [Single GPU Acceleration](#1gpuacc)\n    - [Compilation Acceleration](#compilation)\n    - [Cache Acceleration](#cache_acceleration)\n- [📚  Develop Guide](#dev-guide)\n- [🚧  History and Looking for Contributions](#history)\n- [📝 Cite Us](#cite-us)\n\n\n\u003Ch2 id=\"meet-xdit\">🔥 Meet xDiT\u003C\u002Fh2>\n\nDiffusion Transformers (DiTs) are driving advancements in high-quality image and video generation.\nWith the escalating input context length in DiTs, the computational demand of the Attention mechanism grows **quadratically**!\nConsequently, multi-GPU and multi-machine deployments are essential to meet the **real-time** requirements in online services.\n\n\n\u003Ch3 id=\"meet-xdit-parallel\">Parallel Inference\u003C\u002Fh3>\n\nTo meet real-time demand for DiTs applications, parallel inference is a must.\nxDiT is an inference engine designed for the parallel deployment of DiTs on a large scale.\nxDiT provides a suite of efficient parallel approaches for Diffusion Models, as well as computation accelerations.\n\nThe overview of xDiT is shown as follows.\n\n\u003Cpicture>\n  \u003Cimg alt=\"xDiT\" src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fxdit-project\u002Fxdit_assets\u002Fmain\u002Fmethods\u002Fxdit_overview.png\">\n\u003C\u002Fpicture>\n\n\n1. Sequence Parallelism, [USP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07719) is a unified sequence parallel approach proposed by us combining DeepSpeed-Ulysses, Ring-Attention.\n\n2. [PipeFusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14430), a sequence-level pipeline parallelism, similar to [TeraPipe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.07988) but takes advantage of the input temporal redundancy characteristics of diffusion models.\n\n3. Data Parallel: Processes multiple prompts or generates multiple images from a single prompt in parallel across images.\n\n4. CFG Parallel, also known as Split Batch: Activates when using classifier-free guidance (CFG) with a constant parallelism of 2.\n\nThe four parallel methods in xDiT can be configured in a hybrid manner, optimizing communication patterns to best suit the underlying network hardware.\n\nAs shown in the following picture, xDiT offers a set of APIs to adapt DiT models in [huggingface\u002Fdiffusers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdiffusers) to hybrid parallel implementation through simple wrappers.\nIf the model you require is not available in the model zoo, developing it by yourself is not so difficult; please refer to our [Dev Guide](#dev-guide).\n\nWe also have implemented the following parallel strategies for reference:\n\n1. Tensor Parallelism\n2. [DistriFusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19481)\n\n\u003Ch3 id=\"meet-xdit-cache\">Cache Acceleration\u003C\u002Fh3>\n\nCache method, including [TeaCache](https:\u002F\u002Fgithub.com\u002Fali-vilab\u002FTeaCache.git), [First-Block-Cache](https:\u002F\u002Fgithub.com\u002Fchengzeyi\u002FParaAttention.git) and [DiTFastAttn](https:\u002F\u002Fgithub.com\u002Fthu-nics\u002FDiTFastAttn), which exploits computational redundancies between different steps of the Diffusion Model to accelerate inference on a single GPU.\n\n\u003Ch3 id=\"meet-xdit-perf\">Computing Acceleration\u003C\u002Fh3>\n\nOptimization is orthogonal to parallel and focuses on accelerating performance on a single GPU.\n\nFirst, xDiT employs a series of kernel acceleration methods. In addition to utilizing well-known Attention optimization libraries, we leverage compilation acceleration technologies such as `torch.compile` and `onediff`.\n\n\n\u003Ch2 id=\"updates\">📢 Open-source Community \u003C\u002Fh2>\n\nThe following open-sourced DiT Models are released with xDiT in day 1.\n\n[HunyuanVideo](https:\u002F\u002Fgithub.com\u002FTencent\u002FHunyuanVideo) ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FTencent\u002FHunyuanVideo?style=social)\n\n[StepVideo](https:\u002F\u002Fgithub.com\u002Fstepfun-ai\u002FStep-Video-T2V) ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fstepfun-ai\u002FStep-Video-T2V?style=social)\n\n[SkyReels-V1](https:\u002F\u002Fgithub.com\u002FSkyworkAI\u002FSkyReels-V1) ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FSkyworkAI\u002FSkyReels-V1?style=social)\n\n[Wan2.1](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.1) ![GitHub Repo stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FWan-Video\u002FWan2.1?style=social)\n\n\n\n\u003Ch2 id=\"support-dits\">🎯 Supported DiTs\u003C\u002Fh2>\n\n\u003Cdiv align=\"center\">\n\n| Model Name | CFG | SP | PipeFusion | TP | MR* | Performance Report Link |\n| --- | --- | --- | --- | --- | --- | --- |\n| [🎬 StepVideo](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstepvideo-t2v) | NA | ✔️ | ❎ | ✔️ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fstepvideo.md) |\n| [🎬 HunyuanVideo](https:\u002F\u002Fgithub.com\u002FTencent\u002FHunyuanVideo) | NA | ✔️ | ❎ | ❎ | ✔️ | [Report](.\u002Fdocs\u002Fperformance\u002Fhunyuanvideo.md) |\n| [🎬 HunyuanVideo-1.5](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FHunyuanVideo-1.5) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🎬 ConsisID-Preview](https:\u002F\u002Fgithub.com\u002FPKU-YuanGroup\u002FConsisID) | ✔️ | ✔️ | ❎ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fconsisid.md) |\n| [🎬 CogVideoX1.5](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002FCogVideoX1.5-5B) | ✔️ | ✔️ | ❎ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fcogvideo.md) |\n| [🎬 Mochi-1](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002Fmochi-xdit) | ✔️ | ✔️ | ❎ | ❎ | ❎ | [Report](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002Fmochi-xdit) |\n| [🎬 CogVideoX](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002FCogVideoX-2b) | ✔️ | ✔️ | ❎ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fcogvideo.md) |\n| [🎬 Latte](https:\u002F\u002Fhuggingface.co\u002Fmaxin-cn\u002FLatte-1) | ❎ | ✔️ | ❎ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Flatte.md) |\n| [🎬 Wan2.1](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B-Diffusers) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🎬 Wan2.2](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-I2V-A14B-Diffusers) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🎬 CausalWan2.2](https:\u002F\u002Fhuggingface.co\u002FFastVideo\u002FCausalWan2.2-I2V-A14B-Preview-Diffusers) | ❎ | ❎ | ❎ | ❎ | ✔️ | NA |\n| [🎬 LTX-2](https:\u002F\u002Fhuggingface.co\u002FLightricks\u002FLTX-2) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🔵 HunyuanDiT-v1.2-Diffusers](https:\u002F\u002Fhuggingface.co\u002FTencent-Hunyuan\u002FHunyuanDiT-v1.2-Diffusers) | ✔️ | ✔️ | ✔️ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fhunyuandit.md) |\n| [🔴 Z-Image Turbo](https:\u002F\u002Fhuggingface.co\u002FTongyi-MAI\u002FZ-Image-Turbo) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🟠 Flux 2 klein](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.2-klein-9B) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🟠 Flux 2](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.2-dev) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🟠 Flux](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.1-schnell) | NA | ✔️ | ✔️ | ❎ | ✔️ | [Report](.\u002Fdocs\u002Fperformance\u002Fflux.md) |\n| [🟠 Flux Kontext](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.1-Kontext-dev) | ❎ | ✔️ |  ❎ | ❎ | ✔️ | NA |\n| [🟢 Qwen Image](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen-Image-2512) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🟢 Qwen Image-Edit](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen-Image-Edit-2511) | ❎ | ✔️ | ❎ | ❎ | ✔️ | NA |\n| [🔴 PixArt-Sigma](https:\u002F\u002Fhuggingface.co\u002FPixArt-alpha\u002FPixArt-Sigma-XL-2-1024-MS) | ✔️ | ✔️ | ✔️ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fpixart_alpha_legacy.md) |\n| [🟢 PixArt-alpha](https:\u002F\u002Fhuggingface.co\u002FPixArt-alpha\u002FPixArt-alpha) | ✔️ | ✔️ | ✔️ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fpixart_alpha_legacy.md) |\n| [🟠 Stable Diffusion 3](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstable-diffusion-3-medium-diffusers) | ✔️ | ✔️ | ✔️ | ❎ | ✔️ | [Report](.\u002Fdocs\u002Fperformance\u002Fsd3.md) |\n| [🟤 SANA](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FSana\u002Fblob\u002Fmain\u002Fasset\u002Fdocs\u002Fmodel_zoo.md) | ✔️ | ✔️ | ✔️ | ❎ | ❎ | [Report](.\u002Fdocs\u002Fperformance\u002Fsana.md) |\n| [⚫ SANA Sprint](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FSana\u002Fblob\u002Fmain\u002Fasset\u002Fdocs\u002Fmodel_zoo.md#sana-sprint) | NA | ✔️ | ❎ | ❎ | ❎ | NA |\n| [🟣 SDXL](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstable-diffusion-xl-base-1.0) | ✔️ | ❎ | ❎ | ❎ | ❎ | NA |\n\nMR* = Model is runnable via the model runner. If not, it's runnable via the provided example scripts.\n\n\u003C\u002Fdiv>\n\n\n\n\n\n\n\u003Ch2 id=\"comfyui\">🖼️ TACO-DiT: ComfyUI with xDiT\u003C\u002Fh2>\n\nComfyUI, is the most popular web-based Diffusion Model interface optimized for workflow.\nIt provides users with a UI platform for image generation, supporting plugins like LoRA, ControlNet, and IPAdaptor. Yet, its design for native single-GPU usage leaves it struggling with the demands of today's large DiTs, resulting in unacceptably high latency for users like Flux.1.\n\nUsing our commercial project **TACO-DiT**, a close-sourced ComfyUI variant built with xDiT, we've successfully implemented a multi-GPU parallel processing workflow within ComfyUI, effectively addressing Flux.1's performance challenges. Below is an example of using TACO-DiT to accelerate a Flux workflow with LoRA:\n\n![ComfyUI xDiT Demo](https:\u002F\u002Fraw.githubusercontent.com\u002Fxdit-project\u002Fxdit_assets\u002Fmain\u002Fcomfyui\u002Fflux-demo.gif)\n\nBy using TACO-DiT, you could significantly reduce your ComfyUI workflow inference latency, and  boosting the throughput with Multi-GPUs. Now it is compatible with multiple Plug-ins, including ControlNet and LoRAs.\n\nMore features and details can be found in our Intro Video:\n+ [[YouTube] TACO-DiT: Accelerating Your ComfyUI Generation Experience](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=7DXnGrARqys)\n+ [[Bilibili] TACO-DiT: 加速你的ComfyUI生成体验](https:\u002F\u002Fwww.bilibili.com\u002Fvideo\u002FBV18tU7YbEra\u002F?vd_source=59c1f990379162c8f596974f34224e4f)\n\nThe blog article is also available: [Supercharge Your AIGC Experience: Leverage xDiT for Multiple GPU Parallel in ComfyUI Flux.1 Workflow](https:\u002F\u002Fmedium.com\u002F@xditproject\u002Fsupercharge-your-aigc-experience-leverage-xdit-for-multiple-gpu-parallel-in-comfyui-flux-1-54b34e4bca05).\n\nComfyUI plugin for xDiT is now available: [xdit-comfyui-private](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002Fxdit-comfyui-private)\n\n\u003Ch2 id=\"QuickStart\">🚀 QuickStart\u003C\u002Fh2>\n\n### 1. Install from pip\n\nAbout `diffusers` version:\n- Different models may require different diffusers versions. Model implementations can vary between diffusers versions, especially for latest models, which affects parallel processing. When encountering model execution errors, you may need to try several recent diffusers versions.\n- While we specify a diffusers version in `setup.py`, newer models may require later versions or even need to be installed from main branch.\n- Limited list of validated diffusers versions can be seen [here](#7-limitations).\n\n`flash_attn` is an optional library that can be installed with xDiT. More supported attention backends can be seen [here](#6-supported-attention-backends).\n\n```\npip install xfuser  # Basic installation\npip install \"xfuser[flash-attn]\"  # With flash attention\n```\n\n### 2. Install from source\n\n```\npip install -e .\n# Or optionally, with flash attention\npip install -e \".[flash-attn]\"\n```\n\nNote that we use two self-maintained packages:\n\n1. [yunchang](https:\u002F\u002Fgithub.com\u002Ffeifeibear\u002Flong-context-attention)\n2. [DistVAE](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FDistVAE)\n\nThe [flash_attn](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) used for yunchang should be >= 2.6.0\n\n### 3. Docker\n\nWe provide a docker image for developers to develop with xDiT. The docker image is [thufeifeibear\u002Fxdit-dev](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fthufeifeibear\u002Fxdit-dev). For running with AMD GPUs (MI300X or newer), a monthly image with validated support for select models is available as well: [rocm\u002Fpytorch-xdit](https:\u002F\u002Fhub.docker.com\u002Fr\u002Frocm\u002Fpytorch-xdit)\n\n### 4. Usage\n\n#### Using model runner\n\nThe xDiT Model Runner provides a single entry point for running most supported diffusion models with proper benchmarking and profiling support. To use it, simply run:\n\n```bash\nxdit --model FLUX.1-dev \\\n    --prompt \"A cat running in a garden\" \\\n    --ulysses_degree 8\n```\n\nThe runner does not support all older models. For those we have the example scripts below. More information on how to run the model runner is available [here](docs\u002Frunner\u002Frunner.md).\n\n#### Using example scripts\n\nWe provide examples demonstrating how to run models with xDiT in the [.\u002Fexamples\u002F](.\u002Fexamples\u002F) directory.\nYou can easily modify the model type, model directory, and parallel options in the [examples\u002Frun.sh](examples\u002Frun.sh) within the script to run some already supported DiT models.\n\n```bash\nbash examples\u002Frun.sh\n```\n\nHybridizing multiple parallelism techniques together is essential for efficiently scaling.\nIt's important that **the product of all parallel degrees matches the number of devices**.\nNote use_cfg_parallel means cfg_parallel=2. For instance, you can combine CFG, PipeFusion, and sequence parallelism with the command below to generate an image of a cute dog through hybrid parallelism.\nHere ulysses_degree * pipefusion_parallel_degree * cfg_degree(use_cfg_parallel) == number of devices == 8.\n\n\n```bash\ntorchrun --nproc_per_node=8 \\\nexamples\u002Fpixartalpha_example.py \\\n--model models\u002FPixArt-XL-2-1024-MS \\\n--pipefusion_parallel_degree 2 \\\n--ulysses_degree 2 \\\n--num_inference_steps 20 \\\n--warmup_steps 0 \\\n--prompt \"A cute dog\" \\\n--use_cfg_parallel\n```\n\n⚠️ Applying PipeFusion requires setting `warmup_steps`, also required in DistriFusion, typically set to a small number compared with `num_inference_steps`.\nThe warmup step impacts the efficiency of PipeFusion as it cannot be executed in parallel, thus degrading to a serial execution.\nWe observed that a warmup of 0 had no effect on the PixArt model.\nUsers can tune this value according to their specific tasks.\n\n### 5. Launch an HTTP Service\n\nYou can also launch an HTTP service to generate images with xDiT.\n\n[Launching a Text-to-Image Http Service](.\u002Fdocs\u002Fdeveloper\u002FHttp_Service.md)\n\n### 6. Supported attention backends\n\nWhen initializing the runtime, xDiT checks which attention backends are installed and available and chooses the fastest one automatically.\nThis behaviour can be overriden via command line argument `--attention-backend \u003Cbackend cli name>`.\n\nSeveral different attention backends are supported:\n\n| Backend name | CLI name |\n| --- | --- |\n| [SDPA](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html) | sdpa |\n| [SDPA - Math](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html) | sdpa_math |\n| [SDPA - Memory Efficient](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html) | sdpa_efficient |\n| [SDPA - Flash](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html) | sdpa_flash |\n| [cuDNN](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Fcudnn\u002Ffrontend\u002Flatest\u002Foperations\u002FAttention.html) | cudnn |\n| [FAv2](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) | flash |\n| [FAv3](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Ftree\u002Fmain\u002Fhopper) | flash_3 |\n| [FAv3 FP8](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Ftree\u002Fmain\u002Fhopper) | flash_3_fp8 |\n| [Transformer Engine FP8](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTransformerEngine) | nvte_fp8 |\n| [FAv4](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention\u002Ftree\u002Fmain\u002Fflash_attn\u002Fcute) | flash_4 |\n| [FAv4 FP4](https:\u002F\u002Fgithub.com\u002Fhao-ai-lab\u002Fflash-attention-fp4) | flash_4_fp4 |\n| [SAGE](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) | sage |\n| [AITER](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter |\n| [AITER FP8](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter_fp8 |\n| [AITER Sage](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter_sage |\n| [AITER Sage V2](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter_sage_v2 |\n| [AITER Sparse Sage](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter_sparse_sage |\n| [AITER Sparse Sage V2](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter_sparse_sage_v2 |\n| [AITER MLA](https:\u002F\u002Fgithub.com\u002Frocm\u002Faiter) | aiter_mla |\n\nxDiT comes with `flash_attn` as an optional install requirement, as it currently supports the largest variety of different GPU architectures.\nHowever, newer implementations generally offer better performance. If available for you, we highly recommend using `cuDNN`, `FAv3`, `FAv3 FP8` (on _hopper_ GPUs) or `FAv4`, `Transformer engine FP8` (on _blackwell_ GPUs).\nOn recent AMD GPUs (MI300X or newer) it is generally recommended to use `AITER` in all cases to get the best possible performance. Note that when using `AITER FP8` as the attention backend with `torch.compile`, it is important to use a version of `AITER` from Jan 16, 2026 or later. Older versions may trigger a bug related to the fake tensors, resulting in a runtime error.\n\nPure FP8 attention can introduce visual artifacts in the output video. To mitigate this, xDiT supports hybrid attention, which runs the first and last N diffusion steps with a high-precision backend and the remaining steps with a low-precision one. Enable it with the following flags:\n\nFlag\tDescription\n`--use_hybrid_attn_schedule`\tEnable the hybrid attention scheduler\n`--hybrid_attn_high_precision_backend`\tBackend for the first\u002Flast N steps (e.g. aiter)\n`--hybrid_attn_low_precision_backend`\tBackend for the remaining steps (e.g. aiter_fp8)\n`--num_hybrid_attn_high_precision_steps`\tNumber of steps N at the start and end that use high precision\n\nAITER now supports Sage (FP8) and Sage v2 (MXFP4, GFX950 only) for improved quality when doing quantized attention. Sage v2 is still recommended to be combined with hybrid attention, using either AITER Sage or just AITER as the high precision backend.\n\nThere is also experimental support for Sparse Sage attention backends, currently only working for Hunyuan1.5-Sparse (Distilled model). Running Sparse Sage on multi-gpu can be affected by load imbalance, costing performance. To mitigate this, `--use_ssta_sparse_text_to_image` makes the text-to-image attention path sparse (using MOBA top-k sampling) instead of dense, reducing the computational cost of text tokens attending to image blocks. NOTE: This will change the output video, and needs to be further tested to see whether or not this regresses quality or not.\n\n\n\n### 7. Limitations\n\n#### Diffusers version\n\nBelow is a list of validated diffusers version requirements. If the model is not in the list, you may need to try several diffusers versions to find a working configuration.\n\n| Model Name | Diffusers version |\n| --- | --- |\n| [HunyuanVideo-1.5](https:\u002F\u002Fgithub.com\u002FTencent-Hunyuan\u002FHunyuanVideo-1.5) | >= 0.36.0 |\n| [Z-Image Turbo](https:\u002F\u002Fhuggingface.co\u002FTongyi-MAI\u002FZ-Image-Turbo) | >= 0.36.0 |\n| [Flux 2](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.2-dev) | >= 0.36.0 |\n| [Flux](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.1-dev) | >= 0.35.2 |\n| [Flux Kontext](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.1-Kontext-dev) | >= 0.35.2 |\n| [HunyuanVideo](https:\u002F\u002Fgithub.com\u002FTencent\u002FHunyuanVideo) | >= 0.35.2 |\n| [Wan2.1](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B-Diffusers) | >= 0.35.2 |\n| [Wan2.2](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-I2V-A14B-Diffusers) | >= 0.35.2 |\n\n\u003Ch2 id=\"dev-guide\">📚  Develop Guide\u003C\u002Fh2>\n\nWe provide a step-by-step guide for adding new models, please refer to the following tutorial.\n\n[Apply xDiT to new models](.\u002Fdocs\u002Fdeveloper\u002Fadding_models\u002Freadme.md)\n\nA high-level design of xDiT framework is provided below, which may help you understand the xDiT framework.\n\n[The implement and design of xdit framework](.\u002Fdocs\u002Fdeveloper\u002FThe_implement_design_of_xdit_framework.md)\n\n\u003Ch2 id=\"secrets\">✨ The xDiT's Arsenal\u003C\u002Fh2>\n\nThe remarkable performance of xDiT is attributed to two key facets.\nFirstly, it leverages parallelization techniques, pioneering innovations such as USP, PipeFusion, and hybrid parallelism, to scale DiTs inference to unprecedented scales.\n\nSecondly, we employ compilation technologies to enhance execution on GPUs, integrating established solutions like `torch.compile` and `onediff` to optimize xDiT's performance.\n\n\u003Ch3 id=\"parallel\">1. Parallel Methods\u003C\u002Fh3>\n\nAs illustrated in the accompanying images, xDiTs offer a comprehensive set of parallelization techniques. For the DiT backbone, the foundational methods—Data, USP, PipeFusion, and CFG parallel—operate in a hybrid fashion. Additionally, the distinct methods, Tensor and DistriFusion parallel, function independently.\nFor the VAE module, xDiT offers a parallel implementation, [DistVAE](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FDistVAE), designed to prevent out-of-memory (OOM) issues.\nThe (\u003Cspan style=\"color: red;\">xDiT\u003C\u002Fspan>) highlights the methods first proposed by use.\n\n\u003Cdiv align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fxdit-project\u002Fxdit_assets\u002Fmain\u002Fmethods\u002Fxdit_method.png\" alt=\"xdit methods\">\n\u003C\u002Fdiv>\n\nThe communication and memory costs associated with the aforementioned intra-image parallelism, except for the CFG and DP (they are inter-image parallel), in DiTs are detailed in the table below. (* denotes that communication can be overlapped with computation.)\n\nAs we can see, PipeFusion and Sequence Parallel achieve the lowest communication cost on different scales and hardware configurations, making them suitable foundational components for a hybrid approach.\n\n𝒑: Number of pixels;\\\n𝒉𝒔: Model hidden size;\\\n𝑳: Number of model layers;\\\n𝑷: Total model parameters;\\\n𝑵: Number of parallel devices;\\\n𝑴: Number of patch splits;\\\n𝑸𝑶: Query and Output parameter count;\\\n𝑲𝑽: KV Activation parameter count;\\\n𝑨 = 𝑸 = 𝑶 = 𝑲 = 𝑽: Equal parameters for Attention, Query, Output, Key, and Value;\n\n\n|                           | attn-KV | communication cost           | param memory   | activations memory             | extra buff memory                  |\n|:-------------------------:|:-------:|:----------------------------:|:--------------:|:------------------------------:|:----------------------------------:|\n| Tensor Parallel           | fresh   | $4O(p \\times hs)L$           | $\\frac{1}{N}P$ | $\\frac{2}{N}A = \\frac{1}{N}QO$ | $\\frac{2}{N}A = \\frac{1}{N}KV$     |\n| DistriFusion*             | stale   | $2O(p \\times hs)L$           | $P$            | $\\frac{2}{N}A = \\frac{1}{N}QO$ | $2AL = (KV)L$                      |\n| Ring Sequence Parallel*   | fresh   | $2O(p \\times hs)L$           | $P$            | $\\frac{2}{N}A = \\frac{1}{N}QO$ | $\\frac{2}{N}A = \\frac{1}{N}KV$     |\n| Ulysses Sequence Parallel | fresh   | $\\frac{4}{N}O(p \\times hs)L$ | $P$            | $\\frac{2}{N}A = \\frac{1}{N}QO$ | $\\frac{2}{N}A = \\frac{1}{N}KV$     |\n| PipeFusion*               | stale-  | $2O(p \\times hs)$            | $\\frac{1}{N}P$ | $\\frac{2}{M}A = \\frac{1}{M}QO$ | $\\frac{2L}{N}A = \\frac{1}{N}(KV)L$ |\n\n\n\u003Ch4 id=\"PipeFusion\">1.1. PipeFusion\u003C\u002Fh4>\n\n[PipeFusion: Displaced Patch Pipeline Parallelism for Diffusion Models](.\u002Fdocs\u002Fmethods\u002Fpipefusion.md) **(Accepted by NeurIPS 2025)** \u003Ca href=\"https:\u002F\u002Fneurips.cc\u002Fvirtual\u002F2025\u002Floc\u002Fsan-diego\u002Fposter\u002F119821\">Link\u003C\u002Fa>\n\n\u003Ch4 id=\"USP\">1.2. USP: Unified Sequence Parallelism\u003C\u002Fh4>\n\n[USP: A Unified Sequence Parallelism Approach for Long Context Generative AI](.\u002Fdocs\u002Fmethods\u002Fusp.md)\n\n\u003Ch4 id=\"hybrid_parallel\">1.3. Hybrid Parallel\u003C\u002Fh4>\n\n[Hybrid Parallelism](.\u002Fdocs\u002Fmethods\u002Fhybrid.md)\n\n\u003Ch4 id=\"cfg_parallel\">1.4. CFG Parallel\u003C\u002Fh4>\n\n[CFG Parallel](.\u002Fdocs\u002Fmethods\u002Fcfg_parallel.md)\n\n\u003Ch4 id=\"parallel_vae\">1.5. Parallel VAE\u003C\u002Fh4>\n\n[Patch Parallel VAE](.\u002Fdocs\u002Fmethods\u002Fparallel_vae.md)\n\n\u003Ch3 id=\"1gpuacc\">Single GPU Acceleration\u003C\u002Fh3>\n\n\n\u003Ch4 id=\"compilation\">Compilation Acceleration\u003C\u002Fh4>\n\nWe utilize two compilation acceleration techniques, [torch.compile](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Ftorch_compile_tutorial.html) and [onediff](https:\u002F\u002Fgithub.com\u002Fsiliconflow\u002Fonediff), to enhance runtime speed on GPUs. These compilation accelerations are used in conjunction with parallelization methods.\n\nWe employ the nexfort backend of onediff. Please install it before use:\n\n```\npip install onediff\npip install -U nexfort\n```\n\nFor usage instructions, refer to the [example\u002Frun.sh](.\u002Fexamples\u002Frun.sh). Simply append `--use_torch_compile` or `--use_onediff` to your command. Note that these options are mutually exclusive, and their performance varies across different scenarios.\n\n\u003Ch4 id=\"cache_acceleration\">Cache Acceleration\u003C\u002Fh4>\n\nYou can use `--use_teacache` or `--use_fbcache` in examples\u002Frun.sh, which applies TeaCache and First-Block-Cache respectively.\nNote, cache method is only supported for FLUX model with USP. It is currently not applicable for PipeFusion.\n\nxDiT also provides DiTFastAttn for single GPU acceleration. It can reduce the computation cost of attention layers by leveraging redundancies between different steps of the Diffusion Model.\n\n[DiTFastAttn: Attention Compression for Diffusion Transformer Models](.\u002Fdocs\u002Fmethods\u002Fditfastattn.md)\n\n\u003Ch2 id=\"history\">🚧  History and Looking for Contributions\u003C\u002Fh2>\n\nWe conducted a major upgrade of this project in August 2024, introducing a new set of APIs that are now the preferred choice for all users.\n\nThe legacy APIs are applied in early stage of xDiT to explore and compare different parallelization methods.\nThey are located in the [legacy](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FxDiT\u002Ftree\u002Flegacy) branch, are now considered outdated and do not support hybrid parallelism. Despite this limitation, they offer a broader range of individual parallelization methods, including PipeFusion, Sequence Parallel, DistriFusion, and Tensor Parallel.\n\nFor users working with Pixart models, you can still run the examples in the [scripts\u002F](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FxDiT\u002Ftree\u002Flegacy\u002Fscripts) directory under the `legacy` branch. However, for all other models, we strongly recommend adopting the formal APIs to ensure optimal performance and compatibility.\n\nWe also warmly welcome developers to join us in enhancing the project. If you have ideas for new features or models, please share them in our [issues](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002FxDiT\u002Fissues). Your contributions are invaluable in driving the project forward and ensuring it meets the needs of the community.\n\n\u003Ch2 id=\"cite-us\">📝 Cite Us\u003C\u002Fh2>\n\n\n[xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01738)\n\n```\n@article{fang2024xdit,\n  title={xDiT: an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism},\n  author={Fang, Jiarui and Pan, Jinzhe and Sun, Xibo and Li, Aoyu and Wang, Jiannan},\n  journal={arXiv preprint arXiv:2411.01738},\n  year={2024}\n}\n\n```\n\n[PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference](https:\u002F\u002Fopenreview.net\u002Fforum?id=5xwyxupsLL)\n\n```\n@inproceedings{\n    fang2025pipefusion,\n    title={PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference},\n    author={Jiarui Fang and Jinzhe Pan and Aoyu Li and Xibo Sun and WANG Jiannan},\n    booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},\n    year={2025},\n    url={https:\u002F\u002Fopenreview.net\u002Fforum?id=5xwyxupsLL}\n}\n\n```\n\n[USP: A Unified Sequence Parallelism Approach for Long Context Generative AI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.07719)\n\n\n```\n@article{fang2024unified,\n  title={A Unified Sequence Parallelism Approach for Long Context Generative AI},\n  author={Fang, Jiarui and Zhao, Shangchun},\n  journal={arXiv preprint arXiv:2405.07719},\n  year={2024}\n}\n\n```\n\n[Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.13588)\n\n```\n@article{sun2024unveiling,\n  title={Unveiling Redundancy in Diffusion Transformers (DiTs): A Systematic Study},\n  author={Sun, Xibo and Fang, Jiarui and Li, Aoyu and Pan, Jinzhe},\n  journal={arXiv preprint arXiv:2411.13588},\n  year={2024}\n}\n\n```\n","xDiT是一个专为扩散变换器（DiTs）设计的大规模并行推理引擎，旨在通过多计算设备实现高效部署。该项目利用了多种先进的并行技术如统一序列并行（USP）、管道融合（PipeFusion）以及数据并行等方法来加速DiTs模型的推理过程，并且还提供了针对单GPU环境下的编译与缓存加速方案。xDiT非常适合需要实时处理高质量图像或视频生成的应用场景，尤其是在面对输入上下文长度增加导致计算需求呈指数级增长的情况下，能够有效支持在线服务对性能的要求。",2,"2026-06-11 03:42:01","high_star"]