[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79353":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},79353,"mKernel","uccl-project\u002FmKernel","uccl-project","mKernel: fast multi-node, multi-GPU fused kernels","",null,"Cuda",231,22,1,0,15,20,97,45,83.79,"MIT License",false,"main",[],"2026-06-12 04:01:24","\u003Cdiv align=\"center\" >\n  \u003Cp align=\"center\"> \n    \u003Cimg src=\"figs\u002FmKernel.png\" height=350 alt=\"mKernel\" style=\"margin-bottom:12px\"\u002F>\u003Cbr\u002F>\n    \u003Cem>mKernel: multi-GPU, multi-node fused kernels\u003C\u002Fem>\u003Cbr\u002F>\u003Cbr\u002F>\n  \u003C\u002Fp>\n\n  \u003Cp align=\"center\">\n        \u003Ca href=\"https:\u002F\u002Fuccl-project.github.io\u002F\">\u003Cb>Blog\u003C\u002Fb>\u003C\u002Fa> | \n        \u003Ca href=\"https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fuccl-dev\u002Fshared_invite\u002Fzt-3xbjdb0d0-tvDeUhGxtYxvGqsGKQ31Uw\">\u003Cb>Join Slack\u003C\u002Fb>\u003C\u002Fa> | \n        \u003Ca href=\"https:\u002F\u002Fx.com\u002Fuccl_proj\">\u003Cb>Twitter\u002FX\u003C\u002Fb>\u003C\u002Fa> | \n        \u003Ca href=\"#roadmap\">\u003Cb>Roadmap\u003C\u002Fb>\u003C\u002Fa> | \n        \u003Ca href=\"#quick-start\">\u003Cb>Quick Start\u003C\u002Fb>\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fuccl-project\u002Fuccl\u002Fissues\u002F944\">\u003Cb>Open Letter\u003C\u002Fb>\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n## Highlights\n\n- **Multi-GPU + multi-node, in one kernel.** Handling both intra-node and inter-node GPU-driven communication inside the same kernel.\n- **Fine-grained intra-kernel overlapping.** Compute and communication overlap at tile\u002Fchunk granularity. \n- **Persistent kernel with SM specialization.** CTAs are assigned roles, such as compute \u002F intra-comm \u002F inter-send \u002F inter-reduce. \n- **GPU-driven networking, built from scratch.** Directly implement communication over Libibverbs (without NCCL\u002FNVSHMEM) for maximal performance.\n\n_mKernel is under active development, including optimizing for larger scale, different GPUs, and network topologies. The goal is to have a library for commonly used multi-node\u002FGPU distributed kernels._\n\n## Roadmap\n- ✅ Fused, GPU-driven multi-node kernels\n- ✅ Add CX7 and EFA backend\n- 🚧 Full support for heterogeneous accelerators and NICs\n  - 🚧 Topology-aware accelerator and NIC discovery, placement, and routing\n- 🚧 Internode megakernels\n- 🚧 Support for Blackwell GPUs\n\n## Kernels\n\n| Kernel | What it fuses | Description |\n|---|---|---|\n| **AllGather + GEMM** | AllGather → GEMM | Each rank holds a shard of the activation `A`. While ranks gather peers' shards over NVLink\u002FRDMA, the local GEMM consumes tiles as soon as they arrive — overlapping the gather with `(A_full @ B)` so the matmul starts before the collective finishes. |\n| **GEMM + AllReduce** | GEMM → AllReduce | Computes `C = A @ B` and reduces partial outputs across all 16 ranks in one launch. Output tiles are pushed into the reduction tree the instant they're produced, hiding the AllReduce inside the GEMM tail. |\n| **MoE Dispatch + GEMM** | All-to-All dispatch → grouped GEMM | Routes MoE tokens to their expert ranks (intra-node NVLink + inter-node all-to-all) and runs the per-expert grouped GEMM in the same kernel. Tokens are matmul'd as soon as they land, no staging buffer round-trip. |\n| **Ring Attention** | Ring KV exchange → FlashAttention | Sequence-parallel attention across 16 ranks: each step rotates a KV chunk around the ring while the local FlashAttention consumes the previously-received chunk. Compute and the ring send\u002Frecv run concurrently inside a single persistent kernel. |\n| **GEMM + ReduceScatter** | GEMM → ReduceScatter | Computes `C = A @ B` and reduce-scatters the output across ranks. Each output tile is reduced and forwarded to its owning rank as soon as it's produced, so the scatter overlaps the GEMM rather than following it. |\n\n## Quick start\n\n```sh\n# Pick BACKEND=efa for AWS EFA, or BACKEND=cx7 for ConnectX-7 \u002F InfiniBand.\nmake BACKEND=cx7 PYTHON=python3 all\n\n# Two-node benchmark example. Run from node 0; node 1 is launched over SSH.\nNODE0_IP=\u003Cnode0-data-ip> \\\nNODE1_IP=\u003Cnode1-data-ip> \\\nNODE1_SSH=\u003Cnode1-ssh-target> \\\nbash bench\u002Frun.sh all bench 2\n\nmake plots\n```\n\n## Requirements\n\n- NVIDIA Hopper GPUs; the default build targets `sm_90a`.\n- CUDA 12.9 by default (`CUDA_HOME=\u002Fusr\u002Flocal\u002Fcuda-12.9`), override with `CUDA_HOME=...`.\n- Python with PyTorch installed; pass it to the build with `PYTHON=\u002Fpath\u002Fto\u002Fpython`.\n- CX7 backend: libibverbs development headers and libraries.\n- EFA backend: AWS EFA installation with libfabric, libibverbs, efadv, and EFA headers\u002Flibraries under `EFA_HOME=\u002Fopt\u002Famazon\u002Fefa` by default.\n- Benchmarks assume homogeneous multi-GPU nodes, `torchrun`, passwordless SSH from node 0 to peer nodes, and routable data-plane IPs in `NODE*_IP`.\n\n## Backends\n\n| Backend | Macro | Transport | Where it runs |\n|---|---|---|---|\n| **CX7** | `-DINTERNODE_BACKEND_IBVERBS` | libibverbs RC | ConnectX-7 \u002F InfiniBand \u002F RoCE |\n| **EFA** | `-DINTERNODE_BACKEND_EFA` | libibverbs + efadv (SRD) | AWS p5\u002Fp5e (H200, EFA) |\n\nBoth backends share the same host-side API and the same on-GPU kernel; only the proxy \u002F session implementation differs (`include\u002Fcomm\u002Finternode\u002Fsession.h` for CX7, `session_efa.h` for EFA).\n\n\n## Comparison results — AWS EFA\n\n| Kernel | Plot |\n|---|---|\n| AllGather + GEMM | ![ag_gemm](plots\u002Fag_gemm_efa.png) |\n| GEMM + AllReduce | ![gemm_ar](plots\u002Fgemm_ar_efa.png) |\n| MoE Dispatch + GEMM | ![dispatch_gemm](plots\u002Fdispatch_gemm_efa.png) |\n| Ring Attention | ![ring_attention](plots\u002Fring_attention_efa.png) |\n| GEMM + ReduceScatter | ![gemm_rs](plots\u002Fgemm_rs_efa.png) |\n\n## Comparison results — ConnectX-7\n\n| Kernel | Plot |\n|---|---|\n| AllGather + GEMM | ![ag_gemm_cx7](plots\u002Fag_gemm_cx7.png) |\n| GEMM + AllReduce | ![gemm_ar_cx7](plots\u002Fgemm_ar_cx7.png) |\n| Ring Attention | ![ring_attention_cx7](plots\u002Fring_attn_cx7.png) |\n| GEMM + ReduceScatter | ![gemm_rs_cx7](plots\u002Fgemm_rs_cx7.png) |\n\n## Acknowledgements\n\nThe MMA \u002F compute code is adapted from [ThunderKittens](https:\u002F\u002Fgithub.com\u002FHazyResearch\u002FThunderKittens) (HazyResearch). Many thanks to the TK authors.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n\n","mKernel 是一个专为多节点、多GPU环境设计的融合内核库，旨在加速分布式计算任务。它支持在同一内核中处理节点内和节点间的GPU驱动通信，通过细粒度的内核重叠技术实现计算与通信在块级别上的并行执行。此外，mKernel利用持久化内核和流多处理器（SM）的专业化来分配不同角色给协作线程阵列（CTAs），如计算、内部通信、跨节点发送或减少操作，并且基于Libibverbs从零构建了GPU驱动的网络层以达到最佳性能。该项目适用于需要高性能分布式计算的应用场景，特别是那些涉及大规模矩阵运算和数据交换的任务，比如深度学习模型训练等。",2,"2026-06-11 03:57:44","CREATED_QUERY"]