[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74037":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},74037,"DeepEP","deepseek-ai\u002FDeepEP","deepseek-ai","DeepEP: an efficient expert-parallel communication library","",null,"Cuda",9712,1282,104,191,0,15,24,98,45,40.32,"MIT License",false,"main",[],"2026-06-12 02:03:21","# DeepEP\n\nDeepEP (DeepEveryParallel) is a high-performance communication library for modern machine learning training and inference. The library currently focuses on expert parallelism (EP) — providing high-throughput and low-latency all-to-all GPU kernels (MoE dispatch and combine) with low-precision support including FP8 — while also offering experimental primitives for pipeline parallelism (PP), context parallelism (CP), and remote memory access (Engram), all designed for zero or minimal SM occupation. All kernels are compiled at runtime via a lightweight Just-In-Time (JIT) module, requiring no CUDA compilation during installation.\n\nDespite its lightweight design, DeepEP's performance matches or exceeds hardware bandwidth limits across various configurations.\n\n## News\n\n- **V2 release**: A complete refactoring of Expert Parallelism — achieving extreme performance with several times fewer SM resources compared to V1, while supporting significantly larger scale-up and scale-out domains. V2 has also switched from the NVSHMEM backend to the more lightweight **NCCL Gin backend**.\n\n### New features\n\n- **Fully JIT** (Just-In-Time compilation)\n- **NCCL Gin backend**\n  - Header-only & lightweight\n  - Able to reuse existing NCCL communicators\n- **EPv2**\n  - High-throughput and low-latency APIs unified into a single `ElasticBuffer` interface, with a new GEMM layout\n  - Larger scale-up & scale-out domain support (up to EP2048)\n  - Analytical SM & QP count calculation — no more auto-tuning needed\n  - Both hybrid & direct modes remain supported\n  - For V3-like legacy training, SM usage reduced from 24 to 4 - 6 while maintaining equivalent or better performance\n- **0 SM Engram** (with RDMA)\n- **0 SM PP** (with RDMA)\n- **0 SM CP** (with Copy Engine)\n\n### Notes\n\n- Buffer size consumption is larger than V1\n- 0 SM RDMA low-latency EP is no longer supported\n- Engram, PP, and CP are experimental features\n\n### Still on-going features\n\n- **Elastic GPU & CPU buffers**: A contiguous virtual address space that maps to a hybrid of GPU and CPU physical memory under the hood, enabling fully automatic and transparent Engram or imbalanced EP\n- Reducing intermediate buffer sizes by leveraging EP replay to handle load imbalance\n- All-gather updates and reduce-scatter implementations for DP & TP\n\nFor the legacy V1 documentation (NVSHMEM-based), see [docs\u002Flegacy.md](docs\u002Flegacy.md).\n\n## Performance\n\nFollowing V3's configuration, we tested with 8K tokens per batch, 7168 hidden dimensions, top 8 experts, FP8 dispatching, and BF16 combining, and obtained the following results:\n\n| Arch | NIC type | Topo | Dispatch Bottleneck Bandwidth | Combine Bottleneck Bandwidth | #SMs |\n|--|--|--|--|--|--|\n| SM90 | CX7 | EP 8 x 2 | 90 GB\u002Fs (RDMA) | 81 GB\u002Fs (RDMA) | 12 |\n| SM90 | CX7 | EP 8 x 4 | 61 GB\u002Fs (RDMA) | 61 GB\u002Fs (RDMA) | 6 |\n| SM100 | CX7 | EP 8 x 2 | 90 GB\u002Fs (RDMA) | 91 GB\u002Fs (RDMA) | 12 |\n| SM100 | N\u002FA | EP 8 | 726 GB\u002Fs (NVLink) | 740 GB\u002Fs (NVLink) | 64 (Max perf) |\n| SM100 | N\u002FA | EP 8 | 643 GB\u002Fs (NVLink) | 675 GB\u002Fs (NVLink) | 24 (Min #SM) |\n\nNotes: the results are logical bandwidth. For example, under the `EP 8 x 2` case, 90 GB\u002Fs actually contains local rank traffic.\n\nComparing with V1, **V2 achieves up to 1.3x peak performance, while saving up to 4x SM count**.\n\nWe omit results for larger EP configurations for the time being, but encourage interested users to benchmark them directly. Based on our internal experience, we expect the kernel to continue saturating hardware bandwidth at scale.\n\nFor V1 performance data, see [docs\u002Flegacy.md](docs\u002Flegacy.md#performance).\n\n## Quick start\n\n### Requirements\n\n- Hopper (SM90) GPUs, or other architectures with SM90 PTX ISA support\n- Python 3.8 and above\n- CUDA version\n  - CUDA 12.3 and above for SM90 GPUs\n- PyTorch 2.10 and above\n- NCCL 2.30.4 and above\n- NVLink for intranode communication\n- RDMA network for internode communication\n\n### Install NCCL dependency\n\nWe recommend using pip to install NCCL so that DeepEP can automatically locate it within the Python environment. You can install it using the following command:\n\n```bash\npip install \"nvidia-nccl-cu13>=2.30.4\" --no-deps\n```\n\n### Install NVSHMEM dependency\n\nDeepEP also depends on NVSHMEM to provide support for legacy methods. Please refer to our [NVSHMEM Installation Guide](docs\u002Fnvshmem.md) for instructions.\n\n### Development\n\n```bash\n# Build and make symbolic links for SO files\npython setup.py build\n# You may modify the specific SO names according to your own platform\nln -s build\u002Flib.linux-x86_64-cpython-38\u002Fdeep_ep_cpp.cpython-38-x86_64-linux-gnu.so\n\n# Run test cases\n# NOTES: you may modify the `init_dist` function in `tests\u002Futils\u002Fenvs.py`\n# according to your own cluster settings, and launch into multiple nodes\npython tests\u002Felastic\u002Ftest_ep.py\npython tests\u002Felastic\u002Ftest_agrs.py\npython tests\u002Felastic\u002Ftest_engram.py\npython tests\u002Felastic\u002Ftest_pp.py\n```\n\n### Installation\n\n```bash\npython setup.py install\n```\n\nThen, import `deep_ep` in your Python project, and enjoy!\n\n## Interfaces and examples\n\n### Buffer initialization\n\nIn V2, all EP operations — high-throughput and low-latency — are unified under a single `ElasticBuffer` interface. The buffer can be initialized by specifying MoE settings directly, and the optimal SM and QP counts are calculated analytically.\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import Optional\n\nfrom deep_ep import ElasticBuffer\n\n# Communication buffer (will allocate at runtime)\n_buffer: Optional[ElasticBuffer] = None\n\n# Number of SMs to use for communication kernels (will be set at buffer creation)\n_num_comm_sms: int = 0\n\n\ndef get_buffer(group: dist.ProcessGroup,\n               num_max_tokens_per_rank: int,\n               hidden: int,\n               num_topk: int,\n               num_experts: int,\n               use_fp8_dispatch: bool = False) -> ElasticBuffer:\n    \"\"\"Initialize or retrieve the ElasticBuffer for EP communication.\"\"\"\n    global _buffer, _num_comm_sms\n\n    # Check if we can reuse the existing buffer\n    required_bytes = ElasticBuffer.get_buffer_size_hint(\n        group, num_max_tokens_per_rank, hidden,\n        num_topk=num_topk, use_fp8_dispatch=use_fp8_dispatch,\n    )\n    if _buffer is not None and _buffer.group == group and _buffer.num_bytes >= required_bytes:\n        return _buffer\n\n    # Allocate a new buffer with MoE settings\n    # NOTES: V2 buffer size consumption is larger than V1\n    _buffer = ElasticBuffer(\n        group,\n        num_max_tokens_per_rank=num_max_tokens_per_rank,\n        hidden=hidden,\n        num_topk=num_topk,\n        use_fp8_dispatch=use_fp8_dispatch,\n    )\n\n    # V2 analytically calculates the optimal SM count — no more auto-tuning needed\n    # You may also specify `num_sms` manually in dispatch\u002Fcombine calls to override\n    _num_comm_sms = _buffer.get_theoretical_num_sms(num_experts, num_topk)\n\n    return _buffer\n```\n\n### Example use in model training or inference prefilling\n\nV2 unifies the dispatch and combine APIs into a single `ElasticBuffer` interface. The example below shows how to use them for training (with backward passes) or inference prefilling.\n\n```python\nimport torch\nimport torch.distributed as dist\nfrom typing import Tuple, Union\n\nfrom deep_ep import ElasticBuffer, EPHandle, EventOverlap\n\n\ndef dispatch_forward(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     topk_idx: torch.Tensor, topk_weights: torch.Tensor,\n                     num_experts: int,\n                     num_max_tokens_per_rank: int,\n                     expert_alignment: int = 1) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n              torch.Tensor, torch.Tensor, EPHandle, EventOverlap]:\n    \"\"\"\n    MoE dispatch: route tokens to the corresponding experts across all ranks.\n    Supports both BF16 and FP8 (x as a tuple of [data, scale_factors]) inputs.\n    \"\"\"\n    global _buffer, _num_comm_sms\n\n    recv_x, recv_topk_idx, recv_topk_weights, handle, event = _buffer.dispatch(\n        x,\n        topk_idx=topk_idx,\n        topk_weights=topk_weights,\n        num_experts=num_experts,\n        num_max_tokens_per_rank=num_max_tokens_per_rank,\n        expert_alignment=expert_alignment,\n        num_sms=_num_comm_sms,\n        async_with_compute_stream=True,\n    )\n\n    # `handle` contains routing metadata for the subsequent combine call\n    # `handle.num_recv_tokens_per_expert_list` provides per-expert token counts for GEMM\n    # Use `event.current_stream_wait()` to synchronize the compute stream before using results\n    return recv_x, recv_topk_idx, recv_topk_weights, handle, event\n\n\ndef dispatch_backward(grad_recv_x: torch.Tensor,\n                      grad_recv_topk_weights: torch.Tensor,\n                      handle: EPHandle) -> Tuple[torch.Tensor, torch.Tensor, EventOverlap]:\n    \"\"\"The backward pass of MoE dispatch is actually a combine.\"\"\"\n    global _buffer, _num_comm_sms\n\n    combined_grad_x, combined_grad_topk_weights, event = _buffer.combine(\n        grad_recv_x,\n        handle=handle,\n        topk_weights=grad_recv_topk_weights,\n        num_sms=_num_comm_sms,\n        async_with_compute_stream=True,\n    )\n\n    return combined_grad_x, combined_grad_topk_weights, event\n\n\ndef combine_forward(x: torch.Tensor,\n                    handle: EPHandle) -> Tuple[torch.Tensor, EventOverlap]:\n    \"\"\"MoE combine: reduce expert outputs back to their original ranks.\"\"\"\n    global _buffer, _num_comm_sms\n\n    combined_x, _, event = _buffer.combine(\n        x,\n        handle=handle,\n        num_sms=_num_comm_sms,\n        async_with_compute_stream=True,\n    )\n\n    return combined_x, event\n\n\ndef combine_backward(grad_combined_x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                     handle: EPHandle) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]], EventOverlap]:\n    \"\"\"The backward pass of MoE combine is actually a dispatch.\"\"\"\n    global _buffer, _num_comm_sms\n\n    grad_x, _, _, _, event = _buffer.dispatch(\n        grad_combined_x,\n        handle=handle,\n        num_sms=_num_comm_sms,\n        async_with_compute_stream=True,\n    )\n\n    return grad_x, event\n```\n\nFor communication-computation overlap, use the `EventOverlap` interface to manage dependencies between the communication stream and the compute stream:\n\n```python\n# After dispatch, overlap computation while communication is in-flight\nrecv_x, recv_topk_idx, recv_topk_weights, handle, event = dispatch_forward(...)\n\n# ... do some independent computation here ...\n\n# Wait for communication to finish before using results\nevent.current_stream_wait()\n\n# Now safe to use recv_x, recv_topk_idx, recv_topk_weights\n```\n\n### Example use in inference decoding\n\nFor inference decoding, the same `ElasticBuffer` is used. The handle-caching pattern allows reusing routing metadata across iterations when the gating decisions remain unchanged, avoiding redundant CPU synchronization.\n\n```python\nimport torch\nfrom typing import Tuple, Optional, Union\n\nfrom deep_ep import ElasticBuffer, EPHandle, EventOverlap\n\n\ndef decode_dispatch(x: Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n                    topk_idx: torch.Tensor, topk_weights: torch.Tensor,\n                    num_experts: int,\n                    num_max_tokens_per_rank: int,\n                    cached_handle: Optional[EPHandle] = None) -> \\\n        Tuple[Union[torch.Tensor, Tuple[torch.Tensor, torch.Tensor]],\n              torch.Tensor, torch.Tensor, EPHandle, EventOverlap]:\n    \"\"\"\n    MoE dispatch for inference decoding.\n    If `cached_handle` is provided, the layout is reused without CPU synchronization.\n    \"\"\"\n    global _buffer, _num_comm_sms\n\n    if cached_handle is not None:\n        # Reuse cached handle: skip layout recomputation and CPU sync\n        recv_x, _, _, handle, event = _buffer.dispatch(\n            x,\n            handle=cached_handle,\n            num_sms=_num_comm_sms,\n            async_with_compute_stream=True,\n        )\n        return recv_x, cached_handle.topk_idx, None, handle, event\n\n    recv_x, recv_topk_idx, recv_topk_weights, handle, event = _buffer.dispatch(\n        x,\n        topk_idx=topk_idx,\n        topk_weights=topk_weights,\n        num_experts=num_experts,\n        num_max_tokens_per_rank=num_max_tokens_per_rank,\n        num_sms=_num_comm_sms,\n        async_with_compute_stream=True,\n    )\n\n    return recv_x, recv_topk_idx, recv_topk_weights, handle, event\n\n\ndef decode_combine(x: torch.Tensor,\n                   handle: EPHandle) -> Tuple[torch.Tensor, EventOverlap]:\n    \"\"\"MoE combine for inference decoding.\"\"\"\n    global _buffer, _num_comm_sms\n\n    combined_x, _, event = _buffer.combine(\n        x,\n        handle=handle,\n        num_sms=_num_comm_sms,\n        async_with_compute_stream=True,\n    )\n\n    return combined_x, event\n```\n\n### Environment variables\n\nThe library provides some environment variables, which may be useful:\n\n- General\n    - `EP_BUFFER_DEBUG`: `0` or `1`, print buffer initialization, SM approximation, and backend debugging information, `0` by default\n    - `EP_SUPPRESS_NCCL_CHECK`: `0` or `1`, suppress NCCL version mismatch checking, `0` by default\n    - `EP_AVOID_RECORD_STREAM`: `0` or `1`, avoid `record_stream` on output tensors, `0` by default\n    - `EP_NUM_TOPK_IDX_BITS`: integer, override the number of bits for top-k index encoding, `0` (auto) by default\n- Networking\n    - `EP_NIC_NAME`: string, the default NIC name used to query NIC properties, `mlx5_0` by default\n    - `EP_OVERRIDE_RDMA_SL`: integer, override the RDMA service level index for traffic isolation\n    - `EP_DISABLE_GIN`: `0` or `1`, disable the NCCL Gin backend (fall back to non-Gin path), `0` by default\n- JIT\n    - `EP_JIT_DEBUG`: `0` or `1`, print JIT debugging information, `0` by default\n    - `EP_JIT_CACHE_DIR`: string, cache directory for compiled kernels, `$HOME\u002F.deep_ep` by default\n    - `EP_JIT_NVCC_COMPILER`: string, NVCC compiler path; defaults to `torch.utils.cpp_extension.CUDA_HOME`\n    - `EP_JIT_CPP_STANDARD`: integer, C++ standard version, `20` by default\n    - `EP_JIT_PRINT_COMPILER_COMMAND`: `0` or `1`, print compilation commands, `0` by default\n    - `EP_JIT_PTXAS_VERBOSE`: `0` or `1`, show detailed PTXAS output, `0` by default\n    - `EP_JIT_PTXAS_CHECK`: `0` or `1`, assert no local memory usage in compiled kernels, `0` by default\n    - `EP_JIT_WITH_LINEINFO`: `0` or `1`, embed source line info for profiling tools, `0` by default\n    - `EP_JIT_DUMP_ASM`: `0` or `1`, dump both PTX and SASS, `0` by default\n    - `EP_JIT_DUMP_PTX`: `0` or `1`, dump PTX output, `0` by default\n    - `EP_JIT_DUMP_SASS`: `0` or `1`, dump SASS output, `0` by default\n- Debug and profiling\n    - `EP_GIN_GDAKI_DEBUG`: `0` or `1`, enable NCCL Gin GDAKI debugging output, `0` by default\n    - `EP_USE_NVIDIA_TOOLS`: `0` or `1`, skip internal profiling when running under external NVIDIA tools, `0` by default\n    - `EP_DISABLE_BARRIER_PROFILING`: `0` or `1`, disable barrier-based communication profiling in benchmarks, `0` by default\n- Build\n    - `EP_NCCL_ROOT_DIR`: string, path to the NCCL installation directory; auto-detected from the Python environment if not set\n    - `EP_NVSHMEM_ROOT_DIR`: string, path to the NVSHMEM installation directory; auto-detected from the Python environment if not set\n    - `TORCH_CUDA_ARCH_LIST`: string, list of target CUDA architectures, e.g. `\"9.0\"`\n    - `DISABLE_SM90_FEATURES`: `0` or `1`, disable SM90 features for legacy methods, `0` by default\n    - `DISABLE_AGGRESSIVE_PTX_INSTRS`: `0` or `1`, disable aggressive load\u002Fstore instructions in legacy methods, `0` by default\n\nSome environment variables are **persistent**: they are captured at build time and baked into the installed package as default values. At import time, these defaults are applied automatically unless overridden by current environment variables. The persistent variables are: `EP_JIT_CACHE_DIR`, `EP_JIT_PRINT_COMPILER_COMMAND`, `EP_NUM_TOPK_IDX_BITS`, `EP_NCCL_ROOT_DIR`.\n\nFor additional details, please refer to [the test code](tests\u002Felastic\u002Ftest_ep.py) or review the corresponding Python documentation.\n\n## Network configurations\n\nDeepEP is fully tested with InfiniBand networks. However, it is theoretically compatible with RDMA over Converged Ethernet (RoCE) as well.\n\n### Traffic isolation\n\nTraffic isolation is supported by InfiniBand through Virtual Lanes (VL).\n\nTo prevent interference between different types of traffic, we recommend segregating workloads across different virtual lanes as follows:\n\n- expert-parallel workloads\n- other workloads\n\nFor DeepEP V2, you can control the virtual lane assignment by setting the `sl_idx` argument or the `EP_OVERRIDE_RDMA_SL` environment variable.\n\n### Adaptive routing\n\nAdaptive routing is an advanced routing feature provided by InfiniBand switches that can evenly distribute traffic across multiple paths. Even though adaptive routing introduces additional latency, we still recommend enabling it under all network load conditions.\n\n### Congestion control\n\nCongestion control is disabled because it hurts maximum bandwidth. If congestion is unavoidable in some scenarios, we recommend assigning those workloads to low-priority virtual lanes.\n\n### PCI atomic mode\n\nIf the hardware supports it, we recommend using the following command to set the NIC's `PCI_ATOMIC_MODE` to improve RDMA atomic operation performance:\n\n```bash\nsudo mlxconfig -y -d mlx5_$i set PCI_ATOMIC_MODE=4\n```\n\n## Experimental branches\n\n- [Zero-copy](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F453)\n    - Removing the copy between PyTorch tensors and communication buffers, which reduces the SM usages significantly for normal kernels\n    - This PR is authored by **Tencent Network Platform Department**\n- [Eager](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F437)\n    - Using a low-latency protocol removes the extra RTT latency introduced by RDMA atomic OPs\n- [Hybrid-EP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fhybrid-ep)\n    - A new backend implementation using TMA instructions for minimal SM usage and larger NVLink domain support\n    - Fine-grained communication-computation overlap for single-batch scenarios\n    - PCIe kernel support for non-NVLink environments\n    - NVFP4 data type support\n- [AntGroup-Opt](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fantgroup-opt)\n    - This optimization series is authored by **AntGroup Network Platform Department**\n    - [Normal-SMFree](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F347) Eliminating SM from RDMA path by decoupling comm-kernel execution from NIC token transfer, freeing SMs for compute\n    - [LL-SBO](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F483) Overlapping Down GEMM computation with Combine Send communication via signaling mechanism to reduce end-to-end latency\n    - [LL-Layered](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Fpull\u002F500) Optimizing cross-node LL operator communication using rail-optimized forwarding and data merging to reduce latency\n- [Mori-EP](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP\u002Ftree\u002Fmori-ep)\n    - ROCm\u002FAMD GPU support powered by [MORI](https:\u002F\u002Fgithub.com\u002FROCm\u002Fmori) backend (low-latency mode)\n\n## Community forks\n\n- [uccl\u002Fuccl-ep](https:\u002F\u002Fgithub.com\u002Fuccl-project\u002Fuccl\u002Ftree\u002Fmain\u002Fep) - Enables running DeepEP on heterogeneous GPUs (e.g., Nvidia, AMD) and NICs (e.g., EFA, Broadcom, CX7)\n- [Infrawaves\u002FDeepEP_ibrc_dual-ports_multiQP](https:\u002F\u002Fgithub.com\u002FInfrawaves\u002FDeepEP_ibrc_dual-ports_multiQP) - Adds multi-QP solution and dual-port NIC support in IBRC transport\n- [antgroup\u002FDeepXTrace](https:\u002F\u002Fgithub.com\u002Fantgroup\u002FDeepXTrace) - A diagnostic analyzer for efficient and precise localization of slow ranks\n- [ROCm\u002Fmori](https:\u002F\u002Fgithub.com\u002FROCm\u002Fmori) - AMD's next-generation communication library for performance-critical AI workloads (e.g., Wide EP, KVCache transfer, Collectives)\n\n## Acknowledgement\n\nDeepEP V2 is built on top of the [NCCL](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fnccl) Gin backend. Thanks to @sjeaugey, @pakmarkthub, @sb17v, @xiaofanl-nvidia, and the NCCL team for their support!\n\n## License\n\nThis code repository is released under [the MIT License](LICENSE).\n\n## Citation\n\n```bibtex\n@misc{deepep2025,\n      title={DeepEP: an efficient expert-parallel communication library},\n      author={Chenggang Zhao and Shangyan Zhou and Liyue Zhang and Chengqi Deng and Zhean Xu and Yuxuan Liu and Kuai Yu and Jiashi Li and Liang Zhao},\n      year={2025},\n      publisher = {GitHub},\n      howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepEP}},\n}\n```\n","DeepEP 是一个面向现代机器学习训练和推理的高性能通信库，专注于专家并行（EP）技术。该项目提供高吞吐量、低延迟的全对全GPU内核（如MoE分发与合并），支持包括FP8在内的低精度计算，并实验性地提供了管道并行、上下文并行以及远程内存访问等功能，所有这些都设计为占用尽可能少甚至零SM资源。通过轻量级即时编译模块在运行时编译所有内核，安装过程中无需CUDA编译。DeepEP适用于需要高效处理大规模数据集及模型的大规模分布式训练场景，特别是在追求极致性能同时希望减少硬件资源消耗的情况下表现优异。",2,"2026-06-11 03:48:30","high_star"]