[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1760":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":30,"discoverSource":31},1760,"nanoPD","HJCheng0602\u002FnanoPD","HJCheng0602","A from-scratch Prefill\u002FDecode disaggregation inference engine for LLMs","",null,"Python",156,27,1,0,2,3,45.64,"MIT License",false,"main",true,[24,25,26],"decode","inference","prefill","2026-06-11 04:00:59","\u003Cp align=\"center\">\n  \u003Cimg width=\"260\" src=\"assets\u002Flogo.png\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">nanoPD\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  A from-scratch \u003Cstrong>Prefill\u002FDecode disaggregation inference engine\u003C\u002Fstrong> for LLMs\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10%2B-3776ab?logo=python&logoColor=white\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.1%2B-ee4c2c?logo=pytorch&logoColor=white\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCUDA-11.8%2B-76b900?logo=nvidia&logoColor=white\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHardware-H20-blueviolet\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-Qwen3--8B-orange\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green\">\n\u003C\u002Fp>\n\n---\n\nDisaggregated inference separates the two phases of LLM generation — the compute-intensive **prefill** (processing the prompt) and the memory-bandwidth-bound **decode** (generating tokens one at a time) — onto dedicated GPUs. This avoids the mutual interference between the two phases that limits throughput on collocated deployments.\n\nnanoPD implements the full stack: a custom paged KV cache, chunked prefill, a custom CUDA paged attention kernel, multi-GPU KV transfer, an adaptive router driven by an analytical cost model, and a Poisson-arrival benchmark suite. All three serving strategies — **Collocated**, **Disaggregated**, and **Adaptive** — are implemented and benchmarked.\n\n---\n\n## Architecture\n\n```\n┌─────────────────────────────────────────────────────────────┐\n│                      CentralScheduler                       │\n│   (dispatch, KV transfer coordination, path accounting)     │\n└────────────┬─────────────────┬───────────────┬─────────────┘\n             │                 │               │\n     ┌───────▼──────┐  ┌───────▼──────┐  ┌────▼──────────┐\n     │  Collocated  │  │   Prefill    │  │    Decode     │\n     │   Worker     │  │   Worker     │  │    Worker     │\n     │  (GPU 0)     │  │ (GPU 1 \u002F 3)  │  │   (GPU 2)    │\n     └──────────────┘  └──────┬───────┘  └────▲──────────┘\n                              │    KV Transfer │\n                              └────────────────┘\n             ↑\n      ┌──────┴──────┐\n      │   Router    │  ← analytical cost model decides path per request\n      └─────────────┘\n```\n\nEach incoming request is routed by an **analytical cost model** that estimates the end-to-end latency of both strategies on the current hardware and picks the cheaper one. All parameters (prefill speed, decode speed, interference coefficient, inter-GPU bandwidth) are measured live on the actual device at startup.\n\n---\n\n## Key Features\n\n- **Paged KV Cache** — block-granular memory management with copy-on-write for beam search \u002F speculative decoding forks\n- **Chunked Prefill** — long prompts are split into configurable chunks and interleaved with decode steps, keeping GPU utilisation high\n- **Custom CUDA Paged Attention Kernel** — hand-written CUDA kernel for gather-scatter attention over non-contiguous KV blocks\n- **Async KV Transfer** — prefill→decode KV cache migration over a dedicated CUDA stream via pinned memory relay or P2P direct (NVLink), with overlap against the compute stream\n- **Adaptive Router** — per-request routing decision from a hardware-fitted analytical cost model; no oracle, no offline training\n- **Output Length Predictor** — online Bayesian predictor for output length, used by the router to estimate decode cost before generation starts\n- **Multi-Worker CentralScheduler** — concurrent Collocated and Disaggregated pipelines on separate threads, with dynamic batch management\n- **Poisson Arrival Benchmark** — realistic open-loop load test with configurable arrival rate, workload distribution, warmup, and drain phases\n\n---\n\n## Modules\n\n| Module | Description |\n|---|---|\n| `block_manager\u002F` | Sequence + SequenceGroup data structures, `BlockSpaceManager` (paged KV allocation, CoW fork) |\n| `engine\u002F` | `ModelRunner` (custom `paged_forward` hook), `Engine` (scheduler loop, chunked prefill), `Scheduler` |\n| `paged_attention\u002F` | CUDA C++ extension: paged KV store ops + paged multi-head attention kernel |\n| `workers\u002F` | `CollocatedWorker`, `PrefillWorker`, `DecodeWorker`, `kv_transfer` (pinned relay + P2P) |\n| `router\u002F` | `Router` (wraps cost model), `OutputLengthPredictor` (online Bayesian), `CentralScheduler` |\n| `cost_model\u002F` | `profiler.py` (device micro-benchmarks), `analytical.py` (curve fitting + latency formulas) |\n| `benchmark\u002F` | Static batch benchmark, Poisson arrival benchmark, automated sweep, plotting |\n| `examples\u002F` | `demo_collocated.py` (single GPU), `demo_multiGPU.py` (full pipeline on 8× GPU) |\n| `docs\u002F` | Per-module deep-dive documentation in English and Chinese |\n\n---\n\n## Installation\n\n**1. Clone the repo**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-username\u002FnanoPD.git\ncd nanoPD\n```\n\n**2. Build the CUDA extension** (compiles for the GPU on the current machine)\n\n```bash\ncd nanoPD\u002Fpaged_attention\npip install -e . --no-build-isolation\ncd ..\u002F..\n```\n\n> Requires: Python ≥ 3.10, PyTorch ≥ 2.1, CUDA ≥ 11.8, and NVCC in `PATH`.  \n> The extension uses `-arch=native` and auto-detects the installed GPU's compute capability.\n\n**3. Install Python dependencies**\n\n```bash\npip install transformers scipy numpy matplotlib\n```\n\n---\n\n## Quick Start\n\n### Single-GPU collocated inference\n\nRuns 5 prompts through `Engine.generate()` on a single GPU. Suitable for RTX 4060\u002F4070\u002F4080 with Qwen2-1.5B.\n\n```bash\npython examples\u002Fdemo_collocated.py\n# or specify a local path:\npython examples\u002Fdemo_collocated.py --model \u002Fpath\u002Fto\u002FQwen2-1.5B --gpu 0 --max-new-tokens 300\n```\n\n```\nLoading Qwen\u002FQwen2-1.5B on cuda:0 ...\nModel loaded in 7.1s\n\n[1\u002F5] Prompt: What is the capital of France?\n  Response (7 tokens, 1.04s, 6.7 tok\u002Fs):\n    The capital of France is Paris.\n...\n```\n\n### Multi-GPU adaptive inference (full pipeline)\n\nRuns the three-step demo on 8× GPUs — profile → fit cost model → 60 s Poisson adaptive inference. Results are written to `output\u002Foutput.txt`.\n\n```bash\npython examples\u002Fdemo_multiGPU.py --model \u002Fpath\u002Fto\u002FQwen3-8B\n\n# Skip re-profiling if output\u002Fdata\u002Fprofile_data.pt already exists\npython examples\u002Fdemo_multiGPU.py --model \u002Fpath\u002Fto\u002FQwen3-8B --skip-profile\n\n# Tune load\npython examples\u002Fdemo_multiGPU.py --model \u002Fpath\u002Fto\u002FQwen3-8B \\\n    --skip-profile --arrival-rate 0.3 --workload long\n```\n\nDefault GPU assignment:\n\n| Role | GPU | Flag |\n|---|---|---|\n| Collocated worker | 0 | `--collocated-gpu` |\n| Prefill workers | 1, 3 | `--prefill-gpus 1 3` |\n| Decode worker | 2 | `--decode-gpu` |\n\nOutput files:\n\n| File | Content |\n|---|---|\n| `output\u002Fdata\u002Fprofile_data.pt` | Raw micro-benchmark measurements |\n| `output\u002Fdata\u002Fparams.json` | Fitted cost model parameters |\n| `output\u002Fdata\u002Fresults.json` | Full per-request benchmark results |\n| `output\u002Foutput.txt` | Human-readable summary |\n\n---\n\n## Cost Model & Routing\n\nThe router estimates end-to-end latency for both strategies using four hardware-measured parameters:\n\n| Parameter | Meaning | RTX 4090 × 8 | H20 |\n|---|---|---|---|\n| α | Prefill latency (ms\u002Ftoken) | 0.1247 | 0.1452 |\n| β | Decode step latency at batch=1 (ms) | 51.56 | 33.10 |\n| batch_thresh | Memory→compute crossover batch size | 16 | 16 |\n| γ | Prefill interference on decode (ms\u002Ftoken) | 0.0869 | 0.1302 |\n| bandwidth | Inter-GPU transfer bandwidth (GB\u002Fs) | 12.9 (pinned relay) | 392 (P2P) |\n\n**Key insight** — The routing decision reduces to comparing two costs that are both linear in prompt length:\n\n```\nExtra cost of disaggregated : transfer_rate × L          (pay to move KV cache)\nExtra cost of collocated     : γ × L × (load\u002Fbatch_thresh)  (pay for prefill interference)\n\nDisaggregated wins when:  γ \u002F transfer_rate > batch_thresh \u002F system_load\n```\n\nOn RTX 4090: `γ \u002F transfer_rate ≈ 7.6` → Disaggregated wins from **system_load ≥ 3**  \nOn H20: `γ \u002F transfer_rate ≈ 346` → Disaggregated wins at **virtually any non-zero load**\n\nThe full formula and per-hardware analysis is in [`docs\u002Fen\u002F04-cost_model_en.md`](nanoPD\u002Fdocs\u002Fen\u002F04-cost_model_en.md).\n\n---\n\n## Benchmark Results\n\nTested on **Qwen3-8B** with two hardware configurations.\n\n### Static Serial Benchmark\n\n| Workload | Strategy | 4090 p50 | 4090 p99 | H20 p50 | H20 p99 |\n|---|---|---|---|---|---|\n| short | Collocated | 6.4 s | 6.4 s | 4.9 s | 7.2 s |\n| short | Disaggregated | 9.2 s | 9.2 s | 4.9 s | 3.4 s |\n| long | Collocated | 7.2 s | 7.3 s | 6.1 s | 10.2 s |\n| long | Disaggregated | 7.3 s | ~7 s | 8.4 s | 10.4 s |\n\nOn H20, Disaggregated matches Collocated on short prompts (both 4.9 s) because 392 GB\u002Fs P2P bandwidth makes KV transfer nearly free. On the 4090, the 12.9 GB\u002Fs pinned-relay bandwidth adds a visible overhead — exactly as the cost model predicts.\n\n### Poisson Arrival Benchmark (60 s window, mixed workload)\n\n**RTX 4090 × 8:**\n\n![Throughput 4090](nanoPD\u002Fbenchmark\u002Ffigures\u002Ffig_throughput.png)\n\n**H20:**\n\n![Throughput H20](nanoPD\u002Fbenchmark\u002Ffigures_h20\u002Ffig_throughput.png)\n\n- **Adaptive** saturates at ~240 tok\u002Fs on the 4090 and ~175 tok\u002Fs on H20 at moderate arrival rates\n- **Collocated** is competitive at low load but p99 tail latency degrades quickly as concurrency grows\n- **Disaggregated** (serial implementation) plateaus at ~25–30 tok\u002Fs regardless of device — the bottleneck is the lack of concurrent decode batching in the serial benchmark path\n\nMore plots and analysis: [`docs\u002Fen\u002F07-benchmark_en.md`](nanoPD\u002Fdocs\u002Fen\u002F07-benchmark_en.md)\n\n---\n\n## Project Structure\n\n```\nnanoPD\u002F                            ← repo root\n├── .gitignore\n├── README.md\n├── disaggregated_inference_engine.md   ← high-level design notes\n├── examples\u002F\n│   ├── demo_collocated.py         ← single-GPU demo (Qwen2-1.5B)\n│   └── demo_multiGPU.py           ← 8× GPU full pipeline demo\n└── nanoPD\u002F                        ← source package\n    ├── block_manager\u002F\n    │   ├── block_manager.py       ← BlockSpaceManager, PhysicalBlock\n    │   └── sequence.py            ← Sequence, SequenceGroup, SequenceStatus\n    ├── engine\u002F\n    │   ├── engine.py              ← Engine (scheduler loop + chunked prefill)\n    │   ├── model_runner.py        ← ModelRunner with paged_forward hook\n    │   └── scheduler.py           ← Scheduler (prefill\u002Fdecode batching)\n    ├── paged_attention\u002F\n    │   └── csrc\u002F                  ← CUDA kernels (paged attention + KV store)\n    ├── workers\u002F\n    │   ├── collocated_worker.py\n    │   ├── prefill_worker.py\n    │   ├── decode_worker.py\n    │   └── kv_transfer.py         ← async KV migration (pinned relay + P2P)\n    ├── router\u002F\n    │   ├── central_scheduler.py   ← CentralScheduler (multi-worker dispatch)\n    │   ├── router.py              ← Router (wraps cost model + predictor)\n    │   └── output_lenth_predictor.py\n    ├── cost_model\u002F\n    │   ├── profiler.py            ← device micro-benchmarks\n    │   ├── analytical.py          ← curve fitting + routing decision\n    │   ├── params.json            ← RTX 4090 fitted parameters\n    │   └── params_h20.json        ← H20 fitted parameters\n    ├── benchmark\u002F\n    │   ├── benchmark.py           ← static serial benchmark\n    │   ├── benchmark_poisson.py   ← Poisson arrival benchmark\n    │   ├── sweep.py               ← automated sweep across arrival rates\n    │   └── plot_benchmark.py      ← result visualisation\n    └── docs\u002F\n        ├── en\u002F                    ← English documentation (7 modules)\n        └── zh\u002F                    ← Chinese documentation (7 modules)\n```\n\n---\n\n## Documentation\n\nEach module has a dedicated deep-dive doc covering design rationale, data structures, algorithms, and worked examples.\n\n| # | English | Chinese |\n|---|---|---|\n| 1 | [Block Manager](nanoPD\u002Fdocs\u002Fen\u002F01-block_manager_en.md) | [块管理器](nanoPD\u002Fdocs\u002Fzh\u002F01-block_manager_cn.md) |\n| 2 | [Engine](nanoPD\u002Fdocs\u002Fen\u002F02-engine_en.md) | [推理引擎](nanoPD\u002Fdocs\u002Fzh\u002F02-engine_cn.md) |\n| 3 | [CUDA Kernels](nanoPD\u002Fdocs\u002Fen\u002F03-cuda_kernels_en.md) | [CUDA 内核](nanoPD\u002Fdocs\u002Fzh\u002F03-cuda_kernels_cn.md) |\n| 4 | [Cost Model](nanoPD\u002Fdocs\u002Fen\u002F04-cost_model_en.md) | [代价模型](nanoPD\u002Fdocs\u002Fzh\u002F04-cost_model_cn.md) |\n| 5 | [Workers](nanoPD\u002Fdocs\u002Fen\u002F05-workers_en.md) | [Worker 层](nanoPD\u002Fdocs\u002Fzh\u002F05-workers_cn.md) |\n| 6 | [Router](nanoPD\u002Fdocs\u002Fen\u002F06-router_en.md) | [路由器](nanoPD\u002Fdocs\u002Fzh\u002F06-router_cn.md) |\n| 7 | [Benchmark](nanoPD\u002Fdocs\u002Fen\u002F07-benchmark_en.md) | [基准测试](nanoPD\u002Fdocs\u002Fzh\u002F07-benchmark_cn.md) |\n","nanoPD 是一个从零构建的针对大型语言模型（LLMs）的预填充\u002F解码分离推理引擎。该项目通过将计算密集型的预填充阶段和内存带宽受限的解码阶段分配到专用GPU上执行，从而避免了两者在同一设备上运行时的相互干扰，提高了整体吞吐量。它实现了包括自定义分页KV缓存、块化预填充处理、CUDA手写注意力核等在内的全栈功能，并支持集中式调度器来协调多GPU之间的KV传输及路径选择。适用于需要高效利用硬件资源以加速LLM推理过程的各种场景，如在线服务、大规模文本生成等。","2026-06-11 02:45:51","CREATED_QUERY"]