[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72186":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},72186,"mini-sglang","sgl-project\u002Fmini-sglang","sgl-project","A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.","",null,"Python",4373,693,16,11,0,20,52,211,60,30.52,"MIT License",false,"main",true,[],"2026-06-12 02:02:59","\u003Cp align=\"center\">\n\u003Cimg width=\"400\" src=\"\u002Fassets\u002Flogo.png\">\n\u003C\u002Fp>\n\n# Mini-SGLang\n\nA **lightweight yet high-performance** inference framework for Large Language Models.\n\n---\n\nMini-SGLang is a compact implementation of [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang), designed to demystify the complexities of modern LLM serving systems. With a compact codebase of **~5,000 lines of Python**, it serves as both a capable inference engine and a transparent reference for researchers and developers.\n\n## ✨ Key Features\n\n- **High Performance**: Achieves state-of-the-art throughput and latency with advanced optimizations.\n- **Lightweight & Readable**: A clean, modular, and fully type-annotated codebase that is easy to understand and modify.\n- **Advanced Optimizations**:\n  - **Radix Cache**: Reuses KV cache for shared prefixes across requests.\n  - **Chunked Prefill**: Reduces peak memory usage for long-context serving.\n  - **Overlap Scheduling**: Hides CPU scheduling overhead with GPU computation.\n  - **Tensor Parallelism**: Scales inference across multiple GPUs.\n  - **Optimized Kernels**: Integrates **FlashAttention** and **FlashInfer** for maximum efficiency.\n  - ...\n\n## 🚀 Quick Start\n\n> **⚠️ Platform Support**: Mini-SGLang currently supports **Linux only** (x86_64 and aarch64). Windows and macOS are not supported due to dependencies on Linux-specific CUDA kernels (`sgl-kernel`, `flashinfer`). We recommend using [WSL2](https:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fwsl\u002Finstall) on Windows or Docker for cross-platform compatibility.\n\n### 1. Environment Setup\n\nWe recommend using `uv` for a fast and reliable installation (note that `uv` does not conflict with `conda`).\n\n```bash\n# Create a virtual environment (Python 3.10+ recommended)\nuv venv --python=3.12\nsource .venv\u002Fbin\u002Factivate\n```\n\n**Prerequisites**: Mini-SGLang relies on CUDA kernels that are JIT-compiled. Ensure you have the **NVIDIA CUDA Toolkit** installed and that its version matches your driver's version. You can check your driver's CUDA capability with `nvidia-smi`.\n\n### 2. Installation\n\nInstall Mini-SGLang directly from the source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fmini-sglang.git\ncd mini-sglang && uv venv --python=3.12 && source .venv\u002Fbin\u002Factivate\nuv pip install -e .\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>💡 Installing on Windows (WSL2)\u003C\u002Fb>\u003C\u002Fsummary>\n\nSince Mini-SGLang requires Linux-specific dependencies, Windows users should use WSL2:\n\n1. **Install WSL2** (if not already installed):\n   ```powershell\n   # In PowerShell (as Administrator)\n   wsl --install\n   ```\n\n2. **Install CUDA on WSL2**:\n   - Follow [NVIDIA's WSL2 CUDA guide](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fwsl-user-guide\u002Findex.html)\n   - Ensure your Windows GPU drivers support WSL2\n\n3. **Install Mini-SGLang in WSL2**:\n   ```bash\n   # Inside WSL2 terminal\n   git clone https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fmini-sglang.git\n   cd mini-sglang && uv venv --python=3.12 && source .venv\u002Fbin\u002Factivate\n   uv pip install -e .\n   ```\n\n4. **Access from Windows**: The server will be accessible at `http:\u002F\u002Flocalhost:8000` from Windows browsers and applications.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>🐳 Running with Docker\u003C\u002Fb>\u003C\u002Fsummary>\n\n**Prerequisites**:\n- [Docker](https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F)\n- [NVIDIA Container Toolkit](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html)\n\n1. **Build the Docker image**:\n   ```bash\n   docker build -t minisgl .\n   ```\n\n2. **Run the server**:\n   ```bash\n   docker run --gpus all -p 1919:1919 \\\n       minisgl --model Qwen\u002FQwen3-0.6B --host 0.0.0.0\n   ```\n\n3. **Run in interactive shell mode**:\n   ```bash\n   docker run -it --gpus all \\\n       minisgl --model Qwen\u002FQwen3-0.6B --shell\n   ```\n\n4. **Using Docker Volumes for persistent caches** (recommended for faster subsequent startups):\n   ```bash\n   docker run --gpus all -p 1919:1919 \\\n       -v huggingface_cache:\u002Fapp\u002F.cache\u002Fhuggingface \\\n       -v tvm_cache:\u002Fapp\u002F.cache\u002Ftvm-ffi \\\n       -v flashinfer_cache:\u002Fapp\u002F.cache\u002Fflashinfer \\\n       minisgl --model Qwen\u002FQwen3-0.6B --host 0.0.0.0\n   ```\n\n\u003C\u002Fdetails>\n\n### 3. Online Serving\n\nLaunch an OpenAI-compatible API server with a single command.\n\n```bash\n# Deploy Qwen\u002FQwen3-0.6B on a single GPU\npython -m minisgl --model \"Qwen\u002FQwen3-0.6B\"\n\n# Deploy meta-llama\u002FLlama-3.1-70B-Instruct on 4 GPUs with Tensor Parallelism, on port 30000\npython -m minisgl --model \"meta-llama\u002FLlama-3.1-70B-Instruct\" --tp 4 --port 30000\n```\n\nOnce the server is running, you can send requests using standard tools like `curl` or any OpenAI-compatible client.\n\n### 4. Interactive Shell\n\nChat with your model directly in the terminal by adding the `--shell` flag.\n\n```bash\npython -m minisgl --model \"Qwen\u002FQwen3-0.6B\" --shell\n```\n\n![shell-example](https:\u002F\u002Flmsys.org\u002Fimages\u002Fblog\u002Fminisgl\u002Fshell.png)\n\nYou can also use `\u002Freset` to clear the chat history.\n\n## Benchmark\n\n### Offline inference\n\nSee [bench.py](.\u002Fbenchmark\u002Foffline\u002Fbench.py) for more details. Set `MINISGL_DISABLE_OVERLAP_SCHEDULING=1` for ablation study on overlap scheduling.\n\nTest Configuration:\n\n- Hardware: 1xH200 GPU.\n- Model: Qwen3-0.6B, Qwen3-14B\n- Total Requests: 256 sequences\n- Input Length: Randomly sampled between 100-1024 tokens\n- Output Length: Randomly sampled between 100-1024 tokens\n\n![offline](https:\u002F\u002Flmsys.org\u002Fimages\u002Fblog\u002Fminisgl\u002Foffline.png)\n\n### Online inference\n\nSee [benchmark_qwen.py](.\u002Fbenchmark\u002Fonline\u002Fbench_qwen.py) for more details.\n\nTest Configuration:\n\n- Hardware: 4xH200 GPU, connected by NVLink.\n- Model: Qwen3-32B\n- Dataset: [Qwen trace](https:\u002F\u002Fgithub.com\u002Falibaba-edu\u002Fqwen-bailian-usagetraces-anon\u002Fblob\u002Fmain\u002Fqwen_traceA_blksz_16.jsonl), replaying first 1000 requests.\n\nLaunch command:\n\n```bash\n# Mini-SGLang\npython -m minisgl --model \"Qwen\u002FQwen3-32B\" --tp 4 --cache naive\n\n# SGLang\npython3 -m sglang.launch_server --model \"Qwen\u002FQwen3-32B\" --tp 4 \\\n    --disable-radix --port 1919 --decode-attention flashinfer\n```\n\n> **Note**: If you encounter network issues when downloading models from HuggingFace, try using `--model-source modelscope` to download from ModelScope instead:\n> ```bash\n> python -m minisgl --model \"Qwen\u002FQwen3-32B\" --tp 4 --model-source modelscope\n> ```\n\n![online](https:\u002F\u002Flmsys.org\u002Fimages\u002Fblog\u002Fminisgl\u002Fonline.png)\n\n## 📚 Learn More\n\n- **[Detailed Features](.\u002Fdocs\u002Ffeatures.md)**: Explore all available features and command-line arguments.\n- **[System Architecture](.\u002Fdocs\u002Fstructures.md)**: Dive deep into the design and data flow of Mini-SGLang.\n","Mini-SGLang 是一个轻量级且高性能的大规模语言模型推理框架。该项目通过约5000行Python代码实现了先进的优化技术，包括Radix缓存、分块预填充、重叠调度、张量并行和优化内核（如FlashAttention和FlashInfer），从而在保持简洁易读的同时达到卓越的吞吐量和低延迟。这些特性使其特别适合于需要高效处理复杂LLM服务系统的场景，例如科研实验、开发测试以及对现有系统进行性能优化等。此外，Mini-SGLang目前仅支持Linux平台（x86_64和aarch64架构），对于Windows或macOS用户建议使用WSL2或Docker来实现兼容性。",2,"2026-06-11 03:40:46","high_star"]