[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76197":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":12,"forks30d":12,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":12,"starSnapshotCount":12,"syncStatus":13,"lastSyncTime":27,"discoverSource":28},76197,"QuantumFlow","zimingttkx\u002FQuantumFlow","zimingttkx","QuantumFlow - Distributed LLM inference scheduling framework with multi-backend support (vLLM, TGI, SGLang), adaptive scheduling strategies, and cluster management.",null,"Python",175,0,2,1,14,99,6,58.9,"MIT License",false,"main",true,[],"2026-06-12 04:01:21","# QuantumFlow\n\n\u003Cdiv align=\"center\">\n\n![QuantumFlow Logo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FQuantumFlow-AI%20Inference-6366F1?style=for-the-badge&logo=rocket)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-00D9FF?style=flat-square&logo=python&logoColor=white)](https:\u002F\u002Fwww.python.org\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-FF6B6B?style=flat-square)](LICENSE)\n[![Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fquantumflow\u002Fquantumflow?style=flat-square&color=F59E0B)](https:\u002F\u002Fgithub.com\u002Fquantumflow\u002Fquantumflow\u002Fstargazers)\n[![Forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fquantumflow\u002Fquantumflow?style=flat-square&color=10B981)](https:\u002F\u002Fgithub.com\u002Fquantumflow\u002Fquantumflow\u002Fnetwork\u002Fmembers)\n\n**🚀 下一代分布式大模型推理平台 — 让千亿参数模型跑在每台机器上**\n\n*「像调度 Kubernetes Pods 一样调度 AI 推理任务」*\n\n[English](README.md) | [中文](README_zh.md)\n\n\u003C\u002Fdiv>\n\n---\n\n## ✨ 特性\n\n\u003Cdiv align=\"center\">\n\n| 🎯 核心能力 | 🌟 差异化亮点 | 🔧 技术优势 | 状态 |\n|:---:|:---:|:---:|:---:|\n| **智能调度** | Gang\u002FPack\u002F自适应多策略 | 自动选择最优执行路径 | ✅ 代码完成 |\n| **分布式部署** | Redis队列 + Worker节点 | Controller与Worker完全解耦 | ✅ 代码完成 |\n| **多后端支持** | vLLM \u002F HF \u002F TGI \u002F SGLang | 统一接口，灵活切换 | ✅ vLLM + HF 可用 |\n| **GPU 优化** | BatchAccumulator \u002F Chunked Prefill \u002F Block VRAM | 单卡利用率 99%，显存精细管理 | ✅ 代码完成 |\n| **国产硬件** | 昇腾NPU深度适配 | 打破 NVIDIA 垄断 | 📋 规划中 |\n| **企业级** | 多租户 \u002F 限流 \u002F 容灾 | 开箱即用的生产特性 | 📋 规划中 |\n\n\u003C\u002Fdiv>\n\n> ✅ 已完成 &nbsp;&nbsp; 🔄 开发中 &nbsp;&nbsp; 📋 规划中\n\n### 🔥 为什么选择 QuantumFlow？\n\n```\n┌─────────────────────────────────────────────────────────────────┐\n│                                                                 │\n│   传统方式：                                                      │\n│   ┌─────────┐    ┌─────────┐    ┌─────────┐                   │\n│   │  模型   │───▶│ 手动分配 │───▶│  低效   │                   │\n│   └─────────┘    └─────────┘    └─────────┘                   │\n│                                                                 │\n│   QuantumFlow：                                                   │\n│   ┌─────────┐    ┌─────────┐    ┌─────────┐                   │\n│   │  模型   │───▶│ 智能调度 │───▶│  高效   │                   │\n│   └─────────┘    └─────────┘    └─────────┘                   │\n│                       │                                        │\n│              ┌──────┴──────┐                                  │\n│              │ 自适应策略   │                                  │\n│              │ • Gang (大模型)│                                  │\n│              │ • Pack (小模型)│                                  │\n│              │ • Adaptive (AI) │                                  │\n│              └─────────────┘                                    │\n└─────────────────────────────────────────────────────────────────┘\n```\n\n---\n\n## 🚀 快速开始\n\n### 📦 安装\n\n```bash\ngit clone \u003Crepo-url>\ncd QuantumFlow\npip install -e .\n```\n\n### 💻 启动\n\n```bash\n# 一键启动（推荐）\n.\u002Fscripts\u002Fqf\n\n# 或手动启动\npython -m quantumflow.cli serve\n```\n\n浏览器打开 `http:\u002F\u002Flocalhost:8000` 进入前端。\n\n### 🛠️ CLI\n\n```bash\n# 交互式终端\npython -m quantumflow.cli interactive\n\n# 命令行\npython -m quantumflow.cli status              # 集群状态\npython -m quantumflow.cli models              # 模型列表\npython -m quantumflow.cli load Qwen2.5-1.5B  # 加载模型\npython -m quantumflow.cli chat Qwen2.5-1.5B -p \"你好\"  # 对话\npython -m quantumflow.cli generate Qwen2.5-1.5B -p \"你好\"  # 生成\n```\n\n---\n\n## 🏗️ 系统架构\n\n```\n┌────────────────────────────────────────────────────────────────────────┐\n│                         QuantumFlow Platform                            │\n├────────────────────────────────────────────────────────────────────────┤\n│                                                                        │\n│  ┌──────────────────────────────────────────────────────────────────┐  │\n│  │                   接入层 (Gateway) ✅ 已完成                     │  │\n│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐         │  │\n│  │  │ REST API│  │ gRPC API│  │ Python  │  │   CLI   │         │  │\n│  │  │ ✅FastAPI│  │ 📋 SDK  │  │ 📋 SDK  │  │ ✅ CLI  │         │  │\n│  │  └────┬────┘  └────┬────┘  └────┬────┘  └────┬────┘         │  │\n│  └───────┼─────────────┼─────────────┼─────────────┼───────────────┘  │\n│          └─────────────┴─────────────┴─────────────┘                     │\n│                               │                                         │\n│  ┌───────────────────────────┼───────────────────────────────────────┐ │\n│  │                    调度层 (Scheduler) ✅ 代码完成                    │ │\n│  │  ┌────────────────────────────────────────────────────────────┐  │ │\n│  │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │  │ │\n│  │  │  │  Gang    │  │  Pack    │  │Adaptive  │  │ Priority │   │  │ │\n│  │  │  │ Scheduler│  │ Scheduler│  │ Strategy │  │ Queue   │   │  │ │\n│  │  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │  │ │\n│  │  └────────────────────────────────────────────────────────────┘  │ │\n│  │                              │                                     │ │\n│  │  ┌───────────────────────────┼───────────────────────────────┐  │ │\n│  │  │     ✅ DistributedScheduler (Redis队列 + HTTP Worker通信)    │  │ │\n│  │  └───────────────────────────────────────────────────────────┘  │ │\n│  └──────────────────────────────────────────────────────────────────┘ │\n│                               │                                         │\n│  ┌───────────────────────────┼───────────────────────────────────────┐ │\n│  │                    存储层 (Storage) ✅ Redis队列已部署              │ │\n│  │  ┌─────────────────┐  ┌──────────────────────────────────────┐  │ │\n│  │  │  Redis Queue    │  │  RedisConnectionManager (单例)          │  │ │\n│  │  │  ✅ ZSET优先级  │  │  ✅ 健康检查 \u002F 自动重连                │  │ │\n│  │  └─────────────────┘  └──────────────────────────────────────┘  │ │\n│  └──────────────────────────────────────────────────────────────────┘ │\n│                               │                                         │\n│  ┌───────────────────────────┼───────────────────────────────────────┐ │\n│  │                    集群管理层 (Cluster) ✅ 分布式模式                │ │\n│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐              │ │\n│  │  │   Node      │  │  Service    │  │  Health    │              │ │\n│  │  │  Registry   │  │  Discovery  │  │  Monitor   │              │ │\n│  │  └─────────────┘  └─────────────┘  └─────────────┘              │ │\n│  └──────────────────────────────────────────────────────────────────┘ │\n│                               │                                         │\n│  ┌───────────────────────────┼───────────────────────────────────────┐ │\n│  │                    执行层 (Worker Pool) ✅ 分布式Worker             │ │\n│  │  ┌─────────────────────────────────────────────────────────────┐  │ │\n│  │  │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐   │  │ │\n│  │  │  │ ✅ HF    │  │ ✅ vLLM  │  │ 📋 TGI   │  │📋 SGLang │   │  │ │\n│  │  │  └──────────┘  └──────────┘  └──────────┘  └──────────┘   │  │ │\n│  │  │           ▲            ▲            ▲            ▲        │  │ │\n│  │  │            └────────────┴────────────┴────────────┘         │  │ │\n│  │  │                    Unified Inference API                   │  │ │\n│  │  │  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────┐ │  │ │\n│  │  │  │ TaskFetcher    │  │ WorkerRegistry  │  │ WorkerNode │ │  │ │\n│  │  │  │ ✅ Redis拉取   │  │ ✅ 注册\u002F注销   │  │ ✅ HTTP API │ │  │ │\n│  │  │  └─────────────────┘  └─────────────────┘  └─────────────┘ │  │ │\n│  │  └─────────────────────────────────────────────────────────────┘  │ │\n│  │                                                                  │ │\n│  │  ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐        │ │\n│  │  │Node 1  │ │Node 2  │ │Node 3  │ │ Ascend │ │Cambricon│       │ │\n│  │  │ A100×8 │ │4090×4  │ │H100×4  │ │ NPU×4  │ │  NPU×4  │       │ │\n│  │  └────────┘ └────────┘ └────────┘ └────────┘ └────────┘        │ │\n│  └──────────────────────────────────────────────────────────────────┘ │\n│                                                                        │\n└────────────────────────────────────────────────────────────────────────┘\n```\n\n---\n\n## 🎯 调度策略\n\n### Gang调度 — 大模型的专属武器\n\n```\n┌─────────────────────────────────────────┐\n│         Gang Scheduling (大模型)          │\n│                                          │\n│   Request: 72B Model, TP=8              │\n│                                          │\n│   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐     │\n│   │ GPU0│ │ GPU1│ │ GPU2│ │ GPU3│ ... │\n│   │  ✗  │ │  ✗  │ │  ✗  │ │  ✗  │     │\n│   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘     │\n│      └────────┼────────┼────────┘       │\n│               ▼                           │\n│        All GPUs or Nothing                │\n│                                         │\n│   ✅ 100B+ 模型的最优选择                 │\n│   ✅ 最小化通信开销                       │\n│   ✅ 保障模型一致性                       │\n└─────────────────────────────────────────┘\n```\n\n### Pack调度 — 小模型的效率之王\n\n```\n┌─────────────────────────────────────────┐\n│         Pack Scheduling (小模型)         │\n│                                          │\n│   Request: 7B Model × N                  │\n│                                          │\n│   ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐     │\n│   │Req 1│ │Req 2│ │Req 3│ │Req N│     │\n│   │  ✗  │ │  ✗  │ │  ✗  │ │  ✗  │     │\n│   └──┬──┘ └──┬──┘ └──┬──┘ └──┬──┘     │\n│      └────────┼────────┼────────┘       │\n│               ▼                           │\n│      Shared GPU, Batched                 │\n│                                         │\n│   ✅ 最大化 GPU 利用率                   │\n│   ✅ 高并发处理                          │\n│   ✅ 降低单请求成本                       │\n└─────────────────────────────────────────┘\n```\n\n---\n\n## 📊 GPU 性能基准\n\n### 实测数据 — RTX 4080 Laptop GPU (12GB)\n\n以下图表基于真实运行数据生成，展示了不同并发压力下的 GPU 性能表现：\n\n![QuantumFlow GPU Benchmark](docs\u002Fbenchmarks.png)\n\n**测试配置**\n- 硬件: NVIDIA GeForce RTX 4080 Laptop GPU (12GB)\n- 模型: Qwen2.5-1.5B-Instruct (FP16, HuggingFace Transformers)\n- 优化: BatchAccumulator (max_batch_size=8, max_delay=50ms) + torch.compile\n- **测试文件**: [tests\u002Fquick_benchmark.py](tests\u002Fquick_benchmark.py)（10 场景全路径覆盖）\n- **图表生成**: [tests\u002Fregenerate_chart.py](tests\u002Fregenerate_chart.py)\n\n### 实测结果 — HuggingFace + BatchAccumulator（6 代表场景）\n\n> 以下 6 个场景从 10 个全量测试中选取，覆盖核心 API 路径和典型负载。完整数据见 [docs\u002Fbenchmark_data.json](docs\u002Fbenchmark_data.json)。\n\n| 场景 | API | GPU 利用率 | P50 延迟 | 吞吐量 | 成功率 |\n|------|:---:|:---------:|:--------:|:------:|:------:|\n| **A: Single (greedy, short)** | \u002Fgenerate | 59% | 509 ms | 76.6 tok\u002Fs | 100% |\n| **B: Chat (8 concurrent)** | \u002Fgenerate | 51% | 1058 ms | 65.9 tok\u002Fs | 100% |\n| **C: Code Generation** | \u002Fgenerate | 39% | 5332 ms | 52.6 tok\u002Fs | 100% |\n| **D: Long Prompt + Generation** | \u002Fgenerate | 69% | 2016 ms | 83.4 tok\u002Fs | 100% |\n| **E: VRAM Usage** | \u002Fgenerate | 69% | 2016 ms | 83.4 tok\u002Fs | 100% |\n| **F: Success Rate** | \u002Fgenerate | 54% | 462 ms | 85.5 tok\u002Fs | 100% |\n\n> **覆盖说明**:\n> - `\u002Fgenerate` — BatchAccumulator 50ms 动态批处理（场景 A-D, H）\n> - `\u002Fgenerate\u002Fstream` — Thread+Queue 桥接流式生成（场景 E）\n> - `\u002Fchat` — ChatML 格式对话接口（完整测试包含在 tests\u002Fquick_benchmark.py 中）\n> - `\u002Fbatch` — 引擎直接批量处理，绕过 BatchAccumulator（完整测试包含在 tests\u002Fquick_benchmark.py 中）\n>\n> **图例（X 轴标签）**:\n> - **A** — 单请求基线（greedy，短 prompt）\n> - **B** — 短对话 8 并发\n> - **C** — 代码生成（中 prompt，长输出）\n> - **D** — 长 prompt + 生成（中文技术内容）\n> - **E** — 流式生成（\u002Fgenerate\u002Fstream 路径）\n> - **H** — 高并发压力测试（32 并发）\n>\n> **完整测试**: [tests\u002Fquick_benchmark.py](tests\u002Fquick_benchmark.py) 共 10 个场景，覆盖所有 API 路径和采样参数\n>\n> **已实现优化**: ① BatchAccumulator 动态批处理（50ms 窗口合并请求）② torch.compile 加速\n\n### GPU 利用率指标说明\n\nQuantumFlow 通过 NVIDIA NVML API 采集两个独立的 GPU 指标：\n\n| 指标 | API 来源 | 含义 |\n|------|---------|------|\n| **GPU Compute Utilization (%)** | `nvmlDeviceGetUtilizationRates().gpu` | GPU CUDA 核心活跃度 — 执行计算任务的时间占比 |\n| **GPU Memory Bandwidth (%)** | `nvmlDeviceGetUtilizationRates().memory` | HBM 显存控制器活跃度 — 显存读写操作的时间占比 |\n\n工业界基准参考:\n- **MLPerf Inference** — 业界标准基准套件，衡量推理吞吐量和延迟\n- **vLLM Continuous Batching** — 生产级批处理，通常达到 **60-85% GPU 利用率**\n- **目标区间**: GPU 计算利用率 80%+，显存带宽利用率 80%+ 视为高效\n\n---\n\n## 🛠️ 支持的模型\n\n| 模型 | 参数量 | 显存要求 | 状态 |\n|------|--------|----------|------|\n| Qwen2.5-1.5B | 1.5B | ~3GB | ✅ 已验证 |\n| Qwen2.5-3B | 3B | ~6GB | ✅ 可加载 |\n| Qwen2.5-7B | 7B | ~14GB | 📋 待测试 |\n| LLaMA-3-8B | 8B | ~16GB | 📋 规划中 |\n| Qwen2.5-72B | 72B | 4×24GB | 📋 分布式 |\n\n### 推理引擎\n\n- ✅ **HuggingFace Transformers** — 已验证可用（含动态批处理、torch.compile）\n- ✅ **vLLM** — v0.21.0 已适配，PagedAttention + Continuous Batching 可用\n- ⚠️ **Chunked Prefill** — 已禁用（实现有 bug，需重新参考 vLLM 分块逻辑修复后启用）\n- 📋 **TGI** — 规划中\n- 📋 **SGLang** — 规划中\n- 📋 **TensorRT-LLM** — 规划中\n\n---\n\n## 📁 项目结构\n\n```python\nQuantumFlow\u002F\n├── quantumflow\u002F              # 🎯 核心包\n│   ├── api\u002F                  # ✅ REST API (FastAPI)\n│   │   ├── routes\u002F          # API 路由（含 scheduler.py 调度可视化端点）\n│   │   ├── models\u002F          # 请求\u002F响应模型\n│   │   └── server.py        # FastAPI 应用\n│   │\n│   ├── scheduler\u002F           # ✅ 调度器（含分布式调度）\n│   │   ├── scheduler.py     # 调度器主逻辑\n│   │   ├── strategy\u002F        # 调度策略\n│   │   ├── distributed.py    # ✅ 分布式调度器（Redis队列 + Worker HTTP通信）\n│   │   └── worker_client.py # ✅ Worker HTTP客户端\n│   │\n│   ├── cluster\u002F             # ✅ 集群管理（单机模式）\n│   │\n│   ├── inference\u002F           # ✅ 推理引擎\n│   │   ├── engine.py        # 引擎抽象\n│   │   ├── manager.py       # 引擎管理器（VRAM 感知 + 模型淘汰）\n│   │   ├── vram_manager.py  # VRAM 管理 + BlockPool 细粒度显存\n│   │   ├── batch_accumulator.py  # 动态批处理（50ms 窗口合并）\n│   │   ├── gpu_monitor.py   # GPU 监控（NVML 采集）\n│   │   └── backends\u002F        # 引擎实现\n│   │       ├── huggingface.py # ✅ HF (动态批处理 + torch.compile + Chunked Prefill)\n│   │       └── vllm.py       # ✅ vLLM (PagedAttention + Continuous Batching)\n│   │\n│   ├── worker\u002F              # ✅ Worker节点（分布式）\n│   │   └── task_fetcher.py  # ✅ Worker任务抓取器（Redis队列拉取）\n│   │\n│   ├── storage\u002F             # ✅ Redis队列（分布式）\n│   │   ├── redis_queue.py  # ✅ Redis优先级队列（ZSET实现）\n│   │   └── connection.py   # ✅ Redis连接管理器（单例）\n│   │\n│   └── cli.py               # ✅ CLI工具\n│\n├── scripts\u002F\n│   └── qf                   # ✅ 一键启动脚本\n│\n├── tests\u002F                    # ✅ 325个测试（含59个分布式综合测试）\n│\n├── configs\u002F                  # ⚙️ 配置文件\n├── pyproject.toml\n└── README.md\n```\n\n---\n\n## 🔧 配置示例\n\n```yaml\n# configs\u002Fproduction.yaml\napp:\n  name: \"QuantumFlow\"\n  environment: \"production\"\n  log_level: \"INFO\"\n\nscheduler:\n  default_strategy: \"adaptive\"\n  max_concurrent_requests: 5000\n  queue_max_size: 50000\n  strategies:\n    gang:\n      enabled: true\n      timeout_seconds: 600\n    pack:\n      enabled: true\n      max_batch_size: 64\n\ninference:\n  default_backend: \"huggingface\"\n  backends:\n    huggingface:\n      torch_compile: true        # 启用 torch.compile 加速\n      prefill_chunk_size: 512    # Chunked Prefill 块大小\n      enable_chunked_prefill: true  # 启用分块预填充\n    vllm:\n      tensor_parallel_size: 1\n      gpu_memory_utilization: 0.80\n      max_model_len: 2048\n      enforce_eager: false\n      enable_chunked_prefill: true\n\ncluster:\n  heartbeat_interval_seconds: 5\n  heartbeat_timeout_seconds: 60\n```\n\n---\n\n## 🤝 贡献\n\n我们欢迎所有形式的贡献！\n\n```bash\n# 1. Fork 项目\n# 2. 创建特性分支\ngit checkout -b feature\u002Famazing-feature\n\n# 3. 提交更改\ngit commit -m \"feat: add amazing feature\"\n\n# 4. 推送分支\ngit push origin feature\u002Famazing-feature\n\n# 5. 创建 Pull Request\n```\n\n### 开发环境\n\n```bash\n# 克隆并安装\ngit clone https:\u002F\u002Fgithub.com\u002Fquantumflow\u002Fquantumflow.git\ncd quantumflow\npip install -e \".[dev]\"\n\n# 运行测试\npytest tests\u002F -v\n\n# 代码格式化\nblack quantumflow\u002F\nisort quantumflow\u002F\nruff check quantumflow\u002F\n\n# 类型检查\nmypy quantumflow\u002F\n```\n\n---\n\n## 🏃 部署\n\n### 单机\n\n```bash\n.\u002Fscripts\u002Fqf           # 一键启动\n# 或\npython -m quantumflow.cli serve\n```\n\n### 分布式（规划中）\n\n```bash\n# Controller\nquantumflow serve --host 0.0.0.0 --port 8000\n\n# Worker节点（待实现）\nquantumflow worker --controller-url http:\u002F\u002Flocalhost:8000 --backend vllm\n```\n\n---\n\n## 📖 API 使用\n\n### REST API\n\n```bash\n# 集群状态\ncurl http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Fcluster\u002Fstatus\n\n# 模型列表\ncurl http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Fmodels\u002Flist\n\n# 已加载模型\ncurl http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Fmodels\u002Fstatus\n\n# 加载模型\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Fmodels\u002Fload \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\": \"Qwen2.5-1.5B\"}'\n\n# 推理\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Finference\u002Fgenerate \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"Qwen2.5-1.5B\",\n    \"prompt\": \"你好\",\n    \"sampling_params\": {\"temperature\": 0.7, \"max_tokens\": 100}\n  }'\n\n# 对话\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Finference\u002Fchat \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"Qwen2.5-1.5B\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"你好\"}]\n  }'\n\n# 流式生成\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Finference\u002Fgenerate\u002Fstream \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\": \"Qwen2.5-1.5B\", \"prompt\": \"你好\", \"stream\": true}'\n\n# 调度可视化（含 VRAM、Block、Batch、GPU 状态）\ncurl http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Fscheduler\u002Fstatus\n```\n\n### Python SDK\n\n```python\nimport httpx\n\nasync with httpx.AsyncClient() as client:\n    # 推理\n    resp = await client.post(\"http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Finference\u002Fgenerate\",\n        json={\"model\": \"Qwen2.5-1.5B\", \"prompt\": \"你好\",\n              \"sampling_params\": {\"max_tokens\": 100}})\n    print(resp.json()[\"generated_text\"])\n\n    # 对话\n    resp = await client.post(\"http:\u002F\u002Flocalhost:8000\u002Fapi\u002Fv1\u002Finference\u002Fchat\",\n        json={\"model\": \"Qwen2.5-1.5B\",\n              \"messages\": [{\"role\": \"user\", \"content\": \"你好\"}]})\n    print(resp.json()[\"generated_text\"])\n```\n\n---\n\n## 🧪 测试\n\n```bash\npytest tests\u002F -v    # 266个测试，全部通过\n```\n\n---\n\n## ⭐ Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=quantumflow\u002Fquantumflow&type=Date)](https:\u002F\u002Fstar-history.com\u002F#quantumflow\u002Fquantumflow&Date)\n\n---\n\n## 📜 许可证\n\n本项目基于 Apache License 2.0 许可证开源。详见 [LICENSE](LICENSE) 文件。\n\n---\n\n## 🙏 致谢\n\n本项目站在巨人的肩膀上：\n\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) — PagedAttention + Continuous Batching 实现参考\n- [FlashAttention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) — 高效注意力 kernel 参考\n- [HuggingFace Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) — 推理引擎基础\n- [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray) — 分布式计算框架\n- [K8s](https:\u002F\u002Fkubernetes.io\u002F) — 容器编排参考\n- 所有开源贡献者！\n\n---\n\n\u003Cdiv align=\"center\">\n\n**如果这个项目对你有帮助，请给我们一个 ⭐**\n\n*Built with ❤️ by the QuantumFlow Team*\n\n\u003C\u002Fdiv>\n","QuantumFlow 是一个支持多后端的分布式大模型推理调度框架，能够智能地管理和调度大规模语言模型的推理任务。该项目采用 Python 编写，核心功能包括智能调度、分布式部署以及对多种后端如 vLLM、TGI 和 SGLang 的支持，并且具备自适应调度策略和集群管理能力，旨在提高单卡利用率和显存管理效率。QuantumFlow 适合需要高效处理大规模语言模型推理需求的企业或研究机构使用，在多租户、限流及容灾等方面也有所规划，以满足生产环境下的实际需求。","2026-06-11 03:54:46","CREATED_QUERY"]