[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81740":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":13,"stars30d":13,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":13,"rankGlobal":10,"rankLanguage":10,"license":14,"archived":15,"fork":15,"defaultBranch":16,"hasWiki":17,"hasPages":15,"topics":18,"createdAt":10,"pushedAt":10,"updatedAt":19,"readmeContent":20,"aiSummary":21,"trendingCount":13,"starSnapshotCount":13,"syncStatus":22,"lastSyncTime":23,"discoverSource":24},81740,"ashforge","MMMchou\u002Fashforge","MMMchou","One-command local LLM deployment with automatic hardware probing, GGUF model matching, KV cache tuning, warmup benchmarking, and OpenAI-compatible API gateway. 一条命令启动本地 LLM：自动硬件检测、模型匹配、KV cache 选型、上下文长度探测、warmup 调参，并暴露 OpenAI-compatible API。","",null,"Go",29,0,"MIT License",false,"main",true,[],"2026-06-12 02:04:19","\u003Cdiv align=\"center\">\n\n\u003Cbr>\n\n```\n █████╗ ███████╗██╗  ██╗███████╗ ██████╗ ██████╗  ██████╗ ███████╗\n██╔══██╗██╔════╝██║  ██║██╔════╝██╔═══██╗██╔══██╗██╔════╝ ██╔════╝\n███████║███████╗███████║█████╗  ██║   ██║██████╔╝██║  ███╗█████╗  \n██╔══██║╚════██║██╔══██║██╔══╝  ██║   ██║██╔══██╗██║   ██║██╔══╝  \n██║  ██║███████║██║  ██║██║     ╚██████╔╝██║  ██║╚██████╔╝███████╗\n╚═╝  ╚═╝╚══════╝╚═╝  ╚═╝╚═╝      ╚═════╝ ╚═╝  ╚═╝ ╚═════╝ ╚══════╝\n```\n\n**One command. Auto-tuned. Maximum performance.**\n\n*The only local LLM tool that benchmarks your actual hardware — not just guesses.*\n\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue.svg)](LICENSE)\n[![Go](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGo-1.22+-00ADD8.svg)](https:\u002F\u002Fgo.dev)\n[![Platform](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fplatform-Linux%20%7C%20macOS%20%7C%20Windows-lightgrey.svg)]()\n\n[Quick Start](#-quick-start) · [Why Ashforge](#-why-not-just-use-ollama) · [How It Works](#-how-it-works) · [中文](#中文)\n\n\u003C\u002Fdiv>\n\n---\n\n## The Problem\n\nYou download a 30B model. You run it in Ollama. It works... at 8 tok\u002Fs with 4K context, on a GPU that could do 35 tok\u002Fs at 64K.\n\n**Why?** Because Ollama doesn't know your GPU's bandwidth is 504 GB\u002Fs, your VRAM can fit f16 KV cache, and ubatch=512 is 40% faster than the default on your hardware.\n\nAshforge does.\n\n```bash\nashforge run Qwen3-30B-A3B\n```\n\n```\n[5\u002F6] Warmup benchmark...\n      Probe 1: ctx=128K ... OOM\n      Probe 2: ctx=64K  ... 26.3 tok\u002Fs\n      Probe 3: ctx=128K ... OOM\n      Fine:    ctx=96K  ... 22.1 tok\u002Fs\n      Tune ubatch: ub=128 → 24.8 tok\u002Fs; ub=512 → 26.3 tok\u002Fs\n      ✓ 26.3 tok\u002Fs @ 64K ctx\n\n  ┌─────────────────────────────────────────────────┐\n  │  Ready — Qwen3-30B-A3B @ 26.3 tok\u002Fs            │\n  │  API: http:\u002F\u002F127.0.0.1:21435\u002Fv1\u002Fchat\u002Fcompletions│\n  └─────────────────────────────────────────────────┘\n```\n\nSecond launch? **2 seconds**. Configuration cached.\n\n---\n\n## Why Not Just Use Ollama?\n\n| | Ollama \u002F LM Studio | Ashforge |\n|---|---|---|\n| **Parameter tuning** | Static defaults | Benchmarks your actual hardware |\n| **Context size** | Fixed (usually 4K-8K) | Binary search finds your real max |\n| **KV cache** | Always q8_0 | Picks f16 → q8_0 → q4_0 by VRAM fit |\n| **VRAM prediction** | None (OOM = restart) | Formula from 19,517 measurements, 365 MiB median error |\n| **MoE offload** | Manual `--n-gpu-layers` | Auto-splits attention (GPU) vs experts (CPU) |\n| **Ubatch size** | Default 512 | Benchmarks 128 vs 512, picks faster |\n| **Multi-GPU** | Basic tensor split | VRAM×bandwidth weighted split + NVLink graph mode |\n| **Repetition loops** | You wait and Ctrl+C | Auto-detected and stopped (n-gram + pattern) |\n| **Context overflow** | Silently truncates | Zero-latency extractive compression |\n| **Speculative decode** | Manual flag | Auto-enables MTP or n-gram lookup |\n| **Live monitoring** | None | Real-time TUI (VRAM, speed, temp, context) |\n| **IDE setup** | Manual config | `ashforge inject` → Cursor, Codex, Claude Code |\n\n### The Numbers That Matter\n\nAshforge uses the [oobabooga VRAM formula](https:\u002F\u002Foobabooga.github.io\u002Fblog\u002Fposts\u002Fgguf-vram-formula\u002F) — derived from **19,517 real measurements across 60 models** with a **median error of just 365 MiB** — to predict exactly how much context your GPU can handle before launching.\n\nOther tools guess. Ashforge calculates.\n\n---\n\n## Quick Start\n\n**Install:**\n\n```bash\n# Linux \u002F macOS\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FMMMchou\u002Fashforge\u002Fmain\u002Finstall.sh | sh\n\n# Windows (PowerShell)\nirm https:\u002F\u002Fraw.githubusercontent.com\u002FMMMchou\u002Fashforge\u002Fmain\u002Finstall.ps1 | iex\n```\n\n**Run:**\n\n```bash\nashforge run Qwen3-30B-A3B\n```\n\nThat's it. Ashforge will:\n1. Detect your GPU, VRAM, bandwidth, CPU, RAM\n2. Download the optimal quantization for your hardware\n3. Benchmark to find the max context at target speed\n4. Start an OpenAI-compatible API at `http:\u002F\u002Flocalhost:21435\u002Fv1`\n\n**Connect your tools:**\n\n```bash\nashforge inject    # auto-configures Cursor, Codex CLI, Claude Code\n```\n\nOr point any OpenAI-compatible client to `http:\u002F\u002Flocalhost:21435\u002Fv1`.\n\n---\n\n## How It Works\n\n```\nashforge run Qwen3-30B-A3B\n\n  [1\u002F6] Probe Hardware\n        RTX 4090 (SM89, 24576 MB, 1008 GB\u002Fs)\n        RAM: 64 GB DDR5\n\n  [2\u002F6] Select Configuration\n        Model:  Qwen3-30B-A3B (MoE, 30B total \u002F 3B active)\n        Quant:  Q4_K_M (17.2 GB)\n        Mode:   moe_offload (experts on CPU)\n\n  [3\u002F6] Check Files\n        Binary: llama-server-cuda [cached]\n        Model:  Qwen3-30B-A3B-Q4_K_M.gguf [cached]\n\n  [4\u002F6] Preflight Check\n        ✓ VRAM sufficient\n\n  [5\u002F6] Warmup Benchmark           ← the magic happens here\n        Phase 1:   coarse search (power-of-2 steps)\n        Phase 1.5: fine search (4K precision binary search)\n        Phase 2:   ubatch tuning (128 vs 512)\n        ✓ 26.3 tok\u002Fs @ 64K ctx\n\n  [6\u002F6] Start Server\n        llama-server + Ashforge proxy + TUI monitor\n```\n\n### What Gets Auto-Tuned\n\nEvery parameter is **measured, not guessed**:\n\n| Parameter | How Ashforge Decides |\n|-----------|---------------------|\n| **Context size** | Binary search from model max downward; 3 modes: speed \u002F balanced \u002F context |\n| **KV cache type** | Calculates f16 VRAM footprint → cascades f16 → q8_0+q4_0 → iso3 → q4_0 |\n| **MoE placement** | Measures VRAM for attention layers (25%), routes experts to CPU |\n| **Ubatch size** | Benchmarks candidates; skips 512 on low-bandwidth GPUs (\u003C200 GB\u002Fs) |\n| **Thread count** | 2 for full-GPU; physical_cores\u002F2 for CPU offload |\n| **Tensor split** | Weighted by VRAM × memory bandwidth per GPU |\n| **Speculative decode** | MTP (3 tokens) for Qwen3.6; n-gram lookup (8) for all others |\n| **Flash attention** | Enabled on SM75+ (Turing and newer) |\n| **mlock \u002F mmap** | Based on RAM headroom vs model size |\n| **KV defrag** | Auto-compact at 10% fragmentation |\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Full list of 25+ auto-configured llama-server flags\u003C\u002Fb>\u003C\u002Fsummary>\n\n```\n--ctx-size         dynamic        from warmup binary search\n--batch-size       512 \u002F 4096     mode-dependent\n--ubatch-size      128 \u002F 512      from Phase 2 benchmark\n-ctk \u002F -ctv        f16\u002Fq8_0\u002Fq4_0  from KV cache selection\n--cache-reuse      256-1024       from context size\n--threads          2 \u002F cores\u002F2    from mode + CPU count\n--n-gpu-layers     999            all layers to GPU\n--parallel         1              single-user optimized\n--kv-unified       conditional    skip on Blackwell \u002F multi-GPU\n--cpu-moe          conditional    MoE full offload\n--n-cpu-moe N      conditional    MoE partial offload\n--fit on           conditional    dense models only\n--flash-attn       conditional    SM75+ (Turing+)\n-sm graph          conditional    NVLink + turboquant\n--tensor-split     conditional    multi-GPU weighted\n--swa-full         conditional    hybrid architectures\n--mlock            conditional    RAM headroom check\n--mmap             conditional    when mlock inactive\n--num-spec-tokens  3              native MTP models\n--lookup           8              n-gram speculative\n--defrag-thold     0.1            always enabled\n--cont-batching    on             always enabled\n--metrics          on             always enabled\n--no-webui         on             always disabled\n```\n\n\u003C\u002Fdetails>\n\n---\n\n## Smart Proxy\n\nAshforge doesn't just forward requests — it makes them better:\n\n### Repetition Detection\nDual-detector system catches infinite loops in real-time:\n- **N-gram detector**: tracks 3-gram frequencies in a 200-token sliding window\n- **Pattern detector**: identifies repeating sequences of 2-10 tokens\n\nWhen triggered, auto-stops generation and warns the user. No more waiting 30 seconds for a model stuck in a loop.\n\n### Context Compression\nWhen conversation history hits 75% of context window:\n- Keeps system prompt + recent 8K tokens untouched\n- Compresses middle messages with zero-latency extractive summary\n- Preserves code blocks, file paths, function definitions, TODOs\n- No model call needed — pure algorithmic, zero added latency\n\n### Live Context Warnings\n- Header `X-Ashforge-Context-Warning` at 80% usage\n- Inline hint at 90%: \"Context is filling up, important info may be truncated\"\n\n---\n\n## Live TUI Monitor\n\nReal-time terminal dashboard, refreshes every 2s:\n\n```\n─ Live Monitor · Generating ──────────────── refresh 2s ─\n  64K ctx · q8_0+q4_0 KV · ub512 · mlock\n\n  Speed         VRAM           RAM        GPU      Temp\n  26.3 tok\u002Fs   18.2\u002F24.0 GB   12.1\u002F64 GB   95%     67°C\n  [========..]  [========..]  [==........]  [=====.] [======....]\n\n─────────────────────────────────────────────────────────\n  Context  [================....] 52.1K \u002F 64.0K  余 11.9K  压缩 2 次 · 省 8.3K\n```\n\nColor-coded alerts:\n- VRAM > 90%: warns inference may crash\n- Context > 80%: suggests starting a new conversation\n- GPU temp > 85°C: warns about thermal throttling\n\n---\n\n## Supported Hardware\n\n| Platform | GPU | Status |\n|----------|-----|--------|\n| Linux x86_64 | NVIDIA CUDA (SM75+) | Stable |\n| Windows x86_64 | NVIDIA CUDA | Stable |\n| macOS arm64 | Apple Silicon Metal | Beta |\n| macOS x86_64 | Intel (CPU only) | Supported |\n| Linux x86_64 | AMD Vulkan | Experimental |\n\n**Multi-GPU**: auto-detected. NVLink → graph split mode. No NVLink → VRAM×bandwidth weighted tensor split.\n\n**RTX 50 series (Blackwell)**: first-run JIT compilation handled automatically (~60s one-time, then instant).\n\n---\n\n## Supported Models\n\n30+ models with pre-configured profiles. Any GGUF file also works via auto-detection.\n\n| Family | Models |\n|--------|--------|\n| **Qwen** | Qwen3-235B, Qwen3.6-35B, Qwen3-32B, Qwen3-30B-A3B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, Qwen3-0.6B |\n| **Llama** | Llama 4 Scout, Llama 4 Maverick |\n| **DeepSeek** | DeepSeek R1, DeepSeek R1 0528 |\n| **Gemma** | Gemma 3 27B, 12B, 4B |\n| **Phi** | Phi-4 14B, Phi-4 Mini |\n| **Mistral** | Mistral Small 24B, Codestral 25.01, Mixtral 8x7B |\n| **Others** | Yi-1.5 34B\u002F9B, InternLM3 8B, GLM-4 9B, Command R 7B |\n\n---\n\n## All Commands\n\n```bash\nashforge run \u003Cmodel>              # deploy with auto-tuning\nashforge run \u003Cmodel> --fast       # skip warmup, use cache\nashforge run \u003Cmodel> --reset      # re-benchmark from scratch\nashforge run \u003Cmodel> --mode speed # or balanced \u002F context\nashforge run \u003Cmodel> --ctx-size N # override context size\nashforge run \u003Cmodel> --host 0.0.0.0  # LAN access\n\nashforge stop                     # stop running model\nashforge status                   # show status + speed\nashforge list                     # list all models\nashforge list --installed         # only downloaded models\nashforge probe                    # hardware fingerprint\n\nashforge inject                   # auto-configure IDEs\nashforge inject --undo            # restore IDE configs\n\nashforge config show              # view settings\nashforge config set key=value     # change settings\nashforge cache clear              # clear warmup cache\n```\n\n## Configuration\n\nConfig file: `~\u002F.ashforge\u002Fconfig.yaml`\n\n| Env Variable | Config Key | Default | Description |\n|-------------|------------|---------|-------------|\n| `ASHFORGE_HF_MIRROR` | `hf_mirror` | `https:\u002F\u002Fhf-mirror.com` | HuggingFace mirror URL |\n| `ASHFORGE_LLAMA_PORT` | `llama_port` | `21434` | llama-server port |\n| `ASHFORGE_PROXY_PORT` | `proxy_port` | `21435` | API proxy port |\n| `ASHFORGE_MODEL_DIR` | `model_dir` | `~\u002F.ashforge\u002Fmodels` | Model storage path |\n| `ASHFORGE_LOG_LEVEL` | `log_level` | `info` | Log verbosity |\n| `ASHFORGE_API_KEY` | `api_key` | — | API key for IDE injection |\n| `ASHFORGE_LLAMA_TAG` | — | Built-in | llama.cpp release tag |\n\n---\n\n## Build from Source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FMMMchou\u002Fashforge.git\ncd ashforge\nmake build-linux    # or build-darwin, build-windows\n```\n\nRequires Go 1.22+.\n\n---\n\n## Requirements\n\n- **GPU**: NVIDIA CUDA 12.4+ or Apple Silicon Metal\n- **OS**: Linux, Windows 10\u002F11, macOS 12+\n- **RAM**: 8 GB+ (16 GB+ for 30B+ models)\n- **Format**: GGUF\n\nCPU-only mode is supported but not the focus.\n\n---\n\n\u003Ca name=\"中文\">\u003C\u002Fa>\n\n## 中文\n\n### 为什么选 Ashforge?\n\n**Ollama 和 LM Studio 让模型能跑。Ashforge 让模型跑得快。**\n\n同样的模型、同样的显卡，Ashforge 通常能多给你 **2-4 倍的上下文**和 **20-40% 的速度提升**。\n\n原因很简单：其他工具用默认参数。Ashforge **实测你的硬件**，二分搜索最优配置，结果缓存后第二次启动只要 2 秒。\n\n```bash\nashforge run Qwen3-30B-A3B\n# 自动完成：硬件探测 → 模型分析 → VRAM 预测 → 参数调优 → 启动服务\n```\n\n### 核心优势\n\n**自动调参** — 不是套模板，是真跑 benchmark\n- 二分搜索最大可用上下文（4K 精度）\n- ubatch 对比测试（128 vs 512）\n- 三档模式可选：速度优先 \u002F 均衡 \u002F 上下文优先\n- 结果缓存 30 天，下次启动 2 秒\n\n**VRAM 预测** — 基于 oobabooga 公式（19,517 次实测，中位误差 365 MiB）\n- 启动前就知道能开多大上下文\n- 不会 OOM 崩溃再重试\n\n**MoE 智能卸载** — attention 留 GPU，expert 走 CPU\n- 自动计算最优切分比例\n- 30B-A3B 这类 MoE 模型速度翻倍\n\n**KV Cache 自动选型** — f16 → q8_0+q4_0 → iso3 → q4_0\n- 根据剩余 VRAM 自动选最快的类型\n- Ollama 只会用默认 q8_0\n\n**智能代理层**\n- 重复检测：n-gram + 模式识别双引擎，自动停止死循环\n- 上下文压缩：75% 满时自动压缩历史，零延迟（纯算法，不调模型）\n- 投机解码：MTP 模型自动启用 3-token 预测，其他模型用 n-gram lookup\n\n**一键接入 IDE**\n```bash\nashforge inject   # 自动配置 Cursor \u002F Codex CLI \u002F Claude Code\n```\n\n### 实时监控\n\n```\n速度         显存           内存        GPU      温度\n26.3 tok\u002Fs  18.2\u002F24.0 GB  12.1\u002F64 GB   95%     67°C\n[========..] [========..] [==........] [=====.] [======....]\n\n上下文  [================....] 52.1K \u002F 64.0K  余 11.9K  压缩 2次 · 省 8.3K\n```\n\n### 安装\n\n```bash\n# Linux \u002F macOS\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FMMMchou\u002Fashforge\u002Fmain\u002Finstall.sh | sh\n\n# Windows\nirm https:\u002F\u002Fraw.githubusercontent.com\u002FMMMchou\u002Fashforge\u002Fmain\u002Finstall.ps1 | iex\n```\n\n### 快速开始\n\n```bash\nashforge run Qwen3-30B-A3B    # 运行模型（自动下载）\nashforge run .\u002Fmodel.gguf     # 运行本地 GGUF 文件\nashforge list                  # 查看支持的模型\nashforge list --installed      # 只看已下载的\nashforge probe                 # 查看硬件信息\nashforge inject                # 配置 IDE\nashforge stop                  # 停止服务\n```\n\nAPI 地址：`http:\u002F\u002Flocalhost:21435\u002Fv1`，兼容所有 OpenAI 格式的工具。\n\n---\n\n\u003Cdiv align=\"center\">\n\nBuilt on [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) · by [Ashan](https:\u002F\u002Fgithub.com\u002FMMMchou)\n\nIf Ashforge saves you time, consider giving it a star.\n\n\u003C\u002Fdiv>\n","Ashforge 是一个通过一条命令即可启动本地大语言模型（LLM）的工具，它能够自动检测硬件、匹配GGUF模型、调整KV缓存、探测上下文长度以及warmup调参，并提供与OpenAI兼容的API接口。其核心技术特点包括基于实际硬件性能进行自动优化配置，如选择最佳KV缓存类型、上下文大小和ubatch大小等，从而实现最大化的运行效率。适用于需要在本地快速部署并高效运行大型语言模型的场景，特别是对于那些希望充分利用自身硬件资源以获得更佳性能的研究人员或开发者而言非常有用。",2,"2026-06-11 04:06:12","CREATED_QUERY"]