[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77160":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":9,"totalLinesOfCode":9,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":9,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":39,"readmeContent":40,"aiSummary":41,"trendingCount":16,"starSnapshotCount":16,"syncStatus":42,"lastSyncTime":43,"discoverSource":44},77160,"whichllm","Andyyyy64\u002Fwhichllm","Andyyyy64","Find the local LLM that actually runs and performs best on your hardware. Ranked by real, recency-aware benchmarks, not parameter count. One command, run it instantly.",null,"https:\u002F\u002Fgithub.com\u002FAndyyyy64\u002Fwhichllm","Python",4416,238,16,17,0,1510,2158,3280,4530,106.14,false,"main",[25,26,27,28,29,30,31,32,33,34,35,36,37,38],"ai","cli","llm","local-llm","command-line-tool","gguf","gpu","huggingface","inference","ollama","python","vram","apple-silicon","benchmarks","2026-06-12 04:01:21","# whichllm\n\n[![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fwhichllm)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fwhichllm\u002F)\n[![Python 3.11+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.11+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FMIT)\n[![Tests](https:\u002F\u002Fgithub.com\u002FAndyyyy64\u002Fwhichllm\u002Factions\u002Fworkflows\u002Ftest.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FAndyyyy64\u002Fwhichllm\u002Factions\u002Fworkflows\u002Ftest.yml)\n[![Sponsor](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSponsor-GitHub%20Sponsors-EA4AAA?logo=githubsponsors)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002FAndyyyy64)\n\n**Find the best local LLM that actually runs on your hardware.**\n\nAuto-detects your GPU\u002FCPU\u002FRAM and ranks the top models from HuggingFace that fit your system.\n\n[日本語版はこちら](docs\u002FREADME.ja.md)\n\n![demo](assets\u002Fdemo.gif)\n\n## See it\n\n```text\n$ whichllm --gpu \"RTX 4090\"\n\n#1  Qwen\u002FQwen3.6-27B     27.8B  Q5_K_M   score 92.8    27 t\u002Fs\n#2  Qwen\u002FQwen3-32B       32.0B  Q4_K_M   score 83.0    31 t\u002Fs\n#3  Qwen\u002FQwen3-30B-A3B   30.0B  Q5_K_M   score 82.7   102 t\u002Fs\n```\n\nThe 32B model **fits your card fine** — whichllm still ranks the 27B #1,\nbecause it scores higher on real benchmarks and is a newer generation.\nA size-only \"what fits?\" tool would hand you the bigger one. That gap is\nthe whole point of whichllm. (Note #3: a MoE model at 102 t\u002Fs — speed is\nranked on *active* params, quality on *total*.)\n\n### What can I run?\n\nReal top picks (snapshot 2026-05 — your results track **live** HuggingFace\ndata, this is not a static list):\n\n| Hardware | VRAM | Top pick | Speed |\n|---|---|---|---|\n| RTX 5090 | 32 GB | `Qwen3.6-27B` · Q6_K · score 94.7 | ~40 t\u002Fs |\n| RTX 4090 \u002F 3090 | 24 GB | `Qwen3.6-27B` · Q5_K_M · score 92.8 | ~27 t\u002Fs |\n| RTX 4060 | 8 GB | `Qwen3-14B` · Q3_K_M · score 71.0 | ~22 t\u002Fs |\n| Apple M3 Max | 36 GB | `Qwen3.6-27B` · Q5_K_M · score 89.4 | ~9 t\u002Fs |\n| CPU only | — | `gpt-oss-20b` (MoE) · Q4_K_M · score 45.2 | ~6 t\u002Fs |\n\n`whichllm --gpu \"\u003Cyour card>\"` to simulate any of these before you buy.\n\n> Useful? A GitHub star helps other people find it — and I'd genuinely like\n> to know what it picked for your rig: drop it in [Issues](https:\u002F\u002Fgithub.com\u002FAndyyyy64\u002Fwhichllm\u002Fissues).\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=Andyyyy64\u002Fwhichllm&type=Date)](https:\u002F\u002Fwww.star-history.com\u002F#Andyyyy64\u002Fwhichllm&Date)\n\n## Why whichllm?\n\nFitting a model into your VRAM is the easy part. The hard part is knowing\n**which of the models that fit is actually the best** — and that is what\nwhichllm is built to get right.\n\n- **Evidence-based ranking, not a size heuristic** — The top pick is\n  chosen from merged real benchmarks (LiveBench, Artificial Analysis,\n  Aider, multimodal\u002Fvision, Chatbot Arena ELO, Open LLM Leaderboard) —\n  never \"the biggest model that happens to fit.\"\n- **Recency-aware** — Stale leaderboards are demoted along each model's\n  lineage, so a 2024 model can't outrank a current-generation one on an\n  outdated score. The benchmark snapshot date is printed under every\n  ranking, so a stale recommendation is self-evident instead of silently\n  trusted.\n- **Evidence-graded and guarded** — Every score is tagged\n  `direct` \u002F `variant` \u002F `base` \u002F `interpolated` \u002F `self-reported` and\n  discounted by confidence. Fabricated uploader claims and cross-family\n  inheritance (a small fork borrowing its much larger base's score) are\n  actively rejected.\n- **Architecture-aware estimates** — VRAM = weights + GQA KV cache +\n  activation + overhead; speed is bandwidth-bound with per-quant\n  efficiency, per-backend factors, MoE active-vs-total split, and\n  unified-memory vs discrete-PCIe partial-offload modeling.\n- **One command, scriptable** — `whichllm` prints the answer; add\n  `--json | jq` for pipelines. No TUI, no keybindings to memorize.\n- **Live data** — Models fetched directly from the HuggingFace API, with\n  curated frozen fallbacks for offline or rate-limited use.\n\n## Features\n\n- **Auto-detect hardware** — NVIDIA, AMD, Apple Silicon, CPU-only\n- **Smart ranking** — Scores models by VRAM fit, speed, and benchmark quality\n- **One-command chat** — `whichllm run` downloads and starts a chat session instantly\n- **Code snippets** — `whichllm snippet` prints ready-to-run Python for any model\n- **Live data** — Fetches models directly from HuggingFace (cached for performance)\n- **Benchmark-aware** — Integrates real eval scores with confidence-based dampening\n- **Task profiles** — Filter by general, coding, vision, or math use cases\n- **GPU simulation** — Test with any GPU: `whichllm --gpu \"RTX 4090\"`\n- **Hardware planning** — Reverse lookup: `whichllm plan \"llama 3 70b\"`\n- **Upgrade planning** — Compare your current machine with candidate GPUs\n- **JSON output** — Pipe-friendly: `whichllm --json`\n\n## Run & Snippet\n\n**Try any model with a single command.** No manual installs needed — whichllm creates an isolated environment via `uv`, installs dependencies, downloads the model, and starts an interactive chat.\n\n![run demo](assets\u002Fdemo-run.gif)\n\n```bash\n# Chat with a model (auto-picks the best GGUF variant)\nwhichllm run \"qwen 2.5 1.5b gguf\"\n\n# Auto-pick the best model for your hardware and chat\nwhichllm run\n\n# CPU-only mode\nwhichllm run \"phi 3 mini gguf\" --cpu-only\n```\n\nWorks with **all model formats**:\n- **GGUF** — via `llama-cpp-python` (lightweight, fast)\n- **AWQ \u002F GPTQ** — via `transformers` + `autoawq` \u002F `auto-gptq`\n- **FP16 \u002F BF16** — via `transformers`\n\nGet a **copy-paste Python snippet** instead:\n\n```bash\nwhichllm snippet \"qwen 7b\"\n```\n\n```python\nfrom llama_cpp import Llama\n\nllm = Llama.from_pretrained(\n    repo_id=\"Qwen\u002FQwen2.5-7B-Instruct-GGUF\",\n    filename=\"qwen2.5-7b-instruct-q4_k_m.gguf\",\n    n_ctx=4096,\n    n_gpu_layers=-1,\n    verbose=False,\n)\n\noutput = llm.create_chat_completion(\n    messages=[{\"role\": \"user\", \"content\": \"Hello!\"}],\n)\nprint(output[\"choices\"][0][\"message\"][\"content\"])\n```\n\n## Install\n\n### uv (recommended)\n\n```bash\nuvx whichllm\n```\n\nTo install permanently:\n\n```bash\nuv tool install whichllm\n```\n\n### Homebrew\n\n```bash\nbrew install andyyyy64\u002Fwhichllm\u002Fwhichllm\n```\n\n### pip\n\n```bash\npip install whichllm\n```\n\n### Development\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAndyyyy64\u002Fwhichllm.git\ncd whichllm\nuv sync --dev\nuv run whichllm\nuv run pytest\n```\n\n## Usage\n\n```bash\n# Auto-detect hardware and show best models\nwhichllm\n\n# Simulate a GPU (e.g. planning a purchase)\nwhichllm --gpu \"RTX 4090\"\nwhichllm --gpu \"RTX 5090\"\n# Specify variant\nwhichllm --gpu \"RTX 5060 16\"\n\n\n# CPU-only mode\nwhichllm --cpu-only\n\n# More results \u002F filters\nwhichllm --top 20\nwhichllm --quant Q4_K_M\nwhichllm --min-speed 30\nwhichllm --evidence base   # allow id\u002Fbase-model matches\nwhichllm --evidence strict # id-exact only (same as --direct)\nwhichllm --direct\n\n# JSON output\nwhichllm --json\n\n# Force refresh (ignore cache)\nwhichllm --refresh\n\n# Show hardware info only\nwhichllm hardware\n\n# Plan: what GPU do I need for a specific model?\nwhichllm plan \"llama 3 70b\"\nwhichllm plan \"Qwen2.5-72B\" --quant Q8_0\nwhichllm plan \"mistral 7b\" --context-length 32768\n\n# Upgrade: compare your current machine against candidate GPUs\nwhichllm upgrade \"RTX 4090\" \"RTX 5090\" \"H100\"\nwhichllm upgrade \"Apple M4 Max\" --top 5\n\n# Run: download and chat with a model instantly\nwhichllm run \"qwen 2.5 1.5b gguf\"\nwhichllm run                       # auto-pick best for your hardware\n\n# Snippet: print ready-to-run Python code\nwhichllm snippet \"qwen 7b\"\nwhichllm snippet \"llama 3 8b gguf\" --quant Q5_K_M\n```\n\nJSON model rows include `estimated_tok_per_sec`, `speed_confidence`,\n`speed_range_tok_per_sec`, and `speed_notes`. The speed range is a planning\nrange, not a live benchmark.\n\n## Integrations\n\n### Ollama\n\nUse JSON output to feed scripts that map HuggingFace IDs to your local Ollama\nmodel names:\n\n```bash\n# Pick the top HuggingFace model ID\nwhichllm --top 1 --json | jq -r '.models[0].model_id'\n\n# Find the best coding model ID\nwhichllm --profile coding --top 1 --json | jq -r '.models[0].model_id'\n```\n\nOllama model names do not always match HuggingFace repo IDs, so a small mapping\nstep is usually needed before `ollama run`.\n\n### Shell alias\n\nAdd to your `.bashrc` \u002F `.zshrc`:\n\n```bash\nalias bestllm='whichllm --top 1 --json | jq -r \".models[0].model_id\"'\n# Usage: ollama run $(bestllm)\n```\n\n## Scoring\n\nEach model gets a 0-100 score. Benchmark quality and size form the core;\nevidence confidence and runtime fit then scale it, with speed, source\ntrust, and popularity as adjustments.\n\n| Factor | Effect | Description |\n|--------|--------|-------------|\n| Benchmark quality | core | Merged LiveBench \u002F Artificial Analysis \u002F Aider \u002F Vision \u002F Arena ELO \u002F Open LLM Leaderboard, weighted by source confidence |\n| Model size | up to 35 | `log2`-scaled world-knowledge proxy (MoE uses total params) |\n| Quantization | × penalty | Lower-bit quants discounted multiplicatively |\n| Evidence confidence | ×0.55–1.0 | none \u002F self-reported ×0.55, inherited ×0.78, direct full |\n| Runtime fit | ×0.50–1.0 | partial-offload ×0.72, CPU-only ×0.50 |\n| Speed | -8 to +8 | Usability gate vs a fit-dependent tok\u002Fs floor; reported with confidence and range metadata |\n| Source trust | -5 to +5 | Official-org bonus, known-repackager penalty |\n| Popularity | tie-breaker | Downloads\u002Flikes; weight shrinks as evidence strengthens |\n\nScore markers:\n- **`~`** (yellow) — No direct benchmark; score inherited\u002Finterpolated from the model family\n- **`!sr`** (bright yellow) — Uploader-reported benchmark only, not independently verified\n- **`?`** (red) — No benchmark data available\n\nSpeed markers in `--status`:\n- **`~`** (yellow) — Estimated tok\u002Fs range is available\n- **`?`** (red) — Low-confidence speed estimate; backend\u002Fruntime sensitivity is high\n\n## Documentation\n\n- [CLI reference](docs\u002Fcli.md)\n- [How it works](docs\u002Fhow-it-works.md)\n- [Scoring](docs\u002Fscoring.md)\n- [Hardware detection and simulation](docs\u002Fhardware.md)\n- [Run and snippet](docs\u002Frun-snippet.md)\n- [Troubleshooting](docs\u002Ftroubleshooting.md)\n\n## How it works\n\n### Data pipeline\n\n1. **Model fetching** — Fetches popular models from HuggingFace API:\n   - Text-generation (downloads + recently updated)\n   - GGUF-filtered (separate query for coverage)\n   - Vision models (`image-text-to-text`) when `--profile vision` or `any`\n2. **Benchmark sources** — *Current tier* (LiveBench, Artificial Analysis\n   Index, Aider) merged live when reachable, plus a curated multimodal \u002F\n   vision index; *frozen tier* (Open LLM Leaderboard v2, Chatbot Arena\n   ELO). Tiers have separate caps and lineage-aware recency demotion so\n   stale leaderboards stop over-rewarding older generations.\n3. **Benchmark evidence** — Five resolution levels, increasingly discounted:\n   - `direct` — Exact model ID match\n   - `variant` — Suffix-stripped or -Instruct variant\n   - `base_model` — Base model from cardData\n   - `line_interp` — Size-aware interpolation within model family\n   - `self_reported` — Uploader-claimed eval (heavily discounted)\n\n   Inheritance is rejected when a model's params diverge more than 2× from\n   its family's dominant member, catching draft \u002F MTP \u002F abliterated forks\n   that share a `family_id` with a much larger base.\n4. **Cache** — `~\u002F.cache\u002Fwhichllm\u002F`:\n   - `models.json` — 6h TTL\n   - `benchmark.json` — 24h TTL\n\n### Ranking engine\n\n1. **Hardware detection** — NVIDIA (nvidia-ml-py), AMD (dbgpu\u002FROCm), Apple Silicon (Metal), CPU cores, RAM, disk\n2. **VRAM estimation** — Weights + KV cache + activation + framework overhead (~500MB)\n3. **Compatibility** — Full GPU \u002F Partial Offload \u002F CPU-only; compute capability and OS checks\n4. **Speed** — tok\u002Fs from GPU memory bandwidth, quantization, backend, fit type, and MoE active parameters\n5. **Scoring** — Benchmark (with confidence dampening), size, quantization penalty, fit type, speed, popularity, source trust (official vs repackager)\n6. **Backend filter** — Apple Silicon and CPU-only restrict to GGUF for stability; Linux+NVIDIA allows AWQ\u002FGPTQ\n\n### Project structure\n\n```\nsrc\u002Fwhichllm\u002F\n├── cli.py              # Typer CLI: main, plan, run, snippet, hardware\n├── constants.py        # GPU bandwidth, quantization bytes, compute capability\n├── hardware\u002F\n│   ├── detector.py     # Orchestrates GPU\u002FCPU\u002FRAM detection\n│   ├── nvidia.py       # NVIDIA GPU via nvidia-ml-py\n│   ├── amd.py          # AMD GPU (Linux)\n│   ├── apple.py        # Apple Silicon (Metal)\n│   ├── cpu.py          # CPU name, cores, AVX support\n│   ├── memory.py       # RAM and disk free\n│   ├── gpu_simulator.py # --gpu flag: synthetic GPU from name\n│   └── types.py        # GPUInfo, HardwareInfo\n├── models\u002F\n│   ├── fetcher.py      # HuggingFace API, model parsing, evalResults\n│   ├── benchmark.py    # Arena ELO, Leaderboard (parquet\u002Frows API)\n│   ├── grouper.py      # Family grouping by base_model and name\n│   ├── cache.py        # JSON cache with TTL\n│   └── types.py        # ModelInfo, GGUFVariant, ModelFamily\n├── engine\u002F\n│   ├── vram.py         # VRAM = weights + KV cache + activation + overhead\n│   ├── compatibility.py# Fit type, disk check, compute\u002FOS warnings\n│   ├── performance.py  # tok\u002Fs from bandwidth\n│   ├── quantization.py # Bytes per weight, quality penalty, non-GGUF inference\n│   ├── ranker.py       # Scoring, evidence filter, profile\u002Fmatch\n│   └── types.py        # CompatibilityResult\n└── output\u002F\n    └── display.py      # Rich table, JSON output, hardware\u002Fplan displays\n```\n\n## Contributing\n\nContributions are welcome! See [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## Support\n\nIf whichllm helped you find a model or avoid a bad hardware guess,\nsponsoring is appreciated. It helps keep the project maintained: hardware\nreports, packaging, test fixtures, benchmark updates, and support for more\nmachines.\n\nwhichllm will stay open-source either way. Issues and PRs are always welcome.\n\n## Requirements\n\n- Python 3.11+\n- NVIDIA GPU detection via `nvidia-ml-py` (included by default)\n- AMD \u002F Apple Silicon detected automatically\n\n## License\n\nMIT\n","whichllm 是一个用于查找最适合您硬件配置的本地大语言模型（LLM）的工具。它能够自动检测您的 GPU\u002FCPU\u002FRAM，并根据实际、最新的基准测试结果而非参数数量来对来自 HuggingFace 的顶级模型进行排名。该工具支持 Python 3.11+，并利用了 gguf、HuggingFace 推理等技术特性。适用于需要在本地部署 LLM 的场景，如个人开发者工作站或小型服务器环境，帮助用户快速找到性能最优且能在其现有硬件上运行的大语言模型。",2,"2026-06-11 03:55:05","trending"]