[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77213":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":15,"starSnapshotCount":15,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},77213,"can-i-finetune-this","DaoyuanLi2816\u002Fcan-i-finetune-this","DaoyuanLi2816","Estimate whether a Hugging Face model fits and fine-tunes on your local GPU.","https:\u002F\u002Fpypi.org\u002Fproject\u002Fcanifinetune\u002F",null,"Python",429,63,51,0,20,116,399,96,5.42,"MIT License",false,"main",true,[26,27,28,29,30,31,32,33,34,35,36,37],"bitsandbytes","fine-tuning","gpu","hugging-face","llm","lora","memory-estimation","peft","pytorch","qlora","transformers","vram","2026-06-12 02:03:42","# can-i-finetune-this\n\n[![CI](https:\u002F\u002Fgithub.com\u002FDaoyuanLi2816\u002Fcan-i-finetune-this\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002FDaoyuanLi2816\u002Fcan-i-finetune-this\u002Factions\u002Fworkflows\u002Fci.yml)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcanifinetune.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcanifinetune\u002F)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue)](https:\u002F\u002Fwww.python.org)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue)](LICENSE)\n\n**Estimate, benchmark, and generate fine-tuning recipes for LLMs on consumer GPUs.**\n\n![can-i-finetune-this architecture](docs\u002Farchitecture.png)\n\nYou have one consumer-grade NVIDIA GPU. You want to fine-tune an open-weight LLM\nwith LoRA or QLoRA, but you do not want to download 14 GB of weights just to\ndiscover that your 12 GB \u002F 16 GB \u002F 24 GB card OOMs on step 1.\n\n`canifinetune` answers, before you spend the disk and the time:\n\n1. Can I fine-tune this model?\n2. About how much VRAM will it use?\n3. What batch size \u002F sequence length \u002F LoRA rank \u002F quantization should I use?\n4. If I can't, how should I downsize?\n5. Is there local benchmark evidence for that answer?\n6. Can I get a ready-to-run Hugging Face + PEFT + TRL training script for that config?\n\nIt is a single Python package with a CLI:\n\n```bash\ncanifinetune doctor\ncanifinetune estimate --model Qwen\u002FQwen2.5-1.5B-Instruct --method qlora --gpu-vram-gb 16 --seq-len 2048 --micro-batch-size 1 --lora-rank 16\ncanifinetune recommend --model Qwen\u002FQwen2.5-1.5B-Instruct --gpu-vram-gb 16\ncanifinetune bench    --model sshleifer\u002Ftiny-gpt2 --method lora --steps 3\ncanifinetune calibrate --benchmarks benchmarks\u002Fresults\ncanifinetune recipe   --model Qwen\u002FQwen2.5-1.5B-Instruct --method qlora --output recipes\u002Fqwen2.5-1.5b-qlora-4080\ncanifinetune report   --benchmarks benchmarks\u002Fresults --out report.md\ncanifinetune compare  --benchmarks benchmarks\u002Fresults --out compare.md\n```\n\nWhat `canifinetune estimate` actually prints:\n\n```text\n+-------- Qwen\u002FQwen2.5-1.5B-Instruct  (qlora) --------+\n| feasible: YES    ratio = 0.20    confidence = medium |\n+------------------------------------------------------+\n       Memory breakdown (GB)\n+---------------------------------+\n| Component             |   Value |\n|-----------------------+---------|\n| static model          |   0.737 |\n| quantization overhead |   0.018 |\n| trainable params      |  4.4 MB |\n| gradients             |   0.008 |\n| optimizer states      |   0.010 |\n| activations           |   0.328 |\n| CUDA \u002F fragmentation  |   1.280 |\n| safety margin         |   0.800 |\n| total                 |   3.163 |\n+---------------------------------+\n```\n\nStatic estimate says 3.16 GB; on a real RTX 4080 the same config measures\n7.10 GB (heavy bitsandbytes unpacking buffers at seq_len=2048). `canifinetune\nbench` and `canifinetune calibrate` close that gap on your machine —\nthat is the *point* of the project.\n\n---\n\n## Install\n\n`canifinetune` runs in two layers:\n\n| Layer | Install | What you get |\n| --- | --- | --- |\n| Core (estimate \u002F recommend \u002F recipe \u002F report) | `pip install canifinetune` | All CLI commands. No PyTorch required. |\n| Training (bench \u002F real fine-tuning) | `pip install canifinetune[train]` | Adds `torch`, `transformers`, `peft`, `bitsandbytes`, `trl`, `datasets`. |\n| Reporting extras | `pip install canifinetune[report]` | Pandas\u002Ftabulate for prettier tables. |\n| Development | `pip install canifinetune[dev]` | pytest, ruff, mypy. |\n\nIf you use `uv`:\n\n```bash\nuv venv\nuv pip install -e \".[dev,report]\"\n# Add training deps when you want to run benchmarks:\nuv pip install -e \".[dev,train,report]\"\n```\n\nPyTorch should generally be installed with the CUDA wheel that matches your driver,\ne.g.\n\n```bash\nuv pip install torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu121\n```\n\nSee `docs\u002Ftroubleshooting.md` for Windows \u002F WSL \u002F bitsandbytes specifics.\n\n---\n\n## Quickstart\n\n```bash\n# 1. See what your machine looks like\ncanifinetune doctor\n\n# 2. Ask if a model fits on your card\ncanifinetune estimate \\\n  --model Qwen\u002FQwen2.5-1.5B-Instruct \\\n  --method qlora \\\n  --gpu-vram-gb 16 \\\n  --seq-len 2048 \\\n  --micro-batch-size 1 \\\n  --lora-rank 16\n\n# 3. Have it search for a feasible config\ncanifinetune recommend --model Qwen\u002FQwen2.5-1.5B-Instruct --gpu-vram-gb 16\n\n# 4. Run a tiny real benchmark (downloads sshleifer\u002Ftiny-gpt2, ~5 MB)\ncanifinetune bench --model sshleifer\u002Ftiny-gpt2 --method lora --steps 3\n\n# 5. Generate a ready-to-run training recipe\ncanifinetune recipe \\\n  --model Qwen\u002FQwen2.5-1.5B-Instruct \\\n  --method qlora \\\n  --seq-len 2048 \\\n  --output recipes\u002Fqwen2.5-1.5b-qlora-4080\n```\n\n---\n\n## What's different from `accelerate estimate-memory`?\n\n`accelerate estimate-memory` tells you how much memory **loading** a model takes.\nThat is not enough to know whether you can **train** it.\n\nThis project tries to answer the harder question. It models:\n\n- Model weights, in fp32 \u002F fp16 \u002F bf16 \u002F int8 \u002F NF4 + double-quant\n- LoRA \u002F QLoRA trainable parameter count for typical `target_modules`\n- Gradients only for trainable parameters\n- AdamW vs 8-bit \u002F paged AdamW optimizer states\n- Activations as a function of `seq_len`, `batch_size`, `hidden_size`, `num_layers`,\n  with and without gradient checkpointing\n- A fragmentation \u002F CUDA \u002F buffer safety margin\n- A feasibility decision against your actual GPU\n- Concrete degradation suggestions when not feasible\n\nEstimates are **always** marked with an `assumptions` block and a `confidence`\nlevel, because activation memory in particular is hard to predict statically.\nRun `canifinetune bench` and `canifinetune calibrate` to ground them in real\nmeasurements on your machine.\n\n---\n\n## RTX 4080 baselines\n\n`docs\u002Frtx4080_baselines.md` contains real measurements collected on a single\nRTX 4080 (16 GB). These are not synthetic. If a configuration was not run, the\ntable says \"not run\", not a guessed number.\n\nHighlights (more in the doc):\n\n| model | method | seq_len | measured peak | tok\u002Fsec |\n| --- | --- | --- | --- | --- |\n| `Qwen\u002FQwen2.5-0.5B-Instruct` | qlora | 1024 | 3.30 GB | 1995 |\n| `Qwen\u002FQwen2.5-1.5B-Instruct` | qlora | 1024 | 4.36 GB | 1352 |\n| `Qwen\u002FQwen2.5-1.5B-Instruct` | qlora | 2048 | 7.10 GB | 1470 |\n| `Qwen\u002FQwen2.5-3B-Instruct` | qlora | 1024 | 5.54 GB | 1158 |\n| `sshleifer\u002Ftiny-gpt2` (smoke) | lora | 128 | 0.12 GB | 1735 |\n\n---\n\n## Repository layout\n\n```\nsrc\u002Fcanifinetune\u002F        # package code (estimator, bench, recipes, reports, cli)\nbenchmarks\u002F              # configs\u002F, results\u002F (JSON), calibration\u002F\ndocs\u002F                    # design, memory model, troubleshooting\nexamples\u002F                # end-to-end recipe folders\ntests\u002F                   # pytest tests (CPU-only, no large downloads)\nscripts\u002F                 # helper scripts for collecting baselines\n.github\u002Fworkflows\u002F       # CI (ruff + pytest on CPU)\n```\n\n---\n\n## Roadmap\n\nThe current scope is \"single consumer GPU, single node, LoRA \u002F QLoRA, causal LM,\nHugging Face stack\". Possible directions, none committed:\n\n- DeepSpeed ZeRO and FSDP estimation for multi-GPU setups\n- Heuristics for sequence-classification \u002F encoder-decoder training\n- Throughput modeling (tokens \u002F sec), not just feasibility\n- Auto-tuning of `gradient_accumulation_steps` for a target effective batch size\n- A web UI on top of the CLI\n\nContributions welcome.\n\n---\n\n## License\n\nMIT. See `LICENSE`.\n","`can-i-finetune-this` 是一个用于估算Hugging Face模型是否适合在本地GPU上进行微调的工具。其核心功能包括预估模型所需的显存、推荐合适的批处理大小\u002F序列长度\u002FLoRA秩\u002F量化方法等配置，并生成可直接运行的训练脚本。该工具基于Python开发，支持多种微调技术如LoRA、QLoRA以及PEFT库。适用于拥有消费级NVIDIA GPU的研究人员或开发者，在尝试对大型语言模型进行个性化调整前，通过它来避免因内存不足导致的任务失败，从而节省时间和资源。",2,"2026-06-11 03:55:10","CREATED_QUERY"]