[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83015":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":12,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":14,"starSnapshotCount":14,"syncStatus":12,"lastSyncTime":26,"discoverSource":27},83015,"edge-lm","TheStageAI\u002Fedge-lm","TheStageAI","Tiny llms optimised for edge deployment ",null,"Python",89,2,64,0,10,18,48.23,"MIT License",false,"main",true,[],"2026-06-12 04:01:39","# edge-lm\n\n![Gemma E2B compression flow: 9.26 GB BF16 compressed to 1.44 GB — 6.4× smaller](https:\u002F\u002Fcdn.thestage.ai\u002Fproduction\u002Fcms_file_upload\u002F1780406294-645b80f9-cebe-4ef2-bc04-f524afb4f244\u002FTokens%20per%20Second%20CuDNN%20%282%29.png)\n\n**Tiny LLMs optimized for edge deployment.**\n\n`edge-lm` runs compressed large language models on-device — Apple Silicon Macs and iPhones — through [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx). The first release ships the **smallest publicly available Gemma 4 checkpoints optimized for edge deployment** — roughly **7× smaller** than the original while preserving the capabilities that matter most for on-device assistants: general world knowledge, instruction following, and tool use.\n\n\n> 📝 Read the full write-up: [*7× size reduction for Gemma 4 Edge models — Compressing PLE architectures*](https:\u002F\u002Fapp.thestage.ai\u002Fblog\u002F7x-size-reduction-for-Gemma4-Edge-models?id=14).\n\n## Models\n\n| Model | M size (default) | L size | Compression |\n|---|---|---|---|\n| [`TheStageAI\u002Fgemma-4-E2B-it`](https:\u002F\u002Fhuggingface.co\u002FTheStageAI\u002Fgemma-4-E2B-it) | **1.44 GB** | 1.72 GB | up to 6.4× |\n| [`TheStageAI\u002Fgemma-4-E4B-it`](https:\u002F\u002Fhuggingface.co\u002FTheStageAI\u002Fgemma-4-E4B-it) | **2.72 GB** | 3.28 GB | up to 5.6× |\n\nWeights download automatically from HuggingFace on first run. Each model ships two operating points — `l` (more quality, larger artifact) and `m` (the smaller headline compression target, default).\n\n## Key features\n\n- **~7× smaller checkpoints.** The default Gemma 4 E2B checkpoint fits in 1.44 GB, and E4B fits in 2.72 GB — small enough to download quickly and stay within mobile per-app memory budgets.\n- **Accuracy preserved where it counts.** Quality is held on the three things that matter most for edge assistants — instruction following (IFEval), tool calls (τ²-Bench), and general world knowledge (MMLU-Pro).\n- **MLX-ready artifacts.** Decoder weights use a flat, MLX-compatible per-group quantization format; PLE tables use a compact AQLM-style vector-quantization codec (4.7 GB → ~0.26 GB), decompressed on the fly with a single batched gather.\n\n## Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FTheStageAI\u002Fedge-lm.git\ncd edge-lm\n\npython -m venv .venv && source .venv\u002Fbin\u002Factivate\npip install -r requirements.txt        # or: pip install -e .\n```\n\nRun text generation (downloads `TheStageAI\u002Fgemma-4-E2B-it` on first run):\n\n```bash\npython examples\u002Fgeneration_test.py --prompts \"What is 2+2?\" \"Explain gravity in one sentence\"\n```\n\nUse it from Python:\n\n```python\nfrom edge_lm import load\nfrom mlx_vlm import stream_generate\n\nmodel, tokenizer = load()  # TheStageAI\u002Fgemma-4-E2B-it, size \"m\" by default\n# model, tokenizer = load(\"TheStageAI\u002Fgemma-4-E4B-it\", size=\"l\")  # larger, higher quality\n\nprompt = tokenizer.apply_chat_template(\n    [{\"role\": \"user\", \"content\": \"Write a haiku about the moon.\"}],\n    tokenize=False, add_generation_prompt=True,\n)\nfor chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):\n    print(chunk.text, end=\"\", flush=True)\n```\n\nMore examples:\n\n```bash\npython examples\u002Ftest_vision.py --image photo.jpg --prompt \"Describe this image\"\npython examples\u002Ftest_audio.py  --audio recording.wav --prompt \"Transcribe this speech\"\npython examples\u002Fchat.py --tools                      # interactive chat with tool use\n```\n\n## Benchmarks\n\n### Quality\n\nEvery model — ours and the GGUF baselines alike — is dequantized to a standard BF16 checkpoint and served through vLLM, so the backend is equalized across the table. We report **MMLU-Pro** (general knowledge), **IFEval** (instruction following), and **τ²-Bench \u002F Tau2** (multi-step tool use). For Tau2 the Gemma checkpoint under test acts as the agent while a fixed `Qwen3-235B-A22B-2507` simulates the user.\n\n`Ours L` keeps more quality at a larger artifact size; `Ours M` is the smaller headline compression target.\n\n**Gemma 4 E2B**\n\n| Model | Compression | MMLU-Pro | IFEval | Tau2 (avg of 3) |\n|---|---|---|---|---|\n| BF16 | 1.00× | 61.85 | 74.68 | 30.67 |\n| **Ours L** | 5.62× | **54.48** | **74.86** | 22.20 |\n| **Ours M** | **6.40×** | 49.85 | 71.53 | **23.45** |\n| Unsloth Q3-K-S | 3.81× | 48.20 | 64.51 | 18.69 |\n| Unsloth UD-Q2-K-XL | 3.87× | 43.17 | 66.54 | 20.23 |\n\n**Gemma 4 E4B**\n\n| Model | Compression | MMLU-Pro | IFEval | Tau2 |\n|---|---|---|---|---|\n| BF16 | 1.00× | 70.49 | 81.33 | 37.19 |\n| **Ours L** | 4.64× | **67.41** | **81.52** | **33.25** |\n| **Ours M** | **5.60×** | 63.54 | 80.78 | 29.04 |\n| Unsloth Q3-K-S | 3.90× | 63.66 | 77.08 | 30.47 |\n| Unsloth UD-Q2-K-XL | 4.01× | 58.69 | 79.67 | 22.91 |\n\nBold metric values mark the best result among the compressed checkpoints in each column. Tau2 computed with `Qwen3-235B-A22B-2507` as the user simulator.\n\nReproduce the quality benchmarks:\n\n```bash\npip install \"edge-lm[quality]\"   # CUDA\u002FvLLM quality benchmark dependencies\n\npython benchmarks\u002Fquality\u002Fverify_release.py \\\n    --work-dir runs\u002Frelease_verify \\\n    run \\\n    --models e2b_ours_m,e2b_unsloth_q3_k_s \\\n    --benchmarks mmlu_pro,ifeval\n```\n\nThe frozen production protocols live in [`benchmarks\u002Fquality`](benchmarks\u002Fquality\u002F).\n\n### Performance\n\nMeasured on an **Apple M3 Max (69 GB)**, size `m` checkpoint, 1024 input \u002F 1024 output tokens,\nchunked prefill (256-token chunks), best of 5 runs. `TTFT` = prefill + first token;\n`TPS` = steady-state decode throughput; `MLX peak memory` = `mx.get_peak_memory()` (MLX Metal allocator).\nReferences are the matching original `google\u002Fgemma-4-*-it` checkpoint served via mlx-vlm: bf16,\nand 4-bit quantized (affine, group size 32).\n\n**Gemma 4 E2B**\n\n| Model | TTFT | Decode (TPS) | MLX peak memory |\n|---|---|---|---|\n| **TheStage (ours)** | **434 ms** | **115.0** | **2.1 GB** |\n| Reference bf16 | 531 ms | 57.2 | 10.7 GB |\n| Reference 4-bit (gs32) | 595 ms | 83.3 | 4.6 GB |\n\n**Gemma 4 E4B**\n\n| Model | TTFT | Decode (TPS) | MLX peak memory |\n|---|---|---|---|\n| **TheStage (ours)** | **832 ms** | **73.7** | **3.5 GB** |\n| Reference bf16 | 1110 ms | 30.5 | 16.4 GB |\n| Reference 4-bit (gs32) | 970 ms | 53.5 | 7.1 GB |\n\nReproduce:\n\n```bash\npython benchmarks\u002Fperformance.py --model TheStageAI\u002Fgemma-4-E2B-it \\\n    --hf-model google\u002Fgemma-4-E2B-it \\\n    --input-tokens 1024 --output-tokens 1024 --prefill-step-size 256 \\\n    --compare-ref --compare-ref-4bit --ref-4bit-group-size 32\n```\n\n## License\n\nReleased under the [MIT License](LICENSE), © 2026 thestage.ai labs.\n\nThe compressed model weights are derivatives of Google's Gemma 4 and are additionally subject to the [Gemma Terms of Use](https:\u002F\u002Fai.google.dev\u002Fgemma\u002Fterms).\n","edge-lm 是一个专为边缘设备优化的微型大语言模型项目。其核心功能是通过 MLX 技术在 Apple Silicon Macs 和 iPhone 等设备上运行压缩后的大型语言模型，首个版本提供了目前公开可用的最小 Gemma 4 检查点，比原版小约7倍，同时保留了关键能力如通用世界知识、指令执行和工具使用。适合需要在移动设备或资源受限环境中高效部署语言模型的应用场景。","2026-06-11 04:09:53","CREATED_QUERY"]