[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80797":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":14,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":15,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":23,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},80797,"granite-switch","generative-computing\u002Fgranite-switch","generative-computing","Granite Switch — Build AI models like you build software","https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-switch-4.1-3b-preview",null,"Python",76,5,4,12,0,35,38,71.63,"Apache License 2.0",false,"main",true,[7,25,26,27,28],"llm-inference","llms","transformers","vllm","2026-06-12 04:01:30","# Granite Switch — Build AI models like you build software\n\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue.svg)](LICENSE)\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![corelib](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdynamic\u002Fjson?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fibm-granite%2Fgranitelib-core-r1.0&query=%24.downloads&label=corelib&logo=huggingface&color=yellow)](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranitelib-core-r1.0)\n[![raglib](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdynamic\u002Fjson?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fibm-granite%2Fgranitelib-rag-r1.0&query=%24.downloads&label=raglib&logo=huggingface&color=yellow)](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranitelib-rag-r1.0)\n[![guardianlib](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdynamic\u002Fjson?url=https%3A%2F%2Fhuggingface.co%2Fapi%2Fmodels%2Fibm-granite%2Fgranitelib-guardian-r1.0&query=%24.downloads&label=guardianlib&logo=huggingface&color=yellow)](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranitelib-guardian-r1.0)\n\n| [**Browse adapter functions**](https:\u002F\u002Fgenerative-computing.github.io\u002Fgranite-switch\u002Fadapter_catalog.html) | [Pre-composed Models on HF](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-switch-4.1-8b-preview) | [Tutorials](tutorials\u002FREADME.md) |\n\nSoftware is built from libraries — you pick the ones you need, compose them, and ship. Granite Switch brings this to AI models: choose **adapter functions** for RAG, safety, factuality, and more, compose them into a single model, and deploy with one command. Swap or upgrade any component independently, just like updating a dependency.\n\nAn adapter function is a LoRA adapter trained to a specific input\u002Foutput contract — a score, a decision, a rewritten query — with the output schema [enforced at the token level by Mellea](https:\u002F\u002Fmellea.ai). This is what makes them composable as software: each function has a known signature, not just a general-purpose text output.\n\nSmall models with the right adapter functions consistently outperform much larger generalist models on targeted tasks. **Activated LoRA (aLoRA)** makes this practical at scale: all adapter functions share one KV cache, activating on demand — so one deployment serves many capabilities with no memory or latency overhead.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fmedia\u002Fbenchmark_animation.svg\" alt=\"Granite Switch: adapters stack, accuracy improves\" width=\"820\">\n\u003C\u002Fp>\n\n## Key Features\n\n- **Composable** — Combine independently developed adapter functions into one checkpoint, whether IBM's or yours. Swap, upgrade, or customize without retraining.\n- **Fast** — Built on IBM's Activated LoRA technology for efficient KV cache reuse, low latency, and [high inference throughput](https:\u002F\u002Fgenerative-computing.github.io\u002Fgranite-switch\u002Frace_live.html).\n- **Accurate** — Task-specific adapter functions can match and even surpass the accuracy of significantly larger generalist models, while requiring only a fraction of the serving cost. See the [adapter function catalog](https:\u002F\u002Fgenerative-computing.github.io\u002Fgranite-switch\u002Fadapter_catalog.html#hallucination-detection) for benchmark comparisons across all 12 adapter functions.\n- **Inference-ready** — Deploy with vLLM for production or HuggingFace for prototyping. Same checkpoint, no conversion step.\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgenerative-computing.github.io\u002Fgranite-switch\u002Frace_live.html\">\n    \u003Cimg src=\"docs\u002Fmedia\u002Falora_vs_lora_race.png\" alt=\"aLoRA vs LoRA live race telemetry — aLoRA at 5\u002F16 queries done with 74% KV hit rate while LoRA is at 1\u002F16 with 29%\" width=\"820\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cem>Live race telemetry: aLoRA (74% KV cache hit rate, 5\u002F16 finished) vs LoRA (29% KV hit rate, 1\u002F16 finished) — same model, same hardware, different adapter technology.\u003C\u002Fem>\u003Cbr>\n\u003Ca href=\"https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Falora_vs_lora_race.ipynb\">Reproduce it yourself on Colab →\u003C\u002Fa>\u003C\u002Fp>\n\n## Quick Start\n\n### Install\n\n```bash\npip install \"granite-switch[vllm]\"\n```\n\nOther install options depending on your use case:\n\n```bash\npip install \"granite-switch[compose]\"   # Compose modular models\npip install \"granite-switch[hf]\"        # HuggingFace inference\npip install \"granite-switch[vllm20]\"    # vLLM 0.20+ (requires CUDA 13+)\npip install \"granite-switch[dev]\"       # Everything\n```\n\nRequires Python 3.9+ and PyTorch 2.0+. Two vLLM backends are available: `.[vllm]` for broad CUDA 12.x compatibility (0.19.x), and `.[vllm20]` for the latest performance improvements (CUDA 13+).\n\n### Compose a Model\n\nCompose a base Granite model with adapter libraries into a single deployable checkpoint:\n\n```bash\npython -m granite_switch.composer.compose_granite_switch \\\n  --base-model ibm-granite\u002Fgranite-4.1-3b \\\n  --adapters ibm-granite\u002Fgranitelib-core-r1.0 ibm-granite\u002Fgranitelib-rag-r1.0  ibm-granite\u002Fgranitelib-guardian-r1.0 \\\n  --output .\u002Fmy-model\n```\n\nUse the [adapter function composer](https:\u002F\u002Fgenerative-computing.github.io\u002Fgranite-switch\u002Fadapter_catalog.html) to browse available adapter functions, compare benchmarks, and generate a ready-to-run compose command.\n\nThis downloads the base model, embeds compatible LoRA adapters (with a preference towards activated LoRA), adds control tokens and a chat template, and produces a model directory that works with both HuggingFace and vLLM.\n\n**Or skip composition and use a pre-composed model:**\n\n- [ibm-granite\u002Fgranite-switch-4.1-3b-preview](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-switch-4.1-3b-preview)\n- [ibm-granite\u002Fgranite-switch-4.1-8b-preview](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-switch-4.1-8b-preview)\n- [ibm-granite\u002Fgranite-switch-4.1-30b-preview](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-switch-4.1-30b-preview)\n\n### Run Inference\n\n```bash\npip install mellea\npython -m vllm.entrypoints.openai.api_server --model ibm-granite\u002Fgranite-switch-4.1-3b-preview --port 8000\n```\n\n```python\nfrom mellea.backends.openai import OpenAIBackend\nfrom mellea.stdlib.components.chat import Message\nfrom mellea.stdlib.components.intrinsic.guardian import guardian_check\nfrom mellea.stdlib.context import ChatContext\n\nbackend = OpenAIBackend(\n    model_id=\"ibm-granite\u002Fgranite-switch-4.1-3b-preview\",\n    base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\",\n    api_key=\"unused\",\n)\nbackend.register_embedded_adapter_model(\"ibm-granite\u002Fgranite-switch-4.1-3b-preview\")\n\nctx = ChatContext().add(Message(\"user\", \"Group X people are all lazy.\"))\nscore = guardian_check(ctx, backend, \"social_bias\", scoring_schema=\"user_prompt\")\nprint(f\"social_bias score: {score:.3f}\")\n# => social_bias score: 0.964\n```\n\n## How It Works\n\nWith standard LoRA, each adapter is trained against its own KV distribution — so switching adapter functions across complex flow control means discarding and recomputing the KV cache at every step. aLoRA adapter functions are instead trained against a common normalized KV cache, so they can all coexist in a single checkpoint and activate on demand without cross-contamination:\n\n1. **Control tokens** — Each adapter function has a dedicated control token (e.g., `\u003Cguardian>`, `\u003Cquery_rewrite>`). Placing the token in the input sequence is what triggers activation — the adapter function's LoRA weights apply from that position forward.\n2. **KV cache normalization** — Because all adapter functions are trained against the same normalized KV cache, they never interfere with each other's internal state. Each activates on top of the shared base KV cache, which is what makes independent development, benchmarking, and composition possible without joint training.\n3. **Prefill reuse** — LoRA weights are selected per token position, not per request. Because all adapter functions share the same normalized KV cache, the prefill from earlier steps is reused rather than recomputed — eliminating the main latency cost of multi-adapter complex flow control.\n\nLike functions in a software library, adapter functions can be developed and benchmarked independently or jointly. They compose into one deployable model that contains all capabilities, in analogy to statically linked object code.\n\n## Tutorials\n\nNew here? Start with a 5-minute notebook and work your way up:\n\n| Notebook | What you'll build | Time | |\n|---|---|---|---|\n| [Hello Mellea](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Fhello_mellea.ipynb) | Call adapters through a clean Python API | 5 min | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Fhello_mellea.ipynb) |\n| [RAG Flow](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Frag_flow.ipynb) | Query rewrite + answerability + citations in one model | 30 min | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Frag_flow.ipynb) |\n| [Compose Your Own](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Fcompose_granite_switch.ipynb) | Build a custom checkpoint from adapter function libraries | 15 min | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fgenerative-computing\u002Fgranite-switch\u002Fblob\u002Fmain\u002Ftutorials\u002Fnotebooks\u002Fcompose_granite_switch.ipynb) |\n\nAll notebooks run on Colab. See [tutorials\u002FREADME.md](tutorials\u002FREADME.md) for the full list and guided learning paths.\n\n## Ecosystem\n\nGranite Switch is part of a coordinated stack:\n\n- **[Granite Models](https:\u002F\u002Fhuggingface.co\u002Fibm-granite)** — The base models that Granite Switch builds on. Granite 4.1 is available in 3B, 8B, and 30B parameter sizes on Hugging Face.\n- **[Granite Libraries](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fibm-granite\u002Fgranite-libraries)** — Pre-trained adapter functions for RAG, safety, and core capabilities, published on Hugging Face. These are the components you compose into a Switch model.\n- **[Mellea](https:\u002F\u002Fmellea.ai)** — Reliable, testable LLM output for Python. Type hints become schemas, docstrings become prompts, and valid output is enforced at the token level — not retried into existence. Mellea orchestrates Granite Switch adapter functions through an API built for complex flow control, handling control tokens and constrained decoding so you work with typed function calls, not raw tokens.\n- **Granite Switch** (this repo) — The model architecture and composer toolchain for embedding adapter functions into a base model and producing a deployable checkpoint.\n\n## Contributing\n\nGranite Switch was started by IBM Research and is developed in the open. We welcome bug reports, feature requests, and pull requests — see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines or open an [issue](https:\u002F\u002Fgithub.com\u002Fgenerative-computing\u002Fgranite-switch\u002Fissues).\n\n## License\n\nApache-2.0 — see [LICENSE](LICENSE).\n","Granite Switch 是一个用于构建和部署AI模型的工具，旨在让开发者像构建软件一样灵活地组合和使用AI模型。其核心功能包括通过适配器函数（adapter functions）实现模型组件化，这些函数针对特定任务进行了优化，并且可以独立更新或替换。技术上，Granite Switch 基于激活的LoRA技术（Activated LoRA），允许多个适配器共享同一KV缓存，从而在不增加内存消耗或延迟的情况下提供多种能力。它特别适合需要高效、准确执行特定任务但又希望保持系统灵活性的应用场景，如企业级AI服务开发与维护。",2,"2026-06-11 04:02:23","CREATED_QUERY"]