[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11102":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":38,"readmeContent":39,"aiSummary":40,"trendingCount":16,"starSnapshotCount":16,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},11102,"atlas","Avarok-Cybersecurity\u002Fatlas","Avarok-Cybersecurity","Pure Rust Inference Engine","https:\u002F\u002Fatlasinference.io",null,"Rust",490,69,7,8,0,15,26,230,45,5.54,"GNU Affero General Public License v3.0",false,"main",[26,27,28,29,30,31,32,33,34,35,36,37],"cuda","dgx","dgx-spark","gb10","llm-inference","mamba","nvfp4","openai-api","rust","speculative-decoding","ssm","transformers","2026-06-12 02:02:29","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Flogo.svg\" alt=\"Atlas Inference Engine\" width=\"640\" \u002F>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  \u003Ch1 align=\"center\">Atlas Inference Engine\u003C\u002Fh1>\n  \u003Cp align=\"center\">\n    \u003Cstrong>Pure Rust LLM Inference\u003C\u002Fstrong>\u003Cbr>\n    \u003Cem>Universal Inference At Unimaginable Speeds\u003C\u002Fem>\n  \u003C\u002Fp>\n  \u003Cp align=\"center\">\n    \u003Cimg alt=\"NVIDIA\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNVIDIA-76B900?style=flat-square&logo=nvidia&logoColor=white\">\n    \u003Cimg alt=\"AMD\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAMD-ED1C24?style=flat-square&logo=amd&logoColor=white\">\n    \u003Cimg alt=\"Intel\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FIntel-0071C5?style=flat-square&logo=intel&logoColor=white\">\n  \u003C\u002Fp>\n  \u003Cp align=\"center\">\n    \u003Ca href=\"LICENSE\">\u003Cimg alt=\"License: AGPLv3\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-AGPLv3-yellow?style=flat-square\">\u003C\u002Fa>\n    \u003Ca href=\"#build\">\u003Cimg alt=\"Pure Rust\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fruntime-pure%20Rust-orange?style=flat-square\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhub.docker.com\u002Fr\u002Favarok\u002Fatlas-gb10\">\u003Cimg alt=\"Docker Hub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocker%20Hub-avarok%2Fatlas--gb10-2496ED?style=flat-square&logo=docker&logoColor=white\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FDwF3brBMpw\">\u003Cimg alt=\"Discord\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdynamic\u002Fjson?url=https%3A%2F%2Fdiscord.com%2Fapi%2Fv10%2Finvites%2FDwF3brBMpw%3Fwith_counts%3Dtrue&query=%24.approximate_member_count&label=discord&suffix=%20members&style=flat-square&logo=discord&logoColor=white&color=5865F2\">\u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"assets\u002Fatlas-demo.mp4\">\u003Cimg alt=\"Atlas demo — click for full-quality MP4\" src=\"assets\u002Fatlas-demo.gif\" width=\"820\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#run-atlas\">\u003Cimg alt=\"Quick Start — under 2 minutes\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%E2%9A%A1%20Quick%20Start%20%E2%80%94%20%3C%202%20min-2EA44F?style=for-the-badge&logo=docker&logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fatlasinference.io\">\u003Cimg alt=\"atlasinference.io\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%8C%90%20atlasinference.io-F48C06?style=for-the-badge\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002FAIshaqui81766\u002Fstatus\u002F2052121270506930276\">\u003Cimg alt=\"Launch announcement on X\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9D%95%8F%20Launch%20Announcement-000000?style=for-the-badge&logo=x&logoColor=white\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Philosophy\n\nThe foundation of any given field of science is philosophy. It is that which inspires direction, structure, and mission.\n\nAtlas began as a solution to widely known problem in using other (python) inference engines built by data scientists: the code was steeped in a poly codebase with an ever shifting ecosystem of dependencies, patches, and cross-dependencies. One day your workaround for running a model works, the next day you have to update to a nightly branch of several dependencies and inject a new workaround. This is not how you build a software ecosystem; that's how you build a proof of concept. We thank the great and hard work data scientists made in proving LLMs can revolutionize our world, its economy, and how it challenges us to higher epochs. Now, the software engineers take the torch to turn a proof of concept into something that is designed to withstand the test of time.\n\n### Main Objective\n\nSimilar to how llama.cpp was built with the intent to prove you don't need $10000-$100000 GPUs to run LLMs, Atlas is built with the intent to consistently force the narrative that as hardware continues to advance, we should not have to pay premium Cloud API prices for inference. Atlas, by virtue of its philosophy, maximizes speed for each hardware\u002Fmodel combination, thus paving the way for meaningfully powerful and intelligent LLMs to be run locally in such a way the model is truly useful.\n\n### Design Choices\n\n#### Free and Open Source, Always\n\nWe promised this since the beginning. We believe great software comes from opening the source, not from just keeping it closed. The more eyes, the better. And therein brings us to the next point.\n\n#### Community-First\n\nFor those who've followed us this far since the inception of our Discord, you know the extent to which our commitment to the community is, according to one user humourously put, \"cracked\". We want to build something incredible, and that means we not only build for you, but you, now having access to the source code, can now build for others in ways that triumph over existing solutions. This is the only way we all win. We are the Pirates of the inference space.\n\n#### Monorepo\n\nWe chose a monorepo design to ensure that, as we head further into the agentic age of coding, the average data scientist or engineer can contribute meaningful PRs to any part of the system. Eventually, since this is a monorepo, there will be a day where the repo is autonomously self-improving and self-patching. This is most efficient and most effective when all the code is in one place, not many.\n\n#### Hardware+Model Specific Kernels\n\nWe make no compromises or generalizations. Each hardware and model combination has its own unique properties that require fine-tuning custom kernels that leverage the model for that specific hardware configuration. The end result? 2-3x faster kernels all around.\n\n#### AI-Friendly Codebase\n\nIt took a significant amount of time to build this codebase. We also know people will want to submit AI-generated PRs. We can't stop you, and in fact, given SOTA, you might just have to! The good news is that this codebase was built with enough railguards, structure, and abstraction to guide your AI to absorb the entire monorepo and contribute meaningfully. There's enough context to keep this going off the rails like a crazy train.This means ultimately that instead of waiting for days to weeks before getting model support, you can just fork this repo, and ask your AI to integrate it, then within hours you'll more likely than not have a working model running. We will not be condescending, [unlike some other inference engines out there when good-faith PRs that simply work are posted](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp\u002Fpull\u002F18680#issuecomment-3723954542). We are not stymied by bureacracy, and want to enable the community to rapidly expand this monorepo ecosystem safely and effectively.\n\n#### Theory-Friendly Codebase\n\nArxiv is getting countless papers published every day on AI. Nobody can keep up. Yet, some papers may be relevant to this project, others may not. Research endeavors to improve quality, alignment, and speed ought to be considered by our community as something we can integrate cleanly. Feel free to open a PoC PR here and just explain what you did and why, and how it works.\n\n#### Plug and Play Design\n\nOur system is modular, with tight abstraction boundaries and trait requirements that force the architecture to take on a certain form. This form is designed to prevent pigeon-holing the project into the wrong direction. The business logic is the same across all hardware\u002Fmodel combinations, just the concrete implementations differ.\n\n## Architecture\n\nThe diagram below shows how a single HTTP request flows from the API surface down to hardware-specific CUDA kernel execution. **Dashed borders** mark the **plug-and-play** abstraction boundaries — the traits and registries where a new hardware target, model family, communication backend, or storage backend plugs in without touching the layers above or below it.\n\n```mermaid\nflowchart TB\n    %% ── Colours & styles ──────────────────────────────────────────────\n    classDef server fill:#2d6a4f,stroke:#1b4332,color:#d8f3dc\n    classDef sched fill:#1d3557,stroke:#0d1b2a,color:#a8dadc\n    classDef model fill:#6a040f,stroke:#370617,color:#fae0e4\n    classDef runtime fill:#7b2cbf,stroke:#3c096c,color:#e0aaff\n    classDef kernel fill:#e85d04,stroke:#9d0208,color:#fff\n    classDef plug fill:none,stroke:#f48c06,stroke-width:3,stroke-dasharray:6 4,color:#f48c06\n    classDef prim fill:#3a0ca3,stroke:#240046,color:#c8b6ff\n    classDef hw fill:#264653,stroke:#2a9d8f,color:#e9f5db\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 1. HTTP API  (spark-server)\n    %% ════════════════════════════════════════════════════════════════\n    CLIENT[\"Client\u003Cbr\u002F>(OpenAI \u002F Anthropic API)\"]\n    API[\"spark-server — api.rs \u002F anthropic.rs\u003Cbr\u002F>OpenAI + Anthropic HTTP surface\u003Cbr\u002F>Tool-call parsing · Streaming SSE\"]:::server\n    CLIENT --> API\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 2. Scheduler  (spark-server\u002Fscheduler.rs)\n    %% ════════════════════════════════════════════════════════════════\n    SCHED[\"Scheduler\u003Cbr\u002F>Batched decode · Chunked prefill · MTP spec-decode\u003Cbr\u002F>Grammar FSM · Thinking-loop watchdog\u003Cbr\u002F>Sampling (argmax \u002F top-k \u002F top-p \u002F DRY \u002F LZ)\"]:::sched\n    API -->|\"InferenceRequest\u003Cbr\u002F>via mpsc channel\"| SCHED\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 3. Model trait  (spark-model\u002Ftraits.rs) — PLUG & PLAY\n    %% ════════════════════════════════════════════════════════════════\n    MODEL_TRAIT{{\"🔌 trait Model\u003Cbr\u002F>(spark-model\u002Ftraits.rs)\u003Cbr\u002F>prefill · decode · decode_batch\u003Cbr\u002F>mixed_forward · speculative verify\"}}:::plug\n    SCHED -->|\"Box(dyn Model)\"| MODEL_TRAIT\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 4. Model factory + weight loaders  (spark-model)\n    %% ════════════════════════════════════════════════════════════════\n    FACTORY[\"factory.rs — loader_for_config()\u003Cbr\u002F>SSOT model_type → loader dispatch\"]:::model\n    MODEL_TRAIT --> FACTORY\n\n    subgraph LOADERS [\"🔌 Weight Loaders — trait ModelWeightLoader\"]\n        direction LR\n        QWEN3[\"Qwen3\u003Cbr\u002F>qwen3.rs\"]:::model\n        QWEN35[\"Qwen3.5 \u002F 3.6\u003Cbr\u002F>qwen35.rs\"]:::model\n        GEMMA4[\"Gemma-4\u003Cbr\u002F>gemma4.rs\"]:::model\n        NEMOTRON[\"Nemotron-H\u003Cbr\u002F>nemotron.rs\"]:::model\n        MINIMAX[\"MiniMax M2\u003Cbr\u002F>minimax.rs\"]:::model\n        MISTRAL[\"Mistral\u003Cbr\u002F>mistral_loader.rs\"]:::model\n        NEW_MODEL[\"🆕 Your Model\u003Cbr\u002F>impl ModelWeightLoader\"]:::plug\n    end\n    FACTORY --> LOADERS\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 5. Layer trait  (spark-model\u002Flayer.rs) — PLUG & PLAY\n    %% ════════════════════════════════════════════════════════════════\n    LAYER_TRAIT{{\"🔌 trait TransformerLayer\u003Cbr\u002F>(spark-model\u002Flayer.rs)\u003Cbr\u002F>decode · prefill · decode_multi_seq\u003Cbr\u002F>SSM two-phase · MoE transpose\"}}:::plug\n    LOADERS --> LAYER_TRAIT\n\n    subgraph LAYERS [\"Layer Implementations\"]\n        direction LR\n        ATTN[\"Qwen3 Attention\u003Cbr\u002F>(GQA · MLA · MRoPE)\"]:::model\n        SSM[\"Qwen3 SSM\u003Cbr\u002F>(GDN · Mamba-2)\"]:::model\n        MOE[\"MoE FFN\u003Cbr\u002F>(top-k · sigmoid-route)\"]:::model\n        DENSE[\"Dense FFN\u003Cbr\u002F>(GeGLU · SiLU)\"]:::model\n        VIS[\"Vision Encoder\u003Cbr\u002F>(ViT blocks)\"]:::model\n    end\n    LAYER_TRAIT --> LAYERS\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 6. GPU backend trait  (spark-runtime\u002Fgpu.rs) — PLUG & PLAY\n    %% ════════════════════════════════════════════════════════════════\n    GPU_TRAIT{{\"🔌 trait GpuBackend\u003Cbr\u002F>(spark-runtime\u002Fgpu.rs)\u003Cbr\u002F>alloc · free · launch · copy_h2d\u002Fd2h\u003Cbr\u002F>CUDA graphs · pinned memory\"}}:::plug\n    LAYERS -->|\"&dyn GpuBackend\"| GPU_TRAIT\n\n    subgraph GPU_IMPLS [\"GPU Backend Implementations\"]\n        direction LR\n        CUDA_BE[\"AtlasCudaBackend\u003Cbr\u002F>(production — CUDA driver)\"]:::runtime\n        MOCK_BE[\"MockGpuBackend\u003Cbr\u002F>(unit tests)\"]:::runtime\n        NEW_HW[\"🆕 Your Hardware\u003Cbr\u002F>impl GpuBackend\"]:::plug\n    end\n    GPU_TRAIT --> GPU_IMPLS\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 7. Runtime services  (spark-runtime)\n    %% ════════════════════════════════════════════════════════════════\n    subgraph RUNTIME [\"spark-runtime\"]\n        direction TB\n        KV[\"PagedKvCache\u003Cbr\u002F>BF16 · FP8 · NVFP4 · Turbo3\u002F4\u002F8\"]:::runtime\n        BUFS[\"BufferArena\u003Cbr\u002F>Pre-allocated scratch\"]:::runtime\n        WEIGHTS[\"WeightStore\u003Cbr\u002F>mmap \u002F O_DIRECT loader\"]:::runtime\n        SAMPLER[\"Sampler\u003Cbr\u002F>argmax · top-k\u002Fp · temperature\"]:::runtime\n    end\n    LAYERS --> RUNTIME\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 8. Kernel registry  (atlas-core\u002Fregistry.rs)\n    %% ════════════════════════════════════════════════════════════════\n    REGISTRY[\"AtlasRegistry\u003Cbr\u002F>Global PTX cache · cuLaunchKernel\"]:::kernel\n    CUDA_BE --> REGISTRY\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 9. Kernel target system  (atlas-kernels) — PLUG & PLAY\n    %% ════════════════════════════════════════════════════════════════\n    KERNELS_CRATE{{\"🔌 atlas-kernels\u003Cbr\u002F>KernelTarget = (HW × Model × Quant)\u003Cbr\u002F>build.rs → per-target PTX embedding\u003Cbr\u002F>MODEL.toml → SamplingPresets + ModelBehavior\"}}:::plug\n    REGISTRY --> KERNELS_CRATE\n\n    subgraph TARGETS [\"kernels\u002F — CUDA Kernel Targets\"]\n        direction TB\n        subgraph GB10 [\"kernels\u002Fgb10\u002F — HARDWARE.toml\"]\n            direction LR\n            T1[\"qwen3-next-80b-a3b\"]:::kernel\n            T2[\"qwen3.5-35b-a3b\"]:::kernel\n            T3[\"gemma-4-26b-a4b\"]:::kernel\n            T4[\"nemotron-3-nano-30b\"]:::kernel\n            T5[\"minimax-m2-229b\"]:::kernel\n            T6[\"mistral-small-4\"]:::kernel\n            T7[\"+ 7 more targets\"]:::kernel\n        end\n        NEW_TARGET[\"🆕 kernels\u002F\u003Chw>\u002F\u003Cmodel>\u002F\u003Cquant>\u002F\u003Cbr\u002F>HARDWARE.toml + MODEL.toml + *.cu\"]:::plug\n    end\n    KERNELS_CRATE --> TARGETS\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 10. Communication  (spark-comm) — PLUG & PLAY\n    %% ════════════════════════════════════════════════════════════════\n    COMM_TRAIT{{\"🔌 trait CommBackend\u003Cbr\u002F>(spark-comm\u002Flib.rs)\u003Cbr\u002F>all_reduce · broadcast · send_to\u002Frecv_from\u003Cbr\u002F>Expert Parallelism (EP)\"}}:::plug\n    LAYERS -->|\"&dyn CommBackend\u003Cbr\u002F>(EP all-reduce)\"| COMM_TRAIT\n\n    subgraph COMM_IMPLS [\"Comm Implementations\"]\n        direction LR\n        SINGLE[\"SingleGpuBackend\u003Cbr\u002F>(no-op)\"]:::prim\n        NCCL[\"NcclBackend\u003Cbr\u002F>(multi-GPU NCCL)\"]:::prim\n        NEW_COMM[\"🆕 Your Transport\u003Cbr\u002F>impl CommBackend\"]:::plug\n    end\n    COMM_TRAIT --> COMM_IMPLS\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 11. Storage  (spark-storage) — PLUG & PLAY\n    %% ════════════════════════════════════════════════════════════════\n    STORAGE_TRAIT{{\"🔌 trait StorageBackend\u003Cbr\u002F>(spark-storage\u002Fbackend)\u003Cbr\u002F>High-Speed Swap — NVMe KV offload\u003Cbr\u002F>Predictive eviction · Tiled attention\"}}:::plug\n    KV -->|\"KV spill \u002F restore\"| STORAGE_TRAIT\n\n    subgraph STORAGE_IMPLS [\"Storage Implementations\"]\n        direction LR\n        IOURING[\"IoUringBackend\"]:::prim\n        POSIX[\"PosixBackend\"]:::prim\n        NEW_STORAGE[\"🆕 Your Backend\u003Cbr\u002F>impl StorageBackend\"]:::plug\n    end\n    STORAGE_TRAIT --> STORAGE_IMPLS\n\n    %% ════════════════════════════════════════════════════════════════\n    %% 12. Shared primitives  (atlas-*)\n    %% ════════════════════════════════════════════════════════════════\n    subgraph PRIMITIVES [\"Shared Primitives (atlas-*)\"]\n        direction LR\n        P1[\"atlas-core\u003Cbr\u002F>Config · KernelTarget\u003Cbr\u002F>Registry · Compute\"]:::prim\n        P2[\"atlas-quant\u003Cbr\u002F>NVFP4 · FP8\"]:::prim\n        P3[\"atlas-norm\u003Cbr\u002F>RMS norm\"]:::prim\n        P4[\"atlas-embed\u003Cbr\u002F>Embedding\"]:::prim\n        P5[\"atlas-reduce\u003Cbr\u002F>Reductions\"]:::prim\n        P6[\"atlas-activation\u003Cbr\u002F>SiLU · GeGLU\"]:::prim\n    end\n    LAYERS -.->|\"kernel wrappers\"| PRIMITIVES\n```\n\n### Reading the Diagram\n\n**Solid boxes** are concrete implementations. **Dashed borders with 🔌** are the trait-based abstraction boundaries — each is a Rust trait (or a filesystem convention for kernels) where a new integration plugs in:\n\n| Plug Point | What It Abstracts | To Add New Support |\n|---|---|---|\n| `trait Model` | Full model forward pass | Rarely needed — the existing `TransformerModel` handles all architectures via composable layers |\n| `trait ModelWeightLoader` | HuggingFace → layer translation | **Implement one struct** with weight-name patterns for your model family ([`factory.rs`](crates\u002Fspark-model\u002Fsrc\u002Ffactory.rs) adds one match arm) |\n| `trait TransformerLayer` | Per-layer compute (attn, SSM, MoE, FFN) | Compose existing layer types or implement a new one for novel architectures |\n| `trait GpuBackend` | All GPU memory and kernel ops | Swap the CUDA driver for another accelerator backend |\n| `kernels\u002F\u003Chw>\u002F\u003Cmodel>\u002F\u003Cquant>\u002F` | Hardware-tuned CUDA kernels | Drop a new directory with `MODEL.toml` + `.cu` files; `build.rs` auto-discovers it |\n| `trait CommBackend` | Multi-GPU collective communication | Implement for MPI, GDR, or custom interconnects |\n| `trait StorageBackend` | NVMe KV-cache offload I\u002FO | Implement for CXL, RDMA, or other storage tiers |\n\n### Data Flow Summary\n\n1. **HTTP** → `spark-server` receives OpenAI\u002FAnthropic requests, tokenizes, and enqueues\n2. **Scheduler** → batches sequences, orchestrates prefill\u002Fdecode\u002Fspeculative-verify steps\n3. **Model** → generic loop: `embed → [layer₀ … layerₙ] → norm → lm_head`\n4. **Layers** → each layer dispatches through `GpuBackend` to launch kernels from `AtlasRegistry`\n5. **Kernels** → pre-compiled PTX selected by `(hardware × model × quant)` target at build time\n6. **EP** → `CommBackend` handles cross-GPU all-reduce after MoE expert computation\n7. **Storage** → `StorageBackend` spills\u002Frestores KV blocks to NVMe for long-context sequences\n\n## What We Ship Today\n\nWe have to walk before we can run. Today's Atlas is targeted at a single hardware platform — NVIDIA's GB10 (DGX Spark, SM121) — and twelve hand-tuned (Hardware × Model × Quantization) targets. Every supported model below runs off one multi-model binary; the right kernel set is selected at startup from the model's `config.json`. No swapping images, no rebuilding, no per-model magic — just point Atlas at a HuggingFace ID.\n\n| Family | Model | HuggingFace ID | Params \u002F active | Architecture |\n|---|---|---|---:|---|\n| Qwen3.5 | Qwen3.5-27B | `Kbenkhaled\u002FQwen3.5-27B-NVFP4` | 27B dense | Hybrid SSM + attention, dense FFN, MRoPE |\n| Qwen3.5 | Qwen3.5-35B-A3B | `Sehyo\u002FQwen3.5-35B-A3B-NVFP4` | 35B \u002F 3B | GDN + attention + MoE, MTP |\n| Qwen3.5 | Qwen3.5-122B-A10B | `Sehyo\u002FQwen3.5-122B-A10B-NVFP4` | 122B \u002F 10B | GDN + attention + MoE, MTP |\n| Qwen3.6 | Qwen3.6-35B-A3B | `Qwen\u002FQwen3.6-35B-A3B-FP8` | 35B \u002F 3B | GDN + attention + MoE, MRoPE, vision tower |\n| Qwen3-Next | Qwen3-Next-80B-A3B | `nvidia\u002FQwen3-Next-80B-A3B-Instruct-NVFP4` | 80B \u002F 3B | SSM + attention + MoE |\n| Qwen3-VL | Qwen3-VL-30B-A3B | `ig1\u002FQwen3-VL-30B-A3B-Instruct-NVFP4` | 30B \u002F 3B | Vision + attention + MoE |\n| Gemma-4 | Gemma-4-26B-A4B | `bg-digitalservices\u002FGemma-4-26B-A4B-it-NVFP4A16` | 26B \u002F 4B | Attention + MoE, GeGLU |\n| Gemma-4 | Gemma-4-31B | `nvidia\u002FGemma-4-31B-IT-NVFP4` | 31B dense | Attention (sliding + full), GeGLU |\n| Mistral | Mistral-Small-4-119B | `mistralai\u002FMistral-Small-4-119B-2603-NVFP4` | 119B \u002F 6.5B | Attention + MoE |\n| MiniMax | MiniMax-M2.7 | `lukealonso\u002FMiniMax-M2.7-NVFP4` | 229B \u002F ~10B | Attention + 256-expert MoE + MTP |\n| Nemotron-H | Nemotron-3-Nano-30B-A3B | `nvidia\u002FNVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4` | 30B \u002F 3B | Mamba-2 + attention + MoE |\n| Nemotron-H | Nemotron-3-Super-120B-A12B | `nvidia\u002FNVIDIA-Nemotron-3-Super-120B-A12B-NVFP4` | 120B \u002F 12B | Mamba-2 + attention + MoE |\n\nThis is a starting point, not a destination. The plug-and-play design above exists precisely so that AMD, Apple Silicon, Intel, and the next round of Blackwell parts can land here as community contributions, and so that the Llama 4s and DeepSeek V4s of next quarter slot in the same way the Qwens did this quarter. We did the hard part — bolting in the abstractions while bringing up the first twelve targets — so that adding the thirteenth is a weekend, not a quarter.\n\n## Performance\n\nWe're not going to spend much real estate on benchmark theatre. The numbers below are what the binary in this repository does on a single NVIDIA GB10, on a short prompt (`\"What is the capital of France?\"`, `max_tokens ≤ 30`, `temperature = 0.1`), measured end-to-end through the HTTP API. They are reproducible: `scripts\u002Fsweep_all_models.sh` is the harness, and the source for every kernel that produced them is in this repository.\n\n| Model | Mode | tok\u002Fs |\n|---|---|---:|\n| Qwen3.5-35B-A3B | MTP speculative (K=2) | **131** |\n| Qwen3.5-35B-A3B | turbo4 KV | 77 |\n| Qwen3.5-35B-A3B | No speculative | 70 |\n| Qwen3-Next-80B-A3B | FP8 KV | 74 |\n| Qwen3.5-122B-A10B | EP=2, MTP K=2 (600-tok sustained) | 46 |\n| Qwen3.5-122B-A10B | FP8 KV, single-GPU tuned | 32 |\n| Qwen3-VL-30B-A3B | NVFP4 KV | 97 |\n| Nemotron-3-Nano-30B-A3B | FP8 KV | 88 |\n| Nemotron-3-Super-120B | FP8 KV | 24 |\n| Gemma-4-26B-A4B | default | 67 |\n| Gemma-4-31B | `--max-batch-size 2` | 9 |\n| Mistral-Small-4-119B | NVFP4 | 33 |\n| Qwen3.5-27B (dense hybrid) | FP8 KV | 13 |\n\nWe compete with vLLM and TensorRT-LLM on the same GB10. On Qwen3.5-35B-A3B with MTP speculative decoding, Atlas decodes faster than the same model under NVIDIA's own vLLM build on the same hardware — meaningfully faster, on numbers we can hand you the script for. We will not put a bigger figure in this paragraph than the one that comes off our own benchmark scripts, and we publish the vLLM baseline command alongside ours so you can verify both. If you reproduce a faster vLLM number, file an issue. We would rather be measured than congratulated.\n\nThe kernel-by-kernel comparison against PyTorch eager (35 hyperoptimized CUDA kernels, all wins on production-relevant shapes) lives in the [benchmarks chapter](book\u002Fsrc\u002Foperations\u002Fbenchmarks.md) along with the methodology footnotes — read them; they matter.\n\n## KV Cache Quantization\n\nAtlas stores attention key\u002Fvalue state in one of six quantized formats, selected via `--kv-cache-dtype`. Lower bit-widths fit more tokens in GPU memory at the cost of precision; the Turbo family adds Walsh-Hadamard rotation and Lloyd-Max optimal codebooks to recover accuracy at the same bit rate. Mix dtypes per layer with `--kv-high-precision-layers` to keep boundary layers at BF16 while compressing the middle.\n\n| CLI flag | Bits\u002Felement | Scale overhead | Technique | When to use |\n|---|---:|---|---|---|\n| `bf16` | 16 | — | Raw BF16 storage | Maximum precision; short-context or quality-critical workloads |\n| `fp8` | 8 | Per-tensor FP32 scale (from checkpoint or online calibration via `--fp8-kv-calibration-tokens`) | FP8 E4M3 with static or calibrated per-tensor scale | **Default.** Safe baseline — half the memory of BF16, minimal quality loss for most models |\n| `turbo8` | 8 | Per-group BF16 scale (2 bytes \u002F 16 elements) | Walsh-Hadamard rotation → FP8 E4M3 + BF16 per-group scales | FP8-level memory with outlier suppression; recommended for many-layer models (e.g. MiniMax M2.7, 58 layers) where per-group FP8 scales compound |\n| `nvfp4` | 4 | Per-group FP8 scale (1 byte \u002F 16 elements) | E2M1 packed nibbles (NVIDIA NVFP4 format) | 4× compression vs BF16; good for long-context with `--kv-high-precision-layers auto` |\n| `turbo4` | 4 | Per-group FP8 scale (1 byte \u002F 16 elements) | Walsh-Hadamard rotation → Lloyd-Max optimal 4-bit codebook | ~2× lower MSE than NVFP4 at the same bit rate; same memory footprint |\n| `turbo3` | 3 | Per-group FP8 scale (1 byte \u002F 16 elements) | Walsh-Hadamard rotation → Lloyd-Max 3-bit codebook (8 levels, packed 8 values → 3 bytes) | Maximum compression (22% smaller than turbo4); experimental |\n\n## Quick Start\n\nThe whole supported model matrix lives in one Docker image. Pull it, mount your HuggingFace cache, point Atlas at any model ID from the table above:\n\n\u003Ca id=\"run-atlas\">\u003C\u002Fa>\n\n**Qwen3.6-35B (FP8) — 130 tok\u002Fs on a single Spark:**\n\n```bash\ndocker pull avarok\u002Fatlas-gb10:latest\n\nsudo docker run -d --name atlas \\\n  --network host --gpus all --ipc=host \\\n  -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n  avarok\u002Fatlas-gb10:latest \\\n  serve Qwen\u002FQwen3.6-35B-A3B-FP8 \\\n    --port 8888 \\\n    --max-seq-len 65536 \\\n    --kv-cache-dtype fp8 \\\n    --kv-high-precision-layers auto \\\n    --gpu-memory-utilization 0.90 \\\n    --scheduling-policy slai \\\n    --tool-call-parser qwen3_coder \\\n    --enable-prefix-caching \\\n    --speculative\n```\n\n**Qwen3.5-122B (NVFP4) — single Spark, ~33 tok\u002Fs decode at batch=1:**\n\nThe 122B NVFP4 weights + Atlas runtime overhead leave only ~2 GB for KV cache on a 119.7 GB GB10, so keep `--max-num-seqs` low and use a tighter `--max-seq-len`. This recipe is verified end-to-end (model loads, `\u002Fv1\u002Fchat\u002Fcompletions` answers correctly, 4-way concurrent serves cleanly):\n\n```bash\nsudo docker run -d --name atlas \\\n  --network host --gpus all --ipc=host \\\n  -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n  avarok\u002Fatlas-gb10:latest \\\n  serve Sehyo\u002FQwen3.5-122B-A10B-NVFP4 \\\n    --port 8888 \\\n    --max-seq-len 16384 \\\n    --kv-cache-dtype fp8 \\\n    --kv-high-precision-layers auto \\\n    --gpu-memory-utilization 0.92 \\\n    --scheduling-policy slai \\\n    --max-batch-size 1 \\\n    --max-num-seqs 4 \\\n    --oom-guard-mb 1024 \\\n    --ssm-cache-slots 0 \\\n    --tool-call-parser qwen3_coder\n```\n\n**Note:** `--speculative` (MTP) on single-Spark 122B costs ~1.5 GB for the draft head + draft KV and forces `--max-seq-len` down to ~4 K. Either run without `--speculative` at 16 K (the recipe above), or move to EP=2 across two Sparks ([`QUICKSTART.md`](QUICKSTART.md) §5) for both speculative decoding *and* a 16 K+ window.\n\nFor longer contexts on a single Spark, add `--high-speed-swap --high-speed-swap-dir \u002Fpath\u002Fon\u002Fnvme --high-speed-swap-cache-blocks-per-seq 64`. HSS keeps a rolling 1024-token KV window in HBM and streams older blocks to NVMe through an io_uring orchestrator — works with any `--max-seq-len` you can fit on disk. The Docker container needs `--security-opt seccomp=unconfined --ulimit memlock=-1` for io_uring access.\n\nThat's it. Anything OpenAI-compatible — `curl`, the OpenAI SDK, Open WebUI, opencode — points at port 8888:\n\n```bash\ncurl http:\u002F\u002Flocalhost:8888\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\"model\":\"atlas\",\"messages\":[{\"role\":\"user\",\"content\":\"Hello!\"}],\"max_tokens\":256}'\n```\n\nPer-model recipes (vision, MoE, multi-node EP=2, single-GPU 122B with the tighter budget) live in [`QUICKSTART.md`](QUICKSTART.md). Build-from-source instructions are in [`CONTRIBUTING.md`](CONTRIBUTING.md), and the kernel build pipeline is documented in [`docs\u002FARCHITECTURE.md`](docs\u002FARCHITECTURE.md#build-pipeline).\n\n## Adding a New Hardware Target\n\nThe full recipe is in [`docs\u002FHARDWARE.md`](docs\u002FHARDWARE.md#adding-a-new-hardware-target). The short version: implement two traits (`ComputeTarget` for the build-time compiler, `GpuBackend` for the runtime), drop kernel sources into `kernels\u002F\u003Cyour-hw>\u002F`, add one match arm in the registry. There is a `MockGpuBackend` in `spark-runtime` that lets you write and test the entire scaffold without owning the hardware — every layer above the GPU trait is hardware-agnostic, so unit tests can run on a laptop. We bolted the project from \"single CUDA target\" to \"trait-pluggable across vendors\" specifically so that the AMD, Apple, and Intel ports stop being our problem and start being yours.\n\n## Adding a New Model\n\nSame story, smaller surface. Implement `ModelWeightLoader` (one struct, the existing `Qwen3AttentionLayer`\u002F`MoeLayer`\u002F`Qwen3SsmLayer`\u002F`NemotronMamba2Layer` primitives cover most architectures), add one line to the factory dispatch, optionally drop a `MODEL.toml` for sampling defaults and behavior knobs. Kernels are reused; the scheduler is untouched; the server is oblivious. The step-by-step cookbook is in [`docs\u002FHARDWARE.md`](docs\u002FHARDWARE.md#adding-a-new-model-family). Once your loader produces coherent output on the integration coherence prompt, you are done — file the PR.\n\n## Citations\n\nWe did not invent the kernels we ship. We picked the right ideas from the right papers, fused them together, and tuned them for one chip until they pinned the bandwidth ceiling. Atlas owes a direct intellectual debt to:\n\n- **FlashAttention-2** — Tri Dao. *FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning.* ICLR 2024. [arXiv:2307.08691](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.08691) — tiled online softmax, Q\u002FK\u002FV SMEM staging, causal masking. Foundation of our prefill kernel.\n- **FlashAttention-4** — Shah, Bikshandi, Zhang, Thakkar, Ramani, Dao. *FlashAttention-4: Taming the Hardware.* 2025. [arXiv:2603.05451](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.05451) — conditional softmax rescaling and software polynomial `sw_exp` (3 FMA + `ldexpf` instead of going through the SFU). Both shipped in our GQA-fused paged Flash Attention.\n- **FlashInfer** — Ye, Chen, Lai, Zhao, Zheng, Shao, Hou, Jin, Zuo, Yin, Chen, Ceze. *FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving.* MLSys 2025 (Best Paper). [arXiv:2501.01005](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005) — block-sparse paged KV cache, page index prefetch to SMEM, the gather-SMEM-MMA pattern for scattered pages. Informed our paged attention design.\n- **SageAttention 3** — Zhang, Huang, Zhang, Wei, Zhu, Chen. *SageAttention3: Microscaling FP4 Attention on Blackwell GPUs.* NeurIPS 2025 Spotlight. [arXiv:2505.11594](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11594) — FP4 attention with FP8 per-block microscales. On the SM121 roadmap once silicon-level FP4 MMA arrives upstream.\n- **LeanAttention** — Roy, Vassilieva, Willke, Mendis. *LeanAttention: Hardware-Aware Scalable Attention for LLM Inference.* 2024. [arXiv:2405.10480](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.10480) — stream-K tile scheduling for near-100% SM occupancy in split-K decode attention. Planned next.\n\nIf you wrote one of these papers and you spot a misattribution or a wrong technique credit on our side, open an issue. We would rather be corrected than wrong.\n\n## License and Enterprise Edition\n\nAtlas operates under a **dual-license** model. Both are real, both are intentional, and neither is a teaser for the other.\n\n1. **[Community Edition](LICENSE) — AGPLv3.** Free, open, copyleft. Use it for yourself to run inference on your own hardware, research, hobby projects, side-projects, and\u002For hosted demos, as examples. If you want to make money from Atlas, purchase a commercial license.\n2. **Enterprise Edition — commercial license.** If you need to ship Atlas inside a closed-source product, run it as a SaaS backend without inheriting the AGPLv3 source-disclosure obligation, or simply want a support relationship with the people who wrote the kernels, contact sales. Enterprise customers also receive prioritized model and hardware ports.\n\nThis split exists for a single reason: a permissive license keeps us building Atlas full-time, and the AGPL community license keeps the project honest. What is in this repository is what we run.","Atlas Inference Engine 是一个用纯 Rust 编写的大型语言模型推理引擎。它支持 CUDA、DGX 和多种硬件加速技术，如 NVIDIA、AMD 和 Intel 平台，并且提供了 OpenAI API 兼容接口。Atlas 通过利用先进的解码技术和优化的内存管理，实现了极高的推理速度和效率。此外，该项目还提供 Docker 镜像以简化部署流程。Atlas 适用于需要高性能语言模型推理的应用场景，例如自然语言处理、聊天机器人开发以及大规模文本生成等任务。",2,"2026-06-11 03:31:10","CREATED_QUERY"]