[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76139":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":42,"readmeContent":43,"aiSummary":44,"trendingCount":16,"starSnapshotCount":16,"syncStatus":45,"lastSyncTime":46,"discoverSource":47},76139,"FlashRT","LiangSu8899\u002FFlashRT","LiangSu8899","FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads. The flagship integration is production VLA control for Pi0, Pi0.5, GROOT N1.6, and Pi0-FAST. Also support llm e.g, qwen3.6-27B","",null,"C++",321,37,7,6,0,31,68,175,93,94.74,"Apache License 2.0",false,"main",true,[27,28,29,30,31,32,33,34,35,36,37,38,39,40,41],"cuda","cuda-kernels","gr00t","gr00t-n1-6-3b","jetson","jetson-thor","pi","pi05","qwen","qwen3-6","qwen3-6-27b","realtime-inference","realtime-vla","thor","vla","2026-06-12 04:01:20","# FlashRT\n\n**FlashRT is a high-performance realtime inference engine for small-batch, latency-sensitive AI workloads.**\n\nA general kernel library composed into static graphs — no ONNX export, no engine compilation, no per-driver rebuild. Hand-written kernels (norm \u002F activation \u002F fusion \u002F RoPE \u002F FP8 \u002F NVFP4 GEMM \u002F attention) cover standard transformer, DiT, and SigLIP primitives. The composition pattern itself is hardware-agnostic; today the codebase ships with NVIDIA implementations spanning edge to server (Jetson AGX Thor through A100 \u002F RTX 4090 \u002F 5090).\n\nThe flagship integration today is **VLA control** — production frontends for Pi0, Pi0.5, GROOT N1.6, and Pi0-FAST, validated on LIBERO. The same kernel set also powers the BAGEL world-model image-generation pipeline (research preview) and audio \u002F video generation (4× over PyTorch). FlashRT now also serves **single-stream LLM inference** — the v1 release ships **Qwen3.6-27B (NVFP4)** with **256 K context on a single RTX 5090**, an OpenAI-compatible HTTP server, and decode throughput of **~100 tok\u002Fs typical \u002F 129 tok\u002Fs peak** (real warm-state range across mixed chat \u002F reasoning \u002F code prompts; see [Performance](#performance) for the breakdown). The pattern is workload-shaped (small-batch realtime), not model-class-shaped.\n\nExisting inference tooling is shaped for different workloads — TensorRT for tactic-search compile to frozen engines, vLLM \u002F SGLang for high-batch LLM serving. FlashRT targets the small-batch realtime cell with hand-tuned kernels and no compile step.\n\n## FlashRT is fast with:\n\n- **hand-written CUDA kernels**: norm, activation, residual+norm+quant fusion, RoPE \u002F qkv-split, FP8 \u002F NVFP4 GEMM, cuBLASLt FP8, CUTLASS SM100 FP8, vendored Flash-Attention 2, Thor CUTLASS FMHA\n- **Static CUDA Graph capture** of the entire forward — zero Python overhead at replay\n- **Production FP8 (E4M3) and NVFP4** with automatic per-tensor calibration, JSON-cached to disk\n- **No compile, no export**: direct safetensors \u002F Orbax loading, first call ~3 s, every call after is graph replay\n- Survives CUDA driver upgrades, GPU swaps, and prompt changes without rebuild\n\n## FlashRT is easy to use with:\n\n- **3-line API**: `flash_rt.load_model(...).predict(images, prompt)`\n- **Auto-dispatched hardware**: same code path on Jetson Thor \u002F RTX 5090 \u002F RTX 4090\n- **PyTorch and JAX frontends** share one kernel binary, equivalent results (cosine ≥ 0.999)\n- **Plugin model registration** — add a new VLA via one frontend file + a declarative `WEIGHT_SPEC`, no fork required\n- **LIBERO benchmark integration** out of the box; ~6 minutes from `git clone` to first inference\n\n## FlashRT supports:\n\n- **VLA models**: Pi0, Pi0.5, GROOT N1.6, Pi0-FAST — all production-validated on LIBERO. BAGEL world-model (research preview) — image-gen pipeline at ~4× vs PyTorch.\n- **LLM**: **Qwen3.6-27B NVFP4 — ~100 tok\u002Fs typical \u002F 129 tok\u002Fs peak decode, 256 K context, single RTX 5090** — speculative decoding via the FP8 ckpt's MTP head, OpenAI-compatible HTTP server.\n- **Hardware (today)**: NVIDIA Jetson AGX Thor (SM110), RTX 5090 (SM120), RTX 4090 (SM89), and SM80 \u002F SM86 \u002F SM89 cards (A100, RTX 3090, 4060 Ti, etc.). The kernel composition pattern is portable to other accelerators.\n- **Frameworks**: PyTorch (safetensors) + JAX (Orbax) — same compiled kernels\n\nPi0.5: 44 ms \u002F 23 Hz on Jetson AGX Thor (2v, FP8) · 39.78 ms \u002F 25 Hz (2v, NVFP4) · 17.58 ms \u002F 57 Hz on RTX 5090. Cosine ≥ 0.9996 vs the production reference. See [Performance](#performance) for the full sweep.\n\n## Getting Started\n\n- [Install FlashRT](#build--install)\n- [Quick Start](#quick-start)\n- [API snippets — Pi0 \u002F Pi0.5 \u002F GROOT \u002F Pi0-FAST \u002F Qwen3.6](#api-snippets)\n- [Qwen3.6-27B NVFP4 LLM path — quickstart, K selection, measured throughput](docs\u002Fqwen36_nvfp4.md) · [parameter reference](docs\u002Fqwen36_usage.md) · [OpenAI-compatible server example](examples\u002Fqwen36_openai_server.py)\n- [Adding a new model](docs\u002Fadding_new_model.md)\n- [Contributing](CONTRIBUTING.md)\n- [Architecture](docs\u002Farchitecture.md)\n\n## Quick Start\n\n> Already built? Run the snippet below. **Not yet built? See [Build & install](#build--install) first** — `cmake .. && make -j` produces the kernel `.so` files this snippet imports. About 6 minutes from `git clone` to first inference.\n\n```python\nimport flash_rt   # Python module name; project is FlashRT (see About)\n\nmodel = flash_rt.load_model(\n    checkpoint=\"\u002Fpath\u002Fto\u002Fpi05_checkpoint\",\n    config=\"pi05\",          # or \"pi0\", \"groot\", \"pi0fast\"\n    framework=\"torch\",      # or \"jax\"\n)\n\nactions = model.predict(\n    images=[base_img, wrist_img],\n    prompt=\"pick up the red block\",\n)\n# Pi0.5: actions shape (10, 7) — 10 future steps, 7 DOF\n```\n\nFirst call: ~3 s (calibration + CUDA Graph capture). Every subsequent call: 44 ms graph replay on Thor. No `.engine` file, no rebuild after restart. Full snippets for Pi0 \u002F GROOT \u002F Pi0-FAST in [API snippets](#api-snippets).\n\n## Start here\n\n| If you want to … | Read |\n|---|---|\n| **Run your first inference** | [Build & install](#build--install) — Docker and native Linux paths |\n| **See API examples for all 4 VLA models + the Qwen3.6 LLM** | [API snippets](#api-snippets) |\n| **Run Qwen3.6-27B NVFP4 (LLM, ~100 tok\u002Fs typical \u002F 129 tok\u002Fs peak on RTX 5090)** | [`docs\u002Fqwen36_nvfp4.md`](docs\u002Fqwen36_nvfp4.md) — quickstart, K selection, measured throughput · [`docs\u002Fqwen36_usage.md`](docs\u002Fqwen36_usage.md) — full parameter reference · [`examples\u002Fqwen36_openai_server.py`](examples\u002Fqwen36_openai_server.py) — OpenAI-compatible HTTP server |\n| **Look up the stable Python API surface** | [`docs\u002Fstable_api.md`](docs\u002Fstable_api.md) |\n| **Integrate a new model into FlashRT** | [`docs\u002Fadding_new_model.md`](docs\u002Fadding_new_model.md) — end-to-end walkthrough; external plugin pattern in [`docs\u002Fplugin_model_template.md`](docs\u002Fplugin_model_template.md) |\n| **Contribute a bug fix, benchmark, or model path** | [`CONTRIBUTING.md`](CONTRIBUTING.md) — development rules, validation expectations, and PR checklist |\n| **Understand the architecture** | [`docs\u002Farchitecture.md`](docs\u002Farchitecture.md) — the 8 infrastructure components and how they compose |\n| **Use a load-bearing API** (weight loading, attention, calibration) | [`docs\u002Fextension\u002Fweight_spec.md`](docs\u002Fextension\u002Fweight_spec.md) · [`docs\u002Fextension\u002Fattention_backend.md`](docs\u002Fextension\u002Fattention_backend.md) · [`docs\u002Fextension\u002Fcalibration.md`](docs\u002Fextension\u002Fcalibration.md) |\n| **See the supported models + measured performance** | [Performance](#performance) below |\n| **Know which GPUs have been tested (and how to contribute a run)** | [Tested hardware + Help needed](#tested-hardware--whats-theoretically-supported) |\n| **Know what kernels ship and whether they fit your model** | [`docs\u002Fkernel_catalog.md`](docs\u002Fkernel_catalog.md) — the \"parts list\" with a re-use decision tree |\n| **See which fusion patterns exist and why some were rejected** | [`docs\u002Fkernel_fusion.md`](docs\u002Fkernel_fusion.md) |\n| **Understand FP8 calibration mechanics** | [`docs\u002Fcalibration.md`](docs\u002Fcalibration.md) |\n| **Train a Pi0.5 LoRA fine-tune (FP8 + LoRA, plain or RECAP\u002FACP-conditioned, PyTorch *or* JAX)** | [`training\u002FREADME.md`](training\u002FREADME.md). JAX companion at [`training\u002Fjax\u002FREADME.md`](training\u002Fjax\u002FREADME.md) |\n| **Run advantage-conditioned (RECAP \u002F π\\*0.6) policies with classifier-free guidance** | [`docs\u002Frl_inference.md`](docs\u002Frl_inference.md) — PyTorch + JAX frontends both supported |\n| **See how FlashRT differs from TensorRT \u002F vLLM \u002F SGLang** | [`docs\u002Finference_engine_differences.md`](docs\u002Finference_engine_differences.md) |\n\n---\n\n\u003Ca name=\"performance\">\u003C\u002Fa>\n\n## Performance\n\n| Model | Hardware | Latency | Throughput |\n|-------|----------|---------|------------|\n| **Pi0.5** | **Jetson AGX Thor** (SM110) | **44 ms** | **23 Hz** |\n| **Pi0** | **Jetson AGX Thor** (SM110) | **46 ms** | **22 Hz** |\n| **Pi0.5** | **RTX 5090** (SM120) | **17.58 ms** (2v) | **57 Hz** |\n| **Pi0** | **RTX 5090** (SM120) | **18.43 ms** (1v) \u002F **21.16 ms** (2v) \u002F **24.48 ms** (3v) | **54 \u002F 47 \u002F 41 Hz** |\n| **GROOT N1.6** | **Jetson AGX Thor** (SM110) | **45 ms** (T=50) \u002F **41 ms** (T=16) | **22 \u002F 24 Hz** |\n| **GROOT N1.6** | **RTX 5090** (SM120) | **13.08 ms** (T=50, 2v) \u002F **12.53 ms** (T=16, 2v) | **76 \u002F 80 Hz** |\n| **Pi0-FAST** | **Jetson AGX Thor** (SM110) | **8.1 ms\u002Ftoken** (28 ms prefill + 8.1 × N decode) | **123 tok\u002Fs** |\n| **Pi0-FAST** | **RTX 5090** (SM120) | **2.39 ms\u002Ftoken** (11 ms prefill + 2.39 × N decode) | **418 tok\u002Fs** |\n\n### LLM — Qwen3.6-27B NVFP4 (RTX 5090)\n\nSingle-stream chat-completion latency. NVFP4 W4A16 main weights +\nFP8→NVFP4-converted MTP head for K-step speculative decoding. All\nnumbers are decode-only tok\u002Fs (excluding prefill); same metric vLLM\nand TensorRT-LLM report.\n\n**Peak (single prompt, NTOK=128, no chat template)** — `\"Explain\nquantum entanglement in one short paragraph.\"`, 11 prompt tokens:\n\n| Configuration | Decode latency | Throughput |\n|---|---|---|\n| **Qwen3.6-27B NVFP4** + spec K=3 | **8.49 ms\u002Ftoken** | **117.8 tok\u002Fs** |\n| **Qwen3.6-27B NVFP4** + spec K=6 | **7.74 ms\u002Ftoken** | **128.9 tok\u002Fs** |\n\n**Real-world warm-state (chat-template + NTOK=256, mixed-task workload)**\n— measured across 6 prompts (EN chat \u002F EN reasoning \u002F CN chat \u002F CN\nfactual \u002F CN poetry \u002F Code) at K=6:\n\n| Stat | tok\u002Fs |\n|---|---:|\n| **mean** | **~93** |\n| min (Code prompt) | 75 |\n| max (CN factual prompt) | 101 |\n\nSo users can expect roughly **~100 tok\u002Fs typical** in production with\npeak around **129 tok\u002Fs** on the easiest prompt class. The cliff is\ncontent-dependent (drafter alignment with the input distribution):\n\n* **Instruction-following \u002F factual \u002F chat** prompts hit the headline\n  rate — drafter aligns well with the prompt distribution.\n* **Code generation** drops to ~75 tok\u002Fs — drafter has lower acceptance\n  on punctuation \u002F indentation \u002F bracket tokens. Same trade-off seen\n  in vLLM and SGLang spec decode.\n* **Long generations** (NTOK ≥ 256) shave ~5-10 tok\u002Fs vs short outputs\n  — drafter quality decays past the prompt's local distribution.\n* **First call** at a new (prompt_len, max_tokens) shape pays a\n  ~5-25 s CUDA-Graph capture cost. The bundled OpenAI server\n  ([`examples\u002Fqwen36_openai_server.py`](examples\u002Fqwen36_openai_server.py))\n  pre-captures common shapes at startup via `--warmup`.\n\nLong-context decode at fixed context length (TurboQuant packed KV\ncache, single-token forward, AL=3.17 amortization):\n\n| ctx | forward latency | est. tok\u002Fs with spec |\n|---|---|---|\n| 8 K | 26.6 ms | 119 |\n| 32 K | 38.7 ms | 81 |\n| 128 K | 87.7 ms | 36 |\n| **256 K** | **153 ms** | **21** ← single-card |\n\nCUDA Graph capture+replay at 32 K \u002F 64 K \u002F 128 K \u002F 256 K passes the\ncosine = 1.000000 gate (bit-identical token output across replays).\nTTFT scales linearly at ~22 ms \u002F prompt-token.\n\nThe full per-prompt variance breakdown (5 prompts × NTOK 128\u002F256 ×\nK=3\u002F6) is in [`docs\u002Fqwen36_nvfp4.md`](docs\u002Fqwen36_nvfp4.md) §3.\n\n### VRAM footprint (inference only, 2 views on RTX 5090)\n\nMeasured peak allocation during `model.predict()`:\n\n| Model | dtype | Checkpoint on disk | Peak VRAM |\n|---|---|---:|---:|\n| **Pi0** | fp16 | 6.5 GB | **~10 GB** |\n| **Pi0.5** | bf16 + FP8 | 13.5 GB | **~10 GB** |\n| **GROOT N1.6** | fp16 | 6.1 GB | **~9 GB** |\n| **Pi0-FAST** (jax) | bf16 | 11 GB | **~7 GB** |\n\nIncludes the CUDA context, cuBLASLt workspaces, FA2 scratch, and\nthe captured CUDA Graph. Thor (unified LPDDR 122 GiB) effectively\nhas no memory pressure. Practical card sizing:\n\n| Card | Works for |\n|---|---|\n| **8 GB** (RTX 3060 Ti \u002F 4060) | Pi0-FAST only; others will OOM at graph capture time |\n| **12 GB** (RTX 3080 \u002F 4070) | All four models with ~2 GB headroom |\n| **16 GB+** (RTX 4080 \u002F 4080 Super) | All four, comfortable |\n| **24 GB+** (RTX 4090 \u002F 3090 \u002F 5090) | All four, plus room for larger views \u002F longer prompts |\n\nMeasure locally by wrapping `model.predict(...)` in\n`torch.cuda.max_memory_allocated()` after a warmup call.\n\n> Pi0-FAST is autoregressive — total latency = `prefill + N × per-token decode`,\n> where `N` is the number of action tokens generated per inference (variable,\n> depends on the action sequence; typically 30–80 tokens). Throughput is reported\n> per token, not per inference, since \"1 inference\" is not a fixed unit.\n\n> Pi0 RTX 5090 latencies are steady-state p50 over 200 timed `infer` calls\n> (CUDA-graph replay, 100 warmup) on real LIBERO frames. JAX (Orbax) and\n> PyTorch (safetensors) frontends drive the same compiled pipeline, so\n> their measured latencies are within 0.1 ms of each other at every view\n> count.\n\n### Comparison\n\n| Solution | Hardware | Pi0 | Pi0.5 | GROOT N1.6 | Source |\n|---|---|---|---|---|---|\n| Original openpi (JAX, unoptimized) | Jetson Thor | — | **714 ms (1.4 Hz)** | — | [openpi](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) |\n| PyTorch naive | RTX 4090 | — | ~200 ms | — | HuggingFace LeRobot |\n| torch.compile | RTX 4090 | — | ~40 ms | — | HuggingFace LeRobot |\n| Triton-based VLA | RTX 5090 | — | 26.6 ms (2v) | — | arXiv 2510.26742 |\n| NVIDIA VLA-Perf | RTX 4090 | 31.06 ms (Pi0 3B) | — | — | arXiv 2602.18397 |\n| NVIDIA Isaac GR00T (TensorRT) | Jetson Thor | — | 91–95 ms (3v) | ~95 ms | [Isaac GR00T](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FIsaac-GR00T) |\n| **FlashRT** | **RTX 5090** | **21.16 ms** (2v) | **17.58 ms** (2v) | **13.08 ms** (T=50, 2v) | this work |\n| **FlashRT** | **Jetson Thor** | **46 ms** (2v) | **39.78 ms** (2v) \u002F **51.51 ms** (3v) (NVFP4) | **45 ms** (T=50, 2v) | this work |\n\nOn the same Jetson AGX Thor hardware, FlashRT goes from the original openpi JAX baseline (1.4 Hz) to 23 Hz (FP8) \u002F 25 Hz (NVFP4) — a **~16-18× speedup at zero accuracy loss** (cosine ≥ 0.9996 vs the production reference).\n\nFlashRT Pi0.5 Thor numbers above are the NVFP4 production preset (`use_fp4=True`); the FP8 baseline is 44.0 ms 2v \u002F 54.8 ms 3v at the same task success (491\u002F500). See [Latency (Thor)](#latency-thor) for the full sweep.\n\n### Tested hardware + what's theoretically supported\n\n**Verified working on: RTX 5090, RTX 4090, RTX 4060 Ti, Jetson AGX Thor.**\n\nCMake's `ENABLE_FA2` gate accepts **any card in SM80 \u002F 86 \u002F 89 \u002F 120**\n(Ampere through Blackwell consumer). That means A100, A10, RTX 3090,\n3080, A5000\u002FA6000, 4090, 4080, 4070, 4060 Ti, 5090 — all *should*\nbuild and run out of the box. \"Theoretical\" here just means the\nother cards haven't gone through the regression suite yet; the\nkernel set and dispatch paths are the same.\n\n### Help needed — hardware, robots, models\n\nThis is a solo project. If you have access to any of the following\nand are willing to kick the tires, please open an issue or PR with\nyour numbers \u002F logs:\n\n- **Other GPUs** (A100 \u002F 3090 \u002F 4080 \u002F 4070 \u002F 4060 Ti \u002F AGX Orin \u002F etc.) —\n  run `python examples\u002Fquickstart.py --checkpoint \u003C...> --benchmark 20`\n  and paste the P50 number plus `nvidia-smi` output.\n- **Real robot deployments** (LeRobot, custom arms, humanoid\n  platforms) — smoothness, crash-safety, end-to-end latency\n  including robot-side overhead.\n- **New VLA \u002F generative models** — Pi0.6, GR00T later versions,\n  custom DiT \u002F audio-gen \u002F video-gen backbones. See\n  [`docs\u002Fadding_new_model.md`](docs\u002Fadding_new_model.md) for the\n  integration walkthrough; [`docs\u002Fkernel_catalog.md`](docs\u002Fkernel_catalog.md)\n  has the parts list and a re-use decision tree for judging\n  whether FlashRT fits before you start wiring anything up.\n\nDrive-by benchmarks, bug reports, and \"this crashed on my X\" traces\nare all welcome. The footprint is small — one author, one laptop,\ntwo reference GPUs — so every independent data point genuinely\nmoves the project forward.\n\n---\n\n## Key techniques\n\nThe short version: kernel fusion + static FP8 + captured CUDA Graph\n+ vendored in-SO Flash-Attention 2. Hand-written CUDA kernels cover\nonly the memory-bound ops (norm, activation, fusion, quant);\ncompute-bound GEMM \u002F attention are delegated to cuBLASLt, CUTLASS,\nand the vendored FA2.\n\nFull details by topic:\n\n- [`docs\u002Fkernel_catalog.md`](docs\u002Fkernel_catalog.md) — every kernel\n  shipped, grouped by function, with a re-use decision tree for\n  non-VLA models.\n- [`docs\u002Fkernel_fusion.md`](docs\u002Fkernel_fusion.md) — production\n  fusion patterns, the four historical dead-end optimizations, and\n  why the current fusion set converged where it did.\n- [`docs\u002Fcalibration.md`](docs\u002Fcalibration.md) — FP8 static\n  calibration mechanics.\n- [`docs\u002Foptimization-details.md`](docs\u002Foptimization-details.md) —\n  line-by-line Pi0.5 latency breakdown (44 ms vs 70 ms baseline).\n\n---\n\n## API snippets\n\nAlready built? Jump to API examples below. Not yet built? See\n[Build & install](#build--install) for the full Docker \u002F native\nLinux flow, then come back.\n\n### 3 Lines of Code\n\n```python\nimport flash_rt\n\nmodel = flash_rt.load_model(\n    checkpoint=\"\u002Fpath\u002Fto\u002Fcheckpoint\",\n    framework=\"torch\",    # or \"jax\"\n    autotune=3,           # 0=off, 3=default, 5=thorough\n)\n\nactions = model.predict(\n    images=[base_img, wrist_img],\n    prompt=\"pick up the red block\",\n)\n# Pi0.5: actions shape (10, 7) — 10 future steps, 7 DOF\n\n# Pi0 (continuous state input):\nmodel = flash_rt.load_model(\n    checkpoint=\"\u002Fpath\u002Fto\u002Fpi0_checkpoint\",\n    config=\"pi0\",\n)\nactions = model.predict(\n    images=[base_img, wrist_img],\n    prompt=\"pick up the red block\",\n)\n# Pi0: actions shape (10, 7)\n# Note: Pi0 also accepts 'state' in observation dict for continuous state input\n\n# GROOT N1.6:\nmodel = flash_rt.load_model(\n    checkpoint=\"\u002Fpath\u002Fto\u002Fgroot_checkpoint\",\n    config=\"groot\",\n)\nactions = model.predict(images=[base_img, wrist_img], prompt=\"pick up the red block\")\n# GROOT: actions shape (50, 128) — 50 steps, 128-dim padded\n\n# Pi0-FAST (autoregressive — discrete token generation, not diffusion):\nmodel = flash_rt.load_model(\n    checkpoint=\"\u002Fpath\u002Fto\u002Fpi0_fast_base\",  # Orbax (jax) or safetensors-converted (torch)\n    config=\"pi0fast\",\n    framework=\"torch\",  # or \"jax\"\n)\nactions = model.predict(images=[base_img, wrist_img], prompt=\"pick up the red block\")\n# Pi0-FAST: action sequence is generated as discrete FAST tokens then decoded\n# to continuous actions via the FAST tokenizer (DCT inverse).\n\n# Pi0-FAST max-performance mode (for fixed-prompt 24h deployment):\nmodel = flash_rt.load_model(\n    checkpoint=\"\u002Fpath\u002Fto\u002Fpi0_fast_base\",\n    config=\"pi0fast\",\n    decode_cuda_graph=True,       # capture decode loop as CUDA Graph\n    decode_graph_steps=46,        # action tokens per inference (50 total with text prefix)\n)\n```\n\n#### Qwen3.6-27B NVFP4 (LLM, RTX 5090)\n\nThe LLM path uses a dedicated frontend — same kernel binary, separate\ngeneration API since chat completion has a different surface from VLA\ncontrol. See [`docs\u002Fqwen36_usage.md`](docs\u002Fqwen36_usage.md) for the\nfull parameter reference and [`docs\u002Fqwen36_nvfp4.md`](docs\u002Fqwen36_nvfp4.md)\nfor the K-curve \u002F measured throughput \u002F model-dependency notes.\n\n```python\nimport os\nimport torch\nfrom flash_rt.frontends.torch.qwen36_rtx import Qwen36TorchFrontendRtx\n\n# The NVFP4 ckpt has no MTP head; point this env var at a paired\n# FP8 ckpt directory that contains mtp.safetensors. Without it,\n# speculative decode is disabled (pure-decode still works at ~36 tok\u002Fs).\nos.environ[\"FLASHRT_QWEN36_MTP_CKPT_DIR\"] = \"\u002Fpath\u002Fto\u002Fqwen36_fp8_ckpt\"\n\nfe = Qwen36TorchFrontendRtx(\n    \"\u002Fpath\u002Fto\u002Fqwen36_nvfp4\",   # prithivMLmods\u002FQwen3.6-27B-NVFP4\n    quant=\"nvfp4\",\n)\n\nprompt = \"Explain quantum entanglement in one short paragraph.\"\ninput_ids = fe._tokenizer(prompt, return_tensors=\"pt\").input_ids.cuda()\n\nout = fe.generate_own_speculative_KN_nvfp4(\n    input_ids, max_new_tokens=256, K=6,   # K=6 peaks at NTOK\u003C=128\n)\ntext = fe._tokenizer.decode(out[0, input_ids.shape[1]:].tolist())\nprint(text)\n```\n\nFor an OpenAI-API-compatible HTTP server (chat completions, drop-in\nreplacement for `OpenAI(base_url=...)`), see\n[`examples\u002Fqwen36_openai_server.py`](examples\u002Fqwen36_openai_server.py):\n\n```bash\nexport FLASHRT_QWEN36_MTP_CKPT_DIR=\u002Fpath\u002Fto\u002Fqwen36_fp8_ckpt\npython examples\u002Fqwen36_openai_server.py \\\n    --checkpoint \u002Fpath\u002Fto\u002Fqwen36_nvfp4 \\\n    --port 8000 --K 6\n# Then: curl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions ...\n```\n\n### Framework Choice\n\n| Checkpoint Format | `framework=` | Source |\n|-------------------|:---:|--------|\n| **safetensors** (HuggingFace\u002FPyTorch) | `\"torch\"` | `model.safetensors` |\n| **Orbax** (JAX\u002FPhysical Intelligence) | `\"jax\"` | `checkpoint\u002F` dir |\n\nBoth frontends produce equivalent results (cosine > 0.999) and share the same `flash_rt_kernels.so`.\n\n### Hardware Auto-Dispatch\n\nUser code does **not** need to know which GPU it's running on.\n`load_model()` inspects `torch.cuda.get_device_capability()` at call\ntime and routes to the best-matching backend automatically:\n\n| Compute capability | GPU | Backend |\n|---|---|---|\n| SM110 (11.0) | Jetson AGX Thor | `flash_rt.hardware.thor.*` |\n| SM120 (12.0) | RTX 5090 Blackwell | `flash_rt.hardware.rtx.*`, falling back to Thor for models without a 5090-native class (Pi0-FAST uses Thor's in-file SM120 runtime fork) |\n| SM89  (8.9)  | RTX 4090 Ada | `flash_rt.hardware.rtx.*` |\n\nOverride with `hardware=\"thor\"` \u002F `\"rtx_sm120\"` \u002F `\"rtx_sm89\"` for\ncross-hardware debugging — `\"auto\"` (default) is what you almost\nalways want. Unsupported SM levels raise a clear `RuntimeError` at\n`load_model` time rather than falling back silently, because a wrong\nbackend at runtime is more expensive to debug than a clean crash.\n\n```python\n# Same code path on every supported GPU. On an RTX 5090 this resolves\n# to RtxTorchGroot; on Jetson Thor it resolves to ThorPipelineTorchGroot.\nmodel = flash_rt.load_model(\n    \"\u002Fpath\u002Fto\u002Fgroot_checkpoint\",\n    config=\"groot\",\n    embodiment_tag=\"gr1\",     # see GROOT embodiment slots below\n)\n```\n\n### GROOT N1.6 embodiment slots\n\nGROOT's per-embodiment MLPs (state encoder \u002F action encoder \u002F action\ndecoder) live in 32 parallel slots inside a single checkpoint. In the\n`GR00T-N1.6-3B` base checkpoint only a subset of those slots are\nactually trained — the rest are at initialization std ~0.02 and emit\nnoise-like actions regardless of input. **Pick a trained slot for any\ndemo or deployment**:\n\n| `embodiment_tag=` | Slot | Description |\n|---|---|---|\n| `gr1` | 20 | GR1 humanoid, 1 camera view. Good default for single-cam demos. |\n| `robocasa_panda_omron` | 13 | Tabletop arm + mobile base, 3 camera views |\n| `behavior_r1_pro` | 24 | BEHAVIOR humanoid, 3 camera views |\n| `new_embodiment` | 10 | Placeholder for fine-tuning (UNTRAINED in base) |\n\nAny other tag in the map (`libero_panda`, `oxe_google`, `oxe_widowx`,\n`unitree_g1`, `oxe_droid`) is **untrained** in the base 3B checkpoint\nand logs a warning at load time. Fine-tune one of those slots\nyourself or pick a trained tag for immediate use.\n\n### Autotune\n\nCUDA Graph instantiation is non-deterministic on Thor — the same kernels can produce different schedules with ~2ms variance. `autotune` recaptures until a fast schedule is found:\n\n| `autotune=` | Behavior | Extra Startup |\n|-------------|----------|---------------|\n| `0` or `False` | Off — single capture, may be 2ms slower | 0 |\n| `3` (default) | Retry up to 3× — usually finds fast graph on trial 0 | ~1s |\n| `5` | Retry up to 5× — better chance for JAX | ~2.5s |\n| `True` | Same as `3` | ~1s |\n\n### Pi0-FAST Performance Modes\n\nPi0-FAST supports two decode modes, controlled by `decode_cuda_graph`:\n\n| Parameter | `set_prompt` (cold) | `set_prompt` (cached) | 50-token E2E | Best for |\n|-----------|--------------------:|----------------------:|-------------:|----------|\n| `decode_cuda_graph=False` (default) | ~2.5 s | **~0.1 s** | **~464 ms** | Frequent prompt changes |\n| `decode_cuda_graph=True` | ~4.0 s | **~1.5 s** | **~431 ms** | Fixed prompt, 24h deployment |\n\n**How it works:**\n\n- **Default mode** (`decode_cuda_graph=False`): Each decode token runs through a\n  Python loop with per-step kernel launches. Lowest startup cost. FP8 calibration\n  scales are cached to `~\u002F.flash_rt\u002Fcalibration\u002F` after the first run — subsequent\n  `set_prompt` calls with the same checkpoint skip the 2.4s calibration entirely.\n\n- **Max-performance mode** (`decode_cuda_graph=True`): The action-phase decode loop\n  is captured as a single CUDA Graph (same technique as Pi0's diffusion loop).\n  Eliminates all Python dispatch overhead during decode. Adds ~1.5s to `set_prompt`\n  for graph capture, but saves ~33 ms per 50-token inference.\n  Break-even at ~45 inferences.\n\n```python\n# Default: good for interactive \u002F multi-prompt scenarios\nmodel = flash_rt.load_model(checkpoint, config=\"pi0fast\")\nmodel.set_prompt(\"pick up the red block\", state=state)\n# set_prompt: 0.1s (cached) \u002F 2.5s (first run)\n# infer: ~464 ms per 50-token sequence\n\n# Max-performance: best for fixed-prompt continuous control\nmodel = flash_rt.load_model(\n    checkpoint, config=\"pi0fast\",\n    decode_cuda_graph=True,\n    decode_graph_steps=46,    # covers sequences up to 46 action tokens (50 total)\n)\nmodel.set_prompt(\"pick up the red block\", state=state)\n# set_prompt: 1.5s (cached) \u002F 4.0s (first run)\n# infer: ~431 ms per 50-token sequence\n```\n\n**Calibration caching**: FP8 activation scales are automatically cached per\ncheckpoint and sequence length. Delete `~\u002F.flash_rt\u002Fcalibration\u002F` to force\nrecalibration. The first `infer()` call always recalibrates with real image\ndata regardless of cache.\n\n### NVFP4 encoder FFN (Pi0.5 only)\n\nOptional NVFP4 (Blackwell block-scaled FP4) quantization on the Pi0.5 encoder\nFFN stack. Currently implemented for **Pi0.5 torch only** — passing\n`use_fp4=True` with any other config (pi0 \u002F groot \u002F pi0fast) emits a warning\nand falls back to FP8.\n\n```python\nmodel = flash_rt.load_model(\n    checkpoint,\n    config=\"pi05\",\n    use_fp4=True,    # single flag → enables the production-validated preset\n)\n```\n\n`use_fp4=True` resolves to the best-known production preset automatically:\n- `fp4_layers` = full 18 encoder FFN layers\n- `use_awq` = `True` — activation-aware weight quantization (AWQ)\n- `use_p1_split_gu` = `True` — P1 split-GU 2-GEMM path\n\nAdvanced users can override any sub-flag explicitly at `load_model()` call\ntime (e.g. `fp4_layers=(7, 8, 9), use_awq=False` reverts to the conservative\nL7-9 subset).\n\n**What it does**:\n- Gate+Up and Down GEMMs across all 18 encoder FFN layers run in NVFP4\n  (block-size 16, UE4M3 block scales) instead of FP8.\n- **AWQ** applies activation-aware per-input-channel pre-scaling to the\n  quantized weights, with the inverse scale fused into pre-GEMM kernels\n  (`residual_add_rms_norm_mul_fp4_sfa`, `geglu_two_mul_fp4_to_fp4`). This\n  preserves precision under 18-layer FP4 (without AWQ, full-scope FP4 cos\n  drops from ~0.998 to ~0.33 due to cumulative multi-layer drift).\n- **P1 split-GU** splits the merged Gate+Up GEMM into separate gate_proj \u002F\n  up_proj NVFP4 GEMMs that emit packed FP4 + SFA directly (via\n  `LinCombBlockScaleFactor` epilogue), combined by a dedicated\n  `geglu_two_mul_fp4_to_fp4` kernel. Eliminates ~31 MB\u002Flayer of DRAM\n  round-trips vs the merged-GU path.\n- Residual stream stays fp16 through the FP4 region (NVIDIA\n  `enable_llm_nvfp4` style — `output_quantizer` disabled).\n\n**Requirements**:\n- SM100+ GPU (validated on Thor SM110). Non-SM100 hardware silently falls\n  back to FP8.\n- `flash_rt_fp4.so` extension (built alongside `flash_rt_kernels.so`).\n\n**Measured on Thor SM110, Pi0.5 \u002F LIBERO Spatial 10 × 50 = 500 episodes**:\n\n| Config | Task success | E2E P50 (normal) |\n|---|---|---|\n| FP8 baseline | 491 \u002F 500 (98.2%) | ~43.5 ms |\n| **NVFP4 full-18 + AWQ + P1 (`--use_fp4`)** | **491 \u002F 500 (98.2%)** | **~43.5 ms** |\n\nTask-level parity with the FP8 baseline (491\u002F500 for both — P1 + AWQ\npreserves FP4 precision across all 18 FFN layers).\n\n**Replay-latency benchmark (1-view \u002F 2-view \u002F 3-view, N=8 LIBERO\nstratified calibration, 50 graph replays, Thor SM110)**:\n\n| Config | 1-view | 2-view | 3-view | cos vs PyTorch FP32 ref (3v) |\n|---|---|---|---|---|\n| FP8 baseline (torch) | 34.06 ms | 41.79 ms | 55.46 ms | 0.999236 |\n| **NVFP4 encoder (torch)** | **31.91 ms** | **39.78 ms** | **51.51 ms** | **0.998932** |\n| **NVFP4 encoder (jax, Orbax)** | **34.39 ms** | **43.65 ms** | **56.90 ms** | **0.999030** |\n\nEncoder FP4 preserves cosine **≥ 0.9989** vs the PyTorch FP32 reference\nacross view counts, with no latency regression relative to the FP8\ntorch baseline. The JAX FP4 path derives NVFP4 weights directly from the\nOrbax checkpoint (no torch dependency at runtime) and uses the same\ntwo-phase multi-sample calibration flow as the torch FP4 path, producing\na slightly higher cos (0.99903 vs 0.99893 at 3v, same AWQ refit tuning).\nReproduce with\n[`tests\u002Fbench_pi05_thor_views.py`](tests\u002Fbench_pi05_thor_views.py)\n(defaults now include `jax_fp4`).\n\n**What's next**:\n- Decoder FP4 (S2 precision-validated set — 72 weight tensors, ~-6 ms estimated)\n- `geglu_two_mul` SFA-prefetch optimization (O1, ~-0.5-1.1 ms)\n- SigLIP FFN FP4 \u002F AWQ auto-tune \u002F Pi0.6 port\n\n---\n\n## Build & install\n\nThis is the hands-on \"go from a fresh machine to a green benchmark\"\nsection. For a single-page install reference (prerequisites,\ntroubleshooting table, JAX\u002Ftransformers pin rationale) see\n[`docs\u002FINSTALL.md`](docs\u002FINSTALL.md).\n\nDocker and native Linux paths both produce the same two\nextension modules:\n\n| Artifact | Size | What it contains |\n|---|---|---|\n| `flash_rt\u002Fflash_rt_kernels.so` | ~3 MB | Hand-written memory-bound kernels (norm, activation, fusion, FP8 quant, cuBLASLt wrappers, Thor FMHA). **Always built.** |\n| `flash_rt\u002Fflash_rt_fa2.so` | ~135 MB | Vendored Flash-Attention 2 v2.7.4.post1 fwd (fp16 + bf16, SM80\u002F86\u002F89\u002F120). **Built only on RTX targets** — Thor skips it and uses `fvk.attention_qkv_fp16` (cuBLAS-decomposed) for attention instead. |\n\n**Crucially — no `pip install flash-attn` required.** The FA2 kernel\nis vendored at source level and built into `flash_rt_fa2.so` during\n`cmake`\u002F`make`; at runtime `import flash_rt` loads both .so files\ndirectly, so you never hit the `flash-attn` wheel's\n`torch × CUDA × driver × glibc` compatibility matrix. Setting\n`FVK_RTX_FA2=0` is still supported as a fall-back to `pip flash-attn`\nfor debugging, but the default path has zero pip-wheel dependency.\n\n### Option A — Prebuilt Docker image (fastest, recommended)\n\nThe published image already has CUDA 13.0, PyTorch 2.9, the\nFlashRT kernels prebuilt, and CUTLASS vendored — pull and run, no\nlocal compile, no `flash-attn` wheel hunting:\n\n```bash\ndocker pull ghcr.io\u002Fliangsu8899\u002Fflashrt:latest\ndocker run --rm --gpus all -it ghcr.io\u002Fliangsu8899\u002Fflashrt:latest\n# Drops you in a Python REPL with `flash_rt` already imported.\n```\n\nFor Modal \u002F RunPod \u002F Vast and other cloud runners, point the image\nconfig at the same registry — Modal cold-start drops from a 10-minute\nkernel compile to a ~30-second pull:\n\n```python\nimage = modal.Image.from_registry(\"ghcr.io\u002Fliangsu8899\u002Fflashrt:0.2.0\")\n```\n\nTags + advanced usage (build args, slim variants, mounting checkpoints):\nsee [`docker\u002FREADME.md`](docker\u002FREADME.md).\n\n> **Thor (SM110)** is not covered by this image — Jetson is ARM64 and\n> uses a different NVIDIA base. Thor users follow Option C below.\n\n### Option B — Build the Docker image yourself\n\nIf you need a different GPU arch, want to pin a specific commit, or\nprefer to vet the image source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLiangSu8899\u002FFlashRT.git\ncd FlashRT\ndocker build -t flashrt:dev -f docker\u002FDockerfile .\ndocker run --rm --gpus all -it flashrt:dev\n```\n\nBuild args (`GPU_ARCH`, `FA2_HDIMS`, `BASE_IMAGE`, `CUTLASS_REF`)\ndocumented in [`docker\u002FREADME.md`](docker\u002FREADME.md). Cold build on a\nfresh host is ~25 min (NGC pull + FA2 codegen); warm rebuild ~12 min.\n\n### Option C — Native Linux (no Docker)\n\nSystem requirements:\n\n| Component | Minimum | Notes |\n|---|---|---|\n| GPU | SM80+ (A100, 30xx+, Thor, 4090, 5090) | |\n| NVIDIA driver | 545+ for CUDA 13, 525+ for CUDA 12.4 | 5090 needs 550+ |\n| CUDA Toolkit | 12.4+ (Thor\u002FHopper) or 12.8+ (Blackwell) | CUDA 13 recommended on 5090 |\n| Python | 3.10 \u002F 3.11 \u002F 3.12 | 3.12 on the default NGC image |\n| GCC\u002FG++ | 11+ with C++17 | |\n| CMake | 3.24+ | |\n\n**Create an isolated Python environment first.** The build step calls\n`python3 -m pybind11 --cmakedir` to locate pybind11 headers, so the\nPython that runs `cmake ..` MUST be the same interpreter the `.so`\nfiles will be imported from. System-Python + conda-Python mix-ups are\nthe #1 native-install failure mode.\n\n```bash\npython3.12 -m venv .venv         # 3.10 \u002F 3.11 \u002F 3.12 all supported\nsource .venv\u002Fbin\u002Factivate\n```\n\nMinimum pip list (for the `torch` frontend; everything **must** be\ninstalled *before* `cmake ..`):\n\n```bash\n# 1. PyTorch matching your CUDA:\npip install torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128   # 5090 \u002F CUDA 12.8+\n# or\npip install torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu124   # 4090 \u002F A100 \u002F Thor\n\n# 2. Build helpers\npip install pybind11 cmake \"numpy>=1.24\" safetensors\n\n# 3. Runtime \u002F benchmarking\n#    transformers is pinned \u003C4.56 because the Pi0.5 PaliGemma tokenizer\n#    path broke in 4.56+; drop the upper bound once we verify the new\n#    tokenizer API.\npip install \"transformers\u003C4.56\" pandas pillow pyarrow\n\n# 4. JAX-side (optional — only if you will load Orbax checkpoints).\n#    Versions are pinned because the Orbax\u002Fjaxlib\u002FPJRT plugin ABI is\n#    not stable across minor releases; upgrading any of the four\n#    without matching the others is a reliable way to get cryptic\n#    \"PJRT device not registered\" errors at import time. Pin bump is\n#    tracked upstream — see docs\u002FINSTALL.md §JAX for rationale.\npip install jax==0.5.3 jax-cuda12-pjrt==0.5.3 jax-cuda12-plugin==0.5.3 ml_dtypes==0.5.3\n```\n\nThen build:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLiangSu8899\u002FFlashRT.git\ncd FlashRT\ngit clone --depth 1 --branch v4.4.2 \\\n    https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass.git third_party\u002Fcutlass\n\npip install -e \".[torch]\"          # or \"[jax]\" \u002F \"[all]\"\n# NOTE: editable mode (-e) is required. The cmake build below drops\n# compiled .so files into flash_rt\u002F in the source tree; editable\n# install makes that directory importable directly. A non-editable\n# `pip install .` would install a copy BEFORE the .so files exist and\n# `import flash_rt` would fail at runtime with a missing-module error.\n\ncmake -B build -S .                 # auto-detects GPU arch\ncmake --build build -j$(nproc)\n# CMake writes .so files directly into flash_rt\u002F — no `cp` \u002F\n# `make install` \u002F `ninja install` step needed.\n```\n\n### GPU arch override\n\nCMake reads `nvidia-smi --query-gpu=compute_cap` to pick the target\narch. Override for cross-compilation or when auto-detect fails:\n\n```bash\ncmake -B build -S . -DGPU_ARCH=110   # Jetson AGX Thor   (FA2 skipped, CUTLASS SM100 path ON)\ncmake -B build -S . -DGPU_ARCH=120   # RTX 5090           (FA2 sm_80+sm_120 AOT, NVFP4 ON)\ncmake -B build -S . -DGPU_ARCH=89    # RTX 4090           (FA2 sm_80 AOT natively runs on Ada)\ncmake -B build -S . -DGPU_ARCH=86    # RTX 3090 \u002F A10     (FA2 sm_80 AOT)\ncmake -B build -S . -DGPU_ARCH=80    # A100               (FA2 sm_80 AOT)\n```\n\nFA2 is enabled by CMake when `GPU_ARCH ∈ {80, 86, 89, 120}`. Other\narches (notably Thor SM110 and SM90 Hopper) route attention through\nthe cuBLAS-decomposed `fvk.attention_qkv_fp16` path instead of FA2 —\n`flash_rt_fa2.so` simply isn't built, and no runtime error results.\n\n### Build timing (one-time)\n\nOn a 5090 with CUDA 13 in a warm container, `make -j$(nproc)`:\n\n| Target | Time |\n|---|---|\n| `flash_rt_kernels` (main kernels) | ~2 min |\n| `flash_rt_fa2` (FA2 vendor, default — 12 kernel .cu files × 3 arches) | **~4.5 min** (267 s) |\n| Full `make -j$(nproc)` | ~6.5 min |\n\nSubsequent rebuilds of only the hand-written kernels take ~2 min —\nFA2 is a separate CMake target and is only re-linked, not recompiled,\nunless the vendored source itself changes.\n\n### Slim-build flags (developer iteration speed)\n\nFA2's CUTLASS 3.x templates dominate cold-build cost. The default\nmatrix covers every RTX family card × fp16+bf16 × all 3 hdim\nbuckets, which is right for distribution but overkill when you're\niterating on a single 5090\u002F4090 and a single model family. Three\nopt-in CMake flags trade binary coverage for iteration speed:\n\n| Flag | Default | What it does | `fa2` cold build on 5090 |\n|---|---|---|---|\n| — | (none) | 12 .cu × sm_80 + sm_120 + PTX fallback | **267 s (4.5 min)** |\n| `-DFA2_ARCH_NATIVE_ONLY=ON` | OFF | Only emit SASS for the detected GPU; skip sm_80 + PTX passes | **110 s** (−59%) |\n| `-DFA2_HDIMS=\"96;256\"` | `\"96;128;256\"` | Drop `head_dim=128` (shipped models don't use it; reserved for future DiT variants) | **210 s** (−21%) |\n| `-DFA2_DTYPES=\"fp16\"` | `\"fp16;bf16\"` | Drop bf16 (Pi0 is fp16-only; Pi0.5 \u002F GROOT need bf16) | **179 s** (−33%) |\n| `-DFA2_ARCH_NATIVE_ONLY=ON -DFA2_HDIMS=\"96;256\" -DFA2_DTYPES=\"fp16\"` | — | All three combined (single-card + pi0-only) | **87 s** (−67%) |\n\nShipped `flash_rt_fa2.so` size also shrinks — the all-three-slim\nbuild produces **17.8 MB** (vs 135 MB default), a **87% reduction**\nin binary size on the FA2 module.\n\nDropped entries still resolve at the Python layer — calling a\nstubbed entry (e.g. `fa2.fwd_bf16` on a build with\n`FA2_DTYPES=\"fp16\"`) aborts the process with a clear\n\"rebuild with -DFA2_DTYPES=…\" message instead of linker errors or\nsilent wrong output.\n\n### ccache (iterative C++ rebuild speedup)\n\nIf `ccache` is on PATH at CMake-config time, it is enabled\nautomatically for both C++ and CUDA compiles. First build is\nunchanged. Hit rate on the `.cpp` side (pybind bindings) is high,\nso repeat edits to `csrc\u002Fbindings.cpp` \u002F `csrc\u002Ffa2_bindings.cpp` get\nfast rebuilds. CUDA .cu files — nvcc's invocation style makes\n`ccache` hit rate unreliable, so treat CUDA speedup as a bonus\nrather than a guarantee. Tip: set `CCACHE_DIR` to a host-mounted\npath so the cache survives container rebuilds.\n\nInstall via `apt-get install ccache` (Ubuntu) or equivalent.\n\n### Verify\n\n```bash\npython examples\u002Fquickstart.py \\\n    --checkpoint \u002Fpath\u002Fto\u002Fpi05_checkpoint \\\n    --benchmark 20\n```\n\nExpected (default `--num_views 2`): `P50: ~44 ms (23 Hz)` on Thor.\nOn RTX 5090 pure replay is ~17.4 ms (57 Hz); `quickstart.py` reports\nend-to-end wall clock (~19.5 ms \u002F 51 Hz) because it wraps\n`model.predict(...)` with `time.perf_counter` and therefore also\ncounts image normalization, upload, download, and un-normalization.\nFor the pure-replay number, time `model._pipe._enc_ae_graph.replay()`\nbetween `cuda.Event` markers — see [Measurement protocol](#measurement-protocol).\n\n### Verify\n\n```bash\npython examples\u002Fquickstart.py \\\n    --checkpoint \u002Fpath\u002Fto\u002Fpi05_checkpoint \\\n    --benchmark 20\n```\n\nExpected (default `--num_views 2`): `P50: ~44 ms (23 Hz)` on Thor.\nOn RTX 5090 pure replay is ~17.6 ms (57 Hz); `quickstart.py` reports\nthe end-to-end wall clock (~19.5 ms \u002F 51 Hz) because it wraps\n`model.predict(...)` with `time.perf_counter` and therefore also\ncounts the graph-external image normalization, upload, download, and\nun-normalization. For the pure-replay number, time\n`model._pipe._enc_ae_graph.replay()` between `cuda.Event` markers —\nsee [Measurement protocol](#measurement-protocol).\n\n**GROOT N1.6:**\n```bash\npython examples\u002Fquickstart.py \\\n    --checkpoint \u002Fpath\u002Fto\u002Fgroot_checkpoint \\\n    --config groot \\\n    --benchmark 20\n```\n\nExpected: `P50: ~44 ms (23 Hz)` on Thor.\n\n---\n\n## Architecture\n\nFlashRT is layered so that **framework-specific IO** (safetensors \u002F Orbax),\n**declarative weight loading**, **framework-agnostic compute** (pointer-only\npipelines), and **hardware-dispatched attention kernels** each live in their\nown module. Adding a new model touches at most one file per layer; adding a\nnew GPU target touches only `hardware\u002F`.\n\n```\nflash_rt\u002F\n├── api.py                     ← Public API: load_model() + VLAModel.predict()\n│\n├── hardware\u002F                  ← Hardware-dispatch + attention protocol\n│   ├── __init__.py            ←   detect_arch() + _PIPELINE_MAP\n│   ├── backend.py             ←   AttentionBackend protocol + SiteSpec\n│   ├── thor\u002F                  ←   Thor SM110 (Jetson AGX Thor)\n│   │   ├── attn_backend.py        ← ThorFlashAttnBackend (Pi0.5\u002FPi0)\n│   │   ├── attn_backend_groot.py  ← ThorGrootAttnBackend (GROOT Qwen3+DiT)\n│   │   └── shared_primitives.py   ← SigLIP\u002FEncoder\u002FDecoder primitives + calibrate\n│   └── rtx\u002F                   ←   RTX SM120\u002FSM89 (RTX 5090 \u002F 4090)\n│\n├── executors\u002F                 ← Declarative WEIGHT_SPEC framework (stage 7)\n│   ├── weight_loader.py       ←   Item \u002F LayerBlock \u002F ModelWeightSpec + runner\n│   ├── torch_weights.py       ←   SafetensorsSource + FusedQKV\u002FFusedGateUp\n│   └── jax_weights.py         ←   OrbaxDictSource + CudaBufferFlat\n│\n├── models\u002F                    ← Framework-agnostic pipeline forwards\n│   ├── pi05\u002Fpipeline.py       ←   Pi0.5 RTX pipeline class\n│   ├── pi0\u002Fpipeline.py        ←   Pi0 decoder_forward (Thor+RTX)\n│   ├── pi0fast\u002Fpipeline.py    ←   Pi0-FAST prefill + AR decode (runtime fork)\n│   └── groot\u002F                 ←   GROOT DiT + embodiments\n│       ├── pipeline.py            ← RTX GROOT\n│       ├── pipeline_thor.py       ← Thor GROOT (CKernelQwen3, CKernelDiTHead)\n│       └── embodiments.py         ← per-embodiment state\u002Faction heads\n│\n├── frontends\u002F                 ← Per-framework weight loading + CUDA Graph + infer\n│   ├── torch\u002F\n│   │   ├── pi05_thor.py       ←   Pi0.5 Thor (PyTorch + safetensors)\n│   │   ├── pi0_thor.py        ←   Pi0 Thor\n│   │   ├── groot_thor.py      ←   GROOT Thor\n│   │   ├── pi0fast.py         ←   Pi0-FAST (Thor+RTX runtime fork)\n│   │   ├── pi05.py, groot.py  ←   RTX variants\n│   │   └── _*_thor_spec.py    ←   Declarative WEIGHT_SPEC per model\n│   └── jax\u002F\n│       ├── pi05_thor.py       ←   Pi0.5 Thor (JAX + Orbax)\n│       ├── pi0_thor.py        ←   Pi0 Thor\n│       ├── pi0fast.py         ←   Pi0-FAST\n│       └── _*_thor_spec.py    ←   Declarative WEIGHT_SPEC per model\n│\n├── core\u002F                      ← Shared infrastructure\n│   ├── cuda_buffer.py         ←   CudaBuffer (cudaMalloc wrapper, JAX bridge)\n│   ├── cuda_graph.py          ←   CUDA Graph capture helpers\n│   ├── thor_frontend_utils.py ←   quant_fp8, interleave_qk, embed_prompt\n│   ├── quant\u002Fcalibrator.py    ←   FP8 calibration cache (save\u002Fload)\n│   └── weights\u002F               ←   loader.py, weight_cache, transformer\n│\n├── flash_rt\u002Fconfigs\u002F         ← Per-model YAML configs (pi05.yaml, etc.)\n└── flash_rt_kernels.*.so     ← 93 CUDA kernels (pybind11 — built from csrc\u002F)\n\ncsrc\u002F                       ← C++\u002FCUDA source (compiled once, .so kept in repo)\n├── kernels\u002F                ← norm, activation, rope, quantize, fusion\n├── gemm\u002F                   ← cuBLASLt FP8 + CUTLASS FP8 helpers\n├── attention\u002F              ← CUTLASS FMHA (strided, per-view)\n└── bindings.cpp            ← pybind11 → flash_rt_kernels.so\n\ndocs\u002F                       ← Documentation\n├── stable_api.md           ← Public API + naming convention\n├── adding_new_model.md     ← End-to-end guide for adapting a new VLA model\n├── calibration.md          ← FP8 weight\u002Factivation scale mechanics\n├── kernel_fusion.md        ← 93 kernel reference + fusion patterns\n├── optimization-details.md ← Pi0.5 44ms vs Myelin 70ms breakdown\n└── plugin_model_template.md ← External-plugin model registration\n\ntests\u002F                      ← Precision + unit tests\n├── test_all_models_precision.py   ← End-to-end cos + P50 sweep (4 models)\n├── test_weight_loader.py           ← WEIGHT_SPEC protocols + composites\n├── test_thor_attn_backend.py       ← Pi0.5\u002FPi0 AttentionBackend contract\n├── test_thor_groot_attn_backend.py ← GROOT AttentionBackend contract\n└── test_pi0fast_precision.py       ← Pi0-FAST AR decode precision\n\nexamples\u002F\n├── quickstart.py           ← 3-line usage demo\n└── thor\u002Feval_libero.py     ← LIBERO benchmark\n```\n\n### Key Design Principles\n\n1. **Pipeline forward receives only int pointers** — no torch, no jax, no\n   framework imports. Safe for CUDA Graph capture.\n2. **Weight loading is declarative** — each model exports a\n   `ModelWeightSpec` (composition of `LayerBlock`s + `Item`s). The\n   `WeightLoader` runner executes it over a framework-specific source\n   (safetensors for torch, Orbax `engine_w` dict for jax). Adding a new\n   Paligemma-family model is a ~60-line spec file plus optional composites.\n3. **Attention is protocolized** — `AttentionBackend.run(site=..., layer_idx=..., ...)`\n   dispatches across `fmha_strided_full` (SigLIP),\n   `attention_qkv_fp16` (GQA), `attention_qkv_fp16_state_masked`\n   (Pi0-style), and `attention_mha_fp16` (GROOT) without model code\n   knowing which kernel fires.\n4. **Hardware-dispatched via `_PIPELINE_MAP`** — `(config, framework, arch)\n   → (module, class)` is the single source of truth for which frontend\n   loads on Thor SM110 vs RTX SM120 vs RTX SM89. External plugins can\n   mutate the map at import time (see\n   [`docs\u002Fplugin_model_template.md`](docs\u002Fplugin_model_template.md)).\n5. **Calibration framework-agnostic + cached** — FP8 activation scales\n   are computed once per `(checkpoint, seq_len)` pair, cached to\n   `~\u002F.flash_rt\u002Fcalibration\u002F`, then baked as host-scalar alphas\n   (`act_scale × weight_scale`) into every CUDA Graph capture. See\n   [`docs\u002Fcalibration.md`](docs\u002Fcalibration.md).\n6. **CUDA Graph captures the entire forward** — Python loop unrolled at\n   capture time, zero overhead at replay. All intermediate buffers must\n   be pre-allocated in `_load_weights`; no dynamic allocation inside\n   forward (see [`docs\u002Fkernel_fusion.md`](docs\u002Fkernel_fusion.md) §6).\n\n---\n\n## Supported Models\n\nLatency columns below are **2-view**, pure CUDA Graph replay (p50, see\n[Measurement protocol](#measurement-protocol)). All per-view\nbreakdowns live in the Latency sections further down.\n\n| Model | Architecture | Latency (Thor, 2v) | Latency (RTX 5090, 2v) | Source |\n|-------|-------------|:-:|:-:|--------|\n| [**Pi0.5**](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) | PaliGemma 2B encoder + 300M decoder, 10-step diffusion | **44 ms** | **17.58 ms** | Physical Intelligence |\n| [**Pi0**](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) | Same as Pi0.5, with continuous state input | **46 ms** | (Thor class w\u002F SM120 fork) | Physical Intelligence |\n| [**GROOT N1.6**](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FIsaac-GR00T) | Eagle3-VL + Qwen3 1.7B + AlternateVLDiT 32L, 4-step flow matching | **45 ms** (T=50) \u002F **41 ms** (T=16) | **13.08 ms** (T=50) \u002F **12.53 ms** (T=16) | NVIDIA |\n| [**Pi0-FAST**](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) | Gemma 2B autoregressive, FAST tokenizer | **8.1 ms\u002Ftoken**, ~431 ms (50 tok) | **2.39 ms\u002Ftoken**, ~140 ms (50 tok, max-perf) | Physical Intelligence |\n\n---\n\n## Hardware Support\n\n| Feature | Thor (SM110) | RTX 5090 (SM120) |\n|---------|:----------:|:----------:|\n| FP8 GEMM | CUTLASS | cuBLASLt |\n| NVFP4 GEMM | — | CUTLASS |\n| Attention | CUTLASS FMHA | FlashAttention-2 |\n| CUDA Graph | Full E2E | Full E2E |\n| Status | **Production** | **Production** |\n\n---\n\n## Precision (Thor, 2-view LIBERO)\n\nCosine similarity measured with matched noise injection.\n\n| Comparison | Cosine |\n|-----------|--------|\n| FlashRT Torch vs Production | **0.9996** |\n| FlashRT JAX vs Production | **0.9999** |\n| FlashRT Torch vs JAX | **0.9998** |\n\n**Module-level byte-exact verification** (same input → same output):\n- SigLIP (27 layers): byte-exact\n- Encoder (18 layers): byte-exact\n- Decoder (18 layers × 10 steps): byte-exact\n\n## Latency (Thor)\n\n### Pi0.5\n\n| Frontend | 1-view | 2-view | 3-view |\n|----------|--------|--------|--------|\n| **FlashRT Torch** | **36.5 ms** (27 Hz) | **44.0 ms** (23 Hz) | **54.8 ms** (18 Hz) |\n| **FlashRT JAX** (autotune=5) | **37.3 ms** (27 Hz) | **44.9 ms** (22 Hz) | **54.4 ms** (18 Hz) |\n| NVIDIA TensorRT baseline | — | 91–95 ms | — |\n\n### Pi0\n\n| Frontend | 1-view | 2-view | 3-view |\n|----------|--------|--------|--------|\n| **FlashRT Torch** (autotune=5) | **37.6 ms** (27 Hz) | **45.8 ms** (22 Hz) | **56.7 ms** (18 Hz) |\n| **FlashRT JAX** (autotune=5) | **37.8 ms** (26 Hz) | **45.8 ms** (22 Hz) | **55.9 ms** (18 Hz) |\n\nEach additional camera view adds ~6 ms (256 extra SigLIP tokens → more encoder DRAM traffic + SigLIP forward).\n\nE2E precision: cosine **0.998** vs FP16 PyTorch reference (Torch and JAX both).\n\n### GROOT N1.6\n\n| Stage | T=16 (LIBERO) | T=50 (padded max) | Method |\n|-------|---------------|-------------------|--------|\n| SigLIP (2 views, CUDA Graph) | 6.0 ms | 6.0 ms | Batched 2-view + Graph |\n| Qwen3 16L (CUDA Graph) | 8.8 ms | 8.8 ms | FP8 GEMM (calibrated act scales) + C kernel attention |\n| DiT 32L x 4 steps (CUDA Graph) | 26 ms | 30 ms | FP8 + cuBLASLt epilogue fusion + cross-KV precompute |\n| **Full E2E (image to action)** | **41 ms** (24 Hz) | **45 ms** (22 Hz) | All CUDA Graph |\n\nT = action_horizon. T=50 is the padded max across all embodiments (used in production). T=16 is LIBERO-specific.\n\nE2E precision: cosine **0.999** vs FP32 PyTorch reference. NVIDIA PyTorch baseline: ~95 ms.\nFP8 activation scales calibrated per-layer for both Qwen3 and DiT, cached to `~\u002F.flash_rt\u002Fcalibration\u002F`.\n\n### Pi0-FAST\n\nPi0-FAST is a fundamentally different architecture from Pi0\u002FPi0.5 — actions are\ngenerated as **discrete FAST tokens via autoregressive decoding** through a\nsingle Gemma 2B model, not via diffusion. The FP8 inference path uses **BF16\nresidual stream** for both prefill and decode (Pi0-FAST hidden states reach\n~569K, exceeding FP16's 65504 limit) with **FP8 GEMM** on weights.\n\n**Jetson AGX Thor (SM110)**\n\n| Mode | Per-token | 50-token E2E | Method |\n|------|-----------|-------------|--------|\n| **Default** (`decode_cuda_graph=False`) | **8.7 ms** | **~464 ms** | CUTLASS FP8 wide GEMM, vocab pruning, prefill CUDA Graph, text-phase logit skip |\n| **Max-perf** (`decode_cuda_graph=True`) | **8.1 ms** | **~431 ms** | + decode loop captured as CUDA Graph |\n\n**RTX 5090 (SM120)** — measured on Blackwell consumer silicon\n\n| Mode | Prefill | Per-token | 50-token E2E | Throughput |\n|------|---------|-----------|-------------|------------|\n| **Default** (`decode_cuda_graph=False`) | **12.08 ms** | **2.87 ms** | **155.5 ms** | **348 tok\u002Fs** |\n| **Max-perf** (`decode_cuda_graph=True`)  | **10.99 ms** | **2.39 ms** | **140.3 ms** | **418 tok\u002Fs** |\n\nOn RTX 5090 the entire SM100 CUTLASS FP8 kernel family is replaced by cuBLASLt\n`fp8_gemm_descale_{fp16,bf16out}` (one runtime-gated fork in `pipeline_pi0fast.py`,\nThor path byte-for-byte unchanged). SigLIP attention uses `torch.nn.functional.\nscaled_dot_product_attention` in place of the SM100-only strided FMHA.\nThe decode Down GEMM (`[M=1, N=2048, K=16384]`) hits a cuBLASLt heuristic gap\nfor K≥8192 at small M on SM120, so it's dispatched through a 4-way split-K\nworkaround — costs ~0.5 ms\u002Ftoken of the gap between the current 2.39 ms and\nthe 1.5–2.0 ms DRAM roofline. Numbers measured with 30-iteration median on a\nfixed-image benchmark (min\u002Fmax within 0.25 ms of median in graph mode).\n\n**Speedup vs Thor SM110**: 3.07× in graph mode, 2.77× no-graph.\n\nTotal inference latency = `prefill + N × per_token_decode` where `N` is the\nnumber of action tokens generated (variable per inference, typically 30–80).\nVocab pruning is automatic: once the model enters action token range, the\nlogit projection drops from 257K → 2K vocab (saves ~5 ms\u002Ftoken).\n\n**Backend equivalence vs JAX bf16 reference** (per-segment cosine on identical prefix):\n\n| Backend | Prefill xn | First logit | Decode xn | Decode logit | First token |\n|---------|-----------|-------------|-----------|--------------|-------------|\n| **FlashRT Torch** | 0.998 | 0.999 | 0.995 | 0.998 | MATCH (4022) |\n| **FlashRT JAX**   | 0.995 | 0.997 | 0.987 | 0.993 | MATCH (4022) |\n\nBoth backends match JAX's first decoded token exactly, with all internal hidden\nstates ≥ 0.987 cosine vs the JAX bf16 reference (gemma_fast.Module.apply).\nRun `python tests\u002Ftest_pi0fast_precision.py` to verify on your hardware.\n\n## Latency (RTX 5090)\n\n### Measurement protocol\n\nAll RTX 5090 numbers in this section are **pure CUDA Graph replay p50**\n(`cuda.Event` around `graph.replay()`), not the end-to-end\n`quickstart.py` wall clock. The two differ by roughly 1–3 ms because\nreplay excludes graph-external work — image normalization, H2D upload,\nD2H actions download, post-process un-normalization, Python wrapper —\nwhich any real production caller has to pay, but which isn't part of\nthe engine itself.\n\n| Metric | What it counts | Use it for |\n|--------|----------------|-----------|\n| **replay** (`cuda.Event` around `graph.replay()`) | GPU kernels only, captured graph(s) | Engine latency, comparisons between backends, apples-to-apples vs other kernel-level reports |\n| **wall** (`time.perf_counter()` around `rtx.infer()`) | Everything inside `rtx.infer`: copies, graph, sync, decode, un-normalize | What a Python caller feels |\n\nReplay is the canonical FlashRT benchmark column because:\n- it's compiler\u002FCPU independent (same kernels → same replay regardless of\n  whether the Python wrapper is on Thor's Arm CPU or a 5090 host x86),\n- it's what other framework benchmarks (NVIDIA Isaac, Triton-based VLA work)\n  typically report,\n- wall-clock picks up noise from image preprocessing and Python GC that\n  shifts with host CPU, not GPU.\n\nAll replay runs below use `--warmup 50 --iters 500`. 500 warmup iters\non RTX 5090 actually lands in a slightly slower DVFS state than 50 for\nsmall workloads (1v\u002F2v Pi0.5 replay drifts by ~1 ms), so 50 is a better\n\"hot and honest\" warmup than blindly cranking it higher.\n\n### Reproducing\n\nAfter the model is loaded and `set_prompt(...)` has been called once\n(so the CUDA Graph is captured and the first inference is warm), time\ngraph replays directly:\n\n```python\nimport torch, flash_rt, statistics\n\nmodel = flash_rt.load_model(\"pi05\", \"\u002Fpath\u002Fto\u002Fckpt\", framework=\"torch\")\nmodel.predict(images=[base, wrist], prompt=\"task\")  # warm\n\ngraph = model._pipe._enc_ae_graph\nstart = torch.cuda.Event(enable_timing=True)\nend   = torch.cuda.Event(enable_timing=True)\n\n# 50 warmup, 500 measured\nfor _ in range(50): graph.replay()\ntorch.cuda.synchronize()\n\ntimes_ms = []\nfor _ in range(500):\n    start.record(); graph.replay(); end.record()\n    torch.cuda.synchronize()\n    times_ms.append(start.elapsed_time(end))\n\nprint(f\"P50 replay: {statistics.median(times_ms):.2f} ms\")\n```\n\nSwap `config=\"groot\"` and `action_horizon=50\u002F16` for the GROOT rows.\n\n### Pi0.5\n\n**RTX 5090 (SM120) — FP8 baseline, torch**:\n\n| Frontend | 1-view | 2-view | 3-view |\n|----------|--------|--------|--------|\n| **FlashRT Torch (replay p50)** | **14.48 ms** (69 Hz) | **17.58 ms** (57 Hz) | **20.00 ms** (50 Hz) |\n| (Wall p50 for reference) | 15.92 ms | 19.58 ms | 23.24 ms |\n\nReplay std across 500 timed iterations is ~0.2 ms (1v) \u002F 0.56 ms (2v) \u002F\n0.54 ms (3v). Per-view delta is ~3 ms — the cost is dominated by SigLIP\nforward + patch embed, both linear in `num_views`.\n\nE2E precision: cosine **0.998** vs FP16 PyTorch reference.\n\n**Jetson AGX Thor (SM110) — torch, N=8 LIBERO stratified calibration**:\n\n| Config | 1-view | 2-view | 3-view |\n|--------|--------|--------|--------|\n| **FP8 baseline (replay p50)** | **34.06 ms** (29 Hz) | **41.79 ms** (24 Hz) | **55.46 ms** (18 Hz) |\n| **NVFP4 encoder (`use_fp4=True`)** | **31.91 ms** (31 Hz) | **39.78 ms** (25 Hz) | **51.51 ms** (19 Hz) |\n| FP4 speedup vs FP8 | −2.15 ms (−6.3%) | −2.01 ms (−4.8%) | −3.95 ms (−7.1%) |\n\nMeasured via\n[`tests\u002Fbench_pi05_thor_views.py`](tests\u002Fbench_pi05_thor_views.py);\n50-iteration graph-replay P50. The NVFP4 row is the production preset\n(full 18 encoder FFN layers + AWQ + P1 split-GU — see\n[§NVFP4 encoder FFN](#nvfp4-encoder-ffn-pi05-only) below).\n\n**Encoder FP4 holds cosine ≥ 0.9989 vs PyTorch FP32 reference at every\nview count, with no latency regression (FP4 is actually faster than\nFP8 on every row).** At 3v against the PyTorch FP32 reference:\n`cos = 0.998932`, `maxdiff = 0.0372` — slightly tighter maxdiff than\nthe FP8 baseline (0.0414) thanks to the multi-sample AWQ refit.\n\n### GROOT N1.6\n\n`gr1` is the representative trained embodiment (see\n[GROOT N1.6 embodiment slots](#groot-n16-embodiment-slots) above —\ndon't run the default `new_embodiment` against the base checkpoint\nunless you've fine-tuned it, you'll get noise-like actions).\n\nGROOT's rtx pipeline captures **three separate CUDA graphs** (SigLIP,\nQwen3, DiT) with non-graph torch work between them (pixel unshuffle +\nmlp1, kv_text\u002Fkv_img split, state encode, cross-KV precompute). The\n`replay` figure below sums the three captured-graph sections — each\none replayed on its own captured stream with a sync between stages —\nto mirror the production ordering while excluding the interleaved\nnon-graph torch work. This is why it's noticeably below the wall\nnumber: GROOT has more graph-external work than Pi0.5.\n\n**T = 50** (padded max across all embodiments — production default)\n\n| Frontend | 1-view | 2-view | 3-view |\n|----------|--------|--------|--------|\n| **FlashRT Torch (replay p50)** | **11.90 ms** (84 Hz) | **13.08 ms** (76 Hz) | **13.92 ms** (72 Hz) |\n| (Wall p50 for reference) | 12.77 ms | 15.60 ms | 15.23 ms |\n\n**T = 16** (LIBERO-style short horizon — skips ~34 rows in every DiT block)\n\n| Frontend | 1-view | 2-view | 3-view |\n|----------|--------|--------|--------|\n| **FlashRT Torch (replay p50)** | **11.31 ms** (88 Hz) | **12.53 ms** (80 Hz) | **13.36 ms** (75 Hz) |\n| (Wall p50 for reference) | 12.18 ms | 15.06 ms | 14.66 ms |\n\nReplay std \u003C 0.02 ms across all 6 cells — the graphs are deterministic\nonce captured. Per-view delta is only ~1 ms because GROOT's SigLIP\nfeeds through a 2×2 `pixel_unshuffle` that packs 256 patches per view\ninto 64 tokens, so the extra camera adds much less compute than\nPi0.5's 256-token-per-view path.\n\nT=16 → T=50 costs only ~0.5 ms — most DiT-step cost is norm + the\nshared state-row self-attn + cross-attn with the Qwen3 backbone\nfeatures, none of which grow with T. The linear-in-T pieces (action\nencoder\u002Fdecoder MLPs, a slice of DiT self-attn QKV) only account for\n~1 ms of the total.\n\nE2E precision: cosine **0.9992** vs Isaac-GR00T `Gr00tN1d6` reference\non `gr1`, matched noise + matched post-vlln backbone features. The\nreference run requires the Isaac-GR00T stack (torch 2.7.1 \u002F\ntransformers 4.51.3 \u002F cp310 wheels), which cannot coexist with the\nrtx kernel build environment in the same venv — drive both via\nseparate venvs and compare the saved tensors offline.\n\n### Pi0-FAST\n\n50-token end-to-end, Orbax\u002FJAX frontend, RTX 5090:\n\n| Mode | Quickstart P50 (50-token E2E) | Throughput |\n|------|-------------------------------|------------|\n| **Default** (`decode_cuda_graph=False`) | **147.4 ms** | **~340 tok\u002Fs** |\n| **Max-perf** (`decode_cuda_graph=True`)  | **122.9 ms** | **~410 tok\u002Fs** |\n\n```bash\n# Default\npython examples\u002Fquickstart.py \\\n    --checkpoint \u002Fpath\u002Fto\u002Fpi0_fast_base \\\n    --config pi0fast --framework jax \\\n    --max_steps 50 --benchmark 20 --warmup 5\n\n# Max-perf (decode loop captured as CUDA Graph; 50-token fixed horizon)\npython examples\u002Fquickstart.py \\\n    --checkpoint \u002Fpath\u002Fto\u002Fpi0_fast_base \\\n    --config pi0fast --framework jax \\\n    --decode_cuda_graph --decode_graph_steps 46 \\\n    --max_steps 50 --benchmark 20 --warmup 5\n```\n\nThese are wall-clock end-to-end numbers (prefill + all decode tokens).\nThe per-token breakdown — 12 ms prefill \u002F 2.87 ms per decode token\ndefault, 11 ms \u002F 2.39 ms max-perf — is measured with a 30-iteration\nmedian benchmark in the Pi0-FAST detailed table above.\n\n## LIBERO Benchmark (Thor, Pi0.5)\n\n| Suite | Torch | JAX |\n|-------|-------|-----|\n| **LIBERO Spatial** (10 tasks × 50 ep) | **492\u002F500 = 98.4%** | **490\u002F500 = 98.0%** |\n| **LIBERO 10** (10 tasks × 50 ep) | **465\u002F500 = 93.0%** | **463\u002F500 = 92.6%** |\n\n---\n\n## Acknowledgments\n\n- [CUTLASS](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) — GEMM templates and FMHA kernels\n- [FlashAttention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention) — Attention backend for SM89\u002FSM120\n- [Physical Intelligence](https:\u002F\u002Fwww.physicalintelligence.company\u002F) — Pi0\u002FPi0.5 model architecture\n- [OpenPI](https:\u002F\u002Fgithub.com\u002FPhysical-Intelligence\u002Fopenpi) — Reference PyTorch implementation\n- [NVIDIA Isaac GR00T](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FIsaac-GR00T) — GROOT N1.6 model\n","FlashRT 是一个针对小批量、低延迟AI工作负载的高性能实时推理引擎。其核心功能包括手工编写的CUDA内核，支持静态CUDA图捕获，以及无需编译和导出即可直接加载模型等特性，这些特点使得FlashRT在处理如VLA控制和单流LLM推理等任务时表现出色。特别地，它支持Qwen3.6-27B NVFP4模型，在单张RTX 5090显卡上能够实现约100至129 tokens\u002F秒的解码速度。该项目适用于需要快速响应且对计算资源有限制的边缘设备到服务器级别的应用场景，比如自动驾驶车辆中的视觉理解或自然语言处理服务。FlashRT通过提供简洁易用的API和跨平台兼容性简化了部署流程。",2,"2026-06-11 03:54:37","CREATED_QUERY"]