[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-11332":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":13,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":24,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},11332,"cuda-oxide","NVlabs\u002Fcuda-oxide","NVlabs","cuda-oxide is an experimental Rust-to-CUDA compiler that lets you write (SIMT) GPU kernels in safe(ish), idiomatic Rust. It compiles standard Rust code directly to PTX — no DSLs, no foreign language bindings, just Rust.","https:\u002F\u002Fnvlabs.github.io\u002Fcuda-oxide\u002F",null,"Rust",2716,178,16,26,0,62,1171,186,28.76,"Apache License 2.0",false,"main",true,[26,27,28,29,30,31,32,33,34],"async","compiler-backend","cuda","gpu","heterogeneous-computing","high-performance-computing","nvidia","programming-languages","rust","2026-06-12 02:02:31","\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Fclippy.yml\">\u003Cimg alt=\"clippy\" src=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Fclippy.yml\u002Fbadge.svg?branch=main\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Funit-tests.yml\">\u003Cimg alt=\"unit-tests\" src=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Funit-tests.yml\u002Fbadge.svg?branch=main\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Fcargo-deny.yml\">\u003Cimg alt=\"cargo-deny\" src=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Fcargo-deny.yml\u002Fbadge.svg?branch=main\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Fcodeql.yml\">\u003Cimg alt=\"CodeQL\" src=\"https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide\u002Factions\u002Fworkflows\u002Fcodeql.yml\u002Fbadge.svg?branch=main\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Cimg src=\"assets\u002Flogo.png\" alt=\"cuda-oxide logo\" width=\"100%\">\n\u003C\u002Fp>\n\n# cuda-oxide\n\ncuda-oxide is a custom rustc backend for compiling GPU kernels in pure Rust.\nThe workspace combines:\n\n- single-source compilation -- host and device code live in the same file, built with one `cargo oxide build`\n- a rustc codegen backend that compiles `#[kernel]` functions to CUDA PTX\n- device-side abstractions (type-safe indexing, shared memory, scoped atomics, barriers, TMA, warp\u002Fcluster ops)\n- a host-side runtime for memory management and kernel launching (`cuda-core`, `cuda-async`)\n- a rust-native compilation pipeline using [Pliron](https:\u002F\u002Fgithub.com\u002Fvaivaswatha\u002Fpliron), an MLIR-like IR framework in Rust (Rust → Rust MIR → Pliron IR → LLVM IR → PTX)\n\n## Project Status\n\ncuda-oxide is an experimental compiler that demonstrates how CUDA SIMT kernels can be written natively in pure Rust -- no DSLs, no foreign language bindings -- and made available to the broader Rust community. The project is in an early stage (alpha) and under active development: you should expect bugs, incomplete features, and API breakage as we work to improve it. That said, we hope you'll try it in your own work and help shape its direction by sharing feedback on your experience.\n\nPlease see [CONTRIBUTING.md](CONTRIBUTING.md) if you're interested in contributing to the project.\n\n## Quick Start\n\n```rust\nuse cuda_device::{cuda_module, kernel, thread, DisjointSlice};\nuse cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};\n\n\u002F\u002F Device: generic kernel that applies any function to each element.\n\u002F\u002F F can be a closure with captures — rustc monomorphizes it to a concrete type.\n#[cuda_module]\nmod kernels {\n    use super::*;\n\n    #[kernel]\n    pub fn map\u003CT: Copy, F: Fn(T) -> T + Copy>(f: F, input: &[T], mut out: DisjointSlice\u003CT>) {\n        let idx = thread::index_1d();\n        let i = idx.get();\n        if let Some(out_elem) = out.get_mut(idx) {\n            *out_elem = f(input[i]);\n        }\n    }\n}\n\nfn main() {\n    let ctx = CudaContext::new(0).unwrap();\n    let stream = ctx.default_stream();\n\n    let data: Vec\u003Cf32> = (0..1024).map(|i| i as f32).collect();\n    let input = DeviceBuffer::from_host(&stream, &data).unwrap();\n    let mut output = DeviceBuffer::\u003Cf32>::zeroed(&stream, 1024).unwrap();\n\n    let module = kernels::load(&ctx).unwrap();\n\n    \u002F\u002F Launch with a closure — factor is captured and passed to the GPU automatically\n    let factor = 2.5f32;\n    module\n        .map::\u003Cf32, _>(\n            &stream,\n            LaunchConfig::for_num_elems(1024),\n            move |x: f32| x * factor,\n            &input,\n            &mut output,\n        )\n        .unwrap();\n\n    let result = output.to_host_vec(&stream).unwrap();\n    assert!((result[1] - 2.5).abs() \u003C 1e-5);\n}\n```\n\nThe above example defines a generic `#[kernel]` function `map` that accepts any\n`Fn(T) -> T` closure. `#[cuda_module]` embeds the generated device artifact into\nthe host binary and generates a typed `module.map::\u003Cf32, _>(...)` launch method.\nThe closure `move |x| x * factor` is captured, scalarized, and passed as kernel\nparameters automatically.\n\nFor composable async GPU work, `stream:` disappears, `{kernel}_async` returns a\nlazy `DeviceOperation`, and execution happens when you call `.sync()` or\n`.await`.\n\n```rust\nuse cuda_async::device_operation::DeviceOperation;\n\n\u002F\u002F Assuming `module`, `input`, and `output` come from the cuda-async setup:\nlet factor = 2.5f32;\nmodule\n    .map_async::\u003Cf32, _>(\n        LaunchConfig::for_num_elems(1024),\n        move |x: f32| x * factor,\n        &input,\n        &mut output,\n    )?\n    .sync()?;\n\u002F\u002F or: .await?;\n```\n\nSee the `async_mlp` example and `crates\u002Fcuda-async\u002FREADME.md` for the full async setup.\n\n```bash\n# Build and run an example\ncargo oxide run host_closure\n\n# Show full compilation pipeline (Rust MIR → dialect-mir → mem2reg → dialect-llvm → LLVM IR → PTX)\ncargo oxide pipeline vecadd\n\n# Debug with cuda-gdb\ncargo oxide debug vecadd --tui\n```\n\n## Setup\n\n### Requirements\n\n- **cargo-oxide** — cargo subcommand that drives the build pipeline (`cargo oxide run`, `build`, `debug`, etc.)\n- **Rust nightly** with `rust-src` and `rustc-dev` components (pinned in `rust-toolchain.toml`)\n- **CUDA Toolkit** (12.x+)\n- **LLVM 21+** with NVPTX backend (`llc` must be in PATH)\n- **Clang + libclang dev headers** (`clang-21` \u002F `libclang-common-21-dev`) — needed by `bindgen` when building the host `cuda-bindings` crate\n- **Linux** (tested on Ubuntu 24.04)\n\n> **Why LLVM 21?** We emit TMA \u002F tcgen05 \u002F WGMMA intrinsics that `llc` from LLVM 20 and earlier can't\n> handle. Simple kernels might still work with an older `llc`, but anything Hopper \u002F Blackwell needs 21+.\n\n### Install\n\n#### cargo-oxide\n\nInside the cuda-oxide repo, `cargo oxide` works out of the box via a workspace alias.\n\nFor use outside the repo (your own projects):\n\n```bash\ncargo install --git https:\u002F\u002Fgithub.com\u002FNVlabs\u002Fcuda-oxide.git cargo-oxide\n```\n\nOn first run, `cargo-oxide` will automatically fetch and build the codegen backend.\n\n#### Rust\n\n```bash\n# Toolchain installed automatically via rust-toolchain.toml\n# Manual install if needed:\nrustup toolchain install nightly-2026-04-03\nrustup component add rust-src rustc-dev --toolchain nightly-2026-04-03\n```\n\n#### CUDA\n\n```bash\nexport PATH=\"\u002Fusr\u002Flocal\u002Fcuda\u002Fbin:$PATH\"\nnvcc --version\n```\n\n#### LLVM\n\n```bash\n# Ubuntu\u002FDebian\nsudo apt install llvm-21\n```\n\nIf your distro packages do not provide `llvm-21`, use LLVM's apt helper:\n\n```bash\nsudo apt-get install -y lsb-release wget software-properties-common gnupg\nwget https:\u002F\u002Fapt.llvm.org\u002Fllvm.sh && chmod +x llvm.sh\nsudo .\u002Fllvm.sh 21\n```\n\n```bash\n# Verify NVPTX support\nllc-21 --version | grep nvptx\n```\n\nThe pipeline auto-discovers `llc-22` and `llc-21` on `PATH` (in that order).\nTo pin a specific binary, set `CUDA_OXIDE_LLC=\u002Fusr\u002Fbin\u002Fllc-21`.\n\n#### Clang (host `cuda-bindings`)\n\nThe host `cuda-bindings` crate runs `bindgen`, which loads libclang and needs\nclang's own resource-dir `stddef.h` — a bare `libclang1-*` runtime is not\nenough.\n\n```bash\nsudo apt install clang-21   # or libclang-common-21-dev\n```\n\n`cargo oxide doctor` catches this up front; the symptom otherwise is a cryptic\n`'stddef.h' file not found` during the host build.\n\n#### Dev Container\n\nThe repository includes a standard devcontainer setup in `.devcontainer\u002F` for a\nreproducible CUDA, LLVM, Clang, and Rust environment. See the\n[installation chapter](cuda-oxide-book\u002Fgetting-started\u002Finstallation.md#dev-container)\nfor editor and CLI usage.\n\n### Verifying Installation\n\n```bash\n# Check that all prerequisites are in place\ncargo oxide doctor\n\n# Build and run an example end-to-end\ncargo oxide run vecadd\n```\n\n`cargo oxide doctor` validates your Rust toolchain, CUDA toolkit, LLVM, and\ncodegen backend. If everything is configured correctly, `cargo oxide run vecadd`\ncompiles a Rust kernel to PTX, launches it on the GPU, and prints\n`✓ SUCCESS: All 1024 elements correct!`.\n\n## Examples\n\n**46 examples** in `crates\u002Frustc-codegen-cuda\u002Fexamples\u002F`. Highlights:\n\n| Example              | Description                                                              |\n|----------------------|--------------------------------------------------------------------------|\n| `vecadd`             | Vector addition -- canonical first example                               |\n| `host_closure`       | Generic kernels with closures passed from host                           |\n| `generic`            | Generic kernels with monomorphization (`scale\u003CT>`)                       |\n| `gemm_sol`           | GEMM SoL: 868 TFLOPS (58% cuBLAS on B200), 8 kernels across 4 phases     |\n| `tcgen05`            | Blackwell tensor cores (sm_100a): TMEM, MMA, cta_group::2                |\n| `atomics`            | GPU atomics: 6 types x 3 scopes x 5 orderings (20 tests)                 |\n| `cluster`            | Thread Block Clusters + DSMEM ring exchange (Hopper+)                    |\n| `async_mlp`          | Async MLP pipeline: GEMM → MatVec → ReLU across concurrent streams       |\n| `mathdx_ffi_test`    | cuFFTDx thread-level FFT + cuBLASDx block-level GEMM                     |\n| `async_vecadd`       | Async GPU execution with `cuda-async` and `DeviceOperation`              |\n| `cross_crate_kernel` | Library crates defining kernels, bundled into binaries                   |\n\n```bash\ncargo oxide run vecadd\ncargo oxide run gemm_sol\n```\n\n## Crate Overview\n\n### User-Facing Crates\n\n| Crate               | Description                                                               |\n|---------------------|---------------------------------------------------------------------------|\n| `cuda-device`       | Device intrinsics (`thread::*`, `warp::*`, barriers)                      |\n| `cuda-host`         | Typed module loading, launch helpers, LTOIR loader                        |\n| `cuda-macros`       | Proc macros (`#[cuda_module]`, `#[kernel]`, `gpu_printf!`)                |\n| `cuda-bindings`     | Raw `bindgen` FFI bindings to `cuda.h`                                    |\n| `cuda-core`         | Safe RAII wrappers (`CudaContext`, `CudaStream`, `DeviceBuffer\u003CT>`)       |\n| `cuda-async`        | Async execution layer (`DeviceOperation`, `DeviceFuture`, `DeviceBox\u003CT>`) |\n| `libnvvm-sys`       | `dlopen` bindings to libNVVM (used by `cuda-host::ltoir`)                 |\n| `nvjitlink-sys`     | `dlopen` bindings to nvJitLink (used by `cuda-host::ltoir`)               |\n\n### Compiler Crates\n\n| Crate                | Description                                           |\n|----------------------|-------------------------------------------------------|\n| `rustc-codegen-cuda` | Custom rustc backend                                  |\n| `mir-importer`       | Rust MIR -> `dialect-mir` translation + pipeline      |\n| `mir-lower`          | `dialect-mir` -> `dialect-llvm` lowering              |\n| `dialect-mir`        | pliron dialect modelling Rust MIR                     |\n| `dialect-llvm`       | pliron dialect modelling LLVM IR (+ export to `.ll`)  |\n| `dialect-nvvm`       | pliron dialect modelling NVVM intrinsics              |\n\n### Build Tooling\n\n| Crate          | Description                                          |\n|----------------|------------------------------------------------------|\n| `cargo-oxide`  | Cargo subcommand (`cargo oxide run`, etc.)           |\n\n### Documentation\n\n| Directory           | Description                                                        |\n|---------------------|--------------------------------------------------------------------|\n| `cuda-oxide-book`   | Project book (Sphinx + MyST) — guides, compiler internals, API ref |\n\n## Status\n\n### Highlights:\n\n- End-to-end Rust -> PTX compilation\n- Unified single-source compilation (host + device in one file)\n- Generic functions with monomorphization\n- Closures with captures (move and non-move via HMM)\n- User-defined structs, enums, pattern matching\n- Full GPU intrinsic support (thread, warp, shared memory, barriers, TMA, clusters, atomics)\n- Cross-crate kernels\n- LTOIR generation for Blackwell+ (device-side LTO)\n- Device FFI: Rust \u003C-> C++\u002FCCCL interop via LTOIR\n- MathDx integration: cuFFTDx thread-level FFT, cuBLASDx block-level GEMM\n- Host runtime: `cuda-core` (explicit control) and `cuda-async` (composable async operations)\n- GEMM SoL: 868 TFLOPS (58% cuBLAS SoL) on B200 with cta_group::2, CLC, 4-stage pipeline\n\n## Documentation\n\n**WIP:** 🚧 The **[cuda-oxide book](https:\u002F\u002Fnvlabs.github.io\u002Fcuda-oxide\u002F)** is the primary reference for the project. It covers SIMT kernel authoring in Rust, synchronous and asynchronous GPU programming, the compiler architecture, and more.\n\nTo build and serve the book locally, see [cuda-oxide-book\u002FREADME.md](.\u002Fcuda-oxide-book\u002FREADME.md).\n\n## Ecosystem\n\ncuda-oxide is one of several Rust + GPU efforts under active development. Projects in this space address different parts of the problem — Vulkan\u002FSPIR-V for graphics, implicit offload via LLVM, third-party CUDA backends, safe driver bindings — and we've been working with maintainers across the broader Rust GPU community on how to move GPU computing in Rust forward together. For where cuda-oxide fits relative to other projects, see the [Ecosystem appendix](https:\u002F\u002Fnvlabs.github.io\u002Fcuda-oxide\u002Fappendix\u002Fecosystem.html) of the book.\n\n## License\n\nThe `cuda-bindings` crate is licensed under the NVIDIA Software License: [LICENSE-NVIDIA](LICENSE-NVIDIA). All other crates are licensed under the Apache License, Version 2.0: [LICENSE-APACHE](LICENSE-APACHE).\n","cuda-oxide 是一个实验性的 Rust 到 CUDA 编译器，允许开发者使用相对安全且符合 Rust 习惯的语法编写 GPU 内核代码。该项目的核心功能包括将标准 Rust 代码直接编译为 PTX 代码，无需DSL或外部语言绑定，并提供了单源编译、设备端抽象（如类型安全索引、共享内存等）以及主机端运行时支持。它特别适用于需要高性能计算但又希望保持代码简洁和安全性的场景，比如科学计算、机器学习模型训练等领域。尽管目前仍处于早期开发阶段，存在一些不完善之处，但它已经为 Rust 社区探索了新的异构计算可能性。",2,"2026-06-11 03:31:42","CREATED_QUERY"]