[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-3511":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":33,"discoverSource":34},3511,"oxicuda","cool-japan\u002Foxicuda","cool-japan","OxiCUDA replaces the entire NVIDIA CUDA Toolkit software stack with type-safe, memory-safe Rust code. ","https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxicuda",null,"Rust",124,9,3,0,2,17,6,49.7,"Apache License 2.0",false,"master",true,[25,26,27,28,29],"cuda","cuda-programming","pure-rust","rust","rust-lang","2026-06-12 04:00:18","# OxiCUDA\n\n[![Crates.io](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fv\u002Foxicuda.svg)](https:\u002F\u002Fcrates.io\u002Fcrates\u002Foxicuda)\n[![Documentation](https:\u002F\u002Fdocs.rs\u002Foxicuda\u002Fbadge.svg)](https:\u002F\u002Fdocs.rs\u002Foxicuda)\n[![CI](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxicuda\u002Fworkflows\u002FCI\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxicuda\u002Factions)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fl\u002Foxicuda.svg)](LICENSE)\n\n**Pure Rust CUDA replacement -- cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and beyond in ~783K lines of safe Rust across 73 crates.**\n\nOxiCUDA replaces the entire NVIDIA CUDA Toolkit software stack with type-safe,\nmemory-safe Rust code. The only runtime dependency is the NVIDIA driver\n(`libcuda.so` \u002F `nvcuda.dll`); no CUDA SDK, no `nvcc`, no C\u002FC++ toolchain is\nneeded at build time. Optimized PTX assembly is generated directly from Rust\ndata structures, and a built-in autotuner benchmarks kernel variants per GPU\narchitecture to achieve near-peak throughput from Turing through Blackwell.\n\n## Architecture\n\n```\n+---------------------------------------------------------------+\n|   SciRS2  |  OxiONNX  |  TrustformeRS  |  ToRSh              |\n|   (Scientific Computing \u002F ML \u002F Inference Ecosystem)           |\n+-------------------------------+-------------------------------+\n                                |\n+-------------------------------v-------------------------------+\n|                         OxiCUDA                               |\n|                     (Pure Rust GPU)                            |\n|                                                               |\n|  Vol.1 Foundation (4 crates)                                  |\n|  +----------+ +--------+ +---------+ +---------+             |\n|  | Driver   | | Memory | | Launch  | | Runtime |             |\n|  +----------+ +--------+ +---------+ +---------+             |\n|                                                               |\n|  Vol.2 Codegen (2 crates)                                     |\n|  +-----------+ +------------+                                 |\n|  | PTX Gen   | | Autotune   |                                 |\n|  +-----------+ +------------+                                 |\n|                                                               |\n|  Vol.3 Linear Algebra    Vol.4 Deep Learning                  |\n|  +-------------+         +-------------+                      |\n|  | BLAS        |         | DNN         |                      |\n|  +-------------+         +-------------+                      |\n|                                                               |\n|  Vol.5 Scientific Computing (4 crates)                        |\n|  +------+ +--------+ +--------+ +------+                     |\n|  | FFT  | | Sparse | | Solver | | Rand |                     |\n|  +------+ +--------+ +--------+ +------+                     |\n|                                                               |\n|  Vol.6 Signal    Vol.7 Comp.Graph  Vol.8 Training (2)         |\n|  +---------+     +----------+      +-------+ +-------+        |\n|  | Signal  |     | Graph    |      | Train | | Quant |        |\n|  +---------+     +----------+      +-------+ +-------+        |\n|                                                               |\n|  Vol.9 Inference (3 crates)        Vol.10 RL                  |\n|  +-------+ +------------+ +----+   +------+                   |\n|  | Infer | | Dist-Infer | | LM |   |  RL  |                   |\n|  +-------+ +------------+ +----+   +------+                   |\n|                                                               |\n|  Backends (7 crates)                                          |\n|  +----------+ +--------+ +-------+ +--------+                 |\n|  | backend  | | prims  | | Metal | | Vulkan |                 |\n|  +----------+ +--------+ +-------+ +--------+                 |\n|  +--------+ +-------+ +-----------+                           |\n|  | WebGPU | | ROCm  | | LevelZero |                           |\n|  +--------+ +-------+ +-----------+                           |\n+-------------------------------+-------------------------------+\n                                |\n+-------------------------------v-------------------------------+\n|              libcuda.so  (NVIDIA Driver, runtime only)        |\n|              No SDK  \u002F  No nvcc  \u002F  No C Toolchain            |\n+---------------------------------------------------------------+\n```\n\n## Feature Highlights\n\n**Vol.1 -- Foundation** (4 crates, 26,438 SLoC)\n- Dynamic driver loading via `libloading` -- zero build-time SDK dependency\n- `DeviceBuffer\u003CT>` with Rust ownership semantics -- `Send + Sync`, RAII\n- Type-safe `launch!` macro with compile-time grid\u002Fblock validation\n- CUDA Runtime API layer for high-level device management\n\n**Vol.2 -- PTX Codegen & Autotuner** (2 crates, 46,081 SLoC)\n- Rust DSL that generates PTX IR covering SM 7.5 through SM 10.0\n- Tensor Core support: WMMA, MMA, WGMMA instruction generation\n- Built-in autotuner with 3-tier dispatch (cached \u002F tuned \u002F default)\n- Disk-based PTX cache keyed by kernel hash + GPU architecture\n\n**Vol.3 -- BLAS** (1 crate, 27,226 SLoC)\n- Full BLAS Level 1\u002F2\u002F3 (axpy, gemv, gemm, trsm, syrk, ...)\n- GEMM dispatch: SIMT, Tensor Core, Split-K paths\n- Batched GEMM: standard, strided, grouped\n- Precision coverage: F16, BF16, TF32, F32, F64, FP8\n- Elementwise ops (relu, gelu, sigmoid, silu) and reductions (softmax, variance)\n\n**Vol.4 -- DNN** (1 crate, 37,428 SLoC)\n- Convolution: implicit GEMM, im2col, Winograd 3x3, direct, fused Conv+BN+Act\n- FlashAttention forward\u002Fbackward, PagedAttention, decode attention\n- MoE: top-k routing, token permutation, fused MoE kernel\n- Normalization: BatchNorm, LayerNorm, RMSNorm, GroupNorm\n- Pooling: max, average, adaptive, global\n- Resize: nearest, bilinear, bicubic\n- Quantization: FP8, INT8, block-scaled FP4\n\n**Vol.5 -- Scientific Computing** (4 crates, 55,718 SLoC)\n- FFT: Stockham, radix-2\u002F4\u002F8, mixed-radix, Bluestein, C2C\u002FR2C\u002FC2R, 2D\u002F3D\n- Sparse: CSR\u002FCSC\u002FCOO\u002FBSR\u002FELL, SpMV, SpMM, SpGEMM, SDDMM, ILU(0)\u002FIC(0)\n- Solver: LU, QR, SVD, Cholesky, eigendecomp, CG, BiCGSTAB, GMRES\n- Rand: Philox, MRG32k3a, XORWOW, Sobol, uniform\u002Fnormal\u002FPoisson\n\n**Vol.6 -- Signal Processing** (1 crate, 6,595 SLoC)\n- Audio: MFCC, STFT, Mel filterbank, spectral features\n- Image: Gaussian blur, Sobel edge detection, morphological ops\n- DCT: Types I-IV with fast algorithms\n- DWT: Haar, Daubechies wavelets\n- Filtering: IIR\u002FFIR filters, Butterworth, Chebyshev\n- Correlation: cross-correlation, autocorrelation\n\n**Vol.7 -- Computation Graph** (1 crate, 4,949 SLoC)\n- CUDA Graph capture API (StreamCapture, GraphCapture)\n- Execution plan with dependency-sorted node scheduling\n- Event-based inter-node synchronization\n- Sequential + parallel graph executors\n\n**Vol.8 -- GPU Training** (2 crates, 10,532 SLoC)\n- Mixed precision training (AMP): FP16\u002FBF16 + loss scaling\n- Gradient accumulation and clipping; EMA (exponential moving average)\n- LR schedulers: cosine, warmup, cyclic, polynomial\n- GPU-fused optimizers: Adam, AdamW, SGD, RMSProp, LAMB\n- Checkpointing (model save\u002Fload)\n- Quantization: INT8\u002FINT4\u002FFP8 weight quantization, block-scaled\n\n**Vol.9 -- Inference Engine** (3 crates, 14,692 SLoC)\n- KV-cache with paged attention (PagedKvCache) and prefix caching\n- Speculative decoding\n- Distributed inference pipeline (tensor\u002Fpipeline parallelism)\n- LM inference: BPE tokenizer, vocabulary management, sampling strategies\n\n**Vol.10 -- Reinforcement Learning** (1 crate, 5,536 SLoC)\n- Replay buffers: Uniform, Prioritized (PER), N-step\n- Policy distributions: Categorical, Gaussian (SAC reparameterization), Deterministic\n- Advantage estimators: GAE, TD(λ), V-trace, Retrace(λ)\n- Loss functions: PPO, DQN, Double-DQN, SAC, TD3\n- Observation\u002Freward normalization with Welford running stats\n- Environment abstractions: Env, VecEnv (auto-reset)\n\n**Backends** (7 crates, 28,400 SLoC)\n- Backend trait abstraction for multi-GPU-runtime portability\n- CUB-equivalent GPU primitives (scan, reduce, sort, histogram)\n- Metal (macOS), Vulkan Compute, WebGPU, AMD ROCm, Intel oneAPI (LevelZero)\n\n## Pure Rust, Minimal Dependencies\n\nOxiCUDA is built on a strict **Pure Rust** policy with minimal external\ndependencies. The entire codebase compiles with `cargo build` alone -- no\nC compiler, no Fortran runtime, no CUDA SDK, no `nvcc`, no `pkg-config`.\n\n| Dependency | Purpose | Type |\n|------------|---------|------|\n| `libloading` | Dynamic `.so`\u002F`.dll` loading at runtime | Pure Rust |\n| `thiserror` | Ergonomic error type derivation | Pure Rust |\n| `num-complex` | Complex number types (FFT) | Pure Rust |\n| `half` | FP16\u002FBF16 types (optional) | Pure Rust |\n| `serde` \u002F `serde_json` | Autotune result DB (optional) | Pure Rust |\n\nThe only runtime requirement is the NVIDIA GPU driver (`libcuda.so` on Linux,\n`nvcuda.dll` on Windows). On macOS the crate compiles but returns\n`UnsupportedPlatform` at runtime.\n\n## Quick Start\n\n```rust\nuse oxicuda::prelude::*;\n\nfn main() -> Result\u003C(), oxicuda::Error> {\n    \u002F\u002F Initialize driver and select GPU device\n    let device = Device::get(0)?;\n    let ctx = Context::new(device)?;\n    let stream = Stream::new(&ctx)?;\n\n    \u002F\u002F Allocate device memory\n    let mut d_a = DeviceBuffer::\u003Cf32>::zeroed(1024)?;\n    let mut d_b = DeviceBuffer::\u003Cf32>::zeroed(1024)?;\n    let mut d_c = DeviceBuffer::\u003Cf32>::zeroed(1024)?;\n\n    \u002F\u002F Copy host data to device\n    d_a.copy_from_host(&host_a)?;\n    d_b.copy_from_host(&host_b)?;\n\n    \u002F\u002F Launch a GEMM: C = alpha * A @ B + beta * C\n    let handle = BlasHandle::new(&stream)?;\n    handle.gemm(\n        Transpose::None, Transpose::None,\n        m, n, k,\n        1.0f32,            \u002F\u002F alpha\n        &d_a, lda,\n        &d_b, ldb,\n        0.0f32,            \u002F\u002F beta\n        &mut d_c, ldc,\n    )?;\n\n    stream.synchronize()?;\n\n    \u002F\u002F Copy result back to host\n    let mut result = vec![0.0f32; m * n];\n    d_c.copy_to_host(&mut result)?;\n    Ok(())\n}\n```\n\n## Crate Overview\n\n| Crate | CUDA Equivalent | Description | SLoC | Tests |\n|-------|-----------------|-------------|------|-------|\n| **Vol.1 -- Foundation** | | | | |\n| `oxicuda-driver` | Driver API | FFI, device\u002Fcontext\u002Fstream\u002Fevent\u002Fmodule | 13,508 | 383 |\n| `oxicuda-memory` | cuMemAlloc | DeviceBuffer, PinnedBuffer, unified, pool | 5,297 | 211 |\n| `oxicuda-launch` | cuLaunchKernel | Dim3, LaunchParams, `launch!` macro | 5,112 | 214 |\n| `oxicuda-runtime` | CUDA Runtime | High-level cudaRT API layer | 2,521 | 46 |\n| **Vol.2 -- PTX Codegen & Autotuner** | | | | |\n| `oxicuda-ptx` | nvcc \u002F CUTLASS | PTX IR, codegen DSL, Tensor Core gen | 31,764 | 934 |\n| `oxicuda-autotune` | -- | Search space, benchmark, tuning DB | 14,317 | 421 |\n| **Vol.3 -- Linear Algebra** | | | | |\n| `oxicuda-blas` | cuBLAS | BLAS L1\u002FL2\u002FL3, GEMM, batched, elementwise | 27,226 | 722 |\n| **Vol.4 -- Deep Learning** | | | | |\n| `oxicuda-dnn` | cuDNN | Conv, attention, MoE, norm, pool, quantize | 37,428 | 1,006 |\n| **Vol.5 -- Scientific Computing** | | | | |\n| `oxicuda-fft` | cuFFT | Stockham, radix-2\u002F4\u002F8, Bluestein, 1D\u002F2D\u002F3D | 13,039 | 350 |\n| `oxicuda-sparse` | cuSPARSE | CSR\u002FCSC\u002FCOO\u002FBSR\u002FELL, SpMV, SpMM, SpGEMM | 12,943 | 331 |\n| `oxicuda-solver` | cuSOLVER | LU, QR, SVD, Cholesky, eig, CG, GMRES | 17,724 | 396 |\n| `oxicuda-rand` | cuRAND | Philox, MRG32k3a, Sobol, distributions | 12,012 | 341 |\n| **Vol.6 -- Signal Processing** | | | | |\n| `oxicuda-signal` | -- | Audio\u002Fimage DSP, DCT, DWT, IIR\u002FFIR filters | 6,595 | 240 |\n| **Vol.7 -- Computation Graph** | | | | |\n| `oxicuda-graph` | CUDA Graphs | Graph capture, dep-sorted exec, events | 4,949 | 175 |\n| **Vol.8 -- GPU Training** | | | | |\n| `oxicuda-train` | -- | AMP, grad accum\u002Fclip, LR schedulers, optimizers | 6,214 | 167 |\n| `oxicuda-quant` | -- | INT8\u002FINT4\u002FFP8 quantization, block-scaled | 4,318 | 150 |\n| **Vol.9 -- Inference Engine** | | | | |\n| `oxicuda-infer` | -- | KV-cache, paged attention, speculative decode | 5,632 | 186 |\n| `oxicuda-dist-infer` | -- | Tensor\u002Fpipeline parallelism, distributed infer | 3,279 | 80 |\n| `oxicuda-lm` | -- | BPE tokenizer, vocab, sampling strategies | 5,781 | 226 |\n| **Vol.10 -- Reinforcement Learning** | | | | |\n| `oxicuda-rl` | -- | Replay buffers, policy dists, PPO\u002FDQN\u002FSAC\u002FTD3 | 5,536 | 200 |\n| **Backends** | | | | |\n| `oxicuda-backend` | -- | Backend trait abstraction | 484 | 10 |\n| `oxicuda-primitives` | CUB | GPU scan, reduce, sort, histogram | 4,502 | 142 |\n| `oxicuda-metal` | -- | Metal compute backend (macOS) | 4,395 | 152 |\n| `oxicuda-vulkan` | -- | Vulkan Compute backend | 5,116 | 86 |\n| `oxicuda-webgpu` | -- | WebGPU backend | 3,948 | 129 |\n| `oxicuda-rocm` | -- | AMD ROCm backend | 3,739 | 104 |\n| `oxicuda-levelzero` | -- | Intel oneAPI \u002F LevelZero backend | 6,216 | 103 |\n| **Vol.17 -- Generative AI** | | | | |\n| `oxicuda-gen` | -- | Diffusion (DDPM\u002FDDIM\u002FDPM-Solver++\u002FFlow Matching), CFG, VAE, LoRA | 8,470 | 365 |\n| **Vol.18 -- Graph Neural Networks** | | | | |\n| `oxicuda-gnn` | -- | CSR\u002FCOO\u002FHetero graphs, GCN\u002FGAT\u002FGraphSAGE\u002FGIN, pooling | 10,698 | 401 |\n| **Vol.19 -- State Space Models** | | | | |\n| `oxicuda-mamba` | -- | HiPPO-NPLR, S4D\u002FS5 selective scan, Mamba SSM, RWKV | 11,535 | 514 |\n| **Vol.20 -- Vision Transformers** | | | | |\n| `oxicuda-vision` | -- | ViT, patch embedding, CLIP towers | 10,829 | 496 |\n| **Vol.21 -- Audio\u002FSpeech ML** | | | | |\n| `oxicuda-audio` | -- | Conformer, Wav2Vec2, CTC\u002FRNN-T, WaveNet, SpecAugment, x-vector | 11,215 | 458 |\n| **Vol.22 -- Time-Series Forecasting** | | | | |\n| `oxicuda-timeseries` | -- | TCN, NHiTS, PatchTST, TimesNet, iTransformer, RevIN | 10,493 | 333 |\n| **Vol.23 -- Bayesian Deep Learning** | | | | |\n| `oxicuda-bayes` | -- | Variational inference, MC Dropout, Deep Ensembles, SWAG, Laplace | 10,203 | 385 |\n| **Vol.24 -- Federated Learning** | | | | |\n| `oxicuda-federated` | -- | FedAvg\u002FFedProx\u002FSCAFFOLD\u002FFedAdam, DP, secure aggregation | 7,969 | 351 |\n| **Vol.25 -- Neural Architecture Search** | | | | |\n| `oxicuda-nas` | -- | DARTS, supernet, NSGA-II, hardware-aware FLOPs predictor | 6,577 | 224 |\n| **Vol.26 -- Self-Supervised Learning** | | | | |\n| `oxicuda-ssl` | -- | SimCLR\u002FMoCo\u002FBYOL\u002FBarlow Twins\u002FMAE\u002FDINO | 11,706 | 373 |\n| **Vol.27 -- Adversarial Robustness** | | | | |\n| `oxicuda-adversarial` | -- | FGSM\u002FPGD\u002FCW\u002FTRADES\u002FMART | 9,006 | 387 |\n| **Vol.28 -- Multi-Modal Learning** | | | | |\n| `oxicuda-multimodal` | -- | Cross-modal attention, CLIP\u002FImageBind | 7,788 | 275 |\n| **Vol.29 -- Continual Learning** | | | | |\n| `oxicuda-continual` | -- | EWC\u002FSI\u002FPackNet\u002FGEM\u002FDER++ | 12,642 | 427 |\n| **Vol.30 -- 3D Geometry & Point Clouds** | | | | |\n| `oxicuda-geometry3d` | -- | FPS\u002FkNN\u002FPointNet\u002FDGCNN\u002FICP | 9,552 | 315 |\n| **Vol.31 -- Physics-Informed Neural Networks** | | | | |\n| `oxicuda-pinn` | -- | PINN\u002FNeuralODE\u002FFNO\u002FDeepONet | 12,599 | 493 |\n| **Vol.32 -- RLHF & Alignment** | | | | |\n| `oxicuda-rlhf` | -- | DPO\u002FIPO\u002FKTO\u002FORPO\u002FPPO-RLHF\u002Freward-model | 5,767 | 217 |\n| **Vol.33 -- Meta-Learning** | | | | |\n| `oxicuda-meta` | -- | MAML\u002FFOMAML\u002FANIL\u002FReptile\u002FProtoNet | 8,249 | 225 |\n| **Vol.34 -- Neural Radiance Fields** | | | | |\n| `oxicuda-nerf` | -- | NeRF\u002FInstant-NGP\u002FMip-NeRF\u002FTensoRF | 6,878 | 227 |\n| **Vol.35 -- Mixture of Experts** | | | | |\n| `oxicuda-moe` | -- | Switch\u002FTop-K\u002FExpert-Choice\u002FSoft-MoE | 4,906 | 153 |\n| **Vol.36 -- Tabular Deep Learning** | | | | |\n| `oxicuda-tabular` | -- | TabNet\u002FSAINT\u002FFT-Transformer\u002FNODE | 7,811 | 214 |\n| **Vol.37 -- Anomaly Detection** | | | | |\n| `oxicuda-anomaly` | -- | DeepSVDD\u002FLOF\u002FCOPOD\u002FMahalanobis\u002FIsoForest | 15,255 | 362 |\n| **Vol.38 -- Quantum Simulation** | | | | |\n| `oxicuda-quantum` | -- | State-vector\u002FVQE\u002FQAOA\u002FQML-kernels | 7,156 | 221 |\n| **Vol.39 -- Approximate Nearest Neighbor** | | | | |\n| `oxicuda-ann` | -- | HNSW\u002FIVF\u002FPQ\u002FIVFPQ\u002FLSH | 7,509 | 202 |\n| **Vol.40 -- Recommender Systems** | | | | |\n| `oxicuda-recsys` | -- | ALS\u002FBPR\u002FNCF\u002FDeepFM\u002FSASRec\u002FLightGCN | 10,169 | 253 |\n| **Vol.41 -- Causal Inference** | | | | |\n| `oxicuda-causal` | -- | NOTEARS\u002FIPW\u002FS-T-X-learners\u002FDML\u002FCausalForest | 21,669 | 594 |\n| **Vol.42 -- Parameter-Efficient Fine-Tuning** | | | | |\n| `oxicuda-peft` | -- | LoRA\u002FQLoRA\u002FAdaLoRA\u002FPrefix-Tuning | 14,694 | 479 |\n| **Vol.43 -- Knowledge Distillation** | | | | |\n| `oxicuda-distill` | -- | Hinton\u002FFitNets\u002FAT\u002FCRD\u002FDML\u002FZSKD | 7,029 | 246 |\n| **Vol.44 -- Optimal Transport** | | | | |\n| `oxicuda-ot` | -- | Sinkhorn\u002FEMD\u002FGromov-Wasserstein\u002FWasserstein-kmeans | 19,461 | 480 |\n| **Vol.45 -- Spiking Neural Networks** | | | | |\n| `oxicuda-snn` | -- | LIF\u002FIF\u002FBPTT\u002FSTBP\u002FSLAYER\u002FSTDP\u002FANN→SNN | 10,683 | 329 |\n| **Vol.46 -- Differential Privacy** | | | | |\n| `oxicuda-privacy` | -- | DP-FTRL\u002FDP-Adam\u002FRDP\u002FzCDP\u002FPRV\u002FOUE\u002FRAPPOR | 13,029 | 530 |\n| **Vol.47 -- Hyperdimensional Computing** | | | | |\n| `oxicuda-hdc` | -- | Binary\u002Finteger\u002Fcomplex HVs, AM\u002Fclassifier | 5,725 | 214 |\n| **Vol.48 -- Evolutionary Algorithms** | | | | |\n| `oxicuda-evol` | -- | CMA-ES\u002FNSGA-II\u002FMOEA-D\u002FNEAT\u002FDE\u002FPSO\u002FACO | 15,366 | 424 |\n| **Vol.49 -- Topological Data Analysis** | | | | |\n| `oxicuda-tda` | -- | Vietoris-Rips\u002Fpersistent-homology\u002FMapper | 6,480 | 209 |\n| **Vol.50 -- Tensor Networks** | | | | |\n| `oxicuda-tn` | -- | MPS\u002FMPO\u002FDMRG\u002FTEBD\u002FPEPS\u002FTT-cross\u002FCP-ALS\u002Feinsum | 23,576 | 427 |\n| **Vol.51 -- Sequence Models** | | | | |\n| `oxicuda-seq` | -- | HMM\u002FCRF\u002FKalman\u002FEKF\u002FViterbi\u002FBaum-Welch | 13,336 | 384 |\n| **Vol.52 -- Numerical PDE Solvers** | | | | |\n| `oxicuda-pde` | -- | FDM\u002FFEM\u002Fspectral\u002Fmultigrid\u002FCG | 11,332 | 384 |\n| **Vol.53 -- Manifold Learning** | | | | |\n| `oxicuda-manifold` | -- | t-SNE\u002FUMAP\u002FLLE\u002FIsomap\u002FDiffusion-Maps\u002FSMACOF | 19,877 | 388 |\n| **Vol.54 -- Statistical Inference** | | | | |\n| `oxicuda-stats` | -- | t-test\u002FANOVA\u002FKS\u002Fbootstrap\u002Fregression\u002Fpower | 17,685 | 542 |\n| **Vol.55 -- Streaming Sketches** | | | | |\n| `oxicuda-sketch` | -- | HyperLogLog\u002FCount-Min\u002FBloom\u002Ft-Digest\u002FMinHash | 8,533 | 332 |\n| **Vol.56 -- Survival Analysis** | | | | |\n| `oxicuda-survival` | -- | Kaplan-Meier\u002FCox-PH\u002FAFT\u002FFine-Gray\u002FBrier | 25,296 | 628 |\n| **Vol.57 -- Convex Optimization** | | | | |\n| `oxicuda-cvx` | -- | LP\u002FQP\u002FSOCP\u002FSDP\u002FADMM\u002FFISTA\u002Fproximal-gradient | 12,790 | 387 |\n| **Vol.58 -- Compressed Sensing** | | | | |\n| `oxicuda-cs` | -- | OMP\u002FCoSaMP\u002FIHT\u002FAMP\u002FK-SVD\u002FLASSO\u002Fnuclear-norm | 6,127 | 108 |\n| **Vol.59 -- Graph Algorithms** | | | | |\n| `oxicuda-graphalg` | -- | BFS\u002FDFS\u002FDijkstra\u002FMST\u002Fflow\u002Fmatching\u002FSCC\u002FTSP | 6,392 | 139 |\n| **Vol.60 -- Numerical Analysis** | | | | |\n| `oxicuda-numeric` | -- | Root-finding\u002Fquadrature\u002Fspecial-functions\u002FODE\u002Finterpolation | 6,061 | 212 |\n| **Vol.61 -- 2D Computational Geometry** | | | | |\n| `oxicuda-geom2d` | -- | Delaunay\u002FVoronoi\u002Fconvex-hull\u002Fsweep-line | 6,754 | 204 |\n| **Umbrella** | | | | |\n| `oxicuda` | -- | Umbrella re-export crate | 21,994 | 521 |\n| | | **Total** | **~782,571** | **23,535** |\n\n## Feature Flags\n\n| Flag | Default | Description |\n|------|---------|-------------|\n| `driver` | on | CUDA driver API layer |\n| `memory` | on | Device\u002Fpinned\u002Funified memory |\n| `launch` | on | Kernel launch primitives |\n| `ptx` | off | PTX IR codegen DSL |\n| `autotune` | off | Runtime autotuner with disk cache |\n| `blas` | off | BLAS L1\u002FL2\u002FL3 and GEMM |\n| `dnn` | off | Deep learning ops (conv, attention, MoE, norm) |\n| `fft` | off | FFT transforms |\n| `sparse` | off | Sparse matrix operations |\n| `solver` | off | Linear solvers (LU, QR, SVD, Cholesky, CG) |\n| `rand` | off | GPU random number generation |\n| `primitives` | off | CUB-equivalent GPU primitives |\n| `pool` | off | Async memory pool (CUDA 11.2+) |\n| `vulkan` | off | Vulkan Compute backend |\n| `metal` | off | Metal backend (macOS) |\n| `webgpu` | off | WebGPU backend |\n| `rocm` | off | AMD ROCm backend |\n| `level-zero` | off | Intel oneAPI \u002F LevelZero backend |\n| `wasm-backend` | off | WebAssembly + WebGPU browser target |\n| `gpu-tests` | off | Enable GPU hardware tests |\n| `full` | off | Enable all features |\n\n## Performance Targets\n\n| Operation | Target vs CUDA | Notes |\n|-----------|----------------|-------|\n| SGEMM (FP32) | >= 95% cuBLAS | Autotuned tile sizes |\n| HGEMM (FP16) | >= 95% cuBLAS | Tensor Core WMMA\u002FMMA |\n| Batch GEMM | >= 95% cuBLAS | Stream-K scheduling |\n| Convolution (FP16) | >= 90% cuDNN | Implicit GEMM + Winograd |\n| FlashAttention | >= 90% FA2 | Tiled, causal mask |\n| FFT (power-of-2) | >= 90% cuFFT | Stockham radix-2\u002F4\u002F8 |\n| SpMV (CSR) | >= 85% cuSPARSE | Architecture-tuned |\n| LU \u002F QR \u002F SVD | >= 85% cuSOLVER | Blocked panel factorization |\n\n## Supported GPU Architectures\n\n| Architecture | SM | Codename | Key Features |\n|--------------|----|----------|--------------|\n| Turing | 7.5 | TU10x | INT8 Tensor Cores, RT Cores |\n| Ampere | 8.0 | GA100 | TF32, FP64 Tensor Cores, Async Copy |\n| Ampere | 8.6 | GA10x | Third-gen Tensor Cores |\n| Ada Lovelace | 8.9 | AD10x | FP8 Tensor Cores |\n| Hopper | 9.0 | GH100 | WGMMA, TMA, FP8, DPX |\n| Blackwell | 10.0 | GB10x | FP4, Fifth-gen Tensor Cores |\n\n## Platform Support\n\n| Platform | Status | Notes |\n|----------|--------|-------|\n| Linux x86_64 | Full support | Primary development target |\n| Windows x86_64 | Full support | nvcuda.dll loaded at runtime |\n| macOS (ARM\u002Fx86) | Compile-only | Returns `UnsupportedPlatform` at runtime |\n\n## Building\n\n```bash\n# Default build (no GPU features)\ncargo build\n\n# With all GPU features\ncargo build --features \"ptx,autotune,blas,dnn,fft,sparse,solver,rand\"\n\n# Full build (all features including backends)\ncargo build --features full\n\n# Check without GPU\ncargo check --all-targets\n```\n\n## Testing\n\n```bash\n# Unit tests (no GPU required)\ncargo test\n\n# Full test suite with GPU hardware\ncargo test --features gpu-tests\n\n# Run with nextest\ncargo nextest run --all-features\n```\n\n## Roadmap\n\n**Released (v0.1.8) -- 2026-05-21** *(23,535 tests passing, 783K SLoC, 73 crates)*\n- Vol.1: Driver, Memory, Launch, Runtime -- foundation layer (4 crates)\n- Vol.2: PTX codegen DSL, autotuner engine (2 crates)\n- Vol.3: Full BLAS L1\u002FL2\u002FL3 with Tensor Core GEMM, SYR2K two-operand cross-product variant\n- Vol.4: Convolution, FlashAttention, MoE, normalization, pooling, quantization\n- Vol.5: FFT, sparse, solver, RNG (4 crates)\n- Vol.6: Signal processing -- audio\u002Fimage DSP, DCT, DWT, IIR\u002FFIR filters\n- Vol.7: Computation graph -- capture API, dep-sorted scheduling, parallel executor\n- Vol.8: GPU training -- AMP, optimizers, LR schedulers, checkpointing, quantization (2 crates)\n- Vol.9: Inference engine -- KV-cache, speculative decode, distributed infer, LM (3 crates)\n- Vol.10: Reinforcement learning -- replay buffers, policy dists, PPO\u002FDQN\u002FSAC\u002FTD3\n- Backends: Metal, Vulkan, WebGPU, ROCm, LevelZero (7 crates)\n- Vol.17: Generative AI -- diffusion schedulers, CFG, VAE, LoRA\n- Vol.18: Graph Neural Networks -- GCN\u002FGAT\u002FGraphSAGE\u002FGIN, pooling\n- Vol.19: State Space Models -- HiPPO-NPLR, S4D\u002FS5, Mamba SSM, RWKV\n- Vol.20: Vision Transformers & CLIP -- ViT, patch embedding, dual-tower CLIP\n- Vol.21: Audio\u002FSpeech ML -- Conformer, Wav2Vec2, CTC\u002FRNN-T, WaveNet, SpecAugment\n- Vol.22: Time-Series Forecasting -- TCN, NHiTS, PatchTST, TimesNet, iTransformer, RevIN\n- Vol.23: Bayesian Deep Learning -- variational inference, MC Dropout, Ensembles, Laplace\n- Vol.24: Federated Learning -- FedAvg\u002FFedProx\u002FSCAFFOLD\u002FFedAdam, DP, secure aggregation\n- Vol.25: Neural Architecture Search -- DARTS, supernet, NSGA-II, hardware-aware predictor\n- Vol.26--61: SSL, Adversarial, Multimodal, Continual, 3D Geometry, PINN, RLHF, Meta-Learning, NeRF, MoE, Tabular, Anomaly, Quantum, ANN, RecSys, Causal, PEFT, Distillation, OT, SNN, DP, HDC, Evolutionary, TDA, Tensor Networks, Sequence Models, PDE, Manifold, Statistics, Sketches, Survival, CVX, Compressed Sensing, Graph Algorithms, Numerical Analysis, 2D Geometry\n\n**Next**\n- Published documentation on docs.rs\n- GPU hardware benchmark validation (CI regression tracking)\n- v1.0 completion criteria verification (see TODO.md)\n\n## Quick Links\n\n- [API Documentation](https:\u002F\u002Fdocs.rs\u002Foxicuda)\n- [GitHub Repository](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxicuda)\n- [COOLJAPAN Ecosystem](https:\u002F\u002Fgithub.com\u002Fcool-japan)\n- [Changelog](CHANGELOG.md)\n\n## Related COOLJAPAN Projects\n\n| Project | Description |\n|---------|-------------|\n| [SciRS2](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Fscirs2) | Scientific computing (NumPy\u002FSciPy equivalent) |\n| [ToRSh](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Ftorsh) | Tensor operations (PyTorch equivalent) |\n| [TrustformeRS](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Ftrustformers) | Transformer models |\n| [OxiONNX](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxionnx) | ONNX neural network inference |\n| [OxiBLAS](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxiblas) | Pure Rust BLAS |\n| [OxiFFT](https:\u002F\u002Fgithub.com\u002Fcool-japan\u002Foxifft) | Pure Rust FFT |\n\n## License\n\nLicensed under the Apache License, Version 2.0. See [LICENSE](LICENSE) for details.\n\n## Copyright\n\n(C) 2026 COOLJAPAN OU (Team KitaSan)\n","OxiCUDA项目旨在使用类型安全和内存安全的Rust代码替换整个NVIDIA CUDA工具包软件栈。它通过大约783K行Rust代码实现了cuBLAS、cuDNN、cuFFT等核心库的功能，这些代码分布在73个crate中。该项目仅依赖于NVIDIA驱动程序（如`libcuda.so`或`nvcuda.dll`），无需CUDA SDK、`nvcc`编译器或C\u002FC++工具链即可构建。OxiCUDA直接从Rust数据结构生成优化后的PTX汇编，并内置自动调优器以针对不同GPU架构调整内核变体，从而实现接近峰值的吞吐量性能。适用于需要高性能计算但又希望摆脱CUDA复杂性依赖的科学计算、机器学习及信号处理等领域。","2026-06-11 02:54:40","CREATED_QUERY"]