[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74706":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":35,"lastSyncTime":36,"discoverSource":37},74706,"BarraCUDA","Zaneham\u002FBarraCUDA","Zaneham","Open-source CUDA, Triton and HIP compiler targeting multiple GPU and CPU architectures. ","",null,"C",1696,87,19,37,0,4,8,26,12,18.83,"Apache License 2.0",false,"master",true,[27,28,29,30,31],"c99","compiler","cuda","gpu","ml","2026-06-12 02:03:27","# BarraCUDA\n\nAn open-source CUDA C++ compiler written from scratch in C99 that takes `.cu` files and compiles them to AMD GPU machine code, NVIDIA PTX, and Tenstorrent Tensix C++, with more architectures planned. No LLVM, no dependencies, and no permission asked.\n\nThis is what happens when you look at NVIDIA's walled garden and think \"how hard can it be?\" The answer is: quite hard, actually, but I did it anyway.\n\nSee [Changelog](#changelog) for recent updates.\n\n## What It Does\n\nTakes CUDA C source code, the same `.cu` files you'd feed to `nvcc`, and compiles them to AMD RDNA 2\u002F3\u002F4 binaries, NVIDIA PTX, or Tenstorrent Tensix Metalium C++.\n```\n┌───────────────────────────────────────────────────────────────────────────┐\n│                          BarraCUDA Pipeline                              │\n├───────────────────────────────────────────────────────────────────────────┤\n│  Source (.cu)                                                            │\n│       ↓                                                                  │\n│  Preprocessor → #include, #define, macros, conditionals                  │\n│       ↓                                                                  │\n│  Lexer → Tokens                                                          │\n│       ↓                                                                  │\n│  Parser (Recursive Descent) → AST                                        │\n│       ↓                                                                  │\n│  Semantic Analysis → Type checking, scope resolution                     │\n│       ↓                                                                  │\n│  BIR (BarraCUDA IR) → SSA form, typed instructions                       │\n│       ↓                                                                  │\n│  mem2reg → Promotes allocas to SSA registers                             │\n│       ↓                                                                  │\n│  Instruction Selection                                                   │\n│       ├──────────────────┬──────────────────┬────────────────────┤       │\n│       ↓ AMD              ↓ NVIDIA            ↓ Tenstorrent       │       │\n│  VGPR\u002FSGPR regalloc  PTX isel + emit    Tensix SFPU isel        │       │\n│       ↓                  ↓                   ↓                   │       │\n│  GFX9\u002F10\u002F11\u002F12       .ptx text          Metalium C++             │       │\n│  binary encoding     (driver JIT)       compute\u002Freader\u002Fwriter    │       │\n│       ↓                  ↓                   ↓                   │       │\n│  .hsaco ELF          Runs on NVIDIA     Runs on Tenstorrent      │       │\n│       ↓              hardware            hardware                │       │\n│  Runs on AMD                                                     │       │\n│  hardware                                                        │       │\n└───────────────────────────────────────────────────────────────────────────┘\n```\n\n\n## Building\n\n```bash\n# It's C99. It builds with gcc. There are no dependencies.\nmake\n\n# That's it. No cmake. No autoconf. No 47-step build process.\n# If this doesn't work, your gcc is broken, not the Makefile.\n```\n\n### Requirements\n\n- A C99 compiler (gcc, clang, whatever you've got)\n- A will to live (optional but recommended)\n- LLVM is NOT required. BarraCUDA does its own instruction encoding like an adult.\n\n## Usage\n\n```bash\n# Compile to AMD GPU binary (RDNA 3, default)\n.\u002Fbarracuda --amdgpu-bin kernel.cu -o kernel.hsaco\n\n# Compile for RDNA 2\n.\u002Fbarracuda --amdgpu-bin --gfx1030 kernel.cu -o kernel.hsaco\n\n# Compile for RDNA 4\n.\u002Fbarracuda --amdgpu-bin --gfx1200 kernel.cu -o kernel.hsaco\n\n# Compile to NVIDIA PTX\n.\u002Fbarracuda --nvidia-ptx kernel.cu -o kernel.ptx\n\n# Compile to Tenstorrent Metalium C++\n.\u002Fbarracuda --tensix kernel.cu -o kernel_compute.cpp\n\n# Dump the IR (for debugging or curiosity)\n.\u002Fbarracuda --ir kernel.cu\n\n# Just parse and dump the AST\n.\u002Fbarracuda --ast kernel.cu\n\n# Run semantic analysis\n.\u002Fbarracuda --sema kernel.cu\n\n# Error messages in te reo Maori (or any language with a translation file)\n.\u002Fbarracuda --lang lang\u002Fmi.txt --amdgpu-bin kernel.cu -o kernel.hsaco\n```\n\n## Runtime Launcher\n\nBarraCUDA includes a minimal HSA runtime (`src\u002Fruntime\u002F`) for dispatching compiled kernels on real AMD hardware. Zero compile-time dependency on ROCm — loads `libhsa-runtime64.so` at runtime via `dlopen`.\n\n```bash\n# Compile the runtime and example together\ngcc -std=c99 -O2 -I src\u002Fruntime \\\n    examples\u002Flaunch_saxpy.c src\u002Fruntime\u002Fbc_runtime.c \\\n    -ldl -lm -o launch_saxpy\n\n# Compile a kernel and run it\n.\u002Fbarracuda --amdgpu-bin -o test.hsaco tests\u002Fcanonical.cu\n.\u002Flaunch_saxpy test.hsaco\n```\n\nRequires Linux with ROCm installed. See `examples\u002Flaunch_saxpy.c` for a complete example.\n\n## What Works\n\n The following CUDA features compile to working GFX9\u002FGFX10\u002FGFX11\u002FGFX12 machine code, NVIDIA PTX, and Tensix Metalium C++:\n\n### Core Language\n- `__global__`, `__device__`, `__host__` function qualifiers\n- `threadIdx`, `blockIdx`, `blockDim`, `gridDim` builtins\n- Structs (named + anonymous inline), enums, typedefs, namespaces\n- Pointers, arrays, pointer arithmetic\n- All C control flow: `if`\u002F`else`, `for`, `while`, `do-while`, `switch`\u002F`case`, `goto`\u002F`label`\n- Short-circuit `&&` and `||`\n- Ternary operator\n- Templates (basic instantiation)\n- Multiple return paths, `continue`, `break`\n\n### CUDA Features\n- `__shared__` memory (allocated from LDS, properly tracked)\n- `__syncthreads()` → `s_barrier`\n- Atomic operations: `atomicAdd`, `atomicSub`, `atomicMin`, `atomicMax`, `atomicExch`, `atomicCAS`, `atomicAnd`, `atomicOr`, `atomicXor`\n- Warp intrinsics: `__shfl_sync`, `__shfl_up_sync`, `__shfl_down_sync`, `__shfl_xor_sync`\n- Warp votes: `__ballot_sync`, `__any_sync`, `__all_sync`\n- Vector types: `float2`, `float3`, `float4`, `int2`, `int3`, `int4` with `.x`\u002F`.y`\u002F`.z`\u002F`.w` access\n- Half precision: `__half`, `__float2half()`, `__half2float()`, `__nv_bfloat16`\n- `__launch_bounds__` (parsed, propagated, enforces VGPR caps)\n- Cooperative groups: `cooperative_groups::this_thread_block()` with `.sync()`, `.thread_rank()`, `.size()`\n- Operator overloading\n- Math builtins: `sqrtf`, `rsqrtf`, `expf`, `exp2f`, `logf`, `log2f`, `log10f`, `sinf`, `cosf`, `tanf`, `tanhf`, `powf`, `fabsf`, `floorf`, `ceilf`, `truncf`, `roundf`, `rintf`, `fmaxf`, `fminf`, `fmodf`, `copysignf`\n- `__constant__` memory, `__device__` globals\n\n### Compiler Features\n- Full C preprocessor: `#include`, `#define`\u002F`#undef`, function-like macros, `#ifdef`\u002F`#ifndef`\u002F`#if`\u002F`#elif`\u002F`#else`\u002F`#endif`, `#pragma`, `#error`, `-I`\u002F`-D` flags\n- Error recovery (reports multiple errors without hanging)\n- Multilingual error messages (`--lang \u003Cfile>`) with language-neutral E-codes\n- Source location tracking in IR dumps\n- Struct pass-by-value\n\n## Example\n\n```cuda\n__global__ void vector_add(float *c, float *a, float *b, int n)\n{\n    int idx = threadIdx.x + blockIdx.x * blockDim.x;\n    if (idx \u003C n)\n        c[idx] = a[idx] + b[idx];\n}\n```\n\n```\n$ .\u002Fbarracuda --amdgpu-bin vector_add.cu -o vector_add.hsaco\nwrote vector_add.hsaco (528 bytes code, 1 kernels)\n```\n\nNo LLVM required :-) \n\n\n## Validated on Hardware\n\nBarraCUDA-compiled kernels have been tested and produce correct results on real silicon:\n\n- **AMD MI300X (CDNA3, GFX942)** — 8\u002F8 test kernels passing. Monte Carlo neutron transport producing correct physics (k_eff = 0.995, matching reference).\n- **AMD RDNA3 (GFX1100)** — Full test suite passing via RDNA3 emulator CI.\n- **NVIDIA RTX 4060 Ti** — PTX backend, loaded via CUDA Driver API, JIT-compiled by NVIDIA driver. Monte Carlo neutron transport benchmark produces correct results with 3.8x speedup over single-thread CPU. No NVCC involved anywhere in the pipeline.\n- **Tenstorrent Blackhole** — Compiles to valid Metalium C++. Hardware validation pending dev kit access.\n\n## What Doesn't Work (Yet)\n\nBeing honest about limitations is important. Here's what's missing:\n\n- Parameter reassignment in `__device__` functions (use local variables)\n- Textures and surfaces\n- Dynamic parallelism (device-side kernel launch)\n- Multiple translation units\n- Host code generation (only device code is compiled)\n\nNone of these are architectural blockers. They're all \"haven't got round to it yet\" items.\n\n## Test Suite\n\n14 test files, 35+ kernels, ~1,700 BIR instructions, ~27,000 bytes of machine code:\n\n- `vector_add.cu` - The \"hello world\" of GPU computing\n- `cuda_features.cu` - Atomics, warp ops, barriers, gotos, switch, short-circuit\n- `test_tier12.cu` - Vectors, shared memory, operator overloading\n- `notgpt.cu` - AI-generated CUDA with extremely sarcastic comments (tiled SGEMM, reductions, histograms, prefix scan, stencils, half precision, cooperative groups, and the \"kitchen sink\" kernel)\n- `stress.cu` - N-body simulation, nested control flow, bit manipulation, struct pass-by-value, chained function calls\n- `canonical.cu` - Canonical patterns from NVIDIA samples adapted for the parser\n- `test_errors.cu` - Deliberate syntax errors to verify error recovery\n- `test_launch_bounds.cu` - `__launch_bounds__` parsing and VGPR cap enforcement\n- `test_coop_groups.cu` - Cooperative groups lowering\n- `mymathhomework.cu` - Trig identities, exponential growth, Newton-Raphson, log laws, hyperbolic functions, floor\u002Fceil\u002Fround, power rule, clamping\n- Plus preprocessor tests, template tests, unsigned integer tests\n\n## Roadmap\n\n### Near Term: Hardening\n\nFix the known gaps: integer literal suffixes, `const`, parameter reassignment. These are all small parser\u002Flowerer changes. The goal is to compile real-world `.cu` files without modifications.\n\n### Medium Term: Optimisation\n\nThe generated code works but isn't winning any benchmarks. Done so far: instruction scheduling, constant folding, dead code elimination, divergence-aware SSA register allocation. Priorities:\n\n- Loop-invariant code motion\n- Occupancy tuning based on register pressure\n\n### Long Term: More Architectures\n\nThe IR (BIR) is target-independent. The backend is cleanly separated. Adding a new target means writing a new `isel` + `emit` pair.\n\n- **NVIDIA PTX** - Done. Compiles CUDA to PTX, validated on RTX 4060 Ti. `--nvidia-ptx`\n- **Tenstorrent Tensix** - Done. Compiles CUDA to TT-Metalium C++ for Blackhole. `--tensix`\n- **Intel Arc** - Xe architecture. Would give BarraCUDA coverage across all four major GPU vendors.\n- **RISC-V Vector Extension** - For when GPUs are too mainstream and you want to run CUDA on a softcore.\n\n\n## Contributing\n\n**Issues and PRs in any language are welcome** — just include an English translation alongside. See [CONTRIBUTING.md](CONTRIBUTING.md) for the full guide on style, naming, and where to help.\n\nThe HLASM-style short identifiers (`ra_gc`, `mk_hash`, `enc_vop3`) are culturally neutral by accident, there's nothing English about a 5-character label. If you've found a bug or have an idea, write it up in whatever language you think in.\n\n## Changelog\n\n**2026-03-18** — NVIDIA PTX backend (`--nvidia-ptx`). Compiles CUDA to PTX text, loaded via CUDA Driver API and JIT-compiled by the NVIDIA driver. Validated on RTX 4060 Ti running a Monte Carlo neutron transport benchmark with correct physics results. No NVCC required. Also: anonymous struct\u002Funion support in parser, sema, and lowerer (`struct { float f; int i; } cvt;` pattern).\n\n**2026-03-14** — Divergence-aware SSA register allocator (`--ssa-ra`). Eliminates all 186 VGPR spills on a 654-line Monte Carlo transport kernel — scratch traffic drops 78%, total instructions drop 28%. Exploits the 64:1 cost asymmetry between divergent and uniform VGPR spills on Wave64 hardware: uniform values spill via `v_readfirstlane` at 4 bytes each, divergent values stay in registers where they belong. Based on the divergence analysis of Sampaio et al. (2013). ~1,300 lines of C99, all static memory, no malloc.\n\n**2026-03-09** — Post-isel verification pass (`bc_vfy`). The encoder used to trust isel to produce valid machine instructions. It shouldn't have. `bc_vfy` runs twice (post-isel, post-RA) and catches 5 classes of encoding violation before the binary leaves the compiler. Its first run immediately found 7 isel bugs across GFX10 and GFX942 — every one a silent miscompile that would fault on hardware with \"Reason: Unknown.\" Fixed them all. Also: `bc_abend` runtime crash diagnostics, because if IBM could do post-mortem dumps in 1964, we can do it for GPUs in 2026.\n\n**2026-03-08** — Error localisation infrastructure. Every diagnostic now has a language-neutral ID (`E001`–`E111`). External translation files via `--lang \u003Cfile>`. English reference at `lang\u002Fen.txt`, te reo Maori at `lang\u002Fmi.txt`. Unified error structs. Lowering errors now displayed.\n\n**2026-03-05** — CDNA 3 additions: GFX942 backend hardening, MFMA, Wave64 divergence, tinygrad compat. 8\u002F8 tests passing on MI300X ([PR#56](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F56)).\n\n**2026-03-05** — Instruction scheduling ([PR#52](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F52)).\n\n**2026-03-03** — CDNA 2 support (`--gfx90a`, MI250). Tinygrad compatibility.\n\n**2026-02-28** — Tenstorrent Tensix backend (`--tensix`). Compiles CUDA to TT-Metalium C++ for Blackhole. Constant folding ([PR#51](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F51)). Dead code elimination ([PR#48](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F48)).\n\n**2026-02-25** — HSA runtime launcher ([PR#40](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F40)). RDNA 2 support (`--gfx1030`, [PR#38](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F38)). Test suite ([PR#41](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F41)).\n\n**2026-02-20** — RDNA 4 support (`--gfx1200`, [PR#32](https:\u002F\u002Fgithub.com\u002FZaneham\u002FBarraCUDA\u002Fpull\u002F32)).\n\n**2026-02-16** — Initial release. CUDA compiler targeting AMD RDNA 3 (gfx1100).\n\n## Contact\n\nFound a bug? Want to discuss the finer points of AMDGPU instruction encoding? Need someone to commiserate with about the state of GPU computing?\n\n**zanehambly@gmail.com**\n\nOpen an issue if there's anything you want to discuss. Or don't. I'm not your mum.\n\nBased in New Zealand, where it's already tomorrow and the GPUs are just as confused as everywhere else.\n\n## License\n\nApache 2.0. Do whatever you want. If this compiler somehow ends up in production, I'd love to hear about it, mostly so I can update my LinkedIn with something more interesting than wrote a CUDA compiler for fun.\n\n## Acknowledgements\n\n- **Fernando Magno Quintão Pereira** and the **Compilers Lab at UFMG** (Universidade Federal de Minas Gerais). Fernando reached out after seeing the project, pointed me to the divergence analysis papers, and offered guidance. The SSA register allocator exists because of that conversation.\n- **The academic community** — Cooper, Harvey & Kennedy for dominators; Braun & Hack for SSA spilling; Sampaio, Souza, Collange & Pereira for divergence analysis. I'm just a hobbyist who reads papers and writes C. The actual hard work was done by the researchers.\n- **Steven Muchnick** for *Advanced Compiler Design and Implementation*. If this compiler does anything right, that book is why.\n- **Low Level** for the Zero to Hero C course and the YouTube channel. That's where I learnt C.\n- **Abe Kornelis** for being an amazing teacher. His work on the [z390 Portable Mainframe Assembler](https:\u002F\u002Fgithub.com\u002Fz390development\u002Fz390) project is well worth your time.\n- To the people who've sent messages of kindness and critique, thank you from a forever student and a happy hobbyist.\n- My Granny, Grandad, Nana and Baka. Love you x\n\n*He aha te mea nui o te ao. He tāngata, he tāngata, he tāngata.*\n\nWhat is the most important thing in the world? It is people, it is people, it is people.\n\n---\n\n","BarraCUDA 是一个开源的 CUDA C++ 编译器，能够将 `.cu` 文件编译为 AMD GPU 机器码、NVIDIA PTX 以及 Tenstorrent Tensix C++ 代码。其核心功能包括从源代码到目标架构的完整编译流程，涵盖预处理、词法分析、语法解析、语义分析及中间表示生成等步骤，并支持多种 GPU 架构。该编译器完全使用 C99 语言编写，不依赖于 LLVM 或其他外部库，具有高度独立性。BarraCUDA 适用于需要跨平台 CUDA 代码编译的场景，特别适合那些希望摆脱对特定厂商工具链依赖的研发团队或个人开发者。",2,"2026-06-11 03:50:30","high_star"]