[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-73353":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":33,"readmeContent":34,"aiSummary":35,"trendingCount":16,"starSnapshotCount":16,"syncStatus":36,"lastSyncTime":37,"discoverSource":38},73353,"cubecl","tracel-ai\u002Fcubecl","tracel-ai","Multi-platform high-performance compute language extension for Rust.","https:\u002F\u002Fburn.dev",null,"Rust",2192,188,17,123,0,11,28,55,33,28.83,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32],"cuda","gpgpu","gpu","jit","linalg","rust","webgpu","2026-06-12 02:03:12","\u003Cdiv align=\"center\">\n\u003Cimg src=\".\u002Fassets\u002Flogo.drawio.svg\" width=\"400px\"\u002F>\n\n\u003Cbr \u002F>\n\u003Cbr \u002F>\n\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1038839012602941528.svg?color=7289da&&logo=discord)](https:\u002F\u002Fdiscord.gg\u002FKSBSPhAUCc)\n[![Current Crates.io Version](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fv\u002Fcubecl.svg)](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fcubecl)\n[![Minimum Supported Rust Version](https:\u002F\u002Fimg.shields.io\u002Fcrates\u002Fmsrv\u002Fcubecl)](https:\u002F\u002Fcrates.io\u002Fcrates\u002Fburn)\n[![Test Status](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fcubecl\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fcubecl\u002Factions\u002Fworkflows\u002Ftest.yml)\n![license](https:\u002F\u002Fshields.io\u002Fbadge\u002Flicense-MIT%2FApache--2.0-blue)\n\u003Cbr \u002F>\n[![NVIDIA](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnvidia-cuda-82b432)](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fcubecl\u002Ftree\u002Fmain\u002Fcrates\u002Fcubecl-cuda)\n[![AMD](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Famd-rocm-c22b23)](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fcubecl\u002Ftree\u002Fmain\u002Fcrates\u002Fcubecl-wgpu)\n[![WGPU](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcross_platform-wgpu-008855)](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fcubecl\u002Ftree\u002Fmain\u002Fcrates\u002Fcubecl-wgpu)\n\n---\n\n**Multi-platform high-performance compute language extension for Rust.**\n\u003Cbr\u002F>\n\n\u003C\u002Fdiv>\n\n## TL;DR\n\nWith CubeCL, you can program your GPU using Rust, taking advantage of zero-cost abstractions to develop maintainable, flexible, and efficient compute kernels.\nCubeCL also comes with optimized runtimes managing memory management and lazy execution for any platform.\n\n### Supported Platforms\n\n| Platform | Runtime | Compiler    | Hardware                      |\n| -------- | ------- | ----------- | ----------------------------- |\n| WebGPU   | wgpu    | WGSL        | Most GPUs                     |\n| CUDA     | CUDA    | C++ (CUDA)  | NVIDIA GPUs                   |\n| ROCm     | HIP     | C++ (HIP)   | AMD GPUs                      |\n| Metal    | wgpu    | C++ (Metal) | Apple GPUs                    |\n| Vulkan   | wgpu    | SPIR-V      | Most GPUs on Linux & Windows  |\n| CPU      | cpu     | Rust        | All Cpus, SIMD with most CPUs |\n\nNot all platforms support the same features.\nFor instance Tensor Cores acceleration isn't supported on WebGPU yet.\nUsing an instruction that isn't available on a platform will result with a compilation error at runtime.\nThe launch function is normally responsible to dispatch the right kernel based on device properties.\n\n### Example\n\nSimply annotate functions with the `cube` attribute to indicate that they should run on the GPU.\n\n```rust\nuse cubecl::prelude::*;\n\n#[cube(launch_unchecked)]\n\u002F\u002F\u002F A [Vector] represents a contiguous series of elements where SIMD operations may be available.\n\u002F\u002F\u002F The runtime will automatically use SIMD instructions when possible for improved performance.\nfn gelu_array\u003CF: Float, N: Size>(input: &[Vector\u003CF, N>], output: &mut [Vector\u003CF, N>]) {\n    if ABSOLUTE_POS \u003C input.len() {\n        output[ABSOLUTE_POS] = gelu_scalar(input[ABSOLUTE_POS]);\n    }\n}\n\n#[cube]\nfn gelu_scalar\u003CF: Float, N: Size>(x: Vector\u003CF, N>) -> Vector\u003CF, N> {\n    \u002F\u002F Execute the sqrt function at comptime.\n    let sqrt2 = F::new(comptime!(2.0f32.sqrt()));\n    let tmp = x \u002F Vector::new(sqrt2);\n\n    x * (Vector::erf(tmp) + 1.0) \u002F 2.0\n}\n```\n\nYou can then launch the kernel using the autogenerated `gelu_array::launch_unchecked` function.\n\n```rust\npub fn launch\u003CR: Runtime>(device: &R::Device) {\n    let client = R::client(device);\n    let input = &[-1., 0., 1., 5.];\n    let vectorization = 4;\n    let output_handle = client.empty(input.len() * core::mem::size_of::\u003Cf32>());\n    let input_handle = client.create(f32::as_bytes(input));\n\n    unsafe {\n        gelu_array::launch_unchecked::\u003Cf32, R>(\n            &client,\n            CubeCount::Static(1, 1, 1),\n            CubeDim::new_1d(input.len() as u32 \u002F vectorization),\n            vectorization,\n            BufferArg::from_raw_parts(&input_handle, input.len()),\n            BufferArg::from_raw_parts(&output_handle, input.len()),\n        )\n    };\n\n    let bytes = client.read_one(output_handle);\n    let output = f32::from_bytes(&bytes);\n\n    \u002F\u002F Should be [-0.1587,  0.0000,  0.8413,  5.0000]\n    println!(\"Executed gelu with runtime {:?} => {output:?}\", R::name(&client));\n}\n```\n\nTo see it in action, run the working GELU example with the following command:\n\n```bash\ncargo run --example gelu --features cpu  # cpu\u002Fsimd runtime\ncargo run --example gelu --features cuda # cuda runtime\ncargo run --example gelu --features wgpu # wgpu runtime\n```\n\n## Motivation\n\nThe goal of CubeCL is to ease the pain of writing highly optimized compute kernels that are portable across hardware.\nThere is currently no adequate solution when you want optimal performance while still being multi-platform.\nYou either have to write custom kernels for different hardware, often with different languages such as CUDA, Metal, or ROCm.\nTo fix this, we created a Just-in-Time compiler with three core features: **automatic vectorization**, **comptime**, and **autotune**!\n\nThese features are extremely useful for anyone writing high-performance kernels, even when portability is not a concern.\nThey improve code composability, reusability, testability, and maintainability, all while staying optimal.\nCubeCL also ships with a memory management strategy optimized for throughput with heavy buffer reuse to avoid allocations.\n\nOur goal extends beyond providing an optimized compute language; we aim to develop an ecosystem of high-performance and scientific computing in Rust.\nTo achieve this, we're developing linear algebra components that you can integrate into your own kernels.\nWe currently have an highly optimized matrix multiplication module, leveraging Tensor Cores on NVIDIA hardware where available, while gracefully falling back to basic instructions on other platforms.\nWhile there's room for improvement, particularly in using custom instructions from newer NVIDIA GPUs, our implementation already delivers impressive performance.\n\nWe are a small team also building [Burn](https:\u002F\u002Fburn.dev), so don't hesitate to contribute and port algorithms; it can help more than you would imagine!\n\n## How it works\n\nCubeCL leverages Rust's proc macro system in a unique two-step process:\n\n1. Parsing: The proc macro parses the GPU kernel code using the syn crate.\n2. Expansion: Instead of immediately generating an Intermediate Representation (IR), the macro generates a new Rust function.\n\nThe generated function, semantically similar to the original, is responsible for creating the IR when called.\nThis approach differs from traditional compilers, which typically generate IR directly after parsing.\nOur method enables several key features:\n\n- **Comptime**: By not transforming the original code, it becomes remarkably easy to integrate compile-time optimizations.\n- **Automatic Vectorization**: By simply vectorizing the inputs of a CubeCL function, we can determine the vectorization factor of each intermediate variable during the expansion.\n- **Rust Integration**: The generated code remains valid Rust code, allowing it to be bundled without any dependency on the specific runtime.\n\n## Design\n\nCubeCL is designed around - you guessed it - Cubes! More specifically, it's based on cuboids, because not all axes are the same size.\nSince all compute APIs need to map to the hardware, which are tiles that can be accessed using a 3D representation, our topology can easily be mapped to concepts from other APIs.\n\n\u003Cdiv align=\"center\">\n\n### CubeCL - Topology\n\n\u003Cimg src=\".\u002Fassets\u002Fcubecl.drawio.svg\" width=\"100%\"\u002F>\n\u003Cbr \u002F>\n\u003C\u002Fdiv>\n\u003Cbr \u002F>\n\n_A cube is composed of units, so a 3x3x3 cube has 27 units that can be accessed by their positions along the x, y, and z axes.\nSimilarly, a hyper-cube is composed of cubes, just as a cube is composed of units.\nEach cube in the hyper-cube can be accessed by its position relative to the hyper-cube along the x, y, and z axes.\nHence, a hyper-cube of 3x3x3 will have 27 cubes.\nIn this example, the total number of working units would be 27 x 27 = 729._\n\n\u003Cdetails>\n\u003Csummary>Topology Equivalence 👇\u003C\u002Fsummary>\n\u003Cbr \u002F>\n\nSince all topology variables are constant within the kernel entry point, we chose to use the Rust constant syntax with capital letters.\nOften when creating kernels, we don't always care about the relative position of a unit within a cube along each axis, but often we only care about its position in general.\nTherefore, each kind of variable also has its own axis-independent variable, which is often not present in other languages.\n\n\u003Cbr \u002F>\n\n| CubeCL         | CUDA        | WebGPU                 | Metal                            |\n| -------------- | ----------- | ---------------------- | -------------------------------- |\n| CUBE_COUNT     | N\u002FA         | N\u002FA                    | N\u002FA                              |\n| CUBE_COUNT_X   | gridDim.x   | num_workgroups.x       | threadgroups_per_grid.x          |\n| CUBE_COUNT_Y   | gridDim.y   | num_workgroups.y       | threadgroups_per_grid.y          |\n| CUBE_COUNT_Z   | gridDim.z   | num_workgroups.z       | threadgroups_per_grid.z          |\n| CUBE_POS       | N\u002FA         | N\u002FA                    | N\u002FA                              |\n| CUBE_POS_X     | blockIdx.x  | workgroup_id.x         | threadgroup_position_in_grid.x   |\n| CUBE_POS_Y     | blockIdx.y  | workgroup_id.y         | threadgroup_position_in_grid.y   |\n| CUBE_POS_Z     | blockIdx.z  | workgroup_id.z         | threadgroup_position_in_grid.z   |\n| CUBE_DIM       | N\u002FA         | N\u002FA                    | N\u002FA                              |\n| CUBE_DIM_X     | blockDim.x  | workgroup_size.x       | threads_per_threadgroup.x        |\n| CUBE_DIM_Y     | blockDim.y  | workgroup_size.y       | threads_per_threadgroup.y        |\n| CUBE_DIM_Z     | blockDim.z  | workgroup_size.z       | threads_per_threadgroup.z        |\n| UNIT_POS       | N\u002FA         | local_invocation_index | thread_index_in_threadgroup      |\n| UNIT_POS_X     | threadIdx.x | local_invocation_id.x  | thread_position_in_threadgroup.x |\n| UNIT_POS_Y     | threadIdx.y | local_invocation_id.y  | thread_position_in_threadgroup.y |\n| UNIT_POS_Z     | threadIdx.z | local_invocation_id.z  | thread_position_in_threadgroup.z |\n| PLANE_POS      | N\u002FA         | subgroup_id            | simdgroup_index_in_threadgroup   |\n| PLANE_DIM      | warpSize    | subgroup_size          | threads_per_simdgroup            |\n| UNIT_POS_PLANE | N\u002FA         | subgroup_invocation_id | thread_index_in_simdgroup        |\n| ABSOLUTE_POS   | N\u002FA         | N\u002FA                    | N\u002FA                              |\n| ABSOLUTE_POS_X | N\u002FA         | global_id.x            | thread_position_in_grid.x        |\n| ABSOLUTE_POS_Y | N\u002FA         | global_id.y            | thread_position_in_grid.y        |\n| ABSOLUTE_POS_Z | N\u002FA         | global_id.z            | thread_position_in_grid.z        |\n\n\u003C\u002Fdetails>\n\n## Special Features\n\n### Automatic Vectorization\n\nHigh-performance kernels should rely on SIMD instructions whenever possible, but doing so can quickly get pretty complicated!\nWith CubeCL, you can specify the vectorization factor of each input variable when launching a kernel.\nInside the kernel code, you still use only one type, which is dynamically vectorized and supports automatic broadcasting.\nThe runtimes are able to compile kernels and have all the necessary information to use the best instruction!\nHowever, since the algorithmic behavior may depend on the vectorization factor, CubeCL allows you to access it directly in the kernel when needed, without any performance loss, using the comptime system!\n\n### Comptime\n\nCubeCL isn't just a new compute language: though it feels like you are writing GPU kernels, you are, in fact, writing compiler plugins that you can fully customize!\nComptime is a way to modify the compiler IR at runtime when compiling a kernel for the first time.\n\nThis enables lots of optimizations and flexibility without having to write many separate variants of the same kernels to ensure maximal performance.\n\n| Feature                        | Description                                                                                                                                                                 |\n| ------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| **Instruction Specialization** | Not all instructions are available on all hardware, but when a specialized one exists, it should be enabled with a simple if statement.                                     |\n| **Automatic Vectorization**    | When you can use SIMD instructions, you should! But since not all hardware supports the same vectorization factors, it can be injected at runtime!                          |\n| **Loop Unrolling**             | You may want multiple flavors of the same kernel, with loop unrolling for only a certain range of values. This can be configured easily with Comptime.                      |\n| **Shape Specialization**       | For deep learning kernels, it's often crucial to rely on different kernels for different input sizes; you can do it by passing the shape information as Comptime values.    |\n| **Compile Time Calculation**   | In general, you can calculate a constant using Rust runtime properties and inject it into a kernel during its compilation, to avoid recalculating it during each execution. |\n\n### Autotuning\n\nAutotuning drastically simplifies kernel selection by running small benchmarks at runtime to figure out the best kernels with the best configurations to run on the current hardware; an essential feature for portability.\nThis feature combines gracefully with comptime to test the effect of different comptime values on performance; sometimes it can be surprising!\n\nEven if the benchmarks may add some overhead when running the application for the first time, the information gets cached on the device and will be reused.\nIt is usually a no-brainer trade-off for throughput-oriented programs such as deep learning models.\nYou can even ship the autotune cache with your program, reducing cold start time when you have more control over the deployment target.\n\n## Resource\n\nFor now we don't have a lot of resources to learn, but you can look at the [matrix multiplication library](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fcubek\u002Ftree\u002Fmain\u002Fcrates\u002Fcubek-matmul) to see how CubeCL can be used.\nIf you have any questions or want to contribute, don't hesitate to join the [Discord](https:\u002F\u002Fdiscord.gg\u002FKSBSPhAUCc).\n\n## Disclaimer & History\n\nCubeCL is currently in **alpha**.\n\nWhile CubeCL is used in [Burn](https:\u002F\u002Fburn.dev), there are still a lot of rough edges; it isn't refined yet.\nThe project started as a WebGPU-only backend for Burn.\nAs we optimized it, we realized that we needed an intermediate representation (IR) that could be optimized then compiled to WGSL.\nHaving an IR made it easy to support another compilation target, so we made a CUDA runtime.\nHowever, writing kernels directly in that IR wasn't easy, so we created a Rust frontend using the [syn](https:\u002F\u002Fgithub.com\u002Fdtolnay\u002Fsyn) crate.\nNavigating the differences between CUDA and WebGPU, while leveraging both platforms, forced us to come up with general concepts that worked everywhere.\nHence, CubeCL was born!\n","CubeCL 是一个为 Rust 语言设计的多平台高性能计算扩展库。它支持多种 GPU 和 CPU 平台，包括 CUDA、ROCm、WebGPU 等，并通过零成本抽象提供了可维护、灵活且高效的计算内核开发能力。CubeCL 配备了优化后的运行时系统，能够管理内存分配和实现惰性执行策略，从而进一步提升跨平台应用的性能表现。此项目适用于需要在不同硬件平台上进行高效并行计算的应用场景，如科学计算、机器学习模型训练等。",2,"2026-06-11 03:45:09","high_star"]