[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1611":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":14,"starSnapshotCount":14,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},1611,"CuRast","m-schuetz\u002FCuRast","m-schuetz","Cuda-Based Software Rasterization for Billions of Triangles",null,"C++",223,12,4,0,7,9,26,21,3.34,"Other",false,"main",true,[],"2026-06-12 02:00:30","\n# CuRast: Cuda-Based Software Rasterization for Billions of Triangles\n\n\u003Ca href=\"http:\u002F\u002Farxiv.org\u002Fabs\u002F2604.21749\" target=\"_blank\" rel=\"noopener noreferrer\">[Paper]\u003C\u002Fa>\n\n__About__: [Nanite](https:\u002F\u002Fadvances.realtimerendering.com\u002Fs2021\u002FKaris_Nanite_SIGGRAPH_Advances_2021_final.pdf) has demonstrated that small triangles can be rasterized more efficiently with custom compute shaders than with the fixed-function hardware pipeline. Building on this insight, we explore how far this advantage can be pushed for real-time rendering of massive triangle datasets without relying on precomputed LODs or acceleration structures. \n\n__Method__: A 3-stage rasterization pipeline first rasterizes small triangles efficiently in stage 1, and falls back to other stages for increasingly larger triangles. Stage 1 assumes triangles are small and uses 1 thread to render them directly. If they are not, they are instead queued for stage 2 which uses 1 warp to render larger triangles with more compute power. If they are still too large, they are split up and queued for stage 3. \n\n__Results__: With CUDA, we can render large models with hundreds of millions of unique triangles 2-5x faster than Vulkan, or up to 12x faster when it comes to instanced triangles. For smaller models producing large triangles, or models with numerous meshes with few triangles, Vulkan remains 10x faster.\n\n__Limitations__: We currently focus on dense, opaque meshes like those you would typically obtain from photogrammetry\u002F3D reconstruction. Blending\u002FTransparency is not yet supported, and scenes with thousands of low-poly meshes are not implemented efficiently. \n\n__Future Work__: To make it suitable for games, we intend to (1) optimize handling of scenes with tens of thousands of nodes\u002Fmeshes, (2) add support for hierarchical clustered LODs such as those produced by [Meshoptimizer](https:\u002F\u002Fgithub.com\u002Fzeux\u002Fmeshoptimizer), (3) add support for transparency, likely in its own stage so as to keep opaque rasterization untouched and fast. \n\n\u003Ctable>\n\u003Ctr>\n\t\u003Ctd>\n\t\t\u003Cimg src=\"docs\u002Fcover.jpg\"\u002F>\n\t\u003C\u002Ftd>\n\t\u003Ctd>\n\t\t\u003Cimg src=\"docs\u002Fscreenshot_venice_closeup.jpg\" \u002F>\n\t\u003C\u002Ftd>\n\t\u003Ctd>\n\t\t\u003Cimg src=\"docs\u002Fscreenshot_lantern_instanced_overview.jpg\" \u002F>\n\t\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003Ctr>\n\t\u003Ctd>\n\t\t\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fnvpro-samples\u002Fvk_lod_clusters\u002Fblob\u002Fmain\u002FREADME.md#zorah-demo-scene\">Zorah\u003C\u002Fa> rendered in 67.3ms into a 3840x2160 framebuffer (RTX 5090). 13.5 billion triangles in view frustum.\n\t\u003C\u002Ftd>\n\t\u003Ctd>\n\t\tVenice (400M triangles) rendered in 7.98ms (1920x1080p, RTX 5090).\n\t\u003C\u002Ftd>\n\t\u003Ctd>\n\t\t3000 instances with 1M triangles each, rendered in 9.8ms (1920x1080p, RTX 5090).\n\t\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## Installing\n\n\n### Windows\n\nDependencies: \n* CUDA 13.1\n* Visual Studio 2026\n* An RTX 4090\n\nCreate Visual Studio solution files in a build folder via cmake:\n\n```\nmkdir build\ncd build\ncmake ..\u002F\n```\n\nCompile and run with visual Studio 2026. Drag and drop glb or gltf files to load them.\n\n### Linux\n\nTODO. \n\nMain challenge: We're using the windows API for [memory mapping](.\u002Fsrc\u002FMappedFile.h) (easily read from files) and [unbuffered IO](.\u002Fsrc\u002Funsuck_platform_specific.cpp#L242) (efficiently read from files). mmap on linux should be straightforward, but what about fast sequential SSD reads without buffering overhead? io_uring?\n\n## Getting Started\n\nYou can either drag&drop glb or gltf files into the application, or modify [initScene() in main.cpp](.\u002Fsrc\u002Fmain.cpp) to load at startup and get some control over the settings. Note that glb support is limited, some\u002Fmany glb files may not work. For data sets like Zorah, drag&drop won't work as Zorah is too large to fit in VRAM and requires loading with ```.compress = true```. For Venice, we also have ```.useJpegTextures``` enabled which keeps textures jpeg-compressed on the GPU to save some VRAM. \n\n\n### Data Sets\n\nSome test data sets we've been using, with download link if available. \n\n\u003Ctable>\n\t\u003Ctr>\n\t\t\u003Cth>Data Set\u003C\u002Fth>\n\t\t\u003Cth>Triangles\u003C\u002Fth>\n\t\t\u003Cth>Description\u003C\u002Fth>\n\t\u003C\u002Ftr>\n\t\u003Ctr>\n\t\t\u003Ctd>\n\t\t\t\u003Ca href=\"https:\u002F\u002Fusers.cg.tuwien.ac.at\u002F~mschuetz\u002Fpermanent\u002Fcurast\u002Fkomainu_kobe_60m.glb\">Komainu Kobe\u003C\u002Fa>\n\t\t\u003C\u002Ftd>\n\t\t\u003Ctd>60M\u003C\u002Ftd>\n\t\t\u003Ctd>\n\t\t\tOriginal images courtesy of \u003Ca href=\"https:\u002F\u002Fopenheritage3d.org\u002Fproject.php?id=1wv3-9775\">Gildas Sidobre, NRHK, distributed by Open Heritage 3D.\u003C\u002Fa>\n\t\t\u003C\u002Ftd>\n\t\u003C\u002Ftr>\n\t\u003Ctr>\n\t\t\u003Ctd>\n\t\t\t\u003Ca href=\"https:\u002F\u002Fusers.cg.tuwien.ac.at\u002F~mschuetz\u002Fpermanent\u002Fcurast\u002Fhakone_1M.glb\">Hakone Lantern\u003C\u002Fa>\n\t\t\u003C\u002Ftd>\n\t\t\u003Ctd>1M\u003C\u002Ftd>\n\t\t\u003Ctd>Created with Reality Scan, simplified with Meshoptimizer.\u003C\u002Ftd>\n\t\u003C\u002Ftr>\n\t\u003Ctr>\n\t\t\u003Ctd>\n\t\t\t\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fludicon\u002Fsponza-gltf\">Sponza\u003C\u002Fa>\n\t\t\u003C\u002Ftd>\n\t\t\u003Ctd>262k\u003C\u002Ftd>\n\t\t\u003Ctd>\n\t\t\tWe use the sponza-png.glb modified by Ludicon. Original authors and modifications over the years by Marko Dabrovic, Frank Meinl, Crytek, Hans-Kristian Arntzen, Morgan McGuire.\n\t\t\u003C\u002Ftd>\n\t\u003C\u002Ftr>\n\t\u003Ctr>\n\t\t\u003Ctd>\n\t\t\t\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fnvpro-samples\u002Fvk_lod_clusters\u002Fblob\u002Fmain\u002FREADME.md#zorah-demo-scene\">Zorah\u003C\u002Fa>\n\t\t\u003C\u002Ftd>\n\t\t\u003Ctd>18.9B\u003C\u002Ftd>\n\t\t\u003Ctd>\n\t\t\tWe use the original zorah_main_public.gltf data set which has, since, been replaced by v2. The newer version is compressed, perhaps \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fzeux\u002Fmeshoptimizer\">Meshoptimizer\u003C\u002Fa> can decompress it? \n\t\t\u003C\u002Ftd>\n\t\u003C\u002Ftr>\n\t\u003Ctr>\n\t\t\u003Ctd>\n\t\t\tVenice\n\t\t\u003C\u002Ftd>\n\t\t\u003Ctd>400M\u003C\u002Ftd>\n\t\t\u003Ctd>\n\t\t\tCourtesy of \u003Ca href=\"https:\u002F\u002Ficonem.com\u002F\">Iconem\u003C\u002Fa> and the \u003Ca href=\"https:\u002F\u002Fwww.visitmuve.it\u002Fen\u002F\">Fondazione Musei Civici di Venezia\u003C\u002Fa>.\n\t\t\u003C\u002Ftd>\n\t\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\n### Program\n\n| File | Role |\n|------|------|\n| [src\u002Fmain.cpp](src\u002Fmain.cpp) | Entry point and the place to define hardcoded startup scenes. |\n| [src\u002FCuRast.h](src\u002FCuRast.h) |  |\n| [src\u002FCuRastSettings.h](src\u002FCuRastSettings.h) | Some runtime settings, but also the place where we put the USE_VULKAN_SHARED_MEMORY macro if we want to enable Vulkan.  |\n| [src\u002Fkernels\u002Ftriangles_visbuffer.cu](src\u002Fkernels\u002Ftriangles_visbuffer.cu) | CUDA kernels for triangle rasterization |\n| [src\u002Fkernels\u002Fresolve.cu](src\u002Fkernels\u002Fresolve.cu) | Transforms visibility buffer to color texture for display |\n| [src\u002FCuRast_render.h](src\u002FCuRast_render.h) | Host-side draw code that launches the kernels.  |\n\n#### Known Issues\n\n- Our glb loader is targeted towards loading Zorah fast and compressing it on the fly. This lead to design decisions like having 16 threads, each of which allocates as much host memory as the size of the largest index buffer. This can cause issues on systems with not enough RAM, or data sets with enormous index buffers. \n- If compiled with Vulkan support (see CuRastSettings.h), you can only switch the rasterizer from CUDA to Vulkan, but not back. That is because we implemented converting from CUDA textures to Vulkan, but not the other way around.\n- Can only drag&drop one glb per session. Needs restart to load a new glb.\n- We don't handle \"frames in flight\" yet. While draw data is assembled on the CPU, the GPU may be idle and wait. In the future, while the GPU finishes drawing the current frame, the CPU should already be preparing the next frame. \n\n## References and Further Reads\n\n- [Nanite](https:\u002F\u002Fadvances.realtimerendering.com\u002Fs2021\u002FKaris_Nanite_SIGGRAPH_Advances_2021_final.pdf): Clustered LODs and software rasterization.\n- [FreePipe](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F1730804.1730817): The first to propose using atomicMin for direct rasterization without the need to sort.\n- [CUDARaster](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F2018323.2018337): An efficient, hierarchical software rasterization pipeline for CUDA. \n- [cuRE](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3197517.3201374): A CUDA rendering engine (cuRE) based on a streaming pipeline that processes multiple rasterization stages simultaneously, rather than one after the other.\n- [Meshoptimizer](https:\u002F\u002Fgithub.com\u002Fzeux\u002Fmeshoptimizer): Optimizes the arrangement of vertices and triangles to improve locality and\u002For vertex reuse, and also features hierarchical clustered LOD construction. \n- [\"Billions of triangles in minutes\"](https:\u002F\u002Fzeux.io\u002F2025\u002F09\u002F30\u002Fbillions-of-triangles-in-minutes\u002F): A blog post describing the clustered LOD construction algorithm in meshoptimizer, and the road to reducing the preprocessing time for the entire Zorah data set down to just about two and a half minutes. \n- [\"Learning from failure\"](https:\u002F\u002Fadvances.realtimerendering.com\u002Fs2015\u002FAlexEvans_SIGGRAPH-2015-sml.pdf): A talk about the architecture and software rasterization process of the PS4 game _Dreams_. [\\[video\\]](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=u9KNtnCZDMI)","CuRast 是一个基于 CUDA 的软件光栅化项目，专为处理数十亿个三角形而设计。其核心功能包括三阶段光栅化管线，能够高效地渲染从小到大的各种尺寸的三角形。第一阶段针对小三角形使用单线程直接渲染；第二阶段利用 warp 处理较大的三角形；第三阶段则对过大三角形进行拆分并排队处理。该项目在渲染大规模密集不透明网格时表现出色，速度比 Vulkan 快 2 到 12 倍。它特别适合于需要实时渲染大量三角形数据集的应用场景，如基于摄影测量或3D重建生成的高密度模型。不过，目前 CuRast 尚不支持透明度和低多边形模型的高效渲染。",2,"2026-06-11 02:44:58","CREATED_QUERY"]