[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9650":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":18,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":29,"readmeContent":30,"aiSummary":31,"trendingCount":16,"starSnapshotCount":16,"syncStatus":32,"lastSyncTime":33,"discoverSource":34},9650,"GPU-Puzzles","srush\u002FGPU-Puzzles","srush","Solve puzzles. Learn CUDA.","",null,"Jupyter Notebook",12226,939,134,15,0,4,21,92,43.92,"MIT License",false,"main",true,[26,27,28],"cuda","machine-learning","puzzles","2026-06-12 02:02:10","# GPU Puzzles\n- by [Sasha Rush](http:\u002F\u002Frush-nlp.com) - [srush_nlp](https:\u002F\u002Ftwitter.com\u002Fsrush_nlp)\n\n![](https:\u002F\u002Fgithub.com\u002Fsrush\u002FGPU-Puzzles\u002Fraw\u002Fmain\u002Fcuda.png)\n\nGPU architectures are critical to machine learning, and seem to be\nbecoming even more important every day. However, you can be an expert\nin machine learning without ever touching GPU code. It is hard to gain\nintuition working through abstractions. \n\nThis notebook is an attempt to teach beginner GPU programming in a\ncompletely interactive fashion. Instead of providing text with\nconcepts, it throws you right into coding and building GPU\nkernels. The exercises use NUMBA which directly maps Python\ncode to CUDA kernels. It looks like Python but is basically\nidentical to writing low-level CUDA code. \nIn a few hours, I think you can go from basics to\nunderstanding the real algorithms that power 99% of deep learning\ntoday. If you do want to read the manual, it is here:\n\n[NUMBA CUDA Guide](https:\u002F\u002Fnumba.readthedocs.io\u002Fen\u002Fstable\u002Fcuda\u002Findex.html)\n\nI recommend doing these in Colab, as it is easy to get started.  Be\nsure to make your own copy, turn on GPU mode in the settings (`Runtime \u002F Change runtime type`, then set `Hardware accelerator` to `GPU`), and\nthen get to coding.\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fsrush\u002FGPU-Puzzles\u002Fblob\u002Fmain\u002FGPU_puzzlers.ipynb)\n\n(If you are into this style of puzzle, also check out my [Tensor\nPuzzles](https:\u002F\u002Fgithub.com\u002Fsrush\u002FTensor-Puzzles) for PyTorch.)\n\n[Walkthrough Guide](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=K4T-YwsOxrM)\n\n\n```python\n!pip install -qqq git+https:\u002F\u002Fgithub.com\u002Fdanoneata\u002Fchalk@srush-patch-1\n!wget -q https:\u002F\u002Fgithub.com\u002Fsrush\u002FGPU-Puzzles\u002Fraw\u002Fmain\u002Frobot.png https:\u002F\u002Fgithub.com\u002Fsrush\u002FGPU-Puzzles\u002Fraw\u002Fmain\u002Flib.py\n```\n\n\n```python\nimport numba\nimport numpy as np\nimport warnings\nfrom lib import CudaProblem, Coord\n```\n\n\n```python\nwarnings.filterwarnings(\n    action=\"ignore\", category=numba.NumbaPerformanceWarning, module=\"numba\"\n)\n```\n\n## Puzzle 1: Map\n\nImplement a \"kernel\" (GPU function) that adds 10 to each position of vector `a`\nand stores it in vector `out`.  You have 1 thread per position.\n\n**Warning** This code looks like Python but it is really CUDA! You cannot use\nstandard python tools like list comprehensions or ask for Numpy properties\nlike shape or size (if you need the size, it is given as an argument).\nThe puzzles only require doing simple operations, basically\n+, *, simple array indexing, for loops, and if statements.\nYou are allowed to use local variables. \nIf you get an\nerror it is probably because you did something fancy :). \n\n*Tip: Think of the function `call` as being run 1 time for each thread.\nThe only difference is that `cuda.threadIdx.x` changes each time.*\n\n\n```python\ndef map_spec(a):\n    return a + 10\n\n\ndef map_test(cuda):\n    def call(out, a) -> None:\n        local_i = cuda.threadIdx.x\n        # FILL ME IN (roughly 1 lines)\n\n    return call\n\n\nSIZE = 4\nout = np.zeros((SIZE,))\na = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Map\", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec\n)\nproblem.show()\n```\n\n    # Map\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_14_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0.]\n    Spec : [10 11 12 13]\n\n\n## Puzzle 2 - Zip\n\nImplement a kernel that adds together each position of `a` and `b` and stores it in `out`.\nYou have 1 thread per position.\n\n\n```python\ndef zip_spec(a, b):\n    return a + b\n\n\ndef zip_test(cuda):\n    def call(out, a, b) -> None:\n        local_i = cuda.threadIdx.x\n        # FILL ME IN (roughly 1 lines)\n\n    return call\n\n\nSIZE = 4\nout = np.zeros((SIZE,))\na = np.arange(SIZE)\nb = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Zip\", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec\n)\nproblem.show()\n```\n\n    # Zip\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_17_1.svg)\n    \n\n\n\n\n```python\n\n```\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0.]\n    Spec : [0 2 4 6]\n\n\n## Puzzle 3 - Guards\n\nImplement a kernel that adds 10 to each position of `a` and stores it in `out`.\nYou have more threads than positions.\n\n\n```python\ndef map_guard_test(cuda):\n    def call(out, a, size) -> None:\n        local_i = cuda.threadIdx.x\n        # FILL ME IN (roughly 2 lines)\n\n    return call\n\n\nSIZE = 4\nout = np.zeros((SIZE,))\na = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Guard\",\n    map_guard_test,\n    [a],\n    out,\n    [SIZE],\n    threadsperblock=Coord(8, 1),\n    spec=map_spec,\n)\nproblem.show()\n```\n\n    # Guard\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_21_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0.]\n    Spec : [10 11 12 13]\n\n\n## Puzzle 4 - Map 2D\n\nImplement a kernel that adds 10 to each position of `a` and stores it in `out`.\nInput `a` is 2D and square. You have more threads than positions.\n\n\n```python\ndef map_2D_test(cuda):\n    def call(out, a, size) -> None:\n        local_i = cuda.threadIdx.x\n        local_j = cuda.threadIdx.y\n        # FILL ME IN (roughly 2 lines)\n\n    return call\n\n\nSIZE = 2\nout = np.zeros((SIZE, SIZE))\na = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))\nproblem = CudaProblem(\n    \"Map 2D\", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec\n)\nproblem.show()\n```\n\n    # Map 2D\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_24_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [[0. 0.]\n     [0. 0.]]\n    Spec : [[10 11]\n     [12 13]]\n\n\n## Puzzle 5 - Broadcast\n\nImplement a kernel that adds `a` and `b` and stores it in `out`.\nInputs `a` and `b` are vectors. You have more threads than positions.\n\n\n```python\ndef broadcast_test(cuda):\n    def call(out, a, b, size) -> None:\n        local_i = cuda.threadIdx.x\n        local_j = cuda.threadIdx.y\n        # FILL ME IN (roughly 2 lines)\n\n    return call\n\n\nSIZE = 2\nout = np.zeros((SIZE, SIZE))\na = np.arange(SIZE).reshape(SIZE, 1)\nb = np.arange(SIZE).reshape(1, SIZE)\nproblem = CudaProblem(\n    \"Broadcast\",\n    broadcast_test,\n    [a, b],\n    out,\n    [SIZE],\n    threadsperblock=Coord(3, 3),\n    spec=zip_spec,\n)\nproblem.show()\n```\n\n    # Broadcast\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_27_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [[0. 0.]\n     [0. 0.]]\n    Spec : [[0 1]\n     [1 2]]\n\n\n## Puzzle 6 - Blocks\n\nImplement a kernel that adds 10 to each position of `a` and stores it in `out`.\nYou have fewer threads per block than the size of `a`.\n\n*Tip: A block is a group of threads. The number of threads per block is limited, but we can\nhave many different blocks. Variable `cuda.blockIdx` tells us what block we are in.*\n\n\n```python\ndef map_block_test(cuda):\n    def call(out, a, size) -> None:\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        # FILL ME IN (roughly 2 lines)\n\n    return call\n\n\nSIZE = 9\nout = np.zeros((SIZE,))\na = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Blocks\",\n    map_block_test,\n    [a],\n    out,\n    [SIZE],\n    threadsperblock=Coord(4, 1),\n    blockspergrid=Coord(3, 1),\n    spec=map_spec,\n)\nproblem.show()\n```\n\n    # Blocks\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_31_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0.]\n    Spec : [10 11 12 13 14 15 16 17 18]\n\n\n## Puzzle 7 - Blocks 2D\n\nImplement the same kernel in 2D.  You have fewer threads per block\nthan the size of `a` in both directions.\n\n\n```python\ndef map_block2D_test(cuda):\n    def call(out, a, size) -> None:\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        # FILL ME IN (roughly 4 lines)\n\n    return call\n\n\nSIZE = 5\nout = np.zeros((SIZE, SIZE))\na = np.ones((SIZE, SIZE))\n\nproblem = CudaProblem(\n    \"Blocks 2D\",\n    map_block2D_test,\n    [a],\n    out,\n    [SIZE],\n    threadsperblock=Coord(3, 3),\n    blockspergrid=Coord(2, 2),\n    spec=map_spec,\n)\nproblem.show()\n```\n\n    # Blocks 2D\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_34_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [[0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0.]]\n    Spec : [[11. 11. 11. 11. 11.]\n     [11. 11. 11. 11. 11.]\n     [11. 11. 11. 11. 11.]\n     [11. 11. 11. 11. 11.]\n     [11. 11. 11. 11. 11.]]\n\n\n## Puzzle 8 - Shared\n\nImplement a kernel that adds 10 to each position of `a` and stores it in `out`.\nYou have fewer threads per block than the size of `a`.\n\n**Warning**: Each block can only have a *constant* amount of shared\n memory that threads in that block can read and write to. This needs\n to be a literal python constant not a variable. After writing to\n shared memory you need to call `cuda.syncthreads` to ensure that\n threads do not cross.\n\n(This example does not really need shared memory or syncthreads, but it is a demo.)\n\n\n```python\nTPB = 4\ndef shared_test(cuda):\n    def call(out, a, size) -> None:\n        shared = cuda.shared.array(TPB, numba.float32)\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        local_i = cuda.threadIdx.x\n\n        if i \u003C size:\n            shared[local_i] = a[i]\n            cuda.syncthreads()\n\n        # FILL ME IN (roughly 2 lines)\n\n    return call\n\n\nSIZE = 8\nout = np.zeros(SIZE)\na = np.ones(SIZE)\nproblem = CudaProblem(\n    \"Shared\",\n    shared_test,\n    [a],\n    out,\n    [SIZE],\n    threadsperblock=Coord(TPB, 1),\n    blockspergrid=Coord(2, 1),\n    spec=map_spec,\n)\nproblem.show()\n```\n\n    # Shared\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             1 |             0 |             0 |             1 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_39_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0. 0. 0. 0. 0.]\n    Spec : [11. 11. 11. 11. 11. 11. 11. 11.]\n\n\n## Puzzle 9 - Pooling\n\nImplement a kernel that sums together the last 3 position of `a` and stores it in `out`.\nYou have 1 thread per position. You only need 1 global read and 1 global write per thread.\n\n*Tip: Remember to be careful about syncing.*\n\n\n```python\ndef pool_spec(a):\n    out = np.zeros(*a.shape)\n    for i in range(a.shape[0]):\n        out[i] = a[max(i - 2, 0) : i + 1].sum()\n    return out\n\n\nTPB = 8\ndef pool_test(cuda):\n    def call(out, a, size) -> None:\n        shared = cuda.shared.array(TPB, numba.float32)\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        local_i = cuda.threadIdx.x\n        # FILL ME IN (roughly 8 lines)\n\n    return call\n\n\nSIZE = 8\nout = np.zeros(SIZE)\na = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Pooling\",\n    pool_test,\n    [a],\n    out,\n    [SIZE],\n    threadsperblock=Coord(TPB, 1),\n    blockspergrid=Coord(1, 1),\n    spec=pool_spec,\n)\nproblem.show()\n```\n\n    # Pooling\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_43_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0. 0. 0. 0. 0.]\n    Spec : [ 0.  1.  3.  6.  9. 12. 15. 18.]\n\n\n## Puzzle 10 - Dot Product\n\nImplement a kernel that computes the dot-product of `a` and `b` and stores it in `out`.\nYou have 1 thread per position. You only need 2 global reads and 1 global write per thread.\n\n*Note: For this problem you don't need to worry about number of shared reads. We will\n handle that challenge later.*\n\n\n```python\ndef dot_spec(a, b):\n    return a @ b\n\nTPB = 8\ndef dot_test(cuda):\n    def call(out, a, b, size) -> None:\n        shared = cuda.shared.array(TPB, numba.float32)\n\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        local_i = cuda.threadIdx.x\n        # FILL ME IN (roughly 9 lines)\n    return call\n\n\nSIZE = 8\nout = np.zeros(1)\na = np.arange(SIZE)\nb = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Dot\",\n    dot_test,\n    [a, b],\n    out,\n    [SIZE],\n    threadsperblock=Coord(SIZE, 1),\n    blockspergrid=Coord(1, 1),\n    spec=dot_spec,\n)\nproblem.show()\n```\n\n    # Dot\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_47_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0.]\n    Spec : 140\n\n\n## Puzzle 11 - 1D Convolution\n\nImplement a kernel that computes a 1D convolution between `a` and `b` and stores it in `out`.\nYou need to handle the general case. You only need 2 global reads and 1 global write per thread.\n\n\n```python\ndef conv_spec(a, b):\n    out = np.zeros(*a.shape)\n    len = b.shape[0]\n    for i in range(a.shape[0]):\n        out[i] = sum([a[i + j] * b[j] for j in range(len) if i + j \u003C a.shape[0]])\n    return out\n\n\nMAX_CONV = 4\nTPB = 8\nTPB_MAX_CONV = TPB + MAX_CONV\ndef conv_test(cuda):\n    def call(out, a, b, a_size, b_size) -> None:\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        local_i = cuda.threadIdx.x\n\n        # FILL ME IN (roughly 17 lines)\n\n    return call\n\n\n# Test 1\n\nSIZE = 6\nCONV = 3\nout = np.zeros(SIZE)\na = np.arange(SIZE)\nb = np.arange(CONV)\nproblem = CudaProblem(\n    \"1D Conv (Simple)\",\n    conv_test,\n    [a, b],\n    out,\n    [SIZE, CONV],\n    Coord(1, 1),\n    Coord(TPB, 1),\n    spec=conv_spec,\n)\nproblem.show()\n```\n\n    # 1D Conv (Simple)\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_50_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0. 0. 0.]\n    Spec : [ 5.  8. 11. 14.  5.  0.]\n\n\nTest 2\n\n\n```python\nout = np.zeros(15)\na = np.arange(15)\nb = np.arange(4)\nproblem = CudaProblem(\n    \"1D Conv (Full)\",\n    conv_test,\n    [a, b],\n    out,\n    [15, 4],\n    Coord(2, 1),\n    Coord(TPB, 1),\n    spec=conv_spec,\n)\nproblem.show()\n```\n\n    # 1D Conv (Full)\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_53_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]\n    Spec : [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14.  0.]\n\n\n## Puzzle 12 - Prefix Sum\n\nImplement a kernel that computes a sum over `a` and stores it in `out`.\nIf the size of `a` is greater than the block size, only store the sum of\neach block.\n\nWe will do this using the [parallel prefix sum](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPrefix_sum) algorithm in shared memory.\nThat is, each step of the algorithm should sum together half the remaining numbers.\nFollow this diagram:\n\n![](https:\u002F\u002Fuser-images.githubusercontent.com\u002F35882\u002F178757889-1c269623-93af-4a2e-a7e9-22cd55a42e38.png)\n\n\n```python\nTPB = 8\ndef sum_spec(a):\n    out = np.zeros((a.shape[0] + TPB - 1) \u002F\u002F TPB)\n    for j, i in enumerate(range(0, a.shape[-1], TPB)):\n        out[j] = a[i : i + TPB].sum()\n    return out\n\n\ndef sum_test(cuda):\n    def call(out, a, size: int) -> None:\n        cache = cuda.shared.array(TPB, numba.float32)\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        local_i = cuda.threadIdx.x\n        # FILL ME IN (roughly 12 lines)\n\n    return call\n\n\n# Test 1\n\nSIZE = 8\nout = np.zeros(1)\ninp = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Sum (Simple)\",\n    sum_test,\n    [inp],\n    out,\n    [SIZE],\n    Coord(1, 1),\n    Coord(TPB, 1),\n    spec=sum_spec,\n)\nproblem.show()\n```\n\n    # Sum (Simple)\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_58_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0.]\n    Spec : [28.]\n\n\nTest 2\n\n\n```python\nSIZE = 15\nout = np.zeros(2)\ninp = np.arange(SIZE)\nproblem = CudaProblem(\n    \"Sum (Full)\",\n    sum_test,\n    [inp],\n    out,\n    [SIZE],\n    Coord(2, 1),\n    Coord(TPB, 1),\n    spec=sum_spec,\n)\nproblem.show()\n```\n\n    # Sum (Full)\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_61_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [0. 0.]\n    Spec : [28. 77.]\n\n\n## Puzzle 13 - Axis Sum\n\nImplement a kernel that computes a sum over each column of `a` and stores it in `out`.\n\n\n```python\nTPB = 8\ndef sum_spec(a):\n    out = np.zeros((a.shape[0], (a.shape[1] + TPB - 1) \u002F\u002F TPB))\n    for j, i in enumerate(range(0, a.shape[-1], TPB)):\n        out[..., j] = a[..., i : i + TPB].sum(-1)\n    return out\n\n\ndef axis_sum_test(cuda):\n    def call(out, a, size: int) -> None:\n        cache = cuda.shared.array(TPB, numba.float32)\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        local_i = cuda.threadIdx.x\n        batch = cuda.blockIdx.y\n        # FILL ME IN (roughly 12 lines)\n\n    return call\n\n\nBATCH = 4\nSIZE = 6\nout = np.zeros((BATCH, 1))\ninp = np.arange(BATCH * SIZE).reshape((BATCH, SIZE))\nproblem = CudaProblem(\n    \"Axis Sum\",\n    axis_sum_test,\n    [inp],\n    out,\n    [SIZE],\n    Coord(1, BATCH),\n    Coord(TPB, 1),\n    spec=sum_spec,\n)\nproblem.show()\n```\n\n    # Axis Sum\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_64_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [[0.]\n     [0.]\n     [0.]\n     [0.]]\n    Spec : [[ 15.]\n     [ 51.]\n     [ 87.]\n     [123.]]\n\n\n## Puzzle 14 - Matrix Multiply!\n\nImplement a kernel that multiplies square matrices `a` and `b` and\nstores the result in `out`.\n\n*Tip: The most efficient algorithm here will copy a block into\n shared memory before computing each of the individual row-column\n dot products. This is easy to do if the matrix fits in shared\n memory.  Do that case first. Then update your code to compute\n a partial dot-product and iteratively move the part you\n copied into shared memory.* You should be able to do the hard case\n in 6 global reads.\n\n\n```python\ndef matmul_spec(a, b):\n    return a @ b\n\n\nTPB = 3\ndef mm_oneblock_test(cuda):\n    def call(out, a, b, size: int) -> None:\n        a_shared = cuda.shared.array((TPB, TPB), numba.float32)\n        b_shared = cuda.shared.array((TPB, TPB), numba.float32)\n\n        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x\n        j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y\n        local_i = cuda.threadIdx.x\n        local_j = cuda.threadIdx.y\n        # FILL ME IN (roughly 14 lines)\n\n    return call\n\n# Test 1\n\nSIZE = 2\nout = np.zeros((SIZE, SIZE))\ninp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))\ninp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T\n\nproblem = CudaProblem(\n    \"Matmul (Simple)\",\n    mm_oneblock_test,\n    [inp1, inp2],\n    out,\n    [SIZE],\n    Coord(1, 1),\n    Coord(TPB, TPB),\n    spec=matmul_spec,\n)\nproblem.show(sparse=True)\n```\n\n    # Matmul (Simple)\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_67_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [[0. 0.]\n     [0. 0.]]\n    Spec : [[ 1  3]\n     [ 3 13]]\n\n\nTest 2\n\n\n```python\nSIZE = 8\nout = np.zeros((SIZE, SIZE))\ninp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))\ninp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T\n\nproblem = CudaProblem(\n    \"Matmul (Full)\",\n    mm_oneblock_test,\n    [inp1, inp2],\n    out,\n    [SIZE],\n    Coord(3, 3),\n    Coord(TPB, TPB),\n    spec=matmul_spec,\n)\nproblem.show(sparse=True)\n```\n\n    # Matmul (Full)\n     \n       Score (Max Per Thread):\n       |  Global Reads | Global Writes |  Shared Reads | Shared Writes |\n       |             0 |             0 |             0 |             0 | \n    \n\n\n\n\n\n    \n![svg](GPU_puzzlers_files\u002FGPU_puzzlers_70_1.svg)\n    \n\n\n\n\n```python\nproblem.check()\n```\n\n    Failed Tests.\n    Yours: [[0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]\n     [0. 0. 0. 0. 0. 0. 0. 0.]]\n    Spec : [[  140   364   588   812  1036  1260  1484  1708]\n     [  364  1100  1836  2572  3308  4044  4780  5516]\n     [  588  1836  3084  4332  5580  6828  8076  9324]\n     [  812  2572  4332  6092  7852  9612 11372 13132]\n     [ 1036  3308  5580  7852 10124 12396 14668 16940]\n     [ 1260  4044  6828  9612 12396 15180 17964 20748]\n     [ 1484  4780  8076 11372 14668 17964 21260 24556]\n     [ 1708  5516  9324 13132 16940 20748 24556 28364]]\n\n","GPU Puzzles 是一个旨在通过解谜游戏来学习CUDA编程的互动项目。该项目利用Jupyter Notebook和Numba库，将Python代码直接映射到CUDA内核，使用户能够以接近原生CUDA的方式进行编程练习，从而快速掌握GPU编程的基础知识及实际应用。特别适合于对机器学习有一定了解但希望深入理解底层GPU计算逻辑的学习者，或是想要提高自己在深度学习领域中使用GPU能力的开发者。整个教程设计为完全交互式体验，鼓励参与者通过实践来加深理解，并推荐在Google Colab环境中开启GPU模式后运行相关代码。",2,"2026-06-11 03:24:00","top_topic"]