[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83447":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":12,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":15,"stars90d":13,"forks30d":13,"starsTrendScore":16,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":39,"readmeContent":40,"aiSummary":9,"trendingCount":13,"starSnapshotCount":13,"syncStatus":41,"lastSyncTime":42,"discoverSource":43},83447,"tessera","zengxiao-he\u002Ftessera","zengxiao-he","From teacher to tiles — a from-scratch LLM distillation & serving engine: custom Triton\u002FCUDA kernels, FSDP distillation, paged-KV continuous batching, speculative decoding, a Rust gateway, a JAX oracle, and interpretability tooling.",null,"Python",112,1,0,23,58,94,0.9,"Other",false,"main",true,[23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38],"cuda","flash-attention","fsdp","inference-engine","jax","knowledge-distillation","kv-cache","llm","mechanistic-interpretability","ml-systems","paged-attention","pytorch","quantization","rust","speculative-decoding","triton","2026-06-12 02:04:34","# Tessera\n\nA small, from-scratch LLM stack built around one goal: distill a large teacher into a small\nstudent, then serve that student efficiently. Keeping that goal end-to-end means touching most\nof the pieces that matter in practice — custom GPU kernels, sharded training, an inference\nengine, quantization, and a serving front end — without any of it being a toy.\n\n[![CI](https:\u002F\u002Fgithub.com\u002Fzengxiao-he\u002Ftessera\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fzengxiao-he\u002Ftessera\u002Factions\u002Fworkflows\u002Fci.yml)\n![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue)\n![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue)\n![Rust](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Frust-1.75%2B-orange)\n\nIt runs and is unit-tested on a laptop (CPU or Apple MPS). The Triton\u002FCUDA kernels are written\nfor NVIDIA GPUs; on anything else the model transparently falls back to a torch reference, and\nthe kernels are checked against that reference whenever a GPU is available.\n\n```mermaid\nflowchart LR\n    T[Teacher 40M] -->|distill, FSDP\u002FZeRO-3| S[Student 6M]\n    S -->|int8 \u002F AWQ \u002F FP8| ENG[Inference engine]\n    subgraph ENG[Inference engine]\n      direction TB\n      PK[paged KV cache] --- SCH[continuous batching] --- SPEC[speculative decode]\n    end\n    K[(Triton + CUDA kernels)] -.-> S & ENG\n    GW[Rust tokio\u002Faxum gateway] -->|PyO3| ENG\n    client([client]) --> GW\n```\n\n## What's in it\n\nTraining side:\n\n- Decoder transformer with RMSNorm, RoPE, grouped-query attention and SwiGLU ([`tessera\u002Fmodel`](tessera\u002Fmodel)).\n- Knowledge-distillation losses: temperature-scaled KL, optional hard CE, hidden-state matching ([`distill\u002Flosses.py`](tessera\u002Fdistill\u002Flosses.py)).\n- FSDP\u002FZeRO-3 written from scratch — flat-parameter sharding with a sharded Adam optimizer. It's checked to be numerically identical to single-process training, in one process and across two gloo ranks ([`distill\u002Ffsdp.py`](tessera\u002Fdistill\u002Ffsdp.py)).\n- Atomic, sharded checkpoints with resume-from-latest ([`distill\u002Fcheckpoint.py`](tessera\u002Fdistill\u002Fcheckpoint.py)).\n\nKernels:\n\n- A FlashAttention forward kernel in Triton: online softmax, causal masking, GQA, autotuned tile sizes ([`kernels\u002Ftriton\u002Fflash_attention.py`](tessera\u002Fkernels\u002Ftriton\u002Fflash_attention.py)).\n- Fused RMSNorm, a fused SwiGLU GEMM, and an int8 weight-only matmul that dequantizes in the K-loop ([`kernels\u002Ftriton`](tessera\u002Fkernels\u002Ftriton)).\n- Raw CUDA C++ versions of RMSNorm and attention for the low-level memory work, plus nvtx ranges and Nsight notes ([`kernels\u002Fcuda`](tessera\u002Fkernels\u002Fcuda)).\n\nServing:\n\n- Block-paged KV cache with a ref-counted allocator for prefix sharing ([`serve\u002Fpaged_kv.py`](tessera\u002Fserve\u002Fpaged_kv.py)).\n- A continuous-batching scheduler that recomposes the batch every step, with admission control and preemption under memory pressure ([`serve\u002Fscheduler.py`](tessera\u002Fserve\u002Fscheduler.py)).\n- Speculative decoding with the standard accept\u002Freject sampling ([`serve\u002Fspeculative.py`](tessera\u002Fserve\u002Fspeculative.py)).\n- Post-training quantization: int8 weight-only, AWQ, and an FP8 (E4M3) path ([`quant`](tessera\u002Fquant)).\n- A Rust gateway (tokio + axum) that handles HTTP and admission back-pressure and calls into the Python engine over PyO3 ([`tessera-rs`](tessera-rs)).\n\nExtras:\n\n- A JAX\u002FXLA reimplementation of the forward pass, used as an independent parity check against PyTorch ([`jax_ref`](jax_ref)).\n- Interpretability helpers: activation hooks, a logit lens, and induction-head detection ([`interp`](tessera\u002Finterp)).\n- A byte-level BPE tokenizer plus image-patch and log-mel audio front ends for multimodal data ([`data`](tessera\u002Fdata)).\n\n## Quickstart\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fzengxiao-he\u002Ftessera && cd tessera\npython -m venv .venv && source .venv\u002Fbin\u002Factivate\npip install torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu   # or a CUDA build\npip install -e \".[dev]\"\n\npytest -m \"not gpu\"          # CPU tests; the kernel tests skip without a GPU\ntessera info                 # list presets and parameter counts\npython examples\u002Fserve.py     # continuous batching + speculative decoding\npython examples\u002Ftrain_distill.py --steps 30\npython examples\u002Finterp_demo.py\n```\n\nOn a Linux box with an NVIDIA GPU:\n\n```bash\npip install -e \".[dev,gpu]\"  # adds Triton\npytest -m gpu                # Triton kernels vs the torch reference\n```\n\nRust gateway:\n\n```bash\ncd tessera-rs && cargo test && cargo run --release\ncurl -s localhost:8080\u002Fgenerate -H 'content-type: application\u002Fjson' \\\n  -d '{\"prompt\":\"hello\",\"params\":{\"max_new_tokens\":16}}'\n```\n\n## Benchmarks\n\nThese are from an Apple M2 Pro running the torch reference path, so treat them as a floor\nrather than what the fused kernels do on a real GPU. Run `pytest -m gpu` and the scripts in\n`benchmarks\u002F` on NVIDIA hardware for the numbers that actually matter.\n\n| Workload | Config | M2 Pro |\n|---|---|---|\n| Forward pass | tessera-tiny (6M), B=2, T=128, MPS | ~100k tok\u002Fs, 2.6 ms |\n| Attention (reference) | B=2, H=8, T=256, D=64, MPS | 1.85 TFLOP\u002Fs, 0.15 ms |\n| Engine decode | tessera-tiny, 6 reqs x 48 tok | ~200 tok\u002Fs |\n| Speculative decode | self-draft, greedy | 98-100% acceptance |\n\n```bash\npython benchmarks\u002Fbench_attention.py --seq-len 1024\npython benchmarks\u002Fbench_throughput.py --requests 8\n```\n\n## Tests\n\nThe repo is small enough to test thoroughly, and most of the tests check a property rather\nthan a fixed output:\n\n- Incremental decode with the KV cache matches a full forward pass.\n- Each Triton kernel matches its torch reference within fp tolerance (run with `-m gpu`).\n- The JAX forward matches PyTorch to about 2e-4.\n- Sharded Adam matches `torch.optim.Adam` step for step, in one process and over two gloo ranks.\n- Self-speculation reproduces greedy decoding exactly.\n- The engine drains every request under tight memory and preemption without leaking KV blocks.\n\n## Layout\n\n```\ntessera\u002F            model, kernels, quant, serve, distill, interp, data\n  kernels\u002Ftriton\u002F   Triton kernels      kernels\u002Fcuda\u002F  raw CUDA C++\n  serve\u002F            paged KV, scheduler, speculative, engine\n  distill\u002F          KD losses, FSDP, checkpoint, trainer\ntessera-rs\u002F         Rust tokio\u002Faxum gateway + PyO3 bindings\njax_ref\u002F            JAX\u002FXLA reference\ntests\u002F  examples\u002F  benchmarks\u002F  docs\u002F\n```\n\nMore detail in [docs\u002Farchitecture.md](docs\u002Farchitecture.md), with notes on the\n[kernels](docs\u002Fkernels.md), [serving](docs\u002Fserving.md), and [distillation](docs\u002Fdistillation.md).\n\n## Status\n\nEverything above works and is tested on CPU. Things I haven't done yet, marked in the code:\na fused attention backward kernel, a fused paged-attention decode kernel, an FP8 tensor-core\nGEMM for Hopper, and `pjit`\u002F`shard_map` training on the JAX side.\n\n## License\n\nApache-2.0, © Zengxiao He.\n",2,"2026-06-11 04:11:11","CREATED_QUERY"]