[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-81865":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":13,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},81865,"matching-engine","AsthaMishra\u002Fmatching-engine","AsthaMishra","Sub-100ns limit order book + matching engine in Rust — validated on 108M real NASDAQ ITCH 5.0 ops","",null,"Rust",42,4,23,0,1,19,3,2.1,false,"main",true,[],"2026-06-12 02:04:20","# Rust Matching Engine\n\n> Sub-100 ns order book operations · 108M ops validated on real NASDAQ ITCH 5.0 data · lock-free per-symbol thread model · arena allocation\n\nHigh-performance Limit Order Book + Matching Engine written in Rust, targeting the latency and throughput profile of real exchange infrastructure.\n\n**Why build this?** The matching engine is the highest-stakes component in a trading system — every microsecond of latency is a competitive edge in HFT. This project implements the core algorithms from first principles: price-time priority matching, flat-array price levels, bitmap-indexed BBO, arena allocation, and a lock-free per-symbol thread model. It was validated against real NASDAQ ITCH 5.0 market data (108M operations across 100 symbols) and benchmarked with statistical rigor using Criterion.\n\n**Key numbers (NASDAQ ITCH 5.0, Jan 30 2020 — 100 symbols, 108M ops):**\n- p50 = **98 ns** · p99 = **1.9 µs** · p99.9 = **4.3 µs** per order book operation\n- **~4.1M ops\u002Fsec** sequential throughput across 100 live symbols\n- Validated correctness with fuzz testing (`cargo-fuzz`) — found and fixed 4 real bugs\n\n---\n\n## Architecture\n\n```mermaid\ngraph TD\n    subgraph Client\n        REQ[HTTP Request]\n    end\n\n    subgraph HTTP[\"HTTP Layer — Axum async thread pool (tokio)\"]\n        R1[\"POST \u002Fadd_order\"]\n        R2[\"POST \u002Fcancel_order\"]\n        R3[\"POST \u002Fupdate_order\"]\n        R4[\"GET \u002Fbbo · \u002Fdepth · \u002Fmicroprice · \u002Fimbalance\"]\n    end\n\n    subgraph State[\"AppState (immutable after startup)\"]\n        S[\"Arc&lt;HashMap&lt;symbol_id → Sender&gt;&gt;\\n— lock-free lookup on hot path —\"]\n    end\n\n    subgraph Matchers[\"Matcher Threads — one OS thread per symbol\"]\n        M1[\"matcher-AAPL\\nbumpalo::Bump arena\\nOrderBook&lt;'_&gt;\"]\n        M2[\"matcher-TSLA\\nbumpalo::Bump arena\\nOrderBook&lt;'_&gt;\"]\n        M3[\"matcher-NVDA\\nbumpalo::Bump arena\\nOrderBook&lt;'_&gt;\"]\n    end\n\n    subgraph Book[\"OrderBook internals\"]\n        B1[\"bid \u002F ask: Vec&lt;Option&lt;PriceLevel&gt;&gt;\\n100 000 pre-allocated slots\\nO(1) index = price \u002F tick_size\"]\n        B2[\"bid_bitmap \u002F ask_bitmap: Vec&lt;u64&gt;\\n1 bit per price slot · 25 KB total\\ntrailing\u002Fleading_zeros for next level\"]\n        B3[\"order_index: Vec&lt;Option&lt;(side, price, qty, slot)&gt;&gt;\\nO(1) cancel via direct slot index\"]\n        B4[\"best_bid_idx \u002F best_ask_idx: Option&lt;usize&gt;\\ncached — BBO query = 1 field read\"]\n    end\n\n    REQ --> R1 & R2 & R3 & R4\n    R1 & R2 & R3 & R4 -->|symbol lookup| S\n    S -->|crossbeam_channel send| M1 & M2 & M3\n    M1 & M2 & M3 -->|oneshot response| R1 & R2 & R3 & R4\n    M1 --- Book\n```\n\n**Request flow:** Each HTTP handler looks up the symbol's sender in an `Arc\u003CHashMap>` (no lock), sends a `BookRequest` with an embedded `oneshot::Sender`, then `await`s the response. The matcher thread owns the `OrderBook` and arena exclusively — no locks needed on the matching path.\n\n---\n\n## Benchmarks\n\nAll benchmarks run on x86-64 WSL2 via [Criterion](https:\u002F\u002Fgithub.com\u002Fbheisler\u002Fcriterion.rs). Run with `cargo bench`.\n\n### Latency (single operation, warm cache)\n\n| Operation | p50 | p99 | p99.9 |\n|---|---|---|---|\n| BBO query | \u003C 1 ns | \u003C 1 ns | \u003C 1 ns |\n| Place order (warm level) | \u003C 100 ns | \u003C 100 ns | \u003C 100 ns |\n| Cancel order (mid-book) | \u003C 100 ns | 100 ns | 100 ns |\n| Top-of-book match | 100 ns | 100 ns | 100 ns |\n| Top-N depth (20 levels) | ~950 ns | — | — |\n| Market sweep (20 levels) | ~5.3 µs | — | — |\n\n### Throughput\n\n| Scenario | Throughput |\n|---|---|\n| Insert into existing level (warm) | **~28M orders\u002Fs** |\n| Add maker + taker (immediate match) | **~4.2M orders\u002Fs** |\n\n### Real-World Validation — NASDAQ ITCH 5.0\n\nReplayed a full trading day of AAPL order flow from NASDAQ TotalView-ITCH 5.0 (Jan 30, 2020):\n\n| Metric | Value |\n|---|---|\n| Total operations | 1,937,879 |\n| Add orders | 907,118 |\n| Deletes | 869,275 |\n| Replaces | 151,325 |\n| Partial cancels | 10,161 |\n| Orders deleted without executing | ~96% |\n\n| Latency | ns |\n|---|---|\n| p50 | **100 ns** |\n| p90 | 199 ns |\n| p99 | **1,799 ns** |\n| p99.9 | **14,878 ns** |\n| max | 22,063,506 ns *(OS scheduling jitter on WSL2)* |\n| Hot-path throughput | **~9M ops\u002Fsec** |\n\nThe 96% delete rate confirms real HFT quote-flickering behavior — market makers continuously update their quotes, rarely holding positions. The 22 ms max is entirely OS preemption (WSL2 on Windows); on bare Linux with CPU pinning (`taskset -c 0`) and `SCHED_FIFO`, the tail collapses significantly.\n\nSee [`src\u002Fbin\u002Fitch_replay.rs`](src\u002Fbin\u002Fitch_replay.rs) for the single-symbol replay tool. Run with:\n```bash\ncargo run --release --bin itch_replay -- \u003Citch_file> AAPL\n```\n\n#### Multi-Symbol Replay — Top 100 Symbols\n\nReplayed the same file across the top 100 symbols by order volume, sequentially through independent order books:\n\n| Metric | Value |\n|---|---|\n| Symbols replayed | 100 |\n| Total operations | 108,081,475 |\n| Total time | 25.6 s |\n| Throughput | **~4.1M ops\u002Fsec** (single-threaded, sequential) |\n\n| Latency (aggregate) | ns |\n|---|---|\n| p50 | **98 ns** |\n| p99 | **1,914 ns** |\n| p99.9 | **4,284 ns** |\n| mean | 128 ns |\n\nSelected per-symbol results:\n\n| Symbol | Ops | p50 | p99 | p99.9 | mean |\n|---|---|---|---|---|---|\n| QQQ | 4,704,626 | 100 ns | 1,914 ns | 18,637 ns | 171 ns |\n| SPY | 4,375,605 | 101 ns | 2,014 ns | 5,642 ns | 150 ns |\n| AAPL | 1,937,879 | 101 ns | 2,116 ns | 4,634 ns | 182 ns |\n| MSFT | 1,746,995 | 100 ns | 2,006 ns | 4,312 ns | 143 ns |\n| AMD | 2,315,082 | 100 ns | 907 ns | 6,044 ns | 132 ns |\n\np99.9 spikes (e.g. QQQ at 18 µs) are OS scheduling jitter from WSL2 — longer-running symbols accumulate more preemption events. On bare Linux with `SCHED_FIFO`, tail latency collapses to the low-µs range.\n\nSee [`src\u002Fbin\u002Fitch_replay_all.rs`](src\u002Fbin\u002Fitch_replay_all.rs) for the multi-symbol tool. Run with:\n```bash\ncargo run --release --bin itch_replay_all -- \u003Citch_file> [top_n]\n```\n\n### Optimization Progression\n\nStarting from a `BTreeMap`-based implementation, each optimization is measured and documented in [optimization_notes.md](optimization_notes.md).\n\n| Stage | Cancel (depth 1000) | BBO | Top-N\u002F20 | Warm insert |\n|---|---|---|---|---|\n| BTreeMap baseline | 78 µs | 5 ns | 49 ns | ~7M\u002Fs |\n| + Flat array (`Vec\u003COption\u003CPriceLevel>>`) | 246 ns | 354 ns | ~53 µs | ~8M\u002Fs |\n| + Cached BBO (`best_bid_idx`) | 246 ns | **0.88 ns** | ~53 µs | ~10M\u002Fs |\n| + `Vec` order_index (no HashMap) | 246 ns | 0.88 ns | ~53 µs | **~10M\u002Fs** |\n| + Bitmap `top_n_levels` | 246 ns | 0.88 ns | **~800 ns** | ~10M\u002Fs |\n| + Active flag + slot index + Arena | **~100 ns p99** | ~1 ns | ~950 ns | **~28M\u002Fs** |\n\n---\n\n## Key Optimizations\n\n### 1. Flat Array Price Levels — eliminates pointer chasing\n\n`BTreeMap\u003Ci64, PriceLevel>` allocates each tree node separately. Traversing to a price level follows 3–4 heap pointers — each a potential 50–100 ns cache miss. Cancel scaled linearly with book depth: 700 ns at depth 10, 78 µs at depth 1000, despite being O(1) algorithmically.\n\nReplaced with `Vec\u003COption\u003CPriceLevel>>` pre-allocated to `MAX_PRICE \u002F TICK_SIZE` slots. Array index is `price \u002F tick_size` — O(1), direct, no traversal. Cancel dropped from 78 µs → 246 ns at depth 1000 (**318×**).\n\n```rust\n\u002F\u002F Before: BTreeMap node traversal (cache miss per node)\npub bid: BTreeMap\u003Ci64, PriceLevel>,\n\n\u002F\u002F After: direct index, sequential memory\npub bid: Vec\u003COption\u003CPriceLevel>>,   \u002F\u002F 100 000 pre-allocated slots\npub bid_bitmap: Vec\u003Cu64>,           \u002F\u002F 1 bit per slot, 25 KB total\n```\n\n### 2. Cached BBO + Bitmap Scan\n\nIterating the flat array to find the best bid\u002Fask was O(MAX_PRICE) — scanning up to 100 000 slots on every BBO query.\n\nAdded `best_bid_idx: Option\u003Cusize>` and `best_ask_idx: Option\u003Cusize>`, maintained on every insert and cancel. BBO query is now a single field read: **0.88 ns**. Bitmap scan only runs when the best level fully drains (rare).\n\nWhen a scan is needed, `trailing_zeros()` \u002F `leading_zeros()` find the next occupied slot in one CPU instruction per 64 price slots. The scan starts from the word containing the current best — typically O(1) for adjacent levels.\n\n### 3. Bitmap `top_n_levels` — 98.5% improvement\n\nThe original `top_n_levels` iterated all 100 000 array slots: ~53 µs for any N, a 2400× regression vs BTreeMap.\n\nNow iterates bitmap words (1563 × u64) instead, extracting only occupied slots with Brian Kernighan's trick (`w &= w - 1` clears the lowest set bit in one instruction). Result: ~800 ns regardless of book depth.\n\n### 4. `Vec\u003COption\u003C...>>` order_index — replaces HashMap\n\n`HashMap` uses SipHash (a cryptographic hasher) on every `place_order` and `cancel_order`. Replaced with `Vec\u003COption\u003C(Side, price, qty, slot_idx)>>` indexed directly by order ID — zero hash overhead, cache-friendly sequential access.\n\nThe fourth field `slot_idx` is the position within the price level's `orders` Vec. Cancel is a direct `orders[slot_idx].active = false` — no scan.\n\nPer-book sequential IDs (each `OrderBook` owns a `next_id: usize` counter starting at 0) keep the Vec dense. This matches real exchange behaviour — order IDs are session-scoped and reset at end-of-day.\n\n### 5. Active Flag + `head_idx` — O(1) cancel and matching\n\nPreviously, cancel called `retain()` (O(n) scan + memory shift) and matching called `pop_front()` (VecDeque bookkeeping). Replaced with:\n\n- **`active: bool` on each Order** — cancel flips one flag, no memory movement\n- **`active_count: usize` on each PriceLevel** — level clears only when all orders are gone\n- **`head_idx: usize`** — matching advances an integer past fully-filled front orders instead of popping from a ring buffer\n\nSelf-trade prevention uses *Cancel Resting* (industry standard: CME, NYSE, Nasdaq). The resting order from the same trader is cancelled inline; the incoming order continues matching against other traders.\n\n### 6. Bumpalo Arena Allocator\n\nEach new `PriceLevel` previously called `Vec::new()` (a `malloc`). Replaced with `bumpalo::Bump` — a monotonic bump allocator. Each `Vec\u003COrder>` inside a price level is allocated from the arena via a pointer increment. No `free` per order.\n\nEnd-of-day reset: one `arena.reset()` call reclaims all memory, O(1) regardless of daily order volume.\n\n`bumpalo::Bump` is `!Send` — it cannot cross thread boundaries. This constraint enforces the correct architecture: each matcher thread owns its arena and book exclusively. No synchronisation needed on the matching path.\n\n### 7. Per-symbol Matcher Threads — no lock on hot path\n\nPreviously `Arc\u003CRwLock\u003CExchange>>` serialised all symbols — an AAPL order would block a TSLA query.\n\nNow each symbol runs on its own OS thread owning a `Bump` arena and `OrderBook`. HTTP handlers look up the symbol's `crossbeam_channel::Sender` in an `Arc\u003CHashMap>` (immutable after startup — no lock) and send a `BookRequest` with an embedded `oneshot::Sender` for the response.\n\n```\nHTTP thread (async) ──crossbeam send──> matcher-AAPL (OS thread, no locks)\n                    \u003C──oneshot resp────\n```\n\n---\n\n## Order Types\n\n| Type | Behaviour |\n|---|---|\n| **Limit** | Rest at specified price or better. Stays in book until filled or cancelled. |\n| **Market** | Fill immediately at best available price. Cancelled if book is empty. |\n| **IOC** (Immediate-Or-Cancel) | Fill what is available now, cancel the remainder. |\n| **FOK** (Fill-Or-Kill) | Fill the entire quantity immediately or cancel. |\n\n---\n\n## REST API\n\n### Orders\n\n```\nPOST \u002Fadd_order\nBody: { \"trader_id\": 1, \"symbol\": \"AAPL\", \"order_type\": \"Limit\",\n        \"side\": \"Bid\", \"price\": 19000, \"qty\": 10 }\nReturns: { \"success\": true, \"data\": [ \u003Ctrades> ] }\n\nPOST \u002Fcancel_order\nBody: { \"symbol\": \"AAPL\", \"order_id\": 42 }\nReturns: { \"success\": true }\n\nPOST \u002Fupdate_order\nBody: { \"symbol\": \"AAPL\", \"order_id\": 42, \"new_price\": 19100, \"new_qty\": 5 }\nReturns: { \"success\": true, \"data\": [ \u003Ctrades> ] }\n```\n\nPrices are in **cents** (integer). $190.00 = `19000`. Valid range: $0.01–$999.99.\n\n### Book Queries\n\n```\nGET \u002Fbbo?symbol=AAPL\nReturns: { \"bb\": 18900, \"ba\": 19100 }\n\nGET \u002Fdepth?symbol=AAPL&n=5&side=Bid\nReturns: { \"data\": [[19000, 100], [18900, 50], ...] }\n\nGET \u002Fmicroprice?symbol=AAPL\nReturns: { \"data\": 19001.23 }\n\nGET \u002Fimbalance?symbol=AAPL\nReturns: { \"data\": 0.65 }\n\nGET \u002Fvol_at_price?symbol=AAPL&side=Bid&price=19000\nReturns: { \"data\": 100 }\n```\n\n---\n\n## Design Constraints\n\n| Constraint | Reason |\n|---|---|\n| Price range: $0.01–$999.99 | Flat array is pre-allocated at `MAX_PRICE \u002F TICK_SIZE = 100 000` slots (~9.6 MB). Raising `MAX_PRICE` is one constant change. |\n| Tick size: $0.01 (1 cent) | Sub-cent prices rejected at the API boundary. Standard exchange behaviour. |\n| Per-book sequential order IDs | IDs are session-scoped (reset at EOD like real exchanges). Cancel\u002Fmodify always include the symbol, so per-book IDs are unambiguous. |\n| One thread per symbol | Thread count scales linearly with symbols — correct for tens to low hundreds of instruments. A work-stealing thread pool would be needed for thousands. |\n\n---\n\n## Testing & Validation\n\nThree layers of correctness assurance:\n\n### Unit Tests\nCovers all order types, matching scenarios, partial fills, cancellations, price modifications, and edge cases (empty book, zero quantity, self-trade, crossing orders).\n\n### Property-Based Testing (`proptest`)\n10 invariant properties verified across thousands of random inputs:\n- Filled qty always equals `min(bid_qty, ask_qty)` for crossing pairs\n- Partial fill always leaves the correct remainder in the book\n- Book is never left in a crossed state after any limit order\n- FOK fills entirely or not at all — never partial\n- IOC and Market orders never rest in the book\n- Cancel always removes the order and updates volume correctly\n- Price-time priority: earlier order at the same price always fills first\n- No self-trade ever appears in trade output\n\n### Fuzz Testing (`cargo-fuzz` + LibFuzzer)\nRan structured fuzzing with arbitrary sequences of Add\u002FCancel\u002FModify operations. Found and fixed **4 real bugs**:\n\n| Bug | Symptom | Root Cause |\n|---|---|---|\n| Infinite loop | Test hung >60s | Self-trade prevention broke the outer matching loop |\n| Crossed book | `best_bid >= best_ask` | Self-trade skip at final level left ask resting below bid |\n| Subtract overflow | `attempt to subtract with overflow` | `order_index` stored original qty, stale after partial fill |\n| `active_count` drift | `assert: active_count=2 but 3 active orders` | `cancel + re-add` left stale entries; `find` matched deactivated slot |\n\n```bash\ncargo test                          # unit + property tests\ncargo +nightly fuzz run engine_fuzz # continuous fuzz (requires nightly)\n```\n\n---\n\n## Getting Started\n\n```bash\n# Build\ncargo build --release\n\n# Run server (registers symbols at startup)\ncargo run --release\n\n# Test\ncargo test\n\n# Benchmark\ncargo bench\n\n# Run specific benchmark\ncargo bench -- throughput\n```\n\n---\n\n## Related Projects\n\n- **Rust Market Data Feed Handler** — real-time Binance WebSocket feed handler in Rust (the natural upstream for this matching engine)\n- **ITCH tools** — [`src\u002Fbin\u002Fitch_reader.rs`](src\u002Fbin\u002Fitch_reader.rs) scans and summarises any ITCH 5.0 file; [`src\u002Fbin\u002Fitch_replay.rs`](src\u002Fbin\u002Fitch_replay.rs) replays a single symbol through the order book with latency percentiles\n\n---\n\n## Stack\n\n- **Rust** (2024 edition)\n- **Axum** — async HTTP framework\n- **Tokio** — async runtime\n- **crossbeam-channel** — lock-free MPSC channel for order routing\n- **bumpalo** — arena allocator for price level order storage\n- **Criterion** — statistical benchmarking\n\n---\n\n## Note on AI Usage\n\nI wrote the core architecture, all major optimizations and the majority of the code myself. I used AI assistants for refactoring suggestions, documentation — similar to how many developers use tools like GitHub Copilot. I believe in being transparent about this.\n","该项目是一个用Rust编写的高性能限价订单簿和撮合引擎，旨在实现接近真实交易所基础设施的延迟和吞吐量特性。其核心功能包括亚100纳秒级别的订单簿操作、基于价格-时间优先级的匹配算法、扁平数组价格层级、位图索引最佳买卖报价（BBO）以及无锁每符号线程模型。它已通过1.08亿条真实的NASDAQ ITCH 5.0市场数据进行了验证，并使用Criterion库进行了严格的基准测试。此项目适用于对性能有极高要求的高频交易系统中，特别是在需要极低延迟和高吞吐量处理能力的场景下，如股票交易所或加密货币交易平台。",2,"2026-06-11 04:07:01","CREATED_QUERY"]