[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1529":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":42,"readmeContent":43,"aiSummary":44,"trendingCount":16,"starSnapshotCount":16,"syncStatus":45,"lastSyncTime":46,"discoverSource":47},1529,"Project_Chronos","FonaTech\u002FProject_Chronos","FonaTech","⚡ Zero-Stall MoE Inference via Lookahead Prediction & Async DMA Prefetching. Optimized for SSD I\u002FO with Hybrid MLA+Sliding Window Attention.","",null,"Python",202,20,21,1,0,22,46.17,"Apache License 2.0",false,"main",true,[24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41],"artificial-intelligence","async-dma","dual-layer-moe","generative-ai","high-throughput","io-latency-hiding","large-language-model","llm","lookahead-routing","lora","mixture-of-experts","mla-attention","open-models","open-source","predictive-inference","sliding-window-attention","ssd-offloading","streaming-llm","2026-06-12 04:00:10","# Project Chronos (Experimental)\n\n**A storage-aware MoE stack built for SSD+DRAM hybrid inference, with a full six-stage training pipeline.**\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002FProject_Chronos)](https:\u002F\u002Fpypi.org\u002Fproject\u002FProject_Chronos\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue)](LICENSE)\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue)](https:\u002F\u002Fpython.org)\n\n[中文文档](README_zh.md)\n\n---\n\n## Why reactive MoE offload breaks down\n\nMainstream MoE models such as Mixtral, DeepSeek-MoE, and Qwen-MoE make routing decisions **at decode time, token by token**. The model discovers which experts it needs only when the next token is already on the critical path. If those experts are not resident in VRAM, generation stalls while weights are moved from SSD to RAM to VRAM.\n\nThat is not a tuning issue. It is an architectural mismatch.\n\nMost offload runtimes assume the model was originally designed for full VRAM residency, then try to patch storage pressure afterward. On consumer hardware, that usually means the decode loop ends up paying the IO bill over and over again.\n\n---\n\n## Chronos in one line\n\n**Move IO into prefill. Move synchronization down to the expert-event level.**\n\n```mermaid\nsequenceDiagram\n    autonumber\n    participant Prompt as Prompt tokens\n    participant Intent as IntentClassifier\n    participant SSD as Clustered SSD cache\n    participant RAM as Pinned RAM\n    participant H2D as H2D stream\n    participant GPU as Decode compute stream\n    participant LR as LookaheadRouter\n\n    rect rgb(236, 248, 255)\n        Prompt->>Intent: prefill reads the whole prompt\n        Intent->>SSD: predict generation-level expert set\n        SSD->>RAM: mmap cluster-packed .ctsr files\n        RAM->>H2D: enqueue hot experts before first decode token\n        H2D-->>GPU: per-expert ready events\n    end\n\n    loop every decode token\n        GPU->>LR: block-0 hidden state\n        LR->>H2D: prefetch experts for t+1..t+K\n        GPU->>GPU: forward token t with resident experts\n        GPU->>H2D: wait only on the expert events it needs\n    end\n```\n\nTwo things matter here:\n\n1. **Prefill-time loading**: `PrefillScheduler` and `IntentClassifier` bulk-load predicted experts before the first decode token.\n2. **Per-expert event sync (M3)**: `promote_to_vram(blocking=False)` records a `torch.cuda.Event` on the `_h2d_stream`, and the compute stream waits only on the experts it actually needs. No more `stream.synchronize()` for the whole system.\n\nUnder a simulated 30 ms SSD delay, the new path keeps **35 ms+ per token** of pipeline slack compared with the older blocking path.\n\n---\n\n## Three-tier storage (M1: cluster-aware safetensors)\n\n```mermaid\nflowchart TB\n    subgraph GPU[\"VRAM \u002F Metal unified memory\"]\n        Dense[\"Dense layers\"]\n        Shared[\"Shared experts\u003Cbr\u002F>always resident\"]\n        Hot[\"Predicted hot experts\u003Cbr\u002F>promoted before use\"]\n        Events[\"Per-expert CUDA events\u003Cbr\u002F>no global stream sync\"]\n    end\n\n    subgraph RAM[\"Pinned RAM staging tier\"]\n        Buffer[\"Prefetched expert tensors\"]\n        LRU[\"RAM LRU cache\"]\n    end\n\n    subgraph Disk[\"NVMe SSD\"]\n        C1[\"cluster_000.ctsr\"]\n        C2[\"cluster_001.ctsr\"]\n        Manifest[\"cluster_manifest.json\u003Cbr\u002F>expert -> cluster map\"]\n    end\n\n    Manifest --> C1\n    Manifest --> C2\n    C1 -->|\"sequential mmap read\"| Buffer\n    C2 -->|\"sequential mmap read\"| Buffer\n    Buffer --> LRU\n    LRU -->|\"non-blocking H2D\"| Hot\n    Hot --> Events\n    Shared --> Events\n    Dense --> Events\n```\n\n`cluster_manifest.json` and `.ctsr` files are produced offline by Louvain clustering. At runtime, Chronos uses `safetensors.safe_open(...).mmap` to bring an entire expert cluster into RAM in one shot, turning random reads into mostly sequential reads.\n\n### Cache misses degrade gracefully instead of stalling\n\nEven the worst case does not hard-stop generation:\n\n```python\n# Pure tensor math: no Python branch, no graph break under torch.compile\noutput = avail[i] * expert_output + (1.0 - avail[i]) * shared_expert_output\n```\n\nThe shared expert is always resident, so generation continues while the missing expert finishes loading in the background. For exact lazy\u002Foffload comparison modes, Chronos synchronously materializes only the selected missing expert and evicts low-LRU experts to stay inside the resident budget; it does not silently full-load all experts. Quality degrades smoothly only when fallback mode is explicitly enabled.\n\n---\n\n## Dual-layer routing with supervised lookahead (M2)\n\n```mermaid\nflowchart LR\n    Prompt[\"Prompt \u002F prefill context\"] --> IC[\"IntentClassifier\u003Cbr\u002F>generation-level expert prior\"]\n    IC --> Budget[\"ExpertPredictor\u003Cbr\u002F>budgeted hot set\"]\n    Budget --> Prefill[\"PrefillScheduler\u003Cbr\u002F>bulk preload before decode\"]\n\n    Token[\"Decode token t\"] --> Block0[\"Transformer block 0\"]\n    Block0 --> LR[\"LookaheadRouter\u003Cbr\u002F>Q_t^(1..K)\"]\n    LR --> Future[\"Future expert predictions\u003Cbr\u002F>t+1 ... t+K\"]\n    Future --> Async[\"AsyncPrefetcher\u003Cbr\u002F>prefetch queue\"]\n\n    Prefill --> Cache[\"CacheManager \u002F ExpertStore\"]\n    Async --> Cache\n    Cache --> MoE[\"ChronosMOEFeedForward\u003Cbr\u002F>resident expert or shared fallback\"]\n```\n\n| | IntentClassifier (Layer 1) | LookaheadRouter (Layer 2) |\n|---|---|---|\n| **When it runs** | Once during prefill | Every token during decode |\n| **Input** | Full prompt (up to 512 tokens) | Hidden state after Block 0 |\n| **Output** | Expert set for the full generation | Expert IDs for t+1, t+2 |\n| **Training target** | Supervised from real activation logs | **`L_lookahead` supervised by real router decisions at t+k** |\n| **Parameter count** | ~10-15M | ~2M |\n\nBefore M2, the lookahead head was just a head with no real supervision. M2 adds a proper soft-target objective:\n\n```math\nL_{\\mathrm{lookahead}}\n= \\frac{1}{|K_{\\mathrm{valid}}|}\n\\sum_{k \\in K_{\\mathrm{valid}}}\n\\mathbb{E}_{b,t}\n\\left[\n  - \\sum_{e=1}^{E}\n  \\mathrm{sg}(P_{b,t+k,e}) \\log Q_{b,t,e}^{(k)}\n\\right]\n```\n\nThat turns the lookahead router into an actual predictor of future routing, instead of a best-effort heuristic.\n\n---\n\n## Full training stack: Stage 1 -> Stage 6\n\nEach stage has its own entry script, and every one of them inherits the shared Chronos loss mixer: balance loss, temporal locality loss, lookahead loss, and, for alignment stages, a router KL anchor that keeps RL\u002FDPO updates from destroying cache locality.\n\n```mermaid\nflowchart LR\n    P1[\"Stage 1\u003Cbr\u002F>Pretrain\u003Cbr\u002F>CE + Chronos mix\"] --> P2[\"Stage 2\u003Cbr\u002F>SFT\u003Cbr\u002F>assistant-token CE + mix\"]\n    P2 --> P3[\"Stage 3\u003Cbr\u002F>DPO\u003Cbr\u002F>preference loss + anchor\"]\n    P2 --> P4[\"Stage 4\u003Cbr\u002F>ORPO\u003Cbr\u002F>NLL + odds-ratio + anchor\"]\n    P4 --> P5[\"Stage 5\u003Cbr\u002F>GRPO\u003Cbr\u002F>rollout reward + KL + anchor\"]\n    P5 --> P6[\"Stage 6\u003Cbr\u002F>Distill\u003Cbr\u002F>KD + CE + anchor\"]\n\n    subgraph Shared[\"Shared Chronos training terms\"]\n        B[\"raw load-balance aux\"]\n        T[\"temporal locality\"]\n        LA[\"supervised lookahead\"]\n        A[\"router KL anchor\u003Cbr\u002F>alignment stages\"]\n    end\n\n    Shared -. applied to .-> P1\n    Shared -. applied to .-> P2\n    Shared -. applied to .-> P3\n    Shared -. applied to .-> P4\n    Shared -. applied to .-> P5\n    Shared -. applied to .-> P6\n```\n\n| Stage | Script | Core objective | Router KL anchor (default lambda) |\n|---|---|---|---|\n| 1 Pretrain | `train_chronos.py` | CE + balance + temporal + lookahead | 0.0 (off) |\n| 2 SFT | `train_chronos_sft.py` | SFT loss + shared Chronos mix | 0.01 (weak) |\n| 3 DPO | `train_chronos_dpo.py` | DPO `log-sigma(beta * logits)` + mix | 0.10 (strong) |\n| 4 ORPO | `train_chronos_orpo.py` | NLL + lambda * ORPO term | 0.10 |\n| 5 GRPO | `train_chronos_grpo.py` | `PG * A - beta * KL` with `ToyReward` or pluggable `LMRewardModel` | 0.10 |\n| 6 Distill | `train_chronos_distill.py` | `alpha * T^2 * KL(student || teacher) + (1 - alpha) * CE` | 0.05 |\n\nTraining dtype and resource policy:\n\n- `--dtype auto` is the default. MPS\u002FMLX resolve to BF16-first for training stability, CUDA\u002FXPU resolve to FP16, and CPU resolves to FP32 unless `--dtype float16` or `--dtype bfloat16` is set explicitly.\n- CPU training configures PyTorch to use physical cores by default. Override with `--cpu_threads` or `--cpu_budget_percent`.\n- On macOS, MPS\u002FMLX training forces DataLoader workers to `0` by default to avoid Metal command-buffer crashes from multiprocessing. CPU\u002FCUDA still use worker processes; advanced users can override the guard with `CHRONOS_ALLOW_METAL_DATALOADER_WORKERS=1`.\n- Native MLX training pushes UI logs, scalar readouts, and chart points every `log_interval` steps, and Web UI Stop is checked at each batch boundary.\n- The Web UI writes a warning-only `\u003Ccheckpoint>.verify.json` after each stage. It checks no-mask vs all-available MoE parity and, on Apple Silicon, MLX prefill logits against the PyTorch CPU baseline.\n\nNative MLX training is a separate Apple Silicon backend, not `torch.to(\"mlx\")`.\nThe six-stage MLX trainer mirrors the PyTorch loss stack in `chronos.mlx.*`:\nmasked CE, DPO\u002FORPO\u002FGRPO\u002Fdistillation losses, load balance, temporal locality,\nlookahead soft-target supervision, lookahead top-k hit loss, and router KL\nanchor. Numerically sensitive pieces run in FP32 even when model weights use\nBF16\u002FFP16, which is why `auto` prefers BF16 on MLX\u002FMPS: FP16 has too little\nexponent range for router softmax, CE, and Adam moments on small unstable\ntraining runs.\n\nThe MLX Web UI trainer reports the same fields as CPU\u002FMPS training:\nstep\u002Floss\u002Fsteps-per-second\u002FETA, checkpoint-save events, stop events, and\nverify results. Stop is cooperative at batch boundaries and save intervals\nwrite a valid `.pth` plus sibling `.config.json`, so MLX stages can resume or\nfeed the PyTorch\u002Fexport pipeline.\n\nThe full six-stage comparison harness lives in `tools\u002Fcompare_minimind_chronos_v3.py`.\n\n---\n\n## Backend dispatch (M5)\n\n```mermaid\nflowchart TD\n    Request[\"User request\u003Cbr\u002F>train \u002F inference \u002F WebUI \u002F CLI\"] --> Dispatcher[\"BackendDispatcher\"]\n    Dispatcher --> Probe[\"Capability probes\"]\n    Probe --> CUDA[\"CUDA\u003Cbr\u002F>training + inference\"]\n    Probe --> MPS[\"MPS\u003Cbr\u002F>training + inference\"]\n    Probe --> MLX[\"MLX\u003Cbr\u002F>Apple Silicon native path\"]\n    Probe --> CPU[\"CPU\u003Cbr\u002F>portable fallback\"]\n    Probe --> EXT[\"Extension hooks\u003Cbr\u002F>Vulkan \u002F OpenCL\"]\n    Dispatcher --> Choice[\"Priority + availability decision\"]\n    Choice --> Runtime[\"Chronos runtime \u002F trainer\"]\n```\n\n```python\nfrom chronos.backend import BackendDispatcher\n\nd = BackendDispatcher()\nd.available()   # ['mlx', 'mps', 'cpu'] on Apple Silicon\n                # ['cuda', 'cpu']        on NVIDIA hosts\nd.select()      # choose the best available backend automatically\nd.describe()    # human-readable capability summary\n```\n\n- **First-class backends for training and inference**: `cpu`, `mps`, `cuda`, `mlx`\n- **Inference-only \u002F experimental**: `vulkan` when PyTorch was custom-built with `USE_VULKAN=ON`\n- **Third-party extension hook**: `opencl`, via `chronos\u002Fbackend\u002Fext\u002Fopencl.py:PROBE()`\n- **Apple Silicon policy**: inference auto still prefers MLX; training keeps MLX on the native `chronos.mlx.*` path instead of calling `torch.model.to(\"mlx\")`.\n\nHonest note: upstream PyTorch does not ship a real OpenCL backend, and Vulkan support is still niche. Chronos provides a dispatcher seam so external integrations can plug in cleanly without touching core code.\n\n### MLX lazy\u002Foffload runtime\n\nMLX uses Apple unified memory, so Chronos treats \"VRAM\" and \"RAM\" as logical\ntiers rather than physical buses. The native lazy runtime still enforces the\nsame offload contract as CUDA\u002FMPS\u002FCPU:\n\n- **Hot slots** hold only the execution-budget experts materialized as live\n  MLX modules.\n- **Warm cache** holds a bounded prediction buffer loaded from `.ctsr`\n  safetensors, never a hidden copy of every expert.\n- **Cold storage** is the checkpoint\u002Fexport reader or per-expert cluster cache.\n  In lazy mode, live experts are replaced by placeholders after cache creation;\n  Chronos does not keep a `_saved_live` full-expert repair cache.\n- Lookahead predictions are queued into the warm cache and only promoted when\n  ready. A true miss synchronously materializes the selected expert only; it\n  does not full-load the model.\n- MLX attention dynamically grows RoPE lookup tables past\n  `max_position_embeddings`, so long prompts plus decode do not fail at token\n  257\u002F513\u002Fetc.\n\nInference compare\u002Fsweep reports real hot\u002Fwarm counts, resident hit rate,\nprediction hit rate, sync SSD loads, MLX active\u002Fcache\u002Fpeak memory, process RSS,\nprefill\u002Fdecode time, and tokens\u002Fsec. A lazy run is not considered\noffload-ready unless deterministic exact-lazy output matches full-DRAM output\nand fallback weight stays zero.\n\n---\n\n## Hugging Face and vLLM compatibility (M5)\n\n- `ChronosForCausalLM` subclasses `PreTrainedModel` and registers `AutoConfig` and `AutoModelForCausalLM`, so loading does **not** require `trust_remote_code`:\n\n  ```python\n  from transformers import AutoModelForCausalLM\n\n  model = AutoModelForCausalLM.from_pretrained(\".\u002Fout_dir\")\n  ```\n\n- `chronos.model.hf_io.save_chronos_pretrained` and `load_chronos_pretrained` emit standard `model.safetensors` + `config.json`, while also carrying `cluster_manifest.json` and `.ctsr` files for expert-cache layout. Roundtrip logit drift is `0.00e+00`.\n\n- `chronos.serving.register_chronos_with_vllm()` registers Chronos with the vLLM `ModelRegistry` when vLLM is installed. If vLLM is absent, it prints an install hint and exits cleanly. Worker-side mask injection is documented in [docs\u002Fvllm_integration.md](docs\u002Fvllm_integration.md).\n\n---\n\n## Compared with existing offload stacks\n\n| Feature | llama.cpp offload | vLLM offload | **Project Chronos** |\n|---|---|---|---|\n| Expert prediction | None | None | **Predictive (`IntentCLF` + `LookaheadRouter`)** |\n| Lookahead training | n\u002Fa | n\u002Fa | **Supervised `L_lookahead` (M2)** |\n| IO timing | During decode, blocking | During decode, blocking | **During prefill, async** |\n| Decode pipeline | Synchronous | Synchronous | **Dual-stream + per-expert events (M3)** |\n| Cache miss behavior | Hard stall | Hard stall | **Soft gating, zero hard stall** |\n| Disk format | GGUF | safetensors | **Cluster-packed safetensors (`.ctsr`)** |\n| Training integration | Post-hoc patch | Post-hoc patch | **Native six-stage stack + router KL anchor** |\n| Backend dispatch | Compile-time fixed | CUDA only | **`cpu` \u002F `mps` \u002F `cuda` \u002F `mlx` + extension hooks** |\n| Apple Silicon support | Partial | No | **Full MLX backend** |\n| Hugging Face compatibility | GGUF only | Yes | **Yes, with expert-cache metadata** |\n| vLLM compatibility | n\u002Fa | Native | **Optional adapter** |\n\n---\n\n## Objective\n\n```math\nL_{\\mathrm{total}} =\nL_{\\mathrm{base}}\n+ \\lambda_{\\mathrm{bal}} L_{\\mathrm{aux}}\n+ \\lambda_{\\mathrm{tmp}} L_{\\mathrm{temporal}}\n+ \\lambda_{\\mathrm{LA}} L_{\\mathrm{lookahead}}\n+ \\lambda_{\\mathrm{anc}} L_{\\mathrm{routerKL}}\n```\n\n```math\nL_{\\mathrm{aux}} = E \\sum_{e=1}^{E} load_e \\cdot \\overline{p}_e\n```\n\n```math\nL_{\\mathrm{temporal}} =\n\\mathbb{E}_{b,t}\n\\left[\n  \\left\\| P_{b,t,:} - P_{b,t-1,:} \\right\\|_2^2\n\\right]\n```\n\n```math\nL_{\\mathrm{routerKL}} =\nD_{\\mathrm{KL}}\n\\left(\n  \\pi_{\\theta}^{\\mathrm{router}}\n  \\|\n  \\pi_{\\mathrm{ref}}^{\\mathrm{router}}\n\\right)\n```\n\n- `L_base`: stage-specific objective (`CE`, `DPO`, `ORPO`, `GRPO`, or distillation).\n- `L_aux`: the unscaled MoE load-balance auxiliary term; Chronos applies `lambda_bal` once in `chronos_loss_term`.\n- `L_temporal`: encourages adjacent tokens to reuse similar expert distributions.\n- `L_lookahead`: soft-target cross entropy from the real future router distribution to the lookahead prediction. `sg(...)` means stop-gradient.\n- `L_routerKL`: keeps alignment-stage updates from destroying the routing layout captured at stage start.\n\nAll lambda terms are searchable with Optuna TPE, together with structural hyperparameters such as `hidden_size`, `num_experts`, and `kv_latent_dim`.\n\n---\n\n## Installation\n\n```bash\npip install Project_Chronos\n```\n\nOr from source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FFonaTech\u002FProject_Chronos\ncd Project_Chronos\npip install -e \".[dev]\"\n```\n\n**MLX (Apple Silicon):**\n\n```bash\npip install \"Project_Chronos[mlx]\"\n```\n\n**vLLM serving (optional, Linux + CUDA only):**\n\n```bash\npip install vllm\n```\n\n> **minimind dependency**: Project Chronos uses [minimind](https:\u002F\u002Fgithub.com\u002Fjingyaogong\u002Fminimind) as its MoE kernel.\n> If it is not found locally, Chronos clones it automatically into `~\u002F.cache\u002Fchronos\u002Fminimind-master\u002F` on first import.\n> minimind is licensed under **Apache-2.0**. See [THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md) for attribution details.\n\n**Requirements**: Python 3.10+, PyTorch 2.4+\n\n---\n\n## Quick start\n\n### Web UI (M6: 8 tabs, 4 languages)\n\n```bash\nchronos-ui\n# or\npython chronos_app.py\n```\n\nTabs included:\n\n- `Config` with a live parameter \u002F memory estimator merged in from the old Designer\n- `Train` with its own `data_path`\n- `6-Stage Pipeline` with per-stage dataset paths\n- `Inference`\n- `Export` for FP16\u002FQ8_0 safetensors and GGUF deployment artifacts\n- `Benchmark` with Markdown table + comparison plots\n- `Auto-Tune` with persistent logs and one-click `Apply Best -> Config`\n- `IO Monitor`\n\nBuilt-in i18n: `zh-Hans`, `zh-Hant`, `en`, `ja`\n\n### Deployment export\n\n```bash\nchronos export \\\n    --model_path .\u002Fout\u002Fsft_384_moe.pth \\\n    --output_dir .\u002Fexports\u002Fsft_384 \\\n    --formats fp16-safetensors q8_0-safetensors fp16-gguf q8_0-gguf\n```\n\nExports include `config.json`, `chronos_export_manifest.json`, and Chronos\nmetadata for MoE top-k, shared fallback experts, lookahead router, hybrid\nattention, and optional expert-cache layout. Chronos can load exported\n`safetensors`\u002F`GGUF` artifacts through its native lazy expert loader.\n\nCompatibility note: the GGUF files use `general.architecture=chronos`. Stock\nOllama\u002Fllama.cpp builds need a Chronos architecture adapter to execute them\ncorrectly; Chronos is not a LLaMA tensor-layout clone.\n\n### Stage 1: pretrain\n\n```bash\npython train_chronos.py \\\n    --data_path .\u002Ftests\u002Ffixtures\u002Ftiny_pretrain.jsonl \\\n    --hidden_size 256 --num_hidden_layers 4 --num_experts 4 \\\n    --epochs 1 --device cpu --save_dir .\u002Fout\n```\n\n### Stage 2-5: alignment chain\n\n```bash\npython train_chronos_sft.py  --data_path .\u002Ftests\u002Ffixtures\u002Ftiny_sft.jsonl  --from_weight chronos --save_dir .\u002Fout --device cpu\npython train_chronos_dpo.py  --data_path .\u002Ftests\u002Ffixtures\u002Ftiny_dpo.jsonl  --from_weight sft     --save_dir .\u002Fout --device cpu\npython train_chronos_orpo.py --data_path .\u002Ftests\u002Ffixtures\u002Ftiny_dpo.jsonl  --from_weight sft     --save_dir .\u002Fout --device cpu\npython train_chronos_grpo.py --data_path .\u002Ftests\u002Ffixtures\u002Ftiny_grpo.jsonl --from_weight orpo    --save_dir .\u002Fout --device cpu \\\n    --reward toy   # or lm:\u002Fpath\u002Fto\u002Freward-model\n```\n\n### Stage 6: distillation\n\n```bash\npython train_chronos_distill.py \\\n    --data_path .\u002Ftests\u002Ffixtures\u002Ftiny_sft.jsonl \\\n    --teacher_path .\u002Fout\u002Fsft_192_moe.pth \\\n    --from_weight grpo --save_dir .\u002Fout --device cpu \\\n    --alpha 0.7 --temperature 4.0\n```\n\n### Checkpoint and offload diagnostics\n\nEvery new `.pth` checkpoint writes a sibling `*.config.json` with the MoE\ntopology that cannot be recovered from tensor shapes, including\n`num_experts_per_tok`. Use the diagnostic command to verify chat-template\ngeneration, no-mask vs all-available masked drift, cold shared fallback,\nLookaheadRouter prediction quality, and SSD\u002FRAM\u002FVRAM offload stats.\n\n```bash\npython diagnose_checkpoint.py \\\n    --model_path .\u002Fout\u002Fsft_384_moe.pth \\\n    --config_path .\u002Fchronos_config.json \\\n    --sft_data .\u002FDataset\u002Fsft_t2t.jsonl \\\n    --mlx_parity \\\n    --device cpu\n\n# or through the unified CLI:\nchronos diagnose --model_path .\u002Fout\u002Fsft_384_moe.pth --config_path .\u002Fchronos_config.json\n```\n\nFor backend speed and dtype sanity checks:\n\n```bash\npython benchmark_training_backends.py --backends cpu mps mlx --dtypes auto bfloat16 float16 --steps 2\n```\n\n### End-to-end comparison (minimind vs Chronos)\n\n```bash\npython tools\u002Fcompare_minimind_chronos_v3.py \\\n    --pretrain_steps 150 --align_steps 30 --distill_steps 30 \\\n    --simulated_ssd_ms 30 --device cpu \\\n    --output results\u002Fcompare_results_v3.json\n```\n\nThis emits per-stage loss, HF roundtrip logit delta, tokens\u002Fsec, active-expert ratio, resident-expert bytes, M3 pipeline slack, and the backend inventory on the current host.\n\n### Cluster-pack expert storage for sequential SSD reads\n\n```python\nfrom chronos.io.cluster_layout import (\n    collect_activation_log,\n    build_cooccurrence_matrix,\n    try_louvain_clustering,\n    repack_expert_weights_safetensors,\n)\n\nlog = collect_activation_log(model, calib_loader, \"cpu\", max_batches=50)\nclusters = try_louvain_clustering(build_cooccurrence_matrix(log, num_experts))\nrepack_expert_weights_safetensors(model, clusters, \".\u002Fexpert_cache_clustered\")\n```\n\n### Search lambdas and structure hyperparameters automatically\n\n```python\nfrom chronos.tuning.chronos_auto_tuner import ChronosAutoTuner, ChronosSearchSpaceConfig\n\ntuner = ChronosAutoTuner()\ntuner.start(\n    model_id=\".\u002Fout\u002Fchronos_256_moe.pth\",\n    dataset_path=\".\u002Fdataset\u002Ftrain.jsonl\",\n    search_space=ChronosSearchSpaceConfig(\n        tune_lambda_balance=True,\n        tune_lambda_temporal=True,\n        tune_lambda_lookahead=True,\n        tune_lookahead_steps=True,\n        tune_hidden_size=True,\n        tune_num_experts=True,\n        tune_num_shared_experts=True,\n        tune_kv_latent_dim=True,\n    ),\n    n_trials=20,\n)\n```\n\n---\n\n## Project layout\n\n```text\nProject_Chronos\u002F\n├── chronos\u002F\n│   ├── deps.py                    # Auto-download minimind if missing\n│   ├── __init__.py                # AutoConfig \u002F AutoModelForCausalLM registration\n│   ├── model\u002F\n│   │   ├── config.py              # ChronosConfig\n│   │   ├── hybrid_attention.py    # MLAAttention + SlidingWindowAttention\n│   │   ├── lookahead_router.py    # Per-token lookahead predictor\n│   │   ├── moe_chronos.py         # ChronosMOEFeedForward + shared experts + soft gating\n│   │   ├── model_chronos.py       # ChronosForCausalLM\n│   │   ├── temporal_loss.py       # Temporal locality + lookahead losses\n│   │   └── hf_io.py               # save\u002Fload_chronos_pretrained + HF registration\n│   ├── io\u002F\n│   │   ├── expert_store.py        # Three-tier storage + per-expert events\n│   │   ├── async_prefetcher.py    # Async prefetch engine\n│   │   ├── storage.py             # ClusterStorage: .ctsr safetensors + manifest\n│   │   ├── cluster_layout.py      # Co-occurrence clustering + repacking\n│   │   └── io_simulator.py        # CHRONOS_SIM_SSD_MS test hook\n│   ├── router\u002F\n│   │   ├── intent_classifier.py   # Prompt-level expert predictor\n│   │   ├── expert_predictor.py    # IntentVector -> ExpertSet\n│   │   └── prefill_scheduler.py   # Prefill-time expert preloader\n│   ├── mlx\u002F\n│   │   ├── attention.py \u002F moe.py \u002F model.py \u002F expert_store.py \u002F inference.py\n│   ├── runtime\u002F\n│   │   ├── cache_manager.py       # prefetch_for_next_step \u002F ensure_resident\n│   │   ├── inference_engine.py    # End-to-end inference engine\n│   │   └── metrics.py             # MetricsBus for the IO Monitor\n│   ├── trainer\u002F\n│   │   ├── loss_mixin.py          # chronos_loss_term + router_kl_anchor\n│   │   ├── chronos_trainer.py     # Pretrain\n│   │   ├── sft_trainer.py         # Stage 2\n│   │   ├── dpo_trainer.py         # Stage 3\n│   │   ├── orpo_trainer.py        # Stage 4\n│   │   ├── grpo_trainer.py        # Stage 5\n│   │   ├── distill_trainer.py     # Stage 6\n│   │   └── reward.py              # ToyReward \u002F LMRewardModel \u002F build_reward_fn\n│   ├── tuning\u002F\n│   │   └── chronos_auto_tuner.py  # Optuna lambda + architecture search\n│   ├── eval\u002F\n│   │   ├── io_profiler.py         # Lookahead validation\n│   │   └── benchmark.py           # End-to-end benchmarking\n│   ├── data\u002F\n│   │   └── flexible_dataset.py    # Flexible JSONL dataset loader\n│   ├── backend\u002F\n│   │   ├── __init__.py            # BackendDispatcher (cpu\u002Fmps\u002Fcuda\u002Fmlx)\n│   │   ├── dispatcher.py          # Capability probing + priority logic\n│   │   └── ext\u002Fopencl.py          # Third-party OpenCL extension hook\n│   ├── _backend_legacy.py         # Backward-compatible older APIs\n│   ├── serving\u002F\n│   │   ├── __init__.py\n│   │   └── vllm_adapter.py        # Optional vLLM registration\n│   └── cli.py                     # Unified CLI\n├── ui\u002F                            # Gradio Web UI (zh-Hans \u002F zh-Hant \u002F en \u002F ja)\n│   ├── i18n.py\n│   ├── estimator.py               # Live parameter \u002F memory estimator\n│   └── tabs\u002F\n│       ├── config_tab.py          # Config + Designer merged together\n│       ├── train_tab.py           # Owns data_path\n│       ├── pipeline_tab.py        # Per-stage datasets across all 6 stages\n│       ├── inference_tab.py\n│       ├── benchmark_tab.py       # Markdown table + gr.BarPlot\n│       ├── autotune_tab.py        # Persistent logs + Apply Best -> Config\n│       └── iomon_tab.py           # MetricsBus dashboard\n├── chronos_app.py                 # Web UI entry\n├── train_chronos.py               # Stage 1 entry\n├── train_chronos_sft.py           # Stage 2 entry\n├── train_chronos_dpo.py           # Stage 3 entry\n├── train_chronos_orpo.py          # Stage 4 entry\n├── train_chronos_grpo.py          # Stage 5 entry\n├── train_chronos_distill.py       # Stage 6 entry\n├── tools\u002F\n│   ├── compare_minimind_chronos.py\n│   ├── compare_minimind_chronos_v2.py\n│   └── compare_minimind_chronos_v3.py\n├── tests\u002F\n│   ├── test_smoke.py\n│   ├── test_smoke_cuda.py\n│   └── fixtures\u002F\n├── docs\u002F\n│   └── vllm_integration.md\n├── pyproject.toml\n└── README.md \u002F README_zh.md \u002F THIRD_PARTY_NOTICES.md\n```\n\n---\n\n## Roadmap\n\n```mermaid\ntimeline\n    title Project Chronos delivery map\n    Phase 1\n        : LookaheadRouter\n        : Temporal locality regularization\n        : Router-probability collection path\n    Phase 2\n        : Async IO engine\n        : Three-tier SSD\u002FRAM\u002FVRAM storage\n        : Co-activation clustering\n    Phase 3\n        : Hybrid MLA + SlidingWindow attention\n        : PrefillScheduler\n        : Dual-layer routing\n    Phase 4\n        : Native MLX backend\n        : Web UI and CLI\n        : Optuna search\n        : Open-source release\n    M1-M3\n        : Cluster-aware safetensors storage\n        : Supervised lookahead loss\n        : Dual-stream decode with per-expert events\n    M4-M6\n        : SFT \u002F DPO \u002F ORPO \u002F GRPO trainers\n        : Router KL anchor\n        : HF IO, vLLM adapter, multi-backend dispatch\n        : Stage 6 distillation and pluggable rewards\n        : Web UI v2, benchmark plots, IO Monitor\n    Next\n        : Train IntentClassifier on large activation corpora\n        : Benchmark 7B+ checkpoints\n        : Inject masks on the vLLM worker path\n        : Ship real Vulkan \u002F OpenCL kernels\n```\n\n```mermaid\nmindmap\n  root((Chronos innovation surface))\n    Predictive routing\n      IntentClassifier\n        Prompt-level hot expert prior\n        Budgeted expert-set prediction\n      LookaheadRouter\n        Per-token future routing\n        Soft-target CE from real future routers\n    Storage-aware MoE\n      Clustered safetensors\n        .ctsr packed expert clusters\n        manifest-driven mmap\n      Three-tier cache\n        NVMe SSD\n        Pinned RAM\n        VRAM \u002F unified memory\n      Soft fallback\n        Shared experts always resident\n        No hard stall on cache miss\n    Decode pipeline\n      Prefill-time expert loading\n      AsyncPrefetcher queue\n      H2D stream\n      Per-expert CUDA events\n    Training stack\n      Pretrain\n      SFT\n      DPO\n      ORPO\n      GRPO\n      Distill\n      Router KL anchor\n    Deployment\n      HF safetensors IO\n      AutoModel registration\n      vLLM adapter\n      CPU \u002F CUDA \u002F MPS \u002F MLX dispatch\n      Web UI and CLI\n```\n\n---\n\n## Citation\n\n```bibtex\n@misc{chronos2026,\n  title  = {Project Chronos: Prefill-Time Expert Loading and Dual-Layer Routing\n             for Zero-Stall On-Device MoE Inference},\n  author = {Fona and Project Chronos Contributors},\n  year   = {2026},\n  url    = {https:\u002F\u002Fgithub.com\u002FFonaTech\u002FProject_Chronos}\n}\n```\n\n---\n\n## Third-party attribution\n\nProject Chronos builds on **jingyaogong**'s [minimind](https:\u002F\u002Fgithub.com\u002Fjingyaogong\u002Fminimind), licensed under **Apache-2.0**. Full attribution lives in [THIRD_PARTY_NOTICES.md](THIRD_PARTY_NOTICES.md).\n\n---\n\n## License\n\nApache 2.0 - see [LICENSE](LICENSE)\n","Project Chronos 是一个针对SSD+DRAM混合推理优化的存储感知MoE（Mixture of Experts）栈，旨在通过预取和异步DMA技术减少大语言模型在推理过程中的延迟。其核心功能包括使用Lookahead预测与异步DMA预取来实现零停顿推理，并采用混合MLA（Memory-Limited Attention）加滑动窗口注意力机制以优化SSD I\u002FO性能。特别适合需要高吞吐量、低延迟的大规模语言模型应用场景，如生成式AI服务等，在处理大量数据时能显著提高效率并降低成本。",2,"2026-06-11 02:44:29","CREATED_QUERY"]