[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2235":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":15,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":16,"rankGlobal":10,"rankLanguage":10,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":15,"starSnapshotCount":15,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},2235,"AirTrain","alexandercodes4\u002FAirTrain","alexandercodes4","Distributed ML training across Apple Silicon Macs","",null,"Python",104,9,1,0,3,"MIT License",false,"master",true,[],"2026-06-12 02:00:39","# AirTrain\n\n**Distributed ML training across Apple Silicon Macs.**\n\nAirTrain dramatically reduces machine learning model training costs by splitting computation across multiple Mac devices. Using the DiLoCo algorithm, it achieves near-linear scaling with **500x less network communication** than traditional distributed training — making Wi-Fi-based training practical.\n\nTraining a 124M parameter GPT-2 model? Instead of renting cloud GPUs at $3\u002Fhr, pool three MacBooks in a coffee shop and train for free.\n\n## Table of Contents\n\n- [Features](#features)\n- [Quick Start](#quick-start)\n- [How It Works](#how-it-works)\n- [The DiLoCo Algorithm](#the-diloco-algorithm)\n- [Architecture](#architecture)\n- [Peer Discovery](#peer-discovery)\n- [Network Protocol](#network-protocol)\n- [Checkpoint System](#checkpoint-system)\n- [Training Relay](#training-relay)\n- [Local Dashboard](#local-dashboard)\n- [AirTrain Website (airtrain.dev)](#airtrain-website)\n- [Apple Silicon Performance](#apple-silicon-performance)\n- [CLI Reference](#cli-reference)\n- [Configuration](#configuration)\n- [Project Structure](#project-structure)\n- [Comparison to Existing Tools](#comparison-to-existing-tools)\n- [Roadmap](#roadmap)\n- [Contributing](#contributing)\n- [License](#license)\n\n## Features\n\n- **Zero-config discovery** — Devices find each other automatically on local networks via mDNS\u002FBonjour\n- **DiLoCo training** — 500x less network traffic than traditional distributed training (DDP)\n- **Fault tolerant** — Nodes can join and leave mid-training without killing the run\n- **Checkpoint relay** — Pause training, export a checkpoint, hand it off to someone else to continue\n- **Built for Apple Silicon** — Native MLX framework, optimized for M1\u002FM2\u002FM3\u002FM4\u002FM5 unified memory architecture\n- **Local dashboard** — Real-time training metrics, peer monitoring, and checkpoint timeline in your browser\n- **Community platform** — [airtrain.dev](https:\u002F\u002Fairtrain.dev) lets you find training partners, share checkpoints, and track your contributions on a global leaderboard\n\n## Quick Start\n\n```bash\npip install airtrain\n\n# Mac 1 — Start training as coordinator\nairtrain start --model gpt2-small --dataset .\u002Fdata\u002Fwikitext.txt --dashboard\n\n# Mac 2 — Join automatically via mDNS\nairtrain join auto\n```\n\nBoth Macs now train collaboratively. Loss decreases on both terminals. Open `http:\u002F\u002Flocalhost:8471` on Mac 1 to see the live dashboard.\n\n## How It Works\n\nTraditional distributed training (DDP) synchronizes gradients after **every single step**. For a 124M parameter model in FP32, that's ~500MB of data exchanged per step. At 100 steps\u002Fsecond, you need 50 GB\u002Fs of sustained bandwidth — impossible over Wi-Fi.\n\nAirTrain uses the **DiLoCo** (Distributed Low-Communication) algorithm to reduce this by 500x:\n\n```\nTraditional DDP:      1 sync per step     = 50 GB\u002Fs required\nAirTrain (DiLoCo):    1 sync per 500 steps = 0.1 GB\u002Fs required ✓ Wi-Fi works\n```\n\nEach Mac trains independently for 500 steps, then syncs only the *difference* between where it started and where it ended (pseudo-gradients). A coordinator averages these diffs and broadcasts updated weights. The entire sync takes ~2 seconds over Wi-Fi.\n\n## The DiLoCo Algorithm\n\nAirTrain implements the DiLoCo algorithm from [Douillard et al. (2023)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08105), validated at scale by [PrimeIntellect's OpenDiLoCo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.07852).\n\n### Inner Loop (local training)\n\nEach worker independently runs `H` steps (default 500) of AdamW:\n\n```\nθ_local = θ_global                          # snapshot global params\nfor step in range(H):\n    loss = model(batch, θ_local)\n    θ_local = θ_local - α · AdamW(∇loss)    # α = 3e-4 (inner lr)\n```\n\n### Outer Loop (synchronization)\n\nAfter `H` inner steps, workers compute pseudo-gradients and the coordinator applies an outer SGD step with Nesterov momentum:\n\n```\nΔθ_i = θ_global - θ_local_i                 # pseudo-gradient from worker i\nΔθ_avg = mean(Δθ_1, Δθ_2, ..., Δθ_n)       # average across all workers\n\n# Outer SGD + Nesterov momentum\nv = β · v + Δθ_avg                           # β = 0.9\nθ_global = θ_global - η · (Δθ_avg + β · v)  # η = 0.7 (outer lr)\n```\n\n### Why It Works\n\nDiLoCo works because neural network loss landscapes are smooth enough that independent workers explore different regions and converge to compatible solutions. The pseudo-gradient averaging acts as implicit regularization — similar to how federated learning aggregates updates.\n\n### Configuration\n\n| Parameter | Default | Description |\n|---|---|---|\n| `inner_steps` | 500 | Local training steps before sync |\n| `inner_lr` | 3e-4 | AdamW learning rate for local training |\n| `inner_weight_decay` | 0.1 | AdamW weight decay |\n| `outer_lr` | 0.7 | SGD learning rate for global update |\n| `outer_momentum` | 0.9 | Nesterov momentum for outer optimizer |\n| `gradient_compression` | true | Compress gradients to FP16 + gzip |\n\n## Architecture\n\n### System Overview\n\n```\n┌──────────────────────────────────────────────────────────────┐\n│                      AirTrain Network                        │\n│                                                              │\n│   ┌──────────────┐    ┌──────────────┐   ┌──────────────┐   │\n│   │  Mac #1       │    │  Mac #2       │   │  Mac #3       │  │\n│   │  (Coordinator)│    │  (Worker)     │   │  (Worker)     │  │\n│   │               │    │               │   │               │  │\n│   │ ┌──────────┐ │    │ ┌──────────┐ │   │ ┌──────────┐ │  │\n│   │ │ MLX      │ │    │ │ MLX      │ │   │ │ MLX      │ │  │\n│   │ │ Trainer  │ │    │ │ Trainer  │ │   │ │ Trainer  │ │  │\n│   │ └────┬─────┘ │    │ └────┬─────┘ │   │ └────┬─────┘ │  │\n│   │      │       │    │      │       │   │      │       │  │\n│   │ ┌────▼─────┐ │    │ ┌────▼─────┐ │   │ ┌────▼─────┐ │  │\n│   │ │ DiLoCo   │ │    │ │ DiLoCo   │ │   │ │ DiLoCo   │ │  │\n│   │ │ Engine   │ │    │ │ Engine   │ │   │ │ Engine   │ │  │\n│   │ └────┬─────┘ │    │ └────┬─────┘ │   │ └────┬─────┘ │  │\n│   │      │       │    │      │       │   │      │       │  │\n│   │ ┌────▼─────┐ │    │ ┌────▼─────┐ │   │ ┌────▼─────┐ │  │\n│   │ │ TCP      │◄├────┤►│ TCP      │◄├───┤►│ TCP      │ │  │\n│   │ │Transport │ │    │ │Transport │ │   │ │Transport │ │  │\n│   │ └──────────┘ │    │ └──────────┘ │   │ └──────────┘ │  │\n│   │       ▲      │    │              │   │              │  │\n│   │  Dashboard   │    │              │   │              │  │\n│   │  :8471       │    │              │   │              │  │\n│   └──────────────┘    └──────────────┘   └──────────────┘  │\n│          ▲                                                   │\n│     mDNS\u002FBonjour                                            │\n│   (auto-discovery)                                           │\n└──────────────────────────────────────────────────────────────┘\n```\n\n### Component Stack\n\n```\n┌─────────────────────────────────────────┐\n│              CLI (click)                │  airtrain start \u002F join \u002F relay\n├─────────────────────────────────────────┤\n│         Coordinator \u002F Worker            │  Orchestration layer\n├──────────────┬──────────────────────────┤\n│ DiLoCo Engine│   Checkpoint Manager     │  Training logic\n├──────────────┴──────────────────────────┤\n│         Base Trainer (MLX)              │  Model + optimizer wrapper\n├─────────────────────────────────────────┤\n│    Transport (asyncio TCP)              │  Message passing\n├──────────┬──────────────────────────────┤\n│  Protocol│  Compression (FP16+gzip)    │  Wire format\n├──────────┴──────────────────────────────┤\n│    Discovery (mDNS \u002F HTTP Relay)        │  Peer finding\n└─────────────────────────────────────────┘\n```\n\n## Peer Discovery\n\nAirTrain supports two discovery mechanisms:\n\n### LAN Discovery (mDNS\u002FBonjour)\n\nOn local networks, peers find each other automatically using multicast DNS — the same zero-configuration protocol that Apple uses for AirDrop, AirPlay, and printer discovery.\n\nWhen you run `airtrain start`, the coordinator registers a `_airtrain._tcp.local.` service on the network, advertising its IP, port, model name, and hardware capabilities. When a worker runs `airtrain join auto`, it browses for this service and connects automatically.\n\n```python\n# Under the hood (using python-zeroconf):\nServiceInfo(\n    \"_airtrain._tcp.local.\",\n    \"coordinator._airtrain._tcp.local.\",\n    addresses=[socket.inet_aton(\"192.168.1.10\")],\n    port=7471,\n    properties={\n        \"model\": \"gpt2-small\",\n        \"chip\": \"Apple M4 Pro\",\n        \"memory_gb\": \"48\",\n        \"status\": \"training\",\n    },\n)\n```\n\n**Limitation:** mDNS only works within a single LAN subnet. It won't work across the internet or on networks that block multicast (some university\u002Fenterprise Wi-Fi).\n\n### Internet Discovery (HTTP Relay)\n\nFor peers across the internet, AirTrain provides a lightweight HTTP signaling server. Peers POST their info to the relay, and other peers GET the peer list to find sessions to join.\n\n```bash\n# Self-host a relay server\nuvicorn airtrain.discovery.relay:app --host 0.0.0.0 --port 9000\n\n# Or use the public relay at airtrain.dev\nairtrain start --relay https:\u002F\u002Fairtrain.dev\u002Fapi\u002Frelay\nairtrain join --relay https:\u002F\u002Fairtrain.dev\u002Fapi\u002Frelay\n```\n\nThe relay only handles discovery — all training data flows directly peer-to-peer via TCP.\n\n## Network Protocol\n\nAirTrain uses a custom binary protocol over TCP:\n\n```\n┌────────────┬──────────────┬─────────────────┐\n│ Header Len │ JSON Header  │ Binary Payload  │\n│  (4 bytes) │ (variable)   │ (variable)      │\n└────────────┴──────────────┴─────────────────┘\n```\n\n### Message Types\n\n| Type | Direction | Description |\n|---|---|---|\n| `HANDSHAKE` | Worker → Coordinator | Initial connection with peer capabilities |\n| `SYNC_REQUEST` | Coordinator → Workers | \"Send me your pseudo-gradients\" |\n| `SYNC_GRADIENTS` | Worker → Coordinator | Compressed pseudo-gradient payload |\n| `MODEL_WEIGHTS` | Coordinator → Workers | Updated model weights after outer step |\n| `HEARTBEAT` | Bidirectional | Keep-alive ping every 5 seconds |\n| `PEER_JOIN` | Coordinator → Workers | Notification of new peer |\n| `PEER_LEAVE` | Coordinator → Workers | Notification of disconnected peer |\n\n### Gradient Compression\n\nPseudo-gradients are compressed before transmission:\n\n1. **FP16 casting** — 32-bit floats → 16-bit (2x reduction, negligible quality loss for gradient averaging)\n2. **gzip compression** — Typically 2-3x additional reduction on gradient data\n3. **Net result:** ~4-6x compression. A 500MB gradient payload becomes ~80-125MB.\n\nFor a 124M parameter model: ~250MB per sync (compressed), taking ~2-8 seconds over typical Wi-Fi (30-100 Mbps).\n\n## Checkpoint System\n\nAirTrain saves complete training state as a portable directory:\n\n```\ncheckpoints\u002Fstep-5000\u002F\n├── model.safetensors       # Model weights (HuggingFace safetensors format)\n├── optimizer.npz           # Optimizer state (momentum buffers, etc.)\n└── meta.json               # Training metadata\n```\n\n### Metadata (`meta.json`)\n\n```json\n{\n  \"version\": \"0.1.0\",\n  \"model_name\": \"gpt2-small\",\n  \"global_step\": 5000,\n  \"loss\": 3.42,\n  \"total_compute_hours\": 2.5,\n  \"contributors\": [\"Alicans-MacBook.local\", \"Joes-Mac-Mini.local\"],\n  \"created_at\": \"2026-04-14T15:30:00Z\",\n  \"description\": \"GPT-2 trained on wikitext-103\"\n}\n```\n\nCheckpoints are **automatically saved** every 1000 steps (configurable) and on `Ctrl+C` interruption. The `safetensors` format is compatible with HuggingFace, so trained models can be uploaded directly to the Hub.\n\n## Training Relay\n\nThe relay system enables **asynchronous distributed training** — no need for multiple Macs to be online simultaneously.\n\n### How It Works\n\n1. **You** train a model for a while on your Mac\n2. **You** export a portable relay checkpoint\n3. **You** share it (via the AirTrain website, AirDrop, email, Google Drive — any file transfer)\n4. **Someone else** imports it and continues training\n5. The checkpoint tracks all contributors and cumulative compute hours\n\n```bash\n# Export a relay checkpoint\nairtrain relay export --checkpoint .\u002Fcheckpoints\u002Fstep-5000 \\\n  --output .\u002Frelay-gpt2-step5000 \\\n  --description \"GPT-2 on wikitext-103, loss=3.42, need more compute\"\n\n# Import and continue\nairtrain relay import .\u002Frelay-gpt2-step5000\nairtrain start --model gpt2-small --dataset .\u002Fdata --resume .\u002Frelay-gpt2-step5000\n```\n\nThis is like a relay race — each runner (Mac) carries the baton (checkpoint) for their leg, then hands it off.\n\n## Sleep Swarms\n\nThe most unique feature in AirTrain: **your Mac trains while you sleep**, then hands off to someone in another timezone when you wake up. The model trains 24\u002F7 by chasing nighttime around the globe.\n\n```bash\nairtrain sleep --window \"23:00-07:00\" --prefer \"gpt2*\"\n```\n\n### How It Works\n\n1. You set a **training window** — the hours your Mac is available (default: 11pm–7am)\n2. During that window, AirTrain automatically:\n   - Queries the relay server for active sleep swarm sessions\n   - Downloads the latest checkpoint for the best matching session\n   - Joins as a worker and starts training\n3. When your window closes (or battery drops below 20%, or you close the lid):\n   - Saves a checkpoint\n   - Disconnects gracefully\n   - Uploads the updated checkpoint for the next timezone to pick up\n\n### Timezone Coverage\n\nA model in a sleep swarm passes through contributors around the world:\n\n```\nUTC  00  02  04  06  08  10  12  14  16  18  20  22\n     ████████████                                ████  New York (23:00-07:00)\n                 ████████████                          London (00:00-08:00)\n                             ████████████              Mumbai (05:30-13:30)\n                                         ████████████  Tokyo (09:00-17:00)\n     ─────────────────────────────────────────────────\n     ████████████████████████████████████████████████  = 24\u002F7 coverage\n```\n\n### Configuration\n\n| Flag | Default | Description |\n|---|---|---|\n| `--window` | `23:00-07:00` | Training window in local time |\n| `--prefer` | `any` | Model filter (e.g., `gpt2*`, `llama*`) |\n| `--max-hours` | 8 | Max compute hours per night |\n| `--min-battery` | 20 | Stop if battery drops below this % |\n| `--relay` | `airtrain.dev\u002Fapi\u002Frelay` | Relay server URL |\n\n### Safety\n\nSleep Swarms are safe by default:\n- **Battery protection** — stops training if battery drops below 20%\n- **Lid detection** — pauses if you close your MacBook\n- **Window enforcement** — always stops when your window ends\n- **Auto-checkpoint** — saves progress before every disconnect\n- **Retry logic** — reconnects automatically if Wi-Fi drops\n\n## Dream Training\n\nYour Mac \"dreams\" about the model during idle time. Between training sessions, AirTrain runs low-priority inference to generate synthetic training data from the current checkpoint — scoring each sample for quality and caching the best ones. When training resumes, dream data is mixed into real batches to accelerate convergence.\n\nInspired by how the brain consolidates learning during sleep through replay.\n\n```bash\n# Generate dreams manually from a checkpoint\nairtrain dream run --samples 1000 --temperature 0.9\n\n# Check dream cache stats\nairtrain dream status\n```\n\n### How It Works\n\n1. **Generate** — The model runs inference with temperature sampling to produce diverse text\n2. **Score** — Each sample is evaluated on a quality heuristic (perplexity sweet spot, repetition, diversity)\n3. **Cache** — High-quality samples are saved to `dreams\u002F` as JSONL files\n4. **Mix** — During training, dream data is mixed into real batches (default 15% dream, 85% real)\n5. **Share** — Dream caches are shared across the swarm, so every worker benefits from every other worker's dreams\n\n### Quality Scoring\n\nNot all dreams are useful. AirTrain filters aggressively:\n- **Too low perplexity** (memorized\u002Frepetitive) — rejected\n- **Too high perplexity** (gibberish\u002Fincoherent) — rejected\n- **Sweet spot** (novel but coherent) — kept\n- **N-gram repetition** check catches degenerate loops\n- **Character diversity** check catches punctuation spam\n\n### Integration with Sleep Swarms\n\nWhen the sleep scheduler can't find a training session to join, it dreams instead: *\"If you can't train, dream.\"* These dreams are cached locally and shared when the next session starts, so no idle time is wasted.\n\n### Configuration\n\n| Parameter | Default | Description |\n|---|---|---|\n| `samples_per_session` | 1000 | Samples generated per dream session |\n| `temperature` | 0.9 | Sampling temperature (higher = more diverse) |\n| `top_p` | 0.95 | Nucleus sampling threshold |\n| `quality_threshold` | 0.7 | Min quality score to keep (0-1) |\n| `mix_ratio` | 0.15 | Fraction of dream data in training batches |\n| `max_cache_mb` | 500 | Max dream cache size before auto-pruning |\n| `dream_interval` | 60 | Seconds between idle dream sessions |\n\n## Model Autopsy\n\nAfter training completes, AirTrain generates an interactive **autopsy report** — a detailed analysis of the model's entire training life story.\n\n```bash\nairtrain autopsy --events .\u002Fautopsy\u002Fevents.jsonl\n```\n\nThis opens a self-contained HTML report in your browser with:\n\n- **Training Summary** — total steps, compute hours, contributors, initial\u002Ffinal loss\n- **Loss Curve** — interactive Chart.js visualization of loss over every sync round\n- **Contributor Rankings** — who contributed the most compute, participated in the most syncs, generated the best dreams\n- **Breakthrough Rounds** — the top 5 sync rounds with the biggest loss drops, and which peers were responsible\n- **Dream Impact** — how many dream samples were generated, kept, and their average quality\n- **Peer Timeline** — when each peer joined, contributed, and left\n\n### How Events Are Recorded\n\nThe `AutopsyRecorder` automatically logs events during training:\n- Every sync round (step, loss, participating peers)\n- Peer joins and leaves (hardware info, compute hours contributed)\n- Checkpoints saved\n- Dream sessions (samples generated, quality scores)\n\nEvents are stored as JSONL in `autopsy\u002Fevents.jsonl` — human-readable and portable.\n\n### Sharing Reports\n\nUpload autopsy reports to airtrain.dev to share your model's training story:\n\n```bash\n# Generate JSON format for uploading\nairtrain autopsy --events .\u002Fautopsy\u002Fevents.jsonl --format json --output report.json\n```\n\nReports are viewable on the website, showing every contributor who helped train the model.\n\n## Gradient Marketplace\n\nNot all gradients are created equal. The Gradient Marketplace **scores each worker's contribution** and gives higher-quality gradients more influence in the aggregation step. Workers with better data, more consistent training, or stronger hardware naturally rise to the top.\n\n### How Scoring Works\n\nAfter each sync round, the coordinator evaluates every worker on 4 metrics:\n\n| Metric | Weight | What It Measures |\n|---|---|---|\n| **Alignment** | 35% | Cosine similarity with the consensus gradient. High alignment = agrees with the group = likely good data. |\n| **Magnitude** | 25% | Is the gradient a healthy size? Too small = stale. Too large = diverging. Peaks near the median. |\n| **History** | 25% | Rolling average of past scores. Consistent contributors build trust over time. |\n| **Improvement** | 15% | Did loss decrease when this worker's gradients were used? Retroactive credit for results. |\n\nScores are normalized to weights that sum to 1.0. A minimum weight floor (default 10%) ensures no worker is ever completely silenced — even low-scoring workers contribute something.\n\n### Warmup Period\n\nDuring the first 3 sync rounds, all workers receive equal weights. The marketplace needs a few rounds of history before it can meaningfully differentiate contributors.\n\n### Example Output\n\n```\nMarketplace Rankings (Round 12):\n  #1  MacBook-Pro-Alex   w=0.312  mag=0.95  align=0.87  hist=0.81  imp=0.72\n  #2  Mac-Mini-Server    w=0.289  mag=0.91  align=0.82  hist=0.78  imp=0.68\n  #3  MacBook-Air-Joe    w=0.245  mag=0.88  align=0.71  hist=0.65  imp=0.61\n  #4  iMac-Reception     w=0.154  mag=0.42  align=0.53  hist=0.50  imp=0.50\n```\n\n### Why This Matters\n\nIn traditional distributed training, a single bad worker (training on corrupted data, running on failing hardware) can poison the entire model by contributing garbage gradients that get averaged equally with good ones. The Gradient Marketplace automatically detects and downweights these workers without kicking them out — their contribution is reduced, not eliminated.\n\nThis also creates a natural quality incentive for the community: workers who contribute better data and more reliable compute earn higher marketplace scores, which feed into the website leaderboard.\n\n## Local Dashboard\n\nWhen you run training with `--dashboard`, AirTrain starts a web UI at `http:\u002F\u002Flocalhost:8471`:\n\n```bash\nairtrain start --model gpt2-small --dataset .\u002Fdata --dashboard\n```\n\nThe dashboard shows:\n\n- **Loss curve** — Real-time Chart.js plot of training loss over steps\n- **Peer table** — Connected devices with chip type, memory, contribution percentage, and status\n- **Throughput** — Tokens\u002Fsecond across the swarm\n- **Checkpoint timeline** — History of saved checkpoints with loss at each point\n- **Cluster status** — Total compute hours, global step, peer count\n\nData streams via **Server-Sent Events (SSE)** for real-time updates without polling.\n\n## AirTrain Website\n\n**[airtrain.dev](https:\u002F\u002Fairtrain.dev)** is the community platform that connects AirTrain users worldwide. It serves three purposes: helping people find live training sessions to join, enabling asynchronous checkpoint handoffs between strangers, and gamifying contributions to build a community of distributed ML trainers.\n\n### Swarm Browser\n\nThe Swarm Browser shows **live training sessions** happening right now. When a coordinator starts training with `--relay https:\u002F\u002Fairtrain.dev\u002Fapi\u002Frelay`, their session appears on the website in real-time.\n\nEach listing shows:\n- **Model** being trained (e.g., GPT-2 124M, LLaMA 7B)\n- **Progress** — current step, loss, and estimated completion\n- **Peers** — how many Macs are currently contributing and how many more are wanted\n- **Hardware** — aggregate compute (e.g., \"3x M4 Pro, 1x M2 Air = 11.1 TFLOPS\")\n- **Connection info** — one-click join button that copies the `airtrain join \u003Caddress>` command\n\nAnyone can browse sessions without an account. Joining requires the AirTrain CLI installed locally.\n\n```\n┌──────────────────────────────────────────────────────────┐\n│  Live Training Sessions                          3 active │\n├──────────────────────────────────────────────────────────┤\n│  GPT-2 124M on WikiText-103                              │\n│  Step: 15,000 \u002F 100,000  ▓▓▓░░░░░░░  15%               │\n│  Loss: 3.12  |  Peers: 4\u002F8  |  12.3 TFLOPS combined    │\n│  [Join Session]                                          │\n├──────────────────────────────────────────────────────────┤\n│  TinyLLaMA 1.1B on RedPajama                            │\n│  Step: 2,400 \u002F 50,000   ▓░░░░░░░░░   5%                │\n│  Loss: 5.67  |  Peers: 2\u002F4  |  6.8 TFLOPS combined     │\n│  [Join Session]                                          │\n└──────────────────────────────────────────────────────────┘\n```\n\n### Relay Board\n\nThe Relay Board is a **marketplace for training checkpoints**. Users post checkpoints they've trained and want others to continue. Think of it as a baton-passing board for asynchronous collaborative training.\n\nHow it works:\n\n1. **Post a checkpoint** — Upload metadata (model name, step, loss, compute hours) and a download link (HuggingFace Hub, S3, Google Drive). Weights are never uploaded to airtrain.dev — only metadata and a link.\n2. **Browse available relays** — See what models need more training, sorted by recency or popularity.\n3. **Claim a relay** — Mark a checkpoint as \"claimed\" so others don't duplicate work. Download the checkpoint, train for a while, then post your updated checkpoint back.\n4. **Track lineage** — Each relay checkpoint records its full history: who trained it, for how many steps, and how many total compute hours have been contributed. A model might pass through 10 different people's Macs before reaching convergence.\n\n```\n┌──────────────────────────────────────────────────────────┐\n│  Relay Board                                    12 open   │\n├──────────────────────────────────────────────────────────┤\n│  GPT-2 124M — step 50,000 — loss 2.89                   │\n│  \"Trained on wikitext-103 for 8 hours. Getting close     │\n│   to convergence, needs ~20k more steps.\"                │\n│  Contributors: 3  |  Compute: 14.2 hrs  |  Posted 2h ago│\n│  [Claim & Continue]                [View History]        │\n├──────────────────────────────────────────────────────────┤\n│  TinyStories 33M — step 5,000 — loss 4.21               │\n│  \"Just started this one. Great for beginners to try      │\n│   AirTrain relay — small model, quick progress.\"         │\n│  Contributors: 1  |  Compute: 0.5 hrs  |  Posted 1d ago │\n│  [Claim & Continue]                [View History]        │\n└──────────────────────────────────────────────────────────┘\n```\n\n### Leaderboard & Gamification\n\nThe leaderboard ranks contributors by total **compute hours donated** to collaborative training. It creates a positive feedback loop — the more you train, the higher you rank, and the more visible your contributions become.\n\n**Leaderboard columns:**\n- **Rank** — Position by total compute hours\n- **Username** — GitHub-linked profile\n- **Compute Hours** — Total hours of training contributed across all sessions\n- **Sessions** — Number of training sessions participated in\n- **Relays** — Number of checkpoint handoffs completed\n- **Badges** — Achievement icons earned\n\n**Badges:**\n\n| Badge | Name | Criteria |\n|---|---|---|\n| First Train | Completed your first training session |\n| 10 Hours | Contributed 10 compute hours |\n| 100 Hours | Contributed 100 compute hours |\n| Swarm Leader | Coordinated a session with 5+ peers |\n| Relay Champion | Completed 5 relay handoffs |\n| Early Adopter | Joined during the first month |\n\n### Website Tech Stack\n\n| Component | Technology | Purpose |\n|---|---|---|\n| Backend | FastAPI (Python) | REST API, SSE for real-time updates |\n| Database | SQLite + aiosqlite | Zero-ops, migrates to PostgreSQL at scale |\n| Auth | GitHub OAuth | One-click login for developers |\n| Frontend | Vanilla HTML\u002FCSS\u002FJS | Landing page, swarm browser, relay board, leaderboard |\n| Hosting | Any VPS (Fly.io, Railway, etc.) | Single Python process, no complex infra |\n\n### Website API\n\nAll website features are accessible via REST API:\n\n| Endpoint | Method | Description |\n|---|---|---|\n| `\u002Fapi\u002Fswarms` | GET | List active training sessions |\n| `\u002Fapi\u002Fswarms` | POST | Register a new training session |\n| `\u002Fapi\u002Fswarms\u002F{id}` | PUT | Update session status\u002Fprogress |\n| `\u002Fapi\u002Frelay` | GET | List available relay checkpoints |\n| `\u002Fapi\u002Frelay` | POST | Post a new relay checkpoint |\n| `\u002Fapi\u002Frelay\u002F{id}\u002Fclaim` | POST | Claim a relay checkpoint |\n| `\u002Fapi\u002Fleaderboard` | GET | Get ranked contributor list |\n| `\u002Fapi\u002Fleaderboard\u002Fbadges` | GET | Get badge definitions |\n| `\u002Fauth\u002Flogin` | GET | Initiate GitHub OAuth flow |\n| `\u002Fauth\u002Fcallback` | GET | Handle OAuth callback |\n| `\u002Fhealth` | GET | Health check |\n\nFull interactive API documentation is available at `\u002Fdocs` (auto-generated by FastAPI).\n\n### Database Schema\n\n```sql\nusers           (id, github_id, username, avatar_url, compute_hours, created_at)\ntraining_sessions (id, creator_id, model_name, status, global_step, loss,\n                   peer_count, description, connect_address, created_at)\ncheckpoints     (id, session_id, uploader_id, model_name, global_step, loss,\n                 compute_hours, description, download_url, status, claimed_by)\ncontributions   (id, user_id, session_id, compute_hours, steps_trained)\nbadges          (id, user_id, badge_type, earned_at)\n```\n\n## Apple Silicon Performance\n\nAirTrain is built on [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx), Apple's native ML framework that takes full advantage of Apple Silicon's **unified memory architecture** — CPU and GPU share the same memory pool, eliminating the host-to-device copy overhead that plagues NVIDIA GPU training.\n\n### Chip Benchmarks\n\n| Chip | GPU TFLOPS (FP32) | Memory BW | Unified Memory | Power |\n|---|---|---|---|---|\n| M1 | 1.36 | 60 GB\u002Fs | 8-16 GB | 20W |\n| M2 | 2.24 | 91 GB\u002Fs | 8-24 GB | 22W |\n| M3 | 2.47 | 92 GB\u002Fs | 8-24 GB | 22W |\n| M4 | 2.90 | 100 GB\u002Fs | 16-32 GB | 22W |\n| M4 Pro | 5.30 | 273 GB\u002Fs | 24-48 GB | 30W |\n| M4 Max | 18.43 | 546 GB\u002Fs | 36-128 GB | 40W |\n\n*Source: [arXiv:2502.05317](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2502.05317v1)*\n\n### Why Apple Silicon for Training?\n\n1. **Unified memory** — A M4 Max with 128GB can train a 70B parameter model without offloading. An NVIDIA RTX 4090 has only 24GB VRAM.\n2. **Power efficiency** — Apple Silicon achieves ~245-460 GFLOPS\u002FW vs NVIDIA A100's ~0.7 TFLOPS\u002FW. Training on MacBooks costs nothing in electricity compared to a cloud GPU.\n3. **Ubiquity** — There are hundreds of millions of Apple Silicon Macs in the world. Even if each one contributes just a few hours, the aggregate compute is enormous.\n4. **MLX** — Apple's framework is purpose-built for this hardware. Lazy evaluation, unified memory, and native Metal GPU support.\n\n### Scaling Math\n\nA single M4 MacBook Pro: **2.9 TFLOPS**. An NVIDIA A100: **19.5 TFLOPS**.\n\nBut 7 friends with M4 MacBooks = **20.3 TFLOPS** combined — matching an A100 for $0 in compute cost.\n\nWith DiLoCo's 500x communication reduction, the Wi-Fi overhead is negligible. You get near-linear scaling up to dozens of Macs.\n\n## CLI Reference\n\n| Command | Description |\n|---|---|\n| `airtrain init` | Initialize a new training project (creates `airtrain.yaml`) |\n| `airtrain start --model \u003Cname> --dataset \u003Cpath>` | Start training as coordinator |\n| `airtrain start --dashboard` | Start with local web dashboard on `:8471` |\n| `airtrain start --resume \u003Ccheckpoint>` | Resume training from a checkpoint |\n| `airtrain join auto` | Join a session via mDNS auto-discovery |\n| `airtrain join \u003Cip:port>` | Join a session at a specific address |\n| `airtrain status` | Show cluster status (peers, step, loss) |\n| `airtrain pause` | Checkpoint and pause training |\n| `airtrain resume --from \u003Ccheckpoint>` | Resume from a saved checkpoint |\n| `airtrain relay export --checkpoint \u003Cpath>` | Export portable relay checkpoint |\n| `airtrain relay import \u003Cpath>` | Import a relay checkpoint |\n| `airtrain sleep --window \"23:00-07:00\"` | Auto-join sessions while you sleep |\n\n### Key Flags\n\n| Flag | Default | Description |\n|---|---|---|\n| `--model` | `gpt2-small` | Model architecture to train |\n| `--dataset` | (required) | Path to training data |\n| `--batch-size` | 8 | Per-worker batch size |\n| `--inner-steps` | 500 | DiLoCo inner steps before sync |\n| `--port` | 7471 | TCP port for peer communication |\n| `--checkpoint-dir` | `.\u002Fcheckpoints` | Where to save checkpoints |\n| `--dashboard` | off | Enable local web dashboard |\n\n## Configuration\n\nAirTrain can be configured via `airtrain.yaml` (created by `airtrain init`) or CLI flags:\n\n```yaml\nmodel_name: gpt2-small\ndataset_path: .\u002Fdata\u002Fwikitext.txt\nbatch_size: 8\nmax_steps: 100000\nseq_length: 512\ncheckpoint_dir: .\u002Fcheckpoints\ncheckpoint_every: 1000\nlog_every: 10\nseed: 42\n\ndiloco:\n  inner_steps: 500\n  inner_lr: 0.0003\n  inner_optimizer: adamw\n  inner_weight_decay: 0.1\n  outer_lr: 0.7\n  outer_momentum: 0.9\n  use_nesterov: true\n  gradient_compression: true\n  compress_to_fp16: true\n```\n\n## Project Structure\n\n```\nAirTrain\u002F\n├── airtrain\u002F                        # Core Python package\n│   ├── cli.py                       # Click CLI (init, start, join, relay, etc.)\n│   ├── config.py                    # Pydantic config models\n│   ├── compat.py                    # Cross-platform MLX compatibility layer\n│   ├── discovery\u002F\n│   │   ├── mdns.py                  # LAN auto-discovery via Zeroconf\u002FBonjour\n│   │   ├── relay.py                 # HTTP signaling server for internet discovery\n│   │   └── peer.py                  # Peer manager + Apple Silicon hardware detection\n│   ├── engine\u002F\n│   │   ├── diloco.py                # DiLoCo algorithm implementation\n│   │   ├── trainer.py               # Base MLX training loop\n│   │   ├── coordinator.py           # Coordinator node orchestration\n│   │   ├── worker.py                # Worker node logic\n│   │   ├── checkpoint.py            # Save\u002Fload\u002Fexport\u002Fimport checkpoints\n│   │   ├── pipeline.py              # Pipeline parallelism interface (v2)\n│   │   └── status.py                # Cluster status queries\n│   ├── network\u002F\n│   │   ├── transport.py             # Async TCP server\u002Fclient with heartbeat\n│   │   ├── protocol.py              # Binary message protocol\n│   │   └── compression.py           # FP16 + gzip gradient compression\n│   ├── models\u002F\n│   │   ├── transformer.py           # GPT-2 implementation in MLX\n│   │   └── registry.py              # Model name → factory mapping\n│   └── dashboard\u002F\n│       ├── app.py                   # FastAPI local dashboard + SSE\n│       └── static\u002Findex.html        # Dashboard UI (Chart.js)\n├── website\u002F                         # Public website (airtrain.dev)\n│   ├── backend\u002F\n│   │   ├── app.py                   # FastAPI app with CORS\n│   │   ├── models.py                # SQLAlchemy table definitions\n│   │   ├── auth.py                  # GitHub OAuth flow\n│   │   └── routes\u002F\n│   │       ├── swarms.py            # Live session browser API\n│   │       ├── relay.py             # Relay checkpoint board API\n│   │       └── leaderboard.py       # Leaderboard + badges API\n│   └── frontend\u002F\n│       └── index.html               # Landing page with swarm\u002Frelay\u002Fleaderboard\n├── examples\u002F\n│   ├── train_gpt2.py                # GPT-2 distributed training example\n│   ├── train_mnist.py               # Simple MNIST example for testing\n│   └── relay_demo.py                # Relay checkpoint handoff demo\n├── tests\u002F\n│   ├── test_config.py               # Config model tests\n│   └── test_protocol.py             # Protocol encode\u002Fdecode tests\n├── pyproject.toml                   # Package config + dependencies\n├── README.md\n└── LICENSE                          # MIT\n```\n\n## Comparison to Existing Tools\n\n| Feature | AirTrain | PyTorch DDP | Petals | Hivemind | Flower |\n|---|---|---|---|---|---|\n| Apple Silicon native | Yes (MLX) | No (MPS single-device) | Partial | Partial | Via PyTorch |\n| Communication reduction | 500x (DiLoCo) | 1x (every step) | N\u002FA (inference) | ~10x (Moshpit) | Varies |\n| Zero-config discovery | mDNS | Manual | DHT | DHT | Manual |\n| Wi-Fi friendly | Yes | No | Yes | Yes | Yes |\n| Dynamic join\u002Fleave | Yes | No | Yes | Yes | Yes (per round) |\n| Checkpoint relay | Yes | No | No | No | No |\n| Community platform | airtrain.dev | No | No | No | No |\n| Sleep Swarms (24\u002F7) | Yes | No | No | No | No |\n| Target hardware | Mac (Apple Silicon) | NVIDIA GPU | Any GPU | Any GPU | Any |\n\n### When to Use AirTrain vs Alternatives\n\n- **AirTrain** — You have Macs and want to train models collaboratively with friends\u002Fcommunity, either live or asynchronously via relay\n- **PyTorch DDP** — You have a homogeneous GPU cluster with fast interconnect (InfiniBand)\n- **Petals** — You want to run inference on huge models (70B+) by pooling GPUs across the internet\n- **Hivemind** — You want decentralized training across heterogeneous GPU machines\n- **Flower** — You need federated learning where data stays private on each device\n\n## Roadmap\n\n### v0.1 (Current)\n- [x] DiLoCo data-parallel training\n- [x] mDNS zero-config discovery\n- [x] Async TCP transport with heartbeat\n- [x] FP16 + gzip gradient compression\n- [x] Checkpoint save\u002Fload\u002Frelay\n- [x] CLI (start, join, pause, relay)\n- [x] Local web dashboard\n- [x] Public website (swarm browser, relay board, leaderboard)\n- [x] GPT-2 model\n\n### v0.2 (Planned)\n- [ ] Pipeline parallelism for models too large for single Mac\n- [ ] Real dataset loaders (HuggingFace datasets integration)\n- [ ] More model architectures (LLaMA, Mistral, Phi)\n- [ ] Thunderbolt JACCL backend for same-room high-speed training\n- [ ] Website: real-time session metrics via WebSocket\n\n### v0.3 (Future)\n- [ ] NAT traversal for peer-to-peer across the internet without relay\n- [ ] Differential privacy for gradient sharing\n- [ ] Mobile support (iOS Neural Engine contribution)\n- [ ] Model Hub integration (auto-publish to HuggingFace on convergence)\n- [ ] Browser-based training viewer\n\n## Contributing\n\nWe welcome contributions! Areas where help is especially valuable:\n\n- **Model implementations** — Port more architectures to MLX\n- **Dataset loaders** — Integration with HuggingFace datasets, custom formats\n- **Testing** — Multi-node integration tests, benchmarks\n- **Website** — UI\u002FUX improvements, mobile responsiveness\n- **Documentation** — Tutorials, guides, video walkthroughs\n\n## License\n\nMIT License — see [LICENSE](LICENSE) for details.\n\n## Acknowledgements\n\nAirTrain builds on the work of:\n\n- [MLX](https:\u002F\u002Fgithub.com\u002Fml-explore\u002Fmlx) by Apple — Native Apple Silicon ML framework\n- [DiLoCo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.08105) by Douillard et al. — The low-communication distributed training algorithm\n- [OpenDiLoCo](https:\u002F\u002Fgithub.com\u002FPrimeIntellect-ai\u002FOpenDiloco) by PrimeIntellect — Open-source DiLoCo implementation and validation\n- [Petals](https:\u002F\u002Fgithub.com\u002Fbigscience-workshop\u002Fpetals) — Proving collaborative ML training works over the internet\n- [Hivemind](https:\u002F\u002Fgithub.com\u002Flearning-at-home\u002Fhivemind) — Decentralized deep learning primitives\n- [python-zeroconf](https:\u002F\u002Fgithub.com\u002Fpython-zeroconf\u002Fpython-zeroconf) — Pure Python mDNS\u002FDNS-SD implementation\n","AirTrain 是一个用于在多台 Apple Silicon Mac 设备上进行分布式机器学习训练的项目。它通过 DiLoCo 算法实现近乎线性的扩展，相比传统的分布式训练方法减少了 500 倍的网络通信量，使得基于 Wi-Fi 的训练成为可能。核心功能包括零配置设备发现、容错机制、检查点接力以及为 Apple Silicon 架构优化的本地监控面板。适合场景包括需要降低成本且对网络带宽要求不高的机器学习模型训练任务，特别是在没有云 GPU 资源的情况下，可以利用现有的 Mac 设备资源来完成训练。",2,"2026-06-11 02:49:01","CREATED_QUERY"]