[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76184":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":10,"pushedAt":10,"updatedAt":36,"readmeContent":37,"aiSummary":38,"trendingCount":15,"starSnapshotCount":15,"syncStatus":39,"lastSyncTime":40,"discoverSource":41},76184,"comfyui-mesh","shootthesound\u002Fcomfyui-mesh","shootthesound","Split FLUX.2 and LTX 2.3 across two GPUs (LAN or same-machine) — NVENC compresses activations live on the wire. Icarus (ComfyUI node) + Daedalus (back-half server).","",null,"Python",114,14,4,0,1,31,3.53,false,"main",true,[23,24,25,26,27,28,29,30,31,32,33,34,35],"comfyui","comfyui-node","diffusion-models","distributed-inference","flux","flux2","image-generation","ltx-video","multi-gpu","nvenc","pipeline-parallelism","pytorch","video-generation","2026-06-12 02:03:40","# ComfyUI-Mesh Icarus & Daedalus\n\n**ComfyUI Mesh : Icarus** *(the ComfyUI client node)* ↔\n**ComfyUI Mesh : Daedalus** *(the back-half server)*\n\n**Split a diffusion model across two GPUs — either over a gigabit\nnetwork OR between two cards in the same machine. The activations\nbetween them get compressed live by NVIDIA's NVENC idle silicon\nthrough a codec I designed to abstract model activation data.**\n\n> **Supported today:** FLUX.2 Dev, FLUX.2 Klein 9B, and **LTX 2.3\n> (LTX-AV 22B Dev)**. Each has its own paired node + server launcher\n> — see \"Quick start\" and the LTX section below. Other architectures\n> (Wan, FLUX.1, SD3.5, …) are on the roadmap further down — let me\n> know which one you want next.\n\n> **Headline:** FLUX.2 Klein 9B at 1024² generates in **~4.4 seconds\n> per image** split across an RTX 5090 + RTX 4090 over plain gigabit\n> ethernet. Only ~0.5 s of that is wire overhead (4 sampler timesteps\n> × ~130 ms round-trip) — the rest is diffusion that would have\n> happened anyway. Full numbers (incl. 1536² and lossless modes)\n> further down.\n\n![Demo workflow with Icarus inline](screenshots\u002Fworkflow-screenshot.png)\n\n\u003Ca href=\"https:\u002F\u002Fbuymeacoffee.com\u002Florasandlenses\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBuy%20me%20a%20coffee-FFDD00?style=for-the-badge&logo=buy-me-a-coffee&logoColor=black\" alt=\"Buy Me A Coffee\">\u003C\u002Fa>\n\nIf this project saves you buying a new GPU, please consider donating — it helps me support more models beyond the FLUX.2 family and keep this thing maintained.\n\nFLUX.2 Klein 9B \u002F FLUX.2 Dev (9 GB and ~22 GB respectively) running\non one Nvidia card with its back half offloaded to another Nvidia\ncard elsewhere on the LAN.\n**Any modern Nvidia GPU with NVENC works** — 3080 + 4080, 4070 + 5070,\n5090 + 4090, whatever you have. The two cards don't have to be the\nsame model or generation. Or two cards in the same box without NVLink.\nOr your friend's GPU over VPN. The bandwidth that would normally\nmake this miserable stops being the bottleneck because NVENC compresses\nthe bytes on the wire while they're already on the GPU.\n\n```\n                ┌─────────────────┐                     ┌─────────────────┐\n                │     Icarus      │   NVENC HEVC wire   │    Daedalus     │\n                │  (client node)  │ ─── ~10 MB \u002F step ─►│  (back-half)    │\n   img latent ──┤ front-half      │                     │ slim-loaded     │── img latent\n                │ blocks + VAE    │ ◄────────────────── │ server          │\n                └─────────────────┘    ~10 MB \u002F step    └─────────────────┘\n                       LoRAs work transparently across the wire\n```\n\n---\n\n## The thing that's new\n\nEvery modern Nvidia GPU has dedicated **NVENC** silicon that compresses\nH.265\u002FHEVC video at sub-millisecond per frame. During ML inference it\nsits 100% idle — none of the compute uses it.\n\nThis rig treats ML activations as video frames and feeds them to NVENC.\nA FLUX activation tensor is `[batch, tokens, channels]`; we pack\nmultiple channels per Y\u002FU\u002FV plane in a grid, quantize per-channel to\nuint8, and the codec compresses the result by 3–10× depending on QP.\nThe output bitstream is what crosses the wire. The receiver runs NVDEC\nto reverse it.\n\n**The codec runs on dedicated silicon that wasn't going to do anything\nelse anyway.** So you get compression \"for free\" in the sense that you\nweren't using those transistors. The wire-byte savings convert directly\nto wall-clock savings on any bandwidth-limited transport.\n\nThat's it. The rest is plumbing.\n\n---\n\n## What works today\n\n- **FLUX.2 Dev and FLUX.2 Klein 9B.** These are the two FLUX.2\n  checkpoints Black Forest Labs ships today; both tested end-to-end.\n  (FLUX.1 schnell is a separate architecture and is on the roadmap\n  below, not in this list.)\n- **LTX 2.3 (LTX-AV 22B Dev and Distilled).** The Lightricks LTX video model with\n  audio+video transformer blocks. Uses a separate Icarus LTX node\n  and a separate Daedalus LTX server GUI — see the LTX section below\n  for the small UX differences from the FLUX pair.\n- **LoRAs**: any format ComfyUI itself supports — Kohya, Diffusers PEFT,\n  BFL Flux, USO, Wan Fun, SimpleTuner, native, etc. Includes the full\n  weight-adapter family: **lora, loha, lokr, glora, oft, boft**, plus\n  `diff` \u002F `set` style patches.\n- **Topologies**:\n  - Cross-machine over LAN (gigabit OK, 2.5G\u002F10G better)\n  - Cross-machine over VPN (residential broadband fine for\n    FLUX.2-distilled's 4-step samplers)\n  - Same machine with two GPUs (no NVLink required) — *the rig is\n    set up to handle this cleanly but I haven't been able to test it\n    end-to-end myself; community feedback welcome. See the same-host\n    quickstart below for the expected setup.*\n\n**Models that are NOT supported (yet)** — anything that isn't FLUX.2\nor LTX 2.3. The architectural differences are real (block signatures,\nmodulation, vec structure), but they're not blockers — just code that\nneeds writing. Top of the queue when community demand says go:\n\n- **Wan** (2.1 \u002F 2.2 \u002F VACE) — video model with very similar DiT shape\n- **FLUX.1** — same family as FLUX.2, minor handling differences\n- **SD3 \u002F SD3.5** — MMDiT, related architecture\n- **HunyuanVideo \u002F Qwen-Image \u002F Chroma** — each has its own quirks\n\nIf you want one of these added, **please consider supporting the\nproject** (link below) and tell me which — community demand drives\nthe priority list.\n\n---\n\n## Quick start — cross-machine (the headline use case)\n\nTwo physical machines: one running ComfyUI (**Icarus** lives here),\none running the back-half server (**Daedalus** lives here). Connected\nover your LAN or VPN.\n\n### 1. Install Icarus (the ComfyUI node) on the ComfyUI host\n\nPick whichever route you prefer:\n\n**Via ComfyUI Manager** (easiest if you have it):\n- Open Manager → \"Install via Git URL\" → paste this repo's URL → restart.\n\n**Via git clone:**\n```\ncd ComfyUI\u002Fcustom_nodes\ngit clone \u003Crepo-url> comfyui-mesh\n```\n\n**Drop-in copy:**\n- Copy or unzip the `comfyui-mesh\u002F` folder into\n  `ComfyUI\u002Fcustom_nodes\u002F`.\n\nAfter ComfyUI restarts, the Manager runs `pip install -r requirements.txt`\nautomatically — it's a single line, `cuda-bindings`. The codec wrapper\n(`nvenc_pframe\u002F`) is **bundled in the folder**, no separate install.\nThe Icarus node appears in the node menu under the `mesh` category.\n\n### 2. Deploy Daedalus (the back-half server) on the OTHER machine\n\nThe `server\u002F` subfolder of this repo is a self-contained deploy. Get\nit onto the other machine by whichever means you prefer:\n\n- **git clone the whole repo** there too, and use just the `server\u002F`\n  subfolder.\n- **Copy the `server\u002F` folder over the network** (USB stick \u002F SMB\n  share \u002F SCP \u002F `rsync` etc).\n- **Zip + transfer** if cross-OS.\n\nYou don't need ComfyUI installed on the back-half host beforehand —\nthe installer below clones it for you.\n\nThen on the back-half host, in a terminal in the `server\u002F` folder:\n\n1. Run the one-shot installer:\n   ```\n   install.bat\n   ```\n   Creates a `.venv`, clones ComfyUI INTO this server folder if missing,\n   installs CUDA-enabled torch (cu128 — covers RTX 30\u002F40\u002F50 series),\n   installs dependencies, runs a pre-flight check. Multi-GB,\n   takes a minute or two on a fast connection. Re-runs are idempotent.\n\n2. Drop your FLUX.2 or LTX 2.3 safetensors checkpoint into the same\n   `server\u002F` folder (e.g. `flux-2-klein-9b-fp8.safetensors`,\n   `flux2_dev_fp8mixed.safetensors`, or `ltx-2.3-22b-dev-fp8.safetensors`).\n   Where to get the right files:\n   - **FLUX.2 Dev:** the ComfyUI docs page\n     [Flux.2 Dev](https:\u002F\u002Fdocs.comfy.org\u002Ftutorials\u002Fflux\u002Fflux-2-dev)\n     has direct links (the fp8 variants are what fit comfortably on\n     consumer cards).\n   - **FLUX.2 Klein 9B:** Black Forest Labs' HuggingFace repo at\n     [black-forest-labs\u002FFLUX.2-klein-9B](https:\u002F\u002Fhuggingface.co\u002Fblack-forest-labs\u002FFLUX.2-klein-9B\u002Ftree\u002Fmain).\n   - **LTX 2.3 (LTX-AV 22B Dev and Distilled):** Lightricks' HuggingFace\n     repo at [Lightricks\u002FLTX-2.3-fp8](https:\u002F\u002Fhuggingface.co\u002FLightricks\u002FLTX-2.3-fp8\u002Ftree\u002Fmain).\n     Repo contains both the base model and a pre-distilled variant —\n     pick whichever fits your workflow.\n\n3. Launch via the GUI (recommended for first run):\n   ```\n   run_server_flux2_gui.bat     (FLUX 2)\n   run_server_ltx_gui.bat       (LTX-AV)\n   ```\n   Pick the model file, pick `n_blocks` (how many blocks to host —\n   spinbox shows the range, default 4), pick port, click **Start\n   Server**.\n\n   ![Daedalus server GUI](screenshots\u002Fserver.png)\n\nThe server prints `[server] READY — listening on 0.0.0.0:7777 (n_blocks=4: 4D + 0S)` when ready.\n\n**For more detail** (manual install path, headless `.bat` launchers,\nsame-host two-GPU pinning, troubleshooting matrix, log-format\nreference, network setup tips, `update_comfy.bat` for keeping the\nserver's ComfyUI in sync with the client's): see the\n[`server\u002FREADME.md`](server\u002FREADME.md).\n\n### Wire it up in a workflow\n\n```\nUNETLoader  →  (optional LoraLoader)  →  Icarus  →  KSampler\n```\n\nA ready-to-load demo workflow ships with the repo at\n[`workflows\u002Fklein-9b-example.json`](workflows\u002Fklein-9b-example.json) —\ndrag-and-drop it into ComfyUI to see the full graph wired up.\n\nSet on the `Icarus` node:\n\n- **`n_blocks_remote`** = same number as the server's GUI (default 4)\n- **`remote_host`** = the server's LAN IP, e.g. `192.168.0.18`, or\n  VPN IP `100.x.x.x`\n- **`remote_port`** = `7777`\n- **`codec_mode`** = `nvenc`, **`codec_qp`** = `18`, **`codec_tile_dim`** = `8`\n- **`forward_client_loras`** = **ON** (so any LoraLoader-loaded LoRA\n  affects the back half too)\n\nQueue a generation. Server log shows one `[server] forward …` line\nper timestep, with byte counts. Done.\n\n#### ⚠️ Important: FLUX.2 Dev + the FLUX.2 turbo LoRA — load it on BOTH sides, NOT via forwarding\n\nIf you're running FLUX.2 Dev with the **FLUX.2 turbo LoRA** (the\ndistillation LoRA that lets you sample in 4 steps instead of the\ndefault ~30), do NOT let Icarus forward it across the wire. The\nturbo LoRA is **~2.5 GB** — much bigger than typical character \u002F\nstyle LoRAs. Forwarding it pushes the laptop \u002F smaller GPU server\ninto a memory-pressure regime that **adds seconds per timestep** to\nthe codec encode time, completely killing the speed-up the rig is\nsupposed to give you.\n\nThe right setup is a **two-sided load** that never ships the LoRA\nacross the wire:\n\n1. **On the server:** load the turbo LoRA via Daedalus's GUI LoRA\n   picker. Applies to the back-half blocks server-side.\n2. **On the client:** place a `LoraLoader` for the same turbo LoRA\n   in the workflow **to the RIGHT of the Icarus node** (i.e.\n   between Icarus and KSampler). Applies to the front-half blocks\n   locally.\n\nThe \"after Icarus\" position is the trick — `forward_client_loras`\ndeliberately does NOT capture post-Icarus patches (so they stay\nlocal-only), and the server already has its copy from step 1. Net\nresult: the whole model gets the LoRA, but the wire never carries\nit.\n\nThis applies to any large LoRA (>~500 MB), not just turbo. For\nsmall character \u002F style LoRAs the normal \"LoraLoader BEFORE Icarus\n+ `forward_client_loras=ON`\" pattern is fine and convenient.\n\n---\n\n## LTX 2.3 — separate node + separate server GUI\n\nLTX 2.3 (the Lightricks LTX-AV 22B Dev model) uses its own paired\nclient node and server launcher, alongside the FLUX ones. Both halves\nof the rig live in the same install — pick which node + launcher to\nuse based on which model you're running.\n\n### Client side\n\nIn the ComfyUI node menu under `mesh` you'll see two nodes:\n\n- **`ComfyUI Mesh : Icarus`** — for FLUX.2 Dev \u002F Klein 9B.\n- **`ComfyUI Mesh : Icarus LTX`** — for LTX 2.3.\n\nDrop `Icarus LTX` between the LTX model loader and the LoraLoader \u002F\nsampler chain. A ready-to-load demo workflow ships at\n[`workflows\u002FLTX-example.json`](workflows\u002FLTX-example.json).\n\nThe LTX node has a deliberately smaller surface than the FLUX one:\n\n| Parameter | Default | What it controls |\n|---|---|---|\n| `model` | — | The loaded LTX-AV MODEL |\n| `n_blocks_remote` | 8 | How many of LTX's 48 transformer_blocks run remotely. Increase = more offload, smaller client VRAM. |\n| `remote_host` | `127.0.0.1` | Hostname\u002FIP of the back-half server |\n| `remote_port` | `7777` | TCP port |\n| `codec_mode` | `raw` | `raw` = uncompressed bf16 (default — safe on every wire). `Nvenc LTX (5090 optimized)` = LTX-tuned codec (NVENC HEVC + per-channel percentile-clip quant + sparse exact-correction of outliers; near-raw quality, ~3× smaller than raw, roughly the same wall-clock as raw on gigabit; tuned and validated on RTX 5090, behaviour on older NVENC generations untested). `nvenc` = plain NVENC HEVC, lighter wire, can show contrast crush on LTX. |\n| `forward_client_loras` | **OFF** | Same semantics as the FLUX node, but defaulted OFF on LTX because the LTX-AV 22B back-half is heavy enough that forwarding LoRAs can push \u003C24 GB back-half cards into ComfyUI's dynamic-offload thrash regime (see warning below). Turn ON only if your back-half has 24+ GB free for LoRA buffers. |\n\n**Default is `raw`** — uncompressed bf16, the safe choice on every\nwire.\n\n**Which codec mode to pick:**\n\n- **RTX 50-series (Blackwell) on both ends** → use\n  `Nvenc LTX (5090 optimized)`. This is where the tuned mode was\n  developed and validated; it's the best speed\u002Fquality balance\n  when both client and back-half server have a 50-series card.\n- **Anything older than 50-series** → stick with `raw`, or try\n  `nvenc` if you want a smaller wire and don't mind a quality\n  trade-off (plain `nvenc` on LTX content sometimes shows lower\n  contrast — A\u002FB against `raw` on your specific workflow before\n  trusting it). The `Nvenc LTX (5090 optimized)` mode is\n  untested on older NVENC generations.\n\nThere are no `codec_qp` \u002F `codec_lossless` \u002F `codec_tile_dim` widgets\non the LTX node — those are pinned internally at the sweet spot that\nworks for LTX activations.\n\n#### ⚠️ Workflow placement — LoraLoader position depends on intent\n\nThe Icarus LTX node should sit between the LTX model loader and\nKSampler. **Where you put your LoraLoader relative to Icarus LTX\ncontrols whether the LoRA reaches the server's back-half blocks:**\n\n- **LoraLoader BEFORE Icarus LTX** (with `forward_client_loras=ON`)\n  → the LoRA's patches are on the patcher when Icarus LTX captures\n  it; per timestep the node filters + remaps the back-half-targeting\n  patches, ships them via safetensors-encoded blob (only on session\n  change, not every step), and the server applies them to its slim\n  model. Net effect: the LoRA covers the whole model.\n  ```\n  LTX model loader → LoraLoader → Icarus LTX → KSampler\n  ```\n\n- **LoraLoader AFTER Icarus LTX** → the LoRA applies only to the\n  front-half blocks running locally; the server's back-half never\n  sees it. Use this when you specifically want a local-only LoRA, or\n  when the LoRA is already loaded server-side (see next note) and\n  you don't want to ship it.\n  ```\n  LTX model loader → Icarus LTX → LoraLoader → KSampler\n  ```\n\nThe first call after a LoRA change ships the encoded blob (can be\nhundreds of MB for big LoRAs); subsequent timesteps within the same\ngeneration send only the small session id. Swap LoRAs across gens\nand the next gen pays the blob cost once.\n\n#### ⚠️ Distilled LoRA — load it in the server, not in the workflow\n\nIf you're using the LTX 2.3 distilled LoRA (the standalone .safetensors,\nnot the pre-distilled model variant), **always load it via the\nDaedalus LTX server GUI's \"Distill LoRA\" row, NOT via a workflow\nLoraLoader**. The server applies it once at startup to the\nback-half blocks; the client applies the same LoRA locally to the\nfront-half blocks (place a LoraLoader AFTER Icarus LTX with the same\nfile, so it stays local-only). Net effect: the LoRA covers the\nwhole model without ever crossing the wire.\n\nThe alternative — putting it in a workflow LoraLoader BEFORE Icarus\nLTX with `forward_client_loras=ON` — would ship the LoRA bytes\nacross the network on every fresh generation that triggers a\nsession-id change. For the LTX distilled LoRA specifically that's\nhundreds of MB of wasted wire time for a LoRA that never changes.\nThe server-side slot is the right home for any \"always-on\" LoRA\nyou'd otherwise forward.\n\n#### ⚠️ Big LTX LoRAs — load on server + LoraLoader to the RIGHT of Icarus LTX\n\nLTX-family LoRAs are frequently **large** — often >500 MB, sometimes\ninto the gigabytes. Shipping a LoRA that big across the wire on\nevery session change adds seconds (or tens of seconds on slower\nlinks) to the generation, AND adds memory pressure on the server\nthat can push the codec encode into a slow regime — same mechanism\nas the FLUX turbo-LoRA gotcha further up.\n\nThe right pattern for a LoRA >~500 MB is **two-sided load** that\nnever crosses the wire:\n\n1. **On the server:** load it via the Daedalus LTX GUI's primary\n   LoRA row (the one above the Distill LoRA row).\n2. **On the client:** place a `LoraLoader` for the same file\n   **to the RIGHT of Icarus LTX** (i.e. between Icarus LTX and\n   KSampler). The post-Icarus position means the LoRA applies\n   to the front-half blocks locally and `forward_client_loras`\n   never sees it.\n\nNet result: the whole model gets the LoRA, but the wire only ever\ncarries activations.\n\nFor sub-500 MB LoRAs the \"LoraLoader BEFORE Icarus LTX +\n`forward_client_loras=ON`\" pattern is fine and convenient — the\nencoded blob ships once per session-id change, subsequent timesteps\nwithin the gen send only the small id.\n\n#### ⚠️ Strongly discouraged — `forward_client_loras=ON` on a server with \u003C24 GB VRAM\n\n**Don't use `forward_client_loras=ON` on a back-half server with\nless than 24 GB of VRAM for LTX.** The LTX-AV 22B back-half slim-load\nplus any active LoRAs plus codec scratch buffers will tip a 12 \u002F 16 \u002F\n20 GB server into ComfyUI's dynamic offload regime, where weights\nget paged in and out of VRAM mid-generation. Symptom: per-step\nforward time on the server jumps from ~1–2 s to ~5–10 s once a\nforwarded LoRA gets applied — that's offload thrash.\n\nOn a 24+ GB back-half server, forwarding is fine and the bytes\namortise well across the session-id cache.\n\nIf your back-half is under 24 GB, use the **two-sided load** pattern\nabove for ANY LoRA you'd otherwise forward, not just the big ones:\n\n- Server-side slot (primary or Distill LoRA row in the Daedalus LTX GUI)\n- Local-only LoraLoader on the same file, placed to the RIGHT of\n  Icarus LTX in the workflow so `forward_client_loras` doesn't see it.\n\nThat covers the whole model without ever asking the server to\nallocate a transient LoRA buffer on the fly.\n\n#### Server VRAM headroom — 16+ GB recommended\n\nA practical note on sizing the back-half host: **with 16+ GB of\nVRAM the LTX server has comfortable room** to slim-load 8–24\nback-half blocks AND stack one or two LoRAs on top. On a 12 GB\ncard (or smaller) you may need to keep `n_blocks_remote` near the\ndefault of 8 (or even lower) to fit a sizeable LoRA without\nbumping into ComfyUI's\ndynamic-offload heuristics, which start swapping weights in and\nout of VRAM mid-generation and can slow the per-step forward\nnoticeably. Symptom to watch for: per-step forward time on the\nserver jumping from ~1–2 s to ~5–10 s once a big LoRA is applied\n— that's the offload thrash signature, not anything we ship.\n\n### Server side\n\nThe server side ships **two** GUIs side-by-side in the same `server\u002F`\nfolder:\n\n- **`run_server_flux2_gui.bat`** — Daedalus for FLUX 2 (existing).\n- **`run_server_ltx_gui.bat`** — Daedalus LTX for LTX 2.3.\n\nLaunch whichever matches the client node you're using. They have\ndistinct settings files and run as independent processes — you can\nkeep both installed and toggle by closing one and launching the other\n(they default to the same port 7777 so don't try to run them\nsimultaneously without changing one).\n\nThe LTX server GUI has the same model picker + n_blocks + port +\ndevice + dtype rows as the FLUX one, plus **two LoRA slots** instead\nof one:\n\n- **LoRA \u002F LoRA strength** — primary slot (default strength 1.0).\n  Use this for your character or style LoRA.\n- **Distill LoRA \u002F Distill strength** — secondary slot (default\n  strength **0.5**). Intended for the **LTX 2.3 Distilled LoRA**,\n  which most LTX workflows stack on top of the base model. Default\n  0.5 matches the typical strength.\n\nBoth server-side LoRAs apply to the slim-loaded back-half blocks at\nserver startup (or restart on a settings change). Stacking them\nserver-side avoids the wire cost of forwarding them per generation.\n\n`n_blocks` on the server GUI defaults to **8** to match the LTX\nclient node's default. Persisted settings still win on subsequent\nlaunches.\n\n---\n\n## Quick start — same machine, two GPUs (no NVLink)\n\nComfyUI (Icarus) and the back-half server (Daedalus) live on the same\nmachine but each pinned to a different GPU. You get to use any pair\nof NVENC-capable Nvidia cards (3080 + 4080, 4090 + 5090, mix-and-\nmatch) without buying NVLink-capable cards.\n\n> **Heads-up on testing.** I've set this rig up to handle the\n> same-host two-GPU topology cleanly — the GPU-pinning launchers, the\n> `codec_mode=raw` shortcut on PCIe, the loopback host — but I haven't\n> personally been able to run an end-to-end same-host test on a\n> two-GPU box. The pieces should all work; if you hit a snag, please\n> open an issue (or ping me) so we can shake it out. Community\n> feedback on this path is genuinely useful.\n\n1. On the same host, install Icarus (as above) and deploy Daedalus\n   (as above) — both can sit on the same machine.\n2. Launch ComfyUI normally — it grabs whatever GPU it sees first\n   (usually `cuda:0`).\n3. Launch Daedalus **pinned to the OTHER GPU**:\n   ```\n   run_server_flux2_gpu1.bat   # pins FLUX 2 server to physical GPU 1\n   run_server_ltx_gpu1.bat     # or the LTX variant\n   ```\n   (or the `_gpu0` equivalents if ComfyUI is on GPU 1.) These set\n   `CUDA_VISIBLE_DEVICES` so the two processes don't fight over the\n   same card.\n4. In the workflow, set `remote_host = 127.0.0.1` (loopback).\n\nThat's it. The two processes share PCIe but each only sees its own\nGPU.\n\n**Specifically for same-host setups:** since PCIe between two GPUs in\nthe same desktop is ~32 GB\u002Fs — much faster than what the codec\nencode\u002Fdecode takes — you'll get better wall-clock with\n`codec_mode = raw` instead of `nvenc`. The codec is the right tool for\nslow wires (LAN, VPN, residential broadband); on PCIe it's\noverkill and adds latency. Set `codec_mode = raw` for same-host pairs.\n\n---\n\n## What the two nodes do\n\n### `Icarus`\n\nPass-through MODEL node. Slot it between the model loader (or\nLoraLoader) and the sampler.\n\n\u003Cimg src=\"screenshots\u002Ficarus%20node.png\" alt=\"Icarus node\" width=\"380\">\n\nIts parameters:\n\n| Parameter | Default | What it controls |\n|---|---|---|\n| `model` | — | The loaded FLUX MODEL |\n| `n_blocks_remote` | 4 | How many transformer blocks run remotely (counts double-blocks first, then single-blocks). For FLUX.2 Klein 9B: max 32. For FLUX.2 Dev: max 56. Change handling is inline — see \"Live UX\" below. |\n| `remote_host` | `127.0.0.1` | Hostname or IP of the back-half server. 127.0.0.1 = same machine. 192.168.x.x = LAN. 100.x.x.x = VPN. |\n| `remote_port` | `7777` | TCP port the back-half server is listening on |\n| `codec_mode` | `nvenc` | `nvenc` for slow wires (LAN, VPN, residential broadband). `raw` for same-host PCIe (faster than codec encode\u002Fdecode latency). |\n| `codec_qp` | `18` | NVENC quality. 10=near-lossless, 18=sharp (default), towards 28 the image gets noticeably softer with visible noise |\n| `codec_lossless` | OFF | NVENC lossless tuning (still has uint8 quant floor) |\n| `codec_tile_dim` | `8` | Channels-per-frame tile size. Higher = fewer larger NVENC frames = ~5× faster. 8 is the default; 4 is also fine if you want a touch more codec headroom. |\n| `forward_client_loras` | ON | Ship client-side LoraLoader patches to server so the LoRA effect covers back-half blocks too |\n\n---\n\n## Live UX on the node\n\nThe `Icarus` node has a few inline UI behaviours so you don't\nhave to hunt the console for status:\n\n- **Always-on connection indicator** at the bottom: green dot = client\n  connected to the mesh server, red = disconnected (server died or\n  network gone), grey = idle (no queue this session yet). Right side\n  shows `host:port · server n=N` so you can see at a glance what\n  you're talking to and what its `--n-blocks` is. Polls every 3s.\n\n- **Confirm-restart button** (orange, bold) appears when you change\n  `n_blocks_remote` — it's a pending state that won't actually take\n  effect until you click it. Clicking POSTs the new value to the\n  server, which restarts itself with the new `--n-blocks`. The button\n  disappears when the round-trip completes. Until then, queueing the\n  workflow is blocked with a clear \"click Confirm first\" message —\n  prevents the silent-wrong-output footgun of mismatched n on the\n  two sides.\n\n- **Inline banner** under the node body for important warnings — most\n  notably \"decreasing n_blocks_remote requires a ComfyUI restart\"\n  (the client's stripped weights for the back-half blocks are gone\n  for the session and can only be reloaded from disk by a fresh\n  ComfyUI launch).\n\n- **Last-used values remembered** across fresh node drops. Drop a\n  `Icarus` into a brand-new workflow and your last\n  `remote_host` \u002F `n_blocks_remote` \u002F `codec_qp` etc. come back\n  pre-filled. Loading a saved workflow always wins over the\n  remembered defaults.\n\n- **Transparent reconnect** if the server dies and comes back. The\n  cached client socket gets reset; the connection indicator's\n  background poll attempts a fresh handshake every few seconds and\n  flips green automatically the moment the server is back — no need\n  to queue a workflow first to see whether you're back online. Works\n  whether the server crashed, restarted itself for a reconfigure, or\n  you killed and re-launched it manually.\n\n- **Per-control tooltips** on hover. Every parameter has a one-line\n  explanation that ComfyUI surfaces when your mouse rests on the\n  pill. Useful for picking sensible values without hunting the docs.\n\n- **❓ Help button** at the very bottom of the node opens an inline\n  modal with categorised tips: connection troubleshooting, the\n  Confirm-restart flow, LoRA workflow ordering, codec quality\n  trade-offs, and the ComfyUI-version-mismatch gotcha (which we\n  can't auto-detect because the wire protocol doesn't carry the\n  server's ComfyUI version — `update_comfy.bat` on the server side\n  is the standing fix).\n\n---\n\n## Honest performance numbers\n\nEnd-to-end wall-clock per generated image. FLUX.2 Klein 9B distilled,\n4 sampler steps, RTX 5090 desktop client + RTX 4090 laptop server,\ngigabit ethernet, `n_blocks_remote = 12`, `tile_dim = 8`.\n\n| Resolution | NVENC `qp = 18` | NVENC lossless | Raw (no codec) |\n|---|---:|---:|---:|\n| 1024 × 1024 | **4.38 s** | 4.70 s | 7.20 s |\n| 1536 × 1536 | **4.41 s** | 5.15 s | 9.13 s |\n\nA few things worth pulling out of that table:\n\n- **Compression beats raw by a wide margin and the gap widens with\n  resolution.** At 1024² the codec saves you ~2.8s per image (39%\n  faster); at 1536² it saves ~4.7s (52%). Activations grow with\n  resolution but NVENC compresses bigger frames just as well, so the\n  wire stops being the bottleneck.\n- **NVENC `qp=18` is essentially free vs lossless** at default settings\n  — 0.3s difference at 1024², 0.7s at 1536² — and FLUX's residual\n  stream absorbs the QP=18 codec noise comfortably (cosine similarity\n  > 0.995 per round-trip; visually indistinguishable from all-local at\n  the same seed up to roughly QP=28).\n- **Resolution barely affects NVENC wall-clock** (4.38 → 4.41s, ~1%\n  increase from 1024² to 1536²), because the codec compresses the\n  bigger activations to similar wire payloads. Raw mode jumps 27% over\n  the same resolution increase because every extra byte traverses the\n  wire literally.\n\nHeadline: at QP=18 you're paying about **~130 ms per timestep** for\nthe codec encode + LAN + remote forward + LAN + codec decode, regardless\nof which side has the bigger activations. The rest of the image-time is\ndiffusion that would have happened anyway.\n\n---\n\n## Honest limits\n\n- **FLUX.2 and LTX 2.3 family today.** Other architectures (Wan,\n  FLUX.1, SD3.5, HunyuanVideo, etc.) need per-model code (block\n  signatures, modulation, vec structure differ). Open to contributions\n  or sponsored work.\n- **Workflow ordering for client-LoRA forwarding**: LoraLoader\n  position relative to `Icarus` \u002F `Icarus LTX` controls intent.\n  BEFORE the mesh node + `forward_client_loras=ON` → LoRA covers the\n  whole model (front locally + back forwarded to server). AFTER the\n  mesh node → LoRA stays local. Tooltips on the nodes warn about\n  this.\n\n  Correct (LoraLoader → Icarus → KSampler):\n\n  ![LoRA placed before Icarus — works](screenshots\u002FLora%20Before%20Icarus.png)\n\n  Wrong (Icarus → LoraLoader → KSampler) — LoRA only affects the\n  front-half blocks; the back-half running on Daedalus never sees it:\n\n  ![LoRA placed after Icarus — back half misses it](screenshots\u002FLora%20After%20Icarus.png)\n- **Decreasing `n_blocks_remote` requires a ComfyUI restart.**\n  The client slim-load strips back-half block weights in place to free\n  VRAM (the whole point — the server already has those blocks, the\n  client doesn't need to hold them too). **Increasing** `n_blocks_remote`\n  works seamlessly — the strip extends incrementally to cover more\n  blocks, the Confirm button restarts the server, no client reload\n  needed. **Decreasing** would require un-stripping the weights, but\n  those weights are gone for the session. The inline banner under the\n  node tells you to restart ComfyUI; the next launch re-reads the\n  model from disk and applies the new (smaller) `n_blocks_remote`.\n- **Sequential request\u002Fresponse.** No CUDA-stream overlap of codec\n  work with compute. The FLUX sampler is inherently sequential per\n  timestep, so this caps the headroom anyway.\n- **One client at a time.** Server is single-tenant. Connecting a\n  second client kicks the first.\n\n---\n\n## Support the project\n\nThis is independent work by one person. If it saves you the cost of an\nextra GPU, or you'd just like more FLUX-family models \u002F architectures\nsupported, **donations make this go faster**:\n\n\u003Ca href=\"https:\u002F\u002Fbuymeacoffee.com\u002Florasandlenses\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBuy%20me%20a%20coffee-FFDD00?style=for-the-badge&logo=buy-me-a-coffee&logoColor=black\" alt=\"Buy Me A Coffee\">\u003C\u002Fa>\n\n**Why it matters to me**\n\nI'm a parent working from home, supporting a long-term ill child alongside my wife. As much as this dev work is a passion, it's also a needed distraction — and donations genuinely help keep the lights on right now.\n\nWhat more support unlocks:\n- **More model architectures.** Highest leverage targets:\n  **Wan** (image + video, hugely popular ComfyUI workload),\n  **FLUX.1** (small lift from FLUX.2),\n  **SD3.5**, **HunyuanVideo**, **Qwen-Image**, **Chroma**.\n- **Multi-LoRA server-side stacking on FLUX** (the LTX server GUI\n  already has two LoRA slots — primary + distill — but the FLUX GUI\n  is still single-slot)\n- **Multi-client server mode** — rent your back-half GPU out\n- **CUDA-stream overlap** — codec hides behind compute for genuine\n  wall-clock parity with all-local\n- **Activation pre-stage cache** — the LTX path already does this\n  (per-generation constants cache for context tensors + PE pairs);\n  porting to FLUX is the symmetric extension\n\n---\n\n## Files\n\n```\ncomfyui-mesh\u002F\n├── README.md                     ← this file\n├── requirements.txt              ← ComfyUI auto-installs (cuda-bindings)\n├── __init__.py                   ← ComfyUI node registration + WEB_DIRECTORY\n├── mesh_node.py                  ← MeshSplitFlux + \u002Fmesh\u002Fstatus + \u002Fmesh\u002Freconfigure HTTP routes\n├── mesh_node_ltx.py              ← MeshSplitLTX + \u002Fmesh\u002Fltx\u002Fstatus + \u002Fmesh\u002Fltx\u002Freconfigure\n├── codec.py                      ← tensor ↔ NVENC bitstream (per-channel uint8 + HEVC, plus \"Nvenc LTX (5090 optimized)\" mode)\n├── protocol.py                   ← length-prefixed TCP framing\n├── vec_io.py                     ← FLUX.2 vec\u002Fmodulation tuple (de)serializer\n├── payload_ltx.py                ← LTX-AV per-block payload (de)serializer (constants cache, PE 3-tuples, CompressedTimestep)\n├── lora_io.py                    ← safetensors-based LoRA patch shipping\n├── web\u002Fmesh.js                   ← FLUX Icarus pill widgets, banner, Confirm button, connection light\n├── web\u002Fmesh_ltx.js               ← LTX Icarus equivalents (polls \u002Fmesh\u002Fltx\u002Fstatus)\n├── workflows\u002Fklein-9b-example.json  ← drop-in FLUX.2 Klein 9B demo workflow\n├── workflows\u002FLTX-example.json    ← drop-in LTX 2.3 demo workflow\n├── smoke_test_codec.py           ← standalone codec roundtrip test\n├── nvenc_pframe\u002F                 ← BUNDLED NVENC codec wrapper (no separate install)\n└── server\u002F                       ← deploy folder for the back-half host\n    ├── README.md\n    ├── CLAUDE.md\n    ├── install.bat               ← one-shot installer\n    ├── requirements.txt\n    ├── mesh_server.py            ← FLUX slim-load TCP server\n    ├── mesh_server_gui.py        ← FLUX Tkinter wrapper\n    ├── mesh_server_ltx.py        ← LTX slim-load TCP server (LTX-AV variant detection, dual LoRA slots)\n    ├── mesh_server_ltx_gui.py    ← LTX Tkinter wrapper (extra Distill LoRA row)\n    ├── codec.py \u002F protocol.py \u002F vec_io.py \u002F lora_io.py \u002F payload_ltx.py \u002F nvenc_pframe\u002F  ← wire-contract mirrors (byte-identical to client)\n    ├── smoke_test_server.py\n    ├── install_check.py\n    └── run_server*.bat           ← launchers: run_server_flux2{,_gpu0,_gpu1,_cpu,_gui}.bat + run_server_ltx{,_gpu0,_gpu1,_cpu,_gui}.bat + install\n```\n\n---\n\n## Sibling repo — the codec foundation\n\nThe NVENC codec wrapper that powers the wire compression in this rig\nlives in its own home repo:\n\n- **[shootthesound\u002Ftorch-nvenc-compress](https:\u002F\u002Fgithub.com\u002Fshootthesound\u002Ftorch-nvenc-compress)**\n  — the public reference codec library. ComfyUI Mesh is the first\n  application built on top of it; LLM KV-cache work, distributed\n  training, and other tensor-data use cases are coming. The\n  `nvenc_pframe\u002Fdirect\u002F` folder bundled inside this repo is a\n  vendored copy of that codec.\n\n---\n\n## License & contact\n\nAuthor: Peter Neill — `peter@shootthesound.com`\n\nThe bundled `nvenc_pframe` codec wrapper is Apache-2.0; its\ncomponents retain their original license.\n\nBug reports, feature requests, and architecture additions — open an\nissue, email me, or donate to push them up the queue. ☕\n","ComfyUI-Mesh 是一个用于将扩散模型（如FLUX.2和LTX 2.3）分布在两个GPU上运行的项目，支持局域网或同一台机器内的配置。该项目利用NVIDIA的NVENC技术实时压缩激活数据，从而在两张显卡之间高效传输。其核心功能包括通过Icarus节点与Daedalus服务器实现前后半部分模型的分布式推理，显著减少了单张高端GPU的需求。目前支持FLUX.2 Dev、FLUX.2 Klein 9B以及LTX 2.3等模型，并计划未来支持更多架构。适用于需要高性能图像生成但预算有限无法购买顶级GPU的研究者或开发者，在保证性能的同时降低了硬件成本。",2,"2026-06-11 03:54:46","CREATED_QUERY"]