[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79998":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":16,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":18,"rankGlobal":10,"rankLanguage":10,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":32,"readmeContent":33,"aiSummary":34,"trendingCount":16,"starSnapshotCount":16,"syncStatus":13,"lastSyncTime":35,"discoverSource":36},79998,"UniVidX_ComfyUI","dreamrec\u002FUniVidX_ComfyUI","dreamrec","UniVidX intrinsic & alpha video decomposition custom nodes for ComfyUI","https:\u002F\u002Fhouyuanchen111.github.io\u002FUniVidX.github.io\u002F",null,"Python",72,2,71,3,0,1,41.53,"GNU General Public License v3.0",false,"main",true,[24,25,26,27,28,29,30,31],"alpha-matting","comfyui","comfyui-custom-nodes","intrinsic-decomposition","siggraph-2026","unividx","video-diffusion","wan21","2026-06-12 04:01:26","![UniVidX banner](assets\u002Fregistry-banner-20260510.svg)\n\n# UniVidX Intrinsic & Alpha Decomposition for ComfyUI\n\n[![Smoke Test](https:\u002F\u002Fgithub.com\u002Fdreamrec\u002FUniVidX_ComfyUI\u002Factions\u002Fworkflows\u002Fsmoke.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fdreamrec\u002FUniVidX_ComfyUI\u002Factions\u002Fworkflows\u002Fsmoke.yml)\n![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-GPL--3.0-2f855a)\n![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftorch-%E2%89%A52.7%2Bcu128-ee4c2c)\n![Nodes](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fnodes-5-f59e0b)\n![Tests](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftests-85%20unit%20%2B%2010%20integration-success)\n![GPU](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGPU-%E2%89%A532%20GB%20VRAM%20%7C%20RTX%205090%20validated-blueviolet)\n\nComfyUI custom nodes for [UniVidX](https:\u002F\u002Fhouyuanchen111.github.io\u002FUniVidX.github.io\u002F) (SIGGRAPH 2026): unified video diffusion that decomposes a clip into **RGB \u002F Albedo \u002F Irradiance \u002F Normal** (intrinsic) or **Composite RGB \u002F Alpha matte \u002F Foreground \u002F Background** (alpha). 30 task modes across two model variants, all driven from a single five-node graph.\n\n> ## ⚠️ Hardware requirements (read this first)\n>\n> This is a **14-billion-parameter video diffusion model** running locally. It is **not lightweight**. Verify your system can handle it before installing.\n>\n> **Minimum to run at all:**\n>\n> | Resource | Requirement |\n> |---|---|\n> | **GPU VRAM** | **≥ 24 GB** with the 0.5.0 FP8 path; **≥ 32 GB** for the BF16 baseline |\n> | **GPU architecture** | CUDA compute capability **8.0+** (Ampere \u002F RTX 3000+, Ada \u002F RTX 4000+, Hopper \u002F H100, Blackwell \u002F RTX 5000+) |\n> | **System RAM** | **≥ 32 GB** (peak ~28 GB during cold-load of the BF16 DiT shards); 64+ GB comfortable |\n> | **Disk** | **~85 GB free** (Wan2.1-T2V-14B 69 GB + UniVidX 1.6 GB + optional LightX2V 0.6 GB + working space) |\n> | **PyTorch** | **≥ 2.7 with CUDA 12.8** (older torch errors on Blackwell GPUs with `no kernel image is available`) |\n> | **Python** | **3.10+** (tested on 3.12.9) |\n> | **ComfyUI** | **0.20+** |\n>\n> **Validated configuration (all benchmarks in this README):** RTX 5090 (32 GB Blackwell sm_120), Windows 11, Python 3.12.9, PyTorch 2.7.0+cu128, ComfyUI Desktop 0.20.1.\n>\n> **Card-by-card honest read:**\n>\n> | GPU | Status | Notes |\n> |---|---|---|\n> | **RTX 5090** (32 GB Blackwell) | ✅ **Validated** | The benchmark target. FP8 path: 9.43 min\u002Fchunk; preview: 4.59 min\u002Fchunk. |\n> | **RTX 4090** (24 GB Ada, FP8 native) | 🟢 **Should work; not validated by us** | FP8 path fits cleanly (~14 GB DiT + activations). BF16 path tight but possible with `vram_buffer_gb=8+`. Expect ~10-15% slower than 5090 due to memory bandwidth. |\n> | **RTX 6000 Ada \u002F A6000** (48 GB Ada) | 🟢 **Should work** | Plenty of headroom for both paths. Could batch 2-3 clips in parallel with custom orchestration. |\n> | **RTX 6000 Pro Blackwell** (96 GB) | 🟢 **Should work, ideal for batch** | Sweet spot if you process many clips per day. Same per-clip speed as 5090; massive parallelism headroom. |\n> | **H100 \u002F H200 \u002F B200** (80-192 GB datacenter) | 🟢 **Should work, overkill for inference** | Same per-clip speed as Blackwell consumer. Worth the cost only if you're also fine-tuning. |\n> | **RTX 3090 \u002F 3090 Ti** (24 GB Ampere, no native FP8) | 🟡 **Should work, slower** | FP8 path runs via software cast (no Blackwell tensor-core FP8). Memory fits; per-step ~20-30% slower than 4090. |\n> | **RTX 4080 \u002F 5080** (16 GB) | 🔴 **Will OOM** | FP8 DiT alone is ~14 GB. Add activations + VAE + text encoder → exceeds 16 GB at production resolution. |\n> | **RTX 3080 \u002F 4070 \u002F 4060 Ti** (12 GB) | ❌ **Cannot run** | Insufficient VRAM even with aggressive layer streaming. |\n> | **Pre-Ampere** (RTX 20-series, V100, P100, etc.) | ❌ **Cannot run** | CUDA compute capability \u003C 8.0. Wan2.1's Flash-Attention-2 path needs sm_80+. |\n>\n> **Cloud option:** if you don't have local hardware, **rent an L40 (48 GB) or RTX 6000 (48 GB) instance** from RunPod \u002F Vast.ai \u002F Lambda Labs — they're typically $0.50-1.00\u002Fhour and a single 1-min clip via the chunked sampler (FP8 preview, ~4.4 hr) runs for ~$3-5 of compute.\n>\n> **Per-clip wall times** (RTX 5090 reference; all measured):\n> - One 21-frame chunk @ 480×640: **~9.43 min** (production FP8) or **~4.59 min** (fast preview FP8+distill)\n> - 1-minute @ 24 fps clip via chunked sampler: **~14 hours** (production) or **~4.4 hours** (preview)\n> - This is **not real-time**. Plan workflows around overnight \u002F multi-hour processing.\n\n**What you'd use it for:** relighting (swap the irradiance channel, recombine), VFX alpha pulls without a green screen (a clean matte from any clip), 3D reconstruction pipelines that need normals + albedo as conditioning, ControlNet-style guidance for *other* video models that consume normal maps.\n\n**Strategy A wrapper** — UniVidX's official pipeline runs as an opaque black box. The four output IMAGE batches become standard ComfyUI tensors that flow into any downstream node (VHS video combine, alpha compositing, 3D reconstruction, ControlNet for *other* models, etc.).\n\n### Use this when\n\n- You specifically need **intrinsic** (RGB \u002F Albedo \u002F Irradiance \u002F Normal) or **alpha** (matte \u002F fg \u002F bg) decomposition of a video clip — that's UniVidX's whole reason to exist. No other Wan2.1 wrapper does this.\n- You want clean drag-and-drop ComfyUI workflows for the 30 task modes without writing pipeline code.\n\n### Use [`kijai\u002FComfyUI-WanVideoWrapper`](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-WanVideoWrapper) when\n\n- You want generic Wan2.1\u002F2.2 T2V or I2V (just RGB out, not decomposition).\n- You need finer-grained per-block CPU swap, async prefetch, or kijai-curated FP8\u002FGGUF Wan checkpoints. Their wrapper has more model-management surface; ours has the UniVidX-specific decomposition head.\n\n## See it move — `R2AIN` intrinsic decomposition\n\nThe first 21 contiguous frames of a 24 fps portrait clip in (~0.87 sec of real motion). Three physically-grounded decompositions out, plus a clean normal map — all from one ComfyUI graph. FP8 prequantized DiT, **9.52 minutes** wall on an RTX 5090 *(20 steps · cfg 5.0 · seed 42 · the production preset)*. GIFs play back at the native 24 fps.\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\u003Cb>① RGB input\u003C\u002Fb>\u003Cbr>\u003Csub>the conditioning clip\u003C\u002Fsub>\u003Cbr>\u003Cimg src=\"assets\u002Fresults\u002Fdemo_rgb.gif\" width=\"100%\" alt=\"RGB input frames\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"50%\">\u003Cb>② Albedo\u003C\u002Fb>\u003Cbr>\u003Csub>lighting stripped — pure surface color\u003C\u002Fsub>\u003Cbr>\u003Cimg src=\"assets\u002Fresults\u002Fdemo_albedo.gif\" width=\"100%\" alt=\"Albedo decomposition\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"50%\">\u003Cb>③ Irradiance\u003C\u002Fb>\u003Cbr>\u003Csub>incoming light field, soft &amp; smooth\u003C\u002Fsub>\u003Cbr>\u003Cimg src=\"assets\u002Fresults\u002Fdemo_irradiance.gif\" width=\"100%\" alt=\"Irradiance field\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"50%\">\u003Cb>④ Normal\u003C\u002Fb>\u003Cbr>\u003Csub>encoded surface orientation\u003C\u002Fsub>\u003Cbr>\u003Cimg src=\"assets\u002Fresults\u002Fdemo_normal.gif\" width=\"100%\" alt=\"Normal map\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\nThis is the **`R2AIN` task mode** of the **intrinsic** variant: one conditioning RGB clip, three target decompositions in a single pass. The **albedo** is lighting-independent surface color — relight the scene by swapping the irradiance and re-multiplying. The **normal map** is physically-meaningful surface orientation, usable as ControlNet conditioning for any downstream video\u002Fimage model that consumes normals. Every frame above was emitted by UniVidX itself; no compositing tricks. Reproduce: `examples\u002F_gif_demo_runner.py`.\n\n### Flagship workflow — five nodes end-to-end\n\n![t2RAIN workflow](assets\u002Fworkflow_t2RAIN.png)\n\nDrag-and-drop ready: `Loader → TaskMode → Sampler → Decode → Save`. The same five-node shape works for `t2RAIN` (text-only), `R2AIN` (the demo above), or any of the other 28 task modes — pick your mode in the `TaskMode` node, the sampler validates required inputs and routes accordingly.\n\n### Alpha decomposition (same clip, alpha variant)\n\n![Alpha quad](assets\u002Fresults\u002FLTX_alpha_quad.jpg)\n\nSame source clip, alpha variant + mode `R2PFB`. The **alpha matte** is a true binary-quality mask. **Background** is the most striking output: the model inpaints the wallpaper, chair, and candle stand *behind* where the subject was sitting.\n\n## Quick start\n\n```bash\n# 1. Install. The --recurse-submodules flag pulls in the UniVidX vendor\n#    repo (~500 MB of upstream Python + small assets — no Git LFS needed).\ncd ComfyUI\u002Fcustom_nodes\ngit clone --recurse-submodules https:\u002F\u002Fgithub.com\u002Fdreamrec\u002FUniVidX_ComfyUI.git\ncd UniVidX_ComfyUI\npython -m pip install -r requirements.txt\npython install.py        # creates Win junction \u002F POSIX symlink. No admin needed.\n\n# 2. Install the Hugging Face CLI if you don't have it, then download models.\n#    ~83 GB total — UniVidX is built on Wan2.1-T2V-14B which is the bulk.\npip install -U \"huggingface_hub[cli]\"\nhf download Wan-AI\u002FWan2.1-T2V-14B  --local-dir ComfyUI\u002Fmodels\u002Fwan21_t2v_14b\nhf download houyuanchen\u002FUniVidX    --local-dir ComfyUI\u002Fmodels\u002Funividx\n\n# 3. Restart ComfyUI, drag examples\u002Ft2RAIN_basic.json onto the canvas, queue.\n```\n\n**ComfyUI Desktop \u002F portable \u002F manual** — paths above assume a layout where `ComfyUI\u002Fmodels\u002F` is a sibling of `ComfyUI\u002Fcustom_nodes\u002F`. ComfyUI Desktop installs put `models\u002F` under `Documents\u002FComfyUI\u002F`; the `python install.py` step auto-resolves either layout.\n\nFor real video-clip conditioning (your own MP4), use [`examples\u002FR2AIN_video_api.json`](examples\u002FR2AIN_video_api.json) (intrinsic) or [`examples\u002FR2PFB_video_api.json`](examples\u002FR2PFB_video_api.json) (alpha). They load 21 evenly-spaced frames from disk via `VHS_LoadVideoPath`, which means you'll also need [ComfyUI-VideoHelperSuite](https:\u002F\u002Fgithub.com\u002FKosinkadink\u002FComfyUI-VideoHelperSuite) installed.\n\n## Two recommendations (0.5.0)\n\n**For production finals (best quality):**\n\n```\nUniVidXLoader.dit_weight_mode = fp8_prequantized\n```\n\nEverything else default. Wall: **9.43 min** on R2AIN_video. Quality verified PSNR ≥ 30 dB per modality against BF16. Use this for anything you'd ship.\n\n**For iteration \u002F long-clip processing (fast):**\n\n```\nUniVidXLoader.dit_weight_mode    = fp8_prequantized\nUniVidXLoader.step_distill_lora  = lightx2v\nUniVidXLoader.step_distill_strength = 1.0\nUniVidXSampler.num_inference_steps = 4\nUniVidXSampler.cfg_scale           = 1.0\n```\n\nWall: **4.59 min** on R2AIN_video — **3.15× faster than 0.3.x PRODUCTION**, 2× faster than 0.4.0 FP8 alone. Quality is ~22-26 dB PSNR vs BF16 — visibly different decompositions but plausible content. **Use this for iteration loops, long-clip processing (with `chunked_clip_sampler.py`), or anywhere \"fast and pretty good\" beats \"slow and pristine.\"** Don't ship this output as a final deliverable without an eye check.\n\n## Full performance matrix (0.5.0, RTX 5090, R2AIN_video @ 480×640×21 frames)\n\n| Configuration | Steps | cfg | Wall (min) | Δ vs BF16 baseline | Notes |\n|---|---|---|---|---|---|\n| **BF16 baseline** (no extras) | 20 | 5.0 | 10.85 | 0% | The reference point. ~28 GB DiT, vram_buffer streaming. |\n| **🏆 FP8 baseline** (`dit_weight_mode=fp8_prequantized`) | 20 | 5.0 | **9.43** | **−13.1%** | **Production default.** ~14 GB DiT fully resident. |\n| **🚀 FP8 + distill (lightx2v)** | 4 | 1.0 | **4.59** | **−57.7%** | **Fast preview \u002F iteration.** 3.15× vs old PRODUCTION. Quality ~22-26 dB PSNR. |\n| BF16 + distill (lightx2v) | 4 | 1.0 | 5.77 | −46.8% | FP8 strictly better than BF16 under distill too. |\n| BF16 + sage | 20 | 5.0 | 14.48 | +33.5% | sage_attn is +33% wall on this workload; *not* the −18% advertised by 0.2.0. |\n| FP8 + sage | 20 | 5.0 | 11.75 | +8.3% | sage compounds with FP8 negatively. |\n| FP8 + compile_dit | 20 | 5.0 | 11.65 | +7.4% | Graph-captures cleanly on FP8Linear but no per-step speedup on top of FP8's residency. |\n| FP8 alpha (R2PFB) | 20 | 5.0 | 12.36 | +14% | Alpha variant works; slightly slower than intrinsic. |\n| FP8 PREVIEW + sage | 4 | 1.0 | 6.20 | (different config) | Cold-load dominates short runs. |\n| FP8 text-only tiny (t2RAIN 256×256×5×3) | 3 | 5.0 | 4.66 | (different config) | Tiny smoke baseline. |\n\n### Quality (FP8 vs BF16, R2AIN_video, same seed, 21 frames)\n\n| Modality | PSNR (dB) | Threshold | Verdict |\n|---|---|---|---|\n| placeholder (RGB output slot when RGB is condition) | inf (exact) | ≥30 | PASS |\n| albedo | 30.89 | ≥30 | PASS |\n| irradiance | 39.17 | ≥30 | PASS, comfortable margin |\n| normal | 36.28 | ≥30 | PASS, comfortable margin |\n\n## Using FP8 (new in 0.4.0)\n\nSet this on `UniVidXLoader`:\n\n```\nvariant            = intrinsic     (or alpha)\ndtype              = bfloat16\ndit_weight_mode    = fp8_prequantized\nvram_buffer_gb     = 4.0\nprefer_sage_attn   = False\ncompile_dit        = False\n```\n\nHow it works under the hood: after UniVidX's standard BF16 cold-load, the loader walks the DiT, computes per-tensor absmax scales for each Linear layer, casts the weights to `torch.float8_e4m3fn`, and replaces each Linear with an `FP8Linear` that dequantizes on forward. UniVidX's per-modality LoRA adapters (the four `lora_A\u002FB_\u003Cmod>` pairs at each attention block) are preserved at BF16 by walking through PEFT wrappers and replacing only the inner base layer. No external file needed — when a Kijai `Wan2_1-T2V-14B_fp8_e4m3fn_scaled.safetensors` lands upstream and is dropped into `models\u002Fdiffusion_models\u002F`, the loader will use it directly instead of runtime-quantizing.\n\n## Processing longer clips (chunked sampler)\n\nUniVidX is trained at 21 frames per inference. For source clips longer than ~1 second, use `examples\u002Fchunked_clip_sampler.py` — it slices your source into overlapping 21-frame windows, runs UniVidX on each, and stitches the per-modality outputs into 4 MP4s with a linear crossfade across the overlap.\n\n```bash\n# Fast iteration \u002F preview: ~4.4 hours per 1 min @ 24 fps clip\npython examples\u002Fchunked_clip_sampler.py \\\n    --input  C:\u002Fpath\u002Fto\u002Fyour_clip.mp4 \\\n    --mode   R2AIN \\\n    --output-dir  C:\u002Fpath\u002Fto\u002Foutput \\\n    --preset FP8_DISTILL_PREVIEW    # default in 0.5.0\n\n# Production finals: ~14 hours per 1 min @ 24 fps clip\npython examples\u002Fchunked_clip_sampler.py \\\n    --input  C:\u002Fpath\u002Fto\u002Fyour_clip.mp4 \\\n    --mode   R2AIN \\\n    --output-dir  C:\u002Fpath\u002Fto\u002Foutput \\\n    --preset FP8\n```\n\nWall-time guide for a 1-minute @ 24 fps clip (1440 frames → 90 chunks):\n\n| Preset | Per-chunk | Full clip | Quality |\n|---|---|---|---|\n| **FP8_DISTILL_PREVIEW** *(0.5.0 default)* | **4.59 min** | **~4.4 hours** | ~22-26 dB PSNR vs BF16; iteration \u002F preview |\n| FP8 | 9.43 min | ~14 hours | Production-quality finals (PSNR ≥ 30 dB) |\n| PRODUCTION (legacy: BF16+sage) | 14.48 min | ~22 hours | Same quality as FP8 baseline, slower |\n\nCaveat: each chunk samples from its own noise seed (same seed across chunks, but the trajectory diverges anyway from per-chunk numerical drift), so global identity drift between chunks is possible on lighting-varying content. The overlap crossfade hides per-pixel seams but not global drift. For clips with consistent lighting throughout the minute, drift is minor; for clips with cuts or lighting changes, expect visible breath in the per-modality channels at chunk boundaries.\n\nThe `FP8_DISTILL_PREVIEW` preset requires `lightx2v` LoRA at `models\u002Floras\u002Flightx2v\u002Floras\u002FWan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors`. Download:\n\n```bash\nhf download lightx2v\u002FWan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v \\\n    loras\u002FWan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank64.safetensors \\\n    --local-dir ComfyUI\u002Fmodels\u002Floras\u002Flightx2v\n```\n\n## What stopped helping on Blackwell\n\nThese knobs were useful in 0.2.0 \u002F 0.3.0 but FP8 (0.4.0) is strictly better — leave them off unless you have a specific reason:\n\n- **`prefer_sage_attn=True`** — measured +33% wall on BF16 baseline, +25% on FP8 baseline. SageAttention's INT8 quantized kernels apparently don't win on Wan2.1-14B's attention shapes at this resolution. Earlier docs (0.2.0) advertised \"−18% wall\" — that measurement was on a different config; not reproducible on R2AIN_video as of 0.4.0.\n- **`compile_dit=True`** — measured +24% wall on FP8 baseline. `torch.compile` graph-captures cleanly on `FP8Linear` (good news, no crash) but the speedup it was designed to deliver doesn't materialize on top of FP8's already-resident state. Graph capture itself adds ~90 s overhead on first step.\n- **`dtype=fp8_e4m3fn` \u002F `fp8_e5m2`** — **DEPRECATED, removed in 0.5.0.** The legacy `mmgp.offload.quantize` path hangs during cold-load. Use `dit_weight_mode=fp8_prequantized` instead.\n- **Flash Attention 3** — Hopper-only (H100\u002FH800). Doesn't apply to RTX 5090.\n- **Flash Attention 4** — Linux-only on PyPI; module name (`flash_attn.cute`) doesn't match DiffSynth's auto-detect.\n\n## Other tuning knobs (Loader)\n\nThese have non-trivial effects and are worth understanding:\n\n| Knob | Effect | Notes |\n|---|---|---|\n| `vram_buffer_gb` | GB kept free for activations; passed to `model.pipe.enable_vram_management()`. Controls layer-streaming aggressiveness on the BF16 path. | **+65% wall going 4.0 → 12.0** measured at BF16. Lower = more residency = faster; raise only if you hit OOM. 4.0 GB default is near-optimal on 32 GB cards. **Effectively no-op when `dit_weight_mode=fp8_prequantized`** because the FP8 DiT fits fully resident. |\n| `dit_weight_mode` | `auto \u002F bf16_shards \u002F fp8_prequantized \u002F fp8_runtime_experimental`. `auto` (default) preserves 0.3.x behaviour based on the legacy `dtype` widget. | See \"The 0.4.0 recommendation\" above. |\n\n### SageAttention install (for `prefer_sage_attn=True`)\n\nPyPI ships only sage 1.0.6 (head_dim restricted to {64,96,128}, Hopper\u002FAda-tuned only). For Blackwell + Windows + cp312 + Torch 2.7, use the prebuilt wheel from [woct0rdho\u002FSageAttention](https:\u002F\u002Fgithub.com\u002Fwoct0rdho\u002FSageAttention\u002Freleases):\n\n```bash\npip install \"https:\u002F\u002Fgithub.com\u002Fwoct0rdho\u002FSageAttention\u002Freleases\u002Fdownload\u002Fv2.2.0-windows\u002Fsageattention-2.2.0+cu128torch2.7.1-cp312-cp312-win_amd64.whl\"\n```\n\nMatch the wheel to your stack: `cu128`\u002F`cu130` (CUDA), `torch2.7.1`\u002F`2.8.0` (PyTorch), `cp310`\u002F`cp311`\u002F`cp312`\u002F`cp313` (Python).\n\n> **Cross-plugin gotcha — Stable3DGen SDPA pollution.** ComfyUI-3D-Pack's `Stable3DGen\u002Ftrellis\u002Fbackend_config.py` does `F.scaled_dot_product_attention = sageattn` *globally* at module import when sageattention is importable. That hostile swap breaks any other custom node using SDPA with head_dim outside sage's set (UniVidX's VAE has 1-head SDPA where head_dim = channel_count, hits 384). Our `runtime.load_model()` defensively restores `F.scaled_dot_product_attention` from `torch._C._nn.scaled_dot_product_attention` (the C++ impl, immune to Python alias rebinding). If other custom nodes broke after you installed sage, this is why.\n\n### What does NOT help on Blackwell\n\n- **Flash Attention 3** — Hopper-only (H100\u002FH800). Doesn't apply to RTX 5090.\n- **Flash Attention 4** — Linux-only on PyPI; module name (`flash_attn.cute`) doesn't match DiffSynth's auto-detect.\n\n### FP8 status\n\nWe wired `dtype=fp8_e4m3fn` \u002F `fp8_e5m2` via `mmgp.offload.quantize(model.pipe.dit, weights=\"qfloat8\", exclude=[\"*lora_*\"])`. **The quantize() pass hung in our cold-load test** (no completion after 22 min, required killing ComfyUI). Likely cause: quanto walks all ~720 Linear layers in Wan2.1-14B + UniVidX's PEFT-attached LoRA pairs, computing per-tensor scales over the 28 GB BF16 DiT through mmgp's read-only mmap — possibly genuinely slow, possibly genuinely deadlocked.\n\nThe widget is shipped but **flagged EXPERIMENTAL in tooltip + this README**. Use at your own risk on this stack today. See [Roadmap](#roadmap) for the planned proper fix.\n\n## Node overview\n\nFive nodes, all under the `UniVidX` category. Custom socket types — `UNIVIDX_MODEL` (purple), `UNIVIDX_TASK` (teal), `UNIVIDX_RESULT` (pink) — keep the graph type-safe; standard `IMAGE` (green) is used everywhere a frame batch flows.\n\n\u003Ctable>\n\u003Ctr>\u003Ctd width=\"380\">\n\n![Loader](assets\u002Fnodes\u002Floader.svg)\n\n\u003C\u002Ftd>\u003Ctd>\n\n**`UniVidXLoader`** — Loads `intrinsic` or `alpha` variant, exposes the perf knobs (`compile_dit`, `prefer_sage_attn`, `dtype`, `vram_buffer_gb`, `dit_weight_mode`). Models are cached per `(variant, ckpt, device, dtype, vram_buffer, fp8_qtype, compile_dit, prefer_sage_attn, dit_weight_mode)` so toggling any of them triggers a clean re-load. `vram_buffer_gb` is in the key as of 0.3.0 because it now actually controls VRAM management (was a no-op in 0.1.0–0.2.1). `dit_weight_mode` is in the key as of 0.4.0 — picking `fp8_prequantized` drops DiT steady-state VRAM ~50% (with -13% wall as a bonus); see the perf table below.\n\n\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\u003Ctd>\n\n![Task Mode](assets\u002Fnodes\u002Ftask_mode.svg)\n\n\u003C\u002Ftd>\u003Ctd>\n\n**`UniVidXTaskMode`** — Picks one of 30 modes from a dropdown. Outputs `UNIVIDX_TASK` carrying the mode + family. The sampler validates the family against the loaded model variant.\n\n\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\u003Ctd>\n\n![Sampler](assets\u002Fnodes\u002Fsampler.svg)\n\n\u003C\u002Ftd>\u003Ctd>\n\n**`UniVidXSampler`** — Runs UniVidX's `pipe()` end-to-end inside a `chdir(vendor\u002FUniVidX)` context. Accepts the model + task + a text prompt + up to 7 optional `IMAGE` inputs (one per modality across both families). Inputs not required by the active mode are silently ignored.\n\n\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\u003Ctd>\n\n![Decode Intrinsic](assets\u002Fnodes\u002Fdecode_intrinsic.svg)\n\n\u003C\u002Ftd>\u003Ctd>\n\n**`UniVidXDecodeIntrinsic`** — Splays an intrinsic-family `UNIVIDX_RESULT` into 4 `IMAGE` batches: `rgb \u002F albedo \u002F irradiance \u002F normal`. Modalities that were *conditions* come back as a black placeholder of the right shape, so downstream graphs never break on missing slots.\n\n\u003C\u002Ftd>\u003C\u002Ftr>\n\u003Ctr>\u003Ctd>\n\n![Decode Alpha](assets\u002Fnodes\u002Fdecode_alpha.svg)\n\n\u003C\u002Ftd>\u003Ctd>\n\n**`UniVidXDecodeAlpha`** — Same shape as above but for the alpha family: `composite_rgb \u002F alpha \u002F foreground \u002F background`. Raises `ValueError` if you try to feed it an intrinsic-family result (and vice versa).\n\n\u003C\u002Ftd>\u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## Models\n\n| Pack | Where | Size |\n|---|---|---|\n| [Wan-AI\u002FWan2.1-T2V-14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B) | `ComfyUI\u002Fmodels\u002Fwan21_t2v_14b\u002F` | ~69 GB |\n| [houyuanchen\u002FUniVidX](https:\u002F\u002Fhuggingface.co\u002Fhouyuanchen\u002FUniVidX) | `ComfyUI\u002Fmodels\u002Funividx\u002F` | ~1.6 GB |\n\n`install.py` verifies the vendored UniVidX submodule is at the pinned commit, copies bundled demo workflows into ComfyUI's user workflow directory, and prints a hint about the model files you still need to download. The actual **path bridging** — Windows directory junction (or POSIX symlink) from `vendor\u002FUniVidX\u002Fmodels\u002F` to `ComfyUI\u002Fmodels\u002Fwan21_t2v_14b\u002F`, plus hardlinks for the two LoRA adapters — happens **at runtime** on first model load via `src\u002Fpath_resolver.ensure_symlinks()` (called from `runtime.initialize()`). This lazy approach lets the install step stay quick and avoids touching the filesystem if the user never queues a UniVidX workflow.\n\n## Mode reference\n\nMode names encode `\u003Cconditions>2\u003Ctargets>`. `t` on the left = \"text-only\".\n\n**Intrinsic** (variant `intrinsic`): R=RGB, A=Albedo, I=Irradiance, N=Normal — 15 modes total: `t2RAIN`, `R2AIN`, `A2RIN`, `I2RAN`, `N2RAI`, `RA2IN`, `RI2AN`, `RN2AI`, `AI2RN`, `AN2RI`, `IN2RA`, `RAI2N`, `RAN2I`, `RIN2A`, `AIN2R`.\n\n**Alpha** (variant `alpha`): R=Composite RGB, P=Pha (matte), F=Fgr, B=Bgr — 15 modes total: `t2RPFB`, `R2PFB`, `P2RFB`, `F2RPB`, `B2RPF`, `RP2FB`, `RF2PB`, `RB2PF`, `PF2RB`, `PB2RF`, `FB2RP`, `RPF2B`, `RPB2F`, `RFB2P`, `PFB2R`.\n\nFor modes where a modality is a *condition*, the corresponding decoder output is a black tensor of the right shape — downstream nodes still get a valid `IMAGE`.\n\n## Roadmap\n\nFull v0.3 execution plan in [`ROADMAP_v0.3.md`](ROADMAP_v0.3.md). Summary of priority order (corrected after second-pass review):\n\n1. ~~**`vram_buffer_gb` correctness fix**~~ — **shipped in 0.3.0.** Cache key now includes `vram_buffer` so distinct values get distinct cache entries (was the actual bug). The wiring itself (`model.pipe.enable_vram_management(...)`) was already correct in 0.2.1 — the roadmap's \"wrong method target\" diagnosis was disproven by a live runtime probe; see CHANGELOG 0.3.0 \"Diagnosis correction.\" Wrapped in `if\u002Felse` with explicit INFO-on-success + WARNING-on-missing logging so a future regression can't recur silently. Tier-A5 bench measured **+65% wall going `vram_buffer` 4.0 → 12.0** — this is the biggest single perf lever in the system, not the \"deprecated\" knob it was labelled.\n2. **FP8 via pre-quantized Kijai weights** (replaces the hung runtime-quantize path). The current `dtype=fp8_*` knob calls `mmgp.offload.quantize()` AFTER constructing a full BF16 DiT — which both hangs and destroys the main benefit (the BF16 cold load). Right design is a deeper refactor:\n  1. **Split the public knob** into `compute_dtype = {bf16, fp16}` and `dit_weight_mode = {bf16_shards, fp8_prequantized, fp8_runtime_experimental}` so users get clear choices and runtime quantization is hidden behind an explicit experimental flag.\n  2. **Add path resolution** for pre-quantized weights at `ComfyUI\u002Fmodels\u002Fdiffusion_models\u002FWan2_1-T2V-14B_fp8_e4m3fn.safetensors` (Kijai\u002FComfyUI convention).\n  3. **Implement an alternate DiT loader** that bypasses upstream's hardcoded six-shard BF16 loop in [`vendor\u002FUniVidX\u002Fsrc\u002Fpipelines\u002Funivid_intrinsic.py:447`](vendor\u002FUniVidX\u002Fsrc\u002Fpipelines\u002Funivid_intrinsic.py) and `univid_alpha.py:425`. Instantiate `WanModel`, normalize key prefixes, stream-load FP8 safetensors, keep norms\u002Fbias\u002Ftime\u002Ftext\u002Fpatch\u002Fhead in BF16\u002FFP32, and preserve scale tensors when present.\n  4. **Keep UniVidX LoRA adapters in BF16.** PEFT's dynamic `adapter_names=[...]` switching is central to UniVidX's per-modality routing — a generic FP8 `nn.Linear` substitution would break it.\n  5. **Phase 1 (\"memory-safe FP8\"):** FP8 base weights + BF16 LoRA + dequantize-or-scaled-linear base path. Correctness first.\n  6. **Phase 2 (\"fast FP8\"):** adapter-aware `_scaled_mm` for the base projection plus BF16 LoRA residuals. Needs a custom adapter-aware Linear or a real PEFT integration — not just Kijai's no-LoRA fast path.\n  7. **Validate** with the tiny R2AIN\u002FR2PFB workflows first, then benchmark BF16 vs FP8 on cold-load time, peak VRAM, per-step time, and output sanity (per-modality SSIM\u002FPSNR vs the BF16 reference).\n- **Step-distill LoRA stacking** — try [LightX2V's `Wan21_T2V_14B_lightx2v_cfg_step_distill_lora_rank32.safetensors`](https:\u002F\u002Fhuggingface.co\u002Flightx2v\u002FWan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v) on top of UniVidX's per-modality LoRA. Could close the small quality gap of the PREVIEW 4-step preset. Needs PEFT compatibility verification with UniVidX's `add_multiple_loras_to_model` machinery.\n\n## Requirements (detailed)\n\nThe summary table at the top of this README covers the \"can I run this?\" question. The detail below is for users who want to know exactly why each requirement matters.\n\n### Software\n\n| Dependency | Version | Why |\n|---|---|---|\n| **Python** | 3.10+ (tested 3.12.9) | UniVidX uses Python 3.10's `str \\| None` PEP-604 union syntax. The runtime explicitly fails fast on older Pythons. |\n| **PyTorch** | ≥ 2.7 with CUDA 12.8 | Blackwell sm_120 (RTX 5090) requires `cu128`. Older PyTorch builds error with `no kernel image is available for execution on the device.` `torch.float8_e4m3fn` (used by the 0.4.0+ FP8 path) needs PyTorch 2.1+. |\n| **ComfyUI** | 0.20+ | The node-registration API and the `INPUT_TYPES` schema we ship target this version's frontend. Older versions may render widgets incorrectly. |\n| **DiffSynth-Studio** | ≥ 2.0 | UniVidX wraps DiffSynth's `WanVideoPipeline`. The pipeline-level VRAM management and tokenizer config use DiffSynth 2.x APIs. Auto-installed via `requirements.txt`. |\n| **mmgp** | latest | Provides the read-only memory-mapped safetensors loader that prevents the Windows paging-file commit blowup when six 9.84 GB DiT shards are mapped concurrently. |\n| **PEFT** | ≥ 0.10 | UniVidX uses `peft.inject_adapter_in_model` for the four per-modality LoRA adapters. The FP8 substitution and step-distill merge both descend through PEFT wrappers. |\n| **safetensors** | ≥ 0.4 | All model weights ship as safetensors; the FP8 and step-distill loaders read them via `safetensors.safe_open`. |\n\n### GPU\n\nThe binding constraint is **VRAM**. The Wan2.1-T2V-14B base model is ~28 GB in BF16 and ~14 GB in FP8. On top of that you need ~4-6 GB for activations \u002F KV cache \u002F VAE decode \u002F text encoder during sampling.\n\n| Path | DiT footprint | Total VRAM during sampling | Min card |\n|---|---|---|---|\n| BF16 baseline | ~28 GB | ~32-34 GB | 32 GB+ |\n| **FP8 prequantized (0.5.0 default)** | **~14 GB** | **~18-20 GB** | **24 GB+** |\n| FP8 + step-distill (fast preview) | ~14 GB | ~18-20 GB | 24 GB+ |\n\n**CUDA compute capability**: 8.0 or higher (Ampere generation onward). UniVidX's attention path uses Flash-Attention-2's pattern which requires sm_80+. SageAttention 1.x (optional) supports head_dim ∈ {64, 96, 128} and is Hopper\u002FAda tuned; SageAttention 2.x from [woct0rdho's prebuilt wheels](https:\u002F\u002Fgithub.com\u002Fwoct0rdho\u002FSageAttention\u002Freleases) covers Blackwell. As of 0.5.0, neither sage nor compile_dit help on the FP8 path — leave them off (see the full perf matrix above).\n\n### System RAM\n\nPeak host RAM during a cold-load:\n- BF16 path: ~28 GB peak (loading the six DiT shards before VRAM management kicks in)\n- FP8 path: ~28 GB peak (same cold-load; FP8 quantization runs after on the loaded BF16 weights)\n\n64 GB system RAM gives comfortable headroom for ComfyUI + browser + other applications during the load. 32 GB is the practical minimum; the OS swaps to page file under stress on smaller systems.\n\n### Disk\n\n| Pack | Size | Required? |\n|---|---|---|\n| [Wan-AI\u002FWan2.1-T2V-14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B) | ~69 GB | Yes — the base text-to-video DiT |\n| [houyuanchen\u002FUniVidX](https:\u002F\u002Fhuggingface.co\u002Fhouyuanchen\u002FUniVidX) | ~1.6 GB | Yes — the per-modality LoRA adapters (intrinsic + alpha) |\n| [LightX2V step-distill](https:\u002F\u002Fhuggingface.co\u002Flightx2v\u002FWan2.1-T2V-14B-StepDistill-CfgDistill-Lightx2v) (rank-64 LoRA only) | ~600 MB | Optional — enables the 0.5.0 fast-preview mode |\n| [Kijai `_scaled` Wan2.1 FP8](https:\u002F\u002Fhuggingface.co\u002FKijai\u002FWanVideo_comfy) (if\u002Fwhen it lands) | ~14 GB | Optional — auto-engages the file-based FP8 loader instead of runtime quantize |\n\nPlan for **85 GB** of disk including the optional packs + working space for output PNGs and MP4s (a 1-min clip at 480×640×24fps × 4 modalities is ~500 MB of intermediate PNGs).\n\n### Operating system\n\n| OS | Status |\n|---|---|\n| Windows 11 (validated) | ✅ All three Windows-specific patches (JSON path escaping, mmgp readonly mmap, junctions instead of symlinks) ship enabled |\n| Windows 10 | 🟢 Should work, untested |\n| Linux | 🟢 Should work; the Windows-specific patches no-op gracefully |\n| macOS | 🔴 Not viable; no CUDA support for the Wan2.1 + FP8 path |\n\n## Windows-specific patches\n\nThree patches applied automatically on Windows; no-op on POSIX. Fully documented inline in `src\u002Fruntime.py` and `src\u002Fpath_resolver.py`:\n\n1. **JSON path escaping** — `json.dumps([t5, vae])` so Windows backslashes don't break UniVidX's `json.loads`.\n2. **Read-only mmap for safetensors** — patch `mmgp.safetensors2.torch_load_file` to use `writable_tensors=False` (avoids `[WinError 1455] paging file too small` on six 9.84 GB DiT shards).\n3. **Junctions + hardlinks instead of symlinks** — `mklink \u002FJ` + `os.link()` (no Admin \u002F Developer Mode required).\n\n## Troubleshooting\n\n- **`MissingModelFile`** — re-run the `hf download` commands.\n- **`R2AIN` rgb output is black** — correct, RGB was the input; decoder emits a black placeholder of the right shape.\n- **Text-only alpha matte (`t2RPFB`) is white** — known model limit, not a bug. Use `R2PFB_video_api.json` instead.\n- **Per-step time > 1 min on 32 GB+ GPU** — VRAM management didn't activate. Verify GPU temp \u003C60°C with 99% util = memory-bound.\n- **`CUDA error: no kernel image is available`** — torch too old for Blackwell; upgrade to `torch>=2.7+cu128`.\n- **Custom nodes broke after installing sageattention** — see the Stable3DGen pollution note above. Our defensive un-pollute fixes UniVidX; other affected nodes need the same fix or `pip uninstall sageattention`.\n\nFor workflow-specific gotchas see [`examples\u002FREADME.md`](examples\u002FREADME.md) and [`examples\u002Ftest_matrix\u002FREADME.md`](examples\u002Ftest_matrix\u002FREADME.md).\n\n## Out of scope (Strategy A boundary)\n\nThese would require porting UniVidX's Cross-Modal Self-Attention onto a different DiT class (multi-week project):\n\n- Stacking community Wan2.1\u002F2.2 LoRAs on UniVidX's DiT\n- Injecting ControlNet \u002F IP-Adapter inside UniVidX's denoising loop\n- Replacing UniVidX's sampler with a ComfyUI KSampler\n- Native `MODEL`-type integration (interop with kijai's WanVideoWrapper)\n\nStrategy A's value is at the **I\u002FO boundary** — composing UniVidX outputs with arbitrary downstream ComfyUI nodes. Validated end-to-end in `examples\u002Ftest_matrix\u002F` (10\u002F10 passing).\n\n## Credits\n\n- [UniVidX](https:\u002F\u002Fgithub.com\u002Fhouyuanchen111\u002FUniVidX) — vendored at a pinned commit\n- [Wan-AI \u002F Wan2.1-T2V-14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.1-T2V-14B) — base text-to-video DiT\n- [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio) — pipeline runtime\n- [mmgp](https:\u002F\u002Fpypi.org\u002Fproject\u002Fmmgp\u002F) — paged memory loading\n- [woct0rdho\u002FSageAttention](https:\u002F\u002Fgithub.com\u002Fwoct0rdho\u002FSageAttention) — Blackwell sage 2.x wheels\n- [ComfyUI](https:\u002F\u002Fgithub.com\u002Fcomfyanonymous\u002FComfyUI) — host runtime\n\n## License\n\n[GPL-3.0](LICENSE). Vendored upstream deps keep their own licenses (UniVidX, Wan-AI\u002FWan2.1-T2V-14B).\n","UniVidX_ComfyUI 是一个为 ComfyUI 设计的自定义节点项目，用于实现视频的内在分解和阿尔法分解。其核心功能包括将视频片段分解为RGB\u002F反照率\u002F辐照度\u002F法线（内在）或合成RGB\u002F阿尔法遮罩\u002F前景\u002F背景（阿尔法），支持30种任务模式。该项目基于强大的140亿参数视频扩散模型，采用PyTorch框架开发，要求较高的硬件配置，如至少24GB GPU显存、CUDA计算能力8.0以上等。适合需要高质量视频处理的专业场景使用，尤其是在视频编辑、特效制作等领域具有广泛应用潜力。","2026-06-11 03:58:50","CREATED_QUERY"]