[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79989":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":25,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},79989,"DEMON","daydreamlive\u002FDEMON","daydreamlive","DEMON: Diffusion Engine for Musical Orchestrated Noise","https:\u002F\u002Fdaydreamlive.github.io\u002FDEMON\u002F",null,"Python",234,26,3,22,0,15,64,125,45,4.29,"Other",false,"main",true,[],"2026-06-12 02:03:56","# DEMON\n\n**Diffusion Engine for Musical Orchestrated Noise**\n\nDEMON is a streaming diffusion engine for ACE-Step v1.5. Think StreamDiffusion, for audio: a ring buffer holds several in-flight generations at different denoising stages, advanced together per tick. After warmup, finished latents stream out at a steady rate of `depth\u002Fsteps` generations per tick. End-to-end TensorRT keeps the tick tight; per-frame modulation knobs accept scalars or `[T]` curves and are hot-mutable mid-stream; ring buffer depth itself is hot-resizable. Streaming output is bit-identical to batch.\n\n> Don't have a GPU, or just want to play first? Try the hosted instance at **[music.daydream.live](https:\u002F\u002Fmusic.daydream.live)**.\n\n## What DEMON is\n\nThe engine lives in [`acestep\u002F`](acestep\u002F). One process loads the model once and exposes two things:\n\n1. A programmatic **Session API** ([`acestep\u002Fengine\u002Fsession.py`](acestep\u002Fengine\u002Fsession.py)) that wraps the streaming pipeline, the typed node graph, and the TRT runtime in a small set of methods (`prepare_source`, `encode_text`, `generate`, `decode`, `stream`, `apply_lora`).\n2. A **typed node graph** ([`acestep\u002Fnodes\u002F`](acestep\u002Fnodes\u002F)) of 32 composable operations (latent \u002F audio \u002F conditioning \u002F curve \u002F mask \u002F solver \u002F config \u002F DCW \u002F channel guidance) wired through `NodeDefinition` \u002F `NodePort` \u002F `NodeParam`, with kwarg-validation at registration.\n\nAnything on top, a CLI, a notebook, a VST, the bundled web demo, an MCP tool, or your own protocol, drives the same primitives. The library does not know or care which one you use.\n\n## What the engine does\n\n- **Streaming diffusion for ACE-Step v1.5.** `StreamPipeline` ([`acestep\u002Fengine\u002Fstream.py`](acestep\u002Fengine\u002Fstream.py)) maintains a ring buffer of in-flight generations. Each tick runs a batched decoder forward pass (two when CFG is active: positive + negative) that advances every active slot by one denoising step. The decoder dispatches to TensorRT or PyTorch through the same code path. Depth is hot-resizable mid-stream (`pipeline.set_depth(n)`); active slots drain naturally.\n- **Heterogeneous slots.** Every in-flight slot carries its own `SlotRequest`: its own seed, its own `denoise` strength (with its own cached timestep schedule), its own source latent, its own per-frame curves, its own conditioning (one or more `SlotCondition`s with per-frame `temporal_weight` and per-condition `step_range`), its own CFG mode, its own x0 target, and its own latent-noise mask. A single ring buffer can mix a `denoise=1.0` regeneration, a `denoise=0.5` style transfer, and an RCFG-`self` request simultaneously and batch them in one forward pass.\n- **Scalar-or-curve per-frame modulation.** Velocity scale, SDE re-noise, ODE noise injection, guidance scale, x0 target strength, x0 target curve, initial noise mix, APG momentum, CFG rescale, DCW scalers, and condition temporal weights all accept either a Python scalar or a `[T]` tensor, canonicalized through `normalize_curve` at the boundary so the kernels see one shape.\n- **Channel guidance.** A `[1, T, 64]` per-channel gain applied to `xt` before each forward pass. Lives in its own surface (set via `pipeline.set_channel_gain_tensor(...)`) because its per-channel-and-per-frame shape doesn't fit the `[T]`-curve pattern.\n- **Shared mutable curves.** Layered on top of the heterogeneous slots: `pipeline.set_shared_curve(name, value)` overrides one of the curve-shaped fields (`velocity_scale`, `sde_denoise_curve`, `ode_noise_curve`, `apg_momentum`, `x0_target_strength`, `cfg_rescale_curve`) for the next tick on every in-flight slot at once. The override takes effect immediately rather than waiting for new submissions to make their way through the pipeline. Pass `None` to revert that name to per-slot behavior.\n- **Multi-condition compositing.** Within a single slot, the decoder runs once per active condition and velocities are blended per frame by `temporal_weight`; conditions are gated in and out of the schedule by `step_range`. `ConditioningBlend` (scalar alpha) and `ConditioningCombine` (per-frame temporal weights) are the typed entry points.\n- **Three CFG modes.** Standard CFG (uncond forward every step), RCFG-`initialize` (one uncond forward per slot, cached for the rest of the schedule), and RCFG-`self` (zero uncond forwards: the slot's initial noise stands in as the virtual uncond velocity). All three layer APG momentum and an optional per-frame CFG rescale curve on top.\n- **Latent-noise-mask inpainting.** Two-sided x0 blending matching ComfyUI semantics: pre-blend on `xt` (so the decoder sees correctly-noised context in preserved regions) and post-blend on the predicted `x0`. Supports a per-step strength function for progressive masking.\n- **DCW post-step correction.** Wavelet-domain sampler-side correction from Yu et al. CVPR 2026, ported from upstream ACE-Step v0.1.7. Four modes (low \u002F high \u002F double \u002F pix), with an optional advanced surface (`mult_blend`, `mag_phase`, `soft_thresh`) that at zero is byte-identical to the upstream reference. Hot-updatable via `pipeline.set_dcw(...)`.\n- **Hot LoRA.** Register a directory once, then enable \u002F set_strength \u002F remove without rebuilding anything. The LoRA manager ([`acestep\u002Fengine\u002Flora.py`](acestep\u002Fengine\u002Flora.py)) handles the lifecycle and delta math; when the decoder is in TRT mode, applies route through a refitter against the live engine.\n- **TRT acceleration end-to-end.** The DiT decoder, VAE encode, and VAE decode each pick `tensorrt | compile | eager` independently. The TRT decoder is refit-enabled, so LoRA swaps do not rebuild the engine. The VAE decode has a windowed variant (`vae_decode_fp16_3to30s`, range 3 to 30 s) that is built once and reused across all durations; the caller specifies the window start via `t_start`.\n- **Bit-identical streaming vs. batch.** The streaming and one-shot paths compose the same pure step primitives from [`acestep\u002Fengine\u002Fode_steps.py`](acestep\u002Fengine\u002Fode_steps.py); they produce the same output.\n\n## Tested on\n\nNVIDIA RTX 3090, 4090, and 5090. The headline numbers below are from a 5090.\n\n## Tuning: ring buffer depth, song duration, VAE windowing\n\nThree knobs trade off against each other. Picking the right point on the curve is what makes DEMON run well on a given card.\n\n**Ring buffer depth (`pipeline_depth`, 1 to 8).** The pipeline keeps `depth` in-flight generations at different denoise stages, advanced together each tick. After warmup, throughput is `depth\u002Fsteps` finished generations per tick.\n\n- Higher depth: parameter sweeps glide more smoothly (more slots in different denoise phases, so a curve change blends through finer intermediate states), at the cost of more per-tick batch compute and higher VRAM.\n- Lower depth: knob changes feel snappier and more discrete (fewer slots between a parameter override and the next finished latent), with lower per-tick VRAM and compute.\n\n**Song duration.** TRT engines are profile-specific. Each engine reserves workspace sized to its profile, so a 240 s engine costs more VRAM than a 60 s engine even when the workload is only 60 seconds. Per-engine peak workspace, each measured in isolation on a 5090:\n\n| Component       | 60s engine | 240s engine |          Δ |\n|-----------------|-----------:|------------:|-----------:|\n| Decoder (refit) |  13,511 MB |   15,911 MB |  +2,400 MB |\n| VAE decode      |  10,547 MB |   10,814 MB |    +267 MB |\n| VAE encode      |   4,178 MB |   10,614 MB |  +6,436 MB |\n\nThese are per-engine peaks captured in separate subprocesses, not a live-runtime sum. At inference time the decoder peak dominates and the VAE workspaces do not peak alongside it, which is why the live demo fits on a 24 GB card. The comparison is what matters: switching three engines from 240 s to 60 s frees about 9 GB. Source: [`scripts\u002Fbenchmarks\u002Fvram_60s_vs_240s_results.md`](scripts\u002Fbenchmarks\u002Fvram_60s_vs_240s_results.md). Longer engines also pay more per-tick latency since the diffusion sequence length scales with duration. Build only the durations you need.\n\n**VAE windowing.** Optional. When `vae_window > 0`, decode happens in overlapped time windows (range 3 to 30 s) instead of full-length, controlled by a `t_start` parameter on each decode call. This is what unlocks low-latency streaming updates: only the requested window is decoded per call rather than the full latent. Set to 0 to fall back to full-length decode.\n\n## Performance\n\nRTX 5090, ACE-Step v1.5 turbo (2B), all-TRT, `depth=4`, `steps=8`, `vae_window=3s`, 60 s source.\n\n| Metric | Value |\n|---|---|\n| Tick (decoder forward, depth=4) | ~43 ms |\n| Decode (windowed VAE, 3 s) | 4.5 ms |\n| Throughput | 11.3 generations\u002Fsecond |\n| Parameter convergence | ~248 ms |\n| Per-frame control resolution | 25 Hz (40 ms latent steps) |\n| Streaming vs. batch quality | bit-identical output |\n\n## Acceleration backends\n\nThe DiT decoder and the VAE pick a backend independently. Three values each: `tensorrt`, `compile`, `eager`.\n\n| Component         | Backend     | Notes |\n|-------------------|-------------|-------|\n| Decoder           | `tensorrt`  | Fastest. Requires a built decoder engine for the target duration and checkpoint. Refit-enabled engines support LoRA swaps. |\n| Decoder           | `compile`   | `torch.compile`. Long warmup, no engine to build, good fallback. |\n| Decoder           | `eager`     | Plain PyTorch. Useful for debugging. |\n| VAE encode\u002Fdecode | `tensorrt`  | Fastest. The windowed-decode engine (`vae_decode_fp16_3to30s`) is built once and reused across all durations. |\n| VAE encode\u002Fdecode | `compile`   | `torch.compile`. |\n| VAE encode\u002Fdecode | `eager`     | Plain PyTorch. |\n\nFrom the bundled web demo, pass `--accel {tensorrt|compile|eager}` to set both at once, or `--decoder-accel` \u002F `--vae-accel` to override one component at a time:\n\n```bash\n# All-TRT (recommended).\nuv run python -u -m demos.realtime_motion_graph_web.run -- --accel tensorrt\n\n# TRT decoder, eager VAE (e.g. for debugging the decode path).\nuv run python -u -m demos.realtime_motion_graph_web.run -- \\\n    --accel tensorrt --vae-accel eager\n```\n\n**Recommended baseline: TRT windowed VAE decoder at minimum.** It is the cheapest TRT engine to build, it is checkpoint- and duration-agnostic, and it unlocks the low-latency streaming path. Pair it with `--decoder-accel compile` if you do not want to build the decoder engine yet.\n\n## Requirements\n\n- Python 3.11\n- NVIDIA GPU. Tested on RTX 3090, 4090, and 5090.\n- ACE-Step v1.5 checkpoints in `checkpoints\u002F` (auto-downloaded on first run)\n- Node.js 20+ (only if you run the bundled web demo; first run installs `web\u002Fnode_modules` automatically)\n\n## Setup\n\n```bash\nuv sync\n```\n\nThat is it for Python. Audio fixtures pull on first use from the [`daydreamlive\u002Fdemon-fixtures`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdaydreamlive\u002Fdemon-fixtures) Hugging Face dataset and cache under `~\u002F.cache\u002Fhuggingface\u002F`. See [`acestep\u002Ffixtures.py`](acestep\u002Ffixtures.py) for the canonical set.\n\nLoRAs are not auto-downloaded. Drop a `.safetensors` file into `$ACESTEP_MODELS_DIR\u002Floras\u002F` (defaults to `~\u002F.daydream-scope\u002Fmodels\u002Fdemon\u002Floras\u002F`) and it will appear in any consumer that scans the library on next refresh. See [`acestep\u002Fpaths.py`](acestep\u002Fpaths.py).\n\n## Programmatic use: the Session API\n\nThe Session API is the engine's primary surface. Load the model once, then iterate.\n\n```python\nfrom acestep.engine.session import Session\nfrom acestep.constants import TASK_INSTRUCTIONS\n\nsession = Session(\n    decoder_backend=\"compile\",  # or \"tensorrt\", \"eager\"\n    vae_backend=\"compile\",\n    vae_window=3.0,             # 0 = full decode; >0 enables windowed decode\n)\n\n# Load audio, encode it, extract semantic context (cache across iterations).\nsource = session.prepare_source(audio)\n\n# Encode text once. Reused across generations.\ncond = session.encode_text(\n    tags=\"deathstep death\",\n    instruction=TASK_INSTRUCTIONS[\"cover\"],\n    refer_latent=source.latent,\n    bpm=136, duration=60.0, key=\"G# minor\",\n)\n\n# Generate, decode, save. Cheap after warmup (~310 ms per iteration).\nfor seed in [1528, 9999, 42]:\n    latent = session.generate(\n        conditioning=cond,\n        context_latent=source.context_latent,\n        source_latent=source.latent,\n        seed=seed,\n    )\n    save_audio(session.decode(latent), f\"out_{seed}.wav\")\n```\n\nStreaming is the same primitives wrapped in a `StreamHandle`:\n\n```python\nhandle = session.stream(source=source, conditioning=cond, pipeline_depth=4)\nfor _ in range(N_TICKS):\n    # Mutate handle.conditioning \u002F handle.context_latent between ticks\n    # to swap prompts or blend semantic hints live.\n    latent = handle.tick()\n    if latent is not None:\n        audio = handle.decode(latent, t_start=window_start_s)\n\n# Per-frame curve overrides bypass the ring buffer (1-tick latency):\nhandle.pipeline.set_shared_curve(\"velocity_scale\", 1.2)\nhandle.pipeline.set_shared_curve(\"sde_denoise_curve\", torch.tensor([...]))\n```\n\nQuick-start scripts:\n\n- [`examples\u002Fsession_demo.py`](examples\u002Fsession_demo.py): persistent session, iterate covers with different seeds.\n- [`examples\u002Frealtime_cover.py`](examples\u002Frealtime_cover.py): a full real-time cover workflow with dual prompts, dual LoRAs, timbre \u002F hint references, temporal masking, and engine-exclusive per-frame curves.\n- [`examples\u002Fcovers\u002F`](examples\u002Fcovers\u002F): one standalone script per feature.\n\n| Script | Feature |\n|---|---|\n| `cover_basic.py` | Standard cover pipeline (encode, condition, generate, decode) |\n| `prompt_blend.py` | Two prompts blended with a temporal curve |\n| `sde_denoise_curve.py` | Per-frame SDE re-noise modulation |\n| `velocity_scaling.py` | Per-frame transformation rate control |\n| `lora_generation.py` | LoRA-conditioned generation |\n| `x0_target_blend.py` | Two-pass morphing toward a target latent |\n| `conditioning_average.py` | Fuse two conditionings |\n| `guidance_curve.py` | Per-frame CFG scale |\n| `latent_noise_mask.py` | Latent-space inpainting |\n| `initial_noise_curve.py` | Per-frame noise \u002F source init mix |\n| `ode_noise_injection.py` | Stochastic ODE step |\n| `cover_semantic_blend.py` | Blend semantic hints from two sources |\n| `x0_target_from_reference.py` | Pre-generate a target latent, morph toward it |\n\n## Building TensorRT engines\n\nDEMON targets TensorRT 10.16.x. Plans are version- and GPU-architecture-specific by default, so rebuild after changing TensorRT, CUDA, driver, or the GPU used for inference.\n\n```bash\n# Full matrix (decoder refit + VAE for 60s \u002F 120s \u002F 240s).\nuv run python -m acestep.engine.trt.build --all\n\n# 60s only (recommended starting point).\nuv run python -m acestep.engine.trt.build --all --duration 60\n\n# Just the windowed VAE decoder (smallest, fastest to build, biggest payoff).\nuv run python -m acestep.engine.trt.build --vae-only --duration 60\n\n# Preview what would be built.\nuv run python -m acestep.engine.trt.build --all --dry-run\n\n# Force rebuild even if engines already exist.\nuv run python -m acestep.engine.trt.build --all --force-rebuild\n\n# Force ONNX re-export as well.\nuv run python -m acestep.engine.trt.build --all --duration 60 --force-rebuild --force-onnx\n```\n\nONNX intermediates are duration-agnostic and auto-reused across builds; the model is only loaded when an export is actually needed.\n\n```\ntrt_engines\u002F\n  _onnx\u002F                          # shared, auto-reused across durations\n    vae_encode\u002Fvae_encode.onnx\n    vae_decode\u002Fvae_decode.onnx\n    decoder\u002Fdecoder.onnx          # + external data shards\n    decoder_refit\u002Fdecoder_refit.onnx\n  decoder_mixed_refit_b8_60s\u002F\n    decoder_mixed_refit_b8_60s.engine\n  vae_decode_fp16_3to30s\u002F\n    vae_decode_fp16_3to30s.engine\n  ...\n```\n\nPass engine paths to `Session` when using the API directly:\n\n```python\nsession = Session(\n    decoder_backend=\"tensorrt\",\n    vae_backend=\"tensorrt\",\n    vae_window=3.0,\n    trt_engines={\n        \"decoder\": \"trt_engines\u002Fdecoder_mixed_refit_b8_60s\u002Fdecoder_mixed_refit_b8_60s.engine\",\n        \"vae_encode\": \"trt_engines\u002Fvae_encode_fp16_60s\u002Fvae_encode_fp16_60s.engine\",\n        \"vae_decode\": \"trt_engines\u002Fvae_decode_fp16_3to30s\u002Fvae_decode_fp16_3to30s.engine\",\n    },\n)\n```\n\n## Demo applications\n\nThe engine is meant to be driven. The repository ships a flagship reference application plus a handful of focused entry points.\n\n### realtime_motion_graph_web (the headline demo)\n\nA Python backend plus a Next.js front-end in a single launcher. Feed it audio and a prompt, then twist knobs, draw automation curves, blend prompts, hot-swap timbre \u002F structure references, and toggle LoRAs while the model generates and plays back continuously. Most of the engine surface above is exposed as a live control.\n\n```bash\nuv run python -u -m demos.realtime_motion_graph_web.run\n# then open http:\u002F\u002Flocalhost:6660\n```\n\nThe launcher starts the backend on `:1318` and the Next.js dev server on `:6660`. Forward backend flags after `--`:\n\n```bash\nuv run python -u -m demos.realtime_motion_graph_web.run -- --accel tensorrt\nuv run python -u -m demos.realtime_motion_graph_web.run -- --checkpoint xl\n```\n\nHighlights:\n\n- **Prompt A ↔ B blending.** Two text fields plus a blend slider. One encoder pass per submission; the slider lerps per tick.\n- **LoRA library.** Browse genre-grouped LoRAs, click to enable, drag faders for strength. Optional auto-prepend of trigger words to keep prompts honest.\n- **Timbre and structure references.** Independent fixtures, uploaded clips, or short mic recordings bias instrument character and section \u002F rhythm \u002F dynamics. Mix freely.\n- **Source-audio swap.** Library, upload, or record a 60 s snippet from your mic.\n- **Schedule curves.** Draw automation over the timeline for denoise, hint strength, feedback, shift, and any LoRA strength. Smooth \u002F linear \u002F step interpolation.\n- **MIDI learn.** Right-click any slider, wiggle a physical control, done. Mappings persist per option-profile.\n- **Audio-reactive video.** WebGL2 shader pipeline with saturation-driven color parallax and bloom-on-kick.\n- **Recording.** Capture audio (Opus\u002FWebM, AAC\u002FM4A fallback) or the live graph canvas as video with audio muxed in.\n- **Config import \u002F export.** Snapshot full live session state (knobs, prompts, LoRAs, curves) to JSON.\n- **Onboard MCP server.** Every user-facing action exposed as an MCP tool. Drive the demo from Claude Code or any MCP client.\n\nAll defaults (knob positions, MIDI map seed, walk-window behavior, idle reset, LUFS matcher, audio-reactive shader params, XL-checkpoint overrides) live in [`demos\u002Frealtime_motion_graph_web\u002Fweb\u002Fpublic\u002Fconfig.json`](demos\u002Frealtime_motion_graph_web\u002Fweb\u002Fpublic\u002Fconfig.json). Edit, refresh, done.\n\nSee [`demos\u002Frealtime_motion_graph_web\u002FREADME.md`](demos\u002Frealtime_motion_graph_web\u002FREADME.md) for backend args, wire protocol, onboard MCP setup, and the front-end architecture.\n\n### Other entry points\n\n- [`examples\u002Fsession_demo.py`](examples\u002Fsession_demo.py): one-shot generation, persistent session.\n- [`examples\u002Frealtime_cover.py`](examples\u002Frealtime_cover.py): real-time cover workflow exercising dual prompts, dual LoRAs, timbre \u002F hint references, temporal masking, and engine-exclusive per-frame curves.\n- [`examples\u002Fcovers\u002F`](examples\u002Fcovers\u002F): standalone per-feature scripts (see table above).\n- [`demos\u002Ftest_stream_cover_graph.py`](demos\u002Ftest_stream_cover_graph.py): a streaming cover graph driven from Python.\n\n## Tests\n\n```bash\nuv run pytest tests\u002F -v\n```\n\n## Research\n\nThe DEMON paper and two companion technical notes are forthcoming:\n\n- DEMON paper (main)\n- FastOobleckDecoder (VAE distillation)\n- Latent Channel Semantics (64-channel VAE characterization)\n\nLinks land here as artifacts are released.\n\n## Acknowledgments\n\nDEMON is built on top of [ACE-Step](https:\u002F\u002Fgithub.com\u002Face-step\u002FACE-Step). The base diffusion model, VAE, text encoder, and 5 Hz LM are all ACE-Step's work; without them, none of this exists. Huge thanks to the ACE-Step team for releasing the v1.5 weights and code under MIT.\n\nIf you use DEMON in your work, please also cite ACE-Step.\n\n## Authors\n\nDEMON originally created by Ryan Fosdick ([@RyanOnTheInside](https:\u002F\u002Fryanontheinside.com)). Maintained by [Daydream Live](https:\u002F\u002Fdaydream.live) and contributors.\n","DEMON是一个用于音乐噪声编排的流式扩散引擎。其核心功能包括通过ACE-Step v1.5实现音频的实时生成，利用环形缓冲区管理和推进多个不同去噪阶段的生成任务，支持每帧调制参数动态调整及热更新，同时采用端到端TensorRT优化确保高效处理。此外，该引擎提供了丰富的可组合操作节点图，允许灵活定制处理流程。DEMON适用于需要高质量、低延迟音频生成的应用场景，如在线音乐创作平台、虚拟乐器插件开发或任何对实时性要求较高的声音设计工作。",2,"2026-06-11 03:58:48","CREATED_QUERY"]