[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74233":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":9,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},74233,"ComfyUI-Qwen-TTS","flybirdxx\u002FComfyUI-Qwen-TTS","flybirdxx","A Simple Implementation of Qwen3-TTS's ComfyUI",null,"Python",1583,168,9,53,0,8,31,88,24,19.68,false,"main",true,[],"2026-06-12 02:03:24","# ComfyUI-Qwen-TTS\n\nEnglish | [中文版](README_CN.md)\n\n> **⚠️ CRITICAL: transformers Version Requirement**\n>\n> Qwen3-TTS is **incompatible** with `transformers >= 5.0`. Versions 5.0+ introduce breaking API changes that will cause model loading failures and runtime errors. Please pin your version:\n> ```bash\n> pip install transformers==4.57.3\n> ```\n> If you have already installed a newer version, **downgrade immediately** before using this plugin.\n\n![Nodes Screenshot](example\u002Fexample.png)\n\nComfyUI custom nodes for speech synthesis, voice cloning, and voice design, based on the open-source **Qwen3-TTS** project by the Alibaba Qwen team.\n\n## 📋 Changelog\n\n- **2026-04-12 (v1.0.7)**: Removed `QwenTTSConfigNode` due to voice inconsistency; fixed MPS precision bug & CustomVoice channel mismatch; code cleanup ([update.md](doc\u002Fupdate.md))\n- **2026-02-04**: Added `extra_model_paths.yaml` support ([update.md](doc\u002Fupdate.md))\n- **2026-01-29**: Feature Update: Support for loading custom fine-tuned models & speakers ([update.md](doc\u002Fupdate.md))\n  - *Note: Fine-tuning is currently experimental; zero-shot cloning is recommended for best results.*\n- **2026-01-27**: UI Optimization: Sleek LoadSpeaker UI; fixed PyTorch 2.6+ compatibility ([update.md](doc\u002Fupdate.md))\n- **2026-01-26**: Functional Update: New voice persistence system (SaveVoice \u002F LoadSpeaker) ([update.md](doc\u002Fupdate.md))\n- **2026-01-24**: Added attention mechanism selection & model memory management features ([update.md](doc\u002Fupdate.md))\n- **2026-01-24**: Added generation parameters (top_p, top_k, temperature, repetition_penalty) to all TTS nodes ([update.md](doc\u002Fupdate.md))\n- **2026-01-23**: Dependency compatibility & Mac (MPS) support, New nodes: VoiceClonePromptNode, DialogueInferenceNode ([update.md](doc\u002Fupdate.md))\n\n## Online Workflows\n\n- **Qwen3-TTS Multi-Role Multi-Round Dialogue Generation Workflow**:\n  - [workflow](https:\u002F\u002Fwww.runninghub.ai\u002Fpost\u002F2014703508829769729\u002F?inviteCode=rh-v1041)\n- **Qwen3-TTS 3-in-1 (Clone, Design, Custom) Workflow**:\n  - [workflow](https:\u002F\u002Fwww.runninghub.ai\u002Fpost\u002F2014962110224142337\u002F?inviteCode=rh-v1041)\n\n## Key Features\n\n- 🎵 **Speech Synthesis**: High-quality text-to-speech conversion.\n- 🎭 **Voice Cloning**: Zero-shot voice cloning from short reference audio.\n- 🎨 **Voice Design**: Create custom voice characteristics based on natural language descriptions.\n- 🚀 **Efficient Inference**: Supports both 12Hz and 25Hz speech tokenizer architectures.\n- 🎯 **Multilingual**: Native support for 10 languages (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, and Italian).\n- ⚡ **Integrated Loading**: No separate loader nodes required; model loading is managed on-demand with global caching.\n- ⏱️ **Ultra-Low Latency**: Supports high-fidelity speech reconstruction with low-latency streaming.\n- 🧠 **Attention Mechanism Selection**: Choose from multiple attention implementations (sage_attn, flash_attn, sdpa, eager) with auto-detection and graceful fallback.\n- 💾 **Memory Management**: Optional model unloading after generation to free GPU memory for users with limited VRAM.\n\n## Nodes List\n\n### 1. Qwen3-TTS Voice Design (`VoiceDesignNode`)\nGenerate unique voices based on text descriptions.\n- **Inputs**:\n  - `text`: Target text to synthesize.\n  - `instruct`: Description of the voice (e.g., \"A gentle female voice with a high pitch\").\n  - `model_choice`: Currently locked to **1.7B** for VoiceDesign features.\n  - `attention`: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).\n  - `unload_model_after_generate`: Unload model from memory after generation to free GPU memory.\n- **Capabilities**: Best for creating \"imaginary\" voices or specific character archetypes.\n\n### 2. Qwen3-TTS Voice Clone (`VoiceCloneNode`)\nClone a voice from a reference audio clip.\n- **Inputs**:\n  - `ref_audio`: A short (5-15s) audio clip to clone.\n  - `ref_text`: Text spoken in the `ref_audio` (helps improve quality).\n  - `target_text`: The new text you want the cloned voice to say.\n  - `model_choice`: Choose between **0.6B** (fast) or **1.7B** (high quality).\n  - `attention`: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).\n  - `unload_model_after_generate`: Unload model from memory after generation to free GPU memory.\n\n### 3. Qwen3-TTS Custom Voice (`CustomVoiceNode`)\nStandard TTS using preset speakers.\n- **Inputs**:\n  - `text`: Target text.\n  - `speaker`: Selection from preset voices (Aiden, Eric, Serena, etc.).\n  - `instruct`: Optional style instructions.\n  - `attention`: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).\n  - `unload_model_after_generate`: Unload model from memory after generation to free GPU memory.\n\n### 4. Qwen3-TTS Role Bank (`RoleBankNode`) [New]\nCollect and manage multiple voice prompts for dialogue generation.\n- **Inputs**:\n  - Up to 8 roles, each with:\n    - `role_name_N`: Name of the role (e.g., \"Alice\", \"Bob\", \"Narrator\")\n    - `prompt_N`: Voice clone prompt from `VoiceClonePromptNode`\n- **Capabilities**: Create named voice registry for use in `DialogueInferenceNode`. Supports up to 8 different voices per bank.\n\n### 5. Qwen3-TTS Voice Clone Prompt (`VoiceClonePromptNode`) [New]\nExtract and reuse voice features from reference audio.\n- **Inputs**:\n  - `ref_audio`: A short (5-15s) audio clip to extract features from.\n  - `ref_text`: Text spoken in the `ref_audio` (highly recommended for better quality).\n  - `model_choice`: Choose between **0.6B** (fast) or **1.7B** (high quality).\n  - `attention`: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).\n  - `unload_model_after_generate`: Unload model from memory after generation to free GPU memory.\n- **Capabilities**: Extract a \"prompt item\" once and use it multiple times across different `VoiceCloneNode` instances for faster and more consistent generation.\n\n### 6. Qwen3-TTS Multi-role Dialogue (`DialogueInferenceNode`) [New]\nSynthesize complex dialogues with multiple speakers.\n- **Inputs**:\n  - `script`: Dialogue script in format \"RoleName: Text\".\n  - `role_bank`: Role bank from `RoleBankNode` containing voice prompts.\n  - `model_choice`: Choose between **0.6B** (fast) or **1.7B** (high quality).\n  - `attention`: Attention mechanism (auto, sage_attn, flash_attn, sdpa, eager).\n  - `unload_model_after_generate`: Unload model from memory after generation to free GPU memory.\n  - `pause_seconds`: Silence duration between sentences.\n  - `merge_outputs`: Merge all dialogue segments into a single long audio.\n  - `batch_size`: Number of lines to process in parallel (larger = faster but more VRAM).\n- **Capabilities**: Handles multi-role speech synthesis in a single node, ideal for audiobook narration or roleplay scenarios.\n\n### 7. Qwen3-TTS Load Speaker (`LoadSpeakerNode`) [New]\nLoad saved voice features and metadata with zero configuration.\n- **Capabilities**: Enables a \"Select & Play\" experience by auto-loading pre-computed features and metadata.\n\n### 8. Qwen3-TTS Save Voice (`SaveVoiceNode`) [New]\nPersist extracted voice features and metadata to disk for future use.\n- **Capabilities**: Build a permanent voice library for reuse via `LoadSpeakerNode`.\n\n## Attention Mechanisms\n\nAll nodes support multiple attention implementations with automatic detection and graceful fallback:\n\n| Mechanism | Description | Speed | Installation |\n|-----------|-------------|-------|--------------|\n| **sage_attn** | SAGE attention implementation | ⚡⚡⚡ Fastest | `pip install sage_attn` |\n| **flash_attn** | Flash Attention 2 | ⚡⚡ Fast | `pip install flash_attn` |\n| **sdpa** | Scaled Dot Product Attention (PyTorch built-in) | ⚡ Medium | Built-in (no installation) |\n| **eager** | Standard attention (fallback) | 🐢 Slowest | Built-in (no installation) |\n| **auto** | Automatically selects best available option | Varies | N\u002FA |\n\n### Auto-Detection Priority\n\nWhen `attention: \"auto\"` is selected, the system checks in this order:\n1. **sage_attn** → If installed, use SAGE attention (fastest)\n2. **flash_attn** → If installed, use Flash Attention 2\n3. **sdpa** → Always available (PyTorch built-in)\n4. **eager** → Always available (fallback, slowest)\n\nThe selected mechanism is logged to the console for transparency.\n\n### Graceful Fallback\n\nIf you select an attention mechanism that's not available:\n- Falls back to `sdpa` (if available)\n- Falls back to `eager` (as last resort)\n- Logs the fallback decision with a warning message\n\n### Model Caching\n\n- Models are cached with attention-specific keys\n- Changing attention mechanism automatically clears cache and reloads model\n- Same model with different attention mechanisms coexists in cache\n\n## Memory Management\n\n### Model Unloading After Generation\n\nThe `unload_model_after_generate` toggle is available on all nodes:\n- **Enabled**: Clears model cache, GPU memory, and runs garbage collection after generation\n- **Disabled**: Model remains in cache for faster subsequent generations (default)\n\n**When to use:**\n- ✅ Enable if you have limited VRAM (\u003C 8GB)\n- ✅ Enable if you need to run multiple different models sequentially\n- ✅ Enable if you're done with generation and want to free memory\n- ❌ Disable if you're generating multiple clips with the same model (faster)\n\n**Console Output:**\n```\n🗑️ [Qwen3-TTS] Unloading 1 cached model(s)...\n✅ [Qwen3-TTS] Model cache and GPU memory cleared\n```\n\n\n\n## Installation\n\nEnsure you have the required dependencies:\n\n```bash\npip install torch torchaudio transformers librosa accelerate\n```\n\n### Model Directory Structure\n\nComfyUI-Qwen-TTS automatically searches for models in the following priority:\n\n```text\nComfyUI\u002F\n├── models\u002F\n│   └── qwen-tts\u002F\n│       ├── Qwen\u002FQwen3-TTS-12Hz-1.7B-Base\u002F\n│       ├── Qwen\u002FQwen3-TTS-12Hz-0.6B-Base\u002F\n│       ├── Qwen\u002FQwen3-TTS-12Hz-1.7B-VoiceDesign\u002F\n│       ├── Qwen\u002FQwen3-TTS-Tokenizer-12Hz\u002F\n│       └── voices\u002F (Saved presets .wav\u002F.qvp)\n```\n\n**Note**: You can also use `extra_model_paths.yaml` to define a custom model path:\n```yaml\nqwen-tts: D:\\MyModels\\Qwen\n```\n\n## Tips for Best Results\n\n### Audio Quality\n- **Cloning**: Use clean, noise-free reference audio (5-15 seconds).\n- **Reference Text**: Providing text spoken in reference audio significantly improves quality.\n- **Language**: Select the correct language for best pronunciation and prosody.\n\n### Performance & Memory\n- **VRAM**: Use `bf16` precision to save significant memory with minimal quality loss.\n- **Attention**: Use `attention: \"auto\"` for automatic selection of fastest available mechanism.\n- **Model Unloading**: Enable `unload_model_after_generate` if you have limited VRAM (\u003C 8GB) or need to run multiple different models.\n- **Local Models**: Pre-download weights to `models\u002Fqwen-tts\u002F` to prioritize local loading and avoid HuggingFace timeouts.\n\n### Attention Mechanisms\n- **Best Performance**: Install `sage_attn` or `flash_attn` for 2-3x speedup over sdpa.\n- **Compatibility**: Use `sdpa` (default) for maximum compatibility - no installation required.\n- **Low VRAM**: Use `eager` with smaller models (0.6B) if other mechanisms cause OOM errors.\n\n### Dialogue Generation\n- **Batch Size**: Increase `batch_size` for faster generation (more VRAM usage).\n- **Pauses**: Adjust `pause_seconds` to control timing between dialogue segments.\n- **Merge**: Enable `merge_outputs` for continuous dialogue; disable for separate clips.\n\n## Acknowledgments\n\n- [Qwen3-TTS](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-TTS): Official open-source repository by Alibaba Qwen team.\n\n## License\n\n- This project is licensed under the **Apache License 2.0**.\n- Model weights are subject to the [Qwen3-TTS License Agreement](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-TTS#License).\n\n## Author\n\n- **Bilibili**: [Space](https:\u002F\u002Fspace.bilibili.com\u002F5594117?spm_id_from=333.1007.0.0)\n- **YouTube**: [Channel](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCx5L-wKf93YNbcP_55vDCeg)\n","ComfyUI-Qwen-TTS 是一个基于阿里巴巴Qwen团队开源的Qwen3-TTS项目的语音合成、克隆和设计插件。其核心功能包括高质量的文本转语音、零样本语音克隆以及根据自然语言描述创建自定义声音特性。该项目使用Python开发，支持10种语言，并且具有高效的推理能力和超低延迟的流式传输能力。用户还可以选择不同的注意力机制来优化模型性能。此外，它内置了模型加载管理功能，无需额外的加载节点即可按需加载模型并进行全局缓存。此项目适用于需要快速生成多样化语音的应用场景，如多角色对话系统、个性化语音助手等。请注意，为了确保兼容性，请安装指定版本的transformers库（4.57.3）。",2,"2026-06-11 03:49:37","high_star"]