[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76083":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},76083,"scenema-audio","ScenemaAI\u002Fscenema-audio","ScenemaAI","Zero-shot expressive voice cloning and speech generation. Generate anything from short clips to full-length audiobooks with realistic emotional delivery, pacing, and breath   control. Clone any voice from a 10-second reference and perform emotions the original speaker never recorded.","https:\u002F\u002Fscenema.ai\u002Faudio",null,"Python",514,74,4,9,0,5,12,86,15,9.63,"MIT License",false,"main",true,[],"2026-06-12 02:03:39","# Scenema Audio\n\n**Zero-shot expressive voice cloning and speech generation.**\n\n**[Visit scenema.ai\u002Faudio to hear all demos and try it out.](https:\u002F\u002Fscenema.ai\u002Faudio)**\n\n[![Demo Video](https:\u002F\u002Fcdn.scenema.ai\u002Fpublic\u002Fcontent\u002Fscenema-audio-waveform-thumbnail.png)](https:\u002F\u002Fyoutu.be\u002FVnEQ_ImOaAc)\n\nEvery existing text-to-speech system converts words into sound, but none of them perform. Speech that merely pronounces words correctly is functionally useless for filmmaking, audiobooks, or any context where the emotional delivery carries as much meaning as the words themselves. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.\n\nBuilt on an audio diffusion transformer extracted from [LTX 2.3](https:\u002F\u002Fgithub.com\u002FLightricks\u002FLTX-2)'s 22B parameter audiovisual model, it learned how people actually sound in real scenes: angry, laughing, whispering, crying, exhausted, terrified.\n\n## Quick Start\n\n### Docker (Recommended)\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FScenemaAI\u002Fscenema-audio.git\ncd scenema-audio\n\n# Set your HuggingFace token (Gemma 3 access required)\nexport HF_TOKEN=your_huggingface_token\n\n# Build and run (models are downloaded on first start)\ndocker compose up\n```\n\nRuns on any NVIDIA GPU with 16 GB+ VRAM. The default configuration uses INT8 audio transformer + NF4 Gemma quantization, with automatic model offloading on smaller cards. First startup downloads ~38 GB of model checkpoints and caches them in a Docker volume. Subsequent starts are fast.\n\n### Generate Audio\n\n```bash\n# Using the included script\npython generate.py output.wav\n\n# Or with curl\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fgenerate \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"prompt\": \"\u003Cspeak voice=\\\"A warm, clear male voice with a slight British accent. Measured, thoughtful pacing.\\\" gender=\\\"male\\\">The old lighthouse had stood on the cliff for over a century, its beam cutting through the fog like a blade of light.\u003C\u002Fspeak>\",\n    \"seed\": 42\n  }' \\\n  --output output.wav\n```\n\n### Voice Design (Preview a Voice)\n\n```bash\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fgenerate \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"prompt\": \"\u003Cspeak voice=\\\"A young woman with a smoky, low register voice. Intimate, confessional tone.\\\" gender=\\\"female\\\">The city never really sleeps. It just closes its eyes and pretends for a while.\u003C\u002Fspeak>\",\n    \"mode\": \"voice_design\"\n  }' \\\n  --output voice_preview.wav\n```\n\n### Zero-Shot Voice Cloning\n\nProvide 10-20 seconds of reference audio with some emotional variability. The model generates expressive speech from the prompt, then transfers the reference voice's identity onto the performance. References that contain a range of pitch and intonation produce significantly better identity transfer than flat, monotone clips.\n\n```bash\ncurl -X POST http:\u002F\u002Flocalhost:8000\u002Fgenerate \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"prompt\": \"\u003Cspeak voice=\\\"Gravelly male voice, fast talking, rough.\\\" gender=\\\"male\\\">\u003Caction>He completely loses it, shouting\u003C\u002Faction>What are you waiting for?!\u003C\u002Fspeak>\",\n    \"reference_voice_url\": \"https:\u002F\u002Fexample.com\u002Fcalm-reference.wav\",\n    \"seed\": 42\n  }' \\\n  --output cloned_angry.wav\n```\n\nAny voice can perform any emotion, even if that voice has never been recorded in that emotional state. The reference provides identity. The performance comes from the prompt.\n\n### Web UI (Gradio)\n\nA built-in web interface for experimenting with voice descriptions, action tags, and all generation parameters without writing code.\n\n![Scenema Audio Gradio UI](https:\u002F\u002Fcdn.scenema.ai\u002Fpublic\u002Fcontent\u002Fscenema-audio\u002Fgradio-demo.webp)\n\n```bash\n# Enable the web UI\nENABLE_GRADIO=1 HF_TOKEN=your_token docker compose up\n```\n\nOpen [http:\u002F\u002Flocalhost:8000\u002Fui](http:\u002F\u002Flocalhost:8000\u002Fui) in your browser. The UI provides four tabs:\n\n- **Generate**: Build prompts from individual fields (voice description, speech text, scene, action tags) with preset examples\n- **Voice Design**: Quick 15-second voice previews for iterating on voice descriptions\n- **Voice Cloning**: Upload reference audio and generate with voice identity transfer\n- **Advanced**: Write raw `\u003Cspeak>` XML directly for full control\n\nFor remote GPU servers, forward the port via SSH:\n\n```bash\nssh -i \u002Fpath\u002Fto\u002Fkey -L 8000:localhost:8000 user@your_gpu_server\n# Then open http:\u002F\u002Flocalhost:8000\u002Fui\u002F locally\n```\n\nOr use a public share link (no port forwarding needed). Gradio opens an outbound tunnel and gives you a `*.gradio.live` URL that anyone can access:\n\n```bash\nENABLE_GRADIO=1 GRADIO_SHARE=1 HF_TOKEN=your_token docker compose up\n# Look for the gradio.live URL in the logs\n```\n\n## Prompt Format\n\n```xml\n\u003Cspeak voice=\"VOICE_DESCRIPTION\" gender=\"male|female\"\n       scene=\"OPTIONAL_SCENE\" language=\"OPTIONAL_LANG_CODE\"\n       shot=\"closeup|wide|scene\">\n  \u003Caction>Performance direction.\u003C\u002Faction>\n  Speech text here.\n  \u003Csound>Environmental audio event.\u003C\u002Fsound>\n  More speech.\n\u003C\u002Fspeak>\n```\n\n### Attributes\n\n| Attribute | Required | Default | Description |\n|-----------|----------|---------|-------------|\n| `voice` | Yes | | Detailed voice description. Drives vocal quality, emotion, accent, age, timbre, delivery style. The more specific and theatrical, the better. |\n| `gender` | Yes | | `\"male\"` or `\"female\"`. Controls pronoun assignment in the compiled prompt sent to the diffusion model. |\n| `scene` | No | | Environmental context. Conditions the ambient audio environment around the speech (rain, office hum, crowd noise). |\n| `language` | No | `\"en\"` | Language code. The model supports major world languages with native-sounding output. |\n| `shot` | No | `\"closeup\"` | Controls SFX prominence. `\"closeup\"`: speech-focused, SFX minimal. `\"wide\"`: environment + speech. `\"scene\"`: maximum environmental audio, SFX reinforced. |\n\n### Child Elements\n\n| Element | Description |\n|---------|-------------|\n| Text nodes | The actual speech content. Write natural prose. |\n| `\u003Caction>` | Performance directions that shape HOW the speech is delivered. Not spoken aloud. Stage directions for the diffusion model: emotional shifts, physical delivery, pacing cues, breath control. |\n| `\u003Csound>` | Environmental audio events generated alongside the speech. Thunder cracks, doors slamming, rain starting. Only effective in `wide` or `scene` shot modes. |\n\n### Voice Description\n\nThe `voice` attribute is the primary control for the entire output. Be specific and theatrical:\n\n```xml\n\u003C!-- Weak -->\n\u003Cspeak voice=\"A man speaking\" gender=\"male\">...\u003C\u002Fspeak>\n\n\u003C!-- Strong -->\n\u003Cspeak voice=\"Male, mid 60s. Deep baritone with gravel. Slight Southern American inflection.\nWorn but warm. Nostalgic, firelight cadence. The voice of someone who has seen too much\nand chosen kindness anyway.\" gender=\"male\" scene=\"Fireside, night, crickets\">...\u003C\u002Fspeak>\n```\n\n### Action Tags\n\nAction tags are the primary tool for controlling emotional performance. Place them between speech segments to direct delivery shifts:\n\n```xml\n\u003Cspeak voice=\"Middle-aged man, warm but weathered.\" gender=\"male\">\n  \u003Caction>Calm, almost casual. Staring at his hands.\u003C\u002Faction>\n  I used to think I had all the time in the world.\n  \u003Caction>Voice tightens. Swallows. Fighting to stay composed.\u003C\u002Faction>\n  Then one Tuesday morning, the doctor said three words that changed everything.\n  \u003Caction>Long pause. Deep breath. When he speaks again, his voice is raw but steady.\u003C\u002Faction>\n  And I realized... I hadn't called my son in six months.\n  \u003Caction>Voice breaks on the last word. Clears throat. Forces a half-laugh.\u003C\u002Faction>\n  Funny how that works, isn't it?\n\u003C\u002Fspeak>\n```\n\nDescribe what the speaker is DOING and FEELING, not what the audio should sound like. Combine physical and emotional cues for richer performance.\n\n## API Reference\n\n### POST \u002Fgenerate\n\n#### Request Body\n\n| Field | Type | Default | Description                                                                                                                                                                                                                                                                                                                                                                             |\n|-------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|\n| `prompt` | string | **required** | `\u003Cspeak>` XML string. See Prompt Format above.                                                                                                                                                                                                                                                                                                                                          |\n| `mode` | string | `\"generate\"` | `\"generate\"` for full pipeline with chunking. `\"voice_design\"` for a single 15-second voice sample (no chunking, useful for previewing a voice description).                                                                                                                                                                                                                            |\n| `reference_voice_url` | string | `null` | URL to reference audio (WAV or MP3) for zero-shot voice cloning. 10-20 seconds of clean speech with some emotional variability is ideal. The reference provides identity; emotional performance comes from the prompt.                                                                                                                                                                                             |\n| `background_sfx` | bool | `false` | Keep generated environmental sound effects in the output. When `false`, non-vocal audio is removed. Set to `true` when using `shot=\"scene\"` or `shot=\"wide\"` with `\u003Csound>` tags.                                                                                                                                                                                                       |\n| `validate` | bool | `true` | Enable Whisper speech validation. Each generated chunk is transcribed by faster-whisper and compared against expected text. If word match ratio falls below the threshold, the chunk is regenerated with extended duration and a new seed (up to 3 retries), keeping the best result. Adds \u003C1s per chunk on GPU. Disable for faster generation when prompt reliability is not critical. |\n| `seed` | int | `-1` | Generation seed. `-1` for random. Fixed seeds produce deterministic output for the same prompt and configuration.                                                                                                                                                                                                                                                                       |\n| `pace` | float | `1.5` | Duration allocation multiplier. Higher values give the model more time, resulting in slower, more deliberate speech. Lower values produce faster speech. The default 1.5x accounts for LTX's naturally slower speaking pace compared to real-time speech.                                                                                                                               |\n| `min_match_ratio` | float | `0.90` | Whisper validation threshold. Minimum word match ratio (0.0 to 1.0) between generated audio transcription and expected text. Only used when `validate` is `true`. Lower values accept more pronunciation variance. Lower threshold recommended for languages with accents.                                                                                                              |\n| `skip_vc` | bool | `false` | Skip voice conversion (SeedVC) post-processing entirely. When `true`, no voice identity transfer or cross-chunk voice consistency normalization is applied. Useful for single-chunk generations where the voice description alone is sufficient.                                                                                                                                        |\n| `vc_steps` | int | `25` | SeedVC diffusion steps. More steps produce higher-quality voice identity transfer at the cost of processing time. Range: 10-50.                                                                                                                                                                                                                                                         |\n| `vc_cfg_rate` | float | `0.5` | SeedVC classifier-free guidance rate. Controls how strongly the target voice identity is applied. Higher values produce stronger identity transfer but may reduce naturalness. Range: 0.0-1.0.                                                                                                                                                                                          |\n\n#### Response\n\nReturns JSON with base64-encoded WAV audio:\n\n```json\n{\n  \"status\": \"succeeded\",\n  \"audio\": \"\u003Cbase64-encoded WAV>\",\n  \"content_type\": \"audio\u002Fwav\",\n  \"metadata\": {\n    \"duration_s\": 12.4,\n    \"sample_rate\": 48000,\n    \"processing_ms\": 8200,\n    \"seed\": 42,\n    \"mode\": \"generate\",\n    \"has_reference_voice\": false\n  }\n}\n```\n\nOn error:\n\n```json\n{\n  \"status\": \"failed\",\n  \"error\": \"Description of what went wrong\"\n}\n```\n\n## Capabilities\n\n### Emotional Acting\n\nEmotional state shifts within a single generation. Action tags function as stage directions at specific points in the script.\n\n```xml\n\u003Cspeak voice=\"A man on the edge. Explosive rage. Italian-American inflection.\"\n       gender=\"male\" scene=\"A dimly lit office, late at night\">\n  \u003Caction>He stands up slowly, voice dangerously low\u003C\u002Faction>\n  You come into my house, you eat my food, and then you got the nerve\n  to tell me how to run my business.\n  \u003Caction>Voice rising, finger pointing\u003C\u002Faction>\n  I built this thing from nothing while you were sitting on your ass.\n\u003C\u002Fspeak>\n```\n\n### Child Voices\n\n```xml\n\u003Cspeak voice=\"A six-year-old girl, bright and excited, speaking fast\nwith breathless enthusiasm. Slight lisp on S sounds.\"\ngender=\"female\">\n  Mommy look! There is a rainbow and it goes all the way across the whole sky!\n\u003C\u002Fspeak>\n```\n\n### Scene-Aware Audio (Voice + Environment)\n\nSet `shot=\"scene\"` and `background_sfx: true` to generate speech with environmental audio in the same diffusion pass.\n\n```xml\n\u003Cspeak voice=\"Male, mid 40s. Weathered. Urgent, projecting over wind.\"\n       gender=\"male\" scene=\"Open dock in a thunderstorm, heavy rain\"\n       shot=\"scene\">\n  \u003Csound>Heavy rain and wind howling\u003C\u002Fsound>\n  \u003Caction>He shouts over the storm\u003C\u002Faction>\n  Get the lines! She is pulling loose!\n  \u003Csound>Thunder cracks overhead\u003C\u002Fsound>\n  Move! I said move!\n\u003C\u002Fspeak>\n```\n\n### Multilingual\n\nThe model supports major world languages with native fluency. Set the `language` attribute and write the voice description to match.\n\n```xml\n\u003Cspeak voice=\"Female, mid 70s. Soft alto. Native French speaker, Parisian accent.\nWarm like wool blankets. Unhurried.\" gender=\"female\"\nscene=\"Cozy bedroom, lamplight\" language=\"fr\">\n  \u003Caction>Elle s'assied au bord du lit\u003C\u002Faction>\n  Alors, mon petit. Tu veux que je te raconte l'histoire du renard\n  qui a trompé la lune?\n\u003C\u002Fspeak>\n```\n\n### Long-Form Narration\n\nText is automatically split at sentence boundaries using [Kokoro](https:\u002F\u002Fgithub.com\u002Fhexgrad\u002Fkokoro) phoneme-level duration estimation. Voice identity is maintained across chunks via A2V latent conditioning.\n\n```xml\n\u003Cspeak voice=\"An elderly storyteller with a weathered knowing voice.\nDeep baritone, slow deliberate pacing.\"\ngender=\"male\">\n  Many years later, as he faced the firing squad, Colonel Aureliano Buendia\n  was to remember that distant afternoon when his father took him to discover ice.\n  At that time Macondo was a village of twenty adobe houses, built on the bank\n  of a river of clear water that ran along a bed of polished stones, which were\n  white and enormous, like prehistoric eggs.\n\u003C\u002Fspeak>\n```\n\n## Hardware Requirements\n\n### Minimum: 16 GB VRAM (RTX 4060 Ti 16GB, RTX A4000)\n\nINT8 audio transformer + NF4 Gemma quantization. Models are automatically offloaded between GPU and CPU RAM between pipeline stages (encode, diffuse, decode, voice convert). Requires 32 GB system RAM. Default configuration via `docker compose up`.\n\n### Recommended: 24 GB VRAM (RTX 4090, RTX A5000)\n\nSame INT8 + NF4 config with all models resident on GPU simultaneously. No offloading overhead, fastest generation.\n\n### Full Precision: 48 GB VRAM (A6000 Ada, A40, L40S)\n\nbf16 audio transformer + bf16 Gemma, all models resident on GPU. Best quality. Set environment variables:\n\n```\nAUDIO_CKPT=\u002Fapp\u002Fmodels\u002Fscenema-audio-transformer.safetensors\nGEMMA_QUANTIZE=\n```\n\n### VRAM Configurations\n\n| VRAM | Audio Model | Gemma | Behavior | Notes |\n|------|------------|-------|----------|-------|\n| 16 GB | INT8 (4.9 GB) | NF4 (~8 GB) | Auto-offload per stage | **Default config** |\n| 24 GB | INT8 (4.9 GB) | NF4 (~8 GB) | All models resident | Fastest with quantization |\n| 48 GB | bf16 (9.8 GB) | bf16 (24 GB) | All models resident | Best quality |\n\nVRAM strategy is auto-detected. The engine measures available VRAM at startup and decides whether to offload models between stages or keep everything resident.\n\n## Performance\n\nBenchmarked on NVIDIA RTX 4090, 100-word passage (~55 seconds of audio, 4 chunks):\n\n| Configuration | Total Time | Real-Time Factor |\n|--------------|-----------|-----------------|\n| bf16 + bf16 (CPU streaming) | 83s | 0.66x |\n| INT8 + bf16 (CPU streaming) | 66s | 0.83x |\n| INT8 + NF4 (all GPU) | 35s | 1.57x |\n| INT8 + NF4 + SageAttention 2 | 35s | 1.57x |\n\n## Pipeline Architecture\n\n```\nXML prompt (voice description + scene + stage directions + text)\n  |\n  v\n[Text Splitting] -----------> Sentence boundaries via Kokoro, ~15s max per segment\n  |\n  v\n[Gemma 3 12B Encode] -------> Text conditioning (per segment)\n  |\n  v\n[8-Step Diffusion] ---------> Audio latent generation\n  |                            Voice continuity via A2V latent conditioning between segments\n  v\n[Audio Decode] --------------> Waveform\n  |\n  v\n[MelBandRoFormer] ----------> Vocal separation (strips SFX unless background_sfx=true)\n  |\n  v\n[SeedVC] -------------------> Voice identity transfer (when reference_voice_url provided\n  |                            or multi-chunk for cross-chunk consistency)\n  v\nOutput WAV (48kHz stereo)\n```\n\n### Key Design Decisions\n\n**Kokoro for duration estimation.** Kokoro TTS (82M params, CPU) provides phoneme-level duration estimates. The chunker splits text at sentence boundaries when accumulated Kokoro estimates exceed 15 seconds (with a configurable `pace` multiplier for LTX's naturally slower speaking pace). No word counting.\n\n**15-second chunk cap.** The model was trained on 20-second clips, but quality degrades (repetition, pronunciation failure) beyond ~15 seconds. The 15s cap ensures consistent quality.\n\n**Voice continuity across segments.** The tail of each segment's audio is encoded and used as a voice reference for the next segment. This maintains consistent voice identity across arbitrarily long outputs without requiring a separate voice embedding model.\n\n**Zero-shot voice cloning.** A2V latent conditioning gets about 60% of the way to matching a reference voice. SeedVC post-processing brings it to full identity transfer. No training, no enrollment, no voice database.\n\n**Emotion and identity are independent controls.** The voice description drives the emotional performance. The reference audio drives the voice identity. For maximum emotional range with a cloned voice, use a strong character archetype in the voice description and let the reference audio handle identity.\n\n**INT8 quantization.** Per-channel INT8 reduces the transformer from 9.8 GB to 4.9 GB with no measurable quality difference, enabling generation on consumer GPUs.\n\n## Model Checkpoints\n\nHosted on HuggingFace: [ScenemaAI\u002Fscenema-audio](https:\u002F\u002Fhuggingface.co\u002FScenemaAI\u002Fscenema-audio)\n\n| File | Size | Description |\n|------|------|-------------|\n| `scenema-audio-transformer.safetensors` | 9.8 GB | Audio diffusion transformer (bf16) |\n| `scenema-audio-transformer-int8.safetensors` | 4.9 GB | Audio diffusion transformer (INT8, identical quality) |\n| `scenema-audio-pipeline.safetensors` | 6.7 GB | Audio VAE decoder + vocoder + text projection |\n| `scenema-audio-vae-encoder.safetensors` | 42.7 MB | Audio VAE encoder for reference voice encoding |\n\n## Building from Source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FScenemaAI\u002Fscenema-audio.git\ncd scenema-audio\n\nexport HF_TOKEN=your_huggingface_token\ndocker compose build\ndocker compose up\n```\n\n### Environment Variables\n\nSet in `docker-compose.yml` or pass via `docker run -e`:\n\n| Variable | Default | Description |\n|----------|---------|-------------|\n| `HF_TOKEN` | **required** | HuggingFace token with Gemma 3 access |\n| `AUDIO_CKPT` | `\u002Fapp\u002Fmodels\u002Fscenema-audio-transformer-int8.safetensors` | Path to audio transformer checkpoint |\n| `PIPELINE_CKPT` | `\u002Fapp\u002Fmodels\u002Fscenema-audio-pipeline.safetensors` | Path to pipeline checkpoint |\n| `GEMMA_ROOT` | `\u002Fapp\u002Fmodels\u002Fgemma-3-12b-it` | Path to Gemma 3 12B model directory |\n| `GEMMA_QUANTIZE` | `nf4` | Gemma quantization. `nf4` for 24 GB cards, empty for bf16 on 48 GB+ |\n| `PORT` | `8000` | HTTP service port |\n| `MODEL_DIR` | `\u002Fapp\u002Fmodels` | Base directory for model downloads and cache |\n| `ENABLE_GRADIO` | (empty) | Set to `1` to enable the Gradio web UI at `\u002Fui` |\n| `GRADIO_SHARE` | (empty) | Set to `1` to create a public share link (no port forwarding needed) |\n\n## Limitations\n\n- **Pronunciation**: The model occasionally garbles complex multi-syllable words and proper nouns. Spelling out difficult words phonetically can help.\n- **15-second generation window**: Each audio segment is limited to ~15 seconds. Longer text is automatically split, but very long single sentences may be divided at suboptimal points.\n- **Emotional range with voice cloning**: Voice cloning optimizes for identity accuracy, which can reduce the extremes of emotional delivery. For maximum expressiveness, use a strong emotional archetype in the voice description and provide a reference clip with natural emotional variability (10-20 seconds, not monotone).\n- **Multilingual pronunciation**: When a character switches languages mid-speech, the model may apply the primary language's phonetics to the foreign words. Use separate requests per language.\n- **Generation speed**: Each 15-second segment takes 3-8 seconds depending on hardware. Audio is returned as a complete file, not streamed.\n- **Gemma 3 12B is gated**: Requires accepting Google's terms of use and a HuggingFace token with access.\n- **Reference audio quality sensitivity**: Low-quality references (compressed MP3, background noise) significantly degrade output. Use clean reference audio or rely on the voice description alone with SeedVC as a post-processing step.\n\n## Acknowledgments\n\n- [LTX-2](https:\u002F\u002Fgithub.com\u002FLightricks\u002FLTX-2) by Lightricks for the base audiovisual model\n- [Gemma 3](https:\u002F\u002Fai.google.dev\u002Fgemma) by Google for the text encoder\n- [SeedVC](https:\u002F\u002Fgithub.com\u002FPlachtaa\u002Fseed-vc) by Plachta for voice refinement\n- [Kokoro](https:\u002F\u002Fgithub.com\u002Fhexgrad\u002Fkokoro) by hexgrad for duration estimation\n- [SageAttention](https:\u002F\u002Fgithub.com\u002Fthu-ml\u002FSageAttention) for attention acceleration\n\n## License\n\n**Model weights:** [LTX-2 Community License Agreement](https:\u002F\u002Fgithub.com\u002FLightricks\u002FLTX-2\u002Fblob\u002Fmain\u002FLICENSE). The audio diffusion transformer is derived from LTX 2.3's audiovisual model, and its weights are subject to the same license terms.\n\n**Code:** [MIT License](LICENSE). The inference server, chunking pipeline, and all supporting code are MIT licensed.\n\n[Gemma 3 12B](https:\u002F\u002Fai.google.dev\u002Fgemma\u002Fterms) (text encoder) is a gated model requiring acceptance of Google's terms of use.\n","Scenema Audio 是一个零样本表达式语音克隆和语音生成工具，能够根据文本提示生成具有真实情感、节奏和呼吸控制的语音内容。该项目基于从LTX 2.3的22B参数音视频模型中提取的音频扩散变换器构建，支持通过10秒参考音频实现高质量的语音克隆，并能表现出原声未录制过的情感。它适用于电影制作、有声书等需要丰富情感表达的场景。项目使用Python编写，提供Docker快速部署方式，运行需NVIDIA GPU（16GB+显存），并利用HuggingFace进行模型管理。",2,"2026-06-11 03:54:25","CREATED_QUERY"]