[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83000":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},83000,"JoyAI-Echo","jd-opensource\u002FJoyAI-Echo","jd-opensource","JoyAI-Echo: Pushing the Frontier of Long Audio-Visual Generation","https:\u002F\u002Fecho-team-joy-future-academy-jd.github.io\u002FEcho-LongVideo-Page\u002F",null,"Python",1418,125,11,5,0,78,1048,1189,538,19.3,"Other",false,"main",[],"2026-06-12 02:04:30","\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fimage.png\" alt=\"JoyAI-Echo generated video gallery\" width=\"100%\">\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n\n\u003Ch1>JoyAI-Echo\u003C\u002Fh1>\n\n\u003Cp>\u003Cstrong>🎬 Pushing the Frontier of Long Video Generation\u003C\u002Fstrong>\u003C\u002Fp>\n\n\u003Cp>Standalone, inference-only release for \u003Cstrong>minute-level multi-shot audio-video generation\u003C\u002Fstrong> with a distilled DMD generator, paired cross-modal memory, and story-level consistency.\u003C\u002Fp>\n\n\u003Cp>\n  \u003Ca href=\"https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F405770309_JoyAI-Echo_Pushing_the_Frontier_of_Long_Audio-Visual_Generation\">\u003Cb>📄 Paper\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fecho-team-joy-future-academy-jd.github.io\u002FEcho-LongVideo-Page\u002F\">\u003Cb>🌐 Project Page\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"#quickstart\">\u003Cb>🚀 Quickstart\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Echo\">\u003Cb>🤗 Hugging Face\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"#results\">\u003Cb>📊 Results\u003C\u002Fb>\u003C\u002Fa> |\n  \u003Ca href=\"#citation\">\u003Cb>📝 Citation\u003C\u002Fb>\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.11-3776AB?style=flat-square&logo=python&logoColor=white\" alt=\"Python 3.11\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyTorch-2.8-EE4C2C?style=flat-square&logo=pytorch&logoColor=white\" alt=\"PyTorch 2.8\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCUDA-12.8-76B900?style=flat-square&logo=nvidia&logoColor=white\" alt=\"CUDA 12.8\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRelease-Inference--Only-black?style=flat-square\" alt=\"Inference\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLong%20Video-5%20min-d61f2c?style=flat-square\" alt=\"5 minute long video\">\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n## Abstract\n\nLong video generation still suffers from error accumulation, weak temporal coherence, and prohibitive latency, limiting its applicability to interactive scenarios. We present **JoyAI-Echo**, a framework that breaks these barriers through four key advances.\nCentral to its performance, a cross-modal audio-visual memory bank preserves character appearance and voice timbre consistently over five-minute videos, while a post-training pipeline combines memory-based reinforcement learning with distribution matching distillation for a **7.5× speedup** to substantially boost visual quality and alignment.\nEmpowered by these two components, **JoyAI-Echo** decisively outperforms *HappyOyster* (directing mode) on long-form generation and even surpasses the short-video specialist *Wan 2.6* on human-centric tasks.\nBeyond raw generation quality, an interactive agent enables real-time user editing through conversational instructions, and a lightweight super-resolution module maintains high definition under streaming latency, further elevating the overall experience and delivering instantly editable, conversation-speed video creation.\nFor the first time, **JoyAI-Echo** simultaneously achieves long-range cross-modal consistency, real-time inference for minute-long video, conversational interactivity, and high-resolution output — without compromise, inaugurating a new era of interactive video generation.\nCodes and weights will be open-sourced.\n\n## Highlights\n\n- 🎞️ **Minute-level multi-shot stories**: generate a sequence of coherent shots from one prompt JSON.\n- ⚡ **DMD-distilled few-step inference**: ~7.5x faster than the original pipeline.\n- 🔊 **Joint audio-video generation**: one pipeline produces synchronized video and audio.\n- 🧠 **Paired cross-modal memory bank**: conditions each new shot on prior visual identity and voice context for story-level consistency.\n\n## Demo Gallery\n\nExplore long-form and short-form JoyAI-Echo cases on the [Project Page](https:\u002F\u002Fecho-team-joy-future-academy-jd.github.io\u002FEcho-LongVideo-Page\u002F). 🍿\n\n## Results\n\n### Reported Scale\n\n| Item | Value |\n| --- | ---: |\n| 🎬 Long-form coherent story length | **5 min** |\n| ⚡ Generation speedup over the original multi-step pipeline | **7.5x** |\n| 📚 Benchmark stories | **100** |\n| 🎞️ Generated evaluation shots | **3,000** |\n| 🕒 Frames per shot | **241 @ 25 fps** |\n\n### Human Evaluation\n\nGSB user study on long- and short-video generation. The numbers denote the percentage of user preferences.\n\n| Aspect\u003Cbr>(Long Video) | JoyAI-Echo | Tie | HappyOyster\u003Cbr> (Directing) | \n| --- | ---: | ---: | ---: | \n| Visual aesthetics | **63.6%** | 8.8% | 27.6% | \n| Audio quality | **81.7%** | 6.5% | 11.8% |\n| Prompt following | **80.6%** | 13.5% | 5.9% | \n| IP consistency | **59.4%** | 12.9% | 27.7% |\n\n| Aspect\u003Cbr>(Short Video) | JoyAI-Echo | Tie | Wan 2.6 |\n| ---  | ---: | ---: | ---: |\n| Visual aesthetics | **58.8%** | 14.7% | 26.5% |\n| Audio quality  | 32.3% | 30.9% | 36.8% |\n| Prompt following | 33.8% | 36.8% | 29.4% |\n\n## Repository Layout\n\n```text\n.\n+-- configs\u002F\n|   `-- inference.yaml                # all inference parameters (YAML)\n+-- checkpoints\u002F                      # model weights (download separately)\n|   +-- echo-longvideo-release.safetensors\n|   `-- gemma-3-12b\u002F\n+-- prompts\u002F                          # multi-shot prompt JSON files\n|   +-- example_single_shot.json\n|   `-- example_multi_shot.json\n+-- ltx-core\u002Fsrc\u002Fltx_core\u002F            # transformer, VAE, text-encoder building blocks\n+-- ltx-pipelines\u002Fsrc\u002Fltx_pipelines\u002F  # sampler and pipeline utilities\n+-- ltx-distillation\u002F\n|   +-- src\u002Fltx_distillation\u002F         # DMD wrappers, AV pipelines, memory bank, utils\n|   `-- scripts\u002Fmultishot_inference_dmd.py\n+-- inference.py                      # main entrypoint (load once, infer all)\n+-- requirements.txt\n`-- environment.yml\n```\n\n## Quickstart\n\n### 1. Clone\n\n```bash\n\ngit clone https:\u002F\u002Fgithub.com\u002Fjd-opensource\u002FJoyAI-Echo.git\ncd JoyAI-Echo\n```\n\n### 2. Create the environment\n\nThe reference environment is **Python 3.11 + PyTorch 2.8 + CUDA 12.8**.\n\nWith conda:\n\n```bash\nconda env create -f environment.yml\nconda activate echo-long\n```\n\nWith `uv`:\n\n```bash\nuv venv --python 3.11 .venv\nsource .venv\u002Fbin\u002Factivate\nuv pip install --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 -r requirements.txt\n```\n\n[`ffmpeg`](https:\u002F\u002Fffmpeg.org\u002Fdownload.html) must be available on `PATH` for shot concatenation. The conda recipe includes it. If you use `uv`, install it with your system package manager:\n\n```bash\nsudo apt install ffmpeg\n# macOS:\nbrew install ffmpeg\n```\n\n### 3. Download checkpoint\n\nDownload the JoyAI-Echo release checkpoint and Gemma text encoder:\n\n| File | Description | Size | Link |\n| --- | --- | --- | --- |\n| `echo-longvideo-release.safetensors` | Full model (transformer + VAE + vocoder) | ~46 GB |[`JoyAI-Echo`](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Echo)  |\n| `gemma-3-12b\u002F` | Instruction-tuned model (text encoder) | ~24 GB | [`gemma-3-12b-it`](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-3-12b-it) |\n\nPlace them under `checkpoints\u002F`:\n\n```text\ncheckpoints\u002F\n+-- echo-longvideo-release.safetensors\n`-- gemma-3-12b\u002F\n```\n\n### 4. Write a story prompt\n\nCreate a JSON file under `prompts\u002F`. Each file is a single object with a `prompts` list, where **every string is one complete shot**. A single string produces one shot; multiple strings produce a multi-shot story, with each new shot conditioned on the previous ones through the paired audio-video memory bank.\n\nInside each string, write these parts in order:\n\n| Part | What to describe |\n| --- | --- |\n| **Roles & Subjects** | Describe the appearance of all visible people, including age, build, hair, face, wardrobe, and speaking voice timbre when applicable. |\n| **Action & Dialogue** | What the subject does and speaks. |\n| **Style** | The overall visual and emotional aesthetic — e.g. realistic motorsport film language, cool daylight, restrained cinematic tension. |\n| **Camera Movement** | The shot type and framing or movement — e.g. a stable close-up on the face, or a medium shot from the waist up. |\n| **Background** | The setting and scene details behind the subject. |\n| **Sound Effects & BGM** | The sounds in the scene and the background music — e.g. room tone, wind, footsteps and fabric, with a soft low music bed under the dialogue or nobackground music |\n\nTo turn a story into these shot descriptions automatically, pair an LLM with the story-writer system prompt at `prompts\u002Fstory_writer_system_prompt.md`. A more convenient prompt-writing workflow will be released as a **director agent** for everyone to use.\n\n### 5. Run inference\n\n```bash\npython inference.py\n```\n\nThis loads the model once and processes all prompt files under `prompts\u002F`.\n\n> 💡 **Note**: The inference pipeline is optimized to run on lower-VRAM\n> GPUs. Peak GPU usage is around **46–50 GB**, at the cost of slightly\n> longer per-shot inference time.\n\nOutputs are written to:\n\n```text\ninference_result\u002Foutputs\u002F\u003Cprompt-name>\u002Finference_\u003Ctimestamp>\u002F\n```\n\n## Configuration\n\nAll inference parameters are managed in `configs\u002Finference.yaml`. The file is organized into sections:\n\n| Section | Contents |\n| --- | --- |\n| `paths` | Checkpoint path, prompts directory, output root |\n| `video` | Resolution, frame count, FPS, seed |\n| `denoising` | Step list and sigma schedule |\n| `memory` | Memory bank size, save mode, LoRA settings |\n| `audio_memory` | Audio window, mel-spectrogram params |\n| `inference` | Device, dtype, grad scale |\n\n### Override via CLI\n\nAny YAML parameter can be overridden from the command line:\n\n```bash\npython inference.py --seed 42 --num-frames 121 --video-height 480 --video-width 832\n```\n\nUse a custom config file:\n\n```bash\npython inference.py --config configs\u002Fmy_experiment.yaml\n```\n\nThe Python entrypoint exposes the full configuration surface:\n\n```bash\npython inference.py --help\n```\n\n## Hardware\n\nPeak GPU usage is around **46–50 GB** for the default **25 fps x 241 frames x 1280 x 736** setting, so a single H100\u002FA100-class (80 GB) or 48 GB GPU is sufficient.\n\nFor smaller GPUs, reduce resolution\u002Fframes:\n\n```bash\npython inference.py --num-frames 121 --video-height 480 --video-width 832\n```\n\n## TODO List\n\n- [x] Release inference code\n- [x] Release model checkpoints\n- [x] Add prompt examples\n- [ ] Release Echo-SR (Super-resolution)\n- [ ] Release Director Agent \n\n## Links\n\n- Project page: [`https:\u002F\u002Fecho-team-joy-future-academy-jd.github.io\u002FEcho-LongVideo-Page\u002F`](https:\u002F\u002Fecho-team-joy-future-academy-jd.github.io\u002FEcho-LongVideo-Page\u002F)\n- Repository: [`https:\u002F\u002Fgithub.com\u002Fjd-opensource\u002FJoyAI-Echo`](https:\u002F\u002Fgithub.com\u002Fjd-opensource\u002FJoyAI-Echo)\n- huggingface: [`https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Echo`](https:\u002F\u002Fhuggingface.co\u002Fjdopensource\u002FJoyAI-Echo)\n\n## Acknowledgements\n\nWe gratefully acknowledge the open-source projects this work builds upon — in particular [LTX2.3](https:\u002F\u002Fhuggingface.co\u002FLightricks\u002FLTX-2.3) for the base video generator and [Gemma](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-3-12b-it) for the text encoder. Thanks to the broader research community whose contributions made this release possible.\n\n**For academic research and non-commercial use only.**\n\n## Citation\n\nIf JoyAI-Echo helps your research or products, please cite:\n\n```bibtex\n@techreport{echo2026longvideo,\n  title        = {JoyAI-Echo: Pushing the Frontier of Long Video Generation},\n  author       = {{Echo Team @ Joy Future Academy, JD}},\n  institution  = {Joy Future Academy, JD},\n  year         = {2026},\n  month        = {May}\n}\n```\n\n## License\n\nThis project is based on LTX-2 by Lightricks Ltd.\n\nPortions of the original LTX-2 codebase have been modified by JD.com for academic and research purposes only. \nThis project is not intended for commercial use. For commercial use of LTX-2 or its derivatives, please contact Lightricks Ltd.\n\nAll original copyright, license, patent, trademark, and attribution notices from LTX-2 are retained. \nThis project remains subject to the LTX-2 Community License Agreement.\n","JoyAI-Echo 是一个专注于长视频生成的框架，能够生成长达数分钟的高质量音视频内容。其核心功能包括跨模态音频-视觉记忆库，确保角色外观和声音在五分钟视频中的连贯性；通过基于记忆的强化学习与分布匹配蒸馏结合的后训练流程，实现7.5倍的速度提升，显著提高视觉质量和对齐度。此外，它还支持实时用户编辑和高分辨率输出，适用于需要长时间一致性和交互性的场景，如在线教育、虚拟会议以及创意视频制作等领域。",2,"2026-06-11 04:09:51","CREATED_QUERY"]