[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72252":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":17,"lastSyncTime":29,"discoverSource":30},72252,"mochi","genmoai\u002Fmochi","genmoai","The best OSS video generation models, created by Genmo","",null,"Python",3663,482,45,50,0,2,3,17,6,67.25,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:04","# Mochi 1\n[Blog](https:\u002F\u002Fwww.genmo.ai\u002Fblog) | [Direct Download](https:\u002F\u002Fweights.genmo.dev\u002Fweights.zip) | [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fgenmo\u002Fmochi-1-preview) | [Playground](https:\u002F\u002Fwww.genmo.ai\u002Fplay) | [Careers](https:\u002F\u002Fjobs.ashbyhq.com\u002Fgenmo)\n\nA state of the art video generation model by [Genmo](https:\u002F\u002Fgenmo.ai).\n\nhttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4d268d02-906d-4cb0-87cc-f467f1497108\n\n## News\n\n- ⭐ **November 26, 2024**: Added support for [LoRA fine-tuning](demos\u002Ffine_tuner\u002FREADME.md)\n- ⭐ **November 5, 2024**: Consumer-GPU support for Mochi [natively in ComfyUI](https:\u002F\u002Fx.com\u002FComfyUI\u002Fstatus\u002F1853838184012251317)\n\n## Overview\n\nMochi 1 preview is an open state-of-the-art video generation model with high-fidelity motion and strong prompt adherence in preliminary evaluation. This model dramatically closes the gap between closed and open video generation systems. We’re releasing the model under a permissive Apache 2.0 license. Try this model for free on [our playground](https:\u002F\u002Fgenmo.ai\u002Fplay).\n\n## Installation\n\nInstall using [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fgenmoai\u002Fmochi\ncd mochi\npip install uv\nuv venv .venv\nsource .venv\u002Fbin\u002Factivate\nuv pip install setuptools\nuv pip install -e . --no-build-isolation\n```\n\nIf you want to install flash attention, you can use:\n```\nuv pip install -e .[flash] --no-build-isolation\n```\n\nYou will also need to install [FFMPEG](https:\u002F\u002Fwww.ffmpeg.org\u002F) to turn your outputs into videos.\n\n## Download Weights\n\nUse [download_weights.py](scripts\u002Fdownload_weights.py) to download the model + VAE to a local directory. Use it like this:\n```bash\npython3 .\u002Fscripts\u002Fdownload_weights.py weights\u002F\n```\n\nOr, directly download the weights from [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fgenmo\u002Fmochi-1-preview\u002Ftree\u002Fmain) or via `magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp:\u002F\u002Ftracker.opentrackr.org:1337\u002Fannounce` to a folder on your computer.\n\n## Running\n\nStart the gradio UI with\n\n```bash\npython3 .\u002Fdemos\u002Fgradio_ui.py --model_dir weights\u002F --cpu_offload\n```\n\nOr generate videos directly from the CLI with\n\n```bash\npython3 .\u002Fdemos\u002Fcli.py --model_dir weights\u002F --cpu_offload\n```\n\nIf you have a fine-tuned LoRA in the safetensors format, you can add `--lora_path \u003Cpath\u002Fto\u002Fmy_mochi_lora.safetensors>` to either `gradio_ui.py` or `cli.py`.\n\n## API\n\nThis repository comes with a simple, composable API, so you can programmatically call the model. You can find a full example [here](demos\u002Fapi_example.py). But, roughly, it looks like this:\n\n```python\nfrom genmo.mochi_preview.pipelines import (\n    DecoderModelFactory,\n    DitModelFactory,\n    MochiSingleGPUPipeline,\n    T5ModelFactory,\n    linear_quadratic_schedule,\n)\n\npipeline = MochiSingleGPUPipeline(\n    text_encoder_factory=T5ModelFactory(),\n    dit_factory=DitModelFactory(\n        model_path=f\"weights\u002Fdit.safetensors\", model_dtype=\"bf16\"\n    ),\n    decoder_factory=DecoderModelFactory(\n        model_path=f\"weights\u002Fdecoder.safetensors\",\n    ),\n    cpu_offload=True,\n    decode_type=\"tiled_spatial\",\n)\n\nvideo = pipeline(\n    height=480,\n    width=848,\n    num_frames=31,\n    num_inference_steps=64,\n    sigma_schedule=linear_quadratic_schedule(64, 0.025),\n    cfg_schedule=[6.0] * 64,\n    batch_cfg=False,\n    prompt=\"your favorite prompt here ...\",\n    negative_prompt=\"\",\n    seed=12345,\n)\n```\n\n## Fine-tuning with LoRA\n\nWe provide [an easy-to-use trainer](demos\u002Ffine_tuner\u002FREADME.md) that allows you to build LoRA fine-tunes of Mochi on your own videos. The model can be fine-tuned on one H100 or A100 80GB GPU.\n\n## Model Architecture\n\nMochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer (AsymmDiT) architecture. Trained entirely from scratch, it is the largest video generative model ever openly released. And best of all, it’s a simple, hackable architecture. Additionally, we are releasing an inference harness that includes an efficient context parallel implementation. \n\nAlongside Mochi, we are open-sourcing our video AsymmVAE. We use an asymmetric encoder-decoder structure to build an efficient high quality compression model. Our AsymmVAE causally compresses videos to a 128x smaller size, with an 8x8 spatial and a 6x temporal compression to a 12-channel latent space. \n\n### AsymmVAE Model Specs\n|Params \u003Cbr> Count | Enc Base \u003Cbr>  Channels | Dec Base \u003Cbr> Channels |Latent \u003Cbr> Dim | Spatial \u003Cbr> Compression | Temporal \u003Cbr> Compression | \n|:--:|:--:|:--:|:--:|:--:|:--:|\n|362M   | 64  | 128  | 12   | 8x8   | 6x   | \n\nAn AsymmDiT efficiently processes user prompts alongside compressed video tokens by streamlining text processing and focusing neural network capacity on visual reasoning. AsymmDiT jointly attends to text and visual tokens with multi-modal self-attention and learns separate MLP layers for each modality, similar to Stable Diffusion 3. However, our visual stream has nearly 4 times as many parameters as the text stream via a larger hidden dimension. To unify the modalities in self-attention, we use non-square QKV and output projection layers. This asymmetric design reduces inference memory requirements.\nMany modern diffusion models use multiple pretrained language models to represent user prompts. In contrast, Mochi 1 simply encodes prompts with a single T5-XXL language model.\n\n### AsymmDiT Model Specs\n|Params \u003Cbr> Count | Num \u003Cbr> Layers | Num \u003Cbr> Heads | Visual \u003Cbr> Dim | Text \u003Cbr> Dim | Visual \u003Cbr> Tokens | Text \u003Cbr> Tokens | \n|:--:|:--:|:--:|:--:|:--:|:--:|:--:|\n|10B   | 48   | 24   | 3072   | 1536   | 44520   |   256   |\n\n## Hardware Requirements\nThe repository supports both multi-GPU operation (splitting the model across multiple graphics cards) and single-GPU operation, though it requires approximately 60GB VRAM when running on a single GPU. While ComfyUI can optimize Mochi to run on less than 20GB VRAM, this implementation prioritizes flexibility over memory efficiency. When using this repository, we recommend using at least 1 H100 GPU.\n\n## Safety\nGenmo video models are general text-to-video diffusion models that inherently reflect the biases and preconceptions found in their training data. While steps have been taken to limit NSFW content, organizations should implement additional safety protocols and careful consideration before deploying these model weights in any commercial services or products.\n\n## Limitations\nUnder the research preview, Mochi 1 is a living and evolving checkpoint. There are a few known limitations. The initial release generates videos at 480p today. In some edge cases with extreme motion, minor warping and distortions can also occur. Mochi 1 is also optimized for photorealistic styles so does not perform well with animated content. We also anticipate that the community will fine-tune the model to suit various aesthetic preferences.\n\n## Related Work\n- [ComfyUI-MochiWrapper](https:\u002F\u002Fgithub.com\u002Fkijai\u002FComfyUI-MochiWrapper) adds ComfyUI support for Mochi. The integration of Pytorch's SDPA attention was based on their repository.\n- [ComfyUI-MochiEdit](https:\u002F\u002Fgithub.com\u002Flogtd\u002FComfyUI-MochiEdit) adds ComfyUI nodes for video editing, such as object insertion and restyling.\n- [mochi-xdit](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002Fmochi-xdit) is a fork of this repository and improve the parallel inference speed with [xDiT](https:\u002F\u002Fgithub.com\u002Fxdit-project\u002Fxdit).\n- [Modal script](contrib\u002Fmodal\u002Freadme.md) for fine-tuning Mochi on Modal GPUs.\n\n\n## BibTeX\n```\n@misc{genmo2024mochi,\n      title={Mochi 1},\n      author={Genmo Team},\n      year={2024},\n      publisher = {GitHub},\n      journal = {GitHub repository},\n      howpublished={\\url{https:\u002F\u002Fgithub.com\u002Fgenmoai\u002Fmodels}}\n}\n```\n","Mochi是由Genmo开发的一款先进的开源视频生成模型。它具备高保真度的动态效果和强大的提示响应能力，显著缩小了闭源与开源视频生成系统之间的差距。Mochi使用Python编写，并支持通过LoRA微调来增强模型性能，同时兼容消费者级GPU。此外，项目提供了一个简洁可组合的API接口，便于开发者以编程方式调用模型。该模型非常适合需要高质量视频内容生成的应用场景，如创意设计、数字娱乐或教育材料制作等。","2026-06-11 03:41:02","high_star"]