[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79985":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":16,"lastSyncTime":28,"discoverSource":29},79985,"OmniNFT","zghhui\u002FOmniNFT","zghhui","Code for \"OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation\"","",null,"Python",90,5,73,0,2,8,16,6,48.93,false,"master",true,[],"2026-06-12 04:01:26","\u003Ch2 align=\"center\">OmniNFT\u003C\u002Fh2>\n\u003Ch4 align=\"center\">Modality-wise Omni Diffusion Negative-aware Fine-Tuning for Joint Audio and Video Generation\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fzghhui\u002FOmniNFT\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-OmniNFT-ffc107?logoColor=white\" alt=\"HuggingFace\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12480\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-b5212f?logo=arxiv\" alt=\"ArXiv\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fzghhui.github.io\u002FOmniNFT\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🌐-Project%20Page-blue\" alt=\"Project Page\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n## 🔈 News\n- [2026-05-21] Comfy compatible format is [here](https:\u002F\u002Fhuggingface.co\u002FKijai\u002FLTX2.3_comfy\u002Fblob\u002Fmain\u002Floras\u002FLTX-2.3-OmniNFT-RL-Lora_bf16.safetensors).\n- [2026-05-19] LTX-2.3 has been supported 🚀. LoRA weights for LTX-2.3 are now available!\n- [2026-05-13] OmniNFT is released on [Arixv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.12480).\n- [2026-05-11] Code and LoRA weights for LTX-2 are available.\n\n  \n---\n\n## 🏗️ Method Overview\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fframework.png\" width=\"90%\" \u002F>\n\u003C\u002Fp>\n\n**Modality-wise Advantage Routing** — Instead of collapsing all rewards into a single global advantage, OmniNFT computes independent per-reward advantages for video, audio, and cross-modal synchronization, then routes each to its responsible generation branch — uni-modal advantages supervise only their own branch while the synchronization advantage is broadcast to both — resolving the advantage inconsistency where roughly half of samples receive opposing rewards across modalities.\n\n**Layer-wise Gradient Surgery** — To address gradient imbalance where video-branch gradients leak into shallow audio layers dedicated to intra-modal generation, OmniNFT applies a partial stop-gradient on the audio key-value projections in A2V cross-attention at shallow Transformer blocks, suppressing erroneous gradient injection while preserving full gradient flow through the deeper cross-modal alignment layers (AV-Sync Zone).\n\n**Region-wise Loss Reweighting** — Leveraging V2A cross-attention maps from late denoising steps as an intrinsic proxy for sound-emitting critical regions, OmniNFT aggregates them into per-token importance weights that modulate the video-branch RL loss, providing fine-grained credit assignment that concentrates optimization capacity on regions most critical for audio-video synchronization without requiring external detection modules.\n\n---\n\n## ⚡ Installation\n\n```bash\nconda create -n omnninft python=3.11\nconda activate omnninft\npip install -r requirements.txt\n```\n\n\u003C!-- --- -->\n\n## 📦 Model Checkpoints\n\n| Env Variable | Description | Source |\n|---|---|---|\n| `LTX-MODEL` | LTX base model | [LTX-2](https:\u002F\u002Fhuggingface.co\u002FLightricks\u002FLTX-2) [LTX-2.3](https:\u002F\u002Fhuggingface.co\u002FLightricks\u002FLTX-2.3) |\n| `OmniNFT_LTX` | LTX + OmniNFT | [OmniNFT](https:\u002F\u002Fhuggingface.co\u002Fzghhui\u002FOmniNFT) |\n| `REWARD_MODELS` | All reward models (HPSv3, CLAP, AudioBox, Synchformer, ImageBind, etc.) | [OmniNFT-Reward-Series](https:\u002F\u002Fhuggingface.co\u002Fzghhui\u002FOmniNFT-Reward-Series) |\n\n\n## 🚀 Training\n\n### Step 0: Download Reward Models\n\nDownload all reward model weights from HuggingFace:\n\n```bash\nhuggingface-cli download --resume-download zghhui\u002FOmniNFT-Reward-Series --local-dir Omni_Reward_Series\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Reward model checkpoints under \u003Ccode>Omni_Reward_Series\u002F\u003C\u002Fcode>\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Env Variable | Path | Description |\n|---|---|---|\n| `HPSV3_CKPT_PATH` | `Omni_Reward_Series\u002FHPSv3\u002FHPSv3.safetensors` | HPSv3 image quality scorer |\n| `VIDEOALIGN_CKPT_DIR` | `Omni_Reward_Series\u002FVideoReward` | VideoAlign video quality scorer |\n| `AUDIOBOX_CKPT` | `Omni_Reward_Series\u002Faudiobox-aesthetics\u002Fcheckpoint.pt` | AudioBox aesthetics predictor |\n| `CLAP_CKPT` | `Omni_Reward_Series\u002FCLAP` | CLAP audio-text alignment model |\n| `IMAGEBIND_CKPT` | `Omni_Reward_Series\u002FImageBind\u002Fimagebind_huge.pth` | ImageBind multimodal embeddings |\n| `SYNCHFORMER_CKPT` | `Omni_Reward_Series\u002Fsynchformer\u002Fsynchformer_state_dict.pth` | Synchformer AV sync scorer |\n\nAll paths are pre-configured in `bash_train_omninft_ltx_fsdp.sh` as relative paths.\n\n\u003C\u002Fdetails>\n\n### Step 1: Launch Reward Servers\n\nHPSv3 and VideoAlign run as remote HTTP servers. Start them **before** training:\n\n```bash\n# Terminal 1: HPSv3 server\nbash flow_grpo\u002Fserver\u002Frun_remote_hpsv3.sh\n\n# Terminal 2: VideoAlign server\nbash flow_grpo\u002Fserver\u002Frun_remote_videoalign.sh\n```\n\n### Step 2: Multi(Single)-Node Training\n\n```bash\nbash bash_train_omninft_ltx_fsdp.sh branch_aware_layer_surgery_avweight\n```\n\n\n## 🎬 Inference\n\n### Step 1: Merge LoRA into base model\n\nAfter training, merge the LoRA weights into the base checkpoint:\n\n```bash\npython scripts\u002Fmerge_lora.py \\\n    --checkpoint-path $LTX_MODEL_PATH \\\n    --lora-dir $OUTPUT_DIR\u002Fcheckpoint-latest\u002Flora \\\n    --output-path .\u002Fmerged_model.safetensors \\\n    --dtype bf16\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Arguments\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Argument | Description |\n|---|---|\n| `--checkpoint-path` | LTX-Video base checkpoint used during training |\n| `--lora-dir` | LoRA output directory (contains `adapter_model.safetensors` + `adapter_config.json`) |\n| `--output-path` | Output path for the merged model |\n| `--dtype` | Output precision: `bf16` (default) \u002F `fp16` \u002F `fp32` \u002F `keep` |\n\n\u003C\u002Fdetails>\n\n### Step 2: Generate audio-video\n\n```bash\npython scripts\u002Finference.py \\\n    --model_path .\u002Fmerged_model.safetensors \\\n    --gemma_path $GEMMA_MODEL_PATH \\\n    --prompt \"A man plays acoustic guitar on a wooden stage, warm applause from the audience\" \\\n    --seed 42 \\\n    --output_dir .\u002Fresults\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Arguments\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n| Argument | Default | Description |\n|---|---|---|\n| `--model_path` | (required) | Path to merged `.safetensors` model |\n| `--gemma_path` | env `GEMMA_MODEL_PATH` | Path to Gemma 3 text encoder |\n| `--prompt` | (required) | Text prompt for generation |\n| `--num_frames` | `121` | Number of video frames |\n| `--height` \u002F `--width` | model default | Video resolution |\n| `--num_inference_steps` | model default | Number of denoising steps |\n| `--video_guidance_scale` | model default | Video CFG scale |\n| `--audio_guidance_scale` | model default | Audio CFG scale |\n| `--seed` | `42` | Random seed |\n| `--no_audio` | `false` | Disable audio generation |\n| `--dtype` | `bf16` | Inference precision |\n\n\u003C\u002Fdetails>\n\nOutputs are saved to `--output_dir`: `.mp4` (video with audio) and `.wav` (audio only).\n\n## 🖊️ Citation\n\n```bibtex\n@article{zhang2026omninft,\n  title={OmniNFT: Modality-wise Omni Diffusion Reinforcement for Joint Audio-Video Generation},\n  author={Zhang, Guohui and Ma, XiaoXiao and Huang, Jie and Xu, Hang and Yu, Hu and Fu, Siming and Li, Yuming and Xue, Zeyue and Song, Lin and Huang, Haoyang and Duan, Nan and Zhao, Feng},\n  journal={arXiv preprint arXiv:2605.12480},\n  year={2026}\n}\n```\n\n## 🤝 Acknowledgements\n\n[LTX-2](https:\u002F\u002Fgithub.com\u002FLightricks\u002FLTX-2) · [DiffusionNFT](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FDiffusionNFT)\n\n---\n\n## ⚠️ License\n\nResearch use only. See individual submodule licenses (HPSv3, ImageBind, LTX-Video, etc.) for their terms.\n","OmniNFT 是一个用于联合生成音频和视频的项目，基于模态特定的全扩散负感知微调技术。其核心功能包括模态优势路由、层间梯度手术以及区域损失重加权，这些技术共同作用以提高跨模态同步的质量，并解决在多模态生成过程中常见的梯度不平衡问题。该项目特别适用于需要高质量音视频同步内容生成的场景，如多媒体艺术创作、虚拟现实体验开发等。使用Python编写，支持LTX-2及LTX-2.3模型，易于安装并通过Hugging Face平台提供预训练权重。","2026-06-11 03:58:48","CREATED_QUERY"]