[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1952":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":17,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},1952,"CoInteract","luoxyhappy\u002FCoInteract","luoxyhappy","Official Implementation of CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis",null,"Python",156,11,8,6,0,1,3,7,3.24,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:00:35","\u003Ch1 align=\"center\">CoInteract: Spatially-Structured Co-Generation for Interactive Human-Object Video Synthesis\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  Xiangyang Luo\u003Csup>1,2*\u003C\u002Fsup>, Xiaozhe Xin\u003Csup>2*✉\u003C\u002Fsup>, Tao Feng\u003Csup>1\u003C\u002Fsup>, Xu Guo\u003Csup>1\u003C\u002Fsup>, Meiguang Jin\u003Csup>2\u003C\u002Fsup>, Junfeng Ma\u003Csup>2\u003C\u002Fsup>\n  \u003Cbr>\n  \u003Csup>1\u003C\u002Fsup> Tsinghua University &nbsp; \u003Csup>2\u003C\u002Fsup> Alibaba Group\n  \u003Cbr>\n  \u003Csup>*\u003C\u002Fsup> Equal contribution &nbsp; \u003Csup>✉\u003C\u002Fsup> Corresponding author\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.19636\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-Paper-red\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fgeorgexin\u002Fcointeract\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20HuggingFace-Model-yellow\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fxinxiaozhe12345.github.io\u002FCoInteract_Project\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-lue\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Demo\n\u003Cp align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Ffe59a768-38e8-4403-bd13-155662c628d6\" controls width=\"80%\">\u003C\u002Fvideo>\n\u003C\u002Fp>\n\n## 🔥News\n\n- [May 6, 2026] We release the [training guideline](.\u002Fexamples\u002Fwanvideo\u002Fmodel_training\u002FREADME.md) and **pose-driven inference** of CoInteract.\n- [April 27, 2026] We release the inference code and checkpoint of CoInteract.\n- [April 22, 2026] We release the [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.19636) and [Project](https:\u002F\u002Fxinxiaozhe12345.github.io\u002FCoInteract_Project\u002F) page of CoInteract.\n\n## 🗺️Roadmap\n\n| Stage | Status | Description | Date |\n|-------|--------|-------------|------|\n| 1 | ✅ | Release inference code and model weights | - |\n| 2 | ✅ | Release pose-driven checkpoint and inference | - |\n| 3 | ✅ | Release training code | - |\n\n## Installation\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fluoxyhappy\u002FCoInteract.git\ncd CoInteract\nconda create -n cointeract python=3.10\npip install -e .\n```\n\n### Model Weights\n\nWe rely on two base models plus our CoInteract checkpoint. The easiest way to fetch everything is the HuggingFace CLI:\n\n```bash\n# Wan2.2-S2V-14B base model\nhf download Wan-AI\u002FWan2.2-S2V-14B \\\n    --local-dir .\u002Fmodels\u002FWan2.2-S2V-14B\n\n# Chinese Wav2Vec2 audio encoder\nhf download jonatasgrosman\u002Fwav2vec2-large-xlsr-53-chinese-zh-cn \\\n    --local-dir .\u002Fmodels\u002Fchinese-wav2vec2-large\n\n# CoInteract checkpoint \nhf download georgexin\u002Fcointeract \\\n    --local-dir .\u002Fmodels\u002FCoInteract\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>Note: alternative endpoint for restricted networks\u003C\u002Fb>\u003C\u002Fsummary>\n\nIf `huggingface.co` is unreachable from your environment, configure the community mirror before running the download commands:\n\n```bash\nexport HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n```\n\nTo persist this setting, append the line to `~\u002F.bashrc` or `~\u002F.zshrc`.\n\n\u003C\u002Fdetails>\n\n| Model | Link |\n|-------|------|\n| Wan2.2-S2V-14B | [Wan-AI\u002FWan2.2-S2V-14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-S2V-14B) |\n| Chinese Wav2Vec2 Large | [jonatasgrosman\u002Fwav2vec2-large-xlsr-53-chinese-zh-cn](https:\u002F\u002Fhuggingface.co\u002Fjonatasgrosman\u002Fwav2vec2-large-xlsr-53-chinese-zh-cn) |\n| CoInteract Checkpoint | [georgexin\u002Fcointeract](https:\u002F\u002Fhuggingface.co\u002Fgeorgexin\u002Fcointeract\u002Ftree\u002Fmain) |\n\n## Inference\n\nRun batch inference with the default demos CSV (paths resolve under `.\u002Fmodels\u002F`):\n\n```bash\npython batch_infer.py \\\n    --csv_path .\u002Fexamples\u002Fdemos\u002Fdemos.csv \\\n    --output_dir .\u002Foutput_videos \\\n    --height 1280 \\\n    --width 720 \\\n    --cfg_scale 7.0 \\\n    --num_clips 3\n```\n\nWe recommend running at **720p** (`--height 1280 --width 720`) for the best visual quality. A lower-resolution **480p** setting (`--height 832 --width 480`) is available for memory-constrained GPUs.\n\n| Resolution | Height × Width | Peak GPU Memory |\n|------------|----------------|-----------------|\n| 720p (recommended) | 1280 × 720 | ~59 GB |\n| 480p | 832 × 480 | ~45 GB |\n\nInput CSV must contain columns `audio`, `person_image`, `prompt`. Optional columns: `product_image`, `prompt2`, `prompt3`.\n\n- `person_image`: path to the reference image of the speaker (identity \u002F first frame).\n- `product_image`: path to the product reference image (object appearance). Leave empty for pure speech-driven generation.\n- `prompt2`, `prompt3`: optional per-clip prompts used for **interactive generation**, allowing different textual instructions across sequential clips to drive multi-turn interactions.\n\nWe provide our generated results for the demos in [`.\u002Foutput_videos`](.\u002Foutput_videos) for reference.\n\n> **Notes.** If you want to try your own cases, we recommend using product images with a clean white background for best results, and keeping your prompt in a format consistent with the examples provided in [`.\u002Fexamples\u002Fdemos\u002Fdemos.csv`](.\u002Fexamples\u002Fdemos\u002Fdemos.csv).\n\n## Pose-Driven Inference\n\nBeyond audio-only control, CoInteract also supports **pose-driven** generation, where a pre-extracted pose skeleton video guides the full-body motion of the speaker while the audio still drives lip-sync.\n\n### 1. Download the pose checkpoint\n\nThe pose-driven checkpoint lives in the **same** HuggingFace repository as the default one. \n\n```bash\nhf download georgexin\u002Fcointeract \\\n    --local-dir .\u002Fmodels\u002FCoInteract\n```\n\nAfter downloading, you should see an additional `checkpoint_pose.safetensors` under `.\u002Fmodels\u002FCoInteract\u002F`.\n\n### 2. Prepare the CSV\n\nA ready-to-use CSV is shipped at [`.\u002Fexamples\u002Fdemos\u002Fposedriven\u002Fposedriven.csv`](.\u002Fexamples\u002Fdemos\u002Fposedriven\u002Fposedriven.csv).\n\n### 3. Run batch inference\n\n```bash\npython batch_infer.py \\\n    --csv_path .\u002Fexamples\u002Fdemos\u002Fposedriven\u002Fposedriven.csv \\\n    --lora_path .\u002Fmodels\u002FCoInteract\u002Fcheckpoint_pose.safetensors \\\n    --output_dir .\u002Foutput_videos\u002Fposedriven \\\n    --height 1280 \\\n    --width 720 \\\n    --cfg_scale 7.0 \\\n    --num_clips 3\n```\n\n\nWe provide our generated results in [`.\u002Foutput_videos\u002Fposedriven`](.\u002Foutput_videos\u002Fposedriven) for reference.\n\n## Training\nPlease refer to [`.\u002Fexamples\u002Fwanvideo\u002Fmodel_training\u002FREADME.md`](.\u002Fexamples\u002Fwanvideo\u002Fmodel_training\u002FREADME.md) for the end-to-end walkthrough.\n\n## ✨Highlights\n\nCoInteract enables high-quality **speech-driven human-object interaction video synthesis** with fine-grained spatial control. It supports diverse generation modes including video generation, unified generation, and interactive generation.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fteaser.jpg\" width=\"80%\">\n\u003C\u002Fp>\n\nKey contributions:\n\n- **Human-Aware Mixture-of-Experts (MoE)**: A spatial routing mechanism that dynamically dispatches tokens to specialized expert networks, supervised by GT bounding boxes during training and fully automatic at inference.\n- **Spatially-Structured Co-Generation**: Joint training of RGB video and HOI depth maps provides structural guidance for realistic interactions, without requiring depth input at inference time.\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fpipeline.jpg\" width=\"70%\">\n\u003C\u002Fp>\n\n## Citation\n\n```bibtex\n@article{luo2026cointeract,\n  title={CoInteract: Physically-Consistent Human-Object Interaction Video Synthesis via Spatially-Structured Co-Generation},\n  author={Luo, Xiangyang and Xin, Xiaozhe and Feng, Tao and Guo, Xu and Jin, Meiguang and Ma, Junfeng},\n  journal={arXiv preprint arXiv:2604.19636},\n  year={2026}\n}\n```\n\n## Acknowledgments\n\n- [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio)\n- [Wan2.2](https:\u002F\u002Fgithub.com\u002FWan-Video\u002FWan2.2)\n\n## License\n\nThis project is released under the [Apache License 2.0](.\u002FLICENSE). Note that the underlying base models (e.g., [Wan2.2-S2V-14B](https:\u002F\u002Fhuggingface.co\u002FWan-AI\u002FWan2.2-S2V-14B) and [jonatasgrosman\u002Fwav2vec2-large-xlsr-53-chinese-zh-cn](https:\u002F\u002Fhuggingface.co\u002Fjonatasgrosman\u002Fwav2vec2-large-xlsr-53-chinese-zh-cn)) are governed by their own licenses; please comply with them when using the corresponding weights.\n","CoInteract 是一个用于交互式人-物体视频合成的项目，通过空间结构化的协同生成技术实现。其核心功能包括基于姿态驱动的视频合成，支持从文本或音频输入生成高质量的人与物体互动视频。该项目使用 Python 编写，依赖于预训练模型如 Wan2.2-S2V-14B 和中文 Wav2Vec2 音频编码器，并在 HuggingFace 平台上提供了模型权重下载。CoInteract 适用于需要创建虚拟人与物体互动场景的应用，例如虚拟现实、增强现实、影视制作以及在线教育等领域。",2,"2026-06-11 02:46:59","CREATED_QUERY"]