[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1291":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":7,"htmlUrl":7,"language":8,"languages":7,"totalLinesOfCode":7,"stars":9,"forks":10,"watchers":11,"openIssues":12,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":14,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":17,"compositeScore":18,"rankGlobal":7,"rankLanguage":7,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":7,"pushedAt":7,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":13,"starSnapshotCount":13,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},1291,"SentiAvatar","SentiAvatar\u002FSentiAvatar",null,"Python",305,43,1,16,0,4,13,41,12,60.53,"Other",false,"main",true,[],"2026-06-12 04:00:08","\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fbanner.png\" alt=\"Banner\" width=\"100%\">\n\u003C\u002Fp>\n\n# SentiAvatar: Towards Expressive and Interactive Digital Humans\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.02908\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2604.02908-b31b1b.svg\" alt=\"arXiv\">\u003C\u002Fa>\n  \u003Ca href=\"#license\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-SentiPulse%20--NC%201.0-lightgrey.svg\" alt=\"License\">\u003C\u002Fa>\n  \u003Ca href=\"#dataset\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDataset-SuSuInterActs-blue.svg\" alt=\"Dataset\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#\">Chuhao Jin\u003C\u002Fa>\u003Csup>1,2,*\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Rui Zhang\u003C\u002Fa>\u003Csup>2,*\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Qingzhe Gao\u003C\u002Fa>\u003Csup>2\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Haoyu Shi\u003C\u002Fa>\u003Csup>3\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Dayu Wu\u003C\u002Fa>\u003Csup>2\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Yichen Jiang\u003C\u002Fa>\u003Csup>2\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Yihan Wu\u003C\u002Fa>\u003Csup>1\u003C\u002Fsup>&ensp;\n  \u003Ca href=\"#\">Ruihua Song\u003C\u002Fa>\u003Csup>1,†\u003C\u002Fsup>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Csup>1\u003C\u002Fsup> Gaoling School of Artificial Intelligence, Renmin University of China\u003Cbr>\n  \u003Csup>2\u003C\u002Fsup> SentiPulse &ensp;\u003Cbr>\n  \u003Csup>3\u003C\u002Fsup> College of Computer Science, Inner Mongolia University\u003Cbr>\n  \u003Csub>* Equal contribution. Chuhao Jin led this project. &ensp; † Corresponding author.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.02908\">📄 Paper\u003C\u002Fa> &ensp;|&ensp;\n  \u003Ca href=\"https:\u002F\u002Fsentiavatar.github.io\u002F\">🌐 Project Page\u003C\u002Fa> &ensp;|&ensp;\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FChuhaojin\u002FSuSuInterActs\">🤗 Dataset\u003C\u002Fa> &ensp;|&ensp;\n  \u003Ca href=\"https:\u002F\u002Fsentiavatar.github.io\u002F#demo\">🎬 Demo Video\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## 🔥 Highlights\n\n- 📊 **SuSuInterActs Dataset** — 21K clips, 37 hours of synchronized speech + full-body motion + facial expressions captured via optical motion capture\n- 🧠 **Plan-then-Infill Architecture** — Decouples sentence-level semantic planning from frame-level prosody-driven interpolation\n- 🏆 **State-of-the-Art** — R@1 43.64% (nearly 2× the best baseline) on SuSuInterActs; FGD 4.941, BC 8.078 on BEATv2\n- ⚡ **Real-time** — Generates 6 seconds of motion in 0.3 seconds with unlimited multi-turn streaming\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fteaser.png\" width=\"95%\" alt=\"SentiAvatar Teaser\">\n  \u003Cbr>\n  \u003Csub>\u003Cb>Figure 1:\u003C\u002Fb> SentiAvatar generates high-quality 3D human motion and expression, which are semantically aligned and frame-level synchronized. The same color indicates the same time step.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n## Abstract\n\nWe present **SentiAvatar**, a framework for building expressive interactive 3D digital humans, and use it to create **SuSu**, a virtual character that speaks, gestures, and emotes in real time. Achieving such a system remains challenging, as it requires jointly addressing three key problems: the lack of large-scale high-quality multimodal data, robust semantic-to-motion mapping, and fine-grained frame-level motion-prosody synchronization.\n\nTo solve these problems, first, we build **SuSuInterActs** (21K clips, 37 hours), a dialogue corpus captured via optical motion capture around a single character with synchronized speech, full-body motion, and facial expressions. Second, we pre-train a **Motion Foundation Model** on 200K+ motion sequences, equipping it with rich action priors that go well beyond the conversation. We then propose an audio-aware **plan-then-infill** architecture that decouples sentence-level semantic planning from frame-level prosody-driven interpolation, so that generated motions are both semantically appropriate and rhythmically aligned with speech.\n\n\u003Cdetails>\n\u003Csummary>中文摘要 (Chinese Abstract)\u003C\u002Fsummary>\n\n我们提出了 **SentiAvatar**，一个用于构建富有表现力的交互式3D数字人的框架。该系统采用三阶段流水线：(1) LLM Motion Planner 根据动作标签和音频预测稀疏关键帧；(2) Mask Transformer 基于音频特征进行滑动窗口插帧；(3) RVQVAE Decoder 将离散 token 解码为连续动作序列。此外还集成了 Face VQVAE + HuBERT 的面部动画生成模块。\n\n\u003C\u002Fdetails>\n\n## 📊 Dataset: SuSuInterActs\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fdataset_overview.png\" width=\"80%\" alt=\"Dataset Overview\">\n\u003C\u002Fp>\n\nWe open-source the **SuSuInterActs** dataset with the following content:\n\n| Type | Directory | Format | Description |\n|------|-----------|--------|-------------|\n| 🎭 Face | `SuSuInterActs\u002Farkit_data\u002F` | `.npy` | ARKit facial BlendShape values (51 dims) |\n| 🔊 Audio | `SuSuInterActs\u002Fwav_data\u002F` | `.wav` | 16kHz mono speech audio |\n| 💃 Motion | `SuSuInterActs\u002Fmotion_data\u002F` | `.npy` | 63-joint 6D rotation + root displacement |\n| 📝 Text | `SuSuInterActs\u002Ftext_data\u002F` | `.json` | Action\u002Fexpression tags + dialogue text |\n| 📋 Splits | `SuSuInterActs\u002Fsplit\u002F` | `.txt` | Train (19K) \u002F Val (635) \u002F Test (1479) |\n\n\n### Motion Data Format\n\nEach `.npy` file is a dictionary:\n```python\n{\n    \"body\": np.ndarray,   # (T, 153) = root_offset(3) + body_6d(25×6)\n    \"left\": np.ndarray,   # (T, 120) = left_hand_6d(20×6)\n    \"right\": np.ndarray,  # (T, 120) = right_hand_6d(20×6)\n}\n```\n- Frame rate: **20 FPS**\n- Joints: **63** (25 body + 20 left hand + 20 right hand)\n- Rotation: 6D rotation representation\n- Root displacement: velocity form (differential encoding)\n\n### Text Format\n\n`text_data\u002Fmotion2text.json`:\n```json\n{\n    \"path\u002Fto\u002Fsample_name\": \"【表情：认真聆听】【动作：缓慢点头】嗯嗯，这样啊...\",\n    ...\n}\n```\n\n## 🧠 Method Overview\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fframework.png\" width=\"90%\" alt=\"SentiAvatar Framework\">\n  \u003Cbr>\n  \u003Csub>Overview of SentiAvatar. (a) Multi-modal inputs are quantized into tokens via encoders. (b) LLM planner predicts sparse keyframe tokens for high-level dialogue content. (c) Audio-aware Infill Transformer performs dense, prosody-driven interpolation for fine-grained temporal synchronization.\u003C\u002Fsub>\n\u003C\u002Fp>\n\nOur pipeline consists of three stages:\n\n1. **Motion VQ-VAE (RVQVAE)** — Encodes continuous motion into discrete tokens with a 4-layer residual codebook (512 codes each)\n2. **LLM Motion Planner** — A fine-tuned Qwen2-0.5B predicts sparse keyframe motion tokens (every 4th frame) conditioned on action tags + audio tokens\n3. **Audio-Aware Infill Transformer** — A masked transformer fills in the remaining 3 frames between each pair of keyframes using HuBERT audio features, achieving prosody-aligned dense motion\n\n## ⚙️ Installation\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FSentiAvatar\u002FSentiAvatar.git\ncd SentiAvatar\n\n# Create environment\nconda create -n sentiavatar python=3.10 -y\nconda activate sentiavatar\n\n# Install dependencies\npip install -r requirements.txt\n```\n\n## 📦 Model Checkpoints\n\nDownload all model weights from 🤗 HuggingFace:\n\n👉 **[https:\u002F\u002Fhuggingface.co\u002FChuhaojin\u002FSentiAvatar](https:\u002F\u002Fhuggingface.co\u002FChuhaojin\u002FSentiAvatar)**\n\n```bash\n# Option 1: Using git lfs\ngit lfs install\ngit clone https:\u002F\u002Fhuggingface.co\u002FChuhaojin\u002FSentiAvatar checkpoints\u002F\n\n# Option 2: Using huggingface-cli\npip install huggingface_hub\nhuggingface-cli download Chuhaojin\u002FSentiAvatar --local-dir checkpoints\u002F\n```\n\nPlace the downloaded files into the `checkpoints\u002F` directory. The expected structure:\n\n| Model | Description | Size |\n|-------|-------------|------|\n| `checkpoints\u002Fllm\u002F` | Qwen2-0.5B SFT (Motion Token Planner) | 1.1 GB |\n| `checkpoints\u002Fmask_transformer\u002F` | Audio-Motion Mask Transformer | 276 MB |\n| `checkpoints\u002Frvqvae\u002F` | Residual VQ-VAE (body motion codec) | 754 MB |\n| `checkpoints\u002Fface_vqvae\u002F` | Face VQ-VAE + weight matrices | 50 MB |\n| `checkpoints\u002Fchinese-hubert-base\u002F` | Chinese HuBERT audio encoder | 361 MB |\n| `checkpoints\u002Fhubert_kmeans\u002F` | HuBERT K-means quantizer (layer9 → tokens) | 1.5 MB |\n| `checkpoints\u002Feval_model\u002F` | ChronAccRet evaluation model | 434 MB |\n\n## 🚀 Inference\n\n### Data Preprocessing (Required for Batch Mode)\n\nBefore running batch inference, you need to preprocess the raw dataset to generate intermediate data:\n\n```bash\n# Preprocess all data (audio features + audio tokens + motion tokens)\npython scripts\u002Fpreprocess_data.py --all --device cuda:0\n\n# Or separately:\npython scripts\u002Fpreprocess_data.py --audio   # HuBERT features + K-means tokens\npython scripts\u002Fpreprocess_data.py --motion  # RVQVAE motion tokens\n```\n\nThis generates three directories under `data\u002F`:\n- `audio_features_hubert_layer9_fps10\u002F` — HuBERT layer9 features @10fps\n- `audio_tokens_hubert_layer9_fps10\u002F` — K-means quantized audio tokens @10fps\n- `motion_token_data\u002F` — RVQVAE encoded motion tokens (for GT comparison)\n\n### Mode 1: Test Set Evaluation (Batch Mode)\n\nRun inference on the entire test set and generate BVH\u002FJSON outputs:\n\n```bash\n# Step 1: Preprocess data (if not done)\npython scripts\u002Fpreprocess_data.py --all\n\n# Step 2: Start vLLM service (background)\nbash scripts\u002Fstart_vllm_server.sh checkpoints\u002Fllm 8095 0\n\n# Step 3: Run batch inference\nbash scripts\u002Frun_test.sh 8095 0\n```\n\nOutput: `output\u002Freconstructed\u002F` (BVH + JSON + WAV per sample)\n\n### Mode 2: Single Case Inference\n\nGenerate motion from your own audio + action tag:\n\n```bash\n# Make sure vLLM service is running\nbash scripts\u002Fstart_vllm_server.sh checkpoints\u002Fllm 8095 0\n\n# 🚀 Quick demo (uses built-in example audio, no extra data needed)\nbash scripts\u002Frun_single_infer.sh\n\n# Custom inference with your own audio\nbash scripts\u002Frun_single_infer.sh \\\n    --audio_path \u002Fpath\u002Fto\u002Fyour\u002Faudio.wav \\\n    --action_text \"动作：点头微笑\" \\\n    --output_dir .\u002Foutput_single\n```\n\nOr use the Python script directly:\n```bash\ncd motion_generation\npython single_case_infer.py \\\n    --audio_path \u002Fpath\u002Fto\u002Faudio.wav \\\n    --action_text \"动作：挥手打招呼\" \\\n    --output_dir .\u002Foutput_single \\\n    --vllm_port 8095\n```\n\n**Output files:**\n- `\u003Cname>.bvh` — BVH motion file (viewable in Blender)\n- `\u003Cname>.json` — Animation data (UE engine format)\n- `\u003Cname>.wav` — Corresponding audio file\n\n## 📊 Experimental Results\n\n### Quantitative Comparison on SuSuInterActs\n\n**Bold**: best; ↑\u002F↓: higher\u002Flower is better. ESD in seconds. \"†\" indicates token-by-token autoregressive generation.\n\n| Method | Condition | R@1 ↑ | R@2 ↑ | R@3 ↑ | FID ↓ | ESD ↓ | Diversity ↑ |\n|--------|-----------|-------|-------|-------|-------|-------|-------------|\n| Real Motion | — | 62.20 | 73.56 | 78.70 | 0.000 | 0.308 | 22.61 |\n| *Audio-only methods* | | | | | | | |\n| EMAGE | Audio | 5.00 | 9.40 | 13.32 | 441.6 | 0.606 | 12.92 |\n| A2M-GPT† | Audio | 8.72 | 15.96 | 20.08 | 13.66 | 0.477 | 22.23 |\n| *Text-only methods* | | | | | | | |\n| HunYuan-Motion | Text | 5.21 | 8.59 | 11.9 | 352.56 | 0.708 | 16.92 |\n| T2M-GPT | Text | 23.12 | 30.49 | 35.43 | 67.78 | 0.721 | 20.65 |\n| MoMask | Text | 34.55 | 46.58 | 54.29 | 36.25 | 0.471 | 22.03 |\n| *Audio + Text methods* | | | | | | | |\n| AT2M-GPT† | Audio, Text | 27.52 | 36.11 | 41.38 | 18.491 | 0.503 | 22.36 |\n| **SentiAvatar (Ours)** | **Audio, Text** | **43.64** | **54.94** | **61.84** | **8.912** | **0.456** | **22.41** |\n| *Improvement (%)* | | *+26.3* | *+17.9* | *+13.9* | *+34.8* | *+3.2* | *+0.2* |\n\n### Qualitative Comparison\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fcase_study.png\" width=\"95%\" alt=\"Qualitative Comparison\">\n  \u003Cbr>\n  \u003Csub>Qualitative comparison of generated motions across methods. Texts and arrows of the same color indicate the same time step. Red arrows indicate incorrect actions.\u003C\u002Fsub>\n\u003C\u002Fp>\n\n## 📏 Evaluation\n\nEvaluate generated motion quality using our ChronAccRet evaluation model:\n\n```bash\nbash scripts\u002Frun_eval.sh .\u002Foutput\u002Freconstructed 0\n```\n\n**Metrics:**\n| Metric | Description | Better |\n|--------|-------------|--------|\n| **R@K** | Text-motion retrieval recall @K | Higher ↑ |\n| **FID** | Fréchet Inception Distance | Lower ↓ |\n| **Diversity** | Generation diversity in latent space | Higher ↑ |\n| **ESD** | Event Sync Distance (seconds) | Lower ↓ |\n\n## 🔧 Motion Visualization\n\nConvert `.npy` motion data to BVH files for viewing in Blender or other 3D software:\n\n```bash\n# Single file conversion\npython tools\u002Fvisualize_motion.py \\\n    --input data\u002Fmotion_data\u002Fpath\u002Fto\u002Fsample.npy \\\n    --output output_vis\u002Fsample.bvh\n\n# Batch conversion (max 10 files)\npython tools\u002Fvisualize_motion.py \\\n    --input_dir data\u002Fmotion_data \\\n    --output_dir output_bvh \\\n    --max_files 10\n\n# Output both BVH and JSON\npython tools\u002Fvisualize_motion.py \\\n    --input data\u002Fmotion_data\u002Fsample.npy \\\n    --output sample.bvh \\\n    --save_json\n```\n\n## 🏗️ Project Structure\n\n```\nSentiAvatar\u002F\n├── motion_generation\u002F          # 🎯 Motion generation module\n│   ├── pipeline_infer.py       #    LLM + Mask Transformer pipeline\n│   ├── single_case_infer.py    #    Single-case inference script\n│   ├── reconstruct_from_tokens.py  # Token → BVH\u002FJSON decoder\n│   ├── vllm_server.py          #    vLLM server for LLM inference\n│   ├── models\u002F                 #    Model definitions (RVQVAE, Mask Transformer)\n│   ├── actions\u002F                #    Post-processing (BVH\u002FJSON conversion)\n│   ├── utils\u002F                  #    Utilities and rotation tools\n│   └── meta\u002F                   #    Skeleton templates, normalization params\n├── evaluation\u002F                 # 📊 Evaluation module (ChronAccRet)\n├── tools\u002F                      # 🔧 Visualization tools\n├── scripts\u002F                    # 🚀 Shell scripts\n├── data\u002F                       # 📁 Dataset (SuSuInterActs)\n└── checkpoints\u002F                # 💾 Model weights\n```\n\n## 📝 Citation\n\nIf you find this work useful, please cite our paper:\n\n```bibtex\n@article{jin2026sentiavatar,\n  title={SentiAvatar: Towards Expressive and Interactive Digital Humans},\n  author={Jin, Chuhao and Zhang, Rui and Gao, Qingzhe and Shi, Haoyu and Wu, Dayu and Jiang, Yichen and Wu, Yihan and Song, Ruihua},\n  journal={arXiv preprint arXiv:2604.02908},\n  year={2026}\n}\n```\n\n## ⭐ Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=SentiAvatar\u002FSentiAvatar&type=Date)](https:\u002F\u002Fstar-history.com\u002F#SentiAvatar\u002FSentiAvatar&Date)\n\n##### If you like this project, please give it a star ⭐! It would be a great encouragement for us and help more people discover this work.\n\n## 🙏 Acknowledgments\nThe authors would like to sincerely thank all collaborators for their valuable contributions to this work. In particular, special thanks to Shi Xueliang and Pan Xuanyue for leading the art design and data production efforts. The project also benefited greatly from the contributions of team members: Shi Xueliang, Yu Yongchang, Li Xing, and Liu Xueying in art design; Pan Xuanyue, Li Huixian, Yang Yijia, Zhang Wenxuan, and Wang Wei (UE) in data production. Their dedicated work and collaboration were essential to the successful completion of this research.\n\nWe also thank the following open-source projects:\n\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) — High-throughput LLM inference engine\n- [HuggingFace Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) — Pre-trained model framework\n- [Chinese-HuBERT](https:\u002F\u002Fhuggingface.co\u002FTencentGameMate\u002Fchinese-hubert-base) — Chinese speech encoder\n- [Qwen2](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2) — Base language model\n\n## License\n\nThis project is licensed under [SentiPulse Non-Commercial Source License v1.0](LICENSE).\n\n**You are free to**: share, adapt, and build upon this work for non-commercial purposes.  \n**You may NOT**: use this project, its models, or data for any commercial purpose.\n\nFor commercial licensing, please contact the authors.\n\n---\n\n\u003Cp align=\"center\">\n  Made with ❤️ by the SentiPulse Team\n\u003C\u002Fp>\n","SentiAvatar 是一个用于构建表达性和互动性3D数字人物的框架。其核心功能包括基于大规模多模态数据集SuSuInterActs（包含21K个片段，总时长37小时）的语义到动作映射以及精细的帧级动作-韵律同步技术。该项目采用了一种先计划后填充的架构设计，将句子级别的语义规划与帧级别的韵律驱动插值解耦开来，实现了实时生成高质量3D人物动作和表情的能力。SentiAvatar 适用于需要创建能够实时对话、手势表达及情感展现的虚拟角色的应用场景，如虚拟助手、在线教育、游戏娱乐等。",2,"2026-06-11 02:42:49","CREATED_QUERY"]