[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82988":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":15,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":30,"readmeContent":31,"aiSummary":32,"trendingCount":16,"starSnapshotCount":16,"syncStatus":33,"lastSyncTime":34,"discoverSource":35},82988,"SoulX-Transcriber","Soul-AILab\u002FSoulX-Transcriber","Soul-AILab","An end-to-end framework for multi-speaker transcription that jointly models who spoke, when, and what.","https:\u002F\u002Fsoul-ailab.github.io\u002Fsoulx-transcriber\u002F",null,"Python",230,10,1,4,0,63,104,41,81.12,"Apache License 2.0",false,"main",[25,26,27,28,29],"asr","llm","sd","sdr","speech-recognition","2026-06-12 04:01:39","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"figs\u002Fsoulx_transcriber_logo.png\" width=\"100%\" alt=\"Demo image\">\n\u003C\u002Fdiv>\n\n\u003Ch1 align=\"center\">SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription\u003C\u002Fh1>\n\n\u003Cdiv align=\"center\">\n\n\u003Cdiv style=\"text-align: center;\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-blue\" alt=\"Python\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green\" alt=\"License\">\n  \u003Ca href=\"https:\u002F\u002Fsoul-ailab.github.io\u002Fsoulx-transcriber\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemo-page-blue\" alt=\"Demo\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02400\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-paper-red\" alt=\"arXiv Paper\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FSoul-AILab\u002FSoulX-Transcriber\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-Models-ffd21e\" alt=\"HuggingFace\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSoul-AILab\u002FSoulX-Transcriber\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGitHub-repo-black\" alt=\"GitHub\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n  \u003Ch3>\n    Yuhang Dai\u003Csup>1,2\u003C\u002Fsup>\u003Csup>*\u003C\u002Fsup>, Haopeng Lin\u003Csup>2\u003C\u002Fsup>\u003Csup>*\u003C\u002Fsup>, Zhennan Lin\u003Csup>1\u003C\u002Fsup>, Jiale Qian\u003Csup>2\u003C\u002Fsup>, Jun Wu\u003Csup>2\u003C\u002Fsup>, Hao Meng\u003Csup>2\u003C\u002Fsup>, Hanke Xie\u003Csup>1,2\u003C\u002Fsup>, Hanlin Wen\u003Csup>2\u003C\u002Fsup>, Chuang Ding\u003Csup>3\u003C\u002Fsup>, Shunshun Yin\u003Csup>2\u003C\u002Fsup>, Ming Tao\u003Csup>2\u003C\u002Fsup>, Lei Xie\u003Csup>1\u003C\u002Fsup>, Xinsheng Wang\u003Csup>2†\u003C\u002Fsup>\n  \u003C\u002Fh3>\n\n  \u003Cp>\n    \u003Csup>*\u003C\u002Fsup>Equal contribution.&nbsp;&nbsp;\n    \u003Csup>†\u003C\u002Fsup>Corresponding author\n  \u003C\u002Fp>\n\n  \u003Cp>\n    \u003Csup>1\u003C\u002Fsup>Audio, Speech and Language Processing Group (ASLP@NPU), Northwestern Polytechnical University, Xi’an, China\u003Cbr>\n    \u003Csup>2\u003C\u002Fsup>Soul AI Lab, China\u003Cbr>\n    \u003Csup>3\u003C\u002Fsup>Moonstep AI, China\u003Cbr>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\n\n\n## 🎬 Demo Video\n\n\u003Cdiv align=\"center\">\n\n\u003Chttps:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4020c95b-8cce-4611-a7b5-ffe6f49c1fb6>\n\n\u003C\u002Fdiv>\n\nPlease visit our  ✨[demopage](https:\u002F\u002Fsoul-ailab.github.io\u002Fsoulx-transcriber\u002F)✨ for more demos.\n\u003C!-- \u003Cdiv align=\"center\">\n  \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F9a57a227-9ac9-4bfb-961f-39df8f93b680\" controls width=\"80%\">\u003C\u002Fvideo>\n\u003C\u002Fdiv> -->\n\n## 🏆 SoulX-Transcriber performance Overview\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"figs\u002Fperformance.png\" width=\"80%\" alt=\"Demo image\">\n\u003C\u002Fdiv>\n\n\n## 📖 Introduction\n\nSoulX-Transcriber is a unified end-to-end large audio language model for **multi-speaker diarization and recognition** in multi-speaker dialogue scenarios. Rather than relying on a cascaded pipeline, the model directly learns speaker attribution, timestamped segmentation, and transcription in a single framework, producing coherent speaker-consistent transcripts for overlapping and fast-turn conversations. \n\n\n## 🌟 Highlights\n- **State of the art performance**. SoulX-Transcriber achieves superior performance on the AISHELL-4 and AliMeeting benchmarks via a unified diarization and recognition framework, which directly produces structured outputs consisting of timestamps, speaker labels, and transcripts.\n- **Speaker-aware multi-stage training**. Speaker-aware multi-task Continues Pre-Training plus Supervised Fine-tuned strengthens speaker representation and robustness to  conversations, mitigating same-gender confusion, overlap, and boundary errors.\n- **A more natural and authentic approach to dialogue generation**. We propose a speaker characteristics-driven audio matching pipeline that automatically selects the most suitable reference audio for each utterance, producing more natural, context-aligned simulated dialogues.\n\n\n## 📊 Results\n\n### Utterance-level Evaluation on open-source datasets\n\n\n\u003Cdiv style=\"overflow-x: auto;\">\n\u003Ctable style=\"white-space: nowrap;\">\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth rowspan=\"2\">Model\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">AISHELL-4\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">Alimeeting\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">AMI-SDM\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>WER↓\u003C\u002Fth>\u003Cth>cpWER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>WER↓\u003C\u002Fth>\u003Cth>cpWER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>WER↓\u003C\u002Fth>\u003Cth>cpWER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd>VibeVoice-ASR\u003C\u002Ftd>\n      \u003Ctd>6.77\u003C\u002Ftd>\u003Ctd>21.40\u003C\u002Ftd>\u003Ctd>24.99\u003C\u002Ftd>\u003Ctd>3.59\u003C\u002Ftd>\n      \u003Ctd>10.92\u003C\u002Ftd>\u003Ctd>27.40\u003C\u002Ftd>\u003Ctd>29.33\u003C\u002Ftd>\u003Ctd>1.93\u003C\u002Ftd>\n      \u003Ctd>13.43\u003C\u002Ftd>\u003Ctd>24.65\u003C\u002Ftd>\u003Ctd>28.82\u003C\u002Ftd>\u003Ctd>4.17\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-2.5-Pro†\u003C\u002Ftd>\n      \u003Ctd>36.07\u003C\u002Ftd>\u003Ctd>19.81\u003C\u002Ftd>\u003Ctd>25.11\u003C\u002Ftd>\u003Ctd>5.30\u003C\u002Ftd>\n      \u003Ctd>56.39\u003C\u002Ftd>\u003Ctd>30.16\u003C\u002Ftd>\u003Ctd>39.29\u003C\u002Ftd>\u003Ctd>9.13\u003C\u002Ftd>\n      \u003Ctd>50.28\u003C\u002Ftd>\u003Ctd>31.66\u003C\u002Ftd>\u003Ctd>39.98\u003C\u002Ftd>\u003Ctd>8.32\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-3.1-pro-preview†\u003C\u002Ftd>\n      \u003Ctd>24.84\u003C\u002Ftd>\u003Ctd>24.86\u003C\u002Ftd>\u003Ctd>24.81\u003C\u002Ftd>\u003Ctd>-0.05\u003C\u002Ftd>\n      \u003Ctd>30.76\u003C\u002Ftd>\u003Ctd>18.82\u003C\u002Ftd>\u003Ctd>18.99\u003C\u002Ftd>\u003Ctd>0.17\u003C\u002Ftd>\n      \u003Ctd>40.40\u003C\u002Ftd>\u003Ctd>30.82\u003C\u002Ftd>\u003Ctd>32.97\u003C\u002Ftd>\u003Ctd>2.15\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Qwen3.5-omni†\u003C\u002Ftd>\n      \u003Ctd>22.33\u003C\u002Ftd>\u003Ctd>15.13\u003C\u002Ftd>\u003Ctd>14.71\u003C\u002Ftd>\u003Ctd>\u003Cb>-0.42\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>26.46\u003C\u002Ftd>\u003Ctd>\u003Cb>12.44\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>12.79\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>0.35\u003C\u002Ftd>\n      \u003Ctd>30.05\u003C\u002Ftd>\u003Ctd>28.57\u003C\u002Ftd>\u003Ctd>33.46\u003C\u002Ftd>\u003Ctd>4.89\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cb>SoulX-Transcriber\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>2.89\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>14.16\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>13.90\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>-0.26\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>5.39\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>13.07\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>13.61\u003C\u002Ftd>\u003Ctd>\u003Cb>0.54\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>11.67\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cu>25.55\u003C\u002Fu>\u003C\u002Ftd>\u003Ctd>32.78\u003Ctd>7.23\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n###  Segmented Evaluation (5 minutes segments)\n\n\u003Cdiv style=\"overflow-x: auto;\">\n\u003Ctable style=\"white-space: nowrap;\">\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth rowspan=\"2\">Model\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">Alimeeting\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">AISHELL-4\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>CER↓\u003C\u002Fth>\u003Cth>cpCER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>CER↓\u003C\u002Fth>\u003Cth>cpCER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\u003Ctd colspan=\"7\">\u003Cb>End-to-End Baselines\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>VibeVoice-ASR\u003C\u002Ftd>\n      \u003Ctd>\u003Cu>18.00\u003C\u002Fu>\u003C\u002Ftd>\u003Ctd>29.72\u003C\u002Ftd>\u003Ctd>31.94\u003C\u002Ftd>\u003Ctd>2.22\u003C\u002Ftd>\n      \u003Ctd>\u003Cu>9.17\u003C\u002Fu>\u003C\u002Ftd>\u003Ctd>19.54\u003C\u002Ftd>\u003Ctd>22.95\u003C\u002Ftd>\u003Ctd>3.41\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-2.5-Pro†\u003C\u002Ftd>\n      \u003Ctd>58.14\u003C\u002Ftd>\u003Ctd>31.69\u003C\u002Ftd>\u003Ctd>42.22\u003C\u002Ftd>\u003Ctd>10.53\u003C\u002Ftd>\n      \u003Ctd>40.87\u003C\u002Ftd>\u003Ctd>20.26\u003C\u002Ftd>\u003Ctd>26.31\u003C\u002Ftd>\u003Ctd>6.05\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-3.1-pro-preview†\u003C\u002Ftd>\n      \u003Ctd>38.75\u003C\u002Ftd>\u003Ctd>26.75\u003C\u002Ftd>\u003Ctd>32.84\u003C\u002Ftd>\u003Ctd>6.09\u003C\u002Ftd>\n      \u003Ctd>22.03\u003C\u002Ftd>\u003Ctd>22.75\u003C\u002Ftd>\u003Ctd>27.43\u003C\u002Ftd>\u003Ctd>4.68\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Qwen3-omni-30B-Instruct\u003C\u002Ftd>\n      \u003Ctd>38.36\u003C\u002Ftd>\u003Ctd>25.28\u003C\u002Ftd>\u003Ctd>37.54\u003C\u002Ftd>\u003Ctd>12.26\u003C\u002Ftd>\n      \u003Ctd>34.71\u003C\u002Ftd>\u003Ctd>\u003Cu>15.95\u003C\u002Fu>\u003C\u002Ftd>\u003Ctd>23.63\u003C\u002Ftd>\u003Ctd>7.68\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\u003Ctd colspan=\"9\">\u003Cb>Ours\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cb>SoulX-Transcriber\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>4.40\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>10.34\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>11.58\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>1.24\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>6.12\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>12.87\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>15.45\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cu>2.58\u003C\u002Fu>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n### Internal Multi-domain Evaluation\n\u003Cdiv style=\"overflow-x: auto;\">\n\u003Ctable style=\"white-space: nowrap;\">\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth rowspan=\"2\">Model\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">Social conversation\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">Drama\u003C\u002Fth>\n      \u003Cth colspan=\"4\" style=\"text-align:center;\">Podcast\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>WER↓\u003C\u002Fth>\u003Cth>cpWER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>WER↓\u003C\u002Fth>\u003Cth>cpWER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n      \u003Cth>DER↓\u003C\u002Fth>\u003Cth>WER↓\u003C\u002Fth>\u003Cth>cpWER↓\u003C\u002Fth>\u003Cth>∆cp↓\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd>VibeVoice-ASR\u003C\u002Ftd>\n      \u003Ctd>2.76\u003C\u002Ftd>\u003Ctd>30.34\u003C\u002Ftd>\u003Ctd>31.77\u003C\u002Ftd>\u003Ctd>1.43\u003C\u002Ftd>\n      \u003Ctd>27.78\u003C\u002Ftd>\u003Ctd>21.86\u003C\u002Ftd>\u003Ctd>45.87\u003C\u002Ftd>\u003Ctd>24.01\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>14.7\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>8.88\u003C\u002Ftd>\u003Ctd>\u003Cb>14.58\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>5.7\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-3.1-pro-preview†\u003C\u002Ftd>\n      \u003Ctd>38.69\u003C\u002Ftd>\u003Ctd>29.14\u003C\u002Ftd>\u003Ctd>36.72\u003C\u002Ftd>\u003Ctd>7.58\u003C\u002Ftd>\n      \u003Ctd>34.87\u003C\u002Ftd>\u003Ctd>10.01\u003C\u002Ftd>\u003Ctd>21.03\u003C\u002Ftd>\u003Ctd>11.02\u003C\u002Ftd>\n      \u003Ctd>24.56\u003C\u002Ftd>\u003Ctd>23.89\u003C\u002Ftd>\u003Ctd>27.21\u003C\u002Ftd>\u003Ctd>\u003Cb>3.32\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cb>SoulX-Transcriber\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>1.32\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>6.73\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>7.31\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>0.58\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>\u003Cb>23.56\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>5.17\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>20.58\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>\u003Cb>15.41\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd>21.15\u003C\u002Ftd>\u003Ctd>\u003Cb>7.5\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd>19.37\u003C\u002Ftd>\u003Ctd>11.87\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n† Closed-source model.\n\n## 🧪 Multi-speaker Dialogue Simulation Pipeline\n\nTo improve out-of-domain generalization, we build an **agent-based multi-speaker dialogue simulation pipeline** with a **speaker-aware prompt audio matching** mechanism. Given a target dialogue text, the system analyzes speaker tags, selects the most suitable **reference audio** for each speaker using **multi-dimensional speaker representations**, and synthesizes context-consistent multi-turn dialogue audio.\n\n**Workflow:** building dialogue text database → building reference audio database → target text analysis → reference audio matching → dialogue audio generation. Detailed information is shown on the figure below.\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"figs\u002Fsimulation_pipeline.png\" width=\"100%\" alt=\"simulation_pipeline\">\n\u003C\u002Fdiv>\n\n- **Dialogue text database.** We collect multi-speaker dialogue texts from Chinese\u002FEnglish podcasts and novels. An LLM annotates speaker tags and controls the number of speakers; we keep segments with **3–8 speakers** to ensure natural, coherent dialogue context.\n\n- **Dialogue context analysis:** We use **Qwen3-8B** as the LLM brain for speaker-tag and context analysis, and **SoulX-Podcast & MOSS-TTSD** for long-form, multi-speaker multi-turn TTS synthesis.\n\n- **Reference audio database.** We run VAD on long-form drama audio, cut it into **3–10s** clips, and filter by **UTMOS** and **SNR** to ensure quality. Each clip is annotated by **Gemini-3.1-pro-preview** with multi-dimensional speaker attributes (e.g., gender\u002Fage\u002Femotion\u002Fspeech rate\u002Fpitch\u002Ftimbre\u002Fstyle\u002Ftone\u002Frole state). We embed each attribute using **bge-m3** and stack them into a per-clip feature matrix, forming an embedding index for retrieval.\n\n- **Best reference–audio matching.** Given a target dialogue with speaker tags, an LLM analyzes each speaker’s attributes and builds the same multi-dimensional embedding representation. We compute similarity against all reference clips, apply a weighted score across attribute dimensions, and retrieve **top-k (k=3)** candidates per speaker. A final selection enforces diversity (different source speakers) and **UTMOS consistency (|Δ| ≤ 0.5)** to produce the best reference audio set for synthesis.\n\n\n## Installation\n\n### Environment Setup\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FSoul-AILab\u002FSoulX-Transcriber.git\ncd SoulX-Transcriber\n\nconda create -n soulx_transcriber python=3.12 -y\nconda activate soulx_transcriber\n```\n\nInstall MS-Swift and dependencies:\n\n```bash\npip install ms-swift\n```\n\n## Model Download\n\nWe provide the pre-trained model weights on Hugging Face and modelscope. You can download the model based on your requirements:\n\n| Model Version | Description | Language | Download |\n| :--- | :--- | :---: | :---: |\n| **SoulX-Transcriber** | Full version of SoulX-Transcriber | ZH\u002FEN | [🤗 Hugging Face](https:\u002F\u002Fhuggingface.co\u002FSoul-AILab\u002FSoulX-Transcriber) |\n| **SoulX-Transcriber** | Full version of SoulX-Transcriber | ZH\u002FEN | [\u003Cimg src=\"https:\u002F\u002Favatars.githubusercontent.com\u002Fu\u002F109945100?s=48&v=4\" width=\"20\" height=\"20\" alt=\"ModelScope\" \u002F> ModelScope](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FSoul-AILab\u002FSoulX-Transcriber) |\n\n## Training & Fine-tuning\nSoulX-Transcriber shares the same architecture with Qwen3-Omni-30BA3B-Instruct. We recommend users conduct continued pre-training and fine-tuning for this model via the [ms-swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift) toolkit.\n## Inference\n\n### vLLM-omni\n\nSoulX-Transcriber is built on top of [Qwen3-Omni-30B-A3B-Instruct](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Omni-30B-A3B-Instruct). We recommend using vllm-omni for inference..\n\n```bash\ncd your_env_path\u002F\n# install uv：\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n# create new uv environment（using aliyun mirror）\nuv venv vllm_omni --python 3.12 --seed --index-url https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\n# activate uv environment\nsource vllm_omni\u002Fbin\u002Factivate\n# install vllm：\nuv pip install vllm --torch-backend=auto --index-url https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\n# install vllm-omni:\nuv pip install vllm-omni --index-url https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\n# install gradio (Optional)：\nuv pip install 'vllm-omni[demo]' --index-url https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\n# If you meet an \"Undefined symbol\" error while using VLLM_USE_PRECOMPILED=1, please use \"pip install -e . -v\" to build from source.\ngit clone https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni.git\ncd vllm-omni\nuv pip install -e . --index-url https:\u002F\u002Fmirrors.aliyun.com\u002Fpypi\u002Fsimple\u002F\n```\n> For more details on compiling vLLM from source, refer to the [vLLM official documentation](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fgetting_started\u002Finstallation\u002Fgpu.html#set-up-using-python-only-build-without-compilation).\n\n### Infer single wav file\n\n\n```bash\n# stage1: download pretrained model\n# stage2: inference\nsource your_env_path\u002Fvllm_omni\u002Fbin\u002Factivate  # source the env\nbash .\u002Finference.sh\n```\n\n### Infer single wav file with retry mechanism\n\n\n```bash\n# stage1: download pretrained model\n# stage2: inference\nsource your_env_path\u002Fvllm_omni\u002Fbin\u002Factivate  # source the env\nbash .\u002Finference_with_retry.sh\n```\n\n\n\n## 🙏 Acknowledgements\n\nSpecial thanks to the following open-source projects:\n\n- [Qwen3-omni](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Omni)\n- [Qwen3](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3)\n- [SoulX-Podcast](https:\u002F\u002Fgithub.com\u002FSoul-AILab\u002FSoulX-Podcast)\n- [MOSS-TTSD](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTSD)\n- [FireRedASR](https:\u002F\u002Fgithub.com\u002FFireRedTeam\u002FFireRedASR)\n- [Paraformer](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fiic\u002Fspeech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch)\n- [utmos](https:\u002F\u002Fgithub.com\u002Ffakerybakery\u002Futmos)\n\n\n\n\n\n## Citation\n\nIf you find this work useful, please cite:\n\n```bibtex\n@misc{dai2026soulxtranscriber,\n      title={SoulX-Transcriber: A Robust End-to-End Framework for Multi-Speaker Speech Transcription}, \n      author={Yuhang Dai and Haopeng Lin and Zhennan Lin and Jiale Qian and Jun Wu and Hanke Xie and Hao Meng and Hanlin Wen and Chuang Ding and Shunshun Yin and Ming Tao and Lei Xie and Xinsheng Wang},\n      year={2026},\n      eprint={2606.02400},\n      archivePrefix={arXiv},\n      primaryClass={eess.AS},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.02400}, \n}\n```\n\n## License\nWe use the **Apache 2.0 License**. Researchers and developers are free to use the codes and model weights of our SoulX-Transcriber. Check the license at [LICENSE](LICENSE) for more details.\n## Contact\n\n- **Issues**: Please open a GitHub Issue for bug reports or suggestions.\n- **Email**: yhdai@mail.nwpu.edu.cn, haopenglin@soulapp.cn, lxie@nwpu.edu.cn, wangxinsheng@soulapp.cn\n\n","SoulX-Transcriber 是一个用于多说话人转录的端到端框架，能够同时识别谁在说话、何时说话以及说了什么。该项目采用统一的模型架构，直接学习说话人归属、时间戳分割和语音转文字，适用于多人对话场景，特别是在存在重叠或快速转换的对话中表现出色。基于 Python 语言开发，并且支持 Apache License 2.0 开源协议，SoulX-Transcriber 在 AISHELL-4 和 AliMeeting 基准测试上展示了先进的性能，能生成包含时间戳、说话人标签及文本内容的一体化输出结果，非常适合需要高精度多说话人转录的应用环境，如会议记录、播客分析等。",2,"2026-06-11 04:09:47","CREATED_QUERY"]