[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-994":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":24,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":16,"starSnapshotCount":16,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},994,"MOSS-Audio","OpenMOSS\u002FMOSS-Audio","OpenMOSS","MOSS-Audio is an open-source foundation model for unified audio understanding, enabling speech, sound, music, captioning, QA, and reasoning in real-world scenarios.","https:\u002F\u002Fopenmoss.github.io\u002FMOSS-Audio\u002F",null,"Python",569,39,8,13,0,12,60,129,36,8.81,false,"main",true,[],"2026-06-12 02:00:21","# MOSS-Audio\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fmoss-audio-logo.png\" width=\"55%\" \u002F>\n\u003C\u002Fp>\n\n\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenMOSS-Team\u002Fmoss-audio\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Models-orange?logo=huggingface&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fcollections\u002Fopenmoss\u002FMOSS-Audio\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Models-624AFF?logo=data:image\u002Fsvg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==&amp\">\u003C\u002Fa>\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Coming_Soon-blue?logo=internet-explorer&amp\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-Coming_Soon-red?logo=Arxiv&amp\">\n\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002FOpen_MOSS\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-Follow-black?logo=x&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FXf3aXddCjc\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join-5865F2?logo=discord&amp\">\u003C\u002Fa>\n  \u003Ca href=\".\u002Fassets\u002Fwechat.png\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Join-07C160?logo=wechat&amp;logoColor=white\" alt=\"WeChat\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\".\u002FREADME.md\">English\u003C\u002Fa> | \u003Ca href=\".\u002FREADME_zh.md\">简体中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\n\nMOSS-Audio is an open-source **audio understanding model** from [MOSI.AI](https:\u002F\u002Fmosi.cn\u002F#hero), the [OpenMOSS team](https:\u002F\u002Fwww.open-moss.com\u002F), and [Shanghai Innovation Institute](https:\u002F\u002Fwww.sii.edu.cn\u002F). It performs unified modeling over complex real-world audio, supporting **speech understanding, environmental sound understanding, music understanding, audio captioning, time-aware QA, and complex reasoning**. In this release, we provide **four models**: **MOSS-Audio-4B-Instruct**, **MOSS-Audio-4B-Thinking**, **MOSS-Audio-8B-Instruct**, and **MOSS-Audio-8B-Thinking**. The Instruct variants are optimized for direct instruction following, while the Thinking variants provide stronger chain-of-thought reasoning capabilities.\n\n\n## News\n* 2026.4.20: We have added the MOSS-Audio fine-tuning code and documentation. See `finetune\u002FFINETUNE.md` for LoRA and full-parameter training examples.\n* 2026.4.13: 🎉🎉🎉 We have released [MOSS-Audio](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenMOSS-Team\u002Fmoss-audio). Blog and paper coming soon!\n\n\n## Contents\n\n- [Introduction](#introduction)\n- [Model Architecture](#model-architecture)\n  - [DeepStack Cross-Layer Feature Injection](#deepstack-cross-layer-feature-injection)\n  - [Time-Aware Representation](#time-aware-representation)\n- [Released Models](#released-models)\n- [Evaluation](#evaluation)\n- [Quickstart](#quickstart)\n  - [Environment Setup](#environment-setup)\n  - [Basic Usage](#basic-usage)\n  - [Fine-tuning](#fine-tuning)\n  - [Gradio App](#gradio-app)\n  - [SGLang Serving](#sglang-serving)\n- [More Information](#more-information)\n- [Citation](#citation)\n\n\n## Introduction\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fmoss-audio-image.png\" width=\"95%\" \u002F>\n\u003C\u002Fp>\n\n\n\nUnderstanding audio requires more than simply transcribing words — it demands the ability to perceive acoustic cues, recognize speakers and emotions, interpret environmental sounds, reason over temporal context, and handle complex multi-step inference. **MOSS-Audio** is built to unify these capabilities within a single model.\n\n- **Speech & Content Understanding**: Accurately recognizes and transcribes spoken content from audio inputs, producing clean and well-structured text outputs. Supports both word-level and sentence-level timestamp alignment.\n- **Speaker, Emotion & Event Analysis**: Identifies speaker characteristics, analyzes emotional states based on tone, timbre, and context, and detects key acoustic events within the audio.\n- **Scene & Sound Cue Extraction**: Extracts meaningful cues from background sounds, environmental noise, music, and non-speech signals to infer scene context and atmosphere.\n- **Music Understanding**: Analyzes musical style, emotional progression, instrumentation, and salient acoustic features in music segments.\n- **Audio Question Answering & Summarization**: Answers questions and generates summaries about speech, podcasts, meetings, interviews, and environmental recordings, helping users efficiently extract key information.\n- **Time-Aware QA**: Supports time-aware questions, including word-level and sentence-level timestamp ASR.\n- **Complex Reasoning**: Performs multi-hop reasoning over audio content, powered by chain-of-thought training and reinforcement learning.\n\n## Model Architecture\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Farc2.png\" width=\"95%\" \u002F>\n\u003C\u002Fp>\n\nMOSS-Audio follows a modular design comprising three components: an audio encoder, a modality adapter, and a large language model. Raw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal representations at **12.5 Hz**, which are then projected into the language model's embedding space through the adapter and finally consumed by the LLM for auto-regressive text generation. \n\nRather than relying on off-the-shelf audio frontends, we train a dedicated encoder from scratch to obtain more robust speech representations, tighter temporal alignment, and better extensibility across acoustic domains.\n\n\n### DeepStack Cross-Layer Feature Injection\n\nUsing only the encoder's top-layer features tends to lose low-level prosody, transient events, and local time-frequency structure. To address this, we design a **DeepStack**-inspired cross-layer injection module between the encoder and the language model: in addition to the encoder's final-layer output, features from earlier and intermediate layers are selected, independently projected, and injected into the language model's early layers, preserving multi-granularity information from low-level acoustic details to high-level semantic abstractions.\n\nThis design is especially well-suited for audio understanding tasks, as it helps retain rhythm, timbre, transients, and background structure — information that a single high-level representation cannot fully capture.\n\n### Time-Aware Representation\n\nTime is a critical dimension in audio understanding. To enhance explicit temporal awareness, we adopt a **time-marker insertion** strategy during pretraining: explicit time tokens are inserted between audio frame representations at fixed time intervals to indicate temporal positions. This design enables the model to learn \"what happened when\" within a unified text generation framework, naturally supporting timestamp ASR, event localization, time-based QA, and long-audio retrospection.\n\n\n## Released Models\n\n\n| Model | Audio Encoder | LLM Backbone | Total Size | Hugging Face | ModelScope |\n|---|---|---|---:|---|---|\n| **MOSS-Audio-4B-Instruct** | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-4B-Instruct) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-624AFF)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Audio-4B-Instruct) |\n| **MOSS-Audio-4B-Thinking** | MOSS-Audio-Encoder | Qwen3-4B | ~4.6B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-4B-Thinking) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-624AFF)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Audio-4B-Thinking) |\n| **MOSS-Audio-8B-Instruct** | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-8B-Instruct) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-624AFF)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Audio-8B-Instruct) |\n| **MOSS-Audio-8B-Thinking** | MOSS-Audio-Encoder | Qwen3-8B | ~8.6B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-8B-Thinking) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-624AFF)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Audio-8B-Thinking) |\n\n> More model families, sizes, and variants will be released in the future. Stay tuned!\n\n\n## Evaluation\n\nWe evaluate MOSS-Audio on a comprehensive set of audio understanding benchmarks. Key results:\n\n- **General Audio Understanding**: MOSS-Audio-8B-Thinking achieves an average accuracy of **71.08**, with **77.33** on MMAU, **64.92** on MMAU-Pro, **66.53** on MMAR, and **75.52** on MMSU, outperforming all open-source models.\n- **Speech Captioning**: MOSS-Audio-Instruct variants lead across **11 out of 13** fine-grained speech description dimensions, with **MOSS-Audio-8B-Instruct** achieving the best overall average score (**3.7252**).\n- **ASR**: On a diverse ASR benchmark suite spanning 12 evaluation dimensions, MOSS-Audio achieves the **lowest overall CER (11.30)**, with particular strength in health-condition, code-switching, dialect, singing, and non-speech scenarios.\n- **Timestamp ASR**: MOSS-Audio-8B-Instruct achieves **35.77 AAS** on AISHELL-1 and **131.61 AAS** on LibriSpeech, dramatically outperforming Qwen3-Omni (833.66) and Gemini-3.1-Pro (708.24) in timestamp asr accuracy.\n\n### General Audio Understanding (Accuracy↑)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fgeneral_audio_bar.svg\" width=\"75%\" \u002F>\n\u003C\u002Fp>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth>Model\u003C\u002Fth>\n      \u003Cth>Model Size\u003C\u002Fth>\n      \u003Cth>MMAU\u003C\u002Fth>\n      \u003Cth>MMAU-Pro\u003C\u002Fth>\n      \u003Cth>MMAR\u003C\u002Fth>\n      \u003Cth>MMSU\u003C\u002Fth>\n      \u003Cth>Avg\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\u003Ctd colspan=\"7\">\u003Cem>\u003Cstrong>Open Source (small)\u003C\u002Fstrong>\u003C\u002Fem>\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Kimi-Audio\u003C\u002Ftd>\u003Ctd>7B\u003C\u002Ftd>\u003Ctd>72.41\u003C\u002Ftd>\u003Ctd>56.58\u003C\u002Ftd>\u003Ctd>60.82\u003C\u002Ftd>\u003Ctd>54.74\u003C\u002Ftd>\u003Ctd>61.14\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Qwen2.5-Omni\u003C\u002Ftd>\u003Ctd>7B\u003C\u002Ftd>\u003Ctd>65.60\u003C\u002Ftd>\u003Ctd>52.20\u003C\u002Ftd>\u003Ctd>56.70\u003C\u002Ftd>\u003Ctd>61.32\u003C\u002Ftd>\u003Ctd>58.96\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Audio Flamingo 3\u003C\u002Ftd>\u003Ctd>7B\u003C\u002Ftd>\u003Ctd>61.23\u003C\u002Ftd>\u003Ctd>51.70\u003C\u002Ftd>\u003Ctd>57.96\u003C\u002Ftd>\u003Ctd>60.04\u003C\u002Ftd>\u003Ctd>57.73\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>MiMo-Audio-7B\u003C\u002Ftd>\u003Ctd>7B\u003C\u002Ftd>\u003Ctd>74.90\u003C\u002Ftd>\u003Ctd>53.35\u003C\u002Ftd>\u003Ctd>61.70\u003C\u002Ftd>\u003Ctd>61.94\u003C\u002Ftd>\u003Ctd>62.97\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>MiniCPM-o-4.5\u003C\u002Ftd>\u003Ctd>9B\u003C\u002Ftd>\u003Ctd>70.97\u003C\u002Ftd>\u003Ctd>39.65\u003C\u002Ftd>\u003Ctd>55.75\u003C\u002Ftd>\u003Ctd>60.96\u003C\u002Ftd>\u003Ctd>56.83\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cstrong>MOSS-Audio-4B-Instruct\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>4B\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>75.79\u003C\u002Ftd>\u003Ctd>58.16\u003C\u002Ftd>\u003Ctd>59.68\u003C\u002Ftd>\u003Ctd>59.68\u003C\u002Ftd>\u003Ctd>64.04\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cstrong>MOSS-Audio-4B-Thinking\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>4B\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>77.64\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>60.75\u003C\u002Ftd>\u003Ctd>63.91\u003C\u002Ftd>\u003Ctd>71.20\u003C\u002Ftd>\u003Ctd>68.37\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cstrong>MOSS-Audio-8B-Instruct\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>8B\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>77.03\u003C\u002Ftd>\u003Ctd>57.48\u003C\u002Ftd>\u003Ctd>64.42\u003C\u002Ftd>\u003Ctd>66.36\u003C\u002Ftd>\u003Ctd>66.32\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>\u003Cstrong>MOSS-Audio-8B-Thinking\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>8B\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>77.33\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>64.92\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>66.53\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>75.52\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>71.08\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\u003Ctd colspan=\"7\">\u003Cem>\u003Cstrong>Open Source (large)\u003C\u002Fstrong>\u003C\u002Fem>\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\u003Ctd>30B\u003C\u002Ftd>\u003Ctd>75.00\u003C\u002Ftd>\u003Ctd>\u003Cstrong>61.22\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>66.40\u003C\u002Ftd>\u003Ctd>69.00\u003C\u002Ftd>\u003Ctd>67.91\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Step-Audio-R1.1\u003C\u002Ftd>\u003Ctd>33B\u003C\u002Ftd>\u003Ctd>72.18\u003C\u002Ftd>\u003Ctd>60.80\u003C\u002Ftd>\u003Ctd>68.75\u003C\u002Ftd>\u003Ctd>64.18\u003C\u002Ftd>\u003Ctd>66.48\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Step-Audio-R1\u003C\u002Ftd>\u003Ctd>33B\u003C\u002Ftd>\u003Ctd>\u003Cstrong>78.67\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>59.68\u003C\u002Ftd>\u003Ctd>\u003Cstrong>69.15\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>75.18\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>70.67\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\u003Ctd colspan=\"7\">\u003Cem>\u003Cstrong>Closed Source\u003C\u002Fstrong>\u003C\u002Fem>\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>GPT4o-Audio\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>65.66\u003C\u002Ftd>\u003Ctd>52.30\u003C\u002Ftd>\u003Ctd>59.78\u003C\u002Ftd>\u003Ctd>58.76\u003C\u002Ftd>\u003Ctd>59.13\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-3-Pro\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>80.15\u003C\u002Ftd>\u003Ctd>68.28\u003C\u002Ftd>\u003Ctd>81.73\u003C\u002Ftd>\u003Ctd>81.28\u003C\u002Ftd>\u003Ctd>77.86\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd>Gemini-3.1-Pro\u003C\u002Ftd>\u003Ctd>-\u003C\u002Ftd>\u003Ctd>\u003Cstrong>81.10\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>73.47\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>83.70\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>81.30\u003C\u002Fstrong>\u003C\u002Ftd>\u003Ctd>\u003Cstrong>79.89\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n### Speech Captioning (LLM-as-a-Judge Score↑)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fspeech_caption_radar.png\" width=\"70%\" \u002F>\n\u003C\u002Fp>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Speech Captioning (click to expand)\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n\n| Model | gender | age | accent | pitch | volume | speed | texture | clarity | fluency | emotion | tone | personality | summary | Avg |\n|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| Qwen3-Omni-30B-A3B-Instruct | 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | 3.328 | 3.224 | 3.292 | 3.179 | 3.5986 |\n| Qwen3-Omni-30B-A3B-Thinking | 4.419 | **4.026** | 4.327 | 3.610 | 3.577 | 3.610 | 3.179 | 3.403 | 3.526 | 3.232 | 3.154 | 3.197 | 3.107 | 3.5667 |\n| Gemini-3-Pro | 4.191 | 3.835 | 4.181 | 3.392 | 3.254 | 3.320 | 2.998 | 3.347 | 3.524 | 3.055 | 2.997 | 3.023 | 2.775 | 3.3763 |\n| Gemini-3.1-Pro| 4.436 | 3.936 | 4.356 | 3.590 | 3.682 | 3.614 | 3.093 | 3.521 | 3.531 | **3.328** | 3.224 | 3.292 | 3.179 | 3.5986 |\n| MOSS-Audio-4B-Instruct | **4.697** | 3.980 | 4.497 | 3.628 | **3.722** | 3.564 | **3.407** | 3.841 | 3.744 | 3.311 | **3.282** | **3.305** | 3.259 | 3.7105 |\n| MOSS-Audio-8B-Instruct | 4.683 | 3.979 | **4.572** | **3.682** | 3.709 | **3.638** | 3.403 | **3.869** | **3.747** | 3.314 | 3.253 | 3.272 | **3.307** | **3.7252** |\n\n\u003C\u002Fdetails>\n\n### ASR \n\n| Model | Overall | Health Condition | Dialect | Singing | Non-Speech Vocalizations | Code-Switching | Acoustic Environment (Clean) | Acoustic Environment (Noisy) | Acoustic Characteristics: Whisper | Acoustic Characteristics: Far-Field \u002F Near-Field | Multi-Speaker | Age | Semantic Content |\n|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| Paraformer-Large | 15.77 | 22.18 | 43.45 | 32.34 | 4.95 | 12.65 | 3.11 | 4.67 | 5.02 | 17.46 | 20.33 | 14.96 | 7.14 |\n| GLM-ASR-Nano | 17.29 | 24.49 | 22.39 | 51.95 | 4.65 | 11.88 | 3.68 | 5.02 | 4.94 | 27.51 | 28.02 | 17.19 | 7.32 |\n| Fun-ASR-Nano | 12.04 | 21.99 | 7.80 | 19.35 | 4.76 | 11.23 | 2.98 | 3.46 | 3.78 | 18.38 | 19.82 | **14.95** | 6.08 |\n| SenseVoice-Small | 14.50 | 24.04 | 8.89 | 23.79 | 4.92 | 13.90 | 4.13 | 4.93 | 5.57 | 26.66 | 24.06 | 17.63 | 7.55 |\n| Kimi-Audio-7B-Instruct | 14.12 | 21.11 | 29.34 | 21.76 | 4.68 | 16.38 | **2.20** | **2.15** | 2.66 | 21.02 | 20.61 | 16.74 | 6.12 |\n| Qwen2.5-Omni-3B | 15.26 | 24.65 | 33.87 | 24.24 | 5.54 | 11.66 | 2.76 | 3.56 | 4.32 | 22.15 | 22.91 | 15.17 | 7.24 |\n| Qwen2.5-Omni-7B | 15.05 | 23.85 | 31.91 | 22.69 | 4.56 | 12.97 | 2.52 | 3.16 | 3.64 | 25.38 | 21.01 | 16.13 | 6.78 |\n| Qwen3-Omni-30B-A3B-Instruct | 11.39 | 20.73 | 15.63 | 16.01 | 4.73 | 11.30 | 2.23 | 2.47 | **1.90** | **17.08** | **18.15** | **11.46** | **5.74** |\n| **MOSS-Audio-4B-Instruct** | 11.58 | 21.11 | 11.84 | 10.79 | **4.01** | **10.11** | 3.11 | 3.72 | 3.29 | 18.48 | 20.33 | 15.09 | 8.15 |\n| **MOSS-Audio-8B-Instruct** | **11.30** | **19.18** | **8.76** | **9.81** | 4.31 | 10.18 | 2.70 | 3.20 | 2.75 | 24.04 | 24.36 | 15.26 | 7.69 |\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Detailed ASR Results (click to expand)\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth rowspan=\"2\">Model\u003C\u002Fth>\n    \u003Cth colspan=\"3\">Acoustic Environment (Clean)\u003C\u002Fth>\n    \u003Cth colspan=\"1\">Acoustic Environment (Noisy)\u003C\u002Fth>\n    \u003Cth colspan=\"1\">Acoustic Characteristics: Whisper\u003C\u002Fth>\n    \u003Cth colspan=\"1\">Acoustic Characteristics: Far-Field \u002F Near-Field\u003C\u002Fth>\n    \u003Cth colspan=\"1\">Multi-Speaker\u003C\u002Fth>\n    \u003Cth colspan=\"2\">Age\u003C\u002Fth>\n    \u003Cth colspan=\"2\">Health Condition\u003C\u002Fth>\n    \u003Cth colspan=\"2\">Semantic Content\u003C\u002Fth>\n    \u003Cth colspan=\"3\">Code-Switching\u003C\u002Fth>\n    \u003Cth colspan=\"2\">Dialect\u003C\u002Fth>\n    \u003Cth colspan=\"2\">Singing\u003C\u002Fth>\n    \u003Cth colspan=\"1\">Non-Speech Vocalizations\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Cth>AISHELL-1\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>AISHELL-2\u003Cbr>\u003Cem>Android | IOS | Mic\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>THCHS-30\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>MAGICDATA-READ\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>AISHELL6-Whisper\u003Cbr>\u003Cem>normal | whisper\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>AliMeeting\u003Cbr>\u003Cem>Test_Ali_far | Test_Ali_near\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>AISHELL-4\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>SeniorTalk\u003Cbr>\u003Cem>sentence\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>ChildMandarin\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>AISHELL-6A\u003Cbr>\u003Cem>mild | moderate | severe | StutteringSpeech\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>AISHELL_6B\u003Cbr>\u003Cem>LRDWWS | Uncontrol\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>WenetSpeech\u003Cbr>\u003Cem>test-meeting\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>Fleurs\u003Cbr>\u003Cem>cmn_hans_cn\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>CS-Dialogue\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>TALCS\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>ASCEND\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>KeSpeech\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>WSYue-ASR-eval\u003Cbr>\u003Cem>short\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>MIR-1K\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>openc-pop\u003Cbr>\u003Cem>test\u003C\u002Fem>\u003C\u002Fth>\n    \u003Cth>MNV_17\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Paraformer-Large\u003C\u002Ftd>\n    \u003Ctd>1.98\u003C\u002Ftd>\n    \u003Ctd>3.28 | 3.21 | 3.00\u003C\u002Ftd>\n    \u003Ctd>4.07\u003C\u002Ftd>\n    \u003Ctd>4.67\u003C\u002Ftd>\n    \u003Ctd>1.11 | 8.92\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>25.64\u003C\u002Fstrong> | 9.27\u003C\u002Ftd>\n    \u003Ctd>20.33\u003C\u002Ftd>\n    \u003Ctd>17.31\u003C\u002Ftd>\n    \u003Ctd>12.60\u003C\u002Ftd>\n    \u003Ctd>6.98 | 9.30 | 13.34 | 10.74\u003C\u002Ftd>\n    \u003Ctd>47.59 | 45.08\u003C\u002Ftd>\n    \u003Ctd>7.88\u003C\u002Ftd>\n    \u003Ctd>6.40\u003C\u002Ftd>\n    \u003Ctd>10.64\u003C\u002Ftd>\n    \u003Ctd>10.77\u003C\u002Ftd>\n    \u003Ctd>16.55\u003C\u002Ftd>\n    \u003Ctd>11.48\u003C\u002Ftd>\n    \u003Ctd>75.42\u003C\u002Ftd>\n    \u003Ctd>57.70\u003C\u002Ftd>\n    \u003Ctd>6.98\u003C\u002Ftd>\n    \u003Ctd>4.95\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>GLM-ASR-Nano\u003C\u002Ftd>\n    \u003Ctd>2.89\u003C\u002Ftd>\n    \u003Ctd>3.75 | 3.73 | 3.78\u003C\u002Ftd>\n    \u003Ctd>4.23\u003C\u002Ftd>\n    \u003Ctd>5.02\u003C\u002Ftd>\n    \u003Ctd>0.83 | 9.06\u003C\u002Ftd>\n    \u003Ctd>40.27 | 14.76\u003C\u002Ftd>\n    \u003Ctd>28.02\u003C\u002Ftd>\n    \u003Ctd>20.33\u003C\u002Ftd>\n    \u003Ctd>14.06\u003C\u002Ftd>\n    \u003Ctd>8.74 | 12.11 | 14.38 | 12.29\u003C\u002Ftd>\n    \u003Ctd>50.34 | 49.09\u003C\u002Ftd>\n    \u003Ctd>9.70\u003C\u002Ftd>\n    \u003Ctd>4.94\u003C\u002Ftd>\n    \u003Ctd>11.06\u003C\u002Ftd>\n    \u003Ctd>11.07\u003C\u002Ftd>\n    \u003Ctd>13.50\u003C\u002Ftd>\n    \u003Ctd>9.72\u003C\u002Ftd>\n    \u003Ctd>35.07\u003C\u002Ftd>\n    \u003Ctd>95.87\u003C\u002Ftd>\n    \u003Ctd>8.03\u003C\u002Ftd>\n    \u003Ctd>4.65\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Fun-ASR-Nano\u003C\u002Ftd>\n    \u003Ctd>2.16\u003C\u002Ftd>\n    \u003Ctd>3.04 | 2.99 | 3.07\u003C\u002Ftd>\n    \u003Ctd>3.65\u003C\u002Ftd>\n    \u003Ctd>3.46\u003C\u002Ftd>\n    \u003Ctd>0.81 | 6.76\u003C\u002Ftd>\n    \u003Ctd>27.21 | 9.55\u003C\u002Ftd>\n    \u003Ctd>19.82\u003C\u002Ftd>\n    \u003Ctd>16.96\u003C\u002Ftd>\n    \u003Ctd>12.94\u003C\u002Ftd>\n    \u003Ctd>6.60 | \u003Cstrong>8.81\u003C\u002Fstrong> | 12.98 | 10.30\u003C\u002Ftd>\n    \u003Ctd>47.42 | 45.84\u003C\u002Ftd>\n    \u003Ctd>7.39\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>4.76\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>10.47\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>8.09\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>15.13\u003C\u002Ftd>\n    \u003Ctd>7.43\u003C\u002Ftd>\n    \u003Ctd>8.17\u003C\u002Ftd>\n    \u003Ctd>35.85\u003C\u002Ftd>\n    \u003Ctd>2.84\u003C\u002Ftd>\n    \u003Ctd>4.76\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>SenseVoice-Small\u003C\u002Ftd>\n    \u003Ctd>3.23\u003C\u002Ftd>\n    \u003Ctd>4.16 | 4.02 | 3.96\u003C\u002Ftd>\n    \u003Ctd>5.26\u003C\u002Ftd>\n    \u003Ctd>4.93\u003C\u002Ftd>\n    \u003Ctd>1.25 | 9.88\u003C\u002Ftd>\n    \u003Ctd>37.01 | 16.31\u003C\u002Ftd>\n    \u003Ctd>24.06\u003C\u002Ftd>\n    \u003Ctd>21.07\u003C\u002Ftd>\n    \u003Ctd>14.18\u003C\u002Ftd>\n    \u003Ctd>7.62 | 9.85 | 14.39 | 11.47\u003C\u002Ftd>\n    \u003Ctd>52.92 | 47.97\u003C\u002Ftd>\n    \u003Ctd>8.35\u003C\u002Ftd>\n    \u003Ctd>6.75\u003C\u002Ftd>\n    \u003Ctd>12.81\u003C\u002Ftd>\n    \u003Ctd>10.52\u003C\u002Ftd>\n    \u003Ctd>18.38\u003C\u002Ftd>\n    \u003Ctd>10.45\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>7.34\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>39.51\u003C\u002Ftd>\n    \u003Ctd>8.07\u003C\u002Ftd>\n    \u003Ctd>4.92\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Kimi-Audio-7B-Instruct\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>0.79\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>2.91 | 3.03 | 2.88\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>1.39\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>2.15\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>0.69 | 4.63\u003C\u002Ftd>\n    \u003Ctd>28.22 | 13.82\u003C\u002Ftd>\n    \u003Ctd>20.61\u003C\u002Ftd>\n    \u003Ctd>19.70\u003C\u002Ftd>\n    \u003Ctd>13.79\u003C\u002Ftd>\n    \u003Ctd>7.00 | 9.34 | 12.56 | 10.75\u003C\u002Ftd>\n    \u003Ctd>44.44 | 42.57\u003C\u002Ftd>\n    \u003Ctd>7.15\u003C\u002Ftd>\n    \u003Ctd>5.10\u003C\u002Ftd>\n    \u003Ctd>14.56\u003C\u002Ftd>\n    \u003Ctd>12.74\u003C\u002Ftd>\n    \u003Ctd>21.83\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>5.51\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>53.17\u003C\u002Ftd>\n    \u003Ctd>38.35\u003C\u002Ftd>\n    \u003Ctd>5.17\u003C\u002Ftd>\n    \u003Ctd>4.68\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Qwen2.5-Omni-3B\u003C\u002Ftd>\n    \u003Ctd>1.51\u003C\u002Ftd>\n    \u003Ctd>3.10 | 2.94 | 2.93\u003C\u002Ftd>\n    \u003Ctd>3.32\u003C\u002Ftd>\n    \u003Ctd>3.56\u003C\u002Ftd>\n    \u003Ctd>0.82 | 7.82\u003C\u002Ftd>\n    \u003Ctd>32.14 | 12.16\u003C\u002Ftd>\n    \u003Ctd>22.91\u003C\u002Ftd>\n    \u003Ctd>17.38\u003C\u002Ftd>\n    \u003Ctd>12.96\u003C\u002Ftd>\n    \u003Ctd>6.87 | 10.55 | 14.57 | 11.33\u003C\u002Ftd>\n    \u003Ctd>54.54 | 50.03\u003C\u002Ftd>\n    \u003Ctd>9.04\u003C\u002Ftd>\n    \u003Ctd>5.45\u003C\u002Ftd>\n    \u003Ctd>10.78\u003C\u002Ftd>\n    \u003Ctd>10.94\u003C\u002Ftd>\n    \u003Ctd>13.25\u003C\u002Ftd>\n    \u003Ctd>7.67\u003C\u002Ftd>\n    \u003Ctd>60.06\u003C\u002Ftd>\n    \u003Ctd>45.00\u003C\u002Ftd>\n    \u003Ctd>3.47\u003C\u002Ftd>\n    \u003Ctd>5.54\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Qwen2.5-Omni-7B\u003C\u002Ftd>\n    \u003Ctd>1.16\u003C\u002Ftd>\n    \u003Ctd>2.88 | 2.77 | 2.73\u003C\u002Ftd>\n    \u003Ctd>3.06\u003C\u002Ftd>\n    \u003Ctd>3.16\u003C\u002Ftd>\n    \u003Ctd>0.71 | 6.57\u003C\u002Ftd>\n    \u003Ctd>32.03 | 18.73\u003C\u002Ftd>\n    \u003Ctd>21.01\u003C\u002Ftd>\n    \u003Ctd>19.96\u003C\u002Ftd>\n    \u003Ctd>12.29\u003C\u002Ftd>\n    \u003Ctd>7.27 | 10.94 | 12.92 | 10.53\u003C\u002Ftd>\n    \u003Ctd>51.99 | 49.45\u003C\u002Ftd>\n    \u003Ctd>8.43\u003C\u002Ftd>\n    \u003Ctd>5.13\u003C\u002Ftd>\n    \u003Ctd>14.02\u003C\u002Ftd>\n    \u003Ctd>10.46\u003C\u002Ftd>\n    \u003Ctd>14.42\u003C\u002Ftd>\n    \u003Ctd>6.40\u003C\u002Ftd>\n    \u003Ctd>57.43\u003C\u002Ftd>\n    \u003Ctd>42.62\u003C\u002Ftd>\n    \u003Ctd>2.75\u003C\u002Ftd>\n    \u003Ctd>4.56\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>Qwen3-Omni-30B-A3B-Instruct\u003C\u002Ftd>\n    \u003Ctd>0.95\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>2.70\u003C\u002Fstrong> | \u003Cstrong>2.72\u003C\u002Fstrong> | \u003Cstrong>2.57\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>2.21\u003C\u002Ftd>\n    \u003Ctd>2.47\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>0.59\u003C\u002Fstrong> | \u003Cstrong>3.22\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>25.72 | \u003Cstrong>8.44\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>18.15\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>14.13\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>8.79\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>6.20 | 8.88 | 11.59 | 10.25\u003C\u002Ftd>\n    \u003Ctd>45.80 | 41.65\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>6.64\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>4.84\u003C\u002Ftd>\n    \u003Ctd>12.94\u003C\u002Ftd>\n    \u003Ctd>8.33\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>12.64\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>5.87\u003C\u002Ftd>\n    \u003Ctd>25.39\u003C\u002Ftd>\n    \u003Ctd>30.81\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>1.21\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>4.73\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cstrong>MOSS-Audio-4B-Instruct\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>2.26\u003C\u002Ftd>\n    \u003Ctd>3.22 | 3.20 | 3.33\u003C\u002Ftd>\n    \u003Ctd>3.53\u003C\u002Ftd>\n    \u003Ctd>3.72\u003C\u002Ftd>\n    \u003Ctd>0.73 | 5.86\u003C\u002Ftd>\n    \u003Ctd>27.27 | 9.68\u003C\u002Ftd>\n    \u003Ctd>20.33\u003C\u002Ftd>\n    \u003Ctd>16.93\u003C\u002Ftd>\n    \u003Ctd>13.25\u003C\u002Ftd>\n    \u003Ctd>6.36 | 9.77 | 12.68 | 10.28\u003C\u002Ftd>\n    \u003Ctd>43.35 | 44.25\u003C\u002Ftd>\n    \u003Ctd>8.17\u003C\u002Ftd>\n    \u003Ctd>8.13\u003C\u002Ftd>\n    \u003Ctd>9.14\u003C\u002Ftd>\n    \u003Ctd>8.37\u003C\u002Ftd>\n    \u003Ctd>12.83\u003C\u002Ftd>\n    \u003Ctd>14.65\u003C\u002Ftd>\n    \u003Ctd>9.04\u003C\u002Ftd>\n    \u003Ctd>18.47\u003C\u002Ftd>\n    \u003Ctd>3.10\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>4.01\u003C\u002Fstrong>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cstrong>MOSS-Audio-8B-Instruct\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>1.82\u003C\u002Ftd>\n    \u003Ctd>2.97 | 2.95 | 2.91\u003C\u002Ftd>\n    \u003Ctd>2.82\u003C\u002Ftd>\n    \u003Ctd>3.20\u003C\u002Ftd>\n    \u003Ctd>0.69 | 4.80\u003C\u002Ftd>\n    \u003Ctd>36.82 | 11.25\u003C\u002Ftd>\n    \u003Ctd>24.36\u003C\u002Ftd>\n    \u003Ctd>17.42\u003C\u002Ftd>\n    \u003Ctd>13.10\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>5.84\u003C\u002Fstrong> | 8.94 | \u003Cstrong>11.52\u003C\u002Fstrong> | \u003Cstrong>9.72\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>39.76\u003C\u002Fstrong> | \u003Cstrong>39.27\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>7.86\u003C\u002Ftd>\n    \u003Ctd>7.52\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>9.07\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>8.22\u003C\u002Ftd>\n    \u003Ctd>13.26\u003C\u002Ftd>\n    \u003Ctd>9.18\u003C\u002Ftd>\n    \u003Ctd>8.33\u003C\u002Ftd>\n    \u003Ctd>\u003Cstrong>17.24\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003Ctd>2.39\u003C\u002Ftd>\n    \u003Ctd>4.31\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\n### Timestamp ASR (AAS↓)\n\n| Model | AISHELL-1(zh)  | LibriSpeech(en) |\n|---|---:|---:|\n| Qwen3-Omni-30B-A3B-Instruct | 833.66 | 646.95 |\n| Gemini-3.1-Pro| 708.24 | 871.19 |\n| MOSS-Audio-4B-Instruct | 76.96 | 358.13 |\n| **MOSS-Audio-8B-Instruct** | **35.77** | **131.61** |\n\n\n## Quickstart\n\n### Environment Setup\n\nWe recommend Python 3.12 with a clean Conda environment. The commands below are enough for local inference.\n\n#### Recommended setup\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Audio.git\ncd MOSS-Audio\n\nconda create -n moss-audio python=3.12 -y\nconda activate moss-audio\n\nconda install -c conda-forge \"ffmpeg=7\" -y\npip install --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 -e \".[torch-runtime]\"\n```\n\n#### Optional: FlashAttention 2\n\nIf your GPU supports FlashAttention 2, you can replace the last install command with:\n\n```bash\npip install --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 -e \".[torch-runtime,flash-attn]\"\n```\n\n\n### Basic Usage\n\nDownload the model first:\n\n```bash\nhf  download OpenMOSS-Team\u002FMOSS-Audio-4B-Instruct --local-dir .\u002Fweights\u002FMOSS-Audio-4B-Instruct \nhf  download OpenMOSS-Team\u002FMOSS-Audio-4B-Thinking --local-dir .\u002Fweights\u002FMOSS-Audio-4B-Thinking \nhf  download OpenMOSS-Team\u002FMOSS-Audio-8B-Instruct --local-dir .\u002Fweights\u002FMOSS-Audio-8B-Instruct \nhf  download OpenMOSS-Team\u002FMOSS-Audio-8B-Thinking --local-dir .\u002Fweights\u002FMOSS-Audio-8B-Thinking \n\n```\n\nThen edit `MODEL_PATH` \u002F `AUDIO_PATH` in `infer.py` as needed, and run:\n\n```bash\npython infer.py\n```\n\nThe default prompt in `infer.py` is `Describe this audio.` You can directly edit that line if you want to try transcription, audio QA, or speech captioning.\n\n\u003Ca id=\"fine-tuning\">\u003C\u002Fa>\n\n### Fine-tuning\n\nWe now provide an official fine-tuning script in `finetune\u002Ffinetune.py`, with full instructions in `finetune\u002FFINETUNE.md`.\n\nInstall the extra dependencies needed for training:\n\n```bash\npip install librosa peft\n```\n\nMinimal example for LoRA fine-tuning:\n\n```bash\naccelerate launch finetune\u002Ffinetune.py \\\n    --model_dir .\u002Fweights\u002FMOSS-Audio-4B-Instruct \\\n    --data_path train.jsonl \\\n    --output_dir .\u002Foutput\u002Flora \\\n    --use_lora \\\n    --bf16\n```\n\nThe training data should be a JSONL file containing audio-text conversations. For data format, supported arguments, multi-GPU examples, and full-parameter fine-tuning, see `finetune\u002FFINETUNE.md`.\n\n### Gradio App\n\nStart the Gradio demo with:\n\n```bash\npython app.py\n```\n\n\n\n### SGLang Serving\n\nIf you want to serve MOSS-Audio with SGLang, see the full guide in `moss_audio_usage_guide.md`.\n\nThe shortest setup is:\n\n```bash\ngit clone -b moss-audio https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002Fsglang.git\ncd sglang\npip install -e \"python[all]\"\npip install nvidia-cudnn-cu12==9.16.0.29\ncd ..\nsglang serve --model-path .\u002Fweights\u002FMOSS-Audio --trust-remote-code\n```\n\nIf you use the default `torch==2.9.1+cu128` runtime, installing `nvidia-cudnn-cu12==9.16.0.29` is recommended before starting `sglang serve`.\n\n\n\u003Ca id=\"more-information\">\u003C\u002Fa>\n\n## More Information\n- **MOSI.AI**: [https:\u002F\u002Fmosi.cn](https:\u002F\u002Fmosi.cn)\n- **OpenMOSS**: [https:\u002F\u002Fwww.open-moss.com](https:\u002F\u002Fwww.open-moss.com)\n\n\n## LICENSE\n\nModels in MOSS-Audio are licensed under the Apache License 2.0.\n\n\n## Citation\n\n```bibtex\n@misc{mossaudio2026,\n      title={MOSS-Audio Technical Report},\n      author={OpenMOSS Team},\n      year={2026},\n      howpublished={\\url{https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Audio}},\n      note={GitHub repository}\n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=OpenMOSS\u002FMOSS-Audio&type=date&legend=top-left)](https:\u002F\u002Fwww.star-history.com\u002F#OpenMOSS\u002FMOSS-Audio&type=date&legend=top-left)\n","MOSS-Audio 是一个开源的音频理解模型，旨在实现语音、环境声音、音乐的理解以及音频字幕生成、时间感知问答和复杂推理等功能。该项目基于Python开发，具备DeepStack跨层特征注入与时间感知表示等核心技术特点，能够处理复杂的现实世界音频数据。提供了四种不同配置的模型，包括针对直接指令执行优化的Instruct版本和强化了思维链推理能力的Thinking版本。适用于需要高质量音频分析与理解的应用场景，如智能音箱、语音助手、音频内容自动标注等。",2,"2026-06-11 02:40:47","CREATED_QUERY"]