[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74123":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},74123,"Qwen3-ASR","QwenLM\u002FQwen3-ASR","QwenLM","Qwen3-ASR is an open-source series of ASR models developed by the Qwen team at Alibaba Cloud, supporting stable multilingual speech\u002Fmusic\u002Fsong recognition, language detection and timestamp prediction.",null,"Python",2874,290,10,23,0,38,88,255,114,29.39,"Apache License 2.0",false,"main",[],"2026-06-12 02:03:22","# Qwen3-ASR\n\n\u003Cbr>\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Flogo.png\" width=\"400\"\u002F>\n\u003Cp>\n\n\u003Cp align=\"center\">\n&nbsp&nbsp🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FQwen\u002Fqwen3-asr\">Hugging Face\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤖 \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fcollections\u002FQwen\u002FQwen3-ASR\">ModelScope\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📑 \u003Ca href=\"https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3asr\">Blog\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📑 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.21337\">Paper\u003C\u002Fa>&nbsp&nbsp\n\u003Cbr>\n🖥️ \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FQwen\u002FQwen3-ASR\">Hugging Face Demo\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp 🖥️ \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fstudios\u002FQwen\u002FQwen3-ASR\">ModelScope Demo\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp💬 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen\u002Fblob\u002Fmain\u002Fassets\u002Fwechat.png\">WeChat (微信)\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🫨 \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FCV4E9rpNSD\">Discord\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📑 \u003Ca href=\"https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fqwen-speech-recognition\">API\u003C\u002Fa>\n\n\u003C\u002Fp>\n\nWe release **Qwen3-ASR**, a family that includes two powerful all-in-one speech recognition models that support language identification and ASR for 52 languages and dialects, as well as a novel non-autoregressive speech forced-alignment model that can align text–speech pairs in 11 languages.\n\n\n## News\n* 2026.1.29: 🎉🎉🎉 We have released the [Qwen3-ASR](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FQwen\u002Fqwen3-asr) series (0.6B\u002F1.7B) and the Qwen3-ForcedAligner-0.6B model. Please check out our [blog](https:\u002F\u002Fqwen.ai\u002Fblog?id=qwen3asr)!\n\n\n## Contents \u003C!-- omit in toc -->\n\n- [Overview](#overview)\n  - [Introduction](#introduction)\n  - [Model Architecture](#model-architecture)\n  - [Released Models Description and Download](#released-models-description-and-download)\n- [Quickstart](#quickstart)\n  - [Environment Setup](#environment-setup)\n  - [Python Package Usage](#python-package-usage)\n    - [Quick Inference](#quick-inference)\n    - [vLLM Backend](#vllm-backend)\n    - [Streaming Inference](#streaming-inference)\n    - [ForcedAligner Usage](#forcedaligner-usage)\n  - [DashScope API Usage](#dashscope-api-usage)\n- [Launch Local Web UI Demo](#launch-local-web-ui-demo)\n  - [Gradio Demo](#gradio-demo)\n  - [Streaming Demo](#streaming-demo)\n- [Deployment with vLLM](#deployment-with-vllm)\n- [Fine Tuning](#fine-tuning)\n- [Docker](#docker)\n- [Evaluation](#evaluation)\n- [Citation](#citation)\n\n\n## Overview\n\n### Introduction\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fqwen3_asr_introduction.png\" width=\"90%\"\u002F>\n\u003Cp>\n\nThe Qwen3-ASR family includes Qwen3-ASR-1.7B and Qwen3-ASR-0.6B, which support language identification and ASR for 52 languages and dialects. Both leverage large-scale speech training data and the strong audio understanding capability of their foundation model, Qwen3-Omni. Experiments show that the 1.7B version achieves state-of-the-art performance among open-source ASR models and is competitive with the strongest proprietary commercial APIs. Here are the main features:\n\n* **All-in-one**: Qwen3-ASR-1.7B and Qwen3-ASR-0.6B support language identification and speech recognition for 30 languages and 22 Chinese dialects, so as to English accents from multiple countries and regions.\n\n* **Excellent and Fast**: The Qwen3-ASR family ASR models maintains high-quality and robust recognition under complex acoustic environments and challenging text patterns. Qwen3-ASR-1.7B achieves strong performance on both open-sourced and internal benchmarks. While the 0.6B version achieves accuracy-efficient trade-off, it reaches 2000 times throughput at a concurrency of 128. They both achieve streaming \u002F offline unified inference with single model and support transcribe long audio.\n\n* **Novel and strong forced alignment Solution**: We introduce Qwen3-ForcedAligner-0.6B, which supports timestamp prediction for arbitrary units within up to 5 minutes of speech in 11 languages. Evaluations show its timestamp accuracy surpasses E2E based forced-alignment models.\n\n* **Comprehensive inference toolkit**: In addition to open-sourcing the architectures and weights of the Qwen3-ASR series, we also release a powerful, full-featured inference framework that supports vLLM-based batch inference, asynchronous serving, streaming inference, timestamp prediction, and more.\n\n### Model Architecture\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Foverview.jpg\" width=\"100%\"\u002F>\n\u003Cp>\n\n\n### Released Models Description and Download\n\nBelow is an introduction and download information for the Qwen3-ASR models. Please select and download the model that fits your needs.\n\n| Model | Supported Languages | Supported Dialects | Inference Mode | Audio Types |\n|---|---|---|---|---|\n| Qwen3-ASR-1.7B & Qwen3-ASR-0.6B | Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro) | Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language. | Offline \u002F Streaming | Speech, Singing Voice, Songs with BGM |\n| Qwen3-ForcedAligner-0.6B | Chinese, English, Cantonese, French, German, Italian, Japanese, Korean, Portuguese, Russian, Spanish | -- | NAR | Speech |\n\nDuring model loading in the `qwen-asr` package or vLLM, model weights will be downloaded automatically based on the model name. However, if your runtime environment does not allow downloading weights during execution, you can use the following commands to manually download the model weights to a local directory:\n\n```bash\n# Download through ModelScope (recommended for users in Mainland China)\npip install -U modelscope\nmodelscope download --model Qwen\u002FQwen3-ASR-1.7B  --local_dir .\u002FQwen3-ASR-1.7B\nmodelscope download --model Qwen\u002FQwen3-ASR-0.6B --local_dir .\u002FQwen3-ASR-0.6B\nmodelscope download --model Qwen\u002FQwen3-ForcedAligner-0.6B --local_dir .\u002FQwen3-ForcedAligner-0.6B\n# Download through Hugging Face\npip install -U \"huggingface_hub[cli]\"\nhuggingface-cli download Qwen\u002FQwen3-ASR-1.7B --local-dir .\u002FQwen3-ASR-1.7B\nhuggingface-cli download Qwen\u002FQwen3-ASR-0.6B --local-dir .\u002FQwen3-ASR-0.6B\nhuggingface-cli download Qwen\u002FQwen3-ForcedAligner-0.6B --local-dir .\u002FQwen3-ForcedAligner-0.6B\n```\n\n\n## Quickstart\n\n### Environment Setup\n\nThe easiest way to use Qwen3-ASR is to install the `qwen-asr` Python package from PyPI. This will pull in the required runtime dependencies and allow you to load any released Qwen3-ASR model. If you’d like to simplify environment setup further, you can also use our official [Docker image](#docker). The `qwen-asr` package provides two backends: the transformers backend and the vLLM backend. For usage instructions for different backends, please refer to [Python Package Usage](#python-package-usage). We recommend using a **fresh, isolated environment** to avoid dependency conflicts with existing packages. You can create a clean Python 3.12 environment like this:\n\n```bash\nconda create -n qwen3-asr python=3.12 -y\nconda activate qwen3-asr\n```\n\nRun the following command to get the minimal installation with transformers-backend support:\n\n```bash\npip install -U qwen-asr\n```\n\nTo enable the vLLM backend for faster inference and streaming support, run:\n\n```bash\npip install -U qwen-asr[vllm]\n```\n\nIf you want to develop or modify the code locally, install from source in editable mode:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-ASR.git\ncd Qwen3-ASR\npip install -e .\n# support vLLM backend\n# pip install -e \".[vllm]\"\n```\n\nAdditionally, we recommend using FlashAttention 2 to reduce GPU memory usage and accelerate inference speed, especially for long inputs and large batch sizes.\n\n```bash\npip install -U flash-attn --no-build-isolation\n```\n\nIf your machine has less than 96GB of RAM and lots of CPU cores, run:\n\n```bash\nMAX_JOBS=4 pip install -U flash-attn --no-build-isolation\n```\n\nAlso, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the [FlashAttention repository](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention). FlashAttention 2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.\n\n### Python Package Usage\n\n#### Quick Inference\n\nThe `qwen-asr` package provides two backends: **transformers backend** and **vLLM backend**. You can pass audio inputs as a local path, a URL, base64 data, or a `(np.ndarray, sr)` tuple, and run batch inference. To quickly try Qwen3-ASR, you can use `Qwen3ASRModel.from_pretrained(...)` for the transformers backend with the following code:\n\n```python\nimport torch\nfrom qwen_asr import Qwen3ASRModel\n\nmodel = Qwen3ASRModel.from_pretrained(\n    \"Qwen\u002FQwen3-ASR-1.7B\",\n    dtype=torch.bfloat16,\n    device_map=\"cuda:0\",\n    # attn_implementation=\"flash_attention_2\",\n    max_inference_batch_size=32, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.\n    max_new_tokens=256, # Maximum number of tokens to generate. Set a larger value for long audio input.\n)\n\nresults = model.transcribe(\n    audio=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\",\n    language=None, # set \"English\" to force the language\n)\n\nprint(results[0].language)\nprint(results[0].text)\n```\n\nIf you want to return timestamps, pass `forced_aligner` and its init kwargs. Here is an example of batch inference with timestamps output:\n\n```python\nimport torch\nfrom qwen_asr import Qwen3ASRModel\n\nmodel = Qwen3ASRModel.from_pretrained(\n    \"Qwen\u002FQwen3-ASR-1.7B\",\n    dtype=torch.bfloat16,\n    device_map=\"cuda:0\",\n    # attn_implementation=\"flash_attention_2\",\n    max_inference_batch_size=32, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.\n    max_new_tokens=256, # Maximum number of tokens to generate. Set a larger value for long audio input.\n    forced_aligner=\"Qwen\u002FQwen3-ForcedAligner-0.6B\",\n    forced_aligner_kwargs=dict(\n        dtype=torch.bfloat16,\n        device_map=\"cuda:0\",\n        # attn_implementation=\"flash_attention_2\",\n    ),\n)\n\nresults = model.transcribe(\n    audio=[\n      \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_zh.wav\",\n      \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\",\n    ],\n    language=[\"Chinese\", \"English\"], # can also be set to None for automatic language detection\n    return_time_stamps=True,\n)\n\nfor r in results:\n    print(r.language, r.text, r.time_stamps[0])\n```\n\nFor more detailed usage examples, please refer to the [example code](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-ASR\u002Fblob\u002Fmain\u002Fexamples\u002Fexample_qwen3_asr_transformers.py) for the transformers backend.\n\n#### vLLM Backend\n\nIf you want the fastest inference speed with Qwen3-ASR, we strongly recommend using the vLLM backend by initializing the model with `Qwen3ASRModel.LLM(...)`. Example code is provided below. Note that you must install it via `pip install -U qwen-asr[vllm]`. If you want the model to output timestamps, it’s best to install FlashAttention via `pip install -U flash-attn --no-build-isolation` to speed up inference for the forced aligner model. Remember to wrap your code under `if __name__ == '__main__':` to avoid the `spawn` error described in [vLLM Troubleshooting](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fusage\u002Ftroubleshooting\u002F#python-multiprocessing).\n\n```python\nimport torch\nfrom qwen_asr import Qwen3ASRModel\n\nif __name__ == '__main__':\n    model = Qwen3ASRModel.LLM(\n        model=\"Qwen\u002FQwen3-ASR-1.7B\",\n        gpu_memory_utilization=0.7,\n        max_inference_batch_size=128, # Batch size limit for inference. -1 means unlimited. Smaller values can help avoid OOM.\n        max_new_tokens=4096, # Maximum number of tokens to generate. Set a larger value for long audio input.\n        forced_aligner=\"Qwen\u002FQwen3-ForcedAligner-0.6B\",\n        forced_aligner_kwargs=dict(\n            dtype=torch.bfloat16,\n            device_map=\"cuda:0\",\n            # attn_implementation=\"flash_attention_2\",\n        ),\n    )\n\n    results = model.transcribe(\n        audio=[\n        \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_zh.wav\",\n        \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\",\n        ],\n        language=[\"Chinese\", \"English\"], # can also be set to None for automatic language detection\n        return_time_stamps=True,\n    )\n\n    for r in results:\n        print(r.language, r.text, r.time_stamps[0])\n```\n\nFor more detailed usage examples, please refer to the [example code](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-ASR\u002Fblob\u002Fmain\u002Fexamples\u002Fexample_qwen3_asr_vllm.py) for the vLLM backend. In addition, you can start a vLLM server via the `qwen-asr-serve` command, which is a wrapper around `vllm serve`. You can pass any arguments supported by `vllm serve`, for example:\n\n```bash\nqwen-asr-serve Qwen\u002FQwen3-ASR-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000\n```\n\nAnd send requests to the server via:\n\n```python\nimport requests\n\nurl = \"http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions\"\nheaders = {\"Content-Type\": \"application\u002Fjson\"}\n\ndata = {\n    \"messages\": [\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"audio_url\",\n                    \"audio_url\": {\n                        \"url\": \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\"\n                    },\n                }\n            ],\n        }\n    ]\n}\n\nresponse = requests.post(url, headers=headers, json=data, timeout=300)\nresponse.raise_for_status()\ncontent = response.json()['choices'][0]['message']['content']\nprint(content)\n\n# parse ASR output if you want\nfrom qwen_asr import parse_asr_output\nlanguage, text = parse_asr_output(content)\nprint(language)\nprint(text)\n```\n\n#### Streaming Inference\n\nQwen3-ASR fully supports streaming inference. Currently, streaming inference is only available with the vLLM backend. Note that streaming inference does not support batch inference or returning timestamps. Please refer to the [example code](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-ASR\u002Fblob\u002Fmain\u002Fexamples\u002Fexample_qwen3_asr_vllm_streaming.py) for details. You can also launch a streaming web demo through the [guide](#streaming-demo) to experience Qwen3-ASR’s streaming transcription capabilities. \n\n#### ForcedAligner Usage\n\n`Qwen3-ForcedAligner-0.6B` can align text–speech pairs and return word or character level timestamps. Here is an example of using the forced aligner directly:\n\n```python\nimport torch\nfrom qwen_asr import Qwen3ForcedAligner\n\nmodel = Qwen3ForcedAligner.from_pretrained(\n    \"Qwen\u002FQwen3-ForcedAligner-0.6B\",\n    dtype=torch.bfloat16,\n    device_map=\"cuda:0\",\n    # attn_implementation=\"flash_attention_2\",\n)\n\nresults = model.align(\n    audio=\"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_zh.wav\",\n    text=\"甚至出现交易几乎停滞的情况。\",\n    language=\"Chinese\",\n)\n\nprint(results[0])\nprint(results[0][0].text, results[0][0].start_time, results[0][0].end_time)\n```\n\nIn addition, the forced aligner supports local paths \u002F URLs \u002F base64 data \u002F `(np.ndarray, sr)` inputs and batch inference. Please refer to the [example code](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-ASR\u002Fblob\u002Fmain\u002Fexamples\u002Fexample_qwen3_forced_aligner.py) for details.\n\n### DashScope API Usage\n\nTo further explore Qwen3-ASR, we encourage you to try our DashScope API for a faster and more efficient experience. For detailed API information and documentation, please refer to the following:\n\n| API Description | API Documentation (Mainland China) | API Documentation (International) |\n|------------------|-----------------------------------|------------------------------------|\n| Real-time API for Qwen3-ASR. | [https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fqwen-real-time-speech-recognition](https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fqwen-real-time-speech-recognition) | [https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fqwen-real-time-speech-recognition](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fqwen-real-time-speech-recognition) |\n| FileTrans API for Qwen3-ASR. | [https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fqwen-speech-recognition](https:\u002F\u002Fhelp.aliyun.com\u002Fzh\u002Fmodel-studio\u002Fqwen-speech-recognition) | [https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fqwen-speech-recognition](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fmodel-studio\u002Fqwen-speech-recognition) |\n\n\n## Launch Local Web UI Demo\n\n### Gradio Demo\n\nTo launch the Qwen3-ASR web UI gradio demo, install the `qwen-asr` package and run `qwen-asr-demo`. Use the command below for help:\n\n```bash\nqwen-asr-demo --help\n```\n\nTo launch the demo, you can use the following commands:\n\n```bash\n# Transformers backend\nqwen-asr-demo \\\n  --asr-checkpoint Qwen\u002FQwen3-ASR-1.7B \\\n  --backend transformers \\\n  --cuda-visible-devices 0 \\\n  --ip 0.0.0.0 --port 8000\n\n# Transformers backend + Forced Aligner (enable timestamps)\nqwen-asr-demo \\\n  --asr-checkpoint Qwen\u002FQwen3-ASR-1.7B \\\n  --aligner-checkpoint Qwen\u002FQwen3-ForcedAligner-0.6B \\\n  --backend transformers \\\n  --cuda-visible-devices 0 \\\n  --backend-kwargs '{\"device_map\":\"cuda:0\",\"dtype\":\"bfloat16\",\"max_inference_batch_size\":8,\"max_new_tokens\":256}' \\\n  --aligner-kwargs '{\"device_map\":\"cuda:0\",\"dtype\":\"bfloat16\"}' \\\n  --ip 0.0.0.0 --port 8000\n\n# vLLM backend + Forced Aligner (enable timestamps)\nqwen-asr-demo \\\n  --asr-checkpoint Qwen\u002FQwen3-ASR-1.7B \\\n  --aligner-checkpoint Qwen\u002FQwen3-ForcedAligner-0.6B \\\n  --backend vllm \\\n  --cuda-visible-devices 0 \\\n  --backend-kwargs '{\"gpu_memory_utilization\":0.7,\"max_inference_batch_size\":8,\"max_new_tokens\":2048}' \\\n  --aligner-kwargs '{\"device_map\":\"cuda:0\",\"dtype\":\"bfloat16\"}' \\\n  --ip 0.0.0.0 --port 8000\n```\n\nThen open `http:\u002F\u002F\u003Cyour-ip>:8000`, or access it via port forwarding in tools like VS Code.\n\n#### Backend Notes\n\nThis demo supports two backends: transformers and vLLM. All backend-specific initialization parameters should be passed via `--backend-kwargs` as a JSON dict. If not provided, the demo will use sensible defaults.\n\n```bash\n# Example: override transformers init args with flash attention\n--backend-kwargs '{\"device_map\":\"cuda:0\",\"dtype\":\"bfloat16\",\"attn_implementation\":\"flash_attention_2\"}'\n\n# Example: override vLLM init args with 65% GPU memory\n--backend-kwargs '{\"gpu_memory_utilization\":0.65}'\n```\n\n#### CUDA Device Notes\n\nBecause vLLM does not follow `cuda:0` style device selection, this demo selects GPUs by setting `CUDA_VISIBLE_DEVICES` via `--cuda-visible-devices`.\n\n```bash\n# Use GPU 0\n--cuda-visible-devices 0\n\n# Use GPU 1\n--cuda-visible-devices 1\n```\n\n#### Timestamps Notes\n\nTimestamps are only available when `--aligner-checkpoint` is provided. If you launch the demo without a forced aligner, the timestamps UI will be hidden automatically.\n\n```bash\n# No forced aligner\nqwen-asr-demo --asr-checkpoint Qwen\u002FQwen3-ASR-1.7B\n\n# With forced aligner\nqwen-asr-demo \\\n  --asr-checkpoint Qwen\u002FQwen3-ASR-1.7B \\\n  --aligner-checkpoint Qwen\u002FQwen3-ForcedAligner-0.6B\n```\n\n#### HTTPS Notes\n\nTo avoid browser microphone permission issues after deploying the server, it is recommended\u002Frequired to run the gradio service over HTTPS (especially when accessed remotely or behind modern browsers\u002Fgateways). Use `--ssl-certfile` and `--ssl-keyfile` to enable HTTPS. First, generate a private key and a self-signed certificate (valid for 365 days):\n\n```bash\nopenssl req -x509 -newkey rsa:2048 \\\n  -keyout key.pem -out cert.pem \\\n  -days 365 -nodes \\\n  -subj \"\u002FCN=localhost\"\n```\n\nThen run the demo with HTTPS:\n\n```bash\nqwen-asr-demo \\\n  --asr-checkpoint Qwen\u002FQwen3-ASR-1.7B \\\n  --backend transformers \\\n  --cuda-visible-devices 0 \\\n  --ip 0.0.0.0 --port 8000 \\\n  --ssl-certfile cert.pem \\\n  --ssl-keyfile key.pem \\\n  --no-ssl-verify\n```\n\nThen open `https:\u002F\u002F\u003Cyour-ip>:8000` to use it. If your browser shows a warning, that’s expected for self-signed certificates. For production, use a real certificate.\n\n### Streaming Demo\n\nTo experience Qwen3-ASR’s streaming transcription capability in a web UI, we provide a minimal Flask-based streaming demo. The demo captures microphone audio in the browser, resamples it to 16,000 Hz, and continuously pushes PCM chunks to the model. Run the demo with the following command:\n\n```bash\nqwen-asr-demo-streaming \\\n  --asr-model-path Qwen\u002FQwen3-ASR-1.7B \\\n  --gpu-memory-utilization 0.9 \\\n  --host 0.0.0.0 \\\n  --port 8000\n```\n\nThen open `http:\u002F\u002F\u003Cyour-ip>:8000`, or access it via port forwarding in tools like VS Code.\n\n## Deployment with vLLM\n\nvLLM officially provides day-0 model support for Qwen3-ASR for efficient inference. \n\n### Installation\nYou can run Qwen3-ASR with vLLM nightly wheel or docker image. To install the nightly version of vLLM, we recommend using `uv` as the environment manager\n```bash\nuv venv\nsource .venv\u002Fbin\u002Factivate\nuv pip install -U vllm --pre \\\n    --extra-index-url https:\u002F\u002Fwheels.vllm.ai\u002Fnightly\u002Fcu129 \\\n    --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu129 \\\n    --index-strategy unsafe-best-match\nuv pip install \"vllm[audio]\" # For additional audio dependencies\n```\n\n### Online Serving\nYou can easily deploy Qwen3-ASR with vLLM by running the following command\n```bash\nvllm serve Qwen\u002FQwen3-ASR-1.7B\n```\nAfter the model server is successfully deployed, you can interact with it in multiple ways.\n\n#### Using OpenAI SDK\n```python\nimport base64\nimport httpx\nfrom openai import OpenAI\n\n# Initialize client\nclient = OpenAI(\n    base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\",\n    api_key=\"EMPTY\"\n)\n\n# Create multimodal chat completion request\nresponse = client.chat.completions.create(\n    model=\"Qwen\u002FQwen3-ASR-1.7B\",\n    messages=[\n        {\n            \"role\": \"user\",\n            \"content\": [\n                {\n                    \"type\": \"audio_url\",\n                    \"audio_url\": {\n                        {\"url\": \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\"}\n                    }\n                }\n            ]\n        }\n    ],\n)\n\nprint(response.choices[0].message.content)\n```\nThis model is also supported on vLLM with OpenAI transcription API.\n```python\nimport httpx\nfrom openai import OpenAI\n\n# Initialize client\nclient = OpenAI(\n    base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\",\n    api_key=\"EMPTY\"\n)\naudio_url = \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\"\naudio_file = httpx.get(audio_url).content\n\ntranscription = client.audio.transcriptions.create(\n    model=\"Qwen\u002FQwen3-ASR-1.7B\",\n    file=audio_file,\n)\n\nprint(transcription.text)\n```\n\n#### Using cURL\n```bash\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fchat\u002Fcompletions \\\n    -H \"Content-Type: application\u002Fjson\" \\\n    -d '{\n    \"messages\": [\n    {\"role\": \"user\", \"content\": [\n        {\"type\": \"audio_url\", \"audio_url\": {\"url\": \"https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen3-ASR-Repo\u002Fasr_en.wav\"}}\n    ]}\n    ]\n    }'\n```\n\n### Offline Inference\nSee the following example on using vLLM to run offline inference with Qwen3-ASR\n```python\nfrom vllm import LLM, SamplingParams\nfrom vllm.assets.audio import AudioAsset\nimport base64\nimport requests\n\n# Initialize the LLM\nllm = LLM(\n    model=\"Qwen\u002FQwen3-ASR-1.7B\"\n)\n\n# Load audio\naudio_asset = AudioAsset(\"winning_call\")\n\n# Create conversation with audio content\nconversation = [\n    {\n        \"role\": \"user\",\n        \"content\": [\n            {\n                \"type\": \"audio_url\",\n                \"audio_url\": {\"url\": audio_asset.url}\n            }\n        ]\n    }\n]\n\nsampling_params = SamplingParams(temperature=0.01, max_tokens=256)\n\n# Run inference using .chat()\noutputs = llm.chat(conversation, sampling_params=sampling_params)\nprint(outputs[0].outputs[0].text)\n```\n\n\n## Fine Tuning\n\nPlease refer to [Qwen3-ASR-Finetuning](finetuning\u002F) for detailed instructions on fine-tuning Qwen3-ASR.\n\n\n## Docker\n\nTo make it easier to use our `qwen-asr` Python package, we provide a pre-built Docker image: [qwenllm\u002Fqwen3-asr](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fqwenllm\u002Fqwen3-asr). You only need to install the GPU driver and download the model files to run the code. Please follow the [NVIDIA Container Toolkit installation guide](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html) to ensure Docker can access your GPU. If you are in Mainland China and have trouble reaching Docker Hub, you may use a registry mirror to accelerate image pulls.\n\nFirst, pull the image and start a container:\n\n```bash\nLOCAL_WORKDIR=\u002Fpath\u002Fto\u002Fyour\u002Fworkspace\nHOST_PORT=8000\nCONTAINER_PORT=80\ndocker run --gpus all --name qwen3-asr \\\n    -v \u002Fvar\u002Frun\u002Fdocker.sock:\u002Fvar\u002Frun\u002Fdocker.sock -p $HOST_PORT:$CONTAINER_PORT \\\n    --mount type=bind,source=$LOCAL_WORKDIR,target=\u002Fdata\u002Fshared\u002FQwen3-ASR \\\n    --shm-size=4gb \\\n    -it qwenllm\u002Fqwen3-asr:latest\n```\n\nAfter running the command, you will enter the container’s bash shell. Your local workspace (**replace** `\u002Fpath\u002Fto\u002Fyour\u002Fworkspace` **with the actual path**) will be mounted inside the container at `\u002Fdata\u002Fshared\u002FQwen3-ASR`. Port `8000` on the host is mapped to port `80` in the container, so you can access services running in the container via `http:\u002F\u002F\u003Chost-ip>:8000`. Note that services inside the container must bind to `0.0.0.0` (not `127.0.0.1`) for port forwarding to work.\n\nIf you exit the container, you can start it again and re-enter it with:\n\n```bash\ndocker start qwen3-asr\ndocker exec -it qwen3-asr bash\n```\n\nTo remove the container completely, run:\n\n```bash\ndocker rm -f qwen3-asr\n```\n\n\n## Evaluation\n\nDuring evaluation, we ran inference for all models with `dtype=torch.bfloat16` and set `max_new_tokens=1024` using vLLM. Greedy search was used for all decoding, and none of the tests specified a language parameter. The detailed evaluation results are shown below.\n\n\u003Cdetails>\n\u003Csummary>ASR Benchmarks on Public Datasets (WER ↓)\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth colspan=\"2\" style=\"text-align: left;\">\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">GPT-4o\u003Cbr>-Transcribe\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Gemini-2.5\u003Cbr>-Pro\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Doubao-ASR\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Whisper\u003Cbr>-large-v3\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Fun-ASR\u003Cbr>-MLT-Nano\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-0.6B\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-1.7B\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd colspan=\"9\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">English (en)\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">Librispeech\u003Cbr>clean | other\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>1.39\u003C\u002Fstrong> | 3.75\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.89 | 3.56\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.78 | 5.70\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1.51 | 3.97\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1.68 | 4.03\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.11 | 4.55\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1.63 | \u003Cstrong>3.38\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">GigaSpeech\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">25.50\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.37\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.55\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.76\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">8.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>8.45\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">CV-en\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.08\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.49\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.78\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.90\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.90\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.92\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>7.39\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">Fleurs-en\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>2.40\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.94\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.31\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.08\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.49\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.39\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.35\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">MLS-en\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.12\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>3.68\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.09\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.87\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.00\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.58\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">Tedlium\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.69\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.15\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.91\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.84\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>3.85\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>4.50\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">VoxPopuli\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">10.29\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">11.36\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.12\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.05\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>9.96\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>9.15\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"9\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Chinese (zh)\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">WenetSpeech\u003Cbr>net | meeting\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">15.30 | 32.27\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.43 | 13.47\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">N\u002FA\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.86 | 19.11\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.35 | -\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.97 | 6.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>4.97\u003C\u002Fstrong> | \u003Cstrong>5.88\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">AISHELL-2-test\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.24\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">11.62\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.85\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.06\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.15\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>2.71\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">SpeechIO\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.86\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.30\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.93\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.56\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.44\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>2.88\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">Fleurs-zh\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.44\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.71\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.69\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.09\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.51\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>2.41\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">CV-zh\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.32\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.70\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.95\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.91\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.20\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.89\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>5.35\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"9\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Chinese Dialect\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">KeSpeech\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">26.87\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">24.71\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.27\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">28.79\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.08\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>5.10\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">Fleurs-yue\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.98\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.43\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.98\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.18\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.79\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>3.98\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">CV-yue\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">11.36\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">18.76\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.20\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">16.23\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.50\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>7.57\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">CV-zh-tw\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.32\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.31\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.06\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.84\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.59\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>3.77\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">WenetSpeech-Yue\u003Cbr>short | long\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">15.62 | 25.29\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">25.19 | 11.23\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.74 | 11.40\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">32.26 | 46.64\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">- | -\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.54 | 9.92\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>5.82\u003C\u002Fstrong> | \u003Cstrong>8.85\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"2\" style=\"text-align: left;\">WenetSpeech-Chuan\u003Cbr>easy | hard\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">34.81 | 53.98\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">43.79 | 67.30\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>11.40\u003Cstrong> | \u003Cstrong>20.20\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.35 | 26.80\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">- | -\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.92 | 24.45\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">11.99 | 21.63\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>ASR Benchmarks on Internal Datasets (WER ↓)\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth style=\"text-align: left;\">\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">GPT-4o\u003Cbr>-Transcribe\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Gemini-2.5\u003Cbr>-Pro\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Doubao-ASR\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Whisper\u003Cbr>-large-v3\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Fun-ASR\u003Cbr>-MLT-Nano\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-0.6B\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-1.7B\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd colspan=\"8\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Accented English\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Dialog-Accented English\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">28.56\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">23.85\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">20.41\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">21.30\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">19.96\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>16.62\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>16.07\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"8\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Chinese Mandarin\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Elders&Kids\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.27\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">36.93\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.17\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">10.61\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.54\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.48\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>3.81\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">ExtremeNoise\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">36.11\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">29.06\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">17.04\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">63.17\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">36.55\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">17.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>16.17\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">TongueTwister\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">20.87\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.97\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.47\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">16.63\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.02\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.06\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>2.44\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Dialog-Mandarin\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">20.73\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.50\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.61\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.01\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.32\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.06\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>6.54\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"8\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Chinese Dialect\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Dialog-Cantonese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">16.05\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.98\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.56\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">31.04\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.85\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>4.80\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>4.12\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Dialog-Chinese Dialects\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">45.37\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">47.70\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">19.85\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">44.55\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">19.41\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>18.24\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>15.94\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003Cp>\u003Cstrong>Dialect coverage:\u003C\u002Fstrong> Results for \u003Cem>Dialog-Accented English\u003C\u002Fem> are averaged over 16 accents, and results for \u003Cem>Dialog-Chinese Dialects\u003C\u002Fem> are averaged over 22 Chinese dialects.\u003C\u002Fp>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Multilingual ASR Benchmarks (WER ↓)\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth style=\"text-align: left;\">\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">GLM-ASR\u003Cbr>-Nano-2512\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Whisper\u003Cbr>-large-v3\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Fun-ASR\u003Cbr>-MLT-Nano\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-0.6B\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-1.7B\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd colspan=\"6\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Open-sourced Benchmarks\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">MLS\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.32\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">8.62\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">28.70\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.19\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>8.55\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">CommonVoice\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">19.40\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">10.77\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">17.25\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.75\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>9.18\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">MLC-SLM\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">34.93\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">15.68\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">29.94\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">15.84\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>12.74\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Fleurs\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">16.08\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.27\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">10.03\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.57\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>4.90\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Fleurs\u003Csup>†\u003C\u002Fsup>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">20.05\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.85\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">31.89\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">10.37\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>6.62\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Fleurs\u003Csup>††\u003C\u002Fsup>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">24.83\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>8.16\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">47.84\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">21.80\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">12.60\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"6\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Qwen-ASR Internal Benchmarks\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">News-Multilingual\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">49.40\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.80\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">65.07\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">17.39\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>12.80\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003Cp>\u003Cstrong>Language coverage:\u003C\u002Fstrong> \u003Cem>MLS\u003C\u002Fem> includes 8 languages: {da, de, en, es, fr, it, pl, pt}.\u003Cbr>\u003Cem>CommonVoice\u003C\u002Fem> includes 13 languages: {en, zh, yue, zh_TW, ar, de, es, fr, it, ja, ko, pt, ru}.\u003Cbr>\u003Cem>MLC-SLM\u003C\u002Fem> includes 11 languages: {en, fr, de, it, pt, es, ja, ko, ru, th, vi}.\u003Cbr>\u003Cem>Fleurs\u003C\u002Fem> includes 12 languages: {en, zh, yue, ar, de, es, fr, it, ja, ko, pt, ru }.\u003Cbr>\u003Cem>Fleurs\u003Csup>†\u003C\u002Fsup>\u003C\u002Fem> includes 8 additional languages beyond Fleurs: {hi, id, ms, nl, pl, th, tr, vi}.\u003Cbr>\u003Cem>Fleurs\u003Csup>††\u003C\u002Fsup>\u003C\u002Fem> includes 10 additional languages beyond Fleurs\u003Csup>†\u003C\u002Fsup>: {cs, da, el, fa, fi, fil, hu, mk, ro, sv}.\u003Cbr>\u003Cem>News-Multilingual\u003C\u002Fem> includes 15 languages: {ar, de, es, fr, hi, id, it, ja, ko, nl, pl, pt, ru, th, vi}.\u003C\u002Fp>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Language Identification Accuracy (%) ↑\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth style=\"text-align: left;\">\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Whisper-large-v3\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR-0.6B\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR-1.7B\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">MLS\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>99.9\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">99.3\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>99.9\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">CommonVoice\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">92.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>98.2\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>98.7\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">MLC-SLM\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">89.2\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>92.7\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>94.1\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Fleurs\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">94.6\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>97.1\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>98.7\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr style=\"border-top: 1px solid #ddd;\">\n      \u003Ctd style=\"text-align: left;\">\u003Cem>Avg.\u003C\u002Fem>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">94.1\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>96.8\u003Cstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>97.9\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003Cp>\u003Cstrong>Language coverage:\u003C\u002Fstrong> The language sets follow Multilingual ASR Benchmarks. Here, Fleurs corresponds to Fleurs\u003Csup>††\u003C\u002Fsup> in Multilingual ASR Benchmarks and covers 30 languages.\u003C\u002Fp>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Singing Voice & Song Transcription (WER ↓)\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth style=\"text-align: left;\">\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">GPT-4o\u003Cbr>-Transcribe\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Gemini-2.5\u003Cbr>-Pro\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Doubao-ASR\u003Cbr>-1.0\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Whisper\u003Cbr>-large-v3\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Fun-ASR-MLT\u003Cbr>-Nano\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ASR\u003Cbr>-1.7B\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd colspan=\"7\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Singing\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">M4Singer\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">16.77\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">20.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.58\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.29\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>5.98\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">MIR-1k-vocal\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">11.87\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.85\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.56\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">11.71\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">8.17\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>6.25\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Opencpop\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">7.93\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">6.49\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.80\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.52\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>2.98\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.08\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Popcs\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">32.84\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">15.13\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">8.97\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">13.77\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">9.42\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>8.52\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"7\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Songs with BGM\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">EntireSongs-en\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">30.71\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>12.18\u003C\u002Fstrong>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">33.51\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">N\u002FA\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">N\u002FA\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">14.60\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">EntireSongs-zh\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">34.86\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">18.68\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">23.99\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">N\u002FA\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">N\u002FA\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>13.91\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>ASR Inference Mode Performance (WER ↓)\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth style=\"text-align: left;\">Model\u003C\u002Fth>\n      \u003Cth style=\"text-align: left;\">Infer. Mode\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Librispeech\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Fleurs-en\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Fleurs-zh\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Avg.\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd rowspan=\"2\" style=\"text-align: left; vertical-align: middle;\">Qwen3-ASR-1.7B\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: left;\">Offline\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1.63 | 3.38\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.35\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.41\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.69\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Streaming\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1.95 | 4.51\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.02\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.84\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.33\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr style=\"border-top: 1px solid #ddd;\">\n      \u003Ctd rowspan=\"2\" style=\"text-align: left; vertical-align: middle;\">Qwen3-ASR-0.6B\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: left;\">Offline\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.11 | 4.55\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.39\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.88\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.48\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Streaming\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2.54 | 6.27\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5.38\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">3.40\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4.40\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Forced Alignment Benchmarks (AAS ms ↓)\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth style=\"text-align: left;\">\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Monotonic-Aligner\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">NFA\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">WhisperX\u003C\u002Fth>\n      \u003Cth style=\"text-align: center;\">Qwen3-ForcedAligner-0.6B\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd colspan=\"5\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">MFA-Labeled Raw\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Chinese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">161.1\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">109.8\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>33.1\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">English\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">107.5\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">92.1\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>37.5\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">French\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">100.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">145.3\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>41.7\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">German\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">122.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">165.1\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>46.5\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Italian\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">142.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">155.5\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>75.5\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Japanese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>42.2\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Korean\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>37.2\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Portuguese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>38.4\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Russian\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">200.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>40.2\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Spanish\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">124.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">108.0\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>36.8\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">\u003Cem>Avg.\u003C\u002Fem>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">161.1\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">129.8\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">133.2\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>42.9\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"5\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">MFA-Labeled Concat-300s\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Chinese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1742.4\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">235.0\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>36.5\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">English\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">226.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">227.2\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>58.6\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">French\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">230.6\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2052.2\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>53.4\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">German\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">220.3\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">993.4\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>62.4\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Italian\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">290.5\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">5719.4\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>81.6\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Japanese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>81.3\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Korean\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>42.2\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Portuguese\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>50.0\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Russian\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">283.3\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>43.0\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Spanish\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">240.2\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">4549.9\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>39.6\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Cross-lingual\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>34.2\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">\u003Cem>Avg.\u003C\u002Fem>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">1742.4\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">246.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">2708.4\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>52.9\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd colspan=\"5\" style=\"text-align: left; font-style: italic; border-top: 1px solid #ddd; border-bottom: 1px solid #ddd;\">Human-Labeled\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Raw\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">49.9\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">88.6\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>27.8\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Raw-Noisy\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">53.3\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">89.5\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>41.8\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Concat-60s\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">51.1\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">86.7\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>25.3\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Concat-300s\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">410.8\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">140.0\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>24.8\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">Concat-Cross-lingual\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>42.5\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd style=\"text-align: left;\">\u003Cem>Avg.\u003C\u002Fem>\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">141.3\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">101.2\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">-\u003C\u002Ftd>\n      \u003Ctd style=\"text-align: center;\">\u003Cstrong>32.4\u003C\u002Fstrong>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\n## Citation\n\nIf you find our paper and code useful in your research, please consider giving a star :star: and citation :pencil: :)\n\n```BibTeX\n@article{Qwen3-ASR,\n  title={Qwen3-ASR Technical Report},\n  author={Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin},\n  journal={arXiv preprint arXiv:2601.21337},\n  year={2026}\n}\n```\n\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=QwenLM\u002FQwen3-ASR&type=Date)](https:\u002F\u002Fstar-history.com\u002F#QwenLM\u002FQwen3-ASR&Date)\n\n\u003Cbr>","Qwen3-ASR是由阿里云Qwen团队开发的一系列开源自动语音识别模型，支持多语言的语音、音乐和歌词识别，同时具备语言检测和时间戳预测功能。该项目基于Python语言构建，拥有强大的多语言处理能力，能够识别52种语言及方言，并提供了一种新颖的非自回归语音对齐模型以实现11种语言的文本-语音配对。其核心优势在于利用大规模语音训练数据和基础模型Qwen3-Omni的强大音频理解能力，使得1.7B版本在开源ASR模型中达到了领先水平。适用于需要高精度多语言语音识别的应用场景，如跨国会议记录、多语种视频字幕生成等。",2,"2026-06-11 03:48:55","high_star"]