[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82881":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},82881,"X-ASR","Gilgamesh-J\u002FX-ASR","Gilgamesh-J","X-ASR is a series of automatic speech recognition models based on the icefall framework, focusing on streaming ASR and low-latency deployment.",null,"Swift",111,11,1,7,0,3,23,54,12,66.14,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:39","\u003Ch1 align=\"center\">🎙️ X-ASR\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cb>Streaming-focused automatic speech recognition models based on icefall\u002Fk2, Zipformer, and sherpa-onnx.\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Ctable align=\"center\" border=\"0\" cellspacing=\"0\" cellpadding=\"0\">\n  \u003Ctr>\n    \u003Ctd align=\"center\" width=\"25%\" style=\"border: none; padding: 0 14px;\">\n      \u003Ca href=\"https:\u002F\u002Fwww.sjtu.edu.cn\u002F\">\u003Cimg src=\"assets\u002Finstitutions\u002Fsjtu.png\" height=\"64\" alt=\"Shanghai Jiao Tong University\">\u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"25%\" style=\"border: none; padding: 0 14px;\">\n      \u003Ca href=\"https:\u002F\u002Fwww.sii.edu.cn\u002F\">\u003Cimg src=\"assets\u002Finstitutions\u002Fsii.png\" height=\"64\" alt=\"Shanghai Innovation Institute\">\u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"25%\" style=\"border: none; padding: 0 14px;\">\n      \u003Ca href=\"https:\u002F\u002Fwww.fudan.edu.cn\u002Fen\u002F\">\u003Cimg src=\"assets\u002Finstitutions\u002Ffudan.png\" height=\"64\" alt=\"Fudan University\">\u003C\u002Fa>\n    \u003C\u002Ftd>\n    \u003Ctd align=\"center\" width=\"25%\" style=\"border: none; padding: 0 14px;\">\n      \u003Ca href=\"https:\u002F\u002Fwww.hust.edu.cn\u002F\">\u003Cimg src=\"assets\u002Finstitutions\u002Fhust.png\" height=\"64\" alt=\"Huazhong University of Science and Technology\">\u003C\u002Fa>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Cp align=\"center\">\n  \u003Csub>\u003Cb>Participating Institutions\u003C\u002Fb>\u003C\u002Fsub>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>🌐 \u003Ca href=\"README_zh.md\">中文版\u003C\u002Fa>\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FGilgameshWind\u002FX-ASR-zh-en\">🤗 Hugging Face Hub\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fwww.modelscope.ai\u002FGilgamesh-J\u002FX-ASR-zh-en\">🧩 ModelScope\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fchenxie95\u002FX-ASR\">🪐 Hugging Face Space\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fstream-asr.sjtuxlance.com\u002F\">🎧 Online Demo\u003C\u002Fa> |\n  \u003Ca href=\"X-ASR-zh-en\u002Fdeployment\u002Fx-asr-live-demo\u002FREADME.md\">🎙️ Local Live Demo\u003C\u002Fa> |\n  \u003Ca href=\"X-ASR-zh-en\u002Fdeployment\u002FREADME.md\">🚀 Deployment Guide\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>📄 X-ASR-zh-en Technical Report: Coming Soon\u003C\u002Fb>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel%20Released-X--ASR--zh--en-blue\" alt=\"Model released\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLanguages-zh%20%7C%20en-green\" alt=\"Languages\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FStreaming-low%20latency%20%7C%20multi--mode-orange\" alt=\"Streaming\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDeployment-sherpa--onnx-red\" alt=\"Deployment\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache--2.0-lightgrey\" alt=\"License\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#overview\">🔍 Overview\u003C\u002Fa> |\n  \u003Ca href=\"#timeline\">📅 Timeline\u003C\u002Fa> |\n  \u003Ca href=\"#model-releases\">📦 Model Releases\u003C\u002Fa> |\n  \u003Ca href=\"#applications\">🎙️ Applications\u003C\u002Fa> |\n  \u003Ca href=\"#evaluation\">📊 Evaluation\u003C\u002Fa> |\n  \u003Ca href=\"#quick-start\">🚀 Quick Start\u003C\u002Fa> |\n  \u003Ca href=\"#repository-layout\">🗂️ Repository Layout\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n\u003Ca id=\"overview\">\u003C\u002Fa>\n\n## 🔍 Overview\n\n### 🧩 X-ASR\n\n**X-ASR** is a series of automatic speech recognition models built with the **icefall** framework. The series focuses on **streaming ASR** and **low-latency deployment**, while also supporting offline recognition. This repository currently releases an initial batch of **Chinese-English streaming ASR models**, and the X-ASR series will be continuously maintained, updated, and scaled across **languages**, **model architectures**, and **training data**.\n\n### 🤖 X-ASR-zh-en\n\n**X-ASR-zh-en** is trained on approximately **1 million hours** of open-source and collected speech data. It is designed as an **offline-streaming unified transducer ASR model** with the **Zipformer architecture**, supporting both **offline decoding** and **true streaming decoding**. The model provides multiple streaming chunk sizes: **160 ms**, **480 ms**, **960 ms**, and **1920 ms**, supports **punctuation and casing**, and can be conveniently deployed with **sherpa-onnx**.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fzipformer.png\" width=\"700\" alt=\"Zipformer architecture\">\n\u003C\u002Fp>\n\n\u003Ca id=\"timeline\">\u003C\u002Fa>\n\n## 📅 Timeline\n\n| Status | Item | Details |\n|:---:|:---:|:---:|\n| ✅ Released | `X-ASR-zh-en` initial release | Chinese-English offline-streaming unified ASR models, sherpa-onnx deployment artifacts, and online demo are available. |\n| 📄 Coming Soon | `X-ASR-zh-en` technical report | Training recipe, model architecture, evaluation protocol, deployment details, and ablation analysis will be released. |\n| 🌏 Upcoming | Thai, Indonesian, and Vietnamese ASR | Streaming ASR models for the next language releases are under preparation. |\n| 🔄 Ongoing | Model and data updates | Continued work on model scaling, architecture improvements, data refinement, latency, stability, punctuation, and casing. |\n\n\u003Ca id=\"model-releases\">\u003C\u002Fa>\n\n## 📦 Model Releases\n\n| Model | Languages | Type | Streaming chunks | Deployment | Report | Model files |\n|:---:|:---:|:---:|:---:|:---:|:---:|:---:|\n| `X-ASR-zh-en` | Chinese, English | Offline-streaming unified transducer ASR | 160 ms, 480 ms, 960 ms, 1920 ms | sherpa-onnx | **Coming Soon** | [GitHub](X-ASR-zh-en\u002Fdeployment), [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FGilgameshWind\u002FX-ASR-zh-en), [ModelScope](https:\u002F\u002Fwww.modelscope.ai\u002FGilgamesh-J\u002FX-ASR-zh-en) |\n\n## ⭐ Highlights\n\n| Category | Description |\n|:---:|:---:|\n| **Framework** | icefall \u002F k2 |\n| **Architecture** | Zipformer transducer |\n| **Training scale** | Approximately 1 million hours of open-source and collected speech data |\n| **Current languages** | Chinese and English |\n| **Decoding modes** | Offline decoding and true streaming decoding |\n| **Streaming chunks** | 160 ms, 480 ms, 960 ms, 1920 ms |\n| **Text output** | Supports punctuation and casing |\n| **Runtime** | sherpa-onnx |\n| **Interface** | WebSocket streaming server and WAV-file client |\n\n\u003Ca id=\"applications\">\u003C\u002Fa>\n\n## 🎙️ Applications\n\nWe welcome more experiments and real-world use cases built on top of **X-ASR**. The following downstream applications are based on X-ASR and have been synced into this repository.\n\n### 🧪 Vibe-Coding Application with FireRedVAD\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd width=\"100%\" valign=\"top\" align=\"center\">\n      \u003Ca href=\"X-ASR-zh-en\u002Fdeployment\u002Fx-asr-live-demo\u002FREADME.md\">\n        \u003Cimg src=\"X-ASR-zh-en\u002Fdeployment\u002Fx-asr-live-demo\u002Fassets\u002Fstreaming-demo.gif\" width=\"720\" alt=\"X-ASR local offline live recognition demo\">\n      \u003C\u002Fa>\n      \u003Cbr>\n      \u003Cb>Local Offline Vibe-Coding ASR Demo\u003C\u002Fb>\n      \u003Cbr>\n      \u003Csub>Microphone\u002FWAV → FireRedVAD endpointing → X-ASR streaming decoding → live partial\u002Ffinal output. Designed for local offline dictation, voice-input prototypes, and vibe-coding workflows.\u003C\u002Fsub>\n      \u003Cbr>\u003Cbr>\n      \u003Cp align=\"left\">\n        This application turns X-ASR from a model release into a complete local voice-input loop. FireRedVAD detects when speech starts and ends, while X-ASR performs low-latency streaming recognition during the utterance. A short pause commits the current sentence as final text.\n      \u003C\u002Fp>\n      \u003Cp align=\"left\">\n        The main idea is that streaming ASR alone is not enough for interactive use: the decoder can produce partial text, but it does not know when a user has finished speaking. Adding VAD-based endpointing makes the system usable for local dictation, voice-IME prototypes, and vibe-coding scenarios where speech can be turned into text without sending audio to a server.\n      \u003C\u002Fp>\n      \u003Cp align=\"left\">\n        As a starting point, the demo prints final results in the terminal. A natural next step is to replace that final-text callback with an editor or focused-input injection layer, turning X-ASR into a local hands-free coding and writing interface.\n      \u003C\u002Fp>\n      \u003Cbr>\u003Cbr>\n      \u003Ca href=\"X-ASR-zh-en\u002Fdeployment\u002Fx-asr-live-demo\u002FREADME.md\">\u003Cb>Open Guide\u003C\u002Fb>\u003C\u002Fa> ·\n      \u003Ca href=\"X-ASR-zh-en\u002Fdeployment\u002Fx-asr-live-demo\u002FREADME_zh.md\">中文\u003C\u002Fa>\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### ⬇️ Desktop Package Download\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fapplications\u002Fvibe-xasr\u002Ficon.png\" width=\"88\" alt=\"Vibe XASR app icon\">\n  \u003Cbr>\n  \u003Cb>Vibe XASR\u003C\u002Fb> · a local voice input method powered by X-ASR\n  \u003Cbr>\u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FGilgamesh-J\u002FX-ASR\u002Freleases\">\u003Cb>⬇️&nbsp; Download for macOS &nbsp;→\u003C\u002Fb>\u003C\u002Fa>\n  \u003Cbr>\n  \u003Csub>Universal (Apple Silicon + Intel) · macOS 15.0+ · signed &amp; notarized · auto-updates in-app\u003C\u002Fsub>\n\u003C\u002Fp>\n\n> **Hold a hotkey, speak, and the text lands right at your cursor — 100% local & offline, your data never leaves the device.** The X-ASR streaming engine turns Chinese & English speech (freely code-switched) into text in real time, system-wide.\n\n**Core features**\n\n- 🎙️ **Three dictation modes** — insert-on-finish · live streaming (types as you talk) · OnCall standby (floating window)\n- 🀄 **Chinese ⇄ English** free code-switching, inserted in real time at the cursor\n- 📋 **Built-in pad & history** — saved by date; copy \u002F edit \u002F export\n- 📖 **Personal dictionary** — hotwords, homophone correction, replace rules\n- 🔢 **Number normalization & filler cleanup** — “二零二六” → “2026”, drops “um \u002F uh \u002F 那个”\n- 🌐 **Localized UI** — 中文 \u002F English \u002F 日本語 \u002F 한국어\n- 🔒 **Privacy-first & auto-update** — fully offline; one-click upgrades inside the app\n\n\u003Csub>🪟 A **Windows** build is also available (in [Releases](https:\u002F\u002Fgithub.com\u002FGilgamesh-J\u002FX-ASR\u002Freleases)) — an early **preview**, not yet fully tested, kept in sync with the latest macOS features. Please [report issues](https:\u002F\u002Fgithub.com\u002FGilgamesh-J\u002FX-ASR\u002Fissues) as you run into them.\u003C\u002Fsub>\n\n\u003Ca id=\"evaluation\">\u003C\u002Fa>\n\n## 📊 Evaluation\n\nThe following results are for the current **X-ASR-zh-en** release. All results are reported with **greedy search**. **Measurement:** English results use **WER (%)**, and Chinese results use **CER (%)**; lower is better.\n\n### 🧪 Public ASR Benchmarks\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth align=\"center\" rowspan=\"2\">⚙️ Mode\u003C\u002Fth>\n      \u003Cth align=\"center\" rowspan=\"2\">⏱️ Chunk size\u003C\u002Fth>\n      \u003Cth align=\"center\" colspan=\"2\">📚 LibriSpeech\u003C\u002Fth>\n      \u003Cth align=\"center\" rowspan=\"2\">🎙️ GigaSpeech\u003C\u002Fth>\n      \u003Cth align=\"center\" colspan=\"2\">🗣️ WenetSpeech\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth align=\"center\">clean\u003C\u002Fth>\n      \u003Cth align=\"center\">other\u003C\u002Fth>\n      \u003Cth align=\"center\">net\u003C\u002Fth>\n      \u003Cth align=\"center\">meeting\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\n      \u003Ctd align=\"center\">160 ms\u003C\u002Ftd>\n      \u003Ctd align=\"center\">3.49\u003C\u002Ftd>\n      \u003Ctd align=\"center\">8.75\u003C\u002Ftd>\n      \u003Ctd align=\"center\">10.32\u003C\u002Ftd>\n      \u003Ctd align=\"center\">8.72\u003C\u002Ftd>\n      \u003Ctd align=\"center\">10.47\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\n      \u003Ctd align=\"center\">480 ms\u003C\u002Ftd>\n      \u003Ctd align=\"center\">2.99\u003C\u002Ftd>\n      \u003Ctd align=\"center\">7.36\u003C\u002Ftd>\n      \u003Ctd align=\"center\">9.70\u003C\u002Ftd>\n      \u003Ctd align=\"center\">7.46\u003C\u002Ftd>\n      \u003Ctd align=\"center\">9.11\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\n      \u003Ctd align=\"center\">960 ms\u003C\u002Ftd>\n      \u003Ctd align=\"center\">2.87\u003C\u002Ftd>\n      \u003Ctd align=\"center\">6.77\u003C\u002Ftd>\n      \u003Ctd align=\"center\">9.59\u003C\u002Ftd>\n      \u003Ctd align=\"center\">6.97\u003C\u002Ftd>\n      \u003Ctd align=\"center\">8.40\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\n      \u003Ctd align=\"center\">1920 ms\u003C\u002Ftd>\n      \u003Ctd align=\"center\">2.75\u003C\u002Ftd>\n      \u003Ctd align=\"center\">6.33\u003C\u002Ftd>\n      \u003Ctd align=\"center\">9.43\u003C\u002Ftd>\n      \u003Ctd align=\"center\">6.58\u003C\u002Ftd>\n      \u003Ctd align=\"center\">7.88\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Ctd align=\"center\">Offline\u003C\u002Ftd>\n      \u003Ctd align=\"center\">-\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>2.56\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>5.56\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>9.17\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>5.83\u003C\u002Fb>\u003C\u002Ftd>\n      \u003Ctd align=\"center\">\u003Cb>7.06\u003C\u002Fb>\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n**Note:** Bold numbers indicate the best result among the listed modes for each benchmark column.\n\n### 🏆 Public Benchmark Model Comparison\n\nThe following table compares representative ASR models on the same public benchmark columns. Ranks are computed by **AVG** across the five listed columns; lower is better. Parameter sizes are shown when provided by the source sheet.\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth align=\"center\" rowspan=\"2\">🏅 Rank\u003C\u002Fth>\n      \u003Cth align=\"center\" rowspan=\"2\">Model\u003C\u002Fth>\n      \u003Cth align=\"center\" rowspan=\"2\">Params\u003C\u002Fth>\n      \u003Cth align=\"center\" colspan=\"2\">📚 LibriSpeech\u003C\u002Fth>\n      \u003Cth align=\"center\" rowspan=\"2\">🎙️ GigaSpeech\u003C\u002Fth>\n      \u003Cth align=\"center\" colspan=\"2\">🗣️ WenetSpeech\u003C\u002Fth>\n      \u003Cth align=\"center\" rowspan=\"2\">AVG\u003C\u002Fth>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth align=\"center\">clean\u003C\u002Fth>\n      \u003Cth align=\"center\">other\u003C\u002Fth>\n      \u003Cth align=\"center\">net\u003C\u002Fth>\n      \u003Cth align=\"center\">meeting\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\u003Ctd align=\"center\">1\u003C\u002Ftd>\u003Ctd align=\"center\">Qwen3-ASR\u003C\u002Ftd>\u003Ctd align=\"center\">1.7B\u003C\u002Ftd>\u003Ctd align=\"center\">1.65\u003C\u002Ftd>\u003Ctd align=\"center\">3.45\u003C\u002Ftd>\u003Ctd align=\"center\">8.56\u003C\u002Ftd>\u003Ctd align=\"center\">5.29\u003C\u002Ftd>\u003Ctd align=\"center\">5.46\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.882\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">2\u003C\u002Ftd>\u003Ctd align=\"center\">Qwen3-ASR\u003C\u002Ftd>\u003Ctd align=\"center\">0.6B\u003C\u002Ftd>\u003Ctd align=\"center\">2.18\u003C\u002Ftd>\u003Ctd align=\"center\">4.54\u003C\u002Ftd>\u003Ctd align=\"center\">8.94\u003C\u002Ftd>\u003Ctd align=\"center\">5.97\u003C\u002Ftd>\u003Ctd align=\"center\">6.88\u003C\u002Ftd>\u003Ctd align=\"center\">5.702\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">3\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>X-ASR-zh-en\u003C\u002Fb> (offline)\u003C\u002Ftd>\u003Ctd align=\"center\">0.16B\u003C\u002Ftd>\u003Ctd align=\"center\">2.56\u003C\u002Ftd>\u003Ctd align=\"center\">5.56\u003C\u002Ftd>\u003Ctd align=\"center\">9.17\u003C\u002Ftd>\u003Ctd align=\"center\">5.83\u003C\u002Ftd>\u003Ctd align=\"center\">7.06\u003C\u002Ftd>\u003Ctd align=\"center\">6.036\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">4\u003C\u002Ftd>\u003Ctd align=\"center\">SenseVoice-small\u003C\u002Ftd>\u003Ctd align=\"center\">234M\u003C\u002Ftd>\u003Ctd align=\"center\">3.16\u003C\u002Ftd>\u003Ctd align=\"center\">7.21\u003C\u002Ftd>\u003Ctd align=\"center\">11.24\u003C\u002Ftd>\u003Ctd align=\"center\">5.73\u003C\u002Ftd>\u003Ctd align=\"center\">6.47\u003C\u002Ftd>\u003Ctd align=\"center\">6.762\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">5\u003C\u002Ftd>\u003Ctd align=\"center\">VibeVoice-ASR\u003C\u002Ftd>\u003Ctd align=\"center\">9B\u003C\u002Ftd>\u003Ctd align=\"center\">2.18\u003C\u002Ftd>\u003Ctd align=\"center\">5.65\u003C\u002Ftd>\u003Ctd align=\"center\">9.49\u003C\u002Ftd>\u003Ctd align=\"center\">14.45\u003C\u002Ftd>\u003Ctd align=\"center\">17.19\u003C\u002Ftd>\u003Ctd align=\"center\">9.792\u003C\u002Ftd>\u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n### 🧭 Vertical-Domain Benchmarks\n\nThe following results report **GigaSpeechBench vertical-domain** performance for the current **X-ASR-zh-en** release. Values are **WER\u002FCER percentages**; lower is better. Domain abbreviations follow the GigaSpeechBench vertical-domain labels.\n\n#### CH\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth align=\"center\">⚙️ Mode\u003C\u002Fth>\n      \u003Cth align=\"center\">⏱️ Chunk size\u003C\u002Fth>\n      \u003Cth align=\"center\">ARG\u003C\u002Fth>\n      \u003Cth align=\"center\">AIT\u003C\u002Fth>\n      \u003Cth align=\"center\">ART\u003C\u002Fth>\n      \u003Cth align=\"center\">BIO\u003C\u002Fth>\n      \u003Cth align=\"center\">ECM\u003C\u002Fth>\n      \u003Cth align=\"center\">ENG\u003C\u002Fth>\n      \u003Cth align=\"center\">ENT\u003C\u002Fth>\n      \u003Cth align=\"center\">FIN\u003C\u002Fth>\n      \u003Cth align=\"center\">HUM\u003C\u002Fth>\n      \u003Cth align=\"center\">LAW\u003C\u002Fth>\n      \u003Cth align=\"center\">MED\u003C\u002Fth>\n      \u003Cth align=\"center\">MIL\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">160 ms\u003C\u002Ftd>\u003Ctd align=\"center\">9.88\u003C\u002Ftd>\u003Ctd align=\"center\">6.76\u003C\u002Ftd>\u003Ctd align=\"center\">4.39\u003C\u002Ftd>\u003Ctd align=\"center\">7.32\u003C\u002Ftd>\u003Ctd align=\"center\">4.13\u003C\u002Ftd>\u003Ctd align=\"center\">3.58\u003C\u002Ftd>\u003Ctd align=\"center\">8.45\u003C\u002Ftd>\u003Ctd align=\"center\">3.23\u003C\u002Ftd>\u003Ctd align=\"center\">10.42\u003C\u002Ftd>\u003Ctd align=\"center\">6.58\u003C\u002Ftd>\u003Ctd align=\"center\">4.25\u003C\u002Ftd>\u003Ctd align=\"center\">2.55\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">480 ms\u003C\u002Ftd>\u003Ctd align=\"center\">8.67\u003C\u002Ftd>\u003Ctd align=\"center\">6.17\u003C\u002Ftd>\u003Ctd align=\"center\">3.60\u003C\u002Ftd>\u003Ctd align=\"center\">6.22\u003C\u002Ftd>\u003Ctd align=\"center\">3.78\u003C\u002Ftd>\u003Ctd align=\"center\">3.04\u003C\u002Ftd>\u003Ctd align=\"center\">7.04\u003C\u002Ftd>\u003Ctd align=\"center\">2.78\u003C\u002Ftd>\u003Ctd align=\"center\">9.43\u003C\u002Ftd>\u003Ctd align=\"center\">5.84\u003C\u002Ftd>\u003Ctd align=\"center\">3.76\u003C\u002Ftd>\u003Ctd align=\"center\">2.11\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">960 ms\u003C\u002Ftd>\u003Ctd align=\"center\">8.00\u003C\u002Ftd>\u003Ctd align=\"center\">5.69\u003C\u002Ftd>\u003Ctd align=\"center\">3.44\u003C\u002Ftd>\u003Ctd align=\"center\">6.10\u003C\u002Ftd>\u003Ctd align=\"center\">3.69\u003C\u002Ftd>\u003Ctd align=\"center\">2.88\u003C\u002Ftd>\u003Ctd align=\"center\">6.71\u003C\u002Ftd>\u003Ctd align=\"center\">2.72\u003C\u002Ftd>\u003Ctd align=\"center\">9.07\u003C\u002Ftd>\u003Ctd align=\"center\">5.58\u003C\u002Ftd>\u003Ctd align=\"center\">3.69\u003C\u002Ftd>\u003Ctd align=\"center\">2.11\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">1920 ms\u003C\u002Ftd>\u003Ctd align=\"center\">7.24\u003C\u002Ftd>\u003Ctd align=\"center\">5.58\u003C\u002Ftd>\u003Ctd align=\"center\">3.27\u003C\u002Ftd>\u003Ctd align=\"center\">5.82\u003C\u002Ftd>\u003Ctd align=\"center\">3.48\u003C\u002Ftd>\u003Ctd align=\"center\">2.74\u003C\u002Ftd>\u003Ctd align=\"center\">6.55\u003C\u002Ftd>\u003Ctd align=\"center\">2.57\u003C\u002Ftd>\u003Ctd align=\"center\">8.59\u003C\u002Ftd>\u003Ctd align=\"center\">4.97\u003C\u002Ftd>\u003Ctd align=\"center\">3.53\u003C\u002Ftd>\u003Ctd align=\"center\">1.94\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Offline\u003C\u002Ftd>\u003Ctd align=\"center\">-\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>6.56\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.54\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>2.77\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>5.04\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>2.99\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>2.32\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>6.02\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>1.94\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>7.64\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.20\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>2.90\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>1.68\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n#### EN\n\n\u003Ctable>\n  \u003Cthead>\n    \u003Ctr>\n      \u003Cth align=\"center\">⚙️ Mode\u003C\u002Fth>\n      \u003Cth align=\"center\">⏱️ Chunk size\u003C\u002Fth>\n      \u003Cth align=\"center\">ARG\u003C\u002Fth>\n      \u003Cth align=\"center\">AIT\u003C\u002Fth>\n      \u003Cth align=\"center\">ART\u003C\u002Fth>\n      \u003Cth align=\"center\">BIO\u003C\u002Fth>\n      \u003Cth align=\"center\">ECM\u003C\u002Fth>\n      \u003Cth align=\"center\">ENG\u003C\u002Fth>\n      \u003Cth align=\"center\">ENT\u003C\u002Fth>\n      \u003Cth align=\"center\">FIN\u003C\u002Fth>\n      \u003Cth align=\"center\">HUM\u003C\u002Fth>\n      \u003Cth align=\"center\">LAW\u003C\u002Fth>\n      \u003Cth align=\"center\">MED\u003C\u002Fth>\n      \u003Cth align=\"center\">MIL\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">160 ms\u003C\u002Ftd>\u003Ctd align=\"center\">5.29\u003C\u002Ftd>\u003Ctd align=\"center\">8.57\u003C\u002Ftd>\u003Ctd align=\"center\">8.55\u003C\u002Ftd>\u003Ctd align=\"center\">7.31\u003C\u002Ftd>\u003Ctd align=\"center\">4.33\u003C\u002Ftd>\u003Ctd align=\"center\">5.01\u003C\u002Ftd>\u003Ctd align=\"center\">16.25\u003C\u002Ftd>\u003Ctd align=\"center\">5.58\u003C\u002Ftd>\u003Ctd align=\"center\">7.36\u003C\u002Ftd>\u003Ctd align=\"center\">13.39\u003C\u002Ftd>\u003Ctd align=\"center\">6.03\u003C\u002Ftd>\u003Ctd align=\"center\">6.20\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">480 ms\u003C\u002Ftd>\u003Ctd align=\"center\">4.62\u003C\u002Ftd>\u003Ctd align=\"center\">8.40\u003C\u002Ftd>\u003Ctd align=\"center\">7.73\u003C\u002Ftd>\u003Ctd align=\"center\">6.12\u003C\u002Ftd>\u003Ctd align=\"center\">4.19\u003C\u002Ftd>\u003Ctd align=\"center\">4.65\u003C\u002Ftd>\u003Ctd align=\"center\">14.50\u003C\u002Ftd>\u003Ctd align=\"center\">5.21\u003C\u002Ftd>\u003Ctd align=\"center\">6.79\u003C\u002Ftd>\u003Ctd align=\"center\">11.51\u003C\u002Ftd>\u003Ctd align=\"center\">5.59\u003C\u002Ftd>\u003Ctd align=\"center\">6.02\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">960 ms\u003C\u002Ftd>\u003Ctd align=\"center\">4.58\u003C\u002Ftd>\u003Ctd align=\"center\">8.35\u003C\u002Ftd>\u003Ctd align=\"center\">7.45\u003C\u002Ftd>\u003Ctd align=\"center\">6.00\u003C\u002Ftd>\u003Ctd align=\"center\">4.13\u003C\u002Ftd>\u003Ctd align=\"center\">4.44\u003C\u002Ftd>\u003Ctd align=\"center\">13.99\u003C\u002Ftd>\u003Ctd align=\"center\">5.12\u003C\u002Ftd>\u003Ctd align=\"center\">6.58\u003C\u002Ftd>\u003Ctd align=\"center\">10.86\u003C\u002Ftd>\u003Ctd align=\"center\">5.52\u003C\u002Ftd>\u003Ctd align=\"center\">6.04\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Streaming\u003C\u002Ftd>\u003Ctd align=\"center\">1920 ms\u003C\u002Ftd>\u003Ctd align=\"center\">4.33\u003C\u002Ftd>\u003Ctd align=\"center\">8.32\u003C\u002Ftd>\u003Ctd align=\"center\">6.90\u003C\u002Ftd>\u003Ctd align=\"center\">5.89\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.00\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">4.37\u003C\u002Ftd>\u003Ctd align=\"center\">13.61\u003C\u002Ftd>\u003Ctd align=\"center\">4.98\u003C\u002Ftd>\u003Ctd align=\"center\">6.39\u003C\u002Ftd>\u003Ctd align=\"center\">10.52\u003C\u002Ftd>\u003Ctd align=\"center\">5.45\u003C\u002Ftd>\u003Ctd align=\"center\">5.78\u003C\u002Ftd>\u003C\u002Ftr>\n    \u003Ctr>\u003Ctd align=\"center\">Offline\u003C\u002Ftd>\u003Ctd align=\"center\">-\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.09\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>8.28\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>6.73\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>5.48\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">4.12\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.30\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>12.30\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>4.94\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>6.17\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>10.41\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>5.35\u003C\u002Fb>\u003C\u002Ftd>\u003Ctd align=\"center\">\u003Cb>5.61\u003C\u002Fb>\u003C\u002Ftd>\u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n## 🎧 Demo\n\nA **sherpa-onnx based online demo** is available here:\n\n- [https:\u002F\u002Fstream-asr.sjtuxlance.com\u002F](https:\u002F\u002Fstream-asr.sjtuxlance.com\u002F)\n\nDemo video:\n\n\u003Ca href=\"assets\u002Fdemos\u002Fdemo.mov\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Fdemo-preview.png\" width=\"700\" alt=\"X-ASR demo video preview\">\n\u003C\u002Fa>\n\n[Open demo video](assets\u002Fdemos\u002Fdemo.mov)\n\n\u003Ca id=\"quick-start\">\u003C\u002Fa>\n\n## 🚀 Quick Start\n\nThis section shows how to build and run the **sherpa-onnx WebSocket streaming server** and the corresponding **WebSocket client**. For complete deployment arguments, model switching, runtime options, and production notes, see the [deployment guide](X-ASR-zh-en\u002Fdeployment\u002FREADME.md).\n\n### 1. Clone or download model artifacts\n\nThis repository uses **Git LFS** for ONNX model artifacts and demo media. Install Git LFS before cloning or before pulling large files.\n\n#### GitHub\n\nUse GitHub when you want the full project repository, bilingual documentation, training references, deployment examples, and issue-tracking context.\n\n```bash\ngit lfs install\ngit clone https:\u002F\u002Fgithub.com\u002FGilgamesh-J\u002FX-ASR.git\ncd X-ASR\ngit lfs pull\n```\n\n#### Hugging Face\n\nUse Hugging Face when you want the model artifact page and standard HF Hub download tooling.\n\n```bash\nhf download GilgameshWind\u002FX-ASR-zh-en \\\n  --local-dir .\u002FX-ASR-zh-en\n```\n\n#### ModelScope\n\nUse ModelScope when you prefer the ModelScope mirror or Git LFS clone from ModelScope.\n\n```bash\ngit lfs install\ngit clone https:\u002F\u002Fwww.modelscope.ai\u002FGilgamesh-J\u002FX-ASR-zh-en.git\ncd X-ASR-zh-en\ngit lfs pull\n```\n\n### 2. Prepare the sherpa-onnx runtime\n\nIf you cloned the full GitHub project, enter:\n\n```bash\ncd X-ASR\u002FX-ASR-zh-en\u002Fdeployment\n```\n\nIf you downloaded from Hugging Face or cloned from ModelScope, enter:\n\n```bash\ncd X-ASR-zh-en\u002Fdeployment\n```\n\nThen prepare the Python environment:\n\n```bash\npython -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npython -m pip install --upgrade pip\npython -m pip install -r requirements.txt\n```\n\n### 3. Start the WebSocket server\n\nThe server wraps `sherpa_onnx.OnlineRecognizer` and exposes a WebSocket endpoint. Each WebSocket connection keeps an independent recognizer session, so concurrent clients do not share decoding state. The example below starts the **160 ms streaming model** on CPU and listens on `ws:\u002F\u002F0.0.0.0:6666`.\n\n```bash\npython infer_and_client\u002Fsherpa_streaming_server.py \\\n  --host 0.0.0.0 \\\n  --port 6666 \\\n  --tokens models\u002Fchunk-160ms-model\u002Ftokens.txt \\\n  --encoder models\u002Fchunk-160ms-model\u002Fencoder-160ms.onnx \\\n  --decoder models\u002Fchunk-160ms-model\u002Fdecoder-160ms.onnx \\\n  --joiner models\u002Fchunk-160ms-model\u002Fjoiner-160ms.onnx \\\n  --provider cpu \\\n  --sample-rate 16000 \\\n  --feature-dim 80 \\\n  --decoding-method greedy_search \\\n  --model-type zipformer2 \\\n  --text-format none\n```\n\nThe `--tokens`, `--encoder`, `--decoder`, and `--joiner` files must come from the same model directory.\n\n### 4. Run the WebSocket client\n\nOpen another terminal:\n\n```bash\ncd X-ASR-zh-en\u002Fdeployment\nsource .venv\u002Fbin\u002Factivate\n\npython infer_and_client\u002Fsherpa_streaming_client.py \\\n  --server-uri ws:\u002F\u002F127.0.0.1:6666 \\\n  --wav \u002Fpath\u002Fto\u002Ftest.wav \\\n  --chunk-ms 100 \\\n  --simulate-realtime 1\n```\n\nThe client loads a WAV file, converts or resamples it to **16 kHz mono int16 PCM**, sends binary PCM chunks over WebSocket, and prints partial\u002Ffinal recognition results returned by the server. With `--simulate-realtime 1`, `--chunk-ms 100` means one audio packet is sent roughly every 100 ms.\n\n### 5. WebSocket protocol\n\nThe provided client and server use a minimal streaming protocol:\n\n| Step | Message | Purpose |\n|:---:|:---|:---|\n| 1 | JSON: `{\"type\": \"start\", \"sample_rate\": 16000}` | Start one recognition session |\n| 2 | Binary: int16 PCM audio chunks | Stream audio to the recognizer |\n| 3 | JSON: `{\"type\": \"end\"}` | Finish the session and flush final results |\n\nFor detailed deployment instructions, see [X-ASR-zh-en\u002Fdeployment\u002FREADME.md](X-ASR-zh-en\u002Fdeployment\u002FREADME.md).\n\n\u003Ca id=\"repository-layout\">\u003C\u002Fa>\n\n## 🗂️ Repository Layout\n\n```text\nX-ASR\u002F\n|-- README.md\n|-- README_zh.md\n|-- LICENSE\n|-- assets\u002F\n|   |-- figures\u002F\n|   |   |-- demo-preview.png\n|   |   `-- zipformer.png\n|   |-- demos\u002F\n|   |   `-- demo.mov\n|   `-- institutions\u002F\n|       |-- sjtu.png\n|       |-- sii.png\n|       |-- fudan.png\n|       `-- hust.png\n`-- X-ASR-zh-en\u002F\n    |-- deployment\u002F\n    |   |-- README.md\n    |   |-- requirements.txt\n    |   |-- infer_and_client\u002F\n    |   |   |-- sherpa_streaming_infer.py\n    |   |   |-- sherpa_streaming_server.py\n    |   |   `-- sherpa_streaming_client.py\n    |   |-- x-asr-live-demo\u002F\n    |   |   |-- README.md\n    |   |   |-- README_zh.md\n    |   |   |-- live_asr.py\n    |   |   |-- download_models.sh\n    |   |   |-- requirements.txt\n    |   |   `-- assets\u002F\n    |   `-- models\u002F\n    |       |-- README.md\n    |       |-- chunk-160ms-model\u002F\n    |       |-- chunk-480ms-model\u002F\n    |       |-- chunk-960ms-model\u002F\n    |       `-- chunk-1920ms-model\u002F\n    `-- zipformer\u002F\n        |-- README.md\n        |-- train.py\n        |-- finetune.py\n        |-- decode.py\n        |-- streaming_decode.py\n        |-- export.py\n        |-- export-onnx.py\n        |-- export-onnx-streaming.py\n        |-- model.py\n        |-- zipformer.py\n        |-- data\u002F\n        |   |-- lang_5000\u002F\n        |   |   |-- bpe.model\n        |   |   `-- tokens.txt\n        |   `-- lang_5000_with_punctuation\u002F\n        |       |-- bpe_punc.model\n        |       `-- tokens.txt\n        `-- checkpoint\u002F\n            |-- pretrained.pt\n            `-- fintuned_with_punctuation.pt\n```\n\n`X-ASR-zh-en\u002Fdeployment\u002F` contains runnable sherpa-onnx deployment artifacts, including the WebSocket server\u002Fclient path and the local live ASR application demo. `X-ASR-zh-en\u002Fzipformer\u002F` contains the icefall\u002FZipformer training, decoding, export recipe files, tokenizer\u002Fdata files, and released PyTorch checkpoints for the model.\n\n## 🤝 Contributing\n\nWe welcome feedback and contributions in the following areas:\n\n- Deployment issues on different CPU\u002FGPU environments\n- Streaming latency and stability reports\n- Evaluation results on new datasets or domains\n- Requests for new languages or future releases\n- Improvements to documentation and examples\n\nWhen reporting deployment problems, please include the **environment**, **command**, **input audio format**, and **error log**.\n\n## 📜 License\n\nThis project is released under the **Apache-2.0 License**.\n\n## 🙏 Acknowledgements\n\nThis model series is trained with **icefall** and deployed with **sherpa-onnx**.\n\n- icefall: https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Ficefall\n- sherpa-onnx: https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Fsherpa-onnx\n","X-ASR是一系列基于icefall框架的自动语音识别模型，专注于流式语音识别和低延迟部署。项目采用Zipformer和sherpa-onnx等技术实现高效准确的语音转文字功能，并支持离线识别模式。它适合需要实时语音转换的应用场景，如在线会议、直播字幕生成以及智能客服系统等。",2,"2026-06-11 04:09:31","CREATED_QUERY"]