[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-650":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":35,"readmeContent":36,"aiSummary":37,"trendingCount":16,"starSnapshotCount":16,"syncStatus":38,"lastSyncTime":39,"discoverSource":40},650,"MOSS-TTS-Nano","OpenMOSS\u002FMOSS-TTS-Nano","OpenMOSS","MOSS-TTS-Nano is an open-source multilingual tiny speech generation model from MOSI.AI and the OpenMOSS team. With only 0.1B parameters, it is designed for realtime speech generation, can run directly on CPU without a GPU, and keeps the deployment stack simple enough for local demos, web serving, and lightweight product integration.","https:\u002F\u002Fopenmoss.github.io\u002FMOSS-TTS-Nano-Demo\u002F",null,"Python",3443,447,20,46,0,49,130,564,147,29.95,"Apache License 2.0",false,"main",[26,27,28,29,30,31,32,33,34],"audio-tokenizer","chinese","english","multi-modality","multilingual","realtime","streaming-audio","tts","voice-clone","2026-06-12 02:00:16","# MOSS-TTS-Nano\n\n\u003Cbr>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fimages\u002FOpenMOSS_Logo.png\" height=\"70\" align=\"middle\" \u002F>\n  &nbsp;&nbsp;&nbsp;&nbsp;\n  \u003Cimg src=\".\u002Fassets\u002Fimages\u002Fmosi-logo.png\" height=\"50\" align=\"middle\" \u002F>\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fclawhub.ai\u002Fluogao2333\u002Fmoss-tts-voice\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🦞_OpenClaw-Skills-8A2BE2\" alt=\"OpenClaw\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Nano\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Models-orange?logo=huggingface&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Nano\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Models-7B61FF?logo=modelscope&amp;logoColor=white\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fopenmoss.github.io\u002FMOSS-TTS-Nano-Demo\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-View-blue?logo=internet-explorer&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.18090\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2603.18090-red?logo=arxiv&amp\">\u003C\u002Fa>\n\n  \u003Ca href=\"https:\u002F\u002Fstudio.mosi.cn\u002Fexperiments\u002Fmoss-tts-nano\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAIStudio-Try-green?logo=internet-explorer&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fstudio.mosi.cn\u002Fdocs\u002Fmoss-tts-nano\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-Docs-00A3FF?logo=fastapi&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fx.com\u002FOpen_MOSS\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-Follow-black?logo=x&amp\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FXf3aXddCjc\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join-5865F2?logo=discord&amp\">\u003C\u002Fa>\n  \u003Ca href=\".\u002Fassets\u002Fimages\u002Fwechat.jpg\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Join-07C160?logo=wechat&amp;logoColor=white\" alt=\"WeChat\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n[English](README.md) | [简体中文](README_zh.md)\n\n\n\nMOSS-TTS-Nano is an open-source **multilingual tiny speech generation model** from [MOSI.AI](https:\u002F\u002Fmosi.cn\u002F#hero) and the [OpenMOSS team](https:\u002F\u002Fwww.open-moss.com\u002F). With only **0.1B parameters**, it is designed for **realtime speech generation**, can run directly on **CPU without a GPU**, and keeps the deployment stack simple enough for local demos, web serving, and lightweight product integration.\n\n## MOSS-TTS 2.0 Feedback Collection\n\nMOSS-TTS 2.0 is coming soon. To better optimize model capabilities and product experience, we are collecting feedback and suggestions from TTS users. Please take 2-3 minutes to fill out the [requirements collection form](https:\u002F\u002Facnc6zeentra.feishu.cn\u002Fshare\u002Fbase\u002Fform\u002FshrcnyAe1LwqKWjCSuW4wiZ2Hef). Feature requests of any kind are welcome.\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Facnc6zeentra.feishu.cn\u002Fshare\u002Fbase\u002Fform\u002FshrcnyAe1LwqKWjCSuW4wiZ2Hef\">\n    \u003Cimg src=\".\u002Fassets\u002Fimages\u002Fmoss_tts_2_requirements_gathering.jpg\" width=\"360\" alt=\"MOSS-TTS 2.0 requirements collection QR code\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n[demo_video.mp4](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F25aca215-0bd7-4d0c-be95-8d1f6737aec8)\n\n## News\n\n* 2026.5.6: **MOSS-TTS**, **MOSS-TTS-Nano**, and **MOSS-Audio-Tokenizer** now support [**mlx-audio**](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-audio). Visit the [mlx-audio GitHub repository](https:\u002F\u002Fgithub.com\u002FBlaizzy\u002Fmlx-audio) for details.\n* 2026.4.29: MOSS-TTS 2.0 is coming soon! We are collecting TTS feedback, suggestions, and feature requests via the [requirements collection form](https:\u002F\u002Facnc6zeentra.feishu.cn\u002Fshare\u002Fbase\u002Fform\u002FshrcnyAe1LwqKWjCSuW4wiZ2Hef).\n* 2026.4.27: We added updated evaluation results for [**MOSS-Audio-Tokenizer-Nano**](#moss-audio-tokenizer-nano), including reconstruction quality comparisons on speech, audio, and music benchmarks.\n* 2026.4.17: We are excited to release a more efficient and fully standalone [**ONNX CPU Version**](#onnx-cpu-version), backed by the Hugging Face repositories [**MOSS-TTS-Nano-100M-ONNX**](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Nano-100M-ONNX) and [**MOSS-Audio-Tokenizer-Nano-ONNX**](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer-Nano-ONNX). It preserves the full voice cloning workflow while removing the PyTorch dependency during inference. In our tests, it delivers nearly **2x** the processing efficiency of the original version, and runs smoothly on a **single CPU core** on a **MacBook Air M4**. Built on top of this ONNX CPU version, we have also updated [**MOSS-TTS-Nano-Reader**](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS-Nano-Reader), which can now run the model directly inside the browser as an extension, without requiring a separate local inference service.\n* 2026.4.16: We release the **MOSS-TTS-Nano finetuning code**. See [.\u002Ffinetuning\u002FREADME.md](.\u002Ffinetuning\u002FREADME.md) for training and usage details.\n* 2026.4.14: We release [**MOSS-TTS-Nano-Reader**](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS-Nano-Reader), a local browser reading application built on top of **MOSS-TTS-Nano**.\n* 2026.4.10: We release **MOSS-TTS-Nano**. A demo Space is available at [OpenMOSS-Team\u002FMOSS-TTS-Nano](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMOSS-Team\u002FMOSS-TTS-Nano). You can also view the demo and more details at [openmoss.github.io\u002FMOSS-TTS-Nano-Demo\u002F](https:\u002F\u002Fopenmoss.github.io\u002FMOSS-TTS-Nano-Demo\u002F).\n\n## Demo\n\n- Online Demo: [https:\u002F\u002Fopenmoss.github.io\u002FMOSS-TTS-Nano-Demo\u002F](https:\u002F\u002Fopenmoss.github.io\u002FMOSS-TTS-Nano-Demo\u002F)\n- Hugging Face Space: [OpenMOSS-Team\u002FMOSS-TTS-Nano](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMOSS-Team\u002FMOSS-TTS-Nano)\n\n## Contents\n\n- [News](#news)\n- [Demo](#demo)\n- [Introduction](#introduction)\n  - [Main Features](#main-features)\n- [Supported Languages](#supported-languages)\n- [Quickstart](#quickstart)\n  - [Environment Setup](#environment-setup)\n  - [Voice Clone with `infer.py`](#voice-clone-with-inferpy)\n  - [Local Web Demo with `app.py`](#local-web-demo-with-apppy)\n  - [ONNX CPU Inference](#onnx-cpu-version)\n  - [Export TTS-only ONNX Weights](#export-tts-only-onnx-weights)\n  - [CLI Command: `moss-tts-nano generate`](#cli-command-moss-tts-nano-generate)\n  - [CLI Command: `moss-tts-nano serve`](#cli-command-moss-tts-nano-serve)\n  - [Finetuning](#finetuning)\n- [MOSS-Audio-Tokenizer-Nano](#moss-audio-tokenizer-nano)\n- [MOSS-TTS Family](#moss-tts)\n- [License](#license)\n- [Citation](#citation)\n- [Star History](#star-history)\n\n## Introduction\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fimages\u002Fconcept.png\" alt=\"MOSS-TTS-Nano concept\" width=\"85%\" \u002F>\n\u003C\u002Fp>\n\nMOSS-TTS-Nano focuses on the part of TTS deployment that matters most in practice: **small footprint**, **low latency**, **good enough quality for realtime products**, and **simple local setup**. It uses a pure autoregressive **Audio Tokenizer + LLM** pipeline and keeps the inference workflow friendly for both terminal users and web-demo users.\n\n### Main Features\n\n- **Tiny model size**: only **0.1B parameters**\n- **Native audio format**: **48 kHz**, **2-channel** output\n- **Multilingual**: supports **Chinese, English, and more**\n- **Pure autoregressive architecture**: built on **Audio Tokenizer + LLM**\n- **Streaming inference**: low realtime latency and fast first audio\n- **CPU friendly**: streaming generation can run on a **4-core CPU**\n- **Long-text capable**: supports long input with automatic chunked voice cloning\n- **Open-source deployment**: direct `python infer.py`, `python app.py`, and packaged CLI support\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fimages\u002Farch_moss_tts_nano.png\" alt=\"MOSS-TTS-Nano architecture\" width=\"80%\" \u002F>\n  \u003Cbr \u002F>\n  Architecture of MOSS-TTS-Nano\n\u003C\u002Fp>\n\n## Supported Languages\n\nMOSS-TTS-Nano currently supports **20 languages**:\n\n| Language | Code | Flag | Language | Code | Flag | Language | Code | Flag |\n|---|---|---|---|---|---|---|---|---|\n| Chinese | zh | 🇨🇳 | English | en | 🇺🇸 | German | de | 🇩🇪 |\n| Spanish | es | 🇪🇸 | French | fr | 🇫🇷 | Japanese | ja | 🇯🇵 |\n| Italian | it | 🇮🇹 | Hungarian | hu | 🇭🇺 | Korean | ko | 🇰🇷 |\n| Russian | ru | 🇷🇺 | Persian (Farsi) | fa | 🇮🇷 | Arabic | ar | 🇸🇦 |\n| Polish | pl | 🇵🇱 | Portuguese | pt | 🇵🇹 | Czech | cs | 🇨🇿 |\n| Danish | da | 🇩🇰 | Swedish | sv | 🇸🇪 | Greek | el | 🇬🇷 |\n| Turkish | tr | 🇹🇷 |  |  |  |  |  |  |\n\n## Quickstart\n\n### Environment Setup\n\nWe recommend a clean Python environment first, then installing the project in editable mode so the `moss-tts-nano` command becomes available locally.\nThe examples below intentionally keep arguments minimal and rely on the repository defaults.\nBy default, the code loads `OpenMOSS-Team\u002FMOSS-TTS-Nano` and `OpenMOSS-Team\u002FMOSS-Audio-Tokenizer-Nano`.\n\n#### Using Conda\n\n```bash\nconda create -n moss-tts-nano python=3.12 -y\nconda activate moss-tts-nano\n\ngit clone https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS-Nano.git\ncd MOSS-TTS-Nano\n\npip install -r requirements.txt\npip install -e .\n```\n\nIf `WeTextProcessing` or `pynini` fails to install from `requirements.txt`, install `pynini` first in the same environment, then install `WeTextProcessing`, remove `WeTextProcessing` from `requirements.txt`, and finally rerun `pip install -r requirements.txt`.\n\nWith Conda, we recommend:\n\n```bash\nconda install -c conda-forge pynini=2.1.6.post1 -y\npip install git+https:\u002F\u002Fgithub.com\u002FWhizZest\u002FWeTextProcessing.git\npip install -r requirements.txt\n```\n\nIf you are not using Conda, make sure you download a `pynini` wheel that matches your Python version and platform before installing `WeTextProcessing`. For a community-tested example, see [Issue #6](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS-Nano\u002Fissues\u002F6).\n\n### Voice Clone with `infer.py`\n\nThis repository keeps the direct Python entrypoint for local inference. The example below uses **voice clone mode**, which is the main recommended workflow for MOSS-TTS-Nano.\n\n```bash\npython infer.py \\\n  --prompt-audio-path assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。\"\n```\n\nThis writes audio to `generated_audio\u002Finfer_output.wav` by default.\n\n### Local Web Demo with `app.py`\n\nYou can launch the local FastAPI demo for browser-based testing:\n\n```bash\npython app.py\n```\n\nThen open `http:\u002F\u002F127.0.0.1:18083` in your browser.\n\n\u003Ca id=\"onnx-cpu-version\">\u003C\u002Fa>\n\n### ONNX CPU Inference\n\nWe now strongly recommend trying the **ONNX CPU version** first for lightweight local deployment and CPU inference.\n\nThis version is designed to be more deployment-friendly while keeping the same core MOSS-TTS-Nano experience:\n\n- **No PyTorch dependency during inference**: it runs directly on ONNX Runtime CPU.\n- **Fully standalone CPU deployment**: suitable for local demos, services, and lightweight integration.\n- **Feature-complete voice cloning workflow**: supports direct reference audio input, built-in voices, and `Realtime Streaming Decode`.\n- **Faster in practice**: in our tests, processing efficiency is nearly **2x** that of the original version.\n- **Strong single-core usability**: on a **MacBook Air M4**, we observed smooth inference with only **1 CPU core**.\n\nThe ONNX entrypoints are `infer_onnx.py`, `app_onnx.py`, and the packaged CLI with `--backend onnx`.\n\nBy default, all ONNX commands use `--execution-provider cpu`, so existing commands keep the same CPU behavior. If you have an NVIDIA GPU and a compatible `onnxruntime-gpu` installation, you can opt in to CUDA with `--execution-provider cuda`.\n\nTo prepare a CUDA ONNX Runtime environment, replace the CPU ONNX Runtime wheel with the GPU wheel:\n\n```bash\npip uninstall -y onnxruntime\npip install \"onnxruntime-gpu>=1.20.0\"\n```\n\nIf `--model-dir` is omitted, the script automatically checks `.\u002Fmodels`. When the model files are missing, it downloads them on first run from:\n\n- [OpenMOSS-Team\u002FMOSS-TTS-Nano-100M-ONNX](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Nano-100M-ONNX)\n- [OpenMOSS-Team\u002FMOSS-Audio-Tokenizer-Nano-ONNX](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer-Nano-ONNX)\n\nDownloaded files are stored under:\n\n- `models\u002FMOSS-TTS-Nano-100M-ONNX`\n- `models\u002FMOSS-Audio-Tokenizer-Nano-ONNX`\n\nExample:\n\n```bash\npython infer_onnx.py \\\n  --prompt-audio-path assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"Welcome to the ONNX Runtime CPU demo.\"\n```\n\nOptional CUDA execution:\n\n```bash\npython infer_onnx.py \\\n  --execution-provider cuda \\\n  --prompt-audio-path assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"Welcome to the ONNX Runtime CUDA demo.\"\n```\n\nCUDA execution requires `onnxruntime-gpu`. \n\nIf you already have the ONNX assets in another directory, pass it explicitly:\n\n```bash\npython infer_onnx.py \\\n  --model-dir \u002Fpath\u002Fto\u002Fmodels \\\n  --prompt-audio-path assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"Welcome to the ONNX Runtime CPU demo.\"\n```\n\n### ONNX Local Web Demo with `app_onnx.py`\n\nYou can also launch the ONNX-backed local web demo:\n\n```bash\npython app_onnx.py\n```\n\nTo run the ONNX web demo with CUDA, start it with:\n\n```bash\npython app_onnx.py \\\n  --execution-provider cuda\n```\n\nThen open `http:\u002F\u002F127.0.0.1:18083` in your browser.\n\nThe first startup may spend extra time downloading assets if `models\u002F` does not contain the ONNX weights yet.\n\n### Export TTS-only ONNX Weights\n\nIf you retrain `MOSS-TTS-Nano`, you need to re-export the TTS-side ONNX weights. The exporter under [`onnx\u002F`](.\u002Fonnx) takes a local Hugging Face-format `MOSS-TTS-Nano` checkpoint and outputs a TTS-only ONNX model directory.\n\nExample:\n\n```bash\npython onnx\u002Fexport_hf_to_tts_onnx.py \\\n  --checkpoint-path \u002Fpath\u002Fto\u002FMOSS-TTS-Nano \\\n  --output-dir \u002Fpath\u002Fto\u002FMOSS-TTS-Nano-100M-ONNX\n```\n\nThe output directory contains:\n\n- `moss_tts_prefill.onnx`\n- `moss_tts_decode_step.onnx`\n- `moss_tts_local_decoder.onnx`\n- `moss_tts_local_cached_step.onnx`\n- `moss_tts_local_fixed_sampled_frame.onnx`\n- `moss_tts_global_shared.data`\n- `moss_tts_local_shared.data`\n- `tts_browser_onnx_meta.json`\n- `tokenizer.model`\n\nThis is intended for the ONNX deployment path only. Existing prompt audio codes produced by `MOSS-Audio-Tokenizer-Nano` do not need to be regenerated when the audio tokenizer stays fixed.\n\n### CLI Command: `moss-tts-nano generate`\n\nAfter `pip install -e .`, you can call the packaged CLI directly:\n\n```bash\nmoss-tts-nano generate \\\n  --prompt-speech assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。\"\n```\n\nFor the ONNX CPU backend, add `--backend onnx`:\n\n```bash\nmoss-tts-nano generate \\\n  --backend onnx \\\n  --prompt-speech assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。\"\n```\n\nTo opt in to CUDA:\n\n```bash\nmoss-tts-nano generate \\\n  --backend onnx \\\n  --execution-provider cuda \\\n  --prompt-speech assets\u002Faudio\u002Fzh_1.wav \\\n  --text \"欢迎关注模思智能、上海创智学院与复旦大学自然语言处理实验室。\"\n```\n\nUseful notes:\n\n- `moss-tts-nano generate` writes to `generated_audio\u002Fmoss_tts_nano_output.wav` by default.\n- `--prompt-speech` is the friendly alias for the reference audio path used by voice cloning.\n- `--text-file` is supported for long-form synthesis.\n- ONNX CUDA execution requires `onnxruntime-gpu`; without `--execution-provider cuda`, ONNX inference remains CPU-only.\n\n### CLI Command: `moss-tts-nano serve`\n\nYou can also launch the web demo through the packaged CLI:\n\n```bash\nmoss-tts-nano serve\n```\n\nFor the ONNX web demo:\n\n```bash\nmoss-tts-nano serve \\\n  --backend onnx\n```\n\nFor the ONNX web demo with CUDA:\n\n```bash\nmoss-tts-nano serve \\\n  --backend onnx \\\n  --execution-provider cuda\n```\n\nThis command forwards to the corresponding web app, keeps the model loaded in memory, and serves the local browser demo plus HTTP generation endpoints.\n\nFor server deployment with paged KV cache, streaming, and an OpenAI-compatible `\u002Fv1\u002Faudio\u002Fspeech` endpoint, please read the [vLLM-Omni MOSS-TTS-Nano README](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm-omni\u002Fblob\u002Fmain\u002Fexamples\u002Fonline_serving\u002Fmoss_tts_nano\u002FREADME.md).\n\n### Finetuning\n\nFinetuning tutorials are already provided.\n\nSee [.\u002Ffinetuning\u002FREADME.md](.\u002Ffinetuning\u002FREADME.md) for details.\n\n## MOSS-Audio-Tokenizer-Nano\n\n\u003Ca id=\"mat-intro\">\u003C\u002Fa>\n### Introduction\n**MOSS-Audio-Tokenizer** is the unified discrete audio interface for the entire MOSS-TTS family. It is built on the **Cat** (**C**ausal **A**udio **T**okenizer with **T**ransformer) architecture, a CNN-free audio tokenizer composed entirely of causal Transformer blocks. It serves as the shared audio backbone for MOSS-TTS, MOSS-TTS-Nano, MOSS-TTSD, MOSS-VoiceGenerator, MOSS-SoundEffect, and MOSS-TTS-Realtime, providing a consistent audio representation across the full product family.\n\nTo further improve perceptual quality while reducing inference cost, we trained **MOSS-Audio-Tokenizer-Nano**, a lightweight tokenizer with approximately **20 million parameters** designed for high-fidelity audio compression. It supports **48 kHz** input and output as well as **stereo audio**, which helps reduce compression loss and improve listening quality. It can compress **48 kHz stereo audio** into a **12.5 Hz** token stream and uses **RVQ with 16 codebooks**, enabling high-fidelity reconstruction across variable bitrates from **0.125 kbps to 2 kbps**.\n\n\nTo learn more about setup, advanced usage, and evaluation metrics, please visit the [MOSS-Audio-Tokenizer Repository](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Audio-Tokenizer).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fimages\u002Farch_moss_audio_tokenizer_nano.png\" alt=\"MOSS-Audio-Tokenizer-Nano architecture\" width=\"100%\" \u002F>\n  Architecture of MOSS-Audio-Tokenizer-Nano\n\u003C\u002Fp>\n\n### Model Weights\n\n| Model | Hugging Face | ModelScope |\n|:-----:|:------------:|:----------:|\n| **MOSS-Audio-Tokenizer-Nano** | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Audio-Tokenizer-Nano) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Audio-Tokenizer-Nano) |\n\n\n### Evaluation Metrics\n\nThe table below compares the reconstruction quality of MOSS-Audio-Tokenizer-Nano with open-source audio tokenizers with **no more than 120M parameters** on speech, audio, and music data. As shown in the table and figure, MOSS-Audio-Tokenizer-Nano achieves the best overall reconstruction quality while remaining one of the smallest models in the comparison.\n\n- Speech metrics are evaluated on LibriSpeech test-clean (English) and AISHELL-2 (Chinese), reported as EN\u002FZH.\n- Audio metrics are evaluated on the AudioSet evaluation subset, while music metrics are evaluated on MUSDB, reported as audio\u002Fmusic.\n- STFT-Dist. denotes the STFT distance.\n- Higher is better for speech metrics, while lower is better for audio\u002Fmusic metrics (Mel-Loss, STFT-Dist.).\n- Ch. denotes the number of input\u002Foutput channels supported by the audio tokenizer: `ch=1` means mono audio, and `ch=2` means stereo audio.\n- Nvq denotes the number of quantizers.\n\n\u003Cbr>\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Fimages\u002Fevaluation_table_moss_audio_tokenizer_nano.png\" width=\"100%\"> \u003Cbr>\n    Reconstruction quality comparison of open-source audio tokenizers on speech and audio\u002Fmusic data.\n\u003C\u002Fp>\n\u003Cbr>\n\n### LibriSpeech Speech Metrics (MOSS-Audio-Tokenizer-Nano vs. Open-source Tokenizers)\n\nThe plots below compare MOSS-Audio-Tokenizer-Nano with other open-source audio tokenizers and codecs with **no more than 120M parameters** on the LibriSpeech dataset. The models are evaluated with SIM, STOI, PESQ-NB, and PESQ-WB, where higher values indicate better reconstruction quality.\nFor the same model, we control the bitrate by adjusting the number of RVQ codebooks used during inference.\n\n\u003Cbr>\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Fimages\u002Fevaluation_fig_moss_audio_tokenizer.png\" width=\"100%\"> \u003Cbr>\n    LibriSpeech reconstruction quality comparison across different bitrates.\n\u003C\u002Fp>\n\u003Cbr>\n\n\n\u003Ca id=\"moss-tts\">\u003C\u002Fa>\n## MOSS-TTS Family\n\n### Introduction\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fimages\u002Fmoss_tts_family.jpeg\" width=\"85%\" \u002F>\n\u003C\u002Fp>\n\nMOSS‑TTS Family is an open‑source **speech and sound generation model family** from [MOSI.AI](https:\u002F\u002Fmosi.cn\u002F#hero) and the [OpenMOSS team](https:\u002F\u002Fwww.open-moss.com\u002F). It is designed for **high‑fidelity**, **high‑expressiveness**, and **complex real‑world scenarios**, covering stable long‑form speech, multi‑speaker dialogue, voice\u002Fcharacter design, environmental sound effects, and real‑time streaming TTS.\n\nThe family currently includes:\n\n- **MOSS-TTS**: the flagship model for **high-fidelity zero-shot voice cloning**, **long-speech generation**, **fine-grained control over Pinyin, phonemes, and duration**, and **multilingual\u002Fcode-switched synthesis**.\n- **MOSS-TTS-Local-Transformer**: a smaller model in the family based on `MossTTSLocal`, designed to keep the MOSS-TTS style of speech generation in a lighter model size.\n- **MOSS-TTSD-v1.0**: a spoken dialogue generation model for **expressive**, **multi-speaker**, and **ultra-long** dialogue audio.\n- **MOSS-VoiceGenerator**: a voice design model that can generate diverse voices and speaking styles directly from **text prompts**, without reference speech.\n- **MOSS-SoundEffect**: a controllable sound generation model for natural ambience, city scenes, animals, human actions, and short music-like audio fragments.\n- **MOSS-TTS-Realtime**: a realtime speech model for low-latency voice agents, designed to keep replies natural, coherent, and voice-consistent across turns.\n\n\n\n### Released Models\n\n| Model | Architecture | Size | Hugging Face | ModelScope |\n|---|---|---:|---|---|\n| **MOSS-TTS** | `MossTTSDelay` | 8B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS) |\n| **MOSS-TTS-Local-Transformer** | `MossTTSLocal` | 1.7B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Local-Transformer) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Local-Transformer) |\n| **MOSS-TTSD-v1.0** | `MossTTSDelay` | 8B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTSD-v1.0) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTSD-v1.0) |\n| **MOSS-VoiceGenerator** | `MossTTSDelay` | 1.7B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-VoiceGenerator) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-VoiceGenerator) |\n| **MOSS-SoundEffect** | `MossTTSDelay` | 8B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-SoundEffect) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-SoundEffect) |\n| **MOSS-TTS-Realtime** | `MossTTSRealtime` | 1.7B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-TTS-Realtime) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-7B61FF?logo=modelscope&logoColor=white)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-TTS-Realtime) |\n\n## License\n\nThis repository will follow the license specified in the root `LICENSE` file. If you are reading this before that file is published, please treat the repository as **not yet licensed for redistribution**.\n\n## Citation\n\nIf you use the MOSS-TTS work in your research or product, please cite:\n\n```bibtex\n@misc{openmoss2026mossttsnano,\n  title={MOSS-TTS-Nano},\n  author={OpenMOSS Team},\n  year={2026},\n  howpublished={GitHub repository},\n  url={https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-TTS-Nano}\n}\n```\n\n```bibtex\n@misc{gong2026mossttstechnicalreport,\n  title={MOSS-TTS Technical Report},\n  author={Yitian Gong and Botian Jiang and Yiwei Zhao and Yucheng Yuan and Kuangwei Chen and Yaozhou Jiang and Cheng Chang and Dong Hong and Mingshu Chen and Ruixiao Li and Yiyang Zhang and Yang Gao and Hanfu Chen and Ke Chen and Songlin Wang and Xiaogui Yang and Yuqian Zhang and Kexin Huang and ZhengYuan Lin and Kang Yu and Ziqi Chen and Jin Wang and Zhaoye Fei and Qinyuan Cheng and Shimin Li and Xipeng Qiu},\n  year={2026},\n  eprint={2603.18090},\n  archivePrefix={arXiv},\n  primaryClass={cs.SD},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.18090}\n}\n```\n\n```bibtex\n@misc{gong2026mossaudiotokenizerscalingaudiotokenizers,\n  title={MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models}, \n  author={Yitian Gong and Kuangwei Chen and Zhaoye Fei and Xiaogui Yang and Ke Chen and Yang Wang and Kexin Huang and Mingshu Chen and Ruixiao Li and Qingyuan Cheng and Shimin Li and Xipeng Qiu},\n  year={2026},\n  eprint={2602.10934},\n  archivePrefix={arXiv},\n  primaryClass={cs.SD},\n  url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.10934}, \n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=OpenMOSS\u002FMOSS-TTS-Nano&type=Date)](https:\u002F\u002Fstar-history.com\u002F#OpenMOSS\u002FMOSS-TTS-Nano&Date)\n","MOSS-TTS-Nano 是一个由 MOSI.AI 和 OpenMOSS 团队开发的开源多语言小型语音生成模型。该模型仅包含0.1亿参数，专为实时语音生成设计，能够在无GPU的情况下直接在CPU上运行，并且部署简单，适用于本地演示、网页服务和轻量级产品集成。其核心技术特点包括支持多种语言（如中文和英文）、实时语音合成以及流式音频处理。此外，项目还提供了详细的API文档和社区支持，方便开发者快速上手和应用。",2,"2026-06-11 02:38:21","CREATED_QUERY"]