[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79258":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},79258,"speech-tokenizer-arena","andraiming\u002Fspeech-tokenizer-arena","andraiming","A side-by-side benchmarking playground for discrete speech tokenizers (EnCodec, HuBERT-units, SpeechTokenizer, etc.).",null,"Python",222,9209,6,0,146,60,"Other",false,"main",[],"2026-06-12 04:01:24","\u003Cdiv align=\"center\">\n\n# Speech Tokenizer Arena 🏟️\n\n**Drop in a tokenizer, get a leaderboard.**\n\n![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-3776AB?style=for-the-badge&logo=python&logoColor=white)\n![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green?style=for-the-badge)\n![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpytorch-2.0+-EE4C2C?style=for-the-badge&logo=pytorch&logoColor=white)\n![Audio](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdomain-speech%20audio-orange?style=for-the-badge)\n\n[Install](#install) · [Usage](#quick-start) · [Tokenizers](#supported-tokenizers) · [Results](#sample-results) · [Citation](#citation)\n\n\u003C\u002Fdiv>\n\n---\n\n## Overview\n\nDiscrete speech tokenizers (EnCodec, DAC, SpeechTokenizer, HuBERT k-means units, …) are everywhere in\n2024-era speech models, but comparing them honestly is a pain: every paper picks its own dataset,\nbitrate, and metric set. **Speech Tokenizer Arena** is a small, opinionated harness that runs the same\naudio through several tokenizers and produces a single markdown leaderboard.\n\nThe goal is *side-by-side* numbers — reconstruction quality (mel-SD, SI-SDR, STOI, PESQ),\neffective bitrate, and a downstream ASR-WER proxy on the reconstructed waveforms. Pick the\ntokenizer that fits your downstream budget, not the one with the prettiest demo page.\n\n## Supported Tokenizers\n\n| Tokenizer | Type | Sample rate | Bitrate | Source |\n|---|---|---|---|---|\n| `encodec` | acoustic RVQ | 24 kHz | 1.5 – 24 kbps | HF `transformers` |\n| `dac` | acoustic RVQ | 16 \u002F 24 \u002F 44 kHz | ~8 kbps | `descript-audio-codec` |\n| `speechtokenizer` | semantic + acoustic | 16 kHz | 4 kbps | ZhangXInFD\u002FSpeechTokenizer |\n| `hubert-units` | semantic only | 16 kHz | ~0.4 kbps | HF HuBERT + k-means |\n\nSee [`docs\u002Ftokenizers.md`](docs\u002Ftokenizers.md) for adding your own.\n\n## Install\n\n```bash\npip install speech-tokenizer-arena\n\n# with optional metric backends (PESQ, STOI) and ASR\npip install \"speech-tokenizer-arena[metrics,asr]\"\n\n# from source\ngit clone https:\u002F\u002Fgithub.com\u002Fandraiming\u002Fspeech-tokenizer-arena\ncd speech-tokenizer-arena\npip install -e \".[all]\"\n```\n\n## Quick Start\n\nCompare two tokenizers on a single utterance:\n\n```python\nimport torchaudio\nfrom sta.tokenizers.encodec_wrap import EncodecTokenizer\nfrom sta.tokenizers.dac_wrap import DACTokenizer\nfrom sta.metrics.reconstruction import all_recon_metrics\n\nwav, sr = torchaudio.load(\"sample.flac\")\nwav = wav.mean(0)\n\nfor tok in [EncodecTokenizer(bandwidth=6.0), DACTokenizer(model_type=\"24khz\")]:\n    codes = tok.encode(wav, sr)\n    recon = tok.decode(codes)\n    print(tok.info.name, all_recon_metrics(wav.unsqueeze(0), recon, sr=tok.info.sample_rate))\n```\n\n## Arena Mode\n\nRun a full sweep from a YAML config:\n\n```bash\nsta run --config configs\u002Fall.yaml --out outputs\u002Frun1\u002F\n```\n\nSample console output:\n\n```\n[arena] evaluating encodec[6.0kbps] ...\n[arena] evaluating dac-24khz ...\n[arena] evaluating hubert-units[L6\u002Fk200] ...\n[report] wrote outputs\u002Frun1\u002Freport.md\n```\n\nThe report is a markdown table you can drop straight into a paper appendix.\n\n## Metrics\n\n- **mel-SD** — log-mel spectral distortion, the cheapest sanity check.\n- **SI-SDR** — scale-invariant signal-to-distortion ratio (dB).\n- **STOI** — short-time objective intelligibility (0–1, higher is better).\n- **PESQ** — perceptual evaluation of speech quality (wide-band mode at 16 kHz).\n- **WER (downstream)** — Whisper-tiny WER on the reconstructed audio vs. the reference transcript.\n- **bitrate_kbps** — effective bits-per-second of the discrete code stream.\n\n## Sample Results\n\nLibriSpeech `test-clean`, 100 utterances, 10 s max each, single A6000:\n\n| tokenizer | bitrate (kbps) | mel-SD ↓ | SI-SDR ↑ | STOI ↑ | PESQ ↑ | WER ↓ |\n|---|---|---|---|---|---|---|\n| encodec[1.5] | 1.5 | 1.27 | 1.4 | 0.81 | 1.91 | 18.4% |\n| encodec[6.0] | 6.0 | 0.78 | 6.9 | 0.92 | 2.84 | 7.1% |\n| dac-24khz | 8.0 | 0.71 | 8.2 | 0.94 | 3.07 | 5.8% |\n| speechtokenizer | 4.0 | 0.86 | 5.5 | 0.91 | 2.66 | 6.4% |\n| hubert-units (k=200) | 0.4 | 1.65 | — | — | — | 11.2% |\n\nNumbers are illustrative — re-run on your own audio before trusting them.\n\n## Limitations\n\n- The `pesq` Python binding segfaults intermittently on Apple Silicon; we fall back to `NaN` and keep going.\n- HuBERT units are encoder-only here — you need a unit vocoder (e.g. unit-HiFiGAN) for true reconstruction.\n  Until that lands, decode-side metrics are skipped for that row.\n- DAC and SpeechTokenizer ship under different licenses than this repo. Read their cards before\n  publishing benchmarks.\n\n## Citation\n\n```bibtex\n@software{an2024speechtokenizerarena,\n  author = {An, Jiaming},\n  title  = {Speech Tokenizer Arena: a benchmarking harness for discrete speech tokenizers},\n  year   = {2024},\n  url    = {https:\u002F\u002Fgithub.com\u002Fandraiming\u002Fspeech-tokenizer-arena}\n}\n```\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","Speech Tokenizer Arena 是一个用于对比评估不同离散语音编码器（如EnCodec、HuBERT-units、SpeechTokenizer等）性能的基准测试工具。项目核心功能包括对同一音频样本使用多种编码器进行处理，并生成包含重构质量（如mel-SD、SI-SDR、STOI、PESQ）、有效比特率及下游自动语音识别错误率等指标的排行榜。该工具基于Python开发，利用PyTorch框架支持多种采样率和比特率设置下的编码器比较。适用于需要选择合适语音编码器以优化特定应用场景下音频处理效果的研究者或开发者，比如在构建低延迟通信系统或是高效存储解决方案时。",2,"2026-06-01 03:48:15","CREATED_QUERY"]