[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79192":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},79192,"asr-rescore-bench","okturro\u002Fasr-rescore-bench","okturro","Benchmark for LLM-based ASR n-best rescoring (ngram, neural-LM, MLM-PLL, LLM-prompt strategies).",null,"Python",221,972,6,0,186,58.96,"Other",false,"main",true,[],"2026-06-12 04:01:24","# ASR-Rescore-Bench: A Benchmark for LLM-Based ASR Rescoring\n\nCode for the ASR-Rescore-Bench paper.\n\n[Paper (preprint)](https:\u002F\u002Farxiv.org\u002Fabs\u002Fxxxx.xxxxx) · [Leaderboard](LEADERBOARD.md) · [Data card](docs\u002Fdata.md)\n\n## Abstract\n\nModern ASR systems often emit an n-best list per utterance.  Rescoring that\nlist with an external language model — historically an n-gram or a small\nneural LM — has consistently lowered WER.  With the rise of LLMs the natural\nquestion is whether a 7B-class instruction-tuned model can do better than a\nspecialised neural LM, and at what cost.  We benchmark **eleven** rescoring\nstrategies (n-gram, masked-LM, causal-LM, instruction-tuned chat) across\n**five** corpora (LibriSpeech, AISHELL-1, FLEURS, CommonVoice-zh, internal\nmedical-dialogue), and find that prompt-based rescoring with mid-sized LLMs\nnarrows the WER gap by 40-60% on average — but only when the n-best list is\nlarge (n ≥ 10) and pre-cleaned for repetition.\n\n## Requirements\n\n```bash\npip install -e \".[full]\"\n```\n\n`kenlm` is optional (only required for the n-gram baseline).\n\n## Reproducing Results\n\nThe pipeline has three stages: **load** an n-best list, **rescore** with a\nstrategy, **score** the chosen-1-best against references.\n\n```bash\narb rescore --nbest data\u002Flibrispeech_test_clean_nbest10.jsonl \\\n            --strategy llm_prompt \\\n            --model Qwen\u002FQwen2.5-7B-Instruct \\\n            --output predictions\u002Fqwen2.5-7b__librispeech.jsonl\n\narb score predictions\u002Fqwen2.5-7b__librispeech.jsonl \\\n          --refs data\u002Flibrispeech_test_clean_refs.jsonl\n```\n\nTo reproduce **Table 3**:\n\n```bash\nbash scripts\u002Frepro_table3.sh\n```\n\n(This runs the full grid of strategy × corpus.  Roughly 14 GPU-hours on a single A100.)\n\n## Pre-trained Models\n\nWe don't ship any pre-trained model — every rescorer is built on top of an\noff-the-shelf LM (Qwen2.5, LLaMA-3, etc.).  Cached n-best lists for the four\npublic corpora are available in the **release attachments** (see\n`docs\u002Fdata.md`).\n\n## Strategies\n\n| Strategy        | Module                   | Notes                                   |\n|-----------------|--------------------------|-----------------------------------------|\n| oracle          | `arb.strategies.oracle`  | upper bound — picks the lowest-WER hyp  |\n| ngram_lm        | `arb.strategies.ngram`   | kenlm ARPA, length-normalised log-prob  |\n| neural_lm       | `arb.strategies.neural`  | causal LM perplexity                    |\n| mlm_pll         | `arb.strategies.mlm`     | masked-LM pseudo-log-likelihood         |\n| llm_prompt      | `arb.strategies.llm`     | \"rank these hypotheses\" prompt          |\n| llm_listwise    | `arb.strategies.llm`     | one-shot listwise selection             |\n\n## Headline results\n\n| Strategy        | LibriSpeech-clean ↓ | AISHELL-1 (CER) ↓ |\n|-----------------|---------------------|--------------------|\n| 1-best baseline | 2.45                | 5.32               |\n| ngram_lm        | 2.21                | 5.18               |\n| neural_lm (GPT-2 medium) | 2.04        | 5.02               |\n| mlm_pll (BERT-large) | 1.98           | 4.94               |\n| llm_prompt (Qwen2.5-7B) | 1.78        | 4.61               |\n| llm_listwise (Qwen2.5-7B) | 1.74     | **4.55**           |\n| oracle (lower bound) | 0.62           | 1.41               |\n\n## Citation\n\n```bibtex\n@article{asrrescorebench,\n  title  = {{ASR-Rescore-Bench}: A Benchmark for LLM-based ASR Rescoring},\n  author = {Ruijie Tang},\n  year   = {2025},\n  journal = {arXiv preprint arXiv:xxxx.xxxxx}\n}\n```\n\n## License\n\nCode: Apache-2.0.  Data: per-corpus license (see `docs\u002Fdata.md`).\n","ASR-Rescore-Bench 是一个用于评估基于大语言模型的自动语音识别（ASR）n-best重打分策略的基准项目。该项目支持多种重打分方法，包括n-gram、神经网络语言模型、掩码语言模型伪对数似然以及指令调优的大语言模型提示策略，并在五个不同语料库上进行了测试。通过对比分析，发现使用中等规模的大语言模型进行基于提示的重打分能够显著降低字错误率（WER），特别是在n-best列表较大且经过预处理去除重复项的情况下效果更佳。此工具适用于需要改进现有ASR系统性能的研究者和开发者，尤其是在追求更高准确度的应用场景中。",2,"2026-06-01 03:48:11","CREATED_QUERY"]