[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80678":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":14,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":12,"rankGlobal":9,"rankLanguage":9,"license":16,"archived":17,"fork":17,"defaultBranch":18,"hasWiki":17,"hasPages":17,"topics":19,"createdAt":9,"pushedAt":9,"updatedAt":20,"readmeContent":21,"aiSummary":22,"trendingCount":12,"starSnapshotCount":12,"syncStatus":23,"lastSyncTime":24,"discoverSource":25},80678,"speech-eval-arena","vvt004\u002Fspeech-eval-arena","vvt004","A small CLI harness for evaluating speech LLMs and ASR models on standard benchmarks (LibriSpeech, FLEURS, VoxPopuli).",null,"Python",221,0,5,28,176,"MIT License",false,"main",[],"2026-06-12 02:04:05","# Speech Eval Arena — pit your speech model against a small but ornery test set\n\nA no-frills harness for evaluating speech LLMs and ASR models on a curated\nmix of public benchmarks. Designed to give you a single number per task\nwithout a full lab setup.\n\n## Why?\n\nMost ASR \u002F speech-LLM evals are bundled into giant frameworks that take a\nweekend to set up. I kept rewriting the same 200 lines of glue code for\nevery paper. This is that glue, packaged.\n\n## Install & Run\n\n```bash\npip install -e .\nsea run --model openai\u002Fwhisper-large-v3 --task librispeech-clean\nsea run --model nvidia\u002Fcanary-1b --task fleurs-zh\nsea list                          # see all tasks\nsea report runs\u002F                  # aggregate table\n```\n\n## Supported tasks\n\n| Task | Type | Metric |\n|------|------|--------|\n| `librispeech-clean` | ASR (English) | WER |\n| `librispeech-other` | ASR (English, noisy) | WER |\n| `fleurs-zh` | ASR (Mandarin) | CER |\n| `voxpopuli-en` | ASR (European English) | WER |\n| `spoken-squad` | Spoken QA | F1 |\n| `air-bench-foundation` | Sound understanding | Accuracy |\n\n## Configuration\n\nModels and tasks are described as plain YAML under `configs\u002F`. To add a new\nmodel:\n\n```yaml\n# configs\u002Fmodels\u002Fmy_model.yaml\nname: my-org\u002Fmy-model\nloader: transformers\nsampling_rate: 16000\nbatch_size: 8\ngeneration:\n  max_new_tokens: 256\n  do_sample: false\n```\n\n## Examples\n\nEvaluate Whisper-large-v3 on LibriSpeech clean and other in one shot:\n\n```bash\nsea run --model openai\u002Fwhisper-large-v3 --task librispeech-clean,librispeech-other\nsea report runs\u002F --format markdown\n```\n\nCompare two models side by side:\n\n```bash\nsea compare runs\u002Fwhisper-large-v3 runs\u002Fcanary-1b --task librispeech-clean\n```\n\n## License\n\nMIT.\n","speech-eval-arena 是一个轻量级的命令行工具，用于评估语音大模型和自动语音识别（ASR）模型在标准基准测试集上的性能。它支持多种任务，包括但不限于 LibriSpeech、FLEURS 和 VoxPopuli 等，并提供如词错误率（WER）、字符错误率（CER）以及 F1 分数等评价指标。该工具通过简洁的设计降低了设置复杂度，使得用户能够快速获得单一数值的结果反馈而无需完整的实验室环境配置。适用于需要快速验证或比较不同语音处理模型效果的研究者与开发者。",2,"2026-06-11 04:01:37","CREATED_QUERY"]