[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80144":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":13,"contributorsCount":13,"subscribersCount":13,"size":13,"stars1d":13,"stars7d":15,"stars30d":16,"stars90d":13,"forks30d":13,"starsTrendScore":13,"compositeScore":17,"rankGlobal":10,"rankLanguage":10,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":19,"hasPages":19,"topics":21,"createdAt":10,"pushedAt":10,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":13,"starSnapshotCount":13,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},80144,"speechkv-trim","jelllott\u002Fspeechkv-trim","jelllott","Speech-aware KV cache pruning for long-form speech LLMs (Qwen2-Audio, SALMONN). Token\u002Fhead\u002Fchunk-level pruners + eval on LibriSpeech-long & GigaSpeech.","",null,"Python",219,0,4,48,165,70,"Other",false,"main",[],"2026-06-12 04:01:26","# Hush KV: Speech-Aware KV Cache Pruning\n\n> Companion code for our group's ongoing work on long-context speech LLMs.\n> Most decoder-only speech LLMs blow up the KV cache once you push past\n> ~30s of audio prefix. We look at *what* lives in those caches and how to\n> drop it without hurting downstream ASR \u002F SQA performance.\n\nThis repo contains the reference implementation for the \"Hush KV\" pruning\nfamily (token-level + head-level + chunk-level) and the small evaluation\nharness we use on LibriSpeech-long, GigaSpeech-test, and an in-house\nspoken QA split. The code is organized so each pruner is a single file\nunder `speechkv_trim\u002Fpruners\u002F` and is selectable by name from the CLI.\n\n## TL;DR\n\nSpeech tokenization is *dense*. With a typical 50 Hz audio tokenizer, 60s\nof audio is 3000 tokens before the text prompt even starts. Naive sliding\nwindow cache eviction drops too many \"anchor\" frames (silence + sentence\nboundaries) and hurts disfluency \u002F repetition handling. We score keys\nusing a weighted mix of attention recency and a small acoustic-saliency\nhead, then evict per-layer with a budget.\n\nYou should be able to reproduce the LibriSpeech-long numbers in the paper\ndraft (`docs\u002Fdraft.md`) on a single A100-40G. The GigaSpeech runs need\n2x A100s for the larger backbones.\n\n## Why \"Hush KV\"?\n\nBecause most of the speech KV cache is silence, breathing, and \"uhm\".\nWe hush it.\n\n## Status\n\n| component               | state         | notes                                |\n|-------------------------|---------------|---------------------------------------|\n| token-level pruner      | stable        | default in CLI                        |\n| head-level pruner       | stable        |                                       |\n| chunk-level pruner      | beta          | needs better boundary detector        |\n| streaming evictor       | experimental  | works for Qwen2-Audio, breaks on SALMONN sometimes |\n| eval harness            | stable        | LibriSpeech-long, GigaSpeech, in-house |\n| paper                   | drafting      | see `docs\u002Fdraft.md`                   |\n\n## Install\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fjelllott\u002Fspeechkv-trim\ncd speechkv-trim\npython -m venv .venv && source .venv\u002Fbin\u002Factivate\npip install -e \".[dev]\"\n```\n\nTested with python 3.10 \u002F 3.11. Requires `torch>=2.1`, `transformers>=4.40`,\n`torchaudio>=2.1`, `soundfile`, and `librosa` for the eval scripts. The\nstreaming evictor uses `flash-attn>=2.5` if it's installed and falls back\nto SDPA otherwise (slower, slightly different numbers).\n\n## Quickstart\n\nRun token-level pruning on a single audio file with a Qwen2-Audio\nbackbone:\n\n```bash\npython -m speechkv_trim.cli prune \\\n    --model Qwen\u002FQwen2-Audio-7B-Instruct \\\n    --audio examples\u002Flong_lecture.wav \\\n    --pruner token_attn_recency \\\n    --budget 1024 \\\n    --out out\u002Ftranscript.json\n```\n\nReproduce the paper Table 2 (LibriSpeech-long, all pruners):\n\n```bash\nbash scripts\u002Freproduce_table2.sh\n```\n\nThis will download the prepared metadata (~6 MB) and stream audio from a\nlocal LibriSpeech mirror. See `docs\u002Fdata.md` for the expected layout.\n\n## Pruners\n\nAll pruners implement the `BasePruner` interface in\n`speechkv_trim\u002Fpruners\u002Fbase.py`:\n\n```python\nclass BasePruner:\n    def score(self, k, v, attn_weights, meta): ...\n    def select(self, scores, budget): ...\n    def apply(self, k, v, kept_idx): ...\n```\n\nCurrently shipped:\n\n- `token_attn_recency` — weighted mix of mean attention received and\n  position recency. Cheap, surprisingly strong baseline.\n- `token_saliency` — adds a small acoustic-saliency head (trained on\n  forced-aligned silence\u002Fvoice labels) to the recency score.\n- `head_l1` — drops *whole heads* per layer based on attention L1 norm.\n  Combines well with token-level pruning.\n- `chunk_vad` — chunk-level pruning gated by Silero-VAD boundaries.\n- `streaming_anchor` — incremental, anchored on sentence-start tokens\n  from the text side. Use with `--stream`.\n\nYou can register your own pruner via the `speechkv_trim.pruners` entry\npoint group (see `pyproject.toml`).\n\n## Repo layout\n\n```\nspeechkv_trim\u002F\n  cli.py                 # argparse front end\n  models\u002F                # backbone wrappers\n    qwen2_audio.py\n    salmonn.py\n    whisper_llm.py       # ASR-only baseline\n  pruners\u002F\n    base.py\n    token_attn_recency.py\n    token_saliency.py\n    head_l1.py\n    chunk_vad.py\n    streaming_anchor.py\n  eval\u002F\n    librispeech_long.py\n    gigaspeech.py\n    in_house_sqa.py\n    metrics.py\n  utils\u002F\n    audio.py\n    cache.py             # KV-cache surgery primitives\n    profile.py\ndocs\u002F\n  draft.md\n  data.md\n  pruners.md\nscripts\u002F\n  reproduce_table2.sh\n  reproduce_table3.sh\ntests\u002F\n```\n\n## Reproducibility\n\nWe seed `torch`, `numpy`, and `random` from the `--seed` flag; results\nshould be bit-identical with `flash-attn` off. With `flash-attn` on you\nget small ULP-level deltas — the paper numbers are reported with\n`flash-attn` on because that's what we actually use.\n\n## Citing\n\nIf this code is useful for your work, please cite the (in-progress)\npreprint — see `docs\u002Fdraft.md` for the current bibtex stub. We'll update\nthis section once the arXiv id is live.\n\n## License\n\nApache-2.0. See `LICENSE`.\n\n## Acknowledgements\n\nThanks to the WUT MIIT group for compute, and to the SALMONN authors\nfor releasing checkpoints under a permissive license.\n","该项目主要研究针对长音频序列的语音大模型（如Qwen2-Audio、SALMONN）中KV缓存的剪枝技术。它提供了在token、head和chunk级别上的剪枝器，并在LibriSpeech-long与GigaSpeech数据集上进行了性能评估。核心功能包括通过注意力机制的新颖性及一个小的声学显著性头来对键进行评分，然后按层逐级剔除以控制预算。这种技术特别适用于需要处理长时间音频输入但受限于内存资源的应用场景，比如自动语音识别(ASR)或口语问答系统(SQA)，能够有效减少因音频长度导致的KV缓存膨胀问题而不影响下游任务表现。",2,"2026-06-11 03:59:26","CREATED_QUERY"]