[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80663":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":14,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":12,"starSnapshotCount":12,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},80663,"whisperkv","pulgog\u002Fwhisperkv","pulgog","KV-cache compression for Whisper-family speech models. Drop-in patch, three eviction policies.",null,"Python",221,0,5,37,174,68.5,"MIT License",false,"main",[],"2026-06-12 04:01:29","\u003Cdiv align=\"center\">\n\n# WhisperKV ⚡\n\n**Lightweight KV-cache compression for Whisper-style speech models.**\n\n[![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue?style=for-the-badge)](https:\u002F\u002Fpython.org)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green?style=for-the-badge)](LICENSE)\n[![PyTorch](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpytorch-2.1+-ee4c2c?style=for-the-badge&logo=pytorch&logoColor=white)](https:\u002F\u002Fpytorch.org)\n\n[Install](#install) · [Usage](#usage) · [How It Works](#how-it-works) · [Benchmarks](#benchmarks)\n\n\u003C\u002Fdiv>\n\n---\n\nWhisperKV is a tiny library that wraps `transformers.WhisperForConditionalGeneration` (and similar encoder-decoder speech models) with a configurable KV-cache eviction policy. The goal is to shrink decoder memory during long-form generation without retraining the model, so you can stream ASR or speech-LLM inference on smaller GPUs.\n\nThe default policy is a simple \"heavy-hitter\" rule inspired by H2O: keep recent tokens plus the top-k cross-attention recipients. There's also a magnitude rule and a fixed-window rule for ablation.\n\n## Install\n\n```bash\npip install whisperkv\n```\n\nOr from source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fpulgog\u002Fwhisperkv\ncd whisperkv\npip install -e .\n```\n\n## Usage\n\n```python\nimport torch\nfrom transformers import WhisperProcessor, WhisperForConditionalGeneration\nfrom whisperkv import patch, EvictionConfig\n\nproc  = WhisperProcessor.from_pretrained(\"openai\u002Fwhisper-small\")\nmodel = WhisperForConditionalGeneration.from_pretrained(\"openai\u002Fwhisper-small\")\n\n# wrap with a heavy-hitter policy that keeps the last 64 tokens\n# plus the top-32 by accumulated attention weight\npatch(model, EvictionConfig(policy=\"heavy_hitter\", recent=64, heavy=32))\n\naudio, sr = ...  # 30 s of waveform at 16 kHz\ninputs = proc(audio, sampling_rate=sr, return_tensors=\"pt\")\nids = model.generate(**inputs, max_new_tokens=440)\nprint(proc.batch_decode(ids, skip_special_tokens=True))\n```\n\nThat's the whole API. `patch()` monkey-patches the decoder layers in-place; you can call it once and forget about it.\n\n## How It Works\n\nThe Whisper decoder keeps two caches per layer: a self-attention K\u002FV over previously generated tokens, and a cross-attention K\u002FV over encoder outputs. For long generations the self-attention cache dominates. WhisperKV intercepts each layer's `forward` to maintain a small \"evictable\" view of that cache:\n\n1. Track running attention weights (sum over heads, decayed).\n2. On every step, if the cache exceeds the budget, drop the lowest-weight token *outside* the protected window.\n3. Forward only the surviving K\u002FV.\n\nThe cross-attention cache is left alone — it's keyed on a fixed-size encoder output and isn't the bottleneck.\n\n```\n                         budget = recent + heavy\n                    ┌───────────────────────────────┐\n                    ▼                               ▼\n  [t-N, ..., t-recent-1] | [t-recent, ..., t]   keep all\n   ↑ scored & evicted    | protected window\n```\n\n## Benchmarks\n\nSetup: Whisper-large-v3 on a single A100, 30-second clips, greedy decoding.\n\n| Cache policy | Decoder peak mem | WER on test-clean | Throughput |\n|--------------|-----------------:|------------------:|-----------:|\n| full         | 1.00x | 4.1% | 1.00x |\n| fixed window (128) | 0.34x | 6.8% | 1.21x |\n| heavy_hitter (64+32) | 0.41x | 4.4% | 1.18x |\n| heavy_hitter (32+16) | 0.22x | 5.1% | 1.27x |\n\nNumbers are approximate — see [`benchmarks\u002F`](benchmarks\u002F) for the scripts that produced them.\n\n## License\n\nMIT — see [LICENSE](LICENSE).\n","WhisperKV 是一个轻量级库，用于为类似Whisper的语音模型提供KV缓存压缩功能。它通过配置可选的KV缓存淘汰策略来减少长文本生成时解码器所需的内存，而无需重新训练模型，特别适用于在较小GPU上进行流式自动语音识别或语音-大语言模型推理。核心功能包括三种淘汰策略：默认的“重击者”规则、幅度规则和固定窗口规则，允许用户根据需求灵活选择。本项目采用Python编写，并依赖PyTorch 2.1+环境运行。使用时只需简单调用`patch()`函数对模型进行包装即可立即生效。",2,"2026-06-11 04:01:32","CREATED_QUERY"]