[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77940":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":14,"forks30d":14,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":14,"starSnapshotCount":14,"syncStatus":15,"lastSyncTime":28,"discoverSource":29},77940,"yapsnap","kouhxp\u002Fyapsnap","kouhxp","Snap any video URL or audio file into plaintext. No GPU. No cloud. One command.",null,"Python",265,10,3,0,2,5,99,7,59.52,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:22","# yapsnap\n\n> **Snap any video URL or audio file into plaintext. No GPU. No cloud. One command.**\n\n![Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue) ![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache--2.0-green) ![Platforms](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fplatforms-macOS%20%7C%20Linux%20%7C%20Windows-lightgrey)\n\n```bash\nyapsnap \"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=dQw4w9WgXcQ\"\n```\n\nThat's it. You get a `.txt` next to your shell, transcribed on your CPU, in less time than it took the video to play.\n\n---\n\n## Why yapsnap\n\n- ⚡ **Fast on CPU.** Streaming Zipformer transducer (Kroko English) chews through audio at several times realtime on a laptop. No CUDA. No M-series-only tricks. Plain old cores.\n- 🌐 **Any video URL, plus local files.** YouTube. X. TikTok. Instagram Reels. Direct `.mp4`\u002F`.mp3` links. Or just point it at a file on disk. yt-dlp handles the fetch, ffmpeg handles the decode, the rest is yours.\n- 📴 **Offline after first run.** ~80 MB model downloads once to your cache and stays there. No API keys. No quotas. Your audio never leaves your machine.\n- 🪶 **Lean deps.** `sherpa-onnx`, `numpy`, `yt-dlp` — that's the whole runtime, diarization included. No PyTorch, no cloud SDKs.\n- 🗣 **Ten-plus languages.** English out of the box; French, German, Spanish, Italian, Portuguese, Dutch, Swedish, Swiss German, Hebrew, and Turkish are a one-line `--model` swap away. See [Other languages](#other-languages).\n- ⏱ **Sentence-level timestamps when you want them.** `--timestamps` adds `[MM:SS]` per sentence using Kroko's built-in punctuation. Timing stays correct even when you transcribe at 2x.\n- 🗣️ **Speaker labels, optional.** `--diarize` answers \"who spoke when\" and prefixes each line with `SPEAKER_00`, `SPEAKER_01`, … Still CPU-only, still ONNX — no PyTorch, no extra runtime deps. See [Diarization](#diarization).\n\n---\n\n## Quickstart\n\n```bash\n# 1. ffmpeg on PATH (one-time, per OS — see below)\n# 2. Install (from PyPI, or `pip install .` from a clone)\npip install yapsnap\n\n# 3. Snap something\nyapsnap https:\u002F\u002Fwww.tiktok.com\u002F@user\u002Fvideo\u002F7234567890123456789\nyapsnap meeting.mp4 --timestamps\nyapsnap interview.mp3 --diarize          # label speakers\nyapsnap podcast.mp3 -o ~\u002Fnotes\u002Fepisode.txt\n```\n\nThe first run downloads the model (~80 MB). Every run after is offline.\n\n---\n\n## What it handles\n\nAny URL `yt-dlp` understands works. The big ones:\n\n| Source            | Example                                                 |\n|-------------------|---------------------------------------------------------|\n| YouTube           | `https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=...`                   |\n| YouTube Shorts    | `https:\u002F\u002Fwww.youtube.com\u002Fshorts\u002F...`                    |\n| X \u002F Twitter       | `https:\u002F\u002Fx.com\u002Fuser\u002Fstatus\u002F...\u002Fvideo\u002F1`                 |\n| TikTok            | `https:\u002F\u002Fwww.tiktok.com\u002F@user\u002Fvideo\u002F...`                |\n| Instagram Reels   | `https:\u002F\u002Fwww.instagram.com\u002Freel\u002F...\u002F`                   |\n| Direct media URL  | `https:\u002F\u002Fexample.com\u002Fclip.mp4`                          |\n\nPlus any local file ffmpeg can decode: `.mp3`, `.mp4`, `.m4a`, `.wav`, `.webm`, `.mov`, `.mkv`, `.aac`, `.opus`, `.ogg`, `.flac`, and friends.\n\n---\n\n## Install\n\n### 1. ffmpeg\n\n| OS      | Command                                                  |\n|---------|----------------------------------------------------------|\n| macOS   | `brew install ffmpeg`                                    |\n| Linux   | `sudo apt install ffmpeg` *or* `sudo dnf install ffmpeg` |\n| Windows | `winget install ffmpeg` *or* `choco install ffmpeg`      |\n\n### 2. yapsnap\n\nFrom PyPI (recommended):\n\n```bash\npip install yapsnap\n```\n\nFrom source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fkouhxp\u002Fyapsnap\ncd yapsnap\npip install .\n```\n\nInstalls two equivalent commands on your `PATH`: **`yapsnap`** (canonical) and **`transcribe`** (alias, for when the name slips your mind).\n\n---\n\n## Usage\n\n```bash\n# Local file\nyapsnap path\u002Fto\u002Faudio.mp3\n\n# Any video URL\nyapsnap \"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=dQw4w9WgXcQ\"\n\n# Sentence-level timestamps\nyapsnap input.mp4 --timestamps\n\n# Speaker labels (\"who spoke when\")\nyapsnap interview.mp3 --diarize\n\n# Speaker labels with a known speaker count (more reliable than auto-detect)\nyapsnap call.mp3 --diarize --num-speakers 2\n\n# Custom output path\nyapsnap input.mp4 -o .\u002Ftranscripts\u002Ftalk.txt\n\n# Don't speed audio up before transcribing (default is 1.5x, pitch preserved)\nyapsnap input.mp4 --speed 1.0\n\n# Keep the downloaded audio (URL inputs only)\nyapsnap \"https:\u002F\u002F...\" --keep-audio\n```\n\n---\n\n## Output\n\nPlaintext, UTF-8. Default location is `.\u002Ftranscripts\u002F` (created if missing) under the current working directory; override with `-o`. For URL inputs the filename is derived from the video ID (`dQw4w9WgXcQ_transcript.txt`, etc.).\n\n**Without `--timestamps`** — one paragraph of recognized text:\n\n```\nWelcome to the show. Today we're talking about transcription. Let's get started.\n```\n\n**With `--timestamps`** — one sentence per line, timed against the original audio:\n\n```\n[00:00] Welcome to the show.\n[00:03] Today we're talking about transcription.\n[00:08] Let's get started.\n```\n\nTimestamps stay in original-audio time even at `--speed 1.5` or higher.\n\n**With `--diarize`** — one sentence per line, each tagged with a speaker and timestamp:\n\n```\nSPEAKER_00 [00:00]: Welcome to the show.\nSPEAKER_01 [00:03]: Glad to be here, thanks for having me.\nSPEAKER_00 [00:08]: Let's get started.\n```\n\nSpeaker numbers are assigned in order of appearance and are stable within a single run, but they carry no identity across files — `SPEAKER_00` in one transcript is unrelated to `SPEAKER_00` in another.\n\n---\n\n## Flags\n\n| Flag              | Description                                                          |\n|-------------------|----------------------------------------------------------------------|\n| `-o`, `--output`  | Output `.txt` path. Default: `.\u002Ftranscripts\u002F\u003Cinput>_transcript.txt`. |\n| `--timestamps`    | Emit `[MM:SS] sentence.` lines instead of a single paragraph.        |\n| `--diarize`       | Label speakers (`SPEAKER_00 [MM:SS]: …`). Implies `--timestamps`.     |\n| `--diarize-model` | Segmentation model: `pyannote` (default) or `reverb`. See below.     |\n| `--num-speakers`  | Known speaker count for `--diarize`. Default `-1` (auto-detect).     |\n| `--speed`         | Pre-transcription speedup factor, pitch preserved. Default `1.5`.    |\n| `--keep-audio`    | Keep the downloaded audio (URL inputs only).                         |\n| `--model`         | Override the model directory. Also reads `KROKO_MODEL` env var.      |\n\n---\n\n## How it works\n\n1. **Fetch.** If the input is a URL, `yt-dlp` grabs the best audio-only stream to a temp directory. If it's a local path, this step is skipped.\n2. **Decode.** `ffmpeg` pipes the media into 16 kHz mono PCM. The optional `atempo` filter speeds it up without raising pitch.\n3. **Recognize.** A streaming Zipformer2 transducer (Kroko English, INT8 ONNX, ~80 MB) eats the PCM in chunks. CPU-only. Greedy decode.\n4. **Format.** Plain text by default. With `--timestamps`, token timestamps are grouped on `.!?` into sentences and scaled back to original-audio time.\n\nWith `--diarize`, a second pass runs the audio (decoded at original speed) through a speaker-segmentation model and a speaker-embedding model, clusters the voiceprints into speakers, and tags each sentence with the speaker active at its start. All ONNX, all CPU.\n\nNo frame is sent anywhere. No state is kept between runs except the cached model.\n\n---\n\n## Model & cache\n\nThe default Kroko English model is downloaded on first run to:\n\n- **macOS** — `~\u002FLibrary\u002FCaches\u002Fyapsnap\u002F`\n- **Linux** — `$XDG_CACHE_HOME\u002Fyapsnap\u002F` (or `~\u002F.cache\u002Fyapsnap\u002F`)\n- **Windows** — `%LOCALAPPDATA%\\yapsnap\\`\n\nTo use a different streaming transducer (other languages, larger Kroko variants, etc.), point `--model` at a directory containing `encoder(.int8).onnx`, `decoder(.int8).onnx`, `joiner(.int8).onnx`, and `tokens.txt`. Or set `KROKO_MODEL` in your environment.\n\nIf you use `--diarize`, the segmentation and embedding models download to a `diarization-models\u002F` subfolder of the same cache directory on first use, and are reused offline thereafter.\n\n---\n\n## Other languages\n\nThe default model is English, but yapsnap isn't limited to it. To transcribe another language, just download the matching model and point yapsnap at it — no code changes, no reinstall.\n\nKroko publishes streaming models for a growing list of languages on Hugging Face: \u003Chttps:\u002F\u002Fhuggingface.co\u002FBanafo\u002FKroko-ASR\u002Ftree\u002Fmain>. As of now that includes:\n\n- Dutch\n- French\n- German\n- Hebrew\n- Italian\n- Portuguese\n- Spanish\n- Swedish\n- Swiss German\n- Turkish\n\nDownload the one you need, unpack it into its own folder, and run:\n\n```bash\n# Per-run: pass the model folder explicitly\nyapsnap interview.mp3 --model \u002Fpath\u002Fto\u002Fkroko-french\n\n# Or set it once as your default for the session\nexport KROKO_MODEL=\u002Fpath\u002Fto\u002Fkroko-french\nyapsnap interview.mp3\n```\n\nEach model is single-language, so to work across several languages keep them in separate folders and switch with `--model` (or re-export `KROKO_MODEL`) as you go. Any other sherpa-onnx streaming transducer with the standard `encoder` \u002F `decoder` \u002F `joiner` \u002F `tokens.txt` layout works too, not just the Kroko ones.\n\n---\n\n## Diarization\n\n`--diarize` adds speaker labels to the transcript — \"who spoke when\" — so each line is prefixed with `SPEAKER_00`, `SPEAKER_01`, and so on:\n\n```bash\nyapsnap interview.mp3 --diarize\n```\n\n```\nSPEAKER_00 [00:00]: Welcome to the show.\nSPEAKER_01 [00:03]: Glad to be here, thanks for having me.\nSPEAKER_00 [00:08]: Let's get started.\n```\n\nIt stays true to yapsnap's design: CPU-only, ONNX, no PyTorch, no extra runtime dependencies beyond the `sherpa-onnx` you already have. Two small models download once on first use (a speaker-segmentation model plus a speaker-embedding model) and cache alongside the ASR model.\n\n### How the labels are produced\n\n`--diarize` implies `--timestamps` — the two share a clock. Transcription runs on the sped-up audio as usual, while diarization runs on the same source decoded at original speed (`1.0x`), because speeding audio up degrades both speaker-boundary detection and the voiceprint embeddings. Each transcript sentence is then matched to whichever speaker was active at its start time.\n\nBecause diarization needs sentence timestamps to attach labels to, `--diarize` will stop with an error if your `sherpa-onnx` build doesn't expose timestamp data, rather than silently dropping the speaker labels.\n\n### Speaker count\n\nBy default the number of speakers is detected automatically. Auto-detection is solid up to about seven speakers and degrades above that. If you know the count, pass it — it's more reliable:\n\n```bash\nyapsnap call.mp3 --diarize --num-speakers 2\n```\n\n### Choosing a segmentation model\n\n| Model        | `--diarize-model` | License        | Notes                                          |\n|--------------|-------------------|----------------|------------------------------------------------|\n| pyannote 3.0 | `pyannote` (default) | CC-BY-4.0   | Attribution only; the safe default.            |\n| Reverb v1    | `reverb`          | **Non-commercial** | Same architecture, fine-tuned for accuracy. |\n\n```bash\nyapsnap panel.mp4 --diarize --diarize-model reverb\n```\n\n`pyannote` is the default because its license is clean for most uses. `reverb` (Rev's fine-tune of the same architecture) can be more accurate but is distributed under a **non-commercial** license — yapsnap prints a reminder the first time you download it. Check the Rev model card before using it for anything commercial.\n\n### Limits\n\n- **No overlapping speech.** Each moment is assigned to exactly one speaker; simultaneous talking isn't modeled.\n- **Speaker counting weakens past ~7 speakers.** Pass `--num-speakers` when you know it.\n- **Labels are per-run.** `SPEAKER_00` is not the same person across different files.\n\nTo override the embedding model (for example if the default asset name ever changes), set `YAPSNAP_EMBEDDING_MODEL` to a different filename from the sherpa-onnx speaker-recognition release.\n\n---\n\n## Notes & limits\n\n- The default model is **English**. For other languages, download a matching model and pass it with `--model` — see [Other languages](#other-languages) for the current list and instructions.\n- `--speed 1.5` shaves about a third off transcription time with minimal accuracy cost. Try `2.0` if you want it even faster, or `1.0` for noisy, mumbled, or fast-speech sources.\n- Some social-media URLs are geo-locked or login-walled; `yt-dlp` will say so explicitly.\n- This is a streaming model, so timestamps come from token positions in the recognized stream. They're accurate enough for navigation, not for subtitling-grade alignment.\n\n---\n\n## License\n\nApache-2.0 for this project. The Kroko model is distributed under its own license — see \u003Chttps:\u002F\u002Fhuggingface.co\u002FBanafo\u002FKroko-ASR>. Powered by [sherpa-onnx](https:\u002F\u002Fgithub.com\u002Fk2-fsa\u002Fsherpa-onnx) and [yt-dlp](https:\u002F\u002Fgithub.com\u002Fyt-dlp\u002Fyt-dlp).\n\nThe optional diarization models carry their own licenses, separate from yapsnap's: the default **pyannote** segmentation model is CC-BY-4.0 (attribution), the speaker-embedding model is Apache-2.0, and the opt-in **reverb** segmentation model (`--diarize-model reverb`) is **non-commercial**. If you use diarization, review the license of the model you select before relying on it.","yapsnap 是一个能够将视频链接或音频文件转换为纯文本的工具。其核心功能包括使用CPU进行快速转录，支持多种视频平台和本地文件，并且在首次运行后完全离线工作，无需GPU或云服务。项目采用Python编写，依赖简洁，仅需几个库即可运行。此外，yapsnap 支持超过十种语言的转录，并可提供句子级别的时间戳以及说话人标签。适用于需要从多媒体内容中提取文字信息的各种场景，如会议记录、采访整理等，特别适合对隐私有较高要求的用户。","2026-06-11 03:56:14","CREATED_QUERY"]