[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79967":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":13,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":20,"hasPages":18,"topics":21,"createdAt":9,"pushedAt":9,"updatedAt":22,"readmeContent":23,"aiSummary":24,"trendingCount":14,"starSnapshotCount":14,"syncStatus":25,"lastSyncTime":26,"discoverSource":27},79967,"OmniVAD-Kit","lifeiteng\u002FOmniVAD-Kit","lifeiteng","Cross-platform VAD & Audio Event Detection toolkit — Python (PyPI) + TypeScript (npm) + C API. DFSMN models ~2MB, 200x real-time. Runs everywhere: native, browser (WASM), Node.js.",null,"Python",79,5,1,0,3,2.33,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:56","# OmniVAD\n\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fomnivad)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fomnivad\u002F)\n[![npm](https:\u002F\u002Fimg.shields.io\u002Fnpm\u002Fv\u002Fomnivad)](https:\u002F\u002Fwww.npmjs.com\u002Fpackage\u002Fomnivad)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Flifeiteng\u002FOmniVAD-Kit)](LICENSE)\n\n**English** | [中文](README.zh.md)\n\nCross-platform toolkit for [FireRedVAD](https:\u002F\u002Fgithub.com\u002FFireRedTeam\u002FFireRedVAD) — SOTA voice activity detection and audio event detection.\n\n**Three models, one toolkit, runs everywhere:**\n\n| Model | What it does | Output |\n|-------|-------------|--------|\n| **VAD** | Speech detection (non-stream) | Speech timestamps |\n| **Stream-VAD** | Real-time speech detection (frame-by-frame) | Per-frame speech probability |\n| **AED** | Audio event detection (non-stream) | Speech \u002F Singing \u002F Music timestamps |\n\nAll models are based on DFSMN architecture, ~2.2MB each (~588K params), support 100+ languages.\n\n## Packages\n\n### Python (`omnivad\u002F`)\n\nPyPI package with native C bindings (ncnn). Models bundled in wheel.\n\n```bash\npip install omnivad\n```\n\n**CLI:**\n\n```bash\nomnivad audio.wav                        # VAD + AED → audio.TextGrid\nomnivad audio.wav -o out.json            # Output as JSON\nomnivad audio.wav -o out.srt             # Output as SRT\nomnivad audio.wav -o out.vtt             # Output as WebVTT\nomnivad audio.wav -f srt                 # Format flag (textgrid\u002Fjson\u002Fsrt\u002Fvtt)\nomnivad audio.wav -m vad                 # VAD only\nomnivad audio.wav -m aed                 # AED only (speech\u002Fsinging\u002Fmusic)\nomnivad long.wav --chunk 600 --overlap 2 # Chunked processing for large audio\npython -m omnivad audio.wav              # Also works\n```\n\n**Python API:**\n\n```python\nfrom omnivad import OmniVAD, OmniStreamVAD, OmniAED\nimport numpy as np\n\nvad = OmniVAD()\n\n# File path — auto-loads as float32 [-1,1]\nresult = vad.detect(\"audio.wav\")\n# {'duration': 2.24, 'timestamps': [(0.26, 1.82)]}\n\n# Float32 array [-1.0, 1.0] — from soundfile, torchaudio, librosa\nresult = vad.detect(float32_array)\n\n# Int16 array — from raw WAV, microphone PCM\nresult = vad.detect(np.array([...], dtype=np.int16))\n\n# Large audio — chunked processing with overlap\n# overlap_seconds must be smaller than chunk_seconds\nresult = vad.detect(\"long.wav\", chunk_seconds=600, overlap_seconds=2)\n\n# Stream VAD — real-time, feed 160 samples (10ms) at a time\n# Accepts float32 in [-1, 1] (Web Audio, soundfile, torch) or int16 PCM\nsvad = OmniStreamVAD()\nframe = None\nwhile frame is None:\n    frame = svad.process(pcm_160)  # np.float32 or np.int16\n# StreamResult(time=0.420s, confidence=0.95, is_speech=True)\n\n# FastClone — share model weights, minimal memory per stream\nclone = svad.clone()  # instant, ~0 memory overhead\nclone.process(pcm_160)  # fully independent state\n\n# AED — speech + singing + music\naed = OmniAED()\nevents = aed.detect(\"audio.wav\")\n# {'duration': 22.0, 'events': {'speech': [...], 'singing': [...], 'music': [...]}}\n```\n\n**Platforms:** macOS (arm64\u002Fx86_64), Linux (x86_64\u002Faarch64), Windows (x86_64)\n\n### C\u002FC++ Native Library (`native\u002F`)\n\nUnified C API with [ncnn](https:\u002F\u002Fgithub.com\u002FTencent\u002Fncnn) backend. Single header, single library.\n\n```c\n#include \"omnivad.h\"\n\nint err = OMNI_OK;\n\n\u002F\u002F VAD — whole audio to speech segments\nOmniVadHandle vad = omni_vad_create(\"vad.omnivad\", &err);\nomni_vad_detect_int16(vad, pcm, num_samples, &config, &segments, &count);\n\u002F\u002F segments[0] = { start: 0.44, end: 1.82 }\n\n\u002F\u002F Stream VAD — real-time, 10ms per frame\n\u002F\u002F Two entries: omni_stream_vad_process (float [-1,1]), _int16 (int16 PCM)\nOmniStreamVadHandle svad = omni_stream_vad_create(\"stream-vad.omnivad\", 0.5f, &err);\nomni_stream_vad_process(svad, float_160_samples, 160, &result);   \u002F\u002F FP32\nomni_stream_vad_process_int16(svad, pcm_160_samples, 160, &result); \u002F\u002F int16\n\n\u002F\u002F FastClone — share model weights across streams\nOmniStreamVadHandle clone = omni_stream_vad_clone(svad, &err);\nomni_stream_vad_process_int16(clone, other_pcm, 160, &result);  \u002F\u002F independent state\n\n\u002F\u002F AED — speech + singing + music detection\nOmniAedHandle aed = omni_aed_create(\"aed.omnivad\", &err);\nomni_aed_detect_int16(aed, pcm, num_samples, &config, &segments, &count);\n\u002F\u002F segments[0] = { start: 0.09, end: 12.32, cls: OMNI_AED_MUSIC }\n```\n\n**Build:**\n\n```bash\n# Prerequisites: cmake, ncnn (brew install ncnn)\ncd native\ncmake -B build && cmake --build build -j$(nproc)\n\n# Test\n.\u002Fbuild\u002Ftest_all ..\u002Fmodels\u002F audio.wav\n```\n\n**Platforms:** macOS (arm64\u002Fx86_64), Linux (x86_64\u002Faarch64), Windows (x86_64), Android (armeabi-v7a\u002Farm64-v8a)\n\n### TypeScript\u002FJavaScript (`packages\u002Fomnivad\u002F`)\n\nWorks in both **browser** and **Node.js** via ncnn WebAssembly. **Zero dependencies**, models bundled.\n\n```ts\nimport { OmniVAD, OmniStreamVAD, OmniAED } from 'omnivad';\n\n\u002F\u002F Non-stream VAD — models loaded automatically from bundled WASM\nconst vad = await OmniVAD.create();\nconst result = vad.detect(audioFloat32Array);  \u002F\u002F Float32Array [-1.0, 1.0]\n\u002F\u002F { duration: 2.32, timestamps: [[0.44, 1.82]] }\n\n\u002F\u002F Also accepts Int16Array (raw PCM)\nconst result2 = vad.detect(pcmInt16Array);\n\n\u002F\u002F Stream VAD — frame-by-frame or full-audio batch mode\nconst svad = await OmniStreamVAD.create();\n\u002F\u002F processFrame() accepts Float32Array [-1, 1] or Int16Array — dispatch by dtype\nconst frame = svad.processFrame(float32_160);  \u002F\u002F null until enough audio is buffered\nconst full = svad.detectFull(audioFloat32Array);\n\u002F\u002F { probabilities: Float32Array(...), numFrames: 98, duration: 1.0 }\n\n\u002F\u002F AED — speech + singing + music\nconst aed = await OmniAED.create();\nconst events = aed.detect(audioFloat32Array);\n\u002F\u002F { duration: 22.0, events: { speech: [...], singing: [...], music: [...] }, ratios: { ... } }\n```\n\n**Build:**\n\n```bash\ncd packages\u002Fomnivad\npnpm install && pnpm build\n# Output: dist\u002Findex.js + dist\u002Findex.cjs + dist\u002Findex.d.ts + dist\u002Fwasm\u002F*\n```\n\n## Thread Safety\n\n| Component | Shared handle | Independent handles | Notes |\n|-----------|:---:|:---:|-------|\n| **OmniVAD** | **Safe** | **Safe** | `ncnn::Net` is read-only; each call creates a local `Fbank` and `Extractor` |\n| **OmniAED** | **Safe** | **Safe** | Same architecture as VAD |\n| **OmniStreamVAD** | **Unsafe** | **Safe** | Mutable internal state (`audio_buffer`, `cache`, `frame_offset`) |\n\n**Guidelines:**\n\n- `OmniVAD` and `OmniAED` instances can be safely shared across threads for concurrent inference. The Python `workers` parameter in `detect(..., workers=N)` already uses this pattern.\n- `OmniStreamVAD` instances must **not** be shared across threads. Create one instance per thread for parallel streaming.\n- Handle creation (`omni_*_create`) should be done sequentially — ncnn's model loading is not designed for highly concurrent initialization.\n- Never call `close()` \u002F `destroy()` on a handle while another thread is using it.\n\n**Running thread-safety tests:**\n\n```bash\n# Python\npytest tests\u002Ftest_thread_safety.py -v\n\n# C++ (requires ncnn)\n.\u002Fnative\u002Fbuild\u002Ftest_thread_safety models\u002F tests\u002Fdata\u002Fhello_en.wav [threads] [repeats]\n```\n\n## Audio Input\n\nHigh-level APIs accept 16kHz mono audio only. Two formats, same convention\nacross all 3 model types and all 3 layers (C \u002F Python \u002F TypeScript):\n\n- `float32` \u002F `Float32Array` in `[-1, 1]` (Web Audio, soundfile, torch)\n- `int16` \u002F `Int16Array` PCM (WAV, microphone)\n\nWrappers dispatch by dtype to the matching C entry — **never scale or\nconvert in Python\u002FJS**. All scaling lives in the C library: the `f32`\nentry multiplies by 32768.0f, the `_int16` entry casts to float.\n\n| Method | FP32 entry | int16 entry |\n|---|---|---|\n| `OmniVAD.detect \u002F detect_probs` | `omni_vad_detect[_probs]` | `omni_vad_detect[_probs]_int16` |\n| `OmniAED.detect \u002F detect_probs` | `omni_aed_detect[_probs]` | `omni_aed_detect[_probs]_int16` |\n| `OmniStreamVAD.process` | `omni_stream_vad_process` | `omni_stream_vad_process_int16` |\n| `OmniStreamVAD.detect_full` | `omni_stream_vad_detect_full` | `omni_stream_vad_detect_full_int16` |\n\nFor exact contracts see [`native\u002Finclude\u002Fomnivad.h`](native\u002Finclude\u002Fomnivad.h).\n\n## Audio Pipeline\n\n```\n16kHz PCM → Fbank (80-dim, 25ms window, 10ms shift) → CMVN → DFSMN → Sigmoid → Post-processing → Segments\n                     Povey window                        μ\u002Fσ    ~2.2MB   [0,1]    4-state machine\n                     pre-emphasis 0.97                                            merge\u002Fsplit\u002Fextend\n```\n\n## Streaming VAD — `OmniStreamVAD`\n\nFor long audio (live streams, hours-long recordings, real-time captioning),\n`OmniStreamVAD` processes audio frame-by-frame and emits **segment-boundary\nevents on the same call** that confirms the boundary — bit-identical to\nupstream [FireRedVAD's `FireRedStreamVad`](https:\u002F\u002Fgithub.com\u002FFireRedTeam\u002FFireRedVAD\u002Fblob\u002Fmain\u002Ffireredvad\u002Fstream_vad.py).\n\nEach successful `process()` call returns a result with both per-frame\nprobabilities AND segment-boundary flags:\n\n| Field | Meaning |\n|-------|---------|\n| `confidence` | raw model probability `[0, 1]` |\n| `smoothed_prob` | causal moving-average over `smooth_window_size` frames |\n| `is_speech` | `smoothed_prob >= threshold` |\n| `is_speech_start` | `True` on the frame that confirms a new SPEECH segment |\n| `is_speech_end` | `True` on the frame that confirms a SPEECH segment end |\n| `frame_idx` | 1-based frame index (multiply by 0.01 for seconds) |\n| `speech_start_frame` | 1-based segment start (when `is_speech_start`) |\n| `speech_end_frame` | 1-based segment end (when `is_speech_end`) |\n\n### Configuration (defaults match upstream FireRedVAD)\n\n| Parameter | Default | Meaning |\n|-----------|---------|---------|\n| `threshold` | `0.5` | Speech activation threshold |\n| `smooth_window_size` | `5` | Causal moving-average window (frames) |\n| `pad_start_frame` | `5` | Extend confirmed segment START backward by N frames |\n| `min_speech_frame` | `8` | Min continuous speech frames to confirm START (~80ms) |\n| `max_speech_frame` | `2000` | Force-split when SPEECH-state count hits this (~20s) |\n| `min_silence_frame` | `20` | Min continuous silence frames to confirm END (~200ms) |\n\n### Python\n\n```python\nfrom omnivad import OmniStreamVAD\nimport numpy as np\n\nvad = OmniStreamVAD()                              # upstream defaults\npcm = np.fromfile(\"speech.pcm\", dtype=np.int16)\n\nfor i in range(0, len(pcm), 160):                  # 10ms chunks\n    result = vad.process(pcm[i : i + 160])\n    if result is None:\n        continue\n    if result.is_speech_start:\n        print(f\"START @ {result.speech_start_frame * 0.01:.2f}s\")\n    if result.is_speech_end:\n        print(f\"END   @ {result.speech_end_frame * 0.01:.2f}s\")\n\n# Or get [(start_sec, end_sec), ...] in one call:\nsegments = OmniStreamVAD().detect_segments(\"speech.wav\")\n```\n\n### TypeScript\n\n```typescript\nimport { OmniStreamVAD } from \"omnivad\";\n\nconst vad = await OmniStreamVAD.create();\nfor (let i = 0; i + 160 \u003C= pcm.length; i += 160) {\n    const result = vad.processFrame(pcm.subarray(i, i + 160));\n    if (!result) continue;\n    if (result.isSpeechStart) {\n        console.log(`START @ ${(result.speechStartFrame * 0.01).toFixed(2)}s`);\n    }\n    if (result.isSpeechEnd) {\n        console.log(`END   @ ${(result.speechEndFrame * 0.01).toFixed(2)}s`);\n    }\n}\n```\n\n### Pairing with `merge_chunks`\n\n`OmniStreamVAD` emits raw VAD segments. To pack them into Whisper-sized\n30s chunks for downstream ASR, feed the emitted `[start, end]` pairs to\n`merge_chunks` (see next section).\n\n## Chunking — `merge_chunks` \u002F `mergeChunks`\n\nAfter VAD produces a list of speech `(start, end)` segments, the chunking\nutility groups them into duration-bounded chunks suitable for downstream\nASR \u002F forced alignment \u002F TTS. It is a **pure function** with no model\ndependency — Python uses `ctypes`, TypeScript uses Emscripten WASM, and\nC calls the native function directly. All three bindings share a single C\nimplementation in `native\u002Fsrc\u002Fchunking.cpp`.\n\n```python\nfrom omnivad import merge_chunks\nchunks = merge_chunks(timestamps, max_chunk_secs=30.0, mode=\"greedy\")\n```\n\n```ts\nimport { mergeChunks } from \"omnivad\";\nconst chunks = await mergeChunks(timestamps, { maxChunkSecs: 30.0, mode: \"longest_gap\" });\n```\n\n### Pipeline (5 steps; Steps 1–2 and 4–5 are shared by both modes)\n\n```\ninput (sorted segments)\n  │\n  ├─ Step 1: drop segments with duration \u003C min_speech_secs\n  │\n  ├─ Step 2: pre-merge consecutive segments with gap \u003C min_silence_secs\n  │          (cascades; takes max(end) on overlap)\n  │\n  ├─ Step 3: pack into chunks  ─┬─ mode = \"greedy\"\n  │                              │     sequential append; split when next\n  │                              │     would exceed max_chunk_secs OR gap > max_gap_secs\n  │                              │\n  │                              └─ mode = \"longest_gap\"\n  │                                    recursive split at the longest gap\n  │                                    until every chunk's span ≤ max_chunk_secs\n  │\n  ├─ Step 4: equal hard-split any chunk still longer than max_chunk_secs\n  │          (only triggers when a single segment alone exceeds max_chunk_secs)\n  │\n  └─ Step 5: apply pad_onset_secs (clamped to ≥ 0) and pad_offset_secs\n             output chunks: (start, end, seg_start_idx, seg_count)\n```\n\n### Mode comparison\n\n| Property | `greedy` (default) | `longest_gap` |\n|---|---|---|\n| Strategy | Sequential append until next overflow | Recursive split at longest internal gap until each chunk fits `max_chunk_secs` |\n| Honors `max_chunk_secs` | **Yes** — hard upper bound | **Yes** — recursion stops when chunk span ≤ `max_chunk_secs` |\n| Boundary location | First overflow point | Longest pause inside the over-long span |\n| Honors `max_gap_secs` | **Yes** — split at first `gap > max_gap_secs` | **Yes** — recursion also stops only when no internal gap exceeds `max_gap_secs` |\n| Single seg > `max_chunk_secs` | Step 4 equal hard-split | Same — Step 4 fallback |\n| Determinism | Deterministic | Deterministic; **leftmost** wins on tie |\n| Recommended for | **Whisper \u002F whisperX-style ASR** (fixed-length input, padded to 30s) | **Variable-length-input models** — forced alignment, TTS, encoder-style ASR. Splits at natural pauses; no fixed-length padding required. |\n\nExample with the same input, both modes (`max_chunk_secs=20`):\n\n```\nInput (max_chunk_secs = 20):\n  seg 0 = (0, 5)\n  seg 1 = (8, 10)     gap from seg 0 = 3\n  seg 2 = (20, 25)    gap from seg 1 = 10   ← longer\n\ngreedy\n  start cur = (0, 5)\n  accept seg 1            → cur = (0, 10)   [length 10 ≤ 20 ✓]\n  next seg 2 would_exceed:  25 - 0 = 25 > 20  → SPLIT\n  chunks: [(0, 10, 0, 2), (20, 25, 2, 1)]\n\nlongest_gap\n  span = 25 > 20            → must split\n  longest gap = 10 at idx 1 → cut between seg 1 and seg 2\n    left  = [seg 0, seg 1]  span = 10 ≤ 20 ✓ → keep\n    right = [seg 2]         span = 5  ≤ 20 ✓ → keep\n  chunks: [(0, 10, 0, 2), (20, 25, 2, 1)]\n```\n\n(In this minimal example both modes happen to agree. They diverge whenever\nthe longest gap is not the **first** overflow point.)\n\n### `seg_start_idx` \u002F `seg_count` semantics\n\nThese index into the **post-Step-1+Step-2** view of the input — segments\ndropped by `min_speech_secs` and pre-merged by `min_silence_secs` are\nNOT in the indexing space. Both modes follow this convention.\n\n### Defaults\n\n`omni_chunk_config_default()` (C \u002F `default_chunk_config()` Python \u002F\n`DEFAULT_CHUNK_CONFIG` TS) returns:\n\n| field | default | source |\n|---|---|---|\n| `max_chunk_secs` | `30.0` | seconds; matches Whisper's 30s input window |\n| `max_gap_secs` | `INFINITY` | disabled |\n| `pad_onset_secs` \u002F `pad_offset_secs` | `0.04` \u002F `0.04` | |\n| `min_speech_secs` | `0.0` | pairs with VAD `min_speech_frames` |\n| `min_silence_secs` | `0.20` | matches VAD `min_silence_frames=20` @ 10ms shift |\n| `mode` | `OMNI_CHUNK_GREEDY` | backward-compatible |\n\n> **Heads-up — Python convenience defaults differ.** The Python kwargs of\n> `merge_chunks(...)` use zeros for `pad_onset_secs`, `pad_offset_secs`,\n> `min_silence_secs` (so the simplest call gives raw output). To match\n> the canonical defaults, use the values returned by `default_chunk_config()`.\n> See `tests\u002Ftest_chunking.py::test_python_convenience_defaults_differ_from_canonical`.\n\n### Whisper \u002F WhisperX-style ASR pipeline\n\n`OmniVAD` (whole-audio, batch) + `merge_chunks(mode=\"greedy\")` is the\n1:1 equivalent of WhisperX's `Binarize(max_duration=chunk_size)` +\ngreedy packing. Use this recipe when feeding chunks into Whisper-family\nASR models that expect a fixed 30s input window:\n\n```python\nfrom omnivad import OmniVAD, merge_chunks\n\nvad = OmniVAD()                              # threshold=0.4 default — safer for Whisper\nresult = vad.detect(\"long-audio.wav\")        # whole-audio batch VAD\n\nchunks = merge_chunks(\n    timestamps=result[\"timestamps\"],\n    max_chunk_secs=30.0,                     # Whisper's input window\n    mode=\"greedy\",                           # WhisperX behavior\n    pad_onset_secs=0.04,\n    pad_offset_secs=0.04,\n    min_silence_secs=0.20,                   # matches VAD min_silence_frames=20\n)\n# Each chunk: { start, end, seg_start_idx, seg_count }\n# Slice the audio at [start, end] and feed each slice to Whisper.\n```\n\nNotes:\n\n- Keep the default `threshold=0.4`. Whisper tolerates extra padding silence\n  but is sensitive to clipped word edges (raising to 0.5 risks dropping\n  weak word-initial\u002Ffinal consonants and triggering hallucinations).\n- Do **not** use `mode=\"longest_gap\"` here — that mode targets\n  variable-length-input models (forced alignment, TTS), not WhisperX.\n- For very long audio (>1 hour), pass `chunk_seconds=600, overlap_seconds=2`\n  to `vad.detect(...)` to limit peak memory.\n\n## Model Files\n\nPrebuilt `.omnivad` bundles used by the Python package, TypeScript package, and local examples are already included in this repo under `models\u002F`.\n\nYou only need to download upstream FireRedVAD checkpoints if you want to re-export ONNX or regenerate the native assets yourself.\n\n```bash\n# Download upstream PyTorch models + export to ONNX\npip install fireredvad\npython -m fireredvad.bin.export_onnx --all\n\n# Or download pre-exported ONNX models directly\n# fireredvad_vad.onnx              — Non-stream VAD (2.3MB)\n# fireredvad_aed.onnx              — Non-stream AED (2.3MB)\n# fireredvad_stream_vad_with_cache.onnx — Stream VAD (2.2MB)\n\n# For C\u002Fncnn: convert ONNX → ncnn with pnnx\npip install pnnx\npnnx fireredvad_vad.onnx \"inputshape=[1,100,80]\"\n```\n\n## Local Development\n\nThis section covers building OmniVAD from source and consuming the in-tree\nbuild from another project on the same machine — the loop you want when\nhacking on the C\u002FC++ core, the Python wrapper, or the TS bindings.\n\n### Prerequisites\n\n| Target | Required | Notes |\n|--------|----------|-------|\n| Python wheel | Python 3.10+, CMake 3.15+, a C++14 toolchain | `pip install -e .` runs scikit-build-core, which **fetches ncnn automatically** via CMake `FetchContent`. |\n| Standalone C\u002FC++ library | CMake 3.15+, a pre-installed ncnn (`brew install ncnn` or build from source) | `native\u002FCMakeLists.txt` does **not** fetch ncnn — set `-DNCNN_ROOT=...` if it isn't on the default search path. |\n| TypeScript bundle | Node 18+, [pnpm](https:\u002F\u002Fpnpm.io\u002F) | Builds `dist\u002Findex.{js,cjs,d.ts}` only — does **not** rebuild the WASM. |\n| WASM module | [emsdk](https:\u002F\u002Femscripten.org\u002Fdocs\u002Fgetting_started\u002Fdownloads.html) (any recent version) | Required only when you change C\u002FC++ code and need a fresh `dist\u002Fwasm\u002Fomnivad.wasm`. |\n\n### Build the Python package (editable install)\n\n```bash\npip install -e \".[dev]\"\n```\n\nWhat this produces:\n\n- `omnivad\u002Flibomnivad.{dylib,so,dll}` — the shared library actually loaded\n  at runtime by `omnivad\u002F_binding.py`.\n- `omnivad\u002Fmodels\u002F*.omnivad` — bundled model files (copied by CMake `install(...)`).\n- An editable entry in your environment's `site-packages` pointing back at\n  the source tree.\n\nWhen you change **C\u002FC++ code** in `native\u002F`, re-run `pip install -e .` to\nrelink the dylib. (CMake's incremental build means this is fast.) Pure\nPython edits don't need a reinstall.\n\n### Build the TypeScript package\n\n```bash\ncd packages\u002Fomnivad\npnpm install\npnpm build          # tsup → dist\u002Findex.{js,cjs,d.ts}\npnpm typecheck      # tsc --noEmit\n```\n\nThis step **does not** rebuild the WASM — it consumes whatever's already in\n`dist\u002Fwasm\u002F`. If you only edited TS, you're done.\n\n### Build the WASM module (when you change C\u002FC++)\n\n```bash\nEMSDK=\u002Fpath\u002Fto\u002Femsdk packages\u002Fomnivad\u002Fwasm\u002Fbuild.sh\n```\n\nThe script writes `omnivad.{js,cjs,wasm}` directly into\n`packages\u002Fomnivad\u002Fdist\u002Fwasm\u002F`. After this, re-run `pnpm build` only if you\nalso changed TS.\n\n> The `EMSDK` env var must point at your emsdk root (the directory that\n> contains `emsdk_env.sh` and `upstream\u002Femscripten\u002F`). The script aborts\n> with a clear error if it's missing.\n\n### Consume the in-tree build from another repo\n\n#### Python — `pip install -e \u003Cpath>`\n\n```bash\n# In the target project's venv:\npip install -e \u002Fabs\u002Fpath\u002Fto\u002FOmniVAD-Kit          # editable, picks up your edits\n# or, isolated wheel:\npip install \u002Fabs\u002Fpath\u002Fto\u002FOmniVAD-Kit             # builds and installs a fresh wheel\n```\n\n`pip install -e` is what you want for the dev loop — re-running it after a\nC\u002FC++ edit relinks the dylib in place; pure Python edits are picked up\nwithout reinstalling.\n\n#### TypeScript — three options, pick by use case\n\n| Option | Command | When to use |\n|--------|---------|-------------|\n| **A. Tarball (closest to npm)** | `cd packages\u002Fomnivad && pnpm pack`\u003Cbr>then in target: `pnpm add \u002Fabs\u002Fpath\u002Fomnivad-0.2.8.tgz` | Verifying what real consumers will install. Clean, no symlink quirks. |\n| **B. `file:` protocol** | In target `package.json`: `\"omnivad\": \"file:..\u002FOmniVAD-Kit\u002Fpackages\u002Fomnivad\"` | In-tree monorepo-style consumption. Re-run `pnpm install` to pick up rebuilds. |\n| **C. Global link** | `cd packages\u002Fomnivad && pnpm link --global`\u003Cbr>then in target: `pnpm link --global omnivad` | Fast iteration across many projects. Watch for peer\u002Fhoist quirks. |\n\nFor all three, **rebuild before testing**:\n\n```bash\ncd packages\u002Fomnivad\npnpm build                                       # if only TS changed\nEMSDK=\u002Fpath\u002Fto\u002Femsdk wasm\u002Fbuild.sh && pnpm build # if C\u002FC++ changed\n```\n\n### Full rebuild after a C\u002FC++ change (cheat sheet)\n\n```bash\n# From the repo root:\npip install -e .                                       # Python dylib\nEMSDK=\u002Fpath\u002Fto\u002Femsdk packages\u002Fomnivad\u002Fwasm\u002Fbuild.sh    # WASM (.wasm + glue)\n( cd packages\u002Fomnivad && pnpm build )                  # TS bundle\n```\n\n### Standalone C\u002FC++ build (for native tests \u002F embedding)\n\n```bash\ncd native\ncmake -B build -DNCNN_ROOT=\u002Fpath\u002Fto\u002Fncnn   # only if ncnn isn't auto-discovered\ncmake --build build -j$(nproc 2>\u002Fdev\u002Fnull || sysctl -n hw.ncpu)\n.\u002Fbuild\u002Ftest_all ..\u002Fmodels ..\u002Ftests\u002Fdata\u002Fhello_en.wav\n```\n\nThis is independent from the Python wheel build — the wheel uses CMake\n`FetchContent` to pull a pinned ncnn, while `native\u002F` expects a\npre-installed one.\n\n### Lint \u002F format\n\n```bash\nruff check --fix . && ruff format .                    # Python (line-length 120)\n( cd packages\u002Fomnivad && pnpm typecheck )              # TypeScript\n```\n\n## Testing\n\n```bash\n# Run the full Python test suite\npip install -e \".[dev]\"\npytest tests -v\n\n# Utility scripts (not pytest — require external FireRedVAD models)\npython tests\u002Fgenerate_reference.py            # Generate Python reference data\npython tests\u002Fcheck_timestamp_accuracy.py      # Strict C vs Python comparison\npython tests\u002Fvad_to_textgrid.py audio.wav     # Audio → TextGrid + RTF benchmark\n```\n\n**Accuracy (C\u002Fncnn vs Python, 5 audio files × 3 models):**\n\n| Model | Timestamp Δ | Probability Δ | Status |\n|-------|------------|---------------|--------|\n| VAD | ≤ 0.020s | ≤ 0.001 | Exact match |\n| AED (singing\u002Fmusic) | ≤ 0.010s | ≤ 0.013 | Exact match |\n| AED (speech) | ≤ 0.030s | ≤ 0.015 | Match (ncnn fp16 edge cases on `event.wav`) |\n| Stream-VAD (detect_full) | ≤ 0.010s | ≤ 0.001 | Exact match |\n\n## Project Structure\n\n```\nomnivad\u002F\n├── omnivad\u002F                         # Python PyPI package\n│   ├── __init__.py                  #   Public API: OmniVAD, OmniStreamVAD, OmniAED\n│   ├── cli.py                       #   CLI entry point (omnivad command)\n│   ├── _binding.py                  #   ctypes bindings to libomnivad\n│   ├── vad.py                       #   OmniVAD (non-stream)\n│   ├── stream_vad.py                #   OmniStreamVAD (real-time)\n│   └── aed.py                       #   OmniAED (3-class)\n├── native\u002F                          # C\u002FC++ library (ncnn backend)\n│   ├── include\u002Fomnivad.h            #   Unified C API header\n│   ├── src\u002Fomnivad.cpp              #   Core implementation\n│   ├── frontend\u002F                    #   Fbank\u002FFFT\u002FWAV (from FireRedVAD)\n│   ├── test\u002F                        #   4 test programs\n│   └── CMakeLists.txt\n├── packages\u002Fomnivad\u002F                # TypeScript npm package\n│   ├── src\u002F\n│   │   ├── vad.ts                   #   OmniVAD (non-stream)\n│   │   ├── stream-vad.ts            #   OmniStreamVAD (real-time)\n│   │   ├── aed.ts                   #   OmniAED (3-class)\n│   │   ├── wasm-binding.ts          #   Emscripten\u002FWASM bindings\n│   │   ├── types.ts                 #   Public TypeScript types\n│   │   ├── index.ts                 #   Package exports\n│   │   └── wasm.d.ts                #   WASM module declarations\n│   ├── package.json\n│   └── tsconfig.json\n└── tests\u002F                           # Test suite\n    ├── test_c_vs_python.py          #   Accuracy: omnivad vs Python reference\n    ├── test_determinism.py          #   Repeated-run determinism\n    ├── test_edge_cases.py           #   Edge cases: tiny\u002Fempty\u002Fsilence inputs\n    ├── smoke_test.py                #   CI smoke test (import + detect)\n    ├── test_memory.sh               #   Native memory\u002Fleak checks\n    ├── check_timestamp_accuracy.py  #   Strict C vs Python comparison (manual)\n    ├── check_native.py              #   Native C binary validation (manual)\n    ├── generate_reference.py        #   Generate Python reference data\n    ├── vad_to_textgrid.py           #   Audio → TextGrid + RTF benchmark\n    └── data\u002F                        #   5 test audio files + reference JSON\n```\n\n## Performance\n\nRTF (Real-Time Factor) on Apple M-series, lower = faster:\n\n| Model | RTF | Speed |\n|-------|-----|-------|\n| VAD | ~0.003 | ~330x real-time |\n| Stream-VAD | ~0.002 | ~500x real-time |\n| AED | ~0.002 | ~500x real-time |\n\n## Origin & Attribution\n\nOmniVAD is a cross-platform deployment toolkit built on top of [**FireRedVAD**](https:\u002F\u002Fgithub.com\u002FFireRedTeam\u002FFireRedVAD), developed by [Xiaohongshu (小红书)](https:\u002F\u002Fwww.xiaohongshu.com\u002F). FireRedVAD provides high-quality Voice Activity Detection models and a lightweight Audio Event Detection model that can distinguish speech, singing, and music.\n\n**Original paper:** [FireRedVAD (arXiv:2603.10420)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.10420)\n\n**What FireRedVAD provides:** DFSMN-based models (~2.2MB each), Python inference code, PyTorch training, strong VAD benchmark results (FLEURS-VAD-102 F1: 97.57%).\n\n**What OmniVAD adds:** Unified C API (ncnn backend) for native deployment, TypeScript\u002FJavaScript npm package (ncnn WebAssembly) for browser and Node.js, cross-platform build system, comprehensive test suite with accuracy validation.\n\n## License\n\nApache-2.0 — same as the upstream FireRedVAD.\n\n## Credits\n\n- [**FireRedVAD**](https:\u002F\u002Fgithub.com\u002FFireRedTeam\u002FFireRedVAD) — Kaituo Xu, Wenpeng Li, Kai Huang, Kun Liu (Xiaohongshu)\n- [ncnn](https:\u002F\u002Fgithub.com\u002FTencent\u002Fncnn) — Tencent\n- [Emscripten](https:\u002F\u002Femscripten.org\u002F) — WebAssembly toolchain\n","OmniVAD-Kit 是一个跨平台的语音活动检测（VAD）和音频事件检测（AED）工具包，支持 Python (PyPI)、TypeScript (npm) 和 C API。其核心功能包括非流式语音检测、实时语音检测以及音频事件检测（如语音、唱歌、音乐）。基于 DFSMN 架构的模型大小约为 2MB，支持超过 100 种语言，并能在原生环境、浏览器（通过 WASM）及 Node.js 上运行。该工具包适用于需要高效处理音频数据的应用场景，例如语音识别前处理、音频内容分析等。",2,"2026-06-11 03:58:43","CREATED_QUERY"]