[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79190":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},79190,"av-curator","henliveira\u002Fav-curator","henliveira","Audio-visual data curation pipeline — scene cuts, silence trim, dedup, CLIP\u002FWhisper filtering for messy web video.",null,"Python",230,1841,6,0,191,59.8,"Other",false,"main",[],"2026-06-12 04:01:24","# AV-Curator\n\n## The Problem\n\nIf you've ever tried to train (or evaluate) an audio-visual model on\n\"web video\", you know the data pipeline is usually 90% of the work.  Raw web\nvideo is full of:\n\n- title cards and black frames at scene boundaries\n- music intros and dead silence at clip ends\n- near-duplicate clips from re-uploads\n- watermarks, hardcoded subtitles, and PIP overlays\n- audio that's actually just the platform's stock background music\n- \"speech\" that's actually a melody or a foreign language you didn't ask for\n\nA `youtube-dl | ffmpeg | shuf | go` pipeline picks up all of this, and your\ndownstream training\u002Feval ends up reflecting the noise rather than the signal.\n\n## The Approach\n\n`av-curator` is a small, opinionated **audio-visual data curation pipeline**\nthat runs a sequence of *cheap, swappable filters* over an input manifest\nand writes a clean output manifest (plus optional re-encoded clips).\n\nThe pipeline is intentionally modular — each filter is a function from\n`(clip metadata, sources) -> (passes?, decision_record)` — so you can:\n\n- start with just **scene-cut + silence-trim** and iterate from there\n- swap a heavy filter (e.g. CLIP-based dedup) for a fast one\n  (perceptual hash) without rewiring anything\n- log every filter decision per clip, for auditability\n\n## Show Me\n\n```bash\n# 1. Build a manifest from a directory of raw clips\nav-curate manifest data\u002Fraw\u002F --out manifest.jsonl\n\n# 2. Run the full default pipeline\nav-curate run manifest.jsonl --config configs\u002Fdefault.yaml \\\n    --out manifest.clean.jsonl --report report.html\n\n# 3. Slice the surviving clips with ffmpeg\nav-curate slice manifest.clean.jsonl --out data\u002Fprocessed\u002F\n```\n\nThe `report.html` writes a small per-stage funnel:\n\n```\ninput:     10000 clips\n├─ codec_filter             ──▶  9871  (dropped 129 unreadable)\n├─ duration_filter (≥2s)    ──▶  9612  (dropped 259 too short)\n├─ silence_filter           ──▶  8954  (trimmed 982 \u002F dropped 658 silent)\n├─ scene_cut (max 1 cut)    ──▶  8112  (split 612 \u002F dropped 842)\n├─ phash_dedup              ──▶  7444  (dropped 668 near-dupes)\n├─ whisper_lang(en)         ──▶  6320  (kept en\u002Fzh)\n└─ clip_text_align          ──▶  5907  (dropped 413)\n```\n\n## Getting Started\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhenliveira\u002Fav-curator\ncd av-curator\npip install -e \".[full]\"\n```\n\nSystem dependencies: `ffmpeg`, `ffprobe`.  For CLIP\u002FWhisper filters you'll\nwant a CUDA-capable PyTorch install.\n\n## How it works\n\n### Manifest format\n\nEvery stage reads and writes a JSONL of `Clip` records:\n\n```json\n{\n  \"id\": \"abc123\",\n  \"path\": \"data\u002Fraw\u002Fabc123.mp4\",\n  \"duration\": 12.4,\n  \"video\": {\"fps\": 25.0, \"width\": 1280, \"height\": 720, \"codec\": \"h264\"},\n  \"audio\": {\"sr\": 44100, \"channels\": 2, \"codec\": \"aac\"},\n  \"trims\": [[0.0, 12.4]],\n  \"tags\": [\"scene_clean\", \"lang=en\"],\n  \"scores\": {\"phash_min_dist\": 18.0, \"clip_text_align\": 0.27},\n  \"decisions\": [\n    {\"stage\": \"silence_filter\", \"verdict\": \"kept\", \"note\": \"trimmed 0.6s tail\"},\n    ...\n  ]\n}\n```\n\n### Filter contract\n\nA filter is a Python callable:\n\n```python\ndef my_filter(clip: Clip, ctx: Context) -> Decision:\n    ...\n    return Decision.keep(note=\"ok\")\n    return Decision.drop(reason=\"too dark\")\n    return Decision.trim([(0.2, clip.duration - 0.1)])\n    return Decision.split([(0.0, 4.1), (4.6, 12.4)])\n```\n\n`Decision` carries the verdict, an optional new trim list, and a free-form\nnote that's logged into the clip's `decisions` field.\n\n### Built-in filters\n\n| Filter            | Module                          | Cost |\n|-------------------|---------------------------------|------|\n| codec_filter      | `avcurator.filters.codec`       | cheap (ffprobe) |\n| duration_filter   | `avcurator.filters.duration`    | cheap |\n| black_frame       | `avcurator.filters.black_frame` | cheap (ffmpeg blackdetect) |\n| silence_filter    | `avcurator.filters.silence`     | cheap (ffmpeg silencedetect) |\n| scene_cut         | `avcurator.filters.scene_cut`   | medium (PySceneDetect) |\n| phash_dedup       | `avcurator.filters.phash`       | medium |\n| clip_phash_dedup  | `avcurator.filters.clip_dedup`  | heavy (CLIP embedding) |\n| whisper_lang      | `avcurator.filters.whisper`     | heavy (Whisper inference) |\n| clip_text_align   | `avcurator.filters.clip_align`  | heavy (CLIP) |\n| watermark_detect  | `avcurator.filters.watermark`   | heavy (template match + CNN) |\n\nEach filter is configured via a YAML stanza; see `configs\u002Fdefault.yaml` for\nthe canonical example.\n\n### Caching\n\nHeavy filters cache their per-clip results to disk (`.cache\u002F\u003Cfilter>\u002F`) keyed\nby a hash of `(filter version, clip id, args)`.  Re-running the pipeline\nafter tweaking a cheap filter doesn't re-run the heavy ones.\n\n## Examples\n\n### \"I just want speech-heavy clips for ASR pre-training\"\n\n```yaml\nfilters:\n  - codec_filter\n  - duration_filter:\n      min: 2.0\n      max: 30.0\n  - silence_filter:\n      noise_db: -30\n      min_silence: 0.4\n  - whisper_lang:\n      keep: [en, zh]\n      min_speech_ratio: 0.6\n```\n\n### \"I want clean visual clips for video-LM pre-training\"\n\n```yaml\nfilters:\n  - codec_filter\n  - duration_filter: {min: 3.0}\n  - black_frame\n  - scene_cut: {max_cuts_per_clip: 1}\n  - phash_dedup: {hamming: 12}\n  - clip_text_align: {threshold: 0.22}\n```\n\n## Performance notes\n\n- The cheap filters run in a process pool over ffprobe \u002F ffmpeg subprocesses\n  — single-machine throughput of ~5k clips \u002F minute on 32 cores.\n- The heavy filters are batched on GPU; rough numbers for one A100-40G:\n  - CLIP filter — ~1200 clips \u002F min (8-frame sampling)\n  - Whisper-large — ~600 clips \u002F min (chunked)\n- Heavy filters can be sharded across nodes via `--shard idx\u002Ftotal` if you\n  bring your own scheduler.\n\n## What this isn't\n\n- A downloader.  `av-curate` starts from clips on disk; bring your own\n  `yt-dlp` step.\n- A training framework.  This produces clean **data**, nothing else.\n- A perfect filter.  No filter is.  Many filters are tuned conservatively\n  (drop is cheap; re-running training is not).\n\n## License\n\nBSD 3-Clause.\n\n## Acknowledgments\n\nBuilt on `ffmpeg`, [PySceneDetect](https:\u002F\u002Fwww.scenedetect.com\u002F),\n[CLIP](https:\u002F\u002Fgithub.com\u002Fopenai\u002FCLIP), and OpenAI's Whisper.  Inspired by\nseveral years of staring at messy web-video datasets.\n","AV-Curator 是一个音频-视觉数据整理管道，用于处理网络视频中的场景剪辑、静音修剪、去重和内容过滤。其核心功能包括通过一系列可替换的过滤器对输入的视频清单进行处理，并输出干净的视频清单及可选的重新编码片段。技术上，该工具采用Python编写，支持模块化设计，允许用户根据需求灵活调整过滤流程，如仅使用场景切割与静音修剪或更换为基于CLIP的去重等。此外，每个过滤步骤的结果都会被记录下来，便于审核。适用于需要从原始网络视频中提取高质量训练或评估数据的场景，特别是对于那些希望减少噪声干扰的研究者而言，是一个非常实用的工具。",2,"2026-06-01 03:48:11","CREATED_QUERY"]