[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2393":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":23,"hasPages":23,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},2393,"whisperX","m-bain\u002FwhisperX","m-bain","WhisperX:  Automatic Speech Recognition with Word-level Timestamps (& Diarization)","",null,"Python",22395,2297,158,168,0,19,125,564,100,120,"BSD 2-Clause \"Simplified\" License",false,"main",[26,27,28,29,30],"asr","speech","speech-recognition","speech-to-text","whisper","2026-06-12 04:00:14","\u003Ch1 align=\"center\">WhisperX\u003C\u002Fh1>\n\n## Recall.ai - Meeting Transcription API\n\nIf you’re looking for a transcription API for meetings, consider checking out [Recall.ai's Meeting Transcription API](https:\u002F\u002Fwww.recall.ai\u002Fproduct\u002Fmeeting-transcription-api?utm_source=github&utm_medium=sponsorship&utm_campaign=mbain-whisperx), an API that works with Zoom, Google Meet, Microsoft Teams, and more. Recall.ai diarizes by pulling the speaker data and separate audio streams from the meeting platforms, which means 100% accurate speaker diarization with actual speaker names.\n\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX\u002Fstargazers\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fm-bain\u002FwhisperX.svg?colorA=orange&colorB=orange&logo=github\"\n         alt=\"GitHub stars\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX\u002Fissues\">\n        \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fm-bain\u002Fwhisperx.svg\"\n             alt=\"GitHub issues\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX\u002Fblob\u002Fmaster\u002FLICENSE\">\n        \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fm-bain\u002FwhisperX.svg\"\n             alt=\"GitHub license\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.00747\">\n        \u003Cimg src=\"http:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-2303.00747-B31B1B.svg\"\n             alt=\"ArXiv paper\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=&url=https%3A%2F%2Fgithub.com%2Fm-bain%2FwhisperX\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttps\u002Fgithub.com\u002Fm-bain\u002FwhisperX.svg?style=social\" alt=\"Twitter\">\n  \u003C\u002Fa>      \n\u003C\u002Fp>\n\n\u003Cimg width=\"1216\" align=\"center\" alt=\"whisperx-arch\" src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fm-bain\u002FwhisperX\u002Frefs\u002Fheads\u002Fmain\u002Ffigures\u002Fpipeline.png\">\n\n\u003C!-- \u003Cp align=\"left\">Whisper-Based Automatic Speech Recognition (ASR) with improved timestamp accuracy + quality via forced phoneme alignment and voice-activity based batching for fast inference.\u003C\u002Fp> -->\n\n\u003C!-- \u003Ch2 align=\"left\", id=\"what-is-it\">What is it 🔎\u003C\u002Fh2> -->\n\nThis repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.\n\n- ⚡️ Batched inference for 70x realtime transcription using whisper large-v2\n- 🪶 [faster-whisper](https:\u002F\u002Fgithub.com\u002Fguillaumekln\u002Ffaster-whisper) backend, requires \u003C8GB gpu memory for large-v2 with beam_size=5\n- 🎯 Accurate word-level timestamps using wav2vec2 alignment\n- 👯‍♂️ Multispeaker ASR using speaker diarization from [pyannote-audio](https:\u002F\u002Fgithub.com\u002Fpyannote\u002Fpyannote-audio) (speaker ID labels)\n- 🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation\n\n**Whisper** is an ASR model [developed by OpenAI](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper), trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.\n\n**Phoneme-Based ASR** A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in \"tap\". A popular example model is [wav2vec2.0](https:\u002F\u002Fhuggingface.co\u002Ffacebook\u002Fwav2vec2-large-960h-lv60-self).\n\n**Forced Alignment** refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.\n\n**Voice Activity Detection (VAD)** is the detection of the presence or absence of human speech.\n\n**Speaker Diarization** is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.\n\n\u003Ch2 align=\"left\", id=\"highlights\">New🚨\u003C\u002Fh2>\n\n- 1st place at [Ego4d transcription challenge](https:\u002F\u002Feval.ai\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F1637\u002Fleaderboard\u002F3931\u002FWER) 🏆\n- _WhisperX_ accepted at INTERSPEECH 2023\n- v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization\n- v3 released, 70x speed-up open-sourced. Using batched whisper with [faster-whisper](https:\u002F\u002Fgithub.com\u002Fguillaumekln\u002Ffaster-whisper) backend!\n- v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.\n- Paper drop🎓👨‍🏫! Please see our [ArxiV preprint](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.00747) for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with \\*60-70x REAL TIME speed.\n\n\u003Ch2 align=\"left\" id=\"setup\">Setup ⚙️\u003C\u002Fh2>\n\n### 0. CUDA Installation\n\nTo use WhisperX with GPU acceleration, install the CUDA toolkit 12.8 before WhisperX. Skip this step if using only the CPU.\n\n- For **Linux** users, install the CUDA toolkit 12.8 following this guide:\n  [CUDA Installation Guide for Linux](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-installation-guide-linux\u002F).\n- For **Windows** users, download and install the CUDA toolkit 12.8:\n  [CUDA Downloads](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-12-8-1-download-archive).\n\n### 1. Simple Installation (Recommended)\n\nThe easiest way to install WhisperX is through PyPi:\n\n```bash\npip install whisperx\n```\n\nOr if using [uvx](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fguides\u002Ftools\u002F#running-tools):\n\n```bash\nuvx whisperx\n```\n\n### 2. Advanced Installation Options\n\nThese installation methods are for developers or users with specific needs. If you're not sure, stick with the simple installation above.\n\n#### Option A: Install from GitHub\n\nTo install directly from the GitHub repository:\n\n```bash\nuvx git+https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX.git\n```\n\n#### Option B: Developer Installation\n\nIf you want to modify the code or contribute to the project:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX.git\ncd whisperX\nuv sync --all-extras --dev\n```\n\n> **Note**: The development version may contain experimental features and bugs. Use the stable PyPI release for production environments.\n\nYou may also need to install ffmpeg, rust etc. Follow openAI instructions here https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper#setup.\n\n### Speaker Diarization\n\nTo **enable Speaker Diarization**, include your Hugging Face access token (read) that you can generate from [Here](https:\u002F\u002Fhuggingface.co\u002Fsettings\u002Ftokens) after the `--hf_token` argument and accept the user agreement for the [speaker-diarization-community-1](https:\u002F\u002Fhuggingface.co\u002Fpyannote\u002Fspeaker-diarization-community-1) model.\n\n\u003Ch2 align=\"left\" id=\"example\">Usage 💬 (command line)\u003C\u002Fh2>\n\n### English\n\nRun whisper on example segment (using default params, whisper small) add `--highlight_words True` to visualise word timings in the .srt file.\n\n    whisperx path\u002Fto\u002Faudio.wav\n\nResult using _WhisperX_ with forced alignment to wav2vec2.0 large:\n\nhttps:\u002F\u002Fuser-images.githubusercontent.com\u002F36994049\u002F208253969-7e35fe2a-7541-434a-ae91-8e919540555d.mp4\n\nCompare this to original whisper out the box, where many transcriptions are out of sync:\n\nhttps:\u002F\u002Fuser-images.githubusercontent.com\u002F36994049\u002F207743923-b4f0d537-29ae-4be2-b404-bb941db73652.mov\n\nFor increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.\n\n    whisperx path\u002Fto\u002Faudio.wav --model large-v2 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 4\n\nTo label the transcript with speaker ID's (set number of speakers if known e.g. `--min_speakers 2` `--max_speakers 2`):\n\n    whisperx path\u002Fto\u002Faudio.wav --model large-v2 --diarize --highlight_words True\n\nTo run on CPU instead of GPU (and for running on Mac OS X):\n\n    whisperx path\u002Fto\u002Faudio.wav --compute_type int8 --device cpu\n\n### Other languages\n\nThe phoneme ASR alignment model is _language-specific_, for tested languages these models are [automatically picked from torchaudio pipelines or huggingface](https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX\u002Fblob\u002Ff2da2f858e99e4211fe4f64b5f2938b007827e17\u002Fwhisperx\u002Falignment.py#L24-L58).\nJust pass in the `--language` code, and use the whisper `--model large`.\n\nCurrently default models provided for `{en, fr, de, es, it}` via torchaudio pipelines and many other languages via Hugging Face. Please find the list of currently supported languages under `DEFAULT_ALIGN_MODELS_HF` on [alignment.py](https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX\u002Fblob\u002Fmain\u002Fwhisperx\u002Falignment.py). If the detected language is not in this list, you need to find a phoneme-based ASR model from [huggingface model hub](https:\u002F\u002Fhuggingface.co\u002Fmodels) and test it on your data.\n\n#### E.g. German\n\n    whisperx --model large-v2 --language de path\u002Fto\u002Faudio.wav\n\nhttps:\u002F\u002Fuser-images.githubusercontent.com\u002F36994049\u002F208298811-e36002ba-3698-4731-97d4-0aebd07e0eb3.mov\n\nSee more examples in other languages [here](EXAMPLES.md).\n\n## Python usage 🐍\n\n```python\nimport whisperx\nimport gc\nfrom whisperx.diarize import DiarizationPipeline\n\ndevice = \"cuda\"\naudio_file = \"audio.mp3\"\nbatch_size = 16 # reduce if low on GPU mem\ncompute_type = \"float16\" # change to \"int8\" if low on GPU mem (may reduce accuracy)\n\n# 1. Transcribe with original whisper (batched)\nmodel = whisperx.load_model(\"large-v2\", device, compute_type=compute_type)\n\n# save model to local path (optional)\n# model_dir = \"\u002Fpath\u002F\"\n# model = whisperx.load_model(\"large-v2\", device, compute_type=compute_type, download_root=model_dir)\n\naudio = whisperx.load_audio(audio_file)\nresult = model.transcribe(audio, batch_size=batch_size)\nprint(result[\"segments\"]) # before alignment\n\n# delete model if low on GPU resources\n# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model\n\n# 2. Align whisper output\nmodel_a, metadata = whisperx.load_align_model(language_code=result[\"language\"], device=device)\nresult = whisperx.align(result[\"segments\"], model_a, metadata, audio, device, return_char_alignments=False)\n\nprint(result[\"segments\"]) # after alignment\n\n# delete model if low on GPU resources\n# import gc; import torch; gc.collect(); torch.cuda.empty_cache(); del model_a\n\n# 3. Assign speaker labels\ndiarize_model = DiarizationPipeline(token=YOUR_HF_TOKEN, device=device)\n\n# add min\u002Fmax number of speakers if known\ndiarize_segments = diarize_model(audio)\n# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)\n\nresult = whisperx.assign_word_speakers(diarize_segments, result)\nprint(diarize_segments)\nprint(result[\"segments\"]) # segments are now assigned speaker IDs\n```\n\n## Demos 🚀\n\n[![Replicate (large-v3](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Replicate+WhisperX+large-v3&message=Demo+%26+Cloud+API&color=blue)](https:\u002F\u002Freplicate.com\u002Fvictor-upmeet\u002Fwhisperx)\n[![Replicate (large-v2](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Replicate+WhisperX+large-v2&message=Demo+%26+Cloud+API&color=blue)](https:\u002F\u002Freplicate.com\u002Fdaanelson\u002Fwhisperx)\n[![Replicate (medium)](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=Replicate+WhisperX+medium&message=Demo+%26+Cloud+API&color=blue)](https:\u002F\u002Freplicate.com\u002Fcarnifexer\u002Fwhisperx)\n\nIf you don't have access to your own GPUs, use the links above to try out WhisperX.\n\n\u003Ch2 align=\"left\" id=\"whisper-mod\">Technical Details 👷‍♂️\u003C\u002Fh2>\n\nFor specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint [paper](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fpublications\u002F2023\u002FBain23\u002Fbain23.pdf).\n\nTo reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):\n\n1.  reduce batch size, e.g. `--batch_size 4`\n2.  use a smaller ASR model `--model base`\n3.  Use lighter compute type `--compute_type int8`\n\nTranscription differences from openai's whisper:\n\n1. Transcription without timestamps. To enable single pass batching, whisper inference is performed `--without_timestamps True`, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.\n2. VAD-based segment transcription, unlike the buffered transcription of openai's. In the WhisperX paper we show this reduces WER, and enables accurate batched inference\n3. `--condition_on_prev_text` is set to `False` by default (reduces hallucination)\n\n\u003Ch2 align=\"left\" id=\"limitations\">Limitations ⚠️\u003C\u002Fh2>\n\n- Transcript words which do not contain characters in the alignment models dictionary e.g. \"2014.\" or \"£13.60\" cannot be aligned and therefore are not given a timing.\n- Overlapping speech is not handled particularly well by whisper nor whisperx\n- Diarization is far from perfect\n- Language specific wav2vec2 model is needed\n\n\u003Ch2 align=\"left\" id=\"contribute\">Contribute 🧑‍🏫\u003C\u002Fh2>\n\nIf you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.\n\nBug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.\n\n\u003Ch2 align=\"left\" id=\"coming-soon\">TODO 🗓\u003C\u002Fh2>\n\n- [x] Multilingual init\n\n- [x] Automatic align model selection based on language detection\n\n- [x] Python usage\n\n- [x] Incorporating speaker diarization\n\n- [x] Model flush, for low gpu mem resources\n\n- [x] Faster-whisper backend\n\n- [x] Add max-line etc. see (openai's whisper utils.py)\n\n- [x] Sentence-level segments (nltk toolbox)\n\n- [x] Improve alignment logic\n\n- [ ] update examples with diarization and word highlighting\n\n- [ ] Subtitle .ass output \u003C- bring this back (removed in v3)\n\n- [ ] Add benchmarking code (TEDLIUM for spd\u002FWER & word segmentation)\n\n- [x] Allow silero-vad as alternative VAD option\n\n- [ ] Improve diarization (word level). _Harder than first thought..._\n\n\u003Ch2 align=\"left\" id=\"contact\">Contact\u002FSupport 📇\u003C\u002Fh2>\n\nContact maxhbain@gmail.com for queries.\n\n\u003Ca href=\"https:\u002F\u002Fwww.buymeacoffee.com\u002Fmaxhbain\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fcdn.buymeacoffee.com\u002Fbuttons\u002Fdefault-orange.png\" alt=\"Buy Me A Coffee\" height=\"41\" width=\"174\">\u003C\u002Fa>\n\n\u003Ch2 align=\"left\" id=\"acks\">Acknowledgements 🙏\u003C\u002Fh2>\n\nThis work, and my PhD, is supported by the [VGG (Visual Geometry Group)](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002F) and the University of Oxford.\n\nOf course, this is builds on [openAI's whisper](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fwhisper).\nBorrows important alignment code from [PyTorch tutorial on forced alignment](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Fforced_alignment_with_torchaudio_tutorial.html)\nAnd uses the wonderful pyannote VAD \u002F Diarization https:\u002F\u002Fgithub.com\u002Fpyannote\u002Fpyannote-audio\n\nValuable VAD & Diarization Models from:\n\n- [pyannote-audio](https:\u002F\u002Fgithub.com\u002Fpyannote\u002Fpyannote-audio) — Speaker diarization powered by the [speaker-diarization-community-1](https:\u002F\u002Fhuggingface.co\u002Fpyannote\u002Fspeaker-diarization-community-1) model, licensed under [CC-BY-4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002F) by [pyannoteAI](https:\u002F\u002Fwww.pyannote.ai)\n- [silero-vad](https:\u002F\u002Fgithub.com\u002Fsnakers4\u002Fsilero-vad)\n\nGreat backend from [faster-whisper](https:\u002F\u002Fgithub.com\u002Fguillaumekln\u002Ffaster-whisper) and [CTranslate2](https:\u002F\u002Fgithub.com\u002FOpenNMT\u002FCTranslate2)\n\nThose who have [supported this work financially](https:\u002F\u002Fwww.buymeacoffee.com\u002Fmaxhbain) 🙏\n\nFinally, thanks to the OS [contributors](https:\u002F\u002Fgithub.com\u002Fm-bain\u002FwhisperX\u002Fgraphs\u002Fcontributors) of this project, keeping it going and identifying bugs.\n\n\u003Ch2 align=\"left\" id=\"cite\">Citation\u003C\u002Fh2>\nIf you use this in your research, please cite the paper:\n\n```bibtex\n@article{bain2022whisperx,\n  title={WhisperX: Time-Accurate Speech Transcription of Long-Form Audio},\n  author={Bain, Max and Huh, Jaesung and Han, Tengda and Zisserman, Andrew},\n  journal={INTERSPEECH 2023},\n  year={2023}\n}\n```\n","WhisperX 是一个基于 Whisper 模型的自动语音识别系统，支持词级别时间戳和说话人分离。其核心功能包括通过 wav2vec2 对齐提供准确的词级别时间戳、使用 pyannote-audio 实现多说话人识别，并通过 VAD 预处理减少幻觉现象。技术上，WhisperX 采用 faster-whisper 后端，在大型模型下仅需不到 8GB 的 GPU 内存即可实现 70 倍实时转录速度。该项目适用于需要高精度时间戳及说话人标识的会议记录、访谈等场景，特别适合 Zoom、Google Meet 和 Microsoft Teams 等在线会议平台的音频转录需求。",2,"2026-06-11 02:49:44","top_language"]