[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-82777":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":14,"lastSyncTime":29,"discoverSource":30},82777,"parakeet.cpp","mudler\u002Fparakeet.cpp","mudler","Parakeet implementation in C++ with ggml",null,"C++",357,26,1,2,0,5,140,226,37,4.29,"MIT License",false,"master",true,[],"2026-06-12 02:04:27","# parakeet.cpp\n\n**Brought to you by the [LocalAI](https:\u002F\u002Fgithub.com\u002Fmudler\u002FLocalAI) team**, the folks behind LocalAI, the open-source AI engine that runs any model (LLMs, vision, voice, image, video) on any hardware, no GPU required.\n\n[![Model on Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fbadges\u002Fresolve\u002Fmain\u002Fmodel-on-hf-md.svg)](https:\u002F\u002Fhuggingface.co\u002Fmudler\u002Fparakeet-cpp-gguf)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-green)](LICENSE)\n[![LocalAI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLocalAI-Run_Locally-orange)](https:\u002F\u002Fgithub.com\u002Fmudler\u002FLocalAI)\n\nparakeet.cpp is a C++17 inference port of NVIDIA's [NeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FNeMo) Parakeet speech-recognition models, built on [ggml](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fggml). It gives you fast, dependency-light automatic speech recognition on CPU (and on GPU through ggml's backends), with no Python runtime needed at inference time.\n\nIt covers all the offline Parakeet families (CTC, RNNT, TDT, and hybrid TDT-CTC, in 0.6B\u002F1.1B\u002F110M sizes, English plus multilingual v3), each validated at WER 0 against NeMo on every published checkpoint. It also does **cache-aware streaming with end-of-utterance (EOU) detection** for `parakeet_realtime_eou_120m-v1`, where the streaming transcript matches NeMo's cache-aware streaming byte for byte. The full coverage matrix lives in `docs\u002Fparity.md`.\n\nIt's faster than NeMo's PyTorch runtime on both CPU and GPU, with byte-identical transcripts. The full numbers, methodology, and all the plots are in [benchmarks\u002FBENCHMARK.md](benchmarks\u002FBENCHMARK.md).\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"benchmarks\u002FBENCHMARK.md\">\u003Cimg src=\"benchmarks\u002Fplots\u002Fspeedup.png\" width=\"49%\" alt=\"CPU speedup vs NeMo (RTFx ratio per dtype)\">\u003C\u002Fa>\n  \u003Ca href=\"benchmarks\u002FBENCHMARK.md\">\u003Cimg src=\"benchmarks\u002Fplots\u002Fgpu_speedup.png\" width=\"49%\" alt=\"GPU speedup vs NeMo on the NVIDIA GB10\">\u003C\u002Fa>\n\u003C\u002Fp>\n\nIt also runs circles around whisper.cpp on the same audio: the 110M Parakeet is faster than whisper base.en and far faster than large-v3-turbo, while the larger Parakeets match or beat whisper's accuracy (see [benchmarks\u002FBENCHMARK.md](benchmarks\u002FBENCHMARK.md)).\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"benchmarks\u002FBENCHMARK.md\">\u003Cimg src=\"benchmarks\u002Fplots\u002Fvs_whisper.png\" width=\"88%\" alt=\"parakeet.cpp vs whisper.cpp RTFx on CPU and GPU\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n---\n\n## Performance\n\nparakeet.cpp is faster than NeMo's PyTorch runtime on every Parakeet model, on both CPU and GPU, and the transcripts come out byte-identical (WER 0 vs NeMo). Full methodology, all 10 models, quantization tradeoffs, and plots are in [`benchmarks\u002FBENCHMARK.md`](benchmarks\u002FBENCHMARK.md).\n\n### See it run\n\nThe same clip fed to parakeet.cpp and to NeMo's own PyTorch runtime on the same GPU. The output comes out byte-for-byte identical, parakeet.cpp just gets there first (slowed down so the sub-100ms race is watchable):\n\n![parakeet.cpp vs NeMo on GPU: identical output, parakeet.cpp finishes first](benchmarks\u002Fmedia\u002Fgpu_duel.gif)\n\nThe same race on CPU, against NeMo's own PyTorch runtime: [parakeet.cpp vs NeMo on CPU](benchmarks\u002Fmedia\u002Fcpu_nemo_duel.mp4) (about 1.5x faster, still byte-for-byte identical). And vs whisper.cpp turbo, same accuracy and far less compute: [on GPU](benchmarks\u002Fmedia\u002Fgpu_whisper_duel.mp4) (about 12x faster) and [on CPU](benchmarks\u002Fmedia\u002Fcpu_duel.mp4) (about 27x faster).\n\nCPU numbers (20-core x86, vs NeMo PyTorch-CPU, LibriSpeech test-clean, threads=8; RTFx is audio-seconds over processing-seconds, so higher is faster):\n\n| dtype | size vs f32 | speedup vs NeMo | accuracy |\n| ----- | ----------- | --------------- | -------- |\n| f32   | 100%        | 1.11 to 1.69x (median 1.40x) | WER 0, byte-identical to NeMo |\n| f16   | 57%         | up to 1.70x     | near-lossless |\n| q8_0  | 37%         | up to 1.86x     | near-lossless |\n| q4_k  | 26%         | n\u002Fa             | small, monotonic WER cost |\n\nPeak RAM is also roughly 2x lower than NeMo, and lower still once quantized.\n\nGPU numbers (NVIDIA GB10, Grace-Blackwell, vs NeMo-GPU in the `nvcr.io\u002Fnvidia\u002Fnemo` container, since NeMo can't run on the host's torch\u002FCUDA stack directly): parakeet.cpp wins on all 10 models, with a median of 1.25x and up to 4.3x on the large TDT\u002Fhybrid models. NeMo's TDT greedy decode isn't CUDA-graph accelerated and ours is a lean C++ loop, which is most of that gap. The log-mel front end runs on the GPU via a ggml DFT-matmul graph; the CPU path is unchanged.\n\n---\n\n## Build\n\nClone with submodules (ggml is vendored at `third_party\u002Fggml`):\n\n```sh\ngit clone --recursive https:\u002F\u002Fgithub.com\u002Fmudler\u002Fparakeet.cpp\ncd parakeet.cpp\ncmake -B build -DPARAKEET_BUILD_TESTS=ON && cmake --build build -j\n```\n\nUse `-DGGML_NATIVE=OFF` for portable or CI builds (it disables host-specific ISA extensions). For the shared library (LocalAI \u002F dlopen):\n\n```sh\ncmake -B build-shared -DPARAKEET_SHARED=ON -DPARAKEET_BUILD_CLI=ON\ncmake --build build-shared -j\n# -> build-shared\u002Flibparakeet.so\n```\n\n### CMake options\n\n| Option                   | Default | Purpose                                    |\n| ------------------------ | ------- | ------------------------------------------ |\n| `PARAKEET_BUILD_TESTS`   | OFF     | Compile and register ctest targets         |\n| `PARAKEET_BUILD_CLI`     | ON      | Build `parakeet-cli`                       |\n| `PARAKEET_SHARED`        | OFF     | Build libparakeet as a shared library      |\n| `PARAKEET_GGML_CUDA`     | OFF     | Forward GGML_CUDA to the submodule         |\n| `PARAKEET_GGML_METAL`    | OFF     | Forward GGML_METAL to the submodule        |\n| `PARAKEET_GGML_VULKAN`   | OFF     | Forward GGML_VULKAN to the submodule       |\n| `PARAKEET_GGML_HIP`      | OFF     | Forward GGML_HIP (ROCm) to the submodule   |\n\nTo build for a GPU backend, forward its flag, e.g. Apple Metal:\n\n```sh\ncmake -B build -DPARAKEET_GGML_METAL=ON && cmake --build build -j\n```\n\nThe CLI auto-selects the first GPU device the ggml registry reports, so no runtime flag is needed (set `PARAKEET_DEVICE=cpu` to force CPU). Ops the chosen backend has no kernel for run on the CPU automatically, so a model always runs even when one op lacks a GPU kernel. On an Apple M4, Metal is up to about 5x faster than CPU on the larger models; see [Apple Metal](benchmarks\u002FBENCHMARK.md#apple-metal-m4).\n\n---\n\n## Python environment setup\n\nYou need this once, for model conversion and validation. It's not needed for inference:\n\n```sh\npython3 -m venv .venv\n.venv\u002Fbin\u002Fpip install torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcpu\n.venv\u002Fbin\u002Fpip install -r scripts\u002Frequirements.txt   # nemo_toolkit[asr] + gguf\n```\n\nNeMo 2.7.3 is the validated version. The anchor checkpoint `nvidia\u002Fparakeet-tdt_ctc-110m` (about 440 MB) is downloaded automatically by NeMo on first use.\n\n---\n\n## Converting a model\n\nConvert a HuggingFace or local `.nemo` checkpoint to GGUF:\n\n```sh\n# Default (F32), lossless and largest\n.venv\u002Fbin\u002Fpython scripts\u002Fconvert_parakeet_to_gguf.py \\\n    --model nvidia\u002Fparakeet-tdt_ctc-110m \\\n    --output m.gguf\n\n# F16, about 0.58x the size, WER 0 vs NeMo\n.venv\u002Fbin\u002Fpython scripts\u002Fconvert_parakeet_to_gguf.py \\\n    --model nvidia\u002Fparakeet-tdt_ctc-110m --dtype f16 --output m.gguf\n\n# Q8_0, about 0.39x the size, WER 0 vs NeMo\n.venv\u002Fbin\u002Fpython scripts\u002Fconvert_parakeet_to_gguf.py \\\n    --model nvidia\u002Fparakeet-tdt_ctc-110m --dtype q8_0 --output m.gguf\n```\n\nSupported `--dtype`: `f32` (default), `f16`, `q8_0`.\n\n---\n\n## Quantization\n\nThe Python `gguf` writer can't produce K-quants (`q4_k`, `q5_k`, `q6_k`), so re-quantize an existing F32 GGUF with the CLI instead:\n\n```sh\nparakeet-cli quantize \u003Cin.gguf> \u003Cout.gguf> \u003Ctype>\n# e.g.\nparakeet-cli quantize m.gguf m_q4k.gguf q4_k\nparakeet-cli quantize m.gguf m_q6k.gguf q6_k\n```\n\nSupported types: `q4_0`, `q5_0`, `q8_0`, `q4_k`, `q5_k`, `q6_k`.\n\nOnly the large linear `ggml_mul_mat`-consumed weights (encoder FFN, attention projections, joint enc\u002Fpred projections, subsampling output projection) get quantized. The conv, LSTM, featurizer, batch_norm, and bias tensors stay F32. See `docs\u002Fquantization.md` for the full policy, allowlist, and measured size and WER per type.\n\n---\n\n## Running inference\n\n```sh\n# Default decoder (TDT for hybrid\u002FTDT models, CTC for standalone CTC)\nparakeet-cli transcribe --model m.gguf --input audio.wav\n\n# Force a decoder\nparakeet-cli transcribe --model m.gguf --input audio.wav --decoder ctc\nparakeet-cli transcribe --model m.gguf --input audio.wav --decoder tdt\n\n# Per-word timestamps + confidence: one line per word\n#   \u003Cstart>-\u003Cend>  \u003Cword>  (\u003Cconf>)   (times in seconds)\nparakeet-cli transcribe --model m.gguf --input audio.wav --timestamps\n\n# JSON with the flat text plus per-word and per-token timestamps + confidence:\n#   {\"text\":\"...\",\"words\":[{\"w\":..,\"start\":..,\"end\":..,\"conf\":..}],\n#    \"tokens\":[{\"id\":..,\"t\":..,\"conf\":..}]}\nparakeet-cli transcribe --model m.gguf --input audio.wav --json\n\n# Print model metadata (arch, dims, mel params, vocab size, TDT durations)\nparakeet-cli info m.gguf\n\n# Cache-aware streaming (EOU model parakeet_realtime_eou_120m-v1): feeds the WAV\n# in the model's chunk schedule, prints partial text incrementally and\n# [EOU @ \u003Ct>s] \u002F [EOB @ \u003Ct>s] event markers, then the finalized tail. Add\n# --timestamps to also print per-word [start-end] (conf) lines as words finalize.\nparakeet-cli transcribe --model eou.gguf --input audio.wav --stream\n```\n\nTimestamps and confidence match NeMo's `transcribe(timestamps=True)` with the `max_prob` confidence method exactly (word offsets to 0.0 s, per-token and per-word confidence within `5e-6`), for both the TDT and CTC heads. See `docs\u002Fparity.md`. Word start and end are in seconds (`frame x hop x subsampling \u002F sample_rate`, which works out to 0.08 s\u002Fframe here); confidence is the rescaled softmax probability of the emitted token, aggregated per word with NeMo's `min`.\n\nThe `parakeet-cli` binary lands at `build\u002Fexamples\u002Fcli\u002Fparakeet-cli`.\n\n---\n\n## Batching\n\nSingle-clip transcription is the default and needs no flags: every `transcribe` call runs one clip at a time, byte-for-byte identical to before. Batching is an opt-in path for decoding several clips together, which matters when you serve many concurrent requests on a GPU.\n\nThe win is on the **decode** side. A transducer (TDT\u002FRNN-T) decodes autoregressively with tiny per-step prediction-LSTM and joint GEMMs; one clip launches hundreds of these matvec-sized kernels and leaves the GPU mostly idle between launches. Decoding N clips together coalesces each step into one batched GEMM, so the device stays busy. On the NVIDIA GB10 this reaches about **10-12x** at batch size 16 (CPU about 3-5x); the encoder is already compute-bound, so batching it gives no throughput win. CTC has no autoregressive decode, so batching does not apply to standalone CTC models. The batched path is bit-identical to running the clips one by one (greedy decode is deterministic). Full numbers and per-model tables are in [`benchmarks\u002FBENCHMARK.md`](benchmarks\u002FBENCHMARK.md#batched-decode-throughput).\n\nMeasure it yourself:\n\n```bash\n# Decode-only: serial vs batched decode of one clip replicated B times (the win in isolation).\nparakeet-cli bench-decode --model \u003Cmodel.gguf> --audio \u003Cwav> [--batch-sizes 1,4,8,16] [--threads N] [--reps R] [--json \u003Cout>]\n\n# Full transcribe (encoder + decode) over a manifest at several batch sizes.\nparakeet-cli bench-batch --model \u003Cmodel.gguf> --manifest \u003Cfile> [--decoder ctc|tdt] [--threads N] [--batch-sizes 1,4,8] [--json \u003Cout>]\n```\n\nTo batch from code, use the batched entry points (single-clip B=1 is just N=1):\n\n- C++ (`src\u002Fmodel.hpp`): `Model::transcribe_16k_batch(pcms16k, decoder)` and `transcribe_16k_batch_with_timestamps(...)` take N clips of 16 kHz mono float PCM and return N results.\n- C-API (`include\u002Fparakeet_capi.h`): `parakeet_capi_transcribe_pcm_batch(...)` (N transcripts) and `parakeet_capi_transcribe_pcm_batch_json(...)` (one JSON array of N `{text,words,tokens}` objects). These are what LocalAI's `parakeet-cpp` backend calls to coalesce concurrent requests; it leaves batching off by default and exposes a `batch_max_size` option to opt in.\n\n---\n\n## C-API (`libparakeet.so`)\n\n`include\u002Fparakeet_capi.h` defines a flat, exception-free C-API meant for `dlopen` \u002F FFI \u002F LocalAI integration. Build the shared library with `-DPARAKEET_SHARED=ON`:\n\n```c\n#include \"parakeet_capi.h\"\n\nparakeet_ctx *ctx = parakeet_capi_load(\"model.gguf\");  \u002F\u002F load ONCE\nif (!ctx) { fprintf(stderr, \"%s\\n\", parakeet_capi_last_error(ctx)); return 1; }\n\nchar *text = parakeet_capi_transcribe_path(ctx, \"audio.wav\", 0 \u002F*default*\u002F);\nif (text) { printf(\"%s\\n\", text); parakeet_capi_free_string(text); }\n\nparakeet_capi_free(ctx);\n```\n\nIn-memory PCM:\n```c\nchar *text = parakeet_capi_transcribe_pcm(ctx, samples, n_samples,\n                                          sample_rate, 0 \u002F*default*\u002F);\n```\n\nTimestamps and confidence as JSON (matches NeMo `timestamps=True` + `max_prob`):\n```c\nchar *json = parakeet_capi_transcribe_path_json(ctx, \"audio.wav\", 0 \u002F*default*\u002F);\n\u002F\u002F {\"text\":\"...\",\n\u002F\u002F  \"words\":[{\"w\":\"Well,\",\"start\":0.480,\"end\":0.640,\"conf\":0.7859}, ...],\n\u002F\u002F  \"tokens\":[{\"id\":639,\"t\":0.480,\"conf\":0.9969}, ...]}\nif (json) { printf(\"%s\\n\", json); parakeet_capi_free_string(json); }\n```\n`start`\u002F`end`\u002F`t` are in seconds; `conf` is the rescaled softmax probability of the emitted token in `(0,1]` (a word's `conf` is the `min` over its tokens).\n\n### Streaming (cache-aware EOU model)\n\nFor `parakeet_realtime_eou_120m-v1`, a streaming session decodes 16 kHz mono f32 PCM as it arrives, returning newly-finalized text and signalling EOU\u002FEOB events:\n\n```c\nparakeet_stream *s = parakeet_capi_stream_begin(ctx);\nint eou = 0;\nchar *t = parakeet_capi_stream_feed(s, pcm, n_samples, &eou); \u002F\u002F \"\" if none yet\nif (t) { printf(\"%s\", t); parakeet_capi_free_string(t); }\nif (eou) printf(\" [EOU]\");\n\u002F\u002F ...feed more chunks...\nchar *tail = parakeet_capi_stream_finalize(s);                \u002F\u002F flush the tail\nif (tail) { printf(\"%s\\n\", tail); parakeet_capi_free_string(tail); }\nparakeet_capi_stream_free(s);\n```\n\n`\u003CEOU>` (end-of-utterance) and `\u003CEOB>` (backchannel) are stripped from the text and surfaced via `*eou_out` (the CLI `--stream` prints them as `[EOU @ \u003Ct>s]` markers). The streaming transcript matches NeMo's cache-aware streaming exactly, and `finalize` flushes the end-of-stream tail without fabricating an `\u003CEOU>` that NeMo would not emit.\n\nThe LocalAI backend (in the LocalAI repo) dlopens `libparakeet.so` and uses these symbols directly: the offline `parakeet_capi_transcribe_*` \u002F `parakeet_capi_transcribe_path_json` and the streaming `parakeet_capi_stream_*`. See `include\u002Fparakeet_capi.h` for the full API. The C++ streaming session (`pk::StreamingSession`) also exposes per-word timestamps and confidence as words finalize, via `drain_words()` alongside the EOU events, which the CLI `--stream --timestamps` path prints.\n\n---\n\n## Model coverage\n\nSee `docs\u002Fparity.md` for the full coverage matrix. In short:\n\n| Family | Representative checkpoints | Heads | WER vs NeMo |\n| --- | --- | --- | --- |\n| Hybrid TDT+CTC | `parakeet-tdt_ctc-110m`, `parakeet-tdt_ctc-1.1b` | TDT + CTC | 0.0 |\n| TDT (hybrid) | `parakeet-tdt-0.6b-v2`, `parakeet-tdt-0.6b-v3` (multilingual) | TDT | 0.0 |\n| Pure TDT | `parakeet-tdt-1.1b` | TDT | 0.0 |\n| CTC | `parakeet-ctc-0.6b`, `parakeet-ctc-1.1b` | CTC | 0.0 |\n| RNNT | `parakeet-rnnt-0.6b`, `parakeet-rnnt-1.1b` | RNNT | 0.0 |\n\nAll 10 published offline checkpoints are validated at WER 0 vs NeMo 2.7.3. Sizes: 110M (512\u002F17 layers), 0.6B (1024\u002F24), 1.1B (1024\u002F42).\n\nCache-aware streaming and EOU (`parakeet_realtime_eou_120m-v1`) is implemented too: `layer_norm` plus causal conv, causal subsampling, chunked-limited attention, per-layer conv\u002Fattention caches, carried RNN-T decoder state, and `\u003CEOU>`\u002F`\u003CEOB>` events. The streaming transcript matches NeMo's cache-aware streaming byte for byte. See `docs\u002Fparity.md` (the Streaming + EOU section).\n\n---\n\n## Running tests\n\nModel-independent (run anywhere):\n\n```sh\nctest --test-dir build --output-on-failure -LE model\n```\n\nModel-dependent (need venv + checkpoint):\n\n```sh\nexport PARAKEET_TEST_GGUF=\u002Ftmp\u002Fpk110m.gguf\nexport PARAKEET_TEST_BASELINE=\u002Ftmp\u002Fbaseline.gguf\nexport PARAKEET_TEST_BASELINE_SPEECH=\u002Ftmp\u002Fbaseline_speech.gguf\nctest --test-dir build --output-on-failure\n```\n\nTests labelled `model` return exit code 77 (ctest SKIP) when their required env vars are absent, so they never break a CI environment that has no model.\n\n---\n\n## Roadmap \u002F TODO\n\n- **Tune the GPU encoder kernels.** On the GB10 GPU, parakeet.cpp is faster than NeMo on all 10 models (median 1.25x, up to 4.3x), but the gains are smallest on the pure-encoder CTC models (around 1.2x), because ggml's generic CUDA conv\u002Fattention kernels still trail NeMo's tuned cuDNN. Closing that gap (better conv1d and flash-attention paths for the FastConformer encoder) is the main remaining GPU headroom. The log-mel already runs on the backend (`GpuMel`); the CPU path is unaffected.\n\n---\n\n## Why parakeet.cpp\n\nNeMo is a great training framework, but running Parakeet just for inference drags in a heavy Python\u002FPyTorch stack. parakeet.cpp is a from-scratch C++17\u002Fggml port focused purely on inference:\n\n- **No Python at inference.** A single `libparakeet.so` (or static lib) behind a flat C API (`include\u002Fparakeet_capi.h`), easy to embed from C, C++, Go, or Rust.\n- **Faster than NeMo** on CPU and GPU (see [Performance](#performance)), with byte-identical output.\n- **Small and portable.** GGUF models with f16 \u002F q8_0 \u002F K-quant variants, running on CPU and any ggml GPU backend (CUDA, Metal, Vulkan, HIP).\n- **Full family coverage.** CTC, RNNT, TDT, hybrid TDT-CTC, multilingual, and cache-aware streaming with EOU, all validated at WER 0 vs NeMo.\n\n---\n\n## Citation\n\nIf you use parakeet.cpp, please cite this repository and the original models:\n\n```bibtex\n@software{parakeet_cpp,\n  title  = {parakeet.cpp: a C++\u002Fggml inference engine for NVIDIA Parakeet ASR},\n  author = {Di Giacinto, Ettore and Palethorpe, Richard},\n  url    = {https:\u002F\u002Fgithub.com\u002Fmudler\u002Fparakeet.cpp},\n  year   = {2026}\n}\n```\n\nThe Parakeet models are by NVIDIA NeMo ([NVIDIA-NeMo\u002FNeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FNeMo)).\n\n## Author\n\nEttore Di Giacinto ([@mudler](https:\u002F\u002Fgithub.com\u002Fmudler)).\n\n## License\n\nparakeet.cpp is released under the [MIT License](LICENSE). The model weights are governed by NVIDIA's original Parakeet model licenses, so check each model card on HuggingFace.\n","parakeet.cpp 是一个基于 C++17 和 ggml 的语音识别项目，实现了 NVIDIA NeMo Parakeet 模型的推理。其核心功能包括对多种离线 Parakeet 模型（如 CTC、RNNT、TDT 及其混合模型）的支持，并且在 CPU 和 GPU 上都能提供快速、轻量级的自动语音识别，无需 Python 运行时。该项目特别适用于需要低依赖性和高性能的本地部署场景，例如边缘计算设备或没有 GPU 的服务器。它还支持带有结束检测的缓存感知流式传输，确保了实时应用中的高效准确。与 NeMo 的 PyTorch 实现相比，parakeet.cpp 在速度上具有明显优势，同时保持了相同的识别精度。","2026-06-11 04:09:11","CREATED_QUERY"]