[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-74708":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":14,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":9,"rankLanguage":9,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":9,"pushedAt":9,"updatedAt":25,"readmeContent":26,"aiSummary":27,"trendingCount":15,"starSnapshotCount":15,"syncStatus":28,"lastSyncTime":29,"discoverSource":30},74708,"voxtral.c","antirez\u002Fvoxtral.c","antirez","Pure C inference of Mistral Voxtral Realtime 4B speech to text model",null,"C",1690,118,16,4,0,17,42,12,74.93,"MIT License",false,"main",true,[],"2026-06-12 04:01:15","# Voxtral Realtime 4B Pure C Implementation\n\nThis is a C implementation of the inference pipeline for the [Mistral AI's Voxtral Realtime 4B model](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FVoxtral-Mini-4B-Realtime-2602). It has zero external dependencies beyond the C standard library. The MPS inference is decently fast, while the BLAS acceleration is usable but slow (it continuously convert the bf16 weights to fp32).\n\nAudio processing uses a chunked encoder with overlapping windows, bounding memory usage regardless of input length. Audio can also be piped from stdin (`--stdin`), or captured live from the microphone (`--from-mic`, macOS), making it easy to transcode and transcribe any format via ffmpeg. A streaming C API (`vox_stream_t`) lets you feed audio incrementally and receive token strings as they become available.\n\n**More testing needed:** please note that this project was mostly tested against few samples, and likely requires some more work to be production quality. However the hard part, to understand the model inference and reproduce the inference pipeline, is here, so the rest likely can be done easily. Testing it against very long transcriptions, able to stress the KV cache circular buffer, will be a useful task.\n\n![demo](samples\u002Fdemo.gif)\n\n## Motivations (and some rant)\n\n**Thank you to Mistral** for releasing such a great model in an Open Weights fashion. However, the author of this project believes that limiting the inference to a partnership with vLLM, without providing a self-contained reference implementation in Python, limits the model's actual reach and the potential good effects it could have. For this reason, this project was created: it provides both a pure C inference engine and a simple, self-contained Python reference implementation (`python_simple_implementation.py`) that anyone can read and understand without digging through the vLLM codebase.\n\n## Quick Start\n\n```bash\n# Build (choose your backend)\nmake mps       # Apple Silicon (fastest)\n# or: make blas    # Intel Mac \u002F Linux with OpenBLAS\n\n# Download the model (~8.9GB)\n.\u002Fdownload_model.sh\n\n# Transcribe audio (tokens stream to stdout as generated)\n.\u002Fvoxtral -d voxtral-model -i audio.wav\n\n# Live microphone transcription (macOS, Ctrl+C to stop)\n.\u002Fvoxtral -d voxtral-model --from-mic\n\n# Pipe any format via ffmpeg\nffmpeg -i audio.mp3 -f s16le -ar 16000 -ac 1 - 2>\u002Fdev\u002Fnull | \\\n    .\u002Fvoxtral -d voxtral-model --stdin\n\n# Real-time streaming with low latency\nffmpeg -i audio.mp3 -f s16le -ar 16000 -ac 1 - 2>\u002Fdev\u002Fnull | \\\n    .\u002Fvoxtral -d voxtral-model --stdin -I 0.5\n```\n\nThat's it. No Python runtime, no CUDA toolkit, no `mistral_common` or vLLM required at inference time.\n\n### Python Reference Implementation\n\nA self-contained Python implementation is also provided for reading and understanding the model:\n\n```bash\npip install torch safetensors soundfile soxr\npython python_simple_implementation.py voxtral-model audio.wav\n```\n\nThis requires just PyTorch and a few standard libraries.\n\n## Features\n\n- **Zero dependencies**: Pure C implementation, works standalone for MPS. BLAS required for other targets (OpenBLAS on Linux).\n- **Metal GPU acceleration**: Automatic on Apple Silicon Macs with fused GPU operations and batched attention.\n- **Streaming output**: Tokens are printed to stdout as they are generated, word by word.\n- **Streaming C API**: Feed audio incrementally, get token strings back as they become available.\n- **Memory-mapped weights**: BF16 weights are mmap'd directly from safetensors, loading is near-instant.\n- **Live microphone input**: `--from-mic` captures and transcribes from the default microphone (macOS) with automatic silence detection.\n- **WAV input**: Supports 16-bit PCM WAV files at any sample rate (auto-resampled to 16kHz).\n- **Chunked encoder**: Processes audio in overlapping chunks, bounding memory regardless of length.\n- **Rolling KV cache**: Decoder KV cache is automatically compacted when it exceeds the sliding window (8192 positions), capping memory usage and allowing unlimited-length audio.\n\n## Usage\n\n### Basic Transcription\n\n```bash\n.\u002Fvoxtral -d voxtral-model -i recording.wav\n```\n\nTokens stream to stdout as they are generated. By default, timing info is printed to stderr. Use `--silent` or `--debug` to control verbosity:\n\n```bash\n.\u002Fvoxtral -d voxtral-model -i samples\u002Ftest_speech.wav --silent    # no stderr output\n.\u002Fvoxtral -d voxtral-model -i samples\u002Ftest_speech.wav --debug     # per-layer\u002Fper-chunk details\n.\u002Fvoxtral -d voxtral-model -i samples\u002Ftest_speech.wav --alt 0.5   # show alternative tokens\n```\n\n### Alternative Tokens\n\nWhen the model is uncertain between similar-sounding words, `--alt \u003Ccutoff>` shows the competing candidates inline:\n\n```\n.\u002Fvoxtral -d voxtral-model -i audio.wav --alt 0.95\nHello, this is a test of the[ V| Vo]ox[T|tral]roll speech-to-text system.\n```\n\nThe cutoff (0.0–1.0) controls how close an alternative must be to the best token. A token qualifies if `1 - prob[i]\u002Fprob[0] \u003C= cutoff`. Lower values show only very close alternatives, higher values are more permissive.\n\n### Processing Interval (`-I`)\n\nThe `-I \u003Cseconds>` flag controls how often the encoder processes accumulated audio. This is the key latency\u002Fefficiency tradeoff:\n\n```bash\n.\u002Fvoxtral -d voxtral-model --stdin -I 0.5    # low latency (responsive, more GPU overhead)\n.\u002Fvoxtral -d voxtral-model --stdin -I 5.0    # high efficiency (batches more audio per encoder call)\n```\n\nThe default is 2.0 seconds. Lower values make streaming more responsive (text appears sooner after speech) but increase GPU overhead because each encoder call has a fixed startup cost (~50ms). Higher values batch more audio into fewer, larger encoder calls, improving GPU utilization.\n\nThe overhead is significant: on a 60-second clip, batch mode takes ~2.9s for the encoder, while `-I 0.1` takes ~15.8s (5.4x slower) because of hundreds of small encoder calls each paying the fixed cost. For **real-time streaming**, values between 1.0 and 2.0 work well. Going below 0.5 wastes most of the GPU time on per-call overhead. For **offline file transcription** the interval is irrelevant since all audio is available at once.\n\n### Monitor Mode (`--monitor`)\n\nThe `--monitor` flag prints non-intrusive unicode symbols to stderr, inline with the transcription output, showing what the engine is doing in real time. Useful for diagnosing latency issues or verifying that the pipeline is running smoothly.\n\n| Symbol | Meaning |\n|--------|---------|\n| `▶` | Encoder processed a chunk of audio |\n| `·` | Decoder prefill (initial prompt injection) |\n| `⌛` | Decoder waiting for enough adapter tokens to prefill |\n| `▪` | Decoder generated a batch of tokens (normal speed) |\n| `▸` | Decoder generated a batch of tokens (slow, >40ms\u002Fstep) |\n| `▫` | Decoder generated control tokens only (token id \u003C 1000, normal speed) |\n| `▹` | Decoder generated control tokens only (slow, >40ms\u002Fstep) |\n| `✗` | Decoder generated invalid text-range tokens (empty\u002Finvalid decode, normal speed) |\n| `✘` | Decoder generated invalid text-range tokens (slow, >40ms\u002Fstep) |\n| `⚠` | Elevated non-text streak (appended to control\u002Finvalid decode symbols) |\n| `☠` | Critical non-text streak, restart imminent (appended to control\u002Finvalid decode symbols) |\n| `◦` | EOS-only decode step |\n| `↺` | Decoder restarted after end-of-sequence |\n| `⟳` | Decoder restarted due to KV cache overflow |\n| `↯` | Decoder restarted due to non-text stall |\n| `⌚` | Decoder restarted due to no-decode watchdog timeout |\n| `✂` | Decoder-only hard reset |\n| `♻` | Full stream reset (mel + encoder + decoder state) |\n\nA healthy stream looks like `▶·▪▪▶▪▪▶▪▪` — encoder chunks interleaved with fast decode batches. If `▸`, `▹`, `⚠`, or `☠` appear frequently, decode is under stress. Restart symbols are normal in long continuous streams; you will typically see pairs like `↺✂`, `⟳♻`, `↯♻`, or `⌚♻`.\n\n### Reading Audio from Stdin\n\nThe **`--stdin` flag** reads audio from standard input instead of a file. The format is auto-detected: if the data starts with a RIFF header it is parsed as WAV, otherwise it is treated as **raw signed 16-bit little-endian, 16 kHz, mono** (`s16le`).\n\nThis makes it trivial to transcode any audio\u002Fvideo format on the fly with ffmpeg:\n\n```bash\n# Transcribe an MP3 file\nffmpeg -i podcast.mp3 -f s16le -ar 16000 -ac 1 - 2>\u002Fdev\u002Fnull | \\\n    .\u002Fvoxtral -d voxtral-model --stdin\n\n# Pipe a WAV directly (auto-detected)\ncat recording.wav | .\u002Fvoxtral -d voxtral-model --stdin\n\n# Live transcription of a web radio stream\ncurl -sL http:\u002F\u002Fstream.live.vc.bbcmedia.co.uk\u002Fbbc_world_service | \\\n    ffmpeg -i pipe:0 -ar 16000 -ac 1 -f wav pipe:1 2>\u002Fdev\u002Fnull | \\\n    .\u002Fvoxtral -d voxtral-model --stdin\n```\n\n### Live Microphone Input\n\nThe **`--from-mic` flag** captures audio from the default microphone (macOS only, uses AudioQueue Services). Press Ctrl+C to stop. Silence is automatically detected and stripped to reduce encoder\u002Fdecoder work when you pause speaking — only actual speech is processed.\n\n```bash\n.\u002Fvoxtral -d voxtral-model --from-mic                # default 2s processing interval\n.\u002Fvoxtral -d voxtral-model --from-mic -I 1.0          # lower latency\n.\u002Fvoxtral -d voxtral-model --from-mic --silent         # no stderr status\n```\n\nIf the model falls behind real-time, a warning is printed and audio is skipped to catch up.\n\n`--from-mic`, `--stdin`, and `-i` are mutually exclusive.\n\nTo convert files to WAV format, just use `ffmpeg`:\n\n    ffmpeg -i input.ogg output.wav\n\nThe above command line works for many file types, not just for OGG files, of course.\nThere are two example wave files under the `samples` directory.\n\n### C API\n\nThe library exposes a streaming API (`vox_stream_t`) that works for both offline and real-time use. You feed audio samples and retrieve decoded token strings as they become available.\n\n**Offline transcription** — feed all audio, then collect results:\n\n```c\n#include \"voxtral.h\"\n\nvox_ctx_t *ctx = vox_load(\"voxtral-model\");\n\n\u002F* Load audio (your own code, or use vox_load_wav) *\u002F\nint n_samples;\nfloat *samples = vox_load_wav(\"audio.wav\", &n_samples);\n\n\u002F* Transcribe *\u002F\nvox_stream_t *s = vox_stream_init(ctx);\nvox_stream_feed(s, samples, n_samples);\nvox_stream_finish(s);\n\n\u002F* Collect token strings *\u002F\nconst char *tokens[64];\nint n;\nwhile ((n = vox_stream_get(s, tokens, 64)) > 0) {\n    for (int i = 0; i \u003C n; i++)\n        printf(\"%s\", tokens[i]);\n}\nprintf(\"\\n\");\n\nvox_stream_free(s);\nfree(samples);\nvox_free(ctx);\n```\n\n**Real-time streaming** — feed audio incrementally, retrieve tokens as they arrive:\n\n```c\nvox_stream_t *s = vox_stream_init(ctx);\n\nwhile (have_more_audio()) {\n    float chunk[4096];\n    int n_read = read_audio(chunk, 4096);\n    vox_stream_feed(s, chunk, n_read);\n\n    const char *tokens[16];\n    int n;\n    while ((n = vox_stream_get(s, tokens, 16)) > 0) {\n        for (int i = 0; i \u003C n; i++)\n            printf(\"%s\", tokens[i]);\n        fflush(stdout);\n    }\n}\n\nvox_stream_finish(s);\nconst char *tokens[16];\nint n;\nwhile ((n = vox_stream_get(s, tokens, 16)) > 0) {\n    for (int i = 0; i \u003C n; i++)\n        printf(\"%s\", tokens[i]);\n}\nprintf(\"\\n\");\n\nvox_stream_free(s);\n```\n\n`feed()` runs the mel spectrogram, encoder, and decoder on available data, queuing output tokens. `finish()` adds padding and processes remaining audio. `get()` retrieves pending tokens — call it after each `feed()` or whenever convenient. Token string pointers returned by `vox_stream_get()` are valid until `vox_stream_free()`.\n\n`vox_stream_flush(s)` forces the encoder to process whatever audio is buffered, regardless of the processing interval, and feeds right-padding so the decoder emits tokens that are behind the delay window. Unlike `finish()`, the stream stays open — you can continue feeding audio afterwards. This is useful for silence detection: when the speaker pauses, flush to get the pending transcription without ending the stream.\n\nUse `vox_set_processing_interval(s, seconds)` to control the latency\u002Fefficiency tradeoff (equivalent to `-I` on the CLI). When set, `feed()` accumulates audio but only runs the encoder\u002Fdecoder after at least the specified duration of new audio has been fed. Lower values give more responsive streaming (text appears sooner), higher values batch more audio per encoder call for better GPU utilization. Default is 2.0 seconds. See the `-I` flag documentation above for guidance on choosing values.\n\n**Alternative tokens** — when the model is uncertain, retrieve competing candidates:\n\n```c\nvox_stream_set_alt(s, 3, 0.5);  \u002F* up to 3 alternatives, cutoff 0.5 *\u002F\n\nconst int n_alt = 3;\nconst char *tokens[16 * 3];\nint n;\nwhile ((n = vox_stream_get_alt(s, tokens, 16, n_alt)) > 0) {\n    for (int i = 0; i \u003C n; i++) {\n        printf(\"%s\", tokens[i * n_alt]);  \u002F* best token *\u002F\n        for (int a = 1; a \u003C n_alt && tokens[i * n_alt + a]; a++)\n            printf(\" [alt: %s]\", tokens[i * n_alt + a]);\n    }\n}\n```\n\n`vox_stream_get()` is unaffected — it always returns just the best token.\n\nThere is also a one-shot convenience function if you don't need streaming:\n\n```c\nchar *text = vox_transcribe(ctx, \"audio.wav\");\nprintf(\"%s\\n\", text);\nfree(text);\n```\n\n## Building\n\nChoose a backend when building:\n\n```bash\nmake            # Show available backends\nmake blas       # BLAS acceleration (Accelerate on macOS, OpenBLAS on Linux)\nmake mps        # Apple Silicon Metal GPU (fastest, macOS only)\n```\n\n**Recommended:**\n- macOS Apple Silicon: `make mps`\n- macOS Intel: `make blas`\n- Linux with OpenBLAS: `make blas`\n\nFor `make blas` on Linux, install OpenBLAS first:\n```bash\n# Ubuntu\u002FDebian\nsudo apt install libopenblas-dev\n\n# Fedora\nsudo dnf install openblas-devel\n```\n\nOther targets:\n```bash\nmake clean      # Clean build artifacts\nmake info       # Show available backends for this platform\nmake inspect    # Build safetensors weight inspector\n```\n\n## Model Download\n\nDownload model weights (~8.9GB) from HuggingFace:\n\n```bash\n.\u002Fdownload_model.sh\n```\n\nThis downloads to `.\u002Fvoxtral-model\u002F` containing:\n- `consolidated.safetensors` — all weights, BF16 (~8.9GB)\n- `tekken.json` — Tekken tokenizer vocabulary (~15MB)\n- `params.json` — model configuration\n\nThe model is [Apache-2.0 licensed](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FVoxtral-Mini-4B-Realtime-2602).\n\n## How Fast Is It?\n\nBenchmarks on **Apple M3 Max** (40-core GPU, 128GB RAM, 400 GB\u002Fs bandwidth):\n\n| Backend | Encoder (3.6s audio) | Prefill | Decoder |\n|---------|---------------------|---------|---------|\n| MPS | 284 ms | 252 ms | 23.5 ms\u002Fstep (short) |\n| BLAS | ~8s | ~1.2s | 335 ms\u002Fstep |\n\nThe MPS backend runs the entire decoder in a single Metal command buffer per token, with custom GPU kernels for attention, RoPE, and KV cache management. All weights are pre-converted to f16 on GPU at load time. The BLAS backend uses Accelerate's multi-threaded sgemm with on-the-fly BF16→F32 conversion.\n\nDecoder speed depends on sequence length: attention scans the full KV cache each step, so longer transcriptions are slower per token. For a 60-second clip (~760 steps), the average is ~31.6 ms\u002Fstep. For short clips (~15 steps) it's ~23.5 ms\u002Fstep. Either way, the decoder generates one token per ~80ms of audio, so even at 31.6 ms\u002Fstep transcription runs ~2.5x faster than real-time.\n\nLonger audio scales linearly with the encoder (O(n) with sliding window attention) and the decoder (one token per 80ms of audio).\n\n## Model Architecture\n\nVoxtral Realtime 4B is a streaming speech-to-text model with ~4B parameters:\n\n**Pipeline:**\n```\nWAV → 16kHz → Mel Spectrogram → Conv Stem → Encoder → Downsample 4x → Adapter → Decoder → Tokens\n```\n\n| Component | Architecture |\n|-----------|-------------|\n| Audio Encoder | 32-layer causal transformer, 1280 dim, 32 heads, sliding window 750 |\n| Adapter | Linear(5120→3072) → GELU → Linear(3072→3072) |\n| LLM Decoder | 26-layer transformer (Ministral-3 based), 3072 dim, GQA (32 heads \u002F 8 KV) |\n\n| Parameter | Value |\n|-----------|-------|\n| Total parameters | ~4B (0.6B encoder + 3.4B decoder) |\n| Weight format | BF16 |\n| Vocab size | 131,072 (Tekken tokenizer) |\n| Audio frame rate | 12.5 Hz (1 token = 80ms) |\n| Max audio length | Unlimited (rolling KV cache) |\n| Supported languages | EN, ES, FR, PT, HI, DE, NL, IT, AR, RU, ZH, JA, KO |\n\n## Memory Requirements\n\n| Component | Size |\n|-----------|------|\n| Model weights (mmap'd) | 8.9 GB on disk, mapped on-demand |\n| MPS GPU weight cache | ~8.4 GB (BF16→F16 cached on GPU) |\n| KV cache (decoder) | ~1.8 GB max (rolling, capped at sliding window) |\n| Working buffers | ~200 MB |\n\n## License\n\nMIT\n","Voxtral.c 是一个用于 Mistral AI 的 Voxtral Realtime 4B 语音转文字模型的纯 C 推理实现。该项目不依赖任何外部库，仅使用 C 标准库，支持通过 MPS 进行快速推理，并提供 BLAS 加速选项（尽管速度较慢）。音频处理采用分块编码器和重叠窗口技术，确保无论输入长度如何都能控制内存使用。此外，它还支持从标准输入或麦克风实时捕获音频进行转录，同时提供了一个流式 C API 以逐步接收音频并即时输出文本。适用于需要轻量级、无依赖环境下的实时语音转文字场景，如嵌入式系统或资源受限的设备。需要注意的是，项目仍需进一步测试以达到生产级别质量。",2,"2026-06-11 03:50:30","high_star"]