[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83219":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":8,"language":10,"languages":8,"totalLinesOfCode":8,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":18,"compositeScore":19,"rankGlobal":8,"rankLanguage":8,"license":8,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":22,"topics":23,"createdAt":8,"pushedAt":8,"updatedAt":24,"readmeContent":25,"aiSummary":8,"trendingCount":15,"starSnapshotCount":15,"syncStatus":14,"lastSyncTime":26,"discoverSource":27},83219,"Audio-Interaction","xzf-thu\u002FAudio-Interaction","xzf-thu",null,"https:\u002F\u002Fxzf-thu.github.io\u002FAudio-Interaction\u002F","Python",347,18,8,2,0,27,232,168,87.84,false,"main",true,[],"2026-06-12 04:01:40","# Audio Interaction Model\n\n\u003Cp align=\"center\">\n  \u003Cb>English\u003C\u002Fb> | \u003Ca href=\"README_ZH.md\">简体中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Ffigures\u002Ftop.png\" alt=\"AudioInteraction Logo\" width=\"100%\">\n\u003C\u002Fp>\n\nToday's Large Audio Language Models (LALMs) are stuck in an offline paradigm: you hand them a complete audio clip, wait, and get a reply. Streaming audio models exist, but each one only handles a single, isolated task. There has never been a general streaming audio language model. We formalize that missing capability as a new concept **the Audio Interaction Model** and build the first one.\nAudioInteraction is a unified Audio Interaction Model that:\n\n✅ Runs conventional offline audio tasks (ASR, S2TT, AQA...)\n\n✅ Runs streaming audio tasks in real time (Voice chatting...)\n\n✅ Achieves general streaming audio instruction following on a live stream\n\n✅ Does all of the above inside a single, all-in-one model, and be always-on and proactive\n\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2606.05121\">Technical Report 📖\u003C\u002Fa> \u002F\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzhifeixie\u002FStreamAudio-2M\">StreamAudio-2M 🤗\u003C\u002Fa> \u002F\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fzhifeixie\u002FAudioInteraction\">AudioInteraction Model 🤗\u003C\u002Fa> \u002F\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmasaz14\u002FProactive-Sound-Effect-Benchmark\">Streaming-Audio-Bench 🏆\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FAudioInteraction\u002FAudioInteraction\u002Fraw\u002Fmain\u002Fassets\u002Fwechat.jpg\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWeChat-Join%20Group-07C160?logo=wechat&logoColor=white\" alt=\"WeChat\">\u003C\u002Fa>&nbsp;\u003Ca href=\"https:\u002F\u002Fxzf-thu.github.io\u002FAudio-Interaction\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FProject-Page-blue\" alt=\"Project Page\">\u003C\u002Fa>&nbsp;\u003Ca href=\"https:\u002F\u002Fx.com\u002FXieZhifei14110\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FX-@audiointeraction-black?logo=x&logoColor=white\" alt=\"X\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=4YuBkMm1cmU\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.youtube.com\u002Fvi\u002F4YuBkMm1cmU\u002Fmaxresdefault.jpg\" alt=\"Watch AudioInteraction running live\" width=\"95%\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\u003Cem>▶ Click to watch AudioInteraction listen, decide, and speak — live (YouTube)\u003C\u002Fem>\u003C\u002Fp>\n\n\n\n## 🔥 News\n\n- [Coming]: We will release the full dataset and data curation pipeline.\n- [Coming]: The full training configs and pipeline.\n\n\n- **May 20, 2026**: 🔥 We release **StreamAudio-2M**.\n- **May 20, 2026**: 🔥 We release the **AudioInteraction Inference and Training Codebase**.\n- **May 19, 2026**: 🔥 **AudioInteraction** model weights are now available on Hugging Face.\n- **May 19, 2026**: 🔥 We release the **AudioInteraction Technical Report**.\n\n\n## Contents\n\n* **[Quick Start](#quick-start)**\n* **[Demos](#demos)** \n* **[SoundFlow: Train your own Audio Interaction Model](#how-it-works)**\n* **[StreamAudio-2M dataset](#datasets)**\n* **[Evaluation results](#evaluation)**\n* **[License, Citation & Stars](#citation)**\n\n\n## \u003Ca id=\"quick-start\">\u003C\u002Fa>⚡ Quick Start\n\nAudioInteraction is an always-on model: it keeps listening to incoming audio frames and **decides for itself when to speak**. By default it stays in a `⟨Silent⟩` state and only emits output when the task or the acoustic context warrants it — so you can open a single session, stream audio into it continuously, and watch every capability take turns on its own.\n\n**Installation**\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAudioInteraction\u002FAudioInteraction.git\ncd AudioInteraction\n\nconda create -n AudioInteraction python=3.12 -y\nconda activate AudioInteraction\n# please check if you are using torch-cuda\npip install -r requirements.txt\n# install ffmpeg\nconda install -c conda-forge ffmpeg\n```\n\n**Download Weights**\n```bash\n# download model weights from huggingface\nexport PYTHONPATH=.\u002F\npython download.py\n```\n\n## Inference and WebUI\n\nRun inference first, then start the WebUI demo.\n\n```bash\n# Add project root to PYTHONPATH\nexport PYTHONPATH=.\u002F\n\n# 1. Offline inference\npython infer_offline.py\n\n# To test bundled samples, set input_path in infer_offline.py to one of:\n# sample\u002F01_count_bark\u002Fsequence.json\n# sample\u002F02_translate\u002Fsequence.json\n# sample\u002F03_cough_music\u002Fsequence.json\n\n# 2. Real-time inference\npython infer_online.py\n````\n\n### WebUI real-time demo\n\n```bash\n# Download model weights from Hugging Face first\npython web\u002Fserver.py\n\n# Open in browser:\n# http:\u002F\u002Flocalhost:5001\n```\n\n\n\n## \u003Ca id=\"demos\">\u003C\u002Fa>🎬 Demos\n\nMost audio models do one job and wait to be asked. AudioInteraction's defining trait is that **all of its abilities live in the same continuous stream**, and the model itself decides which one is needed at each moment. The demo below is **one unbroken session, one model, no mode switches, no prompts** — transcription, understanding, conversation, and proactive intervention simply happen as the soundscape changes.\n\n\u003Cdiv align=\"center\">\n  \u003Cvideo src=\"assets\u002Fdemo\u002Fall_in_one_session.mp4\" controls width=\"320\">\u003C\u002Fvideo>\n\u003C\u002Fdiv>\n\n\n#### Capability 1 — Online audio understanding\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth valign=\"top\">Input (streaming)\u003C\u002Fth>\n    \u003Cth valign=\"top\">gpt-audio\u003C\u002Fth>\n    \u003Cth valign=\"top\">doubao-voicechat\u003C\u002Fth>\n    \u003Cth valign=\"top\">gemini-omni\u003C\u002Fth>\n    \u003Cth valign=\"top\">AudioInteraction (Ours)\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd valign=\"top\">Continuous ambient audio: footsteps, a door opening, distant traffic.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">❌ Record-then-infer: waits for the clip to end, then returns one summary — no incremental narration.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Speech-centric: lumps non-speech into \"background noise\" and misses individual events.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Buffers a fixed window first, so narration lags several seconds behind the sound.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">✅ Detects each event incrementally and narrates the scene in real time, without waiting for the clip to end.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Capabilities 2 – 4 (transcription &amp; translation · full-spectrum chat · proactive intervention)\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n#### Capability 2 — Real-time transcription &amp; translation\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth valign=\"top\">Input (streaming)\u003C\u002Fth>\n    \u003Cth valign=\"top\">gpt-audio\u003C\u002Fth>\n    \u003Cth valign=\"top\">doubao-voicechat\u003C\u002Fth>\n    \u003Cth valign=\"top\">gemini-omni\u003C\u002Fth>\n    \u003Cth valign=\"top\">AudioInteraction (Ours)\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd valign=\"top\">A speaker talking continuously while the model listens.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Clean transcript, but only after the utterance finishes — no mid-sentence partials.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Streams ASR well, but translation is turn-based and only fires at sentence boundaries.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Emits chunks but re-decodes aggressively, causing flicker and unstable partials.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">✅ Emits partial transcripts and translations chunk by chunk with low latency, correcting incrementally as context arrives.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### Capability 3 — Voice chat beyond speech\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth valign=\"top\">Input (streaming)\u003C\u002Fth>\n    \u003Cth valign=\"top\">gpt-audio\u003C\u002Fth>\n    \u003Cth valign=\"top\">doubao-voicechat\u003C\u002Fth>\n    \u003Cth valign=\"top\">gemini-omni\u003C\u002Fth>\n    \u003Cth valign=\"top\">AudioInteraction (Ours)\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd valign=\"top\">A user asks about a song playing in the background while talking.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Hears the speech but ignores the music — answers as if no song were playing.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">❌ Treats the music as noise to suppress; can't reason about it.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">⚠️ Can ID the song in isolation, but can't fuse it with the ongoing conversation.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">✅ Jointly perceives speech, music, and general audio, and responds in a context-aware, full-spectrum conversation.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n#### Capability 4 — Proactive intervention\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth valign=\"top\">Input (streaming)\u003C\u002Fth>\n    \u003Cth valign=\"top\">gpt-audio\u003C\u002Fth>\n    \u003Cth valign=\"top\">doubao-voicechat\u003C\u002Fth>\n    \u003Cth valign=\"top\">gemini-omni\u003C\u002Fth>\n    \u003Cth valign=\"top\">AudioInteraction (Ours)\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd valign=\"top\">A smoke alarm starts beeping while the user is silent.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">❌ Stays silent — only responds when prompted; no self-initiated speech.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">❌ Waits for a wake word \u002F user turn; never volunteers a warning.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">❌ No notion of \u003Cem>when\u003C\u002Fem> to speak; requires an explicit query.\u003C\u002Ftd>\n    \u003Ctd valign=\"top\">✅ Holds \u003Ccode>⟨Silent⟩\u003C\u002Fcode> until the acoustic cue appears, then switches to \u003Ccode>⟨Speak⟩\u003C\u002Fcode> and warns the user — no prompt required.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C\u002Fdetails>\n\n\n\n## \u003Ca id=\"how-it-works\">\u003C\u002Fa>⚙️ SoundFlow: Train your own Audio Interaction Model\nOffline audio models answer a finished clip, but real audio needs a model that listens continuously and decides, moment to moment, whether to speak. SoundFlow trains a single model that at every chunk chooses between `⟨Speak⟩` and `⟨Silent⟩`, so recognition, translation, and dialogue become instructions inside one always-on perceive–decide–respond loop — a Large Audio Interaction Model (LAIM) — instead of separate per-task models. The framework covers the whole pipeline: stitching short clips into long interactions for data, chunk-level decision training with history review and comprehension-aware silence, and asynchronous FIFO inference that cuts first-frame latency by 4.5×.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Ffigures\u002Fsoundflow.png\" alt=\"SoundFlow framework\" width=\"92%\">\n\u003C\u002Fp>\n\n&nbsp;\n\n## \u003Ca id=\"finetuning\">\u003C\u002Fa>🔧 Finetuning ** data samples are in \u002Fsrc\u002Faudiointeraction\u002Fdataset\u002Fexamples\n\nYou can fine-tune AudioInteraction on your own streaming data, and you can also use this repository to train standard offline audio language models. There are two steps: build the training data, then train.\n\n### 1. Prepare training data\n\nEdit the path constants at the top of each script first:\n\n| File | Constants to fill in |\n|---|---|\n| `src\u002Faudiointeraction\u002Fdataset\u002Fget_feat.py` | `QWEN_OMNI_CKPT`, `AUDIO_TOWER_CKPT` |\n| `src\u002Faudiointeraction\u002Fdataset\u002Fget_dataset_online.py` | `QWEN_OMNI_CKPT` |\n| `src\u002Faudiointeraction\u002Fdataset\u002Fget_dataset_offline.py` | `QWEN_OMNI_CKPT`, `AUDIO_TOWER_CKPT` |\n\n#### Input JSONL format\n\n**Online** (streaming, multi-turn audio). One JSON object per line:\n\n```json\n{\"conversation\": [\n    {\"audio_path\": \"\u002Fpath\u002Fto\u002Fturn1.wav\", \"assistant\": \"reply 1\", \"emotion\": \"normal\"},\n    {\"audio_path\": \"\u002Fpath\u002Fto\u002Fturn2.wav\", \"assistant\": \"reply 2\", \"emotion\": \"happy\"}\n]}\n```\n\n- `audio_path` and `assistant` are required on every turn.\n- `emotion` is optional and defaults to `\"normal\"`. Allowed values: `happy`, `sad`, `angry`, `surprise`, `normal`, `urgent`.\n- To make the model stay silent on a turn, set `assistant` to `\"\u003Cno need to response>\"`.\n\nA single-turn shorthand is also accepted:\n\n```json\n{\"merge_path\": \"\u002Fpath\u002Fto\u002Faudio.wav\", \"assistant\": \"reply\", \"emotion\": \"normal\"}\n```\n\n**Offline** (single-turn). One JSON object per line, either the flat form:\n\n```json\n{\"user\": \"user text\", \"assistant\": \"reply\", \"audio_path\": \"\u002Fpath\u002Fto\u002Faudio.wav\"}\n```\n\nor the online-style multi-turn shape, in which case only the **first** turn is used:\n\n```json\n{\"conversation\": [{\"user\": \"...\", \"assistant\": \"...\", \"audio_path\": \"...\"}, ...]}\n```\n\n`assistant` is always required. The task variant is decided by which other fields are present:\n\n| Has `audio_path`? | Has `user`? | Task |\n|:---:|:---:|---|\n| ✓ | ✓ | `A_T_T` — audio + user text → assistant |\n| ✓ |   | `A_T` — audio → assistant |\n|   | ✓ | `T_T` — user text → assistant |\n\n#### Data process\n\n```bash\n# Online: \u003Cinput.jsonl> \u003Coutput.jsonl> \u003Cerror.log> \u003Cfeature_dir>\npython src\u002Faudiointeraction\u002Fdataset\u002Fget_dataset_online.py \\\n    \u003Cinput.jsonl> \u003Coutput.jsonl> \u003Cerror.log> \u003Cfeature_dir>\n# Example:\n# python src\u002Faudiointeraction\u002Fdataset\u002Fget_dataset_online.py \\\n#     data\u002Fonline_raw.jsonl data\u002Fonline.jsonl logs\u002Fonline.err features\u002Fonline\n\n# Offline: \u003Cinput.jsonl> \u003Coutput.jsonl> \u003Cerror.log> \u003Cfeature_dir>\npython src\u002Faudiointeraction\u002Fdataset\u002Fget_dataset_offline.py \\\n    \u003Cinput.jsonl> \u003Coutput.jsonl> \u003Cerror.log> \u003Cfeature_dir>\n# Example:\npython src\u002Faudiointeraction\u002Fdataset\u002Fget_dataset_offline.py \\\n#     data\u002Foffline_raw.jsonl data\u002Foffline.jsonl logs\u002Foffline.err features\u002Foffline\n```\n\nBoth scripts are resumable: re-running picks up where the previous run stopped, skipping any `idx` that was already written. For a parallel multi-GPU template, see `src\u002Faudiointeraction\u002Fdataset\u002Fprocess_get_feature.sh`.\n\n### 2. Train\n\n```bash\n# 1. Set the two data roots referenced by config.yaml\nexport DATA_ROOT=\u002Fpath\u002Fto\u002Fyour\u002Fjsonl\u002Fdata\nexport CHECKPOINT_ROOT=\u002Fpath\u002Fto\u002Fyour\u002Fcheckpoints\n# Example:\n# export DATA_ROOT=\u002Fdata\u002Faudiointeraction\u002Fjsonl\n# export CHECKPOINT_ROOT=\u002Fdata\u002Faudiointeraction\u002Fckpts\n\n# 2. Edit hyperparameters \u002F data sources in src\u002Faudiointeraction\u002Ffinetune\u002Fconfig.yaml\n\n# 3. Launch\npython src\u002Faudiointeraction\u002Ffinetune\u002Ffull.py --config src\u002Faudiointeraction\u002Ffinetune\u002Fconfig.yaml\n# Example:\n# python src\u002Faudiointeraction\u002Ffinetune\u002Ffull.py --config src\u002Faudiointeraction\u002Ffinetune\u002Fconfig.yaml\n```\n\n## \u003Ca id=\"datasets\">\u003C\u002Fa> 🎊 StreamAudio-2M: a large-scale stream audio instruction following corpus\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Ffigures\u002Fdataset.png\" alt=\"SoundFlow framework\" width=\"92%\">\n\u003C\u002Fp>\n\nStreamAudio-2M is a ~2.6M-item streaming instruction-following corpus (7.4M rounds, 66.7K hours) covering seven capabilities — audio understanding, real-time ASR, speech translation, voice chatting, proactive response, and environment-aware agent — built by collecting clips from real-world datasets (AudioSet, CommonVoice, CoVoST2, MOSS, …), synthesizing text into speech with CosyVoice, then concatenating them into streaming sequences with environmental noise and token-level annotation.\n\n### Sample structure\n\nEach line is one streaming sequence made of multiple turns:\n\n```json\n{\n  \"id\": \"voice_chatting_000123\",\n  \"stream_scene_type\": \"Home Smart\",\n  \"num_turns\": 2,\n  \"turns\": [\n    {\n      \"user\": \"Turn the living room lights down a bit.\",\n      \"assistant\": \"Sure, dimming them to 40%.\",\n      \"emotion\": \"normal\",\n      \"scene_type\": \"Home Smart\",\n      \"audio_path\": \"voice_chatting\u002F000123\u002Fturn_0.wav\"\n    },\n    {\n      \"user\": \"Thanks. What's the temperature in here?\",\n      \"assistant\": \"It's 22.5 degrees in the living room.\",\n      \"emotion\": \"normal\",\n      \"scene_type\": \"Home Smart\",\n      \"audio_path\": \"voice_chatting\u002F000123\u002Fturn_1.wav\"\n    }\n  ]\n}\n```\n\nSet `assistant` to `\"\u003Cno need to response>\"` for a turn where the model should stay silent.\n\n## \u003Ca id=\"evaluation\">\u003C\u002Fa>📊 Experimental results of Audio-Interaction\n\n### Table 1: Results on MMAU Benchmark\n\n| Model | Size | Stream. | Multi-turn | Text Sound | Text Music | Text Speech | Text Avg. | Audio Sound | Audio Music | Audio Speech | Audio Avg. |\n|---|---:|:---:|:---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| **_Large Audio Language Models_** |  |  |  |  |  |  |  |  |  |  |  |\n| Audio Flamingo 2 | 3B | ✗ | ✗ | **71.47** | **70.96** | 44.74 | 62.40 | 1.50 | 1.49 | 0.35 | 1.16 |\n| Qwen2-Audio-Instruct | 8.4B | ✗ | ✓ | 54.95 | 50.98 | 42.04 | 49.20 | 22.32 | 19.16 | 16.31 | 19.41 |\n| Voxtral-Mini | 3B | ✗ | ✓ | 58.56 | 49.70 | 43.53 | 50.60 | 46.08 | 34.13 | 30.50 | 37.24 |\n| Audio-Reasoner | 8.4B | ✗ | ✗ | 60.06 | 64.30 | **60.70** | 61.71 | 20.48 | 26.65 | 13.48 | 20.57 |\n| **_Omni Language Models_** |  |  |  |  |  |  |  |  |  |  |  |\n| Qwen2.5-Omni | 3B | ✗ | ✓ | 65.36 | 48.94 | 57.78 | 57.81 | 51.81 | 44.01 | 29.79 | 42.51 |\n| Qwen2.5-Omni | 7B | ✗ | ✓ | \u003Cu>67.87\u003C\u002Fu> | \u003Cu>69.16\u003C\u002Fu> | \u003Cu>59.76\u003C\u002Fu> | **65.60** | 60.54 | \u003Cu>50.90\u003C\u002Fu> | \u003Cu>35.11\u003C\u002Fu> | \u003Cu>49.58\u003C\u002Fu> |\n| Phi-4-multimodal | 7B | ✗ | ✓ | 60.97 | 52.87 | 52.83 | 55.56 | 44.65 | 27.84 | 21.99 | 31.75 |\n| Baichuan-Omni-1.5 | 11B | ✗ | ✓ | 65.47 | 58.98 | 55.26 | 59.90 | 57.53 | 36.53 | 24.82 | 40.40 |\n| **_Streaming Audio Language Models_** |  |  |  |  |  |  |  |  |  |  |  |\n| **Audio-Interaction** | **3B** | **✓** | **✓** | 64.12 | 47.80 | 55.13 | 55.68 | **65.63** | **57.93** | **39.68** | **58.15** |\n\n### Table 2: Performance on Spoken-Dialogue Benchmarks\n\n| Model | Size | SpokenQA LLa. Q. | SpokenQA Web Q. | Voicebench Alpa. | Voicebench SD-QA |\n|---|---:|---:|---:|---:|---:|\n| **_Specialized Models_** |  |  |  |  |  |\n| Moshi | 7B | 62.20 | 26.30 | 2.01 | 15.01 |\n| Freeze-Omni | 7B | 72.00 | 44.73 | 4.14 | 50.16 |\n| **_Omni & Audio Language Models_** |  |  |  |  |  |\n| Baichuan-Omni-1.5 | 7B | **78.50** | \u003Cu>59.10\u003C\u002Fu> | **4.50** | 43.40 |\n| Qwen2-Audio | 7B | 69.67 | 45.20 | 3.74 | 35.71 |\n| Qwen2.5-Omni | 3B | 66.00 | 27.95 | 4.32 | 49.37 |\n| Qwen2.5-Omni | 7B | 75.33 | **62.80** | \u003Cu>4.49\u003C\u002Fu> | **55.71** |\n| Phi-4-multimodal | 7B | 60.2 | 26.6 | 3.81 | 39.78 |\n| **_Streaming Audio Language Models_** |  |  |  |  |  |\n| **Audio-Interaction** | **3B** | 67.31 | 54.34 | 4.28 | \u003Cu>52.14\u003C\u002Fu> |\n\n### Table 3: ASR WER and S2TT BLEU on LibriSpeech and CoVoST2\n\n| Model | Size | ASR Clean ↓ | ASR Other ↓ | S2TT en-zh ↑ | S2TT zh-en ↑ |\n|---|---:|---:|---:|---:|---:|\n| **_Specialized Models_** |  |  |  |  |  |\n| Canary | 1B | **1.48** | **2.93** | - | - |\n| Canary-Qwen | 2.5B | 1.49 | \u003Cu>3.10\u003C\u002Fu> | - | - |\n| **_Omni & Audio Language Models_** |  |  |  |  |  |\n| Baichuan-Omni-1.5 | 7B | 5.71 | 10.09 | - | - |\n| Qwen2-Audio | 7B | 1.60 | 3.60 | 45.20 | 24.40 |\n| Qwen2.5-Omni | 3B | 2.87 | 5.90 | 39.50 | 18.17 |\n| Qwen2.5-Omni | 7B | \u003Cu>1.80\u003C\u002Fu> | 3.40 | 41.40 | \u003Cu>29.40\u003C\u002Fu> |\n| Phi-4-multimodal | 5.6B | 1.69 | 3.82 | \u003Cu>46.30\u003C\u002Fu> | 22.39 |\n| **_Streaming Audio Language Models_** |  |  |  |  |  |\n| **Audio-Interaction** | **3B** | 3.17 | 6.04 | **55.22** | **35.21** |\n\n\n## Acknowledgements\n\nWe sincerely thank the creators, maintainers, and contributors of the public datasets and resources used in this work. We also thank the broader large audio language model community for laying the groundwork that made streaming audio modeling possible.\n\nIn particular, this project builds on the following open-source repositories:\n\n- [Qwen2.5-Omni](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni) — the audio encoder and language model backbone behind AudioInteraction.\n- [LitGPT](https:\u002F\u002Fgithub.com\u002FLightning-AI\u002Flitgpt) — the training framework our finetuning code is built on.\n- [CosyVoice](https:\u002F\u002Fgithub.com\u002FFunAudioLLM\u002FCosyVoice) — the text-to-speech model used to synthesize speech during data construction.\n\n\n## \u003Ca id=\"citation\">\u003C\u002Fa>License, Citation & Stars\n\nThis project will be released under the **Apache-2.0 License**. You can do everything with AudioInteraction 🎉\n\n**Citation**: You can cite AudioInteraction using the following BibTeX entry. Thank you for your kindness 🙂\n\n```bibtex\n@misc{xie2026audiointeractionmodel,\n      title={Audio Interaction Model}, \n      author={Zhifei Xie and Zihang Liu and Ze An and Xiaobin Hu and Yue Liao and Ziyang Ma and Dongchao Yang and Mingbao Lin and Deheng Ye and Shuicheng Yan and Chunyan Miao},\n      year={2026},\n      eprint={2606.05121},\n      archivePrefix={arXiv},\n      primaryClass={cs.SD},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.05121}, \n}\n```\n\n\u003Ca href=\"https:\u002F\u002Fwww.star-history.com\u002F?repos=xzf-thu%2FAudioInteraction&type=date&legend=top-left\">\n \u003Cpicture>\n   \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=xzf-thu\u002FAudio-Interaction&type=date&theme=dark&legend=top-left\" \u002F>\n   \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=xzf-thu\u002FAudio-Interaction&type=date&legend=top-left\" \u002F>\n   \u003Cimg alt=\"Star History Chart\" src=\"https:\u002F\u002Fapi.star-history.com\u002Fchart?repos=xzf-thu\u002FAudio-Interaction&type=date&legend=top-left\" \u002F>\n \u003C\u002Fpicture>\n\u003C\u002Fa>","2026-06-11 04:10:26","CREATED_QUERY"]