[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-84002":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":14,"stars7d":17,"stars30d":17,"stars90d":16,"forks30d":16,"starsTrendScore":18,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":25,"readmeContent":26,"aiSummary":10,"trendingCount":16,"starSnapshotCount":16,"syncStatus":14,"lastSyncTime":27,"discoverSource":28},84002,"MMAE","ddlBoJack\u002FMMAE","ddlBoJack","MMAE: A Massive Multitask Audio Editing Benchmark","",null,"Python",89,3,2,1,0,24,28,1.81,"MIT License",false,"main",true,[],"2026-06-12 02:04:37","# MMAE: A Massive Multitask Audio Editing Benchmark\n[**📖 arXiv**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2606.07229) | [**🎬 MMAE Demo Video**](https:\u002F\u002Fyoutu.be\u002F6At5nTWhlXI) | [**🛠️ GitHub Code**](https:\u002F\u002Fgithub.com\u002FddlBoJack\u002FMMAE) | [**🔊 HuggingFace Audio Download**](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoJack\u002FMMAE)\n\n\u003Cp align=\"center\">\u003Cimg src=\"assets\u002Flogo.png\" alt=\"MMAE Benchmark Logo\" width=\"300\"\u002F>\u003C\u002Fp>\n\n## Overview of MMAE\nWe introduce **MMAE**, a **M**assive **M**ultitask **A**udio **E**diting benchmark, serving as the first comprehensive evaluation testbed designed for general-purpose instruction-based audio editing. \nMMAE extends to a broad spectrum of real-world scenarios, encompassing 7 distinct audio modalities, including sound, speech, music, and their mixtures. \nMMAE establish a comprehensive taxonomy spanning 6 levels of task complexity, from basic modifications to multi-hop reasoning and multi-round editing, 2 levels of granularity, and 8 distinct operation types. \nMeticulously curated through human-agent collaboration, MMAE comprises 2,000 high-fidelity samples paired with a pioneering rubric-based evaluation framework. By decomposing free-form tasks into 17,741 verifiable criteria, this robust rubric-based paradigm enables a precise, multi-dimensional assessment of both instruction following and context consistency. \nWe hope MMAE will serve as a catalyst for future advances in the intelligent creation community, providing a clear diagnostic roadmap and establishing a standardized, long-lasting evaluation paradigm for next-generation audio editing systems. \n\nExamples of the MMAE benchmark:\n\n![Example](assets\u002Fexamples.png)\n\nDistribution of the MMAE benchmark across three taxonomy dimensions: modality, difficulty, and operation. \n\u003Cp float=\"left\">\n  \u003Cimg src=\"assets\u002Fpie_modality.png\" width=\"32%\" \u002F>\n  \u003Cimg src=\"assets\u002Fpie_difficulty.png\" width=\"32%\" \u002F>\n  \u003Cimg src=\"assets\u002Fpie_operation.png\" width=\"32%\" \u002F>\n\u003C\u002Fp>\n\n## Data Curation Pipeline\nMMAE is constructed through a systematic five-stage pipeline designed to ensure both diversity and high-quality of the benchmark:\n1. Brainstorming.\n2. Taxonomy & Paradigm Construction.\n3. Instruction-Centric Data Collection.\n4. Rubrics Annotation.\n5. Quality Inspection.\n![Pipeline](assets\u002Fpipeline.png)\n\n## Evaluation\n\nWe use [Qwen3-Omni](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Omni) as the judge model to evaluate audio editing outputs against our rubric-based criteria.\n\n### Step 1: Deploy Qwen3-Omni\n\nClone the official repository and set up the environment following their instructions:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen3-Omni.git\ncd Qwen3-Omni\n# Follow the official README to install dependencies\n```\n\nThen launch the vLLM serving instances. We provide a reference deployment script [`launch_qwen3_omni.sh`](eval\u002Flaunch_qwen3_omni.sh).\n\nThis starts two Qwen3-Omni instances (tensor-parallel=4 each) on 8 GPUs, serving at ports 8001 and 8002. Edit `MODEL_DIR` in the script to point to your local model weights.\n\n### Step 2: Prepare Predictions\n\nRun your audio editing model on the MMAE benchmark inputs ([meta data](MMAE-meta.json)). Modify the original chatml-format `messages` with an appended `assistant` turn pointing to the output audio path, e.g.,\n\n```json\n[\n  {\n    \"id\": \"69e897fbf1844435bec75eca\",\n    \"messages\": [\n      {\n        \"role\": \"user\",\n        \"content\": [\n          {\"type\": \"text\", \"text\": \"Extract the music component from the audio.\"},\n          {\"type\": \"audio\", \"audio_url\": \"wav\u002F69e897fbf1844435bec75eca\u002Faudio1.wav\"}\n        ]\n      },\n      {\n        \"role\": \"assistant\",\n        \"content\": [\n          {\"type\": \"audio\", \"audio_url\": \"your_output_wav_path\"}\n        ]\n      }\n    ]\n  }\n]\n```\n\nThe `audio_url` paths can be absolute or relative to the predictions file's parent directory (or the `--audio_root` you specify). Save the modified meta data as another JSON file indicating your model's predictions.\n\n### Step 3: Run Evaluation\n\n```bash\npython -m eval.score \\\n  --predictions path\u002Fto\u002Fyour_predictions.json \\\n  --base_urls \"http:\u002F\u002Flocalhost:8001\u002Fv1,http:\u002F\u002Flocalhost:8002\u002Fv1\" \\\n  --audio_root path\u002Fto\u002Faudio_root \\\n  --output_dir outputs\u002Fyour_model \\\n  --concurrency 8\n```\n\n**Arguments:**\n\n| Argument | Description |\n|----------|-------------|\n| `--predictions` | **(required)** Path to your predictions JSON file. |\n| `--base_urls` | **(required)** Comma-separated Qwen3-Omni endpoint URLs. |\n| `--metadata` | Path to MMAE metadata. Default: `MMAE-meta.json`. |\n| `--audio_root` | Base directory for resolving relative audio paths. Default: parent directory of predictions file. |\n| `--output_dir` | Where to write results. Default: `outputs\u002Fscores`. |\n| `--concurrency` | Number of samples scored in parallel. Default: 16. |\n| `--retries` | Number of valid judge responses to collect per rubric. Should be 3. |\n| `--max_attempts` | Max total attempts (including failures) per rubric. Default: 10. |\n| `--timeout` | Timeout in seconds per judge request. Default: 300. |\n| `--model` | Model name served by vLLM. Default: `Qwen3Omni-Instruct`. |\n\n**Output files** (written to `--output_dir`):\n\n| File | Description |\n|------|-------------|\n| `results.jsonl` | Per-rubric detailed results: each rubric's 3 judge responses, per-attempt choices, scores, and raw model outputs. |\n| `per_sample.json` | Per-sample aggregated scores: Instruction Following Rate, Consistency Rate, and Exact Match Rate for each data entry. |\n| `taxonomy.json` | Scores grouped by modality, complexity, cross dimensions, and operation type. |\n","2026-06-11 04:12:01","CREATED_QUERY"]