[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-79936":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":10,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":15,"lastSyncTime":27,"discoverSource":28},79936,"MOSS-Music","OpenMOSS\u002FMOSS-Music","OpenMOSS","MOSS-Music is an open-source music understanding model for targeting musical captioning, lyrics ASR, structural analysis, chord \u002F key \u002F tempo reasoning, and long-form musical question answering.","",null,"Python",90,6,1,2,0,3,7,9,2.54,false,"main",[],"2026-06-12 02:03:55","# MOSS-Music\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002FMOSS-Music.png\" width=\"58%\" alt=\"MOSS-Music logo\" \u002F>\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Music-8B-Instruct\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Models-orange?logo=huggingface&amp\">\u003C\u002Fa>\n\u003C!-- \u003Ca href=\"https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Music-8B-Instruct\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Models-624AFF?logo=data:image\u002Fsvg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAyNCAyNCI+PHBhdGggZmlsbD0id2hpdGUiIGQ9Ik0xMiAyQzYuNDggMiAyIDYuNDggMiAxMnM0LjQ4IDEwIDEwIDEwIDEwLTQuNDggMTAtMTBTMTcuNTIgMiAxMiAyeiIvPjwvc3ZnPg==&amp\">\u003C\u002Fa> -->\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBlog-Coming_Soon-blue?logo=internet-explorer&amp\">\n\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FArxiv-Coming_Soon-red?logo=Arxiv&amp\">\n\n\u003Ca href=\"https:\u002F\u002Fx.com\u002FOpen_MOSS\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-Follow-black?logo=x&amp\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FXf3aXddCjc\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDiscord-Join-5865F2?logo=discord&amp\">\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\".\u002FREADME.md\">English\u003C\u002Fa> | \u003Ca href=\".\u002FREADME_zh.md\">简体中文\u003C\u002Fa>\n\u003C\u002Fp>\n\n**MOSS-Music** is an open-source **music understanding model** from\n[MOSI.AI](https:\u002F\u002Fmosi.cn\u002F#hero), the [OpenMOSS team](https:\u002F\u002Fwww.open-moss.com\u002F),\nand [Shanghai Innovation Institute](https:\u002F\u002Fwww.sii.edu.cn\u002F). Built on the same\naudio backbone as [MOSS-Audio](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Audio),\nMOSS-Music is further specialised on music via dedicated continual pre-training\nand supervised fine-tuning — targeting **musical captioning, lyrics ASR,\nstructural analysis, chord \u002F key \u002F tempo reasoning, and long-form musical\nquestion answering**. In this release, we provide **two 8B models**:\n[**MOSS-Music-8B-Instruct**](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Music-8B-Instruct) and [**MOSS-Music-8B-Thinking**](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Music-8B-Thinking). The Instruct variant\nis optimised for direct instruction following on musical prompts, while the\nThinking variant provides stronger chain-of-thought reasoning for musical\nanalysis.\n\n## News\n\n* 2026.05.01: 🎉🎉🎉 We have released [MOSS-Music](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Music-8B-Instruct).\n* 2026.05.01: 🎉🎉🎉 We have released [MOSS-Music-Data-Pipeline](https:\u002F\u002Fgithub.com\u002Fwx9songs\u002FMOSS-Music-Data-Pipeline) for large-scale music data annotation and processing.\n\n## Contents\n\n- [Introduction](#introduction)\n- [Model Architecture](#model-architecture)\n  - [DeepStack Cross-Layer Feature Injection](#deepstack-cross-layer-feature-injection)\n  - [Time-Aware Representation](#time-aware-representation)\n- [Released Models](#released-models)\n- [Music Data Pipeline](#music-data-pipeline)\n- [Evaluation](#evaluation)\n- [Quickstart](#quickstart)\n  - [Environment Setup](#environment-setup)\n  - [SGLang Serving](#sglang-serving)\n  - [Local Inference (Transformers)](#local-inference-transformers)\n  - [Gradio App](#gradio-app)\n- [More Information](#more-information)\n- [LICENSE](#license)\n- [Citation](#citation)\n\n## Introduction\n\nMusic is not just audio plus lyrics — understanding it requires perceiving\nharmonic structure, rhythm, timbre, instrumentation, performance nuance, and\nthe textual content of the lyrics, and reasoning about them jointly across\ntime. **MOSS-Music** is built to unify these capabilities within a single\nmodel.\n\n- **Lyrics ASR & time-aligned transcription**: Accurate singing ASR with\n  sentence- and word-level timestamps, robust to backing tracks.\n- **Musical captioning & tagging**: Natural-language descriptions of mood,\n  genre, instrumentation, production style, and emotional trajectory.\n- **Key \u002F tempo \u002F chord reasoning**: Identifies musical key, beats, downbeats,\n  and chord progressions, including timestamped chord transcription.\n- **Structural analysis**: Segments a song into intro \u002F verse \u002F chorus \u002F\n  bridge \u002F outro and reasons about repetition and contrast.\n- **Instrument & voice recognition**: Identifies prominent instruments and\n  singing voices (solo \u002F chorus, gender, register).\n- **Musical QA and long-form analysis**: Open-ended question answering\n  grounded in a full track, including chain-of-thought reasoning in the\n  *Thinking* variant.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fmoss-music_img.png\" width=\"98%\" alt=\"MOSS-Music overview\" \u002F>\n\u003C\u002Fp>\n\n## Model Architecture\n\nMOSS-Music inherits the MOSS-Audio modular design, comprising three\ncomponents: an audio encoder, a modality adapter, and a large language model.\nRaw audio is first encoded by **MOSS-Audio-Encoder** into continuous temporal\nrepresentations at **12.5 Hz**, which are then projected into the language\nmodel's embedding space through the adapter and finally consumed by the LLM\nfor auto-regressive text generation.\n\nRather than relying on off-the-shelf audio frontends, we train a dedicated\nencoder from scratch to obtain more robust acoustic representations, tighter\ntemporal alignment, and better extensibility across musical styles, singing,\nand non-speech acoustic content.\n\n### DeepStack Cross-Layer Feature Injection\n\nUsing only the encoder's top-layer features tends to lose low-level prosody,\ntransient events, and local time-frequency structure. To address this, we\nadopt a **DeepStack**-inspired cross-layer injection module between the\nencoder and the language model: in addition to the encoder's final-layer\noutput, features from earlier and intermediate layers are selected,\nindependently projected, and injected into the language model's early layers,\npreserving multi-granularity information from low-level acoustic details to\nhigh-level semantic abstractions.\n\nThis design is especially well-suited for music understanding, as it helps\nretain rhythm, timbre, transients, and instrumental texture — information\nthat a single high-level representation cannot fully capture, yet is critical\nfor chord recognition, structural analysis, and nuanced musical description.\n\n### Time-Aware Representation\n\nTime is a critical dimension in music understanding. To enhance explicit\ntemporal awareness, we adopt a **time-marker insertion** strategy during\npre-training: explicit time tokens are inserted between audio frame\nrepresentations at fixed time intervals to indicate temporal positions.\nThis design enables the model to learn \"what happened when\" within a unified\ntext generation framework, naturally supporting timestamped lyrics ASR,\nbeat \u002F downbeat localisation, section boundary detection, and long-song\nretrospective QA.\n\nBuilding on the MOSS-Audio backbone, MOSS-Music is further enhanced through:\n\n- **continual pre-training** on a large, diverse music corpus produced by\n  the data annotation and processing pipeline\n  [`MOSS-Music-Data-Pipeline`](https:\u002F\u002Fgithub.com\u002Fwx9songs\u002FMOSS-Music-Data-Pipeline),\n  with an emphasis on singing, lyrics, and full-song coverage;\n- **supervised fine-tuning (SFT)** on music-centric instruction data covering\n  captioning, lyrics ASR, chord \u002F key \u002F structural analysis, and long-form\n  musical QA;\n- additional **reasoning tuning** for the *Thinking* variant.\n\n## Released Models\n\n| Model | Audio Encoder | LLM Backbone | Total Size | Hugging Face | ModelScope |\n|---|---|---|---:|---|---|\n| **MOSS‑Music‑8B‑Instruct** | MOSS-Audio-Encoder | Qwen3-8B | ~9.1B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Music-8B-Instruct) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-624AFF)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Music-8B-Instruct) |\n| **MOSS‑Music‑8B‑Thinking** | MOSS-Audio-Encoder | Qwen3-8B | ~9.1B | [![Hugging Face](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingface-Model-orange?logo=huggingface)](https:\u002F\u002Fhuggingface.co\u002FOpenMOSS-Team\u002FMOSS-Music-8B-Thinking) | [![ModelScope](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModelScope-Model-624AFF)](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenmoss\u002FMOSS-Music-8B-Thinking) |\n\n> Smaller (4B) variants and additional sizes may follow. Stay tuned!\n\n## Music Data Pipeline\n\nThe training data used by MOSS-Music is produced by an end-to-end pipeline\nthat goes from raw audio to chat-formatted training samples. That pipeline is\navailable at\n[`MOSS-Music-Data-Pipeline`](https:\u002F\u002Fgithub.com\u002Fwx9songs\u002FMOSS-Music-Data-Pipeline),\nwhich hosts duration detection, MIR feature extraction, song-structure\nsegmentation, lyrics ASR, metadata cleanup, and ALM-driven caption \u002F query\ngeneration with models such as Qwen3-Omni, MusicFlamingo, and other\naudio-language models.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fmusic_pipeline.png\" width=\"94%\" \u002F>\n\u003C\u002Fp>\n\n## Evaluation\n\nWe evaluate MOSS-Music on a diverse suite of public music understanding\nbenchmarks. Key results:\n\n- **Music QA and understanding**: **MOSS-Music-8B-Instruct** achieves **80.38**\n  average accuracy across **8 public music QA benchmarks** (excluding the\n  three NSynth note-recognition tracks), ranking first among all compared\n  models in our current evaluation set.\n- **Music captioning**: In our preliminary **GPT-5.4-as-a-Judge** evaluation,\n  the MOSS-Music series leads both caption benchmarks, with\n  `MOSS-Music-8B-Thinking` reaching **4.53** on `MusicCaps` and\n  `MOSS-Music-8B-Instruct` reaching **4.58** on `SDD`.\n- **Lyrics ASR for singing voice**: **MOSS-Music-8B-Thinking** achieves the\n  best average lyrics recognition error across `MUSDB18`, `MIR-1K` and\n  `Opencpop` (**15.88%** avg WER\u002FCER), clearly ahead of all compared\n  audio-language baselines including `Gemini-3.1-Pro-Preview`,\n  `MusicFlamingo` and `Qwen3-Omni`. Detailed timestamped-ASR results will be\n  released in a later update.\n- **Chord transcription**: MOSS-Music supports chord transcription, including\n  timestamped chord transcription for harmonic analysis, accompaniment\n  reference, and related downstream use cases. Detailed benchmark results will\n  be released in a later update.\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\".\u002Fassets\u002Fmusic_bench.png\" width=\"98%\" \u002F>\n\u003C\u002Fp>\n\n### Music QA & Understanding (Accuracy↑)\n\n| Model | MMAU-music | MMAU-mini-music | MMAU-Pro-music | MMAR-music | MuChoMusic | Music-AVQA | NSynth (instrument) | NSynth (source) | NSynth (pitch) | GTZAN | Medley-Solos-DB | Avg |\n|-----|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| **MOSS‑Music‑8B‑Instruct** | **79.33** | **80.78** | 71.02 | 59.70 | **89.39** | **76.78** | **86.55** | 61.07 | **86.94** | **93.59** | 92.42 | **80.38** |\n| Gemini‑3.1‑Pro | 71.69 | 77.18 | **73.06** | **71.64** | 79.53 | 61.51 | 13.38 | 38.90 | 6.47 | 86.39 | 80.34 | 75.17 |\n| **MOSS‑Music‑8B‑Thinking** | 74.09 | 77.78 | 67.98 | 50.25 | 82.90 | 68.90 | 56.17 | 57.48 | 77.83 | 84.78 | 87.42 | 74.26 |\n| MusicFlamingo | 76.83 | 76.35 | 65.60 | 48.66 | 74.58 | 73.60 | 80.76 | **75.89** | 0.00 | 84.45 | 90.86 | 73.87 |\n| Audio‑Flamingo‑Next | 72.39 | 72.07 | 61.64 | 45.27 | 75.62 | 62.94 | 86.40 | 66.73 | 0.05 | 77.68 | 91.47 | 69.89 |\n| MiMo‑Audio‑7B‑Instruct | 66.36 | 72.97 | 66.50 | 45.77 | 75.40 | 57.05 | 25.01 | 1.49 | 4.86 | 65.67 | **93.81** | 67.94 |\n| Step‑Audio‑R1 | 66.46 | 75.08 | 62.34 | 50.75 | 72.62 | 57.98 | 13.75 | 15.87 | 2.39 | 73.67 | 82.45 | 67.67 |\n| Qwen3‑Omni | 65.76 | 68.77 | 66.27 | 48.54 | 78.77 | 56.05 | 30.92 | 44.30 | 28.08 | 80.15 | 69.65 | 66.75 |\n| Kimi‑Audio‑7B‑Instruct | 47.95 | 52.25 | 59.10 | 45.27 | 70.18 | 68.90 | 6.01 | 0.81 | 3.88 | 39.54 | 71.98 | 56.90 |\n\n> `Avg` is computed over 8 public music QA benchmarks:\n> `MMAU-music`, `MMAU-mini-music`, `MMAU-Pro-music`, `MMAR-music`,\n> `MuChoMusic`, `Music-AVQA`, `GTZAN`, and `Medley-Solos-DB`.\n>\n> We exclude the three `NSynth` tracks from the main average because they focus\n> on fine-grained isolated-note recognition, including instrument-family,\n> acoustic\u002Felectronic source, and exact pitch discrimination from short\n> single-note clips. Some compared audio-language models are not explicitly\n> designed for this note-level classification setting, so we report NSynth\n> separately for reference rather than mixing it into the headline average.\n\n### Music Captioning\n\nWe further report a preliminary **GPT-5.4-as-a-Judge** music captioning\ncomparison on `MusicCaps` and `Song Describer Dataset (SDD)`. Scores are on a\n1-5 scale across 9 dimensions: `genre\u002Fstyle`, `mood\u002Faffect`, `tempo\u002Frhythm`,\n`instrumentation\u002Ftimbre`, `vocals`, `melody\u002Fharmony`, `structure\u002Fform`,\n`production\u002Faudio quality`, and `scene\u002Fuse case`.\n\n- **Overall caption quality**: the MOSS-Music series remains strongest across\n  both caption benchmarks, with `MOSS-Music-8B-Thinking` reaching **4.53** on\n  `MusicCaps` and `MOSS-Music-8B-Instruct` reaching **4.58** on `SDD`.\n- **Stronger structural descriptions**: MOSS-Music shows the clearest gains on\n  `structure \u002F form \u002F progression`, especially on `SDD`.\n- **Competitive baselines on instrumentation and scene semantics**:\n  `MusicFlamingo` and `Gemini-3.1-Pro` remain competitive on\n  `instrumentation\u002Ftimbre`, while `Gemini-3.1-Pro` is strongest on\n  `scene \u002F use case`.\n\n#### MusicCaps\n\n| Model | Genre | Mood | Tempo | Instr. | Vocals | Melody\u002FHarmony | Structure | Production | Scene | Avg |\n|-----|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| **MOSS‑Music‑8B‑Thinking** | 4.78 | **4.69** | **4.62** | 4.40 | **4.46** | **4.40** | **4.86** | 4.35 | 4.18 | **4.53** |\n| Gemini‑3.1‑Pro | 4.70 | 4.60 | 4.48 | **4.68** | 4.18 | 4.18 | 3.86 | **4.40** | **4.72** | 4.42 |\n| **MOSS‑Music‑8B‑Instruct** | 4.60 | 4.52 | 4.46 | 4.02 | 4.30 | 4.38 | 4.78 | 4.20 | 3.96 | 4.36 |\n| MusicFlamingo | **4.80** | 4.36 | 4.50 | 4.64 | 3.94 | 4.08 | 3.58 | 4.30 | 3.72 | 4.21 |\n| Audio‑Flamingo‑Next | 4.34 | 4.56 | 4.08 | 4.30 | 4.18 | 3.78 | 3.66 | 4.04 | 3.92 | 4.10 |\n| MiMo‑Audio‑7B‑Instruct | 4.02 | 4.20 | 4.46 | 4.28 | 4.36 | 3.62 | 3.30 | 4.08 | 3.50 | 3.98 |\n| Step‑Audio‑R1 | 4.22 | 4.02 | 4.20 | 3.96 | 3.84 | 4.02 | 3.24 | 4.10 | 3.54 | 3.90 |\n| Qwen3‑Omni | 4.58 | 4.50 | 4.26 | 3.62 | 3.64 | 3.48 | 2.98 | 4.18 | 4.42 | 3.96 |\n| Kimi‑Audio‑7B‑Instruct | 3.98 | 3.92 | 4.32 | 3.88 | 4.48 | 3.28 | 2.72 | 3.72 | 3.24 | 3.73 |\n\n#### Song Describer Dataset (SDD)\n\n| Model | Genre | Mood | Tempo | Instr. | Vocals | Melody\u002FHarmony | Structure | Production | Scene | Avg |\n|-----|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:|\n| **MOSS‑Music‑8B‑Instruct** | **4.84** | **4.76** | **4.68** | 4.24 | **4.52** | **4.56** | **4.92** | 4.42 | 4.24 | **4.58** |\n| Gemini‑3.1‑Pro | 4.72 | 4.64 | 4.52 | **4.72** | 4.22 | 4.24 | 3.94 | **4.46** | **4.82** | 4.48 |\n| **MOSS‑Music‑8B‑Thinking** | 4.66 | 4.58 | 4.50 | 4.36 | 4.36 | 4.44 | 4.84 | 4.26 | 4.02 | 4.45 |\n| MusicFlamingo | 4.82 | 4.40 | 4.52 | 4.70 | 3.98 | 4.14 | 3.66 | 4.36 | 3.80 | 4.26 |\n| Audio‑Flamingo‑Next | 4.40 | 4.62 | 4.14 | 4.36 | 4.22 | 3.84 | 3.74 | 4.10 | 4.00 | 4.16 |\n| MiMo‑Audio‑7B‑Instruct | 4.08 | 4.26 | 4.52 | 4.34 | 4.42 | 3.70 | 3.38 | 4.16 | 3.58 | 4.05 |\n| Step‑Audio‑R1 | 4.30 | 4.10 | 4.26 | 4.02 | 3.92 | 4.10 | 3.32 | 4.18 | 3.62 | 3.98 |\n| Qwen3‑Omni | 4.62 | 4.54 | 4.30 | 3.68 | 3.70 | 3.56 | 3.06 | 4.24 | 4.50 | 4.02 |\n| Kimi‑Audio‑7B‑Instruct | 4.04 | 3.98 | 4.38 | 3.96 | 4.54 | 3.36 | 2.80 | 3.80 | 3.32 | 3.80 |\n\n### Lyrics ASR (WER \u002F CER↓)\n\nWe further evaluate MOSS-Music on **singing-voice lyrics ASR** across three\nrepresentative benchmarks:\n\n- `MUSDB18` — English pop songs **with backing tracks**, scored with **WER**;\n- `MIR-1K` — **Chinese karaoke** clips with background music, scored with **CER**;\n- `Opencpop` — **clean Mandarin studio singing**, scored with **CER**.\n\n`Avg` is the unweighted mean of the three dataset-level error rates.\n\n| Model | MUSDB18 WER | MIR-1K CER | Opencpop CER | Avg |\n|-----|---:|---:|---:|---:|\n| **MOSS‑Music‑8B‑Thinking** | 29.19% | **15.84%** | 2.60% | **15.88%** |\n| **MOSS‑Music‑8B‑Instruct** | 32.99% | 23.96% | 4.62% | 20.52% |\n| Gemini‑3.1‑Pro‑Preview | 26.25% | 36.37% | 6.00% | 22.87% |\n| MusicFlamingo | **23.41%** | 38.98% | 18.73% | 27.04% |\n| Qwen3‑Omni‑30B‑A3B‑Instruct | 62.67% | 20.48% | **2.26%** | 28.47% |\n| MiMo‑Audio‑7B‑Instruct | 94.16% | 23.34% | 6.77% | 41.42% |\n| Kimi‑Audio‑7B‑Instruct | 97.53% | 25.83% | 4.90% | 42.75% |\n| Step‑Audio‑R1 | 81.67% | 48.03% | 4.15% | 44.62% |\n| Audio‑Flamingo‑Next | 94.93% | 55.63% | 12.47% | 54.34% |\n\n> **MOSS-Music-8B-Thinking** achieves the lowest average lyrics-ASR error\n> (**15.88%**) across these three datasets, with particular gains on\n> `MIR-1K` (Chinese karaoke with accompaniment) and `Opencpop` (clean Mandarin\n> singing). MOSS-Music also inherits the strong timestamp-aware ASR ability\n> from MOSS-Audio; detailed singing-timestamp ASR results will be added soon.\n\n### Chord Transcription\n\nMOSS-Music supports chord transcription, including timestamped chord\ntranscription that tracks chord progression over time. This can be useful for\nharmonic analysis, accompaniment reference, music education, and related use\ncases. Detailed benchmark results will be added soon.\n\n## Quickstart\n\n### Environment Setup\n\nWe recommend Python 3.12 with a clean Conda environment. The commands below\nare enough for local inference.\n\n#### Recommended setup\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Music.git\ncd MOSS-Music\n\nconda create -n moss-music python=3.12 -y\nconda activate moss-music\n\nconda install -c conda-forge \"ffmpeg=7\" -y\npip install --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 -e \".[torch-runtime]\"\n```\n\n#### Optional: FlashAttention 2\n\nIf your GPU supports FlashAttention 2, you can replace the last install\ncommand with:\n\n```bash\npip install --extra-index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu128 -e \".[torch-runtime,flash-attn]\"\n```\n\n### SGLang Serving\n\n> [!IMPORTANT]\n> To achieve the best generation quality and fully leverage the model's capabilities, we\n> **strongly recommend using SGLang Serving for inference**.\n\nSee the full SGLang guide in `moss_music_usage_guide.md`.\n\nDownload the model first:\n\n```bash\nhf download OpenMOSS-Team\u002FMOSS-Music-8B-Instruct --local-dir .\u002Fweights\u002FMOSS-Music-8B-Instruct\nhf download OpenMOSS-Team\u002FMOSS-Music-8B-Thinking --local-dir .\u002Fweights\u002FMOSS-Music-8B-Thinking\n```\n\nThe shortest setup is:\n\n```bash\ncd sglang\npip install -e \"python[all]\"\npip install nvidia-cudnn-cu12==9.16.0.29\ncd ..\n\nsglang serve \\\n  --model-path .\u002Fweights\u002FMOSS-Music-8B-Instruct \\\n  --trust-remote-code\n```\n\nYou can replace `.\u002Fweights\u002FMOSS-Music-8B-Instruct` with\n`.\u002Fweights\u002FMOSS-Music-8B-Thinking` if needed.\n\nIf you use the default `torch==2.9.1+cu128` runtime, installing\n`nvidia-cudnn-cu12==9.16.0.29` is recommended before starting `sglang serve`.\n\n### Local Inference (Transformers)\n\nFor a quick local sanity check without SGLang, simply run:\n\n```bash\npython infer.py\n```\n\nEdit `MODEL_PATH`, `AUDIO_PATH`, the prompt and the sampling\nhyper-parameters at the top of `infer.py` to point to your own model\nweights and audio.\n\n> [!NOTE]\n> This Transformers path is mainly for quick verification and debugging.\n> For best generation quality and throughput, please prefer\n> [SGLang Serving](#sglang-serving).\n\n### Gradio App\n\nStart the Gradio demo with:\n\n```bash\npython app.py\n```\n\nThe server address and port can be overridden via the\n`MOSS_MUSIC_SERVER_NAME` and `MOSS_MUSIC_SERVER_PORT` environment variables,\nand the default model ID via `MOSS_MUSIC_MODEL_ID`.\n\n## More Information\n\n- **MOSI.AI**: [https:\u002F\u002Fmosi.cn](https:\u002F\u002Fmosi.cn)\n- **OpenMOSS**: [https:\u002F\u002Fwww.open-moss.com](https:\u002F\u002Fwww.open-moss.com)\n- **MOSS-Audio (backbone)**: [https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Audio](https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Audio)\n- **MOSS-Music Data Pipeline**: [https:\u002F\u002Fgithub.com\u002Fwx9songs\u002FMOSS-Music-Data-Pipeline](https:\u002F\u002Fgithub.com\u002Fwx9songs\u002FMOSS-Music-Data-Pipeline)\n\n## LICENSE\n\nModels in MOSS-Music are licensed under the Apache License 2.0.\n\n## Citation\n\n```bibtex\n@misc{mossmusic2026,\n      title={MOSS-Music Technical Report},\n      author={OpenMOSS Team},\n      year={2026},\n      howpublished={\\url{https:\u002F\u002Fgithub.com\u002FOpenMOSS\u002FMOSS-Music}},\n      note={GitHub repository}\n}\n```\n","MOSS-Music 是一个开源的音乐理解模型，旨在实现音乐字幕生成、歌词语音识别、结构分析、和弦\u002F调\u002F节奏推理以及长格式音乐问答。该项目基于与 MOSS-Audio 相同的音频基础架构，并通过专门的持续预训练和监督微调进一步专注于音乐领域。它提供了两个 80 亿参数的模型：MOSS-Music-8B-Instruct 和 MOSS-Music-8B-Thinking，前者优化了对音乐提示的直接指令响应能力，后者则在音乐分析中展现了更强的链式思维推理能力。MOSS-Music 特别适合需要深度解析音频内容或构建智能化音乐应用的场景使用。","2026-06-11 03:58:36","CREATED_QUERY"]