[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71913":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":9,"rankLanguage":9,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":9,"pushedAt":9,"updatedAt":26,"readmeContent":27,"aiSummary":28,"trendingCount":15,"starSnapshotCount":15,"syncStatus":29,"lastSyncTime":30,"discoverSource":31},71913,"csm","SesameAILabs\u002Fcsm","SesameAILabs","A Conversational Speech Generation Model",null,"Python",14659,1485,739,9,0,1,13,37,3,81.72,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:02","# CSM\n\n**2025\u002F05\u002F20** - CSM is availabile natively in [Hugging Face Transformers](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fmodel_doc\u002Fcsm) 🤗 as of version `4.52.1`, more info available [in our model repo](https:\u002F\u002Fhuggingface.co\u002Fsesame\u002Fcsm-1b)\n\n**2025\u002F03\u002F13** - We are releasing the 1B CSM variant. The checkpoint is [hosted on Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fsesame\u002Fcsm_1b).\n\n---\n\nCSM (Conversational Speech Model) is a speech generation model from [Sesame](https:\u002F\u002Fwww.sesame.com) that generates RVQ audio codes from text and audio inputs. The model architecture employs a [Llama](https:\u002F\u002Fwww.llama.com\u002F) backbone and a smaller audio decoder that produces [Mimi](https:\u002F\u002Fhuggingface.co\u002Fkyutai\u002Fmimi) audio codes.\n\nA fine-tuned variant of CSM powers the [interactive voice demo](https:\u002F\u002Fwww.sesame.com\u002Fvoicedemo) shown in our [blog post](https:\u002F\u002Fwww.sesame.com\u002Fresearch\u002Fcrossing_the_uncanny_valley_of_voice).\n\nA hosted [Hugging Face space](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fsesame\u002Fcsm-1b) is also available for testing audio generation.\n\n## Requirements\n\n* A CUDA-compatible GPU\n* The code has been tested on CUDA 12.4 and 12.6, but it may also work on other versions\n* Similarly, Python 3.10 is recommended, but newer versions may be fine\n* For some audio operations, `ffmpeg` may be required\n* Access to the following Hugging Face models:\n  * [Llama-3.2-1B](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-3.2-1B)\n  * [CSM-1B](https:\u002F\u002Fhuggingface.co\u002Fsesame\u002Fcsm-1b)\n\n### Setup\n\n```bash\ngit clone git@github.com:SesameAILabs\u002Fcsm.git\ncd csm\npython3.10 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npip install -r requirements.txt\n\n# Disable lazy compilation in Mimi\nexport NO_TORCH_COMPILE=1\n\n# You will need access to CSM-1B and Llama-3.2-1B\nhuggingface-cli login\n```\n\n### Windows Setup\n\nThe `triton` package cannot be installed in Windows. Instead use `pip install triton-windows`.\n\n## Quickstart\n\nThis script will generate a conversation between 2 characters, using a prompt for each character.\n\n```bash\npython run_csm.py\n```\n\n## Usage\n\nIf you want to write your own applications with CSM, the following examples show basic usage.\n\n#### Generate a sentence\n\nThis will use a random speaker identity, as no prompt or context is provided.\n\n```python\nfrom generator import load_csm_1b\nimport torchaudio\nimport torch\n\nif torch.backends.mps.is_available():\n    device = \"mps\"\nelif torch.cuda.is_available():\n    device = \"cuda\"\nelse:\n    device = \"cpu\"\n\ngenerator = load_csm_1b(device=device)\n\naudio = generator.generate(\n    text=\"Hello from Sesame.\",\n    speaker=0,\n    context=[],\n    max_audio_length_ms=10_000,\n)\n\ntorchaudio.save(\"audio.wav\", audio.unsqueeze(0).cpu(), generator.sample_rate)\n```\n\n#### Generate with context\n\nCSM sounds best when provided with context. You can prompt or provide context to the model using a `Segment` for each speaker's utterance.\n\nNOTE: The following example is instructional and the audio files do not exist. It is intended as an example for using context with CSM.\n\n```python\nfrom generator import Segment\n\nspeakers = [0, 1, 0, 0]\ntranscripts = [\n    \"Hey how are you doing.\",\n    \"Pretty good, pretty good.\",\n    \"I'm great.\",\n    \"So happy to be speaking to you.\",\n]\naudio_paths = [\n    \"utterance_0.wav\",\n    \"utterance_1.wav\",\n    \"utterance_2.wav\",\n    \"utterance_3.wav\",\n]\n\ndef load_audio(audio_path):\n    audio_tensor, sample_rate = torchaudio.load(audio_path)\n    audio_tensor = torchaudio.functional.resample(\n        audio_tensor.squeeze(0), orig_freq=sample_rate, new_freq=generator.sample_rate\n    )\n    return audio_tensor\n\nsegments = [\n    Segment(text=transcript, speaker=speaker, audio=load_audio(audio_path))\n    for transcript, speaker, audio_path in zip(transcripts, speakers, audio_paths)\n]\naudio = generator.generate(\n    text=\"Me too, this is some cool stuff huh?\",\n    speaker=1,\n    context=segments,\n    max_audio_length_ms=10_000,\n)\n\ntorchaudio.save(\"audio.wav\", audio.unsqueeze(0).cpu(), generator.sample_rate)\n```\n\n## FAQ\n\n**Does this model come with any voices?**\n\nThe model open-sourced here is a base generation model. It is capable of producing a variety of voices, but it has not been fine-tuned on any specific voice.\n\n**Can I converse with the model?**\n\nCSM is trained to be an audio generation model and not a general-purpose multimodal LLM. It cannot generate text. We suggest using a separate LLM for text generation.\n\n**Does it support other languages?**\n\nThe model has some capacity for non-English languages due to data contamination in the training data, but it likely won't do well.\n\n## Misuse and abuse ⚠️\n\nThis project provides a high-quality speech generation model for research and educational purposes. While we encourage responsible and ethical use, we **explicitly prohibit** the following:\n\n- **Impersonation or Fraud**: Do not use this model to generate speech that mimics real individuals without their explicit consent.\n- **Misinformation or Deception**: Do not use this model to create deceptive or misleading content, such as fake news or fraudulent calls.\n- **Illegal or Harmful Activities**: Do not use this model for any illegal, harmful, or malicious purposes.\n\nBy using this model, you agree to comply with all applicable laws and ethical guidelines. We are **not responsible** for any misuse, and we strongly condemn unethical applications of this technology.\n\n---\n\n## Authors\nJohan Schalkwyk, Ankit Kumar, Dan Lyth, Sefik Emre Eskimez, Zack Hodari, Cinjon Resnick, Ramon Sanabria, Raven Jiang, and the Sesame team.\n","CSM（Conversational Speech Model）是一个由Sesame开发的语音生成模型，能够根据文本和音频输入生成RVQ音频编码。该模型基于Llama架构，并使用一个小型音频解码器来产生Mimi音频编码，具备强大的对话式语音合成能力。CSM适合用于需要高质量语音合成的应用场景，如虚拟助手、在线教育或娱乐内容创作等，特别是在追求自然流畅对话体验的情况下。此外，通过Hugging Face平台的支持，用户可以轻松访问并测试CSM-1B版本，降低了技术门槛，使得更多开发者能够利用这一先进工具进行创新。",2,"2026-06-11 03:39:24","high_star"]