[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72484":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":16,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":21,"hasPages":21,"topics":23,"createdAt":10,"pushedAt":10,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":16,"starSnapshotCount":16,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},72484,"Emu3","baaivision\u002FEmu3","baaivision","Next-Token Prediction is All You Need","https:\u002F\u002Femu.baai.ac.cn",null,"Python",2417,99,27,66,0,5,15,62,"Apache License 2.0",false,"main",[],"2026-06-12 04:01:06","\u003Cdiv align='center'>\n\u003Ch1>Emu3: Next-Token Prediction is All You Need\u003C\u002Fh1h1>\n\u003Ch3>\u003C\u002Fh3>\n\n[Emu3 Team, BAAI](https:\u002F\u002Fwww.baai.ac.cn\u002Fenglish.html)\n\n| [Project Page](https:\u002F\u002Femu.baai.ac.cn) | [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.18869) | [🤗HF Models](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FBAAI\u002Femu3-66f4e64f70850ff358a2e60f) | [Modelscope](https:\u002F\u002Fmodelscope.cn\u002Fcollections\u002FEmu3-9eacc8668b1043) | [Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBAAI\u002FEmu3) |\n\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align='center'>\n\u003Cimg src=\".\u002Fassets\u002Farch.png\" class=\"interpolation-image\" alt=\"arch.\" height=\"80%\" width=\"70%\" \u002F>\n\u003C\u002Fdiv>\n\nWe introduce **Emu3**, a new suite of state-of-the-art multimodal models trained solely with **\u003Ci>next-token prediction\u003C\u002Fi>**! By tokenizing images, text, and videos into a discrete space, we train a single transformer from scratch on a mixture of multimodal sequences.\n\n### Emu3 excels in both generation and perception\n**Emu3** outperforms several well-established task-specific models in both generation and perception tasks, surpassing flagship open models such as SDXL, LLaVA-1.6 and OpenSora-1.2, while eliminating the need for diffusion or compositional architectures.\n\n\u003Cdiv align='center'>\n\u003Cimg src=\".\u002Fassets\u002Fcomparison.png\" class=\"interpolation-image\" alt=\"comparison.\" height=\"80%\" width=\"80%\" \u002F>\n\u003C\u002Fdiv>\n\n### Highlights\n- **Emu3** is capable of generating high-quality images following the text input, by simply predicting the next vision token. The model naturally supports flexible resolutions and styles.\n- **Emu3** shows strong vision-language understanding capabilities to see the physical world and provides coherent text responses. Notably, this capability is achieved without depending on a CLIP and a pretrained LLM.\n- **Emu3** simply generates a video causally by predicting the next token in a video sequence, unlike the video diffusion model as in Sora. With a video in context, Emu3 can also naturally extend the video and predict what will happen next. \n\n## News\n- [2025.08] **[Emu3-Chat](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Chat)** with Transformers backend has been supported by [VLLM](https:\u002F\u002Fdocs.vllm.ai\u002Fen\u002Flatest\u002Fmodels\u002Fsupported_models\u002F#text-generation_1) as Emu3ForConditionalGeneration.\n- [2024.10] We release the image pretrained model **[Emu3-Stage1](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Stage1)** and the sft scripts. The model supports image captioning and can generate images at a resolution of 512x512. You can use our training scripts for further instruction tuning for more image generation and perception tasks. 🔥🔥🔥\n- [2024.09] We relase **[Emu3-Chat](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Chat)** and **[Emu3-Gen](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Gen)** which are post training models separately for vision-language understanding and vision generation.\n- [2024.09] We introduce Emu3, a new suite of state-of-the-art multimodal models trained solely with next-token prediction.\n\n\n### TODO\n\n- [X] Release model weights of tokenizer, Emu3-Chat and Emu3-Gen\n- [X] Release the inference code.\n- [ ] Release the evaluation code.\n- [X] Release training scripts for sft.\n- [ ] Release training scripts for pretrain and dpo.\n\n\n### Setup\n\nClone this repository and install required packages:\n\n```shell\ngit clone https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu3\ncd Emu3\n\npip install -r requirements.txt\n```\n\n### Model Weights\n\n| Model name               | HF Weight                                                      | Modelscope                                                                | Wisemodel                                                               |\n| ------------------------ | -------------------------------------------------------------- | ------------------------------------------------------------------------- | ----------------------------------------------------------------------- |\n| **Emu3-Stage1**          | [🤗 HF link](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Stage1)          | [Modelscope link](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FBAAI\u002FEmu3-Stage1)          |  |\n| **Emu3-Chat**            | [🤗 HF link](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Chat)            | [Modelscope link](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FBAAI\u002FEmu3-Chat)            | [Wisemodel link](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FEmu3-Chat)            |\n| **Emu3-Gen**             | [🤗 HF link](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-Gen)             | [Modelscope link](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FBAAI\u002FEmu3-Gen)             | [Wisemodel link](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FEmu3-Gen)             |\n| **Emu3-VisionTokenizer** | [🤗 HF link](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FEmu3-VisionTokenizer) | [Modelscope link](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FBAAI\u002FEmu3-VisionTokenizer) | [Wisemodel link](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FEmu3-VisionTokenizer) |\n\n### Quickstart\n\n#### Use 🤗Transformers to run Emu3-Gen\u002FStage1 for image generation\n```python\nfrom PIL import Image\nfrom transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM\nfrom transformers.generation.configuration_utils import GenerationConfig\nfrom transformers.generation import LogitsProcessorList, PrefixConstrainedLogitsProcessor, UnbatchedClassifierFreeGuidanceLogitsProcessor\nimport torch\n\nfrom emu3.mllm.processing_emu3 import Emu3Processor\n\n\n# model path\nEMU_HUB = \"BAAI\u002FEmu3-Gen\"\nVQ_HUB = \"BAAI\u002FEmu3-VisionTokenizer\"\n\n# prepare model and processor\nmodel = AutoModelForCausalLM.from_pretrained(\n    EMU_HUB,\n    device_map=\"cuda:0\",\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n    trust_remote_code=True,\n)\n\ntokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True, padding_side=\"left\")\nimage_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)\nimage_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map=\"cuda:0\", trust_remote_code=True).eval()\nprocessor = Emu3Processor(image_processor, image_tokenizer, tokenizer)\n\n# prepare input\nPOSITIVE_PROMPT = \" masterpiece, film grained, best quality.\"\nNEGATIVE_PROMPT = \"lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry.\"\n\nclassifier_free_guidance = 3.0\nprompt = \"a portrait of young girl.\"\nprompt += POSITIVE_PROMPT\n\nkwargs = dict(\n    mode='G',\n    ratio=\"1:1\",\n    image_area=model.config.image_area,\n    return_tensors=\"pt\",\n    padding=\"longest\",\n)\npos_inputs = processor(text=prompt, **kwargs)\nneg_inputs = processor(text=NEGATIVE_PROMPT, **kwargs)\n\n# prepare hyper parameters\nGENERATION_CONFIG = GenerationConfig(\n    use_cache=True,\n    eos_token_id=model.config.eos_token_id,\n    pad_token_id=model.config.pad_token_id,\n    max_new_tokens=40960,\n    do_sample=True,\n    top_k=2048,\n)\n\nh = pos_inputs.image_size[:, 0]\nw = pos_inputs.image_size[:, 1]\nconstrained_fn = processor.build_prefix_constrained_fn(h, w)\nlogits_processor = LogitsProcessorList([\n    UnbatchedClassifierFreeGuidanceLogitsProcessor(\n        classifier_free_guidance,\n        model,\n        unconditional_ids=neg_inputs.input_ids.to(\"cuda:0\"),\n    ),\n    PrefixConstrainedLogitsProcessor(\n        constrained_fn ,\n        num_beams=1,\n    ),\n])\n\n# generate\noutputs = model.generate(\n    pos_inputs.input_ids.to(\"cuda:0\"),\n    GENERATION_CONFIG,\n    logits_processor=logits_processor,\n    attention_mask=pos_inputs.attention_mask.to(\"cuda:0\"),\n)\n\nmm_list = processor.decode(outputs[0])\nfor idx, im in enumerate(mm_list):\n    if not isinstance(im, Image.Image):\n        continue\n    im.save(f\"result_{idx}.png\")\n```\n\n#### Use 🤗Transformers to run Emu3-Chat\u002FStage1 for vision-language understanding\n\n```python\nfrom PIL import Image\nfrom transformers import AutoTokenizer, AutoModel, AutoImageProcessor, AutoModelForCausalLM\nfrom transformers.generation.configuration_utils import GenerationConfig\nimport torch\n\nfrom emu3.mllm.processing_emu3 import Emu3Processor\n\n\n# model path\nEMU_HUB = \"BAAI\u002FEmu3-Chat\"\nVQ_HUB = \"BAAI\u002FEmu3-VisionTokenizer\"\n\n# prepare model and processor\nmodel = AutoModelForCausalLM.from_pretrained(\n    EMU_HUB,\n    device_map=\"cuda:0\",\n    torch_dtype=torch.bfloat16,\n    attn_implementation=\"flash_attention_2\",\n    trust_remote_code=True,\n)\n\n# used for Emu3-Chat\ntokenizer = AutoTokenizer.from_pretrained(EMU_HUB, trust_remote_code=True, padding_side=\"left\")\n# used for Emu3-Stage1\n# tokenizer = AutoTokenizer.from_pretrained(\n#     EMU_HUB,\n#     trust_remote_code=True,\n#     chat_template=\"{image_prompt}{text_prompt}\",\n#     padding_side=\"left\",\n# )\nimage_processor = AutoImageProcessor.from_pretrained(VQ_HUB, trust_remote_code=True)\nimage_tokenizer = AutoModel.from_pretrained(VQ_HUB, device_map=\"cuda:0\", trust_remote_code=True).eval()\nprocessor = Emu3Processor(image_processor, image_tokenizer, tokenizer)\n\n# prepare input\ntext = \"Please describe the image\"\nimage = Image.open(\"assets\u002Fdemo.png\")\n\ninputs = processor(\n    text=text,\n    image=image,\n    mode='U',\n    return_tensors=\"pt\",\n    padding=\"longest\",\n)\n\n# prepare hyper parameters\nGENERATION_CONFIG = GenerationConfig(\n    pad_token_id=tokenizer.pad_token_id,\n    bos_token_id=tokenizer.bos_token_id,\n    eos_token_id=tokenizer.eos_token_id,\n    max_new_tokens=1024,\n)\n\n# generate\noutputs = model.generate(\n    inputs.input_ids.to(\"cuda:0\"),\n    GENERATION_CONFIG,\n    attention_mask=inputs.attention_mask.to(\"cuda:0\"),\n)\n\noutputs = outputs[:, inputs.input_ids.shape[-1]:]\nprint(processor.batch_decode(outputs, skip_special_tokens=True)[0])\n```\n\n#### Use 🤗Transformers to run Emu3-VisionTokenzier for vision encoding and decoding\n```python\nimport os\nimport os.path as osp\n\nfrom PIL import Image\nimport torch\nfrom transformers import AutoModel, AutoImageProcessor\n\nMODEL_HUB = \"BAAI\u002FEmu3-VisionTokenizer\"\n\nmodel = AutoModel.from_pretrained(MODEL_HUB, trust_remote_code=True).eval().cuda()\nprocessor = AutoImageProcessor.from_pretrained(MODEL_HUB, trust_remote_code=True)\n\n# TODO: you need to modify the path here\nVIDEO_FRAMES_PATH = \"YOUR_VIDEO_FRAMES_PATH\"\n\nvideo = os.listdir(VIDEO_FRAMES_PATH)\nvideo.sort()\nvideo = [Image.open(osp.join(VIDEO_FRAMES_PATH, v)) for v in video]\n\nimages = processor(video, return_tensors=\"pt\")[\"pixel_values\"]\nimages = images.unsqueeze(0).cuda()\n\n# image autoencode\nimage = images[:, 0]\nprint(image.shape)\nwith torch.no_grad():\n    # encode\n    codes = model.encode(image)\n    # decode\n    recon = model.decode(codes)\n\nrecon = recon.view(-1, *recon.shape[2:])\nrecon_image = processor.postprocess(recon)[\"pixel_values\"][0]\nrecon_image.save(\"recon_image.png\")\n\n# video autoencode\nimages = images.view(\n    -1,\n    model.config.temporal_downsample_factor,\n    *images.shape[2:],\n)\n\nprint(images.shape)\nwith torch.no_grad():\n    # encode\n    codes = model.encode(images)\n    # decode\n    recon = model.decode(codes)\n\nrecon = recon.view(-1, *recon.shape[2:])\nrecon_images = processor.postprocess(recon)[\"pixel_values\"]\nfor idx, im in enumerate(recon_images):\n    im.save(f\"recon_video_{idx}.png\")\n```\n\n## Acknowledgement\n\nWe thank the great work from [Emu Series](https:\u002F\u002Fgithub.com\u002Fbaaivision\u002FEmu), [QWen2-VL](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2-VL) and [MoVQGAN](https:\u002F\u002Fgithub.com\u002Fai-forever\u002FMoVQGAN)\n\nThis work is supported by the National Science and Technology Major Project (No. 2022ZD0116314).\n\n本项目受新一代人工智能国家科技重大专项（No. 2022ZD0116314）支持。\n\n## Citation\n\nIf you find Emu3 useful for your research and applications, please consider starring this repository and citing:\n\n```\n@article{wang2024emu3,\n  title={Emu3: Next-Token Prediction is All You Need},\n  author={Wang, Xinlong and Zhang, Xiaosong and Luo, Zhengxiong and Sun, Quan and Cui, Yufeng and Wang, Jinsheng and Zhang, Fan and Wang, Yueze and Li, Zhen and Yu, Qiying and others},\n  journal={arXiv preprint arXiv:2409.18869},\n  year={2024}\n}\n```\n\n\n","Emu3 是一个基于下个令牌预测训练的多模态模型套件，能够处理图像、文本和视频数据。该项目的核心功能在于通过将不同模态的数据转化为离散空间中的令牌，并使用单一的Transformer架构进行从零开始的训练，从而在生成和感知任务中表现出色。技术特点包括高质量图像生成、强大的视觉-语言理解能力以及视频序列预测等，这些均无需依赖复杂的扩散或组合架构即可实现。Emu3适用于需要高效处理跨模态信息的应用场景，如多媒体内容创作、智能对话系统及视频预测分析等领域。",2,"2026-06-11 03:42:16","high_star"]