[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2768":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":16,"stars30d":17,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":15,"starSnapshotCount":15,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},2768,"GenLIP","YanFangCS\u002FGenLIP","YanFangCS","Official repo for \"Let ViT Speak: Generative Language-Image Pre-training\"",null,"Python",120,4,51,7,0,3,32,46.8,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:00:15","\u003Cdiv align=\"center\">\n\n\u003Ch1>Let ViT Speak: Generative Language-Image Pre-training\u003C\u002Fh1>\n\n\u003Cb>Yan Fang\u003C\u002Fb>\u003Csup>1,2,&#42;\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fmc-lan.github.io\">Mengcheng Lan\u003C\u002Fa>\u003C\u002Fb>\u003Csup>2,3,&#42;\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fspeedinghzl.github.io\">Zilong Huang\u003C\u002Fa>\u003C\u002Fb>\u003Csup>2,&dagger;\u003C\u002Fsup> · \u003Cb>Weixian Lei\u003C\u002Fb>\u003Csup>2\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fyunqing-me.github.io\">Yunqing Zhao\u003C\u002Fa>\u003C\u002Fb>\u003Csup>2\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fy-zhong.info\">Yujie Zhong\u003C\u002Fa>\u003C\u002Fb>\u003Csup>2\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fyingchen001.github.io\">Yingchen Yu\u003C\u002Fa>\u003C\u002Fb>\u003Csup>2\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fqi-she.net\u002F\">Qi She\u003C\u002Fa>\u003C\u002Fb>\u003Csup>2\u003C\u002Fsup> · \u003Cb>Yao Zhao\u003C\u002Fb>\u003Csup>1\u003C\u002Fsup> · \u003Cb>\u003Ca href=\"https:\u002F\u002Fweiyc.github.io\">Yunchao Wei\u003C\u002Fa>\u003C\u002Fb>\u003Csup>1,&dagger;\u003C\u002Fsup>\n\nBeijing Jiaotong University\u003Csup>1\u003C\u002Fsup> & ByteDance\u003Csup>2\u003C\u002Fsup> & Nanyang Technological University\u003Csup>3\u003C\u002Fsup>\n\n\u003Ca href=\"https:\u002F\u002Fyanfangcs.github.io\u002Fvitspeak\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGithub-Page-blue\" alt=\"Home Page\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.00809\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-Arxiv-red\" alt=\"Paper Arxiv\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FYanFang\u002Fgenlip\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FModel-HuggingFace-orange\" alt=\"Model HuggingFace\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n**TL;DR:** **GenLIP -- lets ViT speak.** We show that a strong MLLM vision encoder can be pretrained with just **one Transformer** and **one autoregressive language modeling objective** -- no contrastive loss, no dual-tower architecture, and no extra text decoder. Despite its simplicity, GenLIP scales effectively and performs well as a vision encoder in MLLMs, with particularly strong gains on Doc & OCR tasks.\n\n\u003Cdiv align='center'>\n  \u003Cimg src=\"assets\u002Fteaser.png\" alt=\"teaser\" style=\"height: 200px; width: auto;\">\n\u003C\u002Fdiv>\n\n---\n\n## Table of Contents\n- [News](#news)\n- [Getting Started](#getting-started)\n  - [Installation](#installation)\n  - [Datasets](#datasets)\n  - [Configuration](#configuration)\n- [Training](#training)\n- [Model Checkpoints](#model-checkpoints)\n- [Acknowledgments](#acknowledgments)\n- [Citation](#citation)\n\n## News\n- 2025-05-03: Code released. [✔]\n\n## Getting Started\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https:\u002F\u002Fgithub.com\u002FYanFangCS\u002FGenLIP\ncd GenLIP\n\n# Install dependencies\npip install -r requirements.txt\npip install -e .   # install veomni from this repo\n```\n\n> **Note:** If you are using PyTorch >= 2.6.0, you need to install ByteCheckpoint manually:\n>\n> ```bash\n> git clone https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FByteCheckpoint.git\n> cd ByteCheckpoint\n> # Modify the torch version assert statement in bytecheckpoint\u002Fcheckpointer\u002Ffsdp_checkpointer.py#L232-L234 to support torch >= 2.6.0\n> # assert \"2.1.0\" \u003C= torch.__version__.strip()\n> pip install -e .\n> ```\n\n### Datasets\n\n#### Data Source\n\nWe use several caption datasets during pretraining:\n\n**Stage 1:**\n- [Recap-DataComp-1B](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FUCSC-VLAA\u002FRecap-DataComp-1B)\n\n**Stage 2:**\n- [Infinity-MM](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBAAI\u002FInfinity-MM) (stage1 subset)\n- [BLIP3o-Pretrain-Long-Caption](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBLIP3o\u002FBLIP3o-Pretrain-Long-Caption)\n\n**Optional for Stage 2:**\n- [CapRL-2M](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Finternlm\u002FCapRL-2M)\n- [PLM-Image-Auto](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ffacebook\u002FPLM-Image-Auto) (caption subset only)\n\nFor Stage 1, training GenLIP with 1B seen samples is sufficient to obtain a strong vision encoder.\nFor Stage 2, training GenLIP with Infinity-MM and BLIP3o-Long-Caption using NaViT is sufficient.\nTraining with the two additional datasets (CapRL and PLM-Image-Auto) does not bring further performance gains, but we list them here as potential alternatives.\n\n#### Data Format\n\nAll datasets need to be downloaded and processed into suitable formats for pretraining. Please ensure your preprocessing function can correctly consume your data.\n\nBelow are example data formats:\n\n```python\n# Stage 1 caption data\n# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']\njson_content = {\n  'caption': 'A modern coffee machine with a digital display and two white coffee cups filled with coffee is shown. The machine has a stainless steel finish and is accompanied by a milk frothing pitcher with a white liquid inside. The coffee machine is placed on a surface with a white background.'\n}\n\n# Stage 2 caption data\n# sample keys: ['__key__', '__url__', 'jpg', '__local_path__', 'json']\njson_content = {\n  'conversation': [\n    {\n      'from': 'user',\n      'value': '\u003Cimage>Describe this image in detail.'\n    },\n    {\n      'from': 'assistant',\n      'value': 'The image depicts a serene waterfront scene with calm, slightly rippled water in the foreground...'\n    }\n  ]\n}\n```\n\nYou can also process the datasets into other formats as needed. To ensure training runs smoothly, check and modify the `process_sample` function implementation to match your data format.\n\n### Configuration\n\nWe provide three model configurations in `configs\u002Fmodel_configs\u002Fgenlip\u002F`:\n- `genlip_l16_224.json`\n- `genlip_so16_224.json`\n- `genlip_g16_224.json`\n\nAlong with corresponding training configurations in `configs\u002Fpretrain\u002Fgenlip\u002F`:\n- `stage1\u002Ftrain_genlip_*_recap.yaml`\n- `stage2\u002Ftrain_genlip_*_navit.yaml`\n\nYou may need to modify `model.config_path` in the YAML config files to point to the correct model configuration.\n\n**Remember to update the dataset paths in the config files before starting training.**\n\n## Training\n\nA training script is provided in `jobs\u002Ftrain.sh`. You can start training with:\n\n```bash\nbash jobs\u002Ftrain.sh \u003Cmain_func> \u003Ctrain_config>\n\n# Stage 1 example:\nbash jobs\u002Ftrain.sh tasks\u002Ftrain_genlip_stage1.py configs\u002Fpretrain\u002Fgenlip\u002Fstage1\u002Ftrain_genlip_so16_224_recap.yaml\n\n# Stage 2 example:\nbash jobs\u002Ftrain.sh tasks\u002Ftrain_genlip_navit.py configs\u002Fpretrain\u002Fgenlip\u002Fstage2\u002Ftrain_genlip_so16_navit.yaml\n```\n\n- `\u003Cmain_func>`: the training script to execute (e.g., `tasks\u002Ftrain_genlip_stage1.py` for Stage 1, `tasks\u002Ftrain_genlip_navit.py` for Stage 2).\n- `\u003Ctrain_config>`: the training configuration file to use.\n\nAll you need to do is set the paths and appropriate hyperparameters in the config files, then launch the script and wait for training to complete.\n\nFor **multi-node training**, we also provide `jobs\u002Ftrain_multinode.sh` and `jobs\u002Ftrain_slurm_multinode.sh`. You can modify them to fit your cluster setup and launch distributed training across multiple nodes.\n\n\n## Model Checkpoints\n\nThe pretrained models are available on [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FYanFang\u002Fgenlip).\n\n## Acknowledgments\n\nOur codebase is built upon:\n- [VeOmni](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FVeOmni): A simple and high-performance multi-modal model training framework developed by the ByteDance Seed team.\n\n## License\n\nThis project is licensed under the Apache License 2.0. See the [LICENSE](LICENSE) file for details.\n\n## Citation\n\nIf you find this project helpful, please give us a star and cite our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.00809):\n\n```bibtex\n@article{fang2026letvitspeakgenerative,\n  title={Let ViT Speak: Generative Language-Image Pre-training}, \n  author={Yan Fang and Mengcheng Lan and Zilong Huang and Weixian Lei and Yunqing Zhao and Yujie Zhong and Yingchen Yu and Qi She and Yao Zhao and Yunchao Wei},\n  journal={arXiv preprint arXiv:2605.00809},\n  year={2026}\n}\n```","GenLIP 是一个旨在通过单一Transformer和自回归语言建模目标预训练强大多模态大语言模型（MLLM）视觉编码器的项目。该项目的核心功能是利用简单的架构实现高效扩展，并在多种视觉任务中表现出色，尤其在文档和OCR任务上效果显著。技术特点包括无需对比损失、双塔架构或额外文本解码器，仅依靠一个Transformer完成预训练。适合需要高效且强大的视觉编码器的场景，如文档处理、图像识别等。",2,"2026-06-11 02:51:10","CREATED_QUERY"]