[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-71027":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":8,"htmlUrl":8,"language":9,"languages":8,"totalLinesOfCode":8,"stars":10,"forks":11,"watchers":12,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":8,"rankLanguage":8,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":8,"pushedAt":8,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},71027,"ml-ferret","apple\u002Fml-ferret","apple",null,"Python",8679,519,160,7,0,1,39.15,"Other",false,"main",[],"2026-06-12 02:02:46","\u003C!-- # Project Name\n\nThis software project accompanies the research paper, [Paper title](https:\u002F\u002Farxiv.org).\n\nBrief description of the project.\n\n## Documentation\n\n## Getting Started  -->\n\n# \u003Cimg src=\"figs\u002Fferret_icon.png\" alt=\"Alt text for the image\" width=\"40\" height=\"45\"> Ferret: Refer and Ground Anything Anywhere at Any Granularity\n\n*An End-to-End MLLM that Accept Any-Form Referring and Ground Anything in Response.* [[Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.07704)]\n\n[Haoxuan You*](https:\u002F\u002Fhxyou.github.io\u002F), [Haotian Zhang*](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=1vz0kKUAAAAJ&hl=en\u002F), [Zhe Gan](https:\u002F\u002Fzhegan27.github.io\u002F), [Xianzhi Du](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=l1hP40AAAAAJ&hl=en), [Bowen Zhang](https:\u002F\u002Fzbwglory.github.io\u002F), [Zirui Wang](https:\u002F\u002Fwww.cs.cmu.edu\u002F~ziruiw\u002F), [Liangliang Cao](http:\u002F\u002Fllcao.net\u002F), [Shih-Fu Chang](https:\u002F\u002Fwww.ee.columbia.edu\u002F~sfchang\u002F), [Yinfei Yang](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fyinfeiyang\u002F) \n[*: equal contribution]\n\n\n## Release\n- [10\u002F08\u002F2024] 🔥 We release the [Ferret-UI](ferretui\u002F), the first UI-centric MLLM that is capable of effectively executing **referring, grounding, and reasoning** tasks.\n- [07\u002F10\u002F2024] 🔥 [Ferret-v2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07973) is accepted to COLM 2024. \n- [02\u002F15\u002F2024] 🔥 Ferret is accepted to ICLR 2024 as a [Spotlight](https:\u002F\u002Ficlr.cc\u002Fvirtual\u002F2024\u002Fposter\u002F19537)!!! \n- [12\u002F14\u002F2023] 🔥 We release the Ferret [checkpoints(7B, 13B)](#checkpoints).\n- [10\u002F30\u002F2023] 🔥 We release the code of **FERRET** model and [Ferret-Bench](ferret\u002Feval\u002Fferret_gpt4_data).\n\n## Overview\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"figs\u002Fferret_fig_diagram_v2.png\" width=\"100%\">\u003C\u002Fa> \u003Cbr>\n    Diagram of Ferret Model.\n\u003C\u002Fp>\n\nKey Contributions:\n* Ferret Model - **Hybrid Region Representation + Spatial-aware Visual Sampler** enable fine-grained and open-vocabulary referring and grounding in MLLM.\n* GRIT Dataset (~1.1M) - A **Large-scale, Hierarchical, Robust** ground-and-refer instruction tuning dataset.\n* Ferret-Bench - A multimodal evaluation benchmark that jointly requires **Referring\u002FGrounding, Semantics, Knowledge, and Reasoning**.\n\n\n**Usage and License Notices**: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes. \n\n## Contents\n- [Install](#install)\n- [Train](#train)\n- [Evaluation](#evaluation)\n- [Demo](#demo)\n\n## Install\n\n1. Clone this repository and navigate to FERRET folder\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fapple\u002Fml-ferret\ncd ml-ferret\n```\n\n2. Install Package\n```Shell\nconda create -n ferret python=3.10 -y\nconda activate ferret\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\npip install pycocotools\npip install protobuf==3.20.0\n```\n\n3. Install additional packages for training cases\n```\npip install ninja\npip install flash-attn --no-build-isolation\n```\n\n\n## Train\n\nFERRET is trained on 8 A100 GPUs with 80GB memory. To train on fewer GPUs, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `per_device_train_batch_size` x `gradient_accumulation_steps` x `num_gpus`.\n\n### Hyperparameters\nWe use a similar set of hyperparameters as LLaVA(Vicuna) in finetuning.  \n\n| Hyperparameter | Global Batch Size | Learning rate | Epochs | Max length | Weight decay |\n| --- | ---: | ---: | ---: | ---: | ---: |\n| FERRET-7B | 128 | 2e-5 | 3 | 2048 | 0 |\n| FERRET-13B | 128 | 2e-5 | 3 | 2048 | 0 |\n\n### Prepare Vicuna checkpoint and LLaVA's projector\n\nBefore you start, prepare our base model Vicuna, which is an instruction-tuned chatbot. Please download its weights following the instructions [here](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat#model-weights). Vicuna v1.3 is used in FERRET.\n\nThen download LLaVA's first-stage pre-trained projector weight ([7B](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-336px-pretrain-vicuna-7b-v1.3), [13B](https:\u002F\u002Fhuggingface.co\u002Fliuhaotian\u002Fllava-336px-pretrain-vicuna-13b-v1.3)).\n\n\n### FERRET Training\n\nThe scripts are provided ([7B](experiments\u002Fferret_7b_train.sh), [13B](experiments\u002Fferret_13b_train.sh)).\n\n\n## Evaluation\n\nPlease see this [doc](EVAL.md) for the details.\n\n## Checkpoints\nWe extracted the `delta` between our pre-trained model and Vicuna. Please first download weights of Vicuna following the [previous instruction](#prepare-vicuna-checkpoint-and-llavas-projector). Then download our prepared offsets of weights: [7B](https:\u002F\u002Fdocs-assets.developer.apple.com\u002Fml-research\u002Fmodels\u002Fferret\u002Fferret-7b\u002Fferret-7b-delta.zip), [13B](https:\u002F\u002Fdocs-assets.developer.apple.com\u002Fml-research\u002Fmodels\u002Fferret\u002Fferret-13b\u002Fferret-13b-delta.zip) using `wget` or `curl`, and unzip the downloaded offsets. Lastly, apply the offset to the Vicuna's weight by running the following script:\n```Shell\n# 7B\npython3 -m ferret.model.apply_delta \\\n    --base .\u002Fmodel\u002Fvicuna-7b-v1-3 \\\n    --target .\u002Fmodel\u002Fferret-7b-v1-3 \\\n    --delta path\u002Fto\u002Fferret-7b-delta\n# 13B\npython3 -m ferret.model.apply_delta \\\n    --base .\u002Fmodel\u002Fvicuna-13b-v1-3 \\\n    --target .\u002Fmodel\u002Fferret-13b-v1-3 \\\n    --delta path\u002Fto\u002Fferret-13b-delta\n```\n\n**Notices**: Apple's rights in the attached weight differentials are hereby licensed under the CC-BY-NC license. Apple makes no representations with regards to LLaMa or any other third party software, which are subject to their own terms.\n\nPlease refer to the next section about how to set up a local demo with pre-trained weight.\n\n## Demo\n\nTo run our demo, you need to train FERRET and use the checkpoints locally. Gradio web UI is used. Please run the following commands one by one. \n\n#### Launch a controller\n```Shell\npython -m ferret.serve.controller --host 0.0.0.0 --port 10000\n```\n\n#### Launch a gradio web server.\n```Shell\npython -m ferret.serve.gradio_web_server --controller http:\u002F\u002Flocalhost:10000 --model-list-mode reload --add_region_feature\n```\n\n#### Launch a model worker\n\nThis is the worker that load the ckpt and do the inference on the GPU.  Each worker is responsible for a single model specified in `--model-path`.\n\n```Shell\nCUDA_VISIBLE_DEVICES=0 python -m ferret.serve.model_worker --host 0.0.0.0 --controller http:\u002F\u002Flocalhost:10000 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path .\u002Fcheckpoints\u002FFERRET-13B-v0 --add_region_feature\n```\nWait until the process finishes loading the model and you see \"Uvicorn running on ...\".  Now, refresh your Gradio web UI, and you will see the model you just launched in the model list.\n\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"figs\u002Fferret_demo.png\" width=\"105%\">\u003C\u002Fa> \u003Cbr>\n    Example of Ferret Interactive Demo.\n\u003C\u002Fp>\n\n\n## Citation\n\nIf you find Ferret useful, please cite using this BibTeX:\n\n```bibtex\n@article{you2023ferret,\n  title={Ferret: Refer and Ground Anything Anywhere at Any Granularity},\n  author={You, Haoxuan and Zhang, Haotian and Gan, Zhe and Du, Xianzhi and Zhang, Bowen and Wang, Zirui and Cao, Liangliang and Chang, Shih-Fu and Yang, Yinfei},\n  journal={arXiv preprint arXiv:2310.07704},\n  year={2023}\n}\n```\n\n## Acknowledgement\n\n- [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): the codebase we built upon. \n- [Vicuna](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat): the LLM codebase.\n","Ferret 是一个端到端的多模态语言模型，能够接受任何形式的引用并响应任何粒度的对象定位。其核心功能包括混合区域表示和空间感知视觉采样器，支持细粒度和开放词汇的引用和定位。此外，项目还提供了一个大规模、层次化且鲁棒的数据集 GRIT 以及一个多模态评估基准 Ferret-Bench。Ferret 适用于需要结合视觉和语言处理的任务，如图像理解和交互式问答等场景。该项目适合研究用途，代码和数据仅限于非商业使用。",2,"2026-06-11 03:35:31","high_star"]