[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80590":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":12,"contributorsCount":12,"subscribersCount":12,"size":12,"stars1d":12,"stars7d":14,"stars30d":15,"stars90d":12,"forks30d":12,"starsTrendScore":12,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":12,"starSnapshotCount":12,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},80590,"vlm-probe-suite","mrvellang\u002Fvlm-probe-suite","mrvellang","FineVLM-Probe: a lightweight harness for fine-grained probing of frozen vision-language models (CLIP \u002F SigLIP \u002F BLIP-2 \u002F LLaVA).",null,"Python",220,0,5,54,167,70,"Other",false,"main",[],"2026-06-12 04:01:29","# FineVLM-Probe: Fine-grained Probing of Vision-Language Models\n\n> Code for the report *FineVLM-Probe: A Lightweight Suite for Probing Fine-grained\n> Visual-Language Alignment in Frozen VLMs.* In submission; pre-print pending.\n\nThis repository contains the evaluation code, probe definitions, and dataset loaders\nused in the report. The goal is **not** to introduce yet another VQA benchmark, but\nto provide a small, hackable harness for asking targeted questions like:\n\n- *Does this VLM actually look at the object, or just the global scene gist?*\n- *How does fine-grained alignment degrade as we scale image resolution down?*\n- *Is the text encoder doing the heavy lifting, or the vision encoder?*\n\nWe support frozen-encoder evaluation across CLIP, SigLIP, BLIP-2, and LLaVA-1.5\nvariants. Adding a new model means writing a ~30-line adapter (see `models\u002F`).\n\n---\n\n## Table of contents\n\n- [What is in here](#what-is-in-here)\n- [Probes](#probes)\n- [Supported models](#supported-models)\n- [Datasets](#datasets)\n- [Quick start](#quick-start)\n- [Running a full sweep](#running-a-full-sweep)\n- [Reproducing the report numbers](#reproducing-the-report-numbers)\n- [Adding a new model](#adding-a-new-model)\n- [Adding a new probe](#adding-a-new-probe)\n- [Caveats and known issues](#caveats-and-known-issues)\n- [Citation](#citation)\n- [License](#license)\n\n---\n\n## What is in here\n\n```\nfinevlm_probe\u002F\n  models\u002F            # thin adapters: load model, expose .encode_image, .encode_text, .score\n  probes\u002F            # each probe is a self-contained file producing a metric\n  datasets\u002F          # loaders for COCO subsets, Winoground, ARO, EqBench, our small Cantonese subset\n  runners\u002F           # CLI entry points\n  reporting\u002F         # aggregate JSON outputs into LaTeX\u002Fmarkdown tables\nconfigs\u002F             # YAML configs for each (model, probe, dataset) combination\nscripts\u002F             # download\u002Fprepare helpers; see scripts\u002FREADME.md\ntests\u002F               # unit + smoke tests\n```\n\n## Probes\n\n| Probe ID         | Question                                                            | Metric         |\n|------------------|----------------------------------------------------------------------|----------------|\n| `attr-swap`      | Can the model distinguish \"red cube on blue ball\" from \"blue cube on red ball\"?  | top-1 accuracy |\n| `count-coarse`   | 1 vs 3 vs 5+ objects of same class                                  | macro-F1       |\n| `spatial-binary` | left-of \u002F right-of \u002F above \u002F below                                  | top-1 accuracy |\n| `resolution-sweep` | re-run any probe at 224 \u002F 336 \u002F 448 \u002F 672                          | curve          |\n| `text-shuffle`   | does shuffling word order destroy the model's score (bag-of-words check)? | delta |\n| `object-occlusion` | partially masked target object still recoverable from caption?    | top-1 accuracy |\n| `cantonese-cap`  | Cantonese captions over the same images: does alignment hold?       | top-1 accuracy |\n\nEach probe is a single Python file under `finevlm_probe\u002Fprobes\u002F`. They share a\n`Probe` protocol (`name`, `prepare()`, `step(sample, model) -> dict`, `summarize()`).\n\n## Supported models\n\nWe test against the following at the time of writing:\n\n| Family   | Variants                                                  |\n|----------|-----------------------------------------------------------|\n| CLIP     | ViT-B\u002F32, ViT-B\u002F16, ViT-L\u002F14, ViT-L\u002F14@336                |\n| SigLIP   | base-patch16-224, large-patch16-384, so400m-patch14-384   |\n| BLIP-2   | flan-t5-xl (frozen Q-Former mode)                         |\n| LLaVA-1.5| 7b, 13b (uses HF transformers `LlavaForConditionalGeneration`)  |\n\nAdding a model means writing an adapter in `models\u002F\u003Cname>.py`; see the\n[adding a new model](#adding-a-new-model) section.\n\n## Datasets\n\nDownloaders live in `scripts\u002F`. We do **not** ship the data.\n\n| Dataset                | Used by                                  | Notes                          |\n|------------------------|------------------------------------------|--------------------------------|\n| COCO val2017 (subset)  | attr-swap, count-coarse, spatial-binary  | 2k images; subset list in `configs\u002Fcoco_subset.json` |\n| Winoground             | text-shuffle                             | obtain from HF datasets (gated) |\n| ARO                    | attr-swap (cross-check)                  | https:\u002F\u002Fgithub.com\u002Fmertyg\u002Fvision-language-models-are-bows |\n| EqBench                | count-coarse                             | https:\u002F\u002Fgithub.com\u002FWangt-CN\u002FEqBen |\n| Cantonese-CC (ours)    | cantonese-cap                            | tiny: 300 images, human-written Cantonese captions; see `data\u002FCANTONESE_CC.md` for collection notes |\n\n## Quick start\n\n```bash\n# 1. install\ngit clone https:\u002F\u002Fgithub.com\u002Fmrvellang\u002Fvlm-probe-suite.git\ncd vlm-probe-suite\npip install -e .[dev]\n\n# 2. download the COCO val2017 subset we use (puts ~3GB under data\u002F)\npython scripts\u002Fget_coco_subset.py\n\n# 3. run one probe on one model\npython -m finevlm_probe.runners.run_one \\\n    --model clip_vit_b32 \\\n    --probe attr-swap \\\n    --dataset coco-subset \\\n    --out runs\u002Fclip_b32_attrswap.json\n```\n\n`run_one` emits a single JSON file. To aggregate:\n\n```bash\npython -m finevlm_probe.reporting.aggregate runs\u002F*.json --format markdown > table.md\n```\n\n## Running a full sweep\n\n`runners.sweep` reads a YAML and dispatches each (model, probe, dataset) cell.\nGPUs are picked round-robin by default; pin with `CUDA_VISIBLE_DEVICES=0,1`.\n\n```bash\npython -m finevlm_probe.runners.sweep configs\u002Fmain_sweep.yaml --out runs\u002Fmain\u002F\n```\n\n`configs\u002Fmain_sweep.yaml` mirrors the report's Table 2. Expect ~6 GPU-hours on a\nsingle A100-40G for the full main sweep; the resolution-sweep takes another ~3.\n\n## Reproducing the report numbers\n\nThe exact configs that produced the report:\n\n```bash\nbash scripts\u002Freproduce_table2.sh    # main results\nbash scripts\u002Freproduce_table3.sh    # resolution sweep\nbash scripts\u002Freproduce_figure4.sh   # cantonese subset\n```\n\nIf your numbers move by more than ~0.5 acc., open an issue with your environment\n(PyTorch \u002F CUDA \u002F transformers versions). Most drift we've seen is from\n`transformers` BLIP-2 changes between 4.36 and 4.40.\n\n## Adding a new model\n\n```python\n# finevlm_probe\u002Fmodels\u002Fmy_model.py\nfrom .base import ModelAdapter\n\nclass MyModel(ModelAdapter):\n    name = \"my-model-v1\"\n\n    def __init__(self, device=\"cuda\"):\n        self.device = device\n        # load weights here\n\n    def encode_image(self, images):  # PIL.Image list -> [N, D] tensor\n        ...\n\n    def encode_text(self, texts):    # list[str] -> [N, D] tensor\n        ...\n```\n\nThen register in `finevlm_probe\u002Fmodels\u002F__init__.py` and reference by name in your\nconfig. The `score(image, text)` method is optional; if absent we use cosine of\nthe embeddings.\n\n## Adding a new probe\n\nA probe is one file that subclasses `Probe`:\n\n```python\nfrom finevlm_probe.probes.base import Probe\n\nclass MyProbe(Probe):\n    name = \"my-probe\"\n\n    def prepare(self, dataset):\n        self.samples = list(dataset.iter_for_probe(self.name))\n\n    def step(self, sample, model):\n        # do the eval, return per-sample dict\n        ...\n\n    def summarize(self, records):\n        # aggregate per-sample dicts into a metric dict\n        ...\n```\n\nRegister in `finevlm_probe\u002Fprobes\u002F__init__.py`.\n\n## Caveats and known issues\n\n- **Frozen-encoder only.** We do not fine-tune anything. Some VLMs perform very\n  differently after a tiny adapter LoRA; that is out of scope.\n- **Cantonese subset is tiny.** 300 images, 1 caption per image, by one annotator\n  (the author). It is intended as a *sanity probe*, not a benchmark.\n- **No human eval.** All metrics here are automatic.\n- **BLIP-2 score path.** We use the image-text-matching head logits rather than\n  generation likelihood; this matters for `attr-swap` results.\n- **LLaVA models are slow.** A full sweep with LLaVA-13B takes ~40 GPU-hours.\n  Use `--max-samples` to subset during development.\n- We have observed numerical drift when running on CPU vs CUDA for SigLIP; the\n  ranking is stable, the absolute numbers move by ~0.3 acc.\n- TODO: Add Korean and Hindi subsets like we did for Cantonese.\n- FIXME: `resolution-sweep` re-downloads the image at each resolution; it should\n  cache the largest and resize locally.\n\n## Citation\n\nIf you use this suite, please cite the report:\n\n```bibtex\n@techreport{cheung2025finevlmprobe,\n  title  = {FineVLM-Probe: A Lightweight Suite for Probing Fine-grained\n            Visual-Language Alignment in Frozen VLMs},\n  author = {Cheung, Ka Yiu},\n  year   = {2025},\n  institution = {HKUST}\n}\n```\n\nFor the Cantonese subset specifically, also cite the data notes in `data\u002FCANTONESE_CC.md`.\n\n## License\n\nCode is Apache-2.0 (see `LICENSE`). The Cantonese caption annotations under\n`data\u002Fcantonese_cc\u002F` are released CC-BY-4.0 as noted in that directory's README.\n\n## Acknowledgements\n\nThanks to labmates for catching the Winoground loader bug in #14, and to the\nauthors of CLIP, SigLIP, BLIP-2, LLaVA, and the ARO and EqBench teams whose\nwork this suite stands on.\n","FineVLM-Probe 是一个轻量级工具包，用于对冻结的视觉-语言模型（如CLIP、SigLIP、BLIP-2和LLaVA）进行细粒度探测。该项目提供了多种探针定义和数据集加载器，支持用户针对特定问题进行深入分析，例如评估模型在不同图像分辨率下的表现或检查文本编码器与视觉编码器的相对贡献。核心功能包括灵活的模型适配机制，新增模型仅需编写约30行代码。适用于需要对预训练视觉-语言模型的具体行为进行细致探究的研究场景。",2,"2026-06-11 04:01:19","CREATED_QUERY"]