[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-80013":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":14,"stars7d":14,"stars30d":15,"stars90d":14,"forks30d":14,"starsTrendScore":14,"compositeScore":16,"rankGlobal":9,"rankLanguage":9,"license":17,"archived":18,"fork":18,"defaultBranch":19,"hasWiki":18,"hasPages":18,"topics":20,"createdAt":9,"pushedAt":9,"updatedAt":21,"readmeContent":22,"aiSummary":23,"trendingCount":14,"starSnapshotCount":14,"syncStatus":24,"lastSyncTime":25,"discoverSource":26},80013,"mllm-jailbreak-bench","pardcomper\u002Fmllm-jailbreak-bench","pardcomper","Reproducible benchmark for adversarial attacks on multimodal large language models",null,"Python",230,50,25,0,148,5.12,"Other",false,"main",[],"2026-06-12 02:03:56","\u003Cdiv align=\"center\">\n\n# MLLM-Jailbreak-Bench\n\n**A reproducible benchmark for adversarial attacks on multimodal large language models.**\n\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10+-blue.svg?style=for-the-badge)](https:\u002F\u002Fwww.python.org\u002F)\n[![License: BSD-3](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-BSD--3-orange.svg?style=for-the-badge)](LICENSE)\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpaper-arXiv:2410.xxxxx-b31b1b.svg?style=for-the-badge)](#citation)\n[![Tests](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftests-passing-success.svg?style=for-the-badge)](#)\n\n[Overview](#overview) · [Threat model](#threat-model) · [Quick start](#quick-start) · [Attacks](#attacks-included) · [Defenses](#defenses-included) · [Citation](#citation)\n\n\u003C\u002Fdiv>\n\n---\n\n## Overview\n\nMLLM-Jailbreak-Bench measures how reliably a multimodal large language model can be persuaded to produce harmful output across **five categories** of attack — image-injection, audio-injection, text-image collusion, jailbreak-via-OCR, and visual-prompt-leakage. It is designed to be:\n\n- **Reproducible** — fixed seeds, frozen attack budgets, deterministic eval loops.\n- **Model-agnostic** — works with any HF-compatible MLLM via a thin adapter.\n- **Honest about limits** — every attack reports both attack success rate (ASR) and a calibration estimate so we don't conflate \"jailbroken\" with \"noise\".\n\n> ⚠️ **Responsible use.** This benchmark exists to make MLLMs safer to deploy, not the opposite. Attack templates are drawn from the published literature and are paired with refusal-quality metrics. Do not extend the attack inventory in ways that significantly lower the bar for misuse without prior discussion.\n\n## Threat model\n\nThe benchmark considers an attacker who:\n\n- Has **black-box query access** to the target MLLM (no gradients, no internal state).\n- Can submit arbitrary text *and* up to one image (or one audio clip) per query.\n- Knows the model name and a high-level description of the safety policy, but not the system prompt or refusal triggers.\n\nThis is the most common deployment setting. White-box attacks are out of scope.\n\n## Quick start\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fpardcomper\u002Fmllm-jailbreak-bench\ncd mllm-jailbreak-bench\npip install -e .\n\n# Evaluate one MLLM against all attacks at the default budget\njbb run --target llava-1.5-7b --attacks all --out results\u002Fllava15\u002F\n\n# Aggregate into the leaderboard format\njbb leaderboard --results-dir results\u002F --out LEADERBOARD.md\n```\n\nOr from Python:\n\n```python\nfrom jbb import Benchmark, load_target\n\ntarget = load_target(\"Qwen\u002FQwen2-VL-7B-Instruct\", device=\"cuda\")\nbench = Benchmark(attacks=[\"text_image_collusion\", \"ocr_jailbreak\"], n_per_attack=200)\nreport = bench.run(target)\nprint(report.summary())\n```\n\n## Attacks included\n\n| Category                     | Attack name                | Reference style                  |\n|------------------------------|----------------------------|----------------------------------|\n| Image-injection              | `vis_prompt_injection`     | image-as-instruction             |\n| Image-injection              | `gradient_free_perturb`    | NES-style query-only perturbation|\n| Text-image collusion         | `harmful_in_text_safe_img` | exploits modality routing        |\n| Text-image collusion         | `harmful_in_img_safe_text` | mirror of the above              |\n| OCR jailbreak                | `ocr_jailbreak`            | harmful instruction rendered in pixels |\n| Audio injection              | `audio_prompt_injection`   | TTS-rendered harmful instruction |\n| Visual-prompt leakage        | `sys_prompt_leak`          | tries to extract system prompt   |\n\nEach attack has its own hyperparameter set and a fixed compute budget (queries per sample). Budgets are listed in `configs\u002Fbudgets.yaml`.\n\n## Defenses included\n\nYou don't have to evaluate raw — the benchmark ships with three reference defenses so you can measure how much they help:\n\n- **Input filter** — small text+image classifier upstream of the model.\n- **Self-critique pass** — the model is asked to grade its own response post-hoc.\n- **Refusal-aware decoding** — biases the LM toward known refusal tokens at risky completions.\n\n## Metrics\n\nWe report three numbers per (model, attack) pair:\n\n| Metric                | Definition                                                                  |\n|-----------------------|-----------------------------------------------------------------------------|\n| **ASR**               | Attack success rate: fraction of (target_behavior, sample) pairs the model complied with. |\n| **Refusal quality**   | Among refusals, fraction that gave a *substantive* refusal vs. a generic one. |\n| **Calibration error** | Difference in ASR between adversarial and clean baseline — what's actually due to the attack. |\n\nA high ASR with high calibration error means: *the model is broken in general,* not \"the attack worked.\"\n\n## Leaderboard (snapshot)\n\nThe leaderboard is regenerated on each release. Latest snapshot:\n\n| Model              | ASR (avg) | Refusal qual. | Calib. err. |\n|--------------------|-----------|---------------|-------------|\n| GPT-4V (Apr-2024)  | 0.18      | 0.79          | 0.04        |\n| Qwen2-VL-7B        | 0.31      | 0.62          | 0.07        |\n| LLaVA-1.6-13B      | 0.39      | 0.51          | 0.05        |\n| LLaVA-1.5-7B       | 0.46      | 0.42          | 0.06        |\n| Idefics2-8B        | 0.34      | 0.58          | 0.05        |\n\n(Lower ASR + lower calib. err. + higher refusal quality = better.)\n\n## Reproducing the paper results\n\n```bash\nbash scripts\u002Freproduce_paper.sh\n```\n\nThis will:\n1. Download the prompt pool (~22 MB).\n2. Render OCR-jailbreak images deterministically.\n3. Run every (model, attack) pair at the paper budget.\n4. Emit a `LEADERBOARD.md` matching Table 2 of the paper.\n\nExpected wall-clock on 8 × A100 80 GB: ~12 hours.\n\n## Configuration\n\n| Argument         | Default      | Description                                            |\n|------------------|--------------|--------------------------------------------------------|\n| `--target`       | (required)   | HuggingFace model ID or registered adapter name        |\n| `--attacks`      | `all`        | Comma-separated subset of the attack list              |\n| `--budget`       | `default`    | Named budget preset (`default`, `small`, `paper`)      |\n| `--n-per-attack` | `200`        | Adversarial samples per attack                         |\n| `--seed`         | `0`          | Random seed for sampling and OCR rendering             |\n| `--defenses`     | `none`       | Comma-separated list (`filter`, `self_critique`, `ratd`) |\n| `--out`          | `results\u002F`   | Output directory                                       |\n\n## Adding a target\n\nImplement `BaseTarget` in `jbb\u002Ftargets\u002F` and register it in `jbb\u002Ftargets\u002F__init__.py`:\n\n```python\nfrom jbb.targets.base import BaseTarget, TargetResponse\n\nclass MyMLLM(BaseTarget):\n    name = \"my-mllm\"\n    def query(self, text, image=None, audio=None) -> TargetResponse:\n        ...\n```\n\n## Citation\n\n```bibtex\n@article{wang2024mllmjailbreakbench,\n  title   = {{MLLM-Jailbreak-Bench}: Reproducible Adversarial Evaluation of Multimodal {LLMs}},\n  author  = {Wang, Ziyu},\n  journal = {arXiv preprint arXiv:2410.xxxxx},\n  year    = {2024}\n}\n```\n\n## Acknowledgements\n\nBuilt on PyTorch, HuggingFace Transformers, and the broader LLM safety community.\nCompute for the leaderboard sweeps was provided by the PKU CS department cluster.\n\n## License\n\nBSD-3-Clause.\n","MLLM-Jailbreak-Bench 是一个用于评估多模态大语言模型对抗攻击的可复现基准测试工具。其核心功能包括通过五种类型的攻击（图像注入、音频注入、文本-图像串通、OCR越狱和视觉提示泄露）来衡量模型产生有害输出的可能性，并且支持任何HF兼容的大语言模型。该工具设计为可复现性强，采用固定种子和冻结攻击预算，确保评估过程的一致性；同时它还提供了对攻击成功率及校准估计的诚实报告，帮助区分真正的“越狱”与噪声。适用于需要评估和提高多模态大语言模型安全性的场景，如在部署前进行安全性检测。",2,"2026-06-01 03:49:30","CREATED_QUERY"]