[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-76384":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":21,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},76384,"MMSkills","DeepExperience\u002FMMSkills","DeepExperience","MMSkills: Towards Multimodal Skills for General Visual Agents",null,"Python",320,22,13,5,0,219,54.09,"Apache License 2.0",false,"main",true,[],"2026-06-12 04:01:21","\u003Ch1 align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fassets\u002Fmmskills_title.svg\" alt=\"MMSkills\" width=\"440\"\u002F>\u003Cbr>\n  Towards Multimodal Skills for General Visual Agents\n\u003C\u002Fh1>\n\n\u003Cdiv align=\"center\">\n\n[![Python 3.10+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache--2.0-green.svg)](LICENSE)\n[![OSWorld](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBenchmark-OSWorld-7b39e2.svg)](https:\u002F\u002Fgithub.com\u002Fxlang-ai\u002FOSWorld)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2605.13527-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13527)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-MMSkills-0f766e.svg)](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002F)\n[![Skill Library](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSkill%20Library-515%20MMSkills-4420A8.svg)](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fskills.html)\n[![Demos](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDemos-4%20Video%20Comparisons-a15c11.svg)](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fcases.html)\n[![Agent Adapter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAgent%20Adapter-Codex%20%7C%20OpenClaw%20%7C%20Claude%20Code-0f766e.svg)](agent_integrations\u002Fmmskills-agent-adapter\u002F)\n[![Submit MMSkill](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FSubmit-MMSkill%20Package-a15c11.svg)](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fsubmit.html)\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FDeepExperience\u002FMMSkills?style=social)](https:\u002F\u002Fgithub.com\u002FDeepExperience\u002FMMSkills\u002Fstargazers)\n\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"#-latest-news\">News\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13527\">Paper\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002F\">Website\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fskills.html\">Skill Library\u003C\u002Fa> |\n  \u003Ca href=\"https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fcases.html\">Demos\u003C\u002Fa> |\n  \u003Ca href=\"#-agent-adapter\">Agent Adapter\u003C\u002Fa> |\n  \u003Ca href=\"#-community-submissions\">Submit MMSkills\u003C\u002Fa> |\n  \u003Ca href=\"#-overview\">Overview\u003C\u002Fa> |\n  \u003Ca href=\"#-installation\">Installation\u003C\u002Fa> |\n  \u003Ca href=\"#-quick-start\">Quick Start\u003C\u002Fa> |\n  \u003Ca href=\"#-citation\">Citation\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch5 align=\"center\">If you find this project helpful, please give us a star ⭐ for the latest updates.\u003C\u002Fh5>\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Freadme-typing-svg.herokuapp.com?font=Orbitron&size=18&duration=3000&pause=1000&color=4420A8&center=true&vCenter=true&width=820&lines=Welcome+to+MMSkills;Reusable+Multimodal+Procedural+Knowledge;Skill-Augmented+Visual+Agents+for+Desktop+Tasks\" alt=\"Typing Animation purple MMSkills\" \u002F>\n\u003C\u002Fdiv>\n\n## 📣 Latest News\n\n- 🏆 **[May 2026]** MMSkills ranked **#1 on Hugging Face Daily Papers** on **2026.5.18**.\n- 🤗 **[May 2026]** The MMSkills dataset is now available on [Hugging Face Datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzhangkangning\u002Fmmskills); the paper page is also available on [Hugging Face Papers](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2605.13527).\n- 🌐 **[May 2026]** The project website is live with [demo comparisons](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fcases.html) and a searchable [MMSkills Library](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fskills.html) indexing **515 skills** across Ubuntu, macOS, VAB-Minecraft, and Mario.\n- 🚀 **[May 2026]** The public release includes a compact multimodal desktop-skill subset, OSWorld-ready runtime adapters, task mappings, and model-agnostic skill modes.\n- 🔌 **[May 2026]** We added the **MMSkills Agent Adapter** for Codex, OpenClaw, and Claude Code, with one-line Codex installation and on-demand Hugging Face skill retrieval.\n- 🌱 **[May 2026]** Community MMSkill submissions are open for new domains such as autonomous driving, robotics, mobile agents, and beyond.\n\n## 🎬 Demos\n\nFour OSWorld demos compare the same task under no skills, text-only skill guidance, and multimodal MMSkills. These videos show selected trajectory excerpts to highlight behavioral differences between the three settings; they are not complete end-to-end trajectories. To keep GUI text readable in the GitHub README, each case uses three separate 1080p MP4 players instead of a compressed side-by-side composite. The full video layout is also available at [deepexperience.github.io\u002FMMSkills\u002Fcases.html](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fcases.html).\n\n\u003Cdetails open>\n\u003Csummary>\u003Ch3>1. Calc merged headers\u003C\u002Fh3>\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>No skills\u003C\u002Fth>\n    \u003Cth>Text-only\u003C\u002Fth>\n    \u003Cth>MMSkills\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fcfe1cde8-5da1-4f69-9e90-1a3ee0b82023\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fce092ee3-4e10-44cb-bfd3-bb4780e5c9c4\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F24c8ca7a-a028-422a-8207-52b14c8b5d1e\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\nCreates Sheet2, merges the requested header ranges, and writes the target labels. MMSkills follows the intended spreadsheet workflow while the other modes make slower or less reliable progress.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Ch3>2. VS Code local VSIX install\u003C\u002Fh3>\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>No skills\u003C\u002Fth>\n    \u003Cth>Text-only\u003C\u002Fth>\n    \u003Cth>MMSkills\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F2296cd12-733e-4f25-95d5-402b2845ae37\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F4cfdfe99-6bb6-4a40-86ae-c7703eb1182c\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F90abd134-1e2c-4bbd-833f-83938b81383a\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\nInstalls a local VSIX extension through the GUI workflow. The comparison highlights how multimodal skill references reduce detours around extension discovery and confirmation steps.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Ch3>3. GIMP text-layer move\u003C\u002Fh3>\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>No skills\u003C\u002Fth>\n    \u003Cth>Text-only\u003C\u002Fth>\n    \u003Cth>MMSkills\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6f0d27ba-25a4-4b31-b34b-8480eb3d5fa0\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F57a4719d-0d62-4c00-a0b0-befecf5ac256\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F0e221a36-29a8-4b7f-8eac-1af5e492fbc7\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\nMoves a specific text layer in GIMP. The multimodal skill package provides visual grounding for the relevant layer and toolbar state, making the edit path clearer.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Ch3>4. Calc chart creation\u003C\u002Fh3>\u003C\u002Fsummary>\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Cth>No skills\u003C\u002Fth>\n    \u003Cth>Text-only\u003C\u002Fth>\n    \u003Cth>MMSkills\u003C\u002Fth>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F55a01c94-a748-4a22-9c40-cab707aca386\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fca310dd1-252a-4608-a0c7-7e613b31ee08\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n    \u003Ctd>\u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F1a485289-ceb5-4601-8e16-1be439593145\" width=\"280\" controls>\u003C\u002Fvideo>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\nBuilds the requested clustered chart in LibreOffice Calc. The side-by-side run shows the effect of reusable spreadsheet procedure knowledge on multi-step GUI manipulation.\n\n\u003C\u002Fdetails>\n\n## 💡 Overview\n\n**MMSkills** is a framework for representing, loading, and using reusable multimodal procedural knowledge for visual agents. Each skill combines textual procedure guidance, compact state-card metadata, and optional visual references. At inference time, the agent keeps only lightweight skill hints in the main context, then opens a temporary skill branch when task state suggests that a skill may help.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fassets\u002Ffull_figure.png\" width=\"95%\" alt=\"MMSkills overview\" \u002F>\n\u003C\u002Fdiv>\n\nThis repository is a focused open-source release. It is not a full OSWorld fork; instead, it provides the MMSkill runtime layer, an install script, OSWorld runner patches, task-to-skill mappings, and a representative public skill library.\n\nProject pages:\n\n- [arXiv paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13527)\n- [MMSkills website](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002F)\n- [Searchable Multidomain Skill Library](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fskills.html)\n- [Demo video comparisons](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fcases.html)\n\nWebsite frontend files are published from the `gh-pages` branch. The `main` branch is kept focused on the open-source code, runtime integration, skills, and documentation.\n\n## ✨ Highlights\n\n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\u003Cstrong>🧩 Self-contained skill packages\u003C\u002Fstrong>\u003Cbr>Each skill directory contains \u003Ccode>SKILL.md\u003C\u002Fcode>, runtime state cards, audit state cards, and visual keyframes.\u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\u003Cstrong>👁️ Multimodal evidence gating\u003C\u002Fstrong>\u003Cbr>The runtime first decides whether visual references are needed, then loads only the requested state views.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\u003Cstrong>🧠 Branch-loaded planning\u003C\u002Fstrong>\u003Cbr>A temporary planner branch consults selected skills and returns concise guidance, fallback advice, and verification cues.\u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\u003Cstrong>🔌 OSWorld ready\u003C\u002Fstrong>\u003Cbr>Helper scripts install the agent files, runner integration, skills, and task mappings into a local OSWorld checkout.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\u003Cstrong>⚡ Agent-product adapter\u003C\u002Fstrong>\u003Cbr>The \u003Ccode>mmskills-agent-adapter\u003C\u002Fcode> can be installed as a Codex skill and reused by OpenClaw or Claude Code through the same package contract.\u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\u003Cstrong>📦 On-demand skill retrieval\u003C\u002Fstrong>\u003Cbr>Agents search the 515-skill Hugging Face library, download only task-relevant packages, then read \u003Ccode>SKILL.md\u003C\u002Fcode>, runtime states, and visual references as needed.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd width=\"50%\">\u003Cstrong>🌱 Community-extensible library\u003C\u002Fstrong>\u003Cbr>Researchers can submit MMSkill packages for new domains such as autonomous driving, robotics, mobile apps, web agents, and games.\u003C\u002Ftd>\n    \u003Ctd width=\"50%\">\u003Cstrong>✅ Review-first publishing\u003C\u002Fstrong>\u003Cbr>Submissions open GitHub issues, notify maintainers, and are reviewed before being normalized into the public Hugging Face library and website.\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n## 🔌 Agent Adapter\n\nThe [`mmskills-agent-adapter`](agent_integrations\u002Fmmskills-agent-adapter\u002F) module turns MMSkills into an installable, product-neutral skill adapter for agent systems. It keeps one shared MMSkills package format across Codex, OpenClaw, Claude Code, and future agent products instead of maintaining separate copies for each ecosystem.\n\nThe adapter is intentionally lightweight. It does not bundle the full 515-skill asset set inside the repository branch. Instead, it points agents to the public [Hugging Face MMSkills dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzhangkangning\u002Fmmskills), searches the metadata index, and downloads only the skill package needed for the current task.\n\nOne-line Codex install:\n\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FDeepExperience\u002FMMSkills\u002Fmain\u002Fscripts\u002Finstall_codex_mmskills.sh | bash\n```\n\nDirect Codex skill-installer form:\n\n```bash\npython ~\u002F.codex\u002Fskills\u002F.system\u002Fskill-installer\u002Fscripts\u002Finstall-skill-from-github.py \\\n  --repo DeepExperience\u002FMMSkills \\\n  --path agent_integrations\u002Fmmskills-agent-adapter\n```\n\nAfter restarting Codex, invoke `$mmskills` for GUI-agent or computer-use tasks. The adapter scripts provide the standard flow:\n\n```bash\npython scripts\u002Fsearch_skills.py \"chrome bookmark\" --package ubuntu\npython scripts\u002Fdownload_skill.py ubuntu\u002Fchrome\u002FCHROME_Manage_Bookmarks_Reading_List_And_Shortcuts\npython scripts\u002Finspect_skill.py ~\u002F.cache\u002Fmmskills\u002Fskills\u002Fubuntu\u002Fchrome\u002FCHROME_Manage_Bookmarks_Reading_List_And_Shortcuts\n```\n\nFor OpenClaw and Claude Code, use the same adapter contract: call the search\u002Fdownload scripts, parse `SKILL.md` and `runtime_state_cards.json`, and route `Images\u002F` into the product's visual grounding or verification layer only when visual evidence is needed.\n\n## 🌱 Community Submissions\n\nWe welcome MMSkill packages from new domains. A submission can be a single reusable skill or a new domain collection, such as autonomous driving, robotics, mobile agents, browser workflows, scientific software, games, or other visual-agent environments.\n\nSubmit through the website entrypoint or directly through the GitHub issue form:\n\n- Website entry: [deepexperience.github.io\u002FMMSkills\u002Fsubmit.html](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fsubmit.html)\n- GitHub issue form: [Submit an MMSkill package](https:\u002F\u002Fgithub.com\u002FDeepExperience\u002FMMSkills\u002Fissues\u002Fnew?template=skill_submission.yml)\n- Format guide: [docs\u002Fsubmit_mmskills.md](docs\u002Fsubmit_mmskills.md)\n\nEach submission creates a GitHub issue assigned to the maintainer account, so maintainers can receive email notifications through GitHub's repository notification settings. After review, accepted packages are normalized into the MMSkills library, uploaded to the public Hugging Face dataset, and surfaced on the website Skill Library.\n\n## 🗂️ Repository Layout\n\n```text\nMMSkills\u002F\n├── agent_integrations\u002F        # Codex\u002FOpenClaw\u002FClaude Code agent adapters and download helpers\n├── mm_agents\u002F                 # MMSkill runtime architecture and model adapters\n├── osworld_integration\u002F       # MMSkills-aware OSWorld runner files\n├── skills_library\u002F            # Public multimodal skills subset for direct runtime use\n├── task_skill_mappings\u002F       # OSWorld task-to-skill mapping for released skills\n└── scripts\u002F\n    ├── install_into_osworld.py # Install this release into an OSWorld checkout\n    └── sync_from_sources.py    # Maintainer sync helper for source checkouts\n```\n\n## 🧠 Architecture\n\nThe public runtime entrypoint is [`mm_agents\u002Fmm_skill_agent.py`](mm_agents\u002Fmm_skill_agent.py), exposed in OSWorld as:\n\n```bash\n--agent_type mm_skill\n```\n\nThe architecture is model-agnostic. A main visual agent receives compact skill hints; when a skill may apply, the runtime opens a branch that decides whether visual evidence is needed, requests relevant state views, compares them with the live screenshot, and returns structured guidance for the next grounded action.\n\nThe reference integration supports:\n\n- `mm_skill`: multimodal branch-loaded skill consultation.\n- `general_text_skill`: text-only skill consultation for ablation and lightweight runs.\n- `general`: baseline model-agnostic screenshot-to-pyautogui visual-agent routing.\n\nLegacy `gemini`, `gemini_skill`, and `gemini_text_skill` CLI names are still accepted by the runner as aliases for compatibility, but the public files and recommended commands use the model-agnostic `general*` names.\n\nAny screenshot-capable VLM served through an OpenAI-compatible chat-completions API can use the same `general*` and `mm_skill` interfaces by setting `--model`, `--api_model` when needed, `--base_url`, and `--api_key`.\n\n## 🔧 Installation\n\n### 1. Clone MMSkills\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDeepExperience\u002FMMSkills.git\ncd MMSkills\n```\n\n### 2. Install Python dependencies\n\n```bash\npython3 -m venv .venv\nsource .venv\u002Fbin\u002Factivate\npip install -r requirements.txt\n```\n\n### 3. Install into OSWorld\n\nClone and install OSWorld following its upstream instructions, then run:\n\n```bash\npython3 scripts\u002Finstall_into_osworld.py \u002Fpath\u002Fto\u002FOSWorld --with-runner --with-skills\n```\n\nThis copies the MMSkill agent files into `OSWorld\u002Fmm_agents\u002F`, installs the MMSkills-aware runner files, and copies the released `skills_library\u002F` plus `task_skill_mappings\u002F`.\n\n### 4. Configure model endpoints\n\nFor an OpenAI-compatible endpoint:\n\n```bash\nexport OPENAI_BASE_URL=\"https:\u002F\u002Fyour-openai-compatible-endpoint\u002Fv1\"\nexport OPENAI_API_KEY=\"your_api_key\"\n```\n\nFor native Gemini-compatible routing, pass `--api_backend gemini` and set:\n\n```bash\nexport GEMINI_BASE_URL=\"https:\u002F\u002Fyour-gemini-compatible-endpoint\u002Fv1\"\nexport GEMINI_API_KEY=\"your_api_key\"\n```\n\n### 5. Install the Codex Agent Adapter\n\nMMSkills also ships a lightweight agent-product adapter under [`agent_integrations\u002Fmmskills-agent-adapter\u002F`](agent_integrations\u002Fmmskills-agent-adapter\u002F). The adapter is installable as a Codex skill and points agents to the full Hugging Face skill dataset for on-demand retrieval. See [Agent Adapter](#-agent-adapter) for the full cross-agent contract.\n\nOne-line Codex install:\n\n```bash\ncurl -fsSL https:\u002F\u002Fraw.githubusercontent.com\u002FDeepExperience\u002FMMSkills\u002Fmain\u002Fscripts\u002Finstall_codex_mmskills.sh | bash\n```\n\nDirect Codex skill-installer form:\n\n```bash\npython ~\u002F.codex\u002Fskills\u002F.system\u002Fskill-installer\u002Fscripts\u002Finstall-skill-from-github.py \\\n  --repo DeepExperience\u002FMMSkills \\\n  --path agent_integrations\u002Fmmskills-agent-adapter\n```\n\nAfter restarting Codex, use `$mmskills` to search and load task-relevant packages. The same adapter contract is intended for OpenClaw and Claude Code: share the MMSkills package format, keep product-specific behavior in thin adapters, and download only the skills needed for the current task from [Hugging Face Datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fzhangkangning\u002Fmmskills).\n\n## 🏃 Quick Start\n\nRun commands from the OSWorld checkout after installation.\n\n### Baseline Without Skills\n\n```bash\npython run.py \\\n  --agent_type general \\\n  --model gpt-4o \\\n  --api_backend openai \\\n  --observation_type screenshot \\\n  --action_space pyautogui \\\n  --max_steps 20 \\\n  --test_all_meta_path evaluation_examples\u002Ftest_nogdrive.json \\\n  --domain chrome \\\n  --result_dir results\u002Fno_skills\n```\n\n### Text-Only Skills\n\n```bash\npython run.py \\\n  --agent_type general_text_skill \\\n  --model gpt-4o \\\n  --api_backend openai \\\n  --observation_type screenshot \\\n  --action_space pyautogui \\\n  --max_steps 20 \\\n  --skills_library_dir skills_library \\\n  --task_skill_mapping_root task_skill_mappings\u002Ftask_skill_mapping.json \\\n  --skill_mode text_only \\\n  --text_skill_mode branch_planner \\\n  --test_all_meta_path evaluation_examples\u002Ftest_nogdrive.json \\\n  --domain chrome \\\n  --result_dir results\u002Ftext_only\n```\n\n### Multimodal MMSkill Agent\n\n```bash\npython run.py \\\n  --agent_type mm_skill \\\n  --model gpt-4o \\\n  --api_backend openai \\\n  --observation_type screenshot \\\n  --action_space pyautogui \\\n  --max_steps 20 \\\n  --skills_library_dir skills_library \\\n  --task_skill_mapping_root task_skill_mappings\u002Ftask_skill_mapping.json \\\n  --skill_mode multimodal \\\n  --task_skill_top_k 6 \\\n  --save_conversation_json \\\n  --test_all_meta_path evaluation_examples\u002Ftest_nogdrive.json \\\n  --domain chrome \\\n  --result_dir results\u002Fmm_skill_multimodal\n```\n\nUse `--domain all` for the full no-Google-Drive OSWorld split. The runner writes trajectories, screenshots, `skill_invocations.json`, `skill_usage_summary.json`, and aggregate metrics under the selected `--result_dir`.\n\n## 📚 Skill Library\n\nThe website indexes **515 skills** from the open-source Ubuntu, macOS, VAB-Minecraft, and Mario skill assets. Each skill card links to a structured view of its `SKILL.md`, runtime state cards, and ordered visual references.\n\nBrowse the live library at [deepexperience.github.io\u002FMMSkills\u002Fskills.html](https:\u002F\u002Fdeepexperience.github.io\u002FMMSkills\u002Fskills.html).\n\nThe repository also includes a compact runtime-ready subset under [`skills_library\u002F`](skills_library\u002F) for immediate OSWorld integration.\n\n## 📦 Skill Package Format\n\n```text\nskills_library\u002F\u003Cdomain>\u002F\u003Cskill_name>\u002F\n├── SKILL.md                  # Procedure, applicability, transfer limits, checks\n├── runtime_state_cards.json  # Compact state\u002Fview metadata used at inference time\n├── state_cards.json          # Audit-grade state metadata for inspection\n├── plan.json                 # Generated plan metadata, when available\n└── Images\u002F                   # Full frames, focus crops, before\u002Fafter references\n```\n\nThe main agent sees only concise skill names and state hints. Detailed visual evidence is loaded lazily by the branch planner, which keeps the main context compact while preserving access to state-specific multimodal references.\n\n`runtime_state_cards.json` is the inference-facing version: it contains compact state descriptions, when-to-use rules, visible cues, verification cues, and selected image views for branch-time loading. `state_cards.json` is the richer authoring\u002Faudit version: it keeps transfer-limit notes, highlight targets, grounding queries, bounding boxes, crop decisions, and evidence-source metadata for inspection and regeneration.\n\n## 🧪 Outputs\n\nMMSkills adds skill-aware artifacts to OSWorld result directories:\n\n| File | Purpose |\n|------|---------|\n| `skill_invocations.json` | Per-branch consultation records, selected states, requested views, and planner outputs |\n| `skill_usage_summary.json` | Aggregate skill counts, branch success counts, exhausted skills, and final actions |\n| `conversation.json` | Optional main and branch conversation trace when `--save_conversation_json` is enabled |\n\n## 🤝 Contributing\n\nContributions are welcome for new skills, runtime integrations, documentation, and reproducibility fixes. Please read [`CONTRIBUTING.md`](CONTRIBUTING.md) before opening an issue or pull request.\n\n## 📄 License\n\nThis project is released under the [Apache License 2.0](LICENSE). Portions of the OSWorld integration are derived from OSWorld; see [NOTICE](NOTICE) for attribution details.\n\n## 📝 Citation\n\nIf you use MMSkills in your research or applications, please cite our arXiv paper:\n\n```bibtex\n@misc{zhang2026mmskills,\n  title = {MMSkills: Towards Multimodal Skills for General Visual Agents},\n  author = {Kangning Zhang and Shuai Shao and Qingyao Li and Jianghao Lin and Lingyue Fu and Shijian Wang and Wenxiang Jiao and Yuan Lu and Weiwen Liu and Weinan Zhang and Yong Yu},\n  year = {2026},\n  eprint = {2605.13527},\n  archivePrefix = {arXiv},\n  primaryClass = {cs.AI},\n  url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.13527}\n}\n```\n\nYou can also use the machine-readable citation metadata in [`CITATION.cff`](CITATION.cff).\n","MMSkills 项目旨在为通用视觉代理开发多模态技能。它提供了一个包含515种多模态技能的库，支持Codex、OpenClaw和Claude Code等代理适配器，使视觉代理能够执行复杂的桌面任务。该项目使用Python 3.10+编写，并遵循Apache License 2.0开源协议。MMSkills适用于需要增强视觉代理功能的场景，如自动化办公任务、人机交互系统以及任何需要多模态处理能力的应用。通过其丰富的技能库和易于集成的特点，MMSkills能够显著提升视觉代理在多种环境下的表现。",2,"2026-06-01 03:45:45","CREATED_QUERY"]