[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-83016":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":15,"forks30d":15,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":39,"readmeContent":40,"aiSummary":41,"trendingCount":15,"starSnapshotCount":15,"syncStatus":42,"lastSyncTime":43,"discoverSource":44},83016,"MoleCode","AtomFlow-AI\u002FMoleCode","AtomFlow-AI","Molecode presents molecules as code and enables LLMs to operate and reason on chemistry directly.","https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.16480",null,"Python",235,3,56,0,9,158,174,66,1.81,"MIT License",false,"main",[25,26,27,28,29,30,31,32,33,34,35,36,37,38],"ai-agent","ai-for-science","chemistry","coding-agent","language-model","large-language-models","llm","molecular-modeling","molecular-structures","molecule","representation","smiles","smiles-code","smiles-strings","2026-06-12 02:04:30","\u003Cdiv align=\"center\">\n\n# 🧬 MoleCode\n\n### An LLM-native, graph-explicit molecular language\n\nOfficial repository for [**MoleCode unlocks structural intelligence in large language models**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.16480).\n\n*Molecode presents molecules as code and enables LLMs to operate and reason on chemistry directly.*\n\u003Cbr>*Instead of making language models reconstruct molecular structure from cryptic strings,*\n\u003Cbr>*MoleCode lets them read, write, and edit directly on the structures.*\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-MoleCode-firebrick)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2605.16480)\n[![PDF](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPDF-DADBDD)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2605.16480)\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAtomFlow-atomflow--ai.com-0b7285.svg)](https:\u002F\u002Fatomflow-ai.com\u002F)\n[![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FAtomFlow-AI\u002FMoleCode.svg?style=social\\&label=Stars)](https:\u002F\u002Fgithub.com\u002FAtomFlow-AI\u002FMoleCode)\n\u003Cbr>[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](LICENSE)\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.9+-blue.svg)](https:\u002F\u002Fwww.python.org\u002F)\n[![Powered by RDKit](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpowered%20by-RDKit-green.svg)](https:\u002F\u002Fwww.rdkit.org\u002F)\n\u003Cbr>[![Works with Claude Code](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWorks%20with-Claude%20Code-d97757.svg)](.claude\u002Fskills\u002Fmolecode\u002F)\n[![Works with Codex](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWorks%20with-Codex-412991.svg)](AGENTS.md)\n[![Agent Skill](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAgent-Skill-blue.svg)](.claude\u002Fskills\u002Fmolecode\u002FSKILL.md)\n\n**English** | [中文](README.zh-CN.md)\n\n\u003Cbr>\n\u003Cbr>\n\n\u003Cimg src=\"docs\u002Fassets\u002Foverview.png\" alt=\"MoleCode overview\" width=\"100%\">\n\n\u003C\u002Fdiv>\n\n---\n\n## Try our latest products on the official website!\nPlease visit the [AtomFlow website](https:\u002F\u002Fatomflow-ai.com\u002F).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"docs\u002Fassets\u002Fdemo_atomchat.png\" width=\"33%\" \u002F>\n  \u003Cimg src=\"docs\u002Fassets\u002Fdemo_retro.png\" width=\"33%\" \u002F>\n  \u003Cimg src=\"docs\u002Fassets\u002Fdemo_paper_read.png\" width=\"33%\" \u002F>\n\u003C\u002Fp>\n\n\u003Cbr>\n    \n## What is MoleCode?\n\nA molecule **is** a graph: atoms are nodes, bonds are edges, and chemistry emerges from the topology. Yet large language models are almost always fed molecules as *linear strings* like SMILES, where the graph is **implicit** — connectivity is positional, branches are syntactic, and rings hide inside index digits. Before an LLM can do any chemistry, it must first *reconstruct the graph from the syntax*, spending reasoning budget on structural bookkeeping.\n\n**MoleCode makes the structure the language.** Every atom and bond is written as a typed declaration with a persistent identifier, serialized as a [Mermaid](https:\u002F\u002Fmermaid.js.org\u002F) graph. Topology becomes directly readable, editable, and auditable inside the context window — and the format is **deterministically and losslessly inter-convertible with SMILES \u002F MOL via RDKit** (no learned model, no information loss).\n\n```mermaid\ngraph TB\n    subgraph chlorophenol[\"para-chlorophenol\"]\n        chlorophenol_C_1[C]\n        chlorophenol_O_1[OH]\n        chlorophenol_C_2[CH]\n        chlorophenol_C_3[CH]\n        chlorophenol_C_4[C]\n        chlorophenol_Cl_1[Cl]\n        chlorophenol_C_5[CH]\n        chlorophenol_C_6[CH]\n        chlorophenol_C_1 === chlorophenol_C_2\n        chlorophenol_C_2 --- chlorophenol_C_3\n        chlorophenol_C_3 === chlorophenol_C_4\n        chlorophenol_C_4 --- chlorophenol_C_5\n        chlorophenol_C_5 === chlorophenol_C_6\n        chlorophenol_C_6 --- chlorophenol_C_1\n        chlorophenol_C_1 --- chlorophenol_O_1\n        chlorophenol_C_4 --- chlorophenol_Cl_1\n    end\n```\n\n> The same `Subgraph → Node → Edge` grammar covers **small molecules, polymers, and Markush structures** — and extends to reaction mechanisms and multimodal document parsing.\n\n---\n\n## Why it matters\n\n| | SMILES | **MoleCode** |\n| --- | --- | --- |\n| Topology | implicit, positional | **explicit, named nodes & edges** |\n| Atom identity | none | **persistent IDs** (stable across prompt → reasoning → output) |\n| Editing | whole-string rewrite | **local graph op** (add a methyl = 1 node + 1 edge) |\n| Validation | fragile string parsing | **deterministic RDKit round-trip** |\n| Reasoning behavior | memorizes syntax | **generalizes over structure** |\n\nEmpirically (see the [MoleCode paper](#-citation) and [docs\u002F06-why-it-works.md](docs\u002F06-why-it-works.md)):\n\n- **Generalization, not memorization.** SMILES accuracy collapses from ~42% on familiar molecules to ~20% on novel ones; MoleCode holds **~76–80%** across all familiarity tiers.\n- **Cheaper reasoning.** MoleCode has longer *input* but its chain-of-thought grows **sub-linearly** with molecule size (~C^0.52) versus SMILES' super-linear ~C^1.65 — about a **5× lower total token cost** per query.\n- **Scales to big, repetitive objects.** Full-chain SMILES accuracy falls toward **0%** as polymer chains grow; MoleCode stays flat.\n- **Markush understanding** jumps from **38.1% → 84.0%**.\n\n---\n\n## Install\n\n```bash\npip install molecode          # from PyPI — pulls in rdkit + networkx\n```\n\nOr from source (for the examples, the Agent Skill, and development):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FAtomFlow-AI\u002FMoleCode.git\ncd MoleCode\npip install -e .\n```\n\n> `pip install molecode` gives you the **library** (`molecode.molecule`,\n> `molecode.polymer`, `molecode.markush`, `molecode.prompts`, `molecode.llm`).\n> The runnable [`examples\u002F`](examples) and the [Agent Skill](.claude\u002Fskills\u002Fmolecode\u002F)\n> live in the repository. Full API reference → [docs\u002Fapi.md](docs\u002Fapi.md).\n\n## Quick start\n\n```python\nfrom rdkit import Chem\nfrom molecode import mol_to_mermaid, mermaid_to_mol, mol_to_smiles\n\n# SMILES  ->  MoleCode graph\ngraph = mol_to_mermaid(Chem.MolFromSmiles(\"CC(=O)Oc1ccccc1C(=O)O\"), name=\"Aspirin\")\nprint(graph)\n\n# MoleCode graph  ->  SMILES  (lossless round-trip)\nassert mol_to_smiles(mermaid_to_mol(graph)) == Chem.CanonSmiles(\"CC(=O)Oc1ccccc1C(=O)O\")\n```\n\n---\n\n## Works with your coding agent (Claude Code · Codex)\n\nMoleCode ships as a ready-to-use **[Agent Skill](https:\u002F\u002Fdocs.claude.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Fskills)**,\nso coding agents can clone this repo and immediately reason over and edit\nmolecules at the explicit-graph level — no extra setup, no MCP server required.\n\n| Agent | How it picks MoleCode up |\n| --- | --- |\n| **Claude Code** | Auto-discovers the skill at [`.claude\u002Fskills\u002Fmolecode\u002F`](.claude\u002Fskills\u002Fmolecode\u002F). Just ask it to understand or edit a molecule. |\n| **Codex** (and other agents) | Reads [`AGENTS.md`](AGENTS.md) at the repo root and uses the bundled CLI; interface metadata in [`agents\u002Fopenai.yaml`](.claude\u002Fskills\u002Fmolecode\u002Fagents\u002Fopenai.yaml). |\n\nInstead of asking the model to hand-write SMILES — error-prone for anything\nnon-trivial — the skill has it **convert → inspect the named atoms\u002Fbonds → edit\nthe graph → validate**, all through one stable CLI:\n\n```bash\npython .claude\u002Fskills\u002Fmolecode\u002Fscripts\u002Fmolecode_convert.py doctor\npython .claude\u002Fskills\u002Fmolecode\u002Fscripts\u002Fmolecode_convert.py smiles-to-molecode \"CCO\" --name Ethanol\npython .claude\u002Fskills\u002Fmolecode\u002Fscripts\u002Fmolecode_convert.py validate --input edited.mmd     # formula, counts, round-trip\npython .claude\u002Fskills\u002Fmolecode\u002Fscripts\u002Fmolecode_convert.py molecode-to-smiles --input edited.mmd\n```\n\nThe skill bundles the six conversion forms (SMILES \u002F PSMILES \u002F Markush ↔\nMoleCode) plus `validate`, `compare` (Markush-aware isomorphism) and `doctor`, a\nsyntax reference for hand-editing graphs, and a file-based edit workflow built\nfor large molecules. See [`.claude\u002Fskills\u002Fmolecode\u002FSKILL.md`](.claude\u002Fskills\u002Fmolecode\u002FSKILL.md).\n\n## Three domains, one grammar\n\n### 🧪 Small molecules — [`molecode.molecule`](molecode\u002Fmolecule)\n\nAtoms are `prefix_Element_Number[Label]` nodes; bonds are `---` (single), `===` (double), `-.-` (triple), with `===|E|`\u002F`===|Z|` and `_R`\u002F`_S` for stereochemistry. → [syntax reference](docs\u002F02-syntax.md)\n\n### 🔗 Polymers — [`molecode.polymer`](molecode\u002Fpolymer)\n\nThe repeat unit stays **explicit** as a subgraph carrying a symbolic `×n` count, with `TL`\u002F`TR` terminus markers — so the graph does not blow up with chain length. → [polymer docs](docs\u002F03-polymers.md)\n\n```python\nfrom molecode.polymer import polymer_to_mermaid, mermaid_to_psmiles\n\ngraph = polymer_to_mermaid(\"*NCCCCCC(=O)*\", n=8, name=\"Nylon-6\")   # PSMILES -> graph\nmermaid_to_psmiles(graph)                                          # -> '*NCCCCCC(=O)*'\n```\n\n### 🧩 Markush structures — [`molecode.markush`](molecode\u002Fmarkush)\n\nVariable R-groups and named substituents become **abbreviation nodes** in curly braces — `{R1}`, `{Boc}`, `{Ar}` — something plain SMILES cannot express. A built-in graph-isomorphism comparator scores predictions up to abbreviation expansion. → [Markush docs](docs\u002F04-markush.md)\n\n```mermaid\ngraph TB\n    subgraph Mol[\"molecule name\"]\n        Mol_C_1[C]\n        Mol_O_1[OH]\n        Mol_X_1{Boc}\n        Mol_X_2{R1}\n        Mol_C_1 --- Mol_O_1\n        Mol_C_1 --- Mol_X_1\n    end\n```\n\n---\n\n## Run the tasks: understand · generate · edit · reason\n\nMoleCode is a **drop-in representation for any LLM** — feed the grammar as a system prompt, hand the model a graph, and validate its output deterministically. The [`examples\u002F`](examples) folder has runnable scripts for all four task families (they run **offline by default**, printing the exact prompt; set `MOLECODE_API_KEY` to call a model):\n\n```bash\npython examples\u002F01_molecule_roundtrip.py   # SMILES \u003C-> graph (lossless)\npython examples\u002F02_polymer_roundtrip.py    # polymers with ×n\npython examples\u002F03_markush_roundtrip.py    # abbreviation nodes & isomorphism\npython examples\u002F04_understanding.py        # count atoms \u002F formula \u002F rings ...\npython examples\u002F05_generation.py           # de novo design under constraints\npython examples\u002F06_editing.py              # local graph edits (add\u002Fdel\u002Fsubstitute)\npython examples\u002F07_reasoning.py            # reaction-product prediction\npython examples\u002F08_image_to_molecode.py    # OCSR: molecule image -> MoleCode (vision model)\n```\n\nThe reusable ingredients:\n\n```python\nfrom molecode.prompts import MOLECULE_SYSTEM_PROMPT   # give this to the LLM as the system prompt\nfrom molecode.molecule import mol_to_mermaid          # your molecule -> what the model reads\nfrom molecode.molecule import mermaid_to_mol           # model output -> validated RDKit Mol\n```\n\n### Calling an LLM\n\nMoleCode is just a representation, so you can use **any** LLM SDK — the prompts\nare plain strings. For convenience the package also ships a tiny,\ndependency-free, **OpenAI-compatible** client (`molecode.llm.LLMClient`, built on\nstdlib `urllib`). You supply the API key and base URL — nothing is hard-coded, so\nit works with OpenAI, DeepSeek, Azure, Together, vLLM, Ollama, …\n\n```python\nfrom molecode import LLMClient\nfrom molecode.prompts import MOLECULE_SYSTEM_PROMPT\nfrom molecode.molecule import mol_to_mermaid, mermaid_to_mol\nfrom rdkit import Chem\n\nclient = LLMClient(api_key=\"sk-...\", base_url=\"https:\u002F\u002Fapi.openai.com\u002Fv1\", model=\"\")\n# (or set MOLECODE_API_KEY \u002F MOLECODE_BASE_URL \u002F MOLECODE_MODEL and call LLMClient())\n\ngraph = mol_to_mermaid(Chem.MolFromSmiles(\"CC(=O)Oc1ccccc1C(=O)O\"), name=\"Aspirin\")\nreply = client.chat(f\"How many carbons are in this molecule?\\n```mermaid\\n{graph}\\n```\",\n                    system=MOLECULE_SYSTEM_PROMPT)\nprint(reply)\n```\n\nPrefer the official `openai` SDK? Pass the same prompt strings straight to\n`openai.OpenAI().chat.completions.create(...)` — you don't need `LLMClient` at all.\n\nSee [docs\u002F05-tasks.md](docs\u002F05-tasks.md) for the full task catalog.\n\n| Domain | Understanding | Generation | Editing | Reasoning |\n| --- | :---: | :---: | :---: | :---: |\n| Molecules | ✅ | ✅ | ✅ | ✅ |\n| Polymers | ✅ | ✅ | ✅ | — |\n| Markush | ✅ | — | — | — |\n\n---\n\n## Repository layout\n\n```\nmolecode\u002F                # the library (pip-installable)\n├── molecule\u002F            # small-molecule  \u003C-> Mermaid  (rdkit_to_mermaid, mermaid_to_rdkit)\n├── polymer\u002F             # polymer         \u003C-> Mermaid  (polymer_to_mermaid, mermaid_to_psmiles)\n├── markush\u002F             # Markush         \u003C-> Mermaid  + graph isomorphism + abbreviation_map\n├── prompts\u002F             # LLM system prompts (molecule + markush grammars)\n└── llm.py               # optional OpenAI-compatible client (you supply key + base_url)\nexamples\u002F                # 7 runnable demos (round-trips + 4 task families)\ndocs\u002F                    # overview, syntax, polymers, markush, tasks, why-it-works\nAGENTS.md                # entrypoint for coding agents\n.claude\u002Fskills\u002Fmolecode\u002F # Agent Skill: SKILL.md + CLI + references (Claude Code \u002F Codex)\n```\n\n---\n\n## Results at a glance\n\n| Generalization & reasoning | Goal-directed design | Scaling | Long molecules | General language |\n| :---: | :---: | :---: | :---: | :---: |\n| ![](docs\u002Fassets\u002Fresults_1_main.png) | ![](docs\u002Fassets\u002Fresults_2_chemistry.png) | ![](docs\u002Fassets\u002Fresults_3_scaling.png) | ![](docs\u002Fassets\u002Fresults_4_long_molecules.png) | ![](docs\u002Fassets\u002Fresults_5_extension.png) |\n\n---\n\n## About AtomFlow\n\nMoleCode is built and maintained by **[AtomFlow](https:\u002F\u002Fatomflow-ai.com\u002F)**.\n\nAtomFlow builds **LLM-native AI for chemistry** — letting language models operate\ndirectly on molecular structure rather than on opaque strings. Our work centers on\nmolecule-grounded applications, including:\n\n- **Molecular chat & interaction** — converse with a molecule; select atoms, bonds,\n  or fragments and edit them in natural language.\n- **Structure-aware editing** — auditable, graph-level molecular edits.\n- **Retrosynthesis** — LLM-native synthesis planning over explicit structures.\n- **Literature reading & structure parsing** — extracting structures from papers\n  and patents, including optical chemical structure recognition (OCSR).\n\nMoleCode is the open representation layer underneath these products — making\nmolecular structure explicit, editable, and auditable for LLMs.\n\n🌐 Learn more at **[atomflow-ai.com](https:\u002F\u002Fatomflow-ai.com\u002F)**.\n\n## 📚 Citation\n\nIf you use MoleCode in your research, please cite the MoleCode technical report:\n\n```bibtex\n@article{yan2026molecode,\n  title={MoleCode unlocks structural intelligence in large language models},\n  author={Yan, Zhiyuan and Liu, Chen and Zhao, Boxuan and Lin, Kaiqing and Zhao, Jixiang and Wang, Yimi and Lv, Liuzhenghao and Li, Hao and Zhang, Shanzhuo and Yuan, Li and others},\n  journal={arXiv preprint arXiv:2605.16480},\n  year={2026}\n}\n```\n\n## License\n\n[MIT](LICENSE) © 2026 AtomFlow-AI\n","MoleCode 是一个将分子表示为代码的项目，使大型语言模型能够直接操作和推理化学结构。其核心功能是通过将每个原子和键写成带有持久标识符的类型声明，并以 Mermaid 图形格式序列化，使得分子拓扑结构可以直接被读取、编辑和审核。技术上，MoleCode 依赖 Python 语言实现，并且与 RDKit 集成，确保与 SMILES 和 MOL 格式之间无损互转。适用于需要对化学分子进行自动化处理和分析的场景，如药物发现、材料科学等领域。",2,"2026-06-11 04:09:53","CREATED_QUERY"]