[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-1794":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":15,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":22,"hasPages":22,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":15,"lastSyncTime":31,"discoverSource":32},1794,"DISCO","DISCO-design\u002FDISCO","DISCO-design","Code for the DISCO model: General Multimodal Protein Design Enables DNA-Encoding of Chemistry","https:\u002F\u002Fdisco-design.github.io\u002F",null,"Python",196,24,4,2,0,5,33,6,53.99,"Apache License 2.0",false,"main",[25,26,27],"diffusion-models","enzyme-design","protein-design","2026-06-12 04:00:11","\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"assets\u002Fdisco.png\" alt=\"DISCO: Diffusion for Sequence-Structure Co-design\" width=\"900\"\u002F>\u003Cbr>\n    \u003Cimg src=\"assets\u002Fcarbene.gif\" width=\"700\"\u002F>\u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.05181\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-94133F?style=for-the-badge&logo=arxiv\" alt=\"arXiv\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdisco-design.github.io\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📝%20Blog-007A87?style=for-the-badge&logoColor=white\" alt=\"Blog\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002FDISCO-Design\u002FDISCO\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FHuggingFace-DE9B35.svg?style=for-the-badge&logo=HuggingFace\" alt=\"HF\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\nDISCO (DIffusion for Sequence-structure CO-design) is a multimodal generative model that simultaneously co-designs protein sequences and 3D structures, conditioned on and co-folded with arbitrary biomolecules — including small-molecule ligands, DNA, and RNA. Unlike sequential pipelines that first generate a backbone and then apply inverse folding, DISCO generates both modalities jointly, enabling sequence-based objectives to inform structure generation and vice versa.\n\nDISCO achieves state-of-the-art in silico performance in generating binders for diverse biomolecular targets with fine-grained property control, performing best on 178\u002F179 evaluated ligands, as well as DNA and RNA. Applied to new-to-nature catalysis, DISCO was conditioned solely on reactive intermediates — without pre-specifying catalytic residues or relying on template scaffolds — to design diverse heme enzymes with novel active-site geometries. These enzymes catalyze new-to-nature carbene-transfer reactions, including alkene cyclopropanation, spirocyclopropanation, B–H and C(sp³)–H insertions, with top activities exceeding those of engineered enzymes. Random mutagenesis of a selected design further yielded a fourfold activity gain, indicating that the designed enzymes are evolvable.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fconditional_design_results.png\" width=\"95%\" alt=\"DISCO vs baselines on conditional protein design\" \u002F>\n\u003C\u002Fp>\n\n## Quick Start\n\n1. **Install** — see [Installation](#-installation) below.\n2. **Set up prerequisites** — (optionally) configure [CUTLASS](#cutlass-optional) (see below).\n3. **Run:**\n\n```bash\npython runner\u002Finference.py \\\n  experiment=designable \\\n  input_json_path=input_jsons\u002Funconditional_config.json \\\n  seeds=\\[0,1,2,3,4\\]\n```\n\nIf a run is interrupted, simply rerun the same command — DISCO automatically skips samples that have already been generated.\n\n> **Note:** The first time you run inference, it may take some time before inference steps begin, as the pairformer kernels are compiled just in time.\n\n## 📦 Installation\n\nDISCO uses [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) for dependency management (you may need to [install uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002Fgetting-started\u002Finstallation\u002F) first). To install:\n\n> **AMD GPUs:** DeepSpeed does not support AMD GPUs. If you are using an AMD GPU, remove the `deepspeed` dependency from `pyproject.toml` before running `uv sync`, and run with `use_deepspeed_evo_attention=false`.\n\n```bash\nuv sync\n```\n\nBy default, `uv sync` installs PyTorch with its default backend. If you need a specific CUDA or CPU backend, uninstall torch and reinstall with the desired index URL. For example, for CUDA 12.4:\n\n```bash\nuv pip uninstall torch\nuv pip install torch --torch-backend=cu124\n```\n\nTo activate the environment run from the top-level of the repository:\n\n```bash\nsource .venv\u002Fbin\u002Factivate\n```\n\n## 🔧 Prerequisites\n\n### CUTLASS (optional)\n\nBy default, DISCO uses [DeepSpeed4Science EvoformerAttention](https:\u002F\u002Fwww.deepspeed.ai\u002Ftutorials\u002Fds4sci_evoformerattention\u002F) for memory-efficient attention, which significantly reduces GPU memory usage and enables inference on longer sequences. This requires [NVIDIA CUTLASS](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass) to be available on disk and a GPU with **Ampere or newer architecture** (e.g. A100, L40S, H100, H200, B100, B200).\n\nTo set it up, clone the CUTLASS repository and set the `CUTLASS_PATH` environment variable:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcutlass.git \u002Fpath\u002Fto\u002Fcutlass\nexport CUTLASS_PATH=\u002Fpath\u002Fto\u002Fcutlass\n```\n\nYou can add `CUTLASS_PATH` to your shell profile so it persists across sessions. The attention kernels will be compiled the first time they are invoked.\n\nIf you prefer to skip the CUTLASS installation, disable DeepSpeed attention on the command line:\n\n```bash\npython runner\u002Finference.py use_deepspeed_evo_attention=false ...\n```\n\nThis falls back to a naive attention implementation that materializes the full attention matrix and uses substantially more GPU memory.\n\n## 🚀 Running Inference\n\nInference is run through the Hydra-based runner:\n\n```bash\npython runner\u002Finference.py \\\n  experiment=designable \\\n  input_json_path=input_jsons\u002Fyour_config.json \\\n  seeds=\\[$(seq -s \",\" 0 4)\\]\n```\n\n### Key command-line options\n\n| Option | Description |\n|--------|-------------|\n| `experiment=` | Experiment preset (`designable` or `diverse`). See [Experiment Presets](#experiment-presets) below. |\n| `input_json_path=` | Path to the input JSON file describing what to generate. |\n| `seeds=` | List of random seeds, e.g. `[0,1,2]`. Each seed produces one sample per job in the input JSON, so the total number of generated samples equals `len(seeds) * len(jobs)`. |\n| `num_inference_seeds=` | Alternative to `seeds=`: generates seeds `[0, 1, ..., N-1]`. For example, `num_inference_seeds=100` produces 100 samples per job. |\n| `effort=` | Compute preset: `max` (default) or `fast`. **We only recommend `effort=fast` for unconditional generation; for conditional generation (e.g. ligand- or DNA\u002FRNA-conditioned) use `effort=max`.** See [Trading off quality for speed](#trading-off-quality-for-speed). |\n| `dump_dir=` | Output directory for generated structures. Defaults to `.\u002Foutput`. |\n\n### Experiment presets\n\nDISCO ships with two experiment presets that control the trade-off between designability and diversity:\n\n- **`designable`** — Uses entropy-adaptive temperature scaling and noisy guidance over both sequence and structure. This steers the model toward samples that are more likely to refold correctly under an external structure predictor, at the cost of reduced structural variety.\n- **`diverse`** — Disables noisy guidance and entropy-adaptive temperature. The model samples more freely from its learned distribution, producing greater structural variety at the cost of lower average designability.\n\nWhich preset to use depends on the task — see [Reproducing Paper Experiments](#-reproducing-paper-experiments) for guidance on which preset was used in each experiment.\n\n> **Tip: cheaper designable runs.** The `designable` preset uses noisy guidance, which increases the effective batch size of each forward pass and slows down inference. You can disable it while keeping the rest of the designable settings by adding `sample_diffusion.noisy_guidance.enabled=false` on the command line. This gives slightly lower designability scores but reduces compute costs, which can be useful for rapid prototyping or large-scale screening runs.\n\n### Trading off quality for speed\n\nDISCO provides two `effort` presets that control the number of recycling cycles and diffusion steps:\n\n| Preset | Diffusion steps | Recycling cycles | Description |\n|--------|:-:|:-:|-------------|\n| `effort=fast` | 100 | 2 | ~4x faster inference with only ~10% lower co-designability. Good for prototyping and large screening runs. This is the default. |\n| `effort=max` | 200 | 4 | Full quality used in the paper. |\n\n> **⚠️ Important:** We only recommend `effort=fast` for **unconditional** generation. For **conditional** generation (e.g. ligand- or DNA\u002FRNA-conditioned), use `effort=max` for best results.\n\n```bash\n# Fast (default) — good for prototyping\npython runner\u002Finference.py \\\n  experiment=designable \\\n  input_json_path=input_jsons\u002Fyour_config.json \\\n  seeds=\\[0,1,2,3,4\\]\n\n# Max quality — reproducing paper results\npython runner\u002Finference.py \\\n  experiment=designable \\\n  effort=max \\\n  input_json_path=input_jsons\u002Fyour_config.json \\\n  seeds=\\[0,1,2,3,4\\]\n```\n\nYou can also override the individual parameters directly with `model.N_cycle=` and `sample_diffusion.N_step=`.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"assets\u002Fn_cycle_steps_ablation.png\" width=\"85%\" alt=\"Co-designability and structural diversity vs. number of steps and cycles\" \u002F>\n\u003C\u002Fp>\n\nThe figure above shows the trade-off between co-designability, structural diversity, and compute, measured using the `designable` preset with noisy guidance disabled. Beyond 2 cycles and 100 steps, returns diminish quickly.\n\n> **Note:** When benchmarking against DISCO, use `effort=max` to reproduce the full-quality results reported in the paper.\n\n### Output directory\n\nGenerated structures are saved under `dump_dir` with the following layout:\n\n```\ndump_dir\u002F\n  pdbs\u002F\n    \u003Cname>_sample_\u003Cseed>.pdb\n    \u003Cname>_sample_\u003Cseed>_ligands.txt   # only if ligands are present\n  sequences\u002F\n    \u003Cname>_sample_\u003Cseed>.txt\n  ERR\u002F\n    \u003Cname>.txt                         # only for failed samples\n```\n\nHere `\u003Cname>` is the job name from the input JSON (e.g. `length_200_heme_b`) and `\u003Cseed>` is the random seed used for that sample.\n\nYou can override `dump_dir` on the command line:\n\n```bash\npython runner\u002Finference.py dump_dir=\u002Fmy\u002Foutput\u002Fdir ...\n```\n\nBy default it resolves to `.\u002Foutput` relative to the working directory.\n\n## 🔬 Reproducing Paper Experiments\n\nThe sections below walk through each class of experiment from the paper. We provide all input JSON files needed to reproduce the reported results. To make comparisons to DISCO easier, the raw generated samples and results for **all** *in silico* experiments are available on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FDISCO-Design\u002FDISCO_benchmark_data).\n\n### 🧬 Unconditional protein generation\n\nIn the unconditional setting, DISCO receives no conditioning target and generates both a protein sequence and a 3D structure from scratch. We evaluate at chain lengths 70, 100, 200, and 300 to assess how generation quality varies with protein size.\n\n```bash\npython runner\u002Finference.py \\\n  experiment=designable \\\n  input_json_path=input_jsons\u002Funconditional_config.json \\\n  seeds=\\[$(seq -s \",\" 0 99)\\]\n```\n\nThe unconditional config (`input_jsons\u002Funconditional_config.json`) contains fully masked protein chains of varying lengths. The model generates both sequence and structure from scratch. \n\n### 💊 Ligand-conditioned protein design\n\nIn the ligand-conditioned setting, DISCO is given only a small molecule and generates a protein sequence and structure with a complementary binding site, without requiring a template structure or seed sequence.\n\nWe provide five representative ligand input files in `input_jsons\u002F`:\n\n| File | Ligand | Description |\n|------|--------|-------------|\n| `heme_b.json` | Heme B | Iron-porphyrin cofactor central to the paper's enzyme design experiments |\n| `NDI.json` | NDI | Naphthalenediimide derivative |\n| `PLP.json` | PLP | Pyridoxal phosphate, a common enzyme cofactor |\n| `thyroxine.json` | Thyroxine | Thyroid hormone |\n| `warfarin.json` | Warfarin | Anticoagulant drug |\n\nEach file specifies fully masked protein chains (at lengths 150, 200, and 250) alongside a ligand provided as an SDF file. To run:\n\n```bash\npython runner\u002Finference.py \\\n  experiment=diverse \\\n  effort=max \\\n  input_json_path=input_jsons\u002Fheme_b.json \\\n  seeds=\\[$(seq -s \",\" 0 4)\\]\n```\n\nTo reproduce the paper's ligand-conditioned results, use `experiment=diverse`.\n\n### 🧪 Nucleic acid-conditioned protein design\n\nDISCO can also design proteins conditioned on nucleic acid sequences, generating protein chains that form complexes with DNA or RNA.\n\nTwo nucleic acid conditioning files are provided:\n\n| File | Target | Description |\n|------|--------|-------------|\n| `6YMC_rna.json` | RNA | Protein design conditioned on a 26-nt RNA sequence (PDB: 6YMC) |\n| `7S03_dna.json` | DNA | Protein design conditioned on double-stranded DNA (PDB: 7S03) |\n\nThese files sweep over protein chain lengths (50--80) while keeping the nucleic acid sequence fixed. To run:\n\n```bash\npython runner\u002Finference.py \\\n  experiment=diverse \\\n  effort=max \\\n  input_json_path=input_jsons\u002F6YMC_rna.json \\\n  seeds=\\[$(seq -s \",\" 0 4)\\]\n```\n\nTo reproduce the paper's nucleic acid results, use `experiment=diverse`.\n\n## 🕺 Studio-179: A Ligand Benchmark for Generative Protein Design 💃\n\nTo systematically evaluate ligand-conditioned protein design, we curated **Studio-179**: a benchmark of 170 natural and non-natural ligands — plus 9 multi-ligand combinations — spanning catalysis, pharmaceuticals, luminescence, and sensing.\n\nThe library covers a range of chemical and geometric properties relevant to protein-ligand interactions:\n\n- **Rigid molecules** — e.g., the persistent organic pollutant tetrachlorodibenzodioxin\n- **Large or flexible molecules** — e.g., CoQ10, a 50-heavy-atom cofactor with a long isoprenoid tail\n- **Metals and metalloclusters** — e.g., [4Fe-4S] iron-sulfur clusters\n\nThe SDF files for all 179 ligands are included in the `studio-179\u002F` directory, organized by priority tier. The full benchmark is split across four input files for parallel execution:\n\n```\ninput_jsons\u002Fall_priorities_ligands_split_0.json\ninput_jsons\u002Fall_priorities_ligands_split_1.json\ninput_jsons\u002Fall_priorities_ligands_split_2.json\ninput_jsons\u002Fall_priorities_ligands_split_3.json\n```\n\nTo run a split:\n\n```bash\npython runner\u002Finference.py \\\n  experiment=diverse \\\n  effort=max \\\n  input_json_path=input_jsons\u002Fall_priorities_ligands_split_0.json \\\n  seeds=\\[$(seq -s \",\" 0 4)\\]\n```\n\nTo run against a **single ligand** instead of all 179, create an input JSON with jobs for that ligand at the three benchmark protein lengths (150, 200, 250 residues). For example, to benchmark only heme B:\n\n```json\n[\n  {\n    \"name\": \"length_150_heme_b\",\n    \"sequences\": [\n      {\"proteinChain\": {\"sequence\": \"\u003C150 '-' characters>\", \"count\": 1}},\n      {\"ligand\": {\"ligand\": \"FILE_studio-179\u002Fpriority_1\u002Fheme_b_final_0.sdf\", \"count\": 1}}\n    ]\n  },\n  {\n    \"name\": \"length_200_heme_b\",\n    \"sequences\": [\n      {\"proteinChain\": {\"sequence\": \"\u003C200 '-' characters>\", \"count\": 1}},\n      {\"ligand\": {\"ligand\": \"FILE_studio-179\u002Fpriority_1\u002Fheme_b_final_0.sdf\", \"count\": 1}}\n    ]\n  },\n  {\n    \"name\": \"length_250_heme_b\",\n    \"sequences\": [\n      {\"proteinChain\": {\"sequence\": \"\u003C250 '-' characters>\", \"count\": 1}},\n      {\"ligand\": {\"ligand\": \"FILE_studio-179\u002Fpriority_1\u002Fheme_b_final_0.sdf\", \"count\": 1}}\n    ]\n  }\n]\n```\n\nThe SDF files for each ligand can be found in `studio-179\u002Fpriority_{0,1,2,3}\u002F`. See `input_jsons\u002Fheme_b.json` for a complete working example.\n\n### Evaluation metric: co-designability\n\nFor each ligand, we quantify the fraction of generated designs that are both structurally diverse and **co-designable**. A design is considered co-designable if the protein backbone and all ligand centroids have an RMSD \u003C 2 Å upon refolding with Chai-1. This evaluates whether the generated sequence encodes the intended structure and binding mode, rather than only assessing the plausibility of the generated structure in isolation.\n\nStudio-179 is intended as a community resource. We encourage others to evaluate their ligand-conditioned design methods on this benchmark and to report results using the same co-designability metric for comparability.\n\n## 🎨 Running Your Own Designs\n\nThe sections above reproduce experiments from the paper using provided input files. This section walks through how to set up DISCO for your own custom targets.\n\n### Custom ligand-conditioned design\n\nTo design a protein around your own small molecule, create an input JSON with a fully masked protein chain and your ligand. There are three ways to specify a ligand:\n\n#### Option 1: SMILES string\n\nThe simplest approach — no files needed. DISCO will automatically generate a 3D conformer from the SMILES string using RDKit.\n\n```json\n[\n  {\n    \"name\": \"my_ligand_design\",\n    \"sequences\": [\n      {\n        \"proteinChain\": {\n          \"sequence\": \"--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\",\n          \"count\": 1\n        }\n      },\n      {\n        \"ligand\": {\n          \"ligand\": \"CC(=O)Oc1ccccc1C(=O)O\",\n          \"count\": 1\n        }\n      }\n    ]\n  }\n]\n```\n\nThe protein sequence length is determined by the number of `-` characters (200 in this example). Replace the SMILES string with whatever molecule you want to target.\n\n```bash\npython runner\u002Finference.py \\\n  experiment=diverse \\\n  effort=max \\\n  input_json_path=input_jsons\u002Fmy_ligand_design.json \\\n  seeds=\\[$(seq -s \",\" 0 4)\\]\n```\n\n#### Option 2: Molecular structure file (SDF, MOL, MOL2, or PDB)\n\nIf you already have a 3D conformer for your molecule, you can pass it directly. Prefix the file path with `FILE_`:\n\n```json\n{\n  \"ligand\": {\n    \"ligand\": \"FILE_\u002Fabsolute\u002Fpath\u002Fto\u002Fmolecule.sdf\",\n    \"count\": 1\n  }\n}\n```\n\nSupported file formats are **SDF**, **MOL**, **MOL2**, and **PDB**. The file **must** contain a 3D conformer (2D structures will be rejected). Paths can be absolute or relative to the repository root:\n\n```json\n{\n  \"ligand\": {\n    \"ligand\": \"FILE_my_ligands\u002Fcaffeine.mol2\",\n    \"count\": 1\n  }\n}\n```\n\n> **Note:** XYZ files are not currently supported. If you have an XYZ file, convert it to SDF or MOL2 first using a tool like Open Babel (`obabel input.xyz -O output.sdf`).\n\n#### Option 3: CCD code\n\nFor standard ligands in the PDB Chemical Component Dictionary, use the CCD code prefixed with `CCD_`:\n\n```json\n{\n  \"ligand\": {\n    \"ligand\": \"CCD_ATP\",\n    \"count\": 1\n  }\n}\n```\n\nFor multi-component ligands (e.g., glycans), concatenate CCD codes with underscores: `\"CCD_NAG_BMA_BGC\"`.\n\n#### Putting it together: a full ligand example\n\nHere is a complete input JSON that designs 150- and 200-residue proteins around a ligand provided as an SDF file, similar to the provided heme B example:\n\n```json\n[\n  {\n    \"name\": \"my_mol_length_150\",\n    \"sequences\": [\n      {\n        \"proteinChain\": {\n          \"sequence\": \"------------------------------------------------------------------------------------------------------------------------------------------------------\",\n          \"count\": 1\n        }\n      },\n      {\n        \"ligand\": {\n          \"ligand\": \"FILE_my_ligands\u002Fmy_molecule.sdf\",\n          \"count\": 1\n        }\n      }\n    ]\n  },\n  {\n    \"name\": \"my_mol_length_200\",\n    \"sequences\": [\n      {\n        \"proteinChain\": {\n          \"sequence\": \"--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\",\n          \"count\": 1\n        }\n      },\n      {\n        \"ligand\": {\n          \"ligand\": \"FILE_my_ligands\u002Fmy_molecule.sdf\",\n          \"count\": 1\n        }\n      }\n    ]\n  }\n]\n```\n\nEach entry in the top-level list is a separate design job. Here we sweep over two protein lengths for the same ligand.\n\n### Custom nucleic acid-conditioned design\n\nDISCO can design proteins that bind DNA or RNA sequences. Provide the nucleic acid as a fixed sequence alongside a fully masked protein chain.\n\n#### RNA-binding protein design\n\n```json\n[\n  {\n    \"name\": \"my_rna_binder\",\n    \"sequences\": [\n      {\n        \"proteinChain\": {\n          \"sequence\": \"----------------------------------------------------------------------\",\n          \"count\": 1\n        }\n      },\n      {\n        \"rnaSequence\": {\n          \"sequence\": \"GGCUAGCCAUUUGAC\",\n          \"count\": 1\n        }\n      }\n    ]\n  }\n]\n```\n\nRNA sequences use the letters `A`, `U`, `G`, `C`, and `N` (unknown). Masking RNA positions with `-` is **not supported** — all nucleic acid positions must be fully specified.\n\n```bash\npython runner\u002Finference.py \\\n  experiment=diverse \\\n  effort=max \\\n  input_json_path=input_jsons\u002Fmy_rna_binder.json \\\n  seeds=\\[$(seq -s \",\" 0 4)\\]\n```\n\n#### DNA-binding protein design\n\nDNA works the same way, but uses `dnaSequence` with the letters `A`, `T`, `G`, `C`, and `N`. As with RNA, masking DNA positions with `-` is **not supported** — all positions must be fully specified. For double-stranded DNA, add each strand as a separate entry:\n\n```json\n[\n  {\n    \"name\": \"my_dna_binder\",\n    \"sequences\": [\n      {\n        \"proteinChain\": {\n          \"sequence\": \"----------------------------------------------------------------------\",\n          \"count\": 1\n        }\n      },\n      {\n        \"dnaSequence\": {\n          \"sequence\": \"GATTACAGATC\",\n          \"count\": 1\n        }\n      },\n      {\n        \"dnaSequence\": {\n          \"sequence\": \"GATCTGTAATC\",\n          \"count\": 1\n        }\n      }\n    ]\n  }\n]\n```\n\n> **Important:** `dnaSequence` represents a single strand. For double-stranded DNA, you must add the reverse complement as a second `dnaSequence` entry, as shown above.\n\n### Tips for custom designs\n\n- **Protein length:** The number of `-` characters sets the designed protein length. If you're unsure what length to use, try sweeping over a range (e.g., 100, 150, 200) by including multiple jobs in the same JSON file.\n- **Partial masking:** You can fix specific residues and mask others. For example, `\"MKTL----VPEG\"` fixes the termini and designs the middle region.\n- **Multiple seeds:** Use `seeds=[0,1,...,N]` or `num_inference_seeds=N` to generate multiple independent samples per job for diversity.\n- **Experiment preset:** Use `experiment=diverse` for maximum structural diversity (recommended for ligand and nucleic acid targets). Use `experiment=designable` when you want higher confidence that the designed sequence will refold to the intended structure.\n- **Covalent bonds:** You can specify covalent bonds between entities (e.g., a covalently attached ligand).\n\n## 📝 Input JSON Format\n\nThe input JSON closely follows the [AlphaFold Server](https:\u002F\u002Falphafoldserver.com\u002F) format. Each file is a list of jobs, where each job specifies the entities to model:\n\n```json\n[\n  {\n    \"name\": \"my_design\",\n    \"sequences\": [\n      {\n        \"proteinChain\": {\n          \"sequence\": \"MKTL------VPEG\",\n          \"count\": 1\n        }\n      },\n      {\n        \"ligand\": {\n          \"ligand\": \"CCD_ATP\",\n          \"count\": 1\n        }\n      }\n    ],\n    \"covalent_bonds\": []  \u002F\u002F optional\n  }\n]\n```\n\n### Masking with `-`\n\nThe character **`-`** (hyphen) denotes a **masked position** in protein sequences. DISCO will generate residues at these positions. Fixed residues are specified with their standard letter codes. For example:\n\n- `\"--------\"` — fully masked 8-residue protein (design all positions)\n- `\"MKTL----VPEG\"` — fix the N- and C-terminal residues, design the middle\n\nMasking is only supported for protein chains. DNA and RNA sequences must be fully specified.\n\n### Supported entity types\n\n- **`proteinChain`** — Protein sequence (standard 20 amino acids, `X` for unknown, `-` for masked)\n- **`dnaSequence`** — Single-stranded DNA (`A`, `T`, `G`, `C`, `N` for unknown). Masking is not supported; all positions must be specified.\n- **`rnaSequence`** — Single-stranded RNA (`A`, `U`, `G`, `C`, `N` for unknown). Masking is not supported; all positions must be specified.\n- **`ligand`** — Small molecule, specified as:\n  - A CCD code (e.g., `CCD_ATP`)\n  - A SMILES string\n  - A path to an SDF\u002FMOL\u002FMOL2\u002FPDB file, prefixed with `FILE_` (e.g., `FILE_path\u002Fto\u002Fligand.sdf`). The path can be **absolute** or **relative to the repository root**. For example, `FILE_studio-179\u002Fpriority_1\u002Fheme_b_final_0.sdf` would resolve relative to the repo base directory.\n- **`ion`** — Ion specified by CCD code (e.g., `MG`, `ZN`)\n\n## ⚠️ Known Limitations\n\n- **No protein-protein complex design.** DISCO does not currently support designing multi-chain protein complexes. The language model used to replace the MSA module (DPLM) was trained exclusively on single-chain proteins, and as a result it predicts multi-chain proteins poorly. Inputs containing more than one protein chain will raise an error. Protein-ligand, protein-DNA, and protein-RNA complexes with a single protein chain are fully supported.\n- **No motif scaffolding.** Motif scaffolding (designing a protein around a fixed structural motif) is not currently supported but will be added in a future update.\n\n## 🔜 Coming Soon\n\n- **Feynman-Kac correctors.** The Feynman-Kac correctors code for improved sampling will be added to the repository shortly.\n- **Training code.** Code for training DISCO from scratch will also be released.\n\n## 🙏 Acknowledgements\n\nWe gratefully acknowledge the authors of [Protenix](https:\u002F\u002Fgithub.com\u002Fbytedance\u002Fprotenix), as this codebase is built on top of their repository.\n\n## 📖 Citation\n\n```bibtex\n@Article{disco2026,\n      title={General Multimodal Protein Design Enables DNA-Encoding of Chemistry},\n      author={Jarrid Rector-Brooks and Théophile Lambert and Marta Skreta and Daniel Roth and Yueming Long and Zi-Qi Li and Xi Zhang and Miruna Cretu and Francesca-Zhoufan Li and Tanvi Ganapathy and Emily Jin and Avishek Joey Bose and Jason Yang and Kirill Neklyudov and Yoshua Bengio and Alexander Tong and Frances H. Arnold and Cheng-Hao Liu},\n      year={2026},\n      eprint={2604.05181},\n      archivePrefix={arXiv},\n      primaryClass={cs.LG},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2604.05181},\n}\n```\n","DISCO是一个用于蛋白质序列和三维结构同步设计的多模态生成模型，能够与任意生物分子（包括小分子配体、DNA和RNA）共同折叠。该项目通过同时生成序列和结构两种模式，使基于序列的目标可以指导结构生成，反之亦然，从而在计算机上实现对多种生物分子靶标生成结合物的最佳性能，并且能够在不预先指定催化残基或依赖模板支架的情况下设计出具有新颖活性位点几何形状的血红素酶。这些酶能催化自然界中未曾出现过的卡宾转移反应，显示出超越工程酶的活性。DISCO适用于需要精确控制蛋白质性质及功能的新药研发、酶工程等领域。项目使用Python编写，采用Apache License 2.0开源许可协议发布。","2026-06-11 02:46:05","CREATED_QUERY"]