[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-77633":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":13,"contributorsCount":14,"subscribersCount":14,"size":14,"stars1d":13,"stars7d":15,"stars30d":16,"stars90d":14,"forks30d":14,"starsTrendScore":17,"compositeScore":18,"rankGlobal":9,"rankLanguage":9,"license":19,"archived":20,"fork":20,"defaultBranch":21,"hasWiki":22,"hasPages":20,"topics":23,"createdAt":9,"pushedAt":9,"updatedAt":24,"readmeContent":25,"aiSummary":26,"trendingCount":14,"starSnapshotCount":14,"syncStatus":27,"lastSyncTime":28,"discoverSource":29},77633,"carbon","huggingface\u002Fcarbon","huggingface","The home of Carbon Genomic Foundation Model 🧬",null,"Python",193,27,1,0,6,104,3,4.34,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:03:43","# Carbon\n\nGenomic foundation models from Hugging Face. Carbon is a family of causal\nlanguage models trained on **1T tokens of DNA \u002F 6T DNA base pairs** from the\n[Carbon Pretraining Corpus](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceBio\u002Fcarbon-pretraining-corpus),\na curated mix of DNA & RNA sequences.\n\nThis repo contains:\n\n* the eval code for Carbon tasks: sequence recovery, variant\neffect prediction, and perturbations. We put this together because the zero-shot DNA eval landscape is\ncurrently scattered — useful tasks live in different repos, often buried\nalongside evals that need finetuning or that are already saturated, which\nmakes reproducibility harder.\n* scripts for fine-tuning the Carbon models on downstream tasks.\n\n## Contents\n\n- [Models](#models)\n- [Installation](#installation)\n- [Inference](#inference)\n- [Pretraining](#pretraining)\n- [Evaluation](#evaluation)\n- [Finetuning](#finetuning)\n\n## Models\n\n| Model | Params | Notes |\n|---|---|---|\n| [`HuggingFaceBio\u002FCarbon-500M`](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceBio\u002FCarbon-500M) | 500M | Draft model for speculative decoding. |\n| **[`HuggingFaceBio\u002FCarbon-3B`](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceBio\u002FCarbon-3B)** | 3B | **Flagship.** Matches or beats Evo2 7B. |\n| [`HuggingFaceBio\u002FCarbon-8B`](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceBio\u002FCarbon-8B) | 8B | Larger model for more performance. |\n\nThe Carbon checkpoints use a **hybrid tokenizer**: BPE for English text and 6-mer\nfor DNA, switched by a `\u003Cdna>` tag mid-sequence. That's why every inference\nor eval snippet below wraps DNA inputs with `\u003Cdna>` — see\n[evaluation\u002FREADME.md](evaluation\u002FREADME.md) for the full DNA-tag explanation.\n\n## Installation\n\nInstall the core runtime dependencies with:\n\n```bash\nuv sync\n```\n\nTo include evaluation dependencies, run:\n\n```bash\nuv sync --group evaluation\n```\n\nFor Evo2-backed evaluation, install the evaluation and Evo2 dependency groups:\n\n```bash\nuv sync --group evaluation --group evo2\n```\n\n## Inference\n\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_id = \"HuggingFaceBio\u002FCarbon-3B\"\ntok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True,\n                                             torch_dtype=\"bfloat16\").to(\"cuda\")\n\n# DNA generation: wrap the prompt with \u003Cdna> so the tokenizer routes to 6-mer mode.\ncontext = \"ATGGCCTCGAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG\"\nprompt = f\"\u003Cdna>{context}\"\ninputs = tok(prompt, return_tensors=\"pt\", add_special_tokens=False).to(\"cuda\")\n\nout = model.generate(**inputs, max_new_tokens=10, do_sample=False)\nprint(tok.decode(out[0]))\n```\n\nFor zero-shot variant scoring, just feed the model the full sequence and read\nthe log-likelihood — see [`evaluation\u002Fvep_eval.py`](evaluation\u002Fvep_eval.py).\n\n## Pretraining\n\n### Training data\n\nCarbon was trained on **1 T tokens (≈ 6 T DNA base pairs)** drawn from the\n[Carbon Pretraining Corpus](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceBio\u002Fcarbon-pretraining-corpus) mix of:\n\n- **Eukaryote genes** (animals, plants, fungi, protists) — functional genomic regions, extracted from refSeq from Generator training mix.\n- **mRNA transcripts** — processed, spliced mRNA from OpenGenome2.\n- **Prokaryote genomes** — long chromosomal chunks from bacteria and archaea\n  (GTDB v220 + IMG\u002FPR), included as a smaller fraction (~10 % of the\n  training mixture).\n\nThe mixture is **eukaryote-heavy by design**. Carbon's target use case is\neukaryote. The\nprokaryote share is 10% of the pretraining mixture, so the model can be continually pretrained on prokaryote species.\n\n### Pretraining code\n\nCarbon was trained with our Megatron-LM fork:\n[**huggingface\u002FMegatron-LM-Carbon**](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002FMegatron-LM-Carbon).\nThe fork adds:\n\n- Hybrid loss: the loss for bridging coarse 6-mer tokenization and single-nucleotide resolution.\n- Carbon training scripts\n\n## Evaluation\n\nThis repo ships a **suite of seven zero-shot DNA evaluations** with\nreproducible code. The benchmark datasets are available in this [collection](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FHuggingFaceBio\u002Fdna-benchmarks).\n\nThe suite covers four modes of zero-shot evaluation:\n\n- **Variant effect prediction**, with three established benchmarks spanning\n  both coding (BRCA2) and non-coding regulatory variants (TraitGym\n  Mendelian), plus ClinVar for broad pathogenic-vs-benign coverage.\n- **A generative task** — sequence recovery, ported from the GENERator paper.\n- **Two perturbation tasks** we built — CAG repeat insertion and\n  synonymous-codon substitution — to probe regulatory-motif awareness and\n  codon-usage structure.\n- **Long-context retrieval**  we built — Genome-NIAH, a needle-in-a-haystack eval\n  adapted to DNA (four tasks × six context lengths up to 786 kbp).\n\nAll eval scripts live in [`evaluation\u002F`](evaluation). Each one runs on Carbon,\nGENERator, or Evo2 via a single backend flag, so numbers are directly\ncomparable across model families.\n\n| Benchmark | What it measures | Script |\n|---|---|---|\n| **Sequence recovery** | Given a DNA context, generate the next 30 bp; score per-base accuracy against the held-out continuation. Training-free generative eval from the GENERator paper. | [`sequence_recovery.py`](evaluation\u002Fsequence_recovery.py) |\n| **CAG repeat insertion** | A 30 bp codon-aligned region 60 bp into the CDS exon is replaced with 10 consecutive CAG triplets, mimicking polyglutamine expansion disorders (HD, SCAs, DRPLA). The patch is length- and reading-frame-preserving; all sequence outside is identical. The model should assign higher likelihood to the intact native sequence. | [`perturbation_tasks.py`](evaluation\u002Fperturbation_tasks.py) `--task motif_human` |\n| **Synonymous codon substitution** | CDS codons are replaced with the highest-frequency synonym for the target species (human or mouse); amino acid identity is preserved by construction. The model should prefer native codon usage over the codon-optimised variant. Probes coding-sequence structure and species-specific codon bias. | [`perturbation_tasks.py`](evaluation\u002Fperturbation_tasks.py) `--task syn_human` \u002F `--task syn_mouse` |\n| **BRCA2 VEP** | Zero-shot VEP on saturation-mutagenesis BRCA2 ([Huang 2025](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08388-8)). Centered 8 kb window + full-LL delta. | [`vep_eval.py`](evaluation\u002Fvep_eval.py) |\n| **TraitGym Mendelian** | 3,380 fine-mapped non-coding regulatory variants for 113 Mendelian diseases ([Benegas et al. 2025](https:\u002F\u002Fwww.biorxiv.org\u002Fcontent\u002F10.1101\u002F2025.02.11.637758v1)). Centered 8 kb window + full-LL delta. | [`vep_eval.py`](evaluation\u002Fvep_eval.py) |\n| **ClinVar** | Pathogenic vs benign on curated coding + noncoding ClinVar variants. Right-end \u002F next-token scoring with 24 kb left context. | [`clinvar_vep_eval.py`](evaluation\u002Fclinvar_vep_eval.py) (uses [`HuggingFaceBio\u002Fclinvar-vep-final`](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceBio\u002Fclinvar-vep-final) directly) |\n| **Genome-NIAH** | Long-context retrieval: insert a (key, value) pair in a real-genome haystack, ask the model to retrieve the value. Four tasks × six context lengths (up to 786 kbp). | [`genome_niah_eval.py`](evaluation\u002Fgenome_niah_eval.py) |\n\nSee [`evaluation\u002FREADME.md`](evaluation\u002FREADME.md) for run commands, DNA-tag\nflags, and per-benchmark details.\n\n## Finetuning\n\nA minimal end-to-end finetuning example (promoter detection from the\nNucleotide Transformer downstream benchmark) lives in\n[`finetuning\u002F`](finetuning). It uses the standard 🤗 Transformers `Trainer`\nwith `AutoModelForSequenceClassification` on top of the Carbon backbone — swap\nin any other classification dataset by changing one flag.\n\nTo specialise Carbon on a new clade (e.g. a specific bacterium or protist\nthat wasn't well represented in the pretraining mix), the same scaffolding\nworks for **continual pretraining**: load the model with\n`AutoModelForCausalLM`, feed it sequences with the `\u003Cdna>` tag, and continue\ntraining on next-token loss. The ~10 % prokaryote slice in the pretraining\ndata means the model already has a reasonable starting point even for\nbacterial sequences.\n\n## Acknowledgements\n\nCarbon is a joint collaboration between the research teams at Hugging Face, Zhongguancun Academy, and TIGEM\u002FUniversity of Naples “Federico II”.\n\n## License\n\nApache 2.0.\n","Carbon 是由 Hugging Face 开发的基因组基础模型，旨在处理 DNA 和 RNA 序列。项目基于约 1T tokens（约 6T DNA 碱基对）的数据集进行训练，并提供了多种预训练模型，包括500M、3B和8B参数规模的模型。其核心功能包括序列恢复、变异效应预测及扰动分析等任务的评估代码，以及用于下游任务微调的脚本。Carbon 使用混合分词器，支持英文文本和DNA序列的处理。该项目适合生物信息学领域中需要高效处理大规模基因数据的研究人员使用，特别是在零样本学习场景下对DNA序列进行预测和分析时。",2,"2026-06-11 03:55:41","CREATED_QUERY"]