[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-72589":3},{"id":4,"name":5,"fullName":6,"owner":5,"repo":5,"description":7,"homepage":8,"htmlUrl":9,"language":10,"languages":9,"totalLinesOfCode":9,"stars":11,"forks":12,"watchers":13,"openIssues":14,"contributorsCount":15,"subscribersCount":15,"size":15,"stars1d":15,"stars7d":15,"stars30d":16,"stars90d":15,"forks30d":15,"starsTrendScore":15,"compositeScore":17,"rankGlobal":9,"rankLanguage":9,"license":18,"archived":19,"fork":19,"defaultBranch":20,"hasWiki":21,"hasPages":19,"topics":22,"createdAt":9,"pushedAt":9,"updatedAt":23,"readmeContent":24,"aiSummary":25,"trendingCount":15,"starSnapshotCount":15,"syncStatus":26,"lastSyncTime":27,"discoverSource":28},72589,"OpenCoder-llm","OpenCoder-llm\u002FOpenCoder-llm","The Open Cookbook for Top-Tier Code Large Language Model","https:\u002F\u002Fopencoder-llm.github.io\u002F",null,"Python",2088,124,32,11,0,6,58.89,"MIT License",false,"main",true,[],"2026-06-12 04:01:06","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002Fopencoder-llm.github.io\u002Fblob\u002Fmain\u002Fstatic\u002Fimages\u002Fopencoder_icon.jpg?raw=true\" width=\"30%\" alt=\"OpenCoder-Icon\" \u002F>\n\u003C\u002Fdiv>\n\n\n\u003Cp align=\"center\">\n    \u003Ch1 align=\"center\">\n\u003C!--         \u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F93406728-e93f-4a90-9edc-adc346dedbf3\"\n         alt=\"Logo\" width=\"65\"\n        height=\"65\" style=\"vertical-align: middle;\"> -->\n        OpenCoder\n    \u003C\u002Fh1>\n     \u003Cp align=\"center\">⚡ The Open Cookbook for Top-Tier Code Large Language Models ⚡\u003C\u002Fp>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n        🏠\u003Ca href=\"https:\u002F\u002Fopencoder-llm.github.io\u002F\">Home Page\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp🤗\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Finfly\u002Fopencoder-672cec44bbb86c39910fb55e\">Model\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📊\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenCoder-LLM\u002Fopencoder-datasets-672e6db6a0fed24bd69ef1c2\">Dataset\u003C\u002Fa>&nbsp&nbsp | &nbsp&nbsp📄\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04905\">Paper\u003C\u002Fa>&nbsp ｜ 🚀\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenCoder-LLM\u002FOpenCoder-8B-Instruct\">Demo\u003C\u002Fa>&nbsp&nbsp\n\u003C\u002Fp>\n\n![12](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F3aa8dd8f-b12a-46e7-a543-d81cfd175d30)\n\n## News\n- 🔥🔥🔥 ```2024\u002F12\u002F08``` We have released our pretraining data cleaning pipeline: [opc_data_filtering](https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002Fopc_data_filtering). Try to use this pipeline to create your own high-quality code pretraining corpus!\n- 🔥 ```2024\u002F11\u002F19``` We have released intermedidate checkpoints during our pretraining stage: 🤗 [OpenCoder-1.5B-Base-Checkpoints](https:\u002F\u002Fhuggingface.co\u002FOpenCoder-LLM\u002FOpenCoder-1.5B-Base-Checkpoints) and 🤗 [OpenCoder-8B-Base-Checkpoints](https:\u002F\u002Fhuggingface.co\u002FOpenCoder-LLM\u002FOpenCoder-8B-Base-Checkpoints).\n- 🔥 ```2024\u002F11\u002F15``` We have released meta data of **RefineCode** 📊 [RefineCode-code-corpus-meta](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002FRefineCode-raw-code-meta). You can collect your own **RefineCode** referring to this dataset!\n- 🔥 ```2024\u002F11\u002F12``` We have released our efficient CodeLLM evaluation framework: [OpenCodeEval](https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002FOpenCoder-llm\u002Ftree\u002Fmain\u002FOpenCodeEval).\n- 🔥 ```2024\u002F11\u002F12``` We have released high-quality annealing data 📊 [opc-annealing-corpus](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Fopc-annealing-corpus), which includes algorithmic-corpus along with corresponding synthetic data.\n- 🔥 ```2024\u002F11\u002F11``` We have released 55B of recalled pages from [Fineweb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceFW\u002Ffineweb), including 📊 [fineweb-code-corpus](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Ffineweb-code-corpus) and 📊 [fineweb-math-corpus](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Ffineweb-math-corpus).\n- 🔥 ```2024\u002F11\u002F09``` We have released 4.5M Post-training data: 📊 [Dataset](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FOpenCoder-LLM\u002Fopencoder-datasets-672e6db6a0fed24bd69ef1c2).\n- 🔥 ```2024\u002F11\u002F08``` We have released our models! Please download them from 🤗 [Model](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Finfly\u002Fopencoder-672cec44bbb86c39910fb55e).\n- 🔥 ```2024\u002F11\u002F07``` We have released our paper on Arxiv: 📄 [OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04905).\n\n\n## Releases\n- [x] Data cleaning pipeline\n- [x] **RefineCode**: Code-related web data\n- [x] **RefineCode**: Metadata of raw code data \n- [x] Intermedidate Checkpoints\n- [x] CodeLLM evaluation framework: OpenCodeEval\n- [x] High-quality annealing data\n- [x] Post-training data\n- [x] Final model weights\n- [x] Paper\n\nWe are working hard to release all those resources! 💪 \n\n\n## Introduction\n\n**OpenCoder** is an open and reproducible code LLM family which includes 1.5B and 8B base and chat models, supporting both English and Chinese languages. Starting from scratch, OpenCoder is pretrained on 2.5 trillion tokens composed of 90% raw code and 10% code-related web data, and supervised finetuned on over 4.5M high-quality SFT examples, finally reaching the performance of top-tier code LLMs. We provide not only model weights and inference code, but also the reproducible training data, the complete data processing pipeline, rigorous experimental ablation results, and detailed training protocols. Empowering researchers to build and innovate, OpenCoder is your open foundation for advancing code AI. \n\n- **Complete Open Source**: OpenCoder ensures full transparency by releasing not only the model weights and forthcoming inference code but also the complete data-cleaning code for training. This release includes high-quality synthetic data, an extensive set of checkpoints, and a dataset of over 4.5 million supervised fine-tuning (SFT) entries, making OpenCoder one of the most comprehensively open-sourced models available.\n- **Comprehensive Experimental Analysis**: OpenCoder is rigorously tested through extensive ablation studies on various data-cleaning strategies and training processes, including file-level and repository-level deduplication experiments, ensuring thorough exploration and validation of the model’s performance.\n- **High-Quality Synthetic Data**: OpenCoder provides a fully developed synthetic data generation process and over 4.5 million SFT data entries, establishing a robust data foundation for model training and evaluation.\n- **Exceptional Performance**: OpenCoder achieves high performance across multiple language model benchmarks, positioning it among the leading open-source models for code.\n\n\n## Models\n\n\u003C!-- |         Model         | Sequence Length |                                Download                                 |\n|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|\n| OpenCoder-1.5B-Base  |      4K       | 🤗 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-1.5B-Base)  |\n| OpenCoder-8B-Base  |      8K       | 🤗 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-8B-Base)  |\n| OpenCoder-1.5B-Instruct  |      4K       | 🤗 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-1.5B-Instruct) |\n| OpenCoder-8B-Instruct  |      8K       | 🤗 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-8B-Instruct) | -->\n\n|         Model         | Sequence Length |                   HuggingFace                 |      wisemodel    |        \n|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|:------------------------------------------|\n| OpenCoder-1.5B-Base  |      4K       | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-1.5B-Base)  |  [\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002Fopencoder-llm.github.io\u002Fblob\u002Fmain\u002Fstatic\u002Fimages\u002Fwisemodel_logo.png?raw=true\" height=\"12\">](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FOpenCoder\u002FOpenCoder-1.5B-Base) |\n| OpenCoder-8B-Base  |      8K       | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-8B-Base)  | [\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002Fopencoder-llm.github.io\u002Fblob\u002Fmain\u002Fstatic\u002Fimages\u002Fwisemodel_logo.png?raw=true\" height=\"12\">](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FOpenCoder\u002FOpenCoder-8B-Base) |\n| OpenCoder-1.5B-Instruct  |      4K       | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-1.5B-Instruct) | [\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002Fopencoder-llm.github.io\u002Fblob\u002Fmain\u002Fstatic\u002Fimages\u002Fwisemodel_logo.png?raw=true\" height=\"12\">](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FOpenCoder\u002FOpenCoder-1.5B-Instruct) |\n| OpenCoder-8B-Instruct  |      8K       | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Finfly\u002FOpenCoder-8B-Instruct) | [\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FOpenCoder-llm\u002Fopencoder-llm.github.io\u002Fblob\u002Fmain\u002Fstatic\u002Fimages\u002Fwisemodel_logo.png?raw=true\" height=\"12\">](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FOpenCoder\u002FOpenCoder-8B-Instruct) |\n\n\n## Datasets\n\n### Pre-training\n|         Dataset       | Size |                                Download                                 |\n|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|\n| fineweb-code-corpus  |      148 GB       | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Ffineweb-code-corpus)  |\n| fineweb-math-corpus  |       10 GB    | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Ffineweb-math-corpus)  |\n| opc-annealing-corpus  |      24 GB    | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Fopc-annealing-corpus)  |\n\n\n### Post-training\n\n|         Dataset       | Num |                                Download                                 |\n|:---------------------:|:---------------:|:-----------------------------------------------------------------------:|\n| opc-sft-stage1  |      4.21 M       | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Fopc-sft-stage1)  |\n| opc-sft-stage2  |      375 K      | [🤗HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FOpenCoder-LLM\u002Fopc-sft-stage2)  |\n\n\n**This is not the end; we are organizing the remaining data and uploading it progressively.**\n\n## Performance\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F7f5a49b2-9539-4185-91fa-fd32c1315b2a\" width=\"75%\">\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F81c6e686-0ed0-4eb5-8fb8-a651750ec346\" width=\"75%\">\n\n## Get Started\n```python\nimport torch\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\n\nmodel_name = \"infly\u002FOpenCoder-8B-Instruct\"\nmodel = AutoModelForCausalLM.from_pretrained(model_name,\n                                             torch_dtype=torch.bfloat16,\n                                             device_map=\"auto\",\n                                             trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\n\nmessages=[\n    { 'role': 'user', 'content': \"write a quick sort algorithm in python.\"}\n]\n\ninputs = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors=\"pt\")\n\noutputs = model.generate(inputs, max_new_tokens=512, do_sample=False)\n\nresult = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)\nprint(result)\n```\n\n## Citation\nIf you find our work helpful, feel free to give us a cite :-)\n\n```bibtex\n@inproceedings{Huang2024OpenCoderTO,\n  title={OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models},\n  author={Siming Huang and Tianhao Cheng and Jason Klein Liu and Jiaran Hao and Liuyihan Song and Yang Xu and J. Yang and J. H. Liu and Chenchen Zhang and Linzheng Chai and Ruifeng Yuan and Zhaoxiang Zhang and Jie Fu and Qian Liu and Ge Zhang and Zili Wang and Yuan Qi and Yinghui Xu and Wei Chu},\n  year={2024},\n  url={https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.04905}\n}\n```\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Fapi.star-history.com\u002Fsvg?repos=OpenCoder-llm\u002FOpenCoder-llm&type=Date)](https:\u002F\u002Fstar-history.com\u002F#OpenCoder-llm\u002FOpenCoder-llm&Date)\n","OpenCoder 是一个面向顶级代码大语言模型的开源项目。该项目提供了从数据预处理到模型训练再到评估的一整套解决方案，支持用户构建高质量的代码生成与理解模型。其核心功能包括高效的数据清洗流程、多种规模的基础模型检查点以及专门针对代码生成任务优化的评测框架。技术上，OpenCoder 采用 Python 编程语言实现，并通过 Hugging Face 平台分享相关资源。它适用于需要定制化开发或研究高级代码生成模型的开发者和研究人员，在软件工程教育、自动化编程辅助等领域具有广泛的应用前景。",2,"2026-06-11 03:42:42","high_star"]