[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-2586":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":17,"stars7d":18,"stars30d":19,"stars90d":16,"forks30d":16,"starsTrendScore":20,"compositeScore":21,"rankGlobal":10,"rankLanguage":10,"license":22,"archived":23,"fork":23,"defaultBranch":24,"hasWiki":25,"hasPages":23,"topics":26,"createdAt":10,"pushedAt":10,"updatedAt":27,"readmeContent":28,"aiSummary":29,"trendingCount":16,"starSnapshotCount":16,"syncStatus":30,"lastSyncTime":31,"discoverSource":32},2586,"trl","huggingface\u002Ftrl","huggingface","Train transformer language models with reinforcement learning.","http:\u002F\u002Fhf.co\u002Fdocs\u002Ftrl",null,"Python",18612,2784,103,426,0,7,71,268,49,45,"Apache License 2.0",false,"main",true,[],"2026-06-12 02:00:42","# TRL - Transformers Reinforcement Learning\n\n\u003Cdiv style=\"text-align: center\">\n    \u003Cpicture>\n        \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftrl-lib\u002Fdocumentation-images\u002Fresolve\u002Fmain\u002FTRL%20banner%20light.png\">\n        \u003Cimg src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftrl-lib\u002Fdocumentation-images\u002Fresolve\u002Fmain\u002Ftrl_banner_dark.png\" alt=\"TRL Banner\">\n    \u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\n\u003Chr> \u003Cbr>\n\n\u003Ch3 align=\"center\">\n    \u003Cp>A comprehensive library to post-train foundation models\u003C\u002Fp>\n\u003C\u002Fh3>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Ftrl.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Findex\">\u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?label=documentation&url=https%3A%2F%2Fhuggingface.co%2Fdocs%2Ftrl%2Findex&down_color=red&down_message=offline&up_color=blue&up_message=online\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Freleases\">\u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Ftrl.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Ftrl-lib\">\u003Cimg alt=\"Hugging Face Hub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🤗%20Hub-trl--lib-yellow\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 🎉 What's New\n\n**TRL v1:** We released TRL v1 — a major milestone that marks a real shift in what TRL is. Read the [blog post](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Ftrl-v1) to learn more.\n\n## Overview\n\nTRL is a cutting-edge library designed for post-training foundation models using advanced techniques like Supervised Fine-Tuning (SFT), Group Relative Policy Optimization (GRPO), and Direct Preference Optimization (DPO). Built on top of the [🤗 Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) ecosystem, TRL supports a variety of model architectures and modalities, and can be scaled-up across various hardware setups.\n\n## Highlights\n\n- **Trainers**: Various fine-tuning methods are easily accessible via trainers like [`SFTTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fsft_trainer), [`GRPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fgrpo_trainer), [`DPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fdpo_trainer), [`RewardTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Freward_trainer) and more.\n\n- **Efficient and scalable**:\n  - Leverages [🤗 Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) to scale from single GPU to multi-node clusters using methods like [DDP](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fintermediate\u002Fddp_tutorial.html) and [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fdeepspeedai\u002FDeepSpeed).\n  - Full integration with [🤗 PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) enables training on large models with modest hardware via quantization and LoRA\u002FQLoRA.\n  - Integrates [🦥 Unsloth](https:\u002F\u002Fgithub.com\u002Funslothai\u002Funsloth) for accelerating training using optimized kernels.\n\n- **Command Line Interface (CLI)**: A simple interface lets you fine-tune with models without needing to write code.\n\n## Installation\n\n### Python Package\n\nInstall the library using `pip`:\n\n```bash\npip install trl\n```\n\n### From source\n\nIf you want to use the latest features before an official release, you can install TRL from source:\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n### Repository\n\nIf you want to use the examples you can clone the repository with the following command:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\n```\n\n## Quick Start\n\nFor more flexibility and control over training, TRL provides dedicated trainer classes to post-train language models or PEFT adapters on a custom dataset. Each trainer in TRL is a light wrapper around the 🤗 Transformers trainer and natively supports distributed training methods like DDP, DeepSpeed ZeRO, and FSDP.\n\n### `SFTTrainer`\n\nHere is a basic example of how to use the [`SFTTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fsft_trainer):\n\n```python\nfrom trl import SFTTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002FCapybara\", split=\"train\")\n\ntrainer = SFTTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `GRPOTrainer`\n\n[`GRPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fgrpo_trainer) implements the [Group Relative Policy Optimization (GRPO) algorithm](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2402.03300) that is more memory-efficient than PPO and was used to train [Deepseek AI's R1](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-R1).\n\n```python\nfrom datasets import load_dataset\nfrom trl import GRPOTrainer\nfrom trl.rewards import accuracy_reward\n\ndataset = load_dataset(\"trl-lib\u002FDeepMath-103K\", split=\"train\")\n\ntrainer = GRPOTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    reward_funcs=accuracy_reward,\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n> [!NOTE]\n> For reasoning models, use the `reasoning_accuracy_reward()` function for better results.\n\n### `DPOTrainer`\n\n[`DPOTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fdpo_trainer) implements the popular [Direct Preference Optimization (DPO) algorithm](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2305.18290) that was used to post-train [Llama 3](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2407.21783) and many other models. Here is a basic example of how to use the `DPOTrainer`:\n\n```python\nfrom datasets import load_dataset\nfrom trl import DPOTrainer\n\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\ntrainer = DPOTrainer(\n    model=\"Qwen\u002FQwen3-0.6B\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n### `RewardTrainer`\n\nHere is a basic example of how to use the [`RewardTrainer`](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Freward_trainer):\n\n```python\nfrom trl import RewardTrainer\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"trl-lib\u002Fultrafeedback_binarized\", split=\"train\")\n\ntrainer = RewardTrainer(\n    model=\"Qwen\u002FQwen2.5-0.5B-Instruct\",\n    train_dataset=dataset,\n)\ntrainer.train()\n```\n\n## Command Line Interface (CLI)\n\nYou can use the TRL Command Line Interface (CLI) to quickly get started with post-training methods like Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO):\n\n**SFT:**\n\n```bash\ntrl sft --model_name_or_path Qwen\u002FQwen2.5-0.5B \\\n    --dataset_name trl-lib\u002FCapybara \\\n    --output_dir Qwen2.5-0.5B-SFT\n```\n\n**DPO:**\n\n```bash\ntrl dpo --model_name_or_path Qwen\u002FQwen2.5-0.5B-Instruct \\\n    --dataset_name argilla\u002FCapybara-Preferences \\\n    --output_dir Qwen2.5-0.5B-DPO \n```\n\nRead more about CLI in the [relevant documentation section](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fclis) or use `--help` for more details.\n\n## Development\n\nIf you want to contribute to `trl` or customize it to your needs make sure to read the [contribution guide](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl\u002Fblob\u002Fmain\u002FCONTRIBUTING.md) and make sure you make a dev install:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl.git\ncd trl\u002F\npip install -e .[dev]\n```\n\n## Experimental\n\nA minimal incubation area is available under `trl.experimental` for unstable \u002F fast-evolving features. Anything there may change or be removed in any release without notice.\n\nExample:\n\n```python\nfrom trl.experimental.new_trainer import NewTrainer\n```\n\nRead more in the [Experimental docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftrl\u002Fexperimental_overview).\n\n## Citation\n\n```bibtex\n@software{vonwerra2020trl,\n  title   = {{TRL: Transformers Reinforcement Learning}},\n  author  = {von Werra, Leandro and Belkada, Younes and Tunstall, Lewis and Beeching, Edward and Thrush, Tristan and Lambert, Nathan and Huang, Shengyi and Rasul, Kashif and Gallouédec, Quentin},\n  license = {Apache-2.0},\n  url     = {https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl},\n  year    = {2020}\n}\n```\n\n## License\n\nThis repository's source code is available under the [Apache-2.0 License](LICENSE).\n","TRL 是一个用于通过强化学习训练变换器语言模型的综合库。它支持多种高级技术，包括监督微调（SFT）、组相对策略优化（GRPO）和直接偏好优化（DPO），并基于 Hugging Face 的 Transformers 生态系统构建，能够兼容各种模型架构和模态。TRL 通过利用 Hugging Face 的 Accelerate 库实现了从单个 GPU 到多节点集群的有效扩展，并且与 PEFT 集成，使得在资源有限的情况下也能对大型模型进行训练。此外，TRL 还提供了一个命令行界面，方便用户无需编写代码即可完成模型微调。此项目适用于需要对预训练的语言模型进行进一步定制化调整以适应特定任务或环境的研究人员和开发者。",2,"2026-06-11 02:50:26","top_language"]