[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9752":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":17,"compositeScore":19,"rankGlobal":10,"rankLanguage":10,"license":20,"archived":21,"fork":21,"defaultBranch":22,"hasWiki":23,"hasPages":21,"topics":24,"createdAt":10,"pushedAt":10,"updatedAt":31,"readmeContent":32,"aiSummary":33,"trendingCount":16,"starSnapshotCount":16,"syncStatus":34,"lastSyncTime":35,"discoverSource":36},9752,"PaLM-rlhf-pytorch","lucidrains\u002FPaLM-rlhf-pytorch","lucidrains","Implementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Basically ChatGPT but with PaLM","",null,"Python",7864,676,133,17,0,1,4,39.49,"MIT License",false,"main",true,[25,26,27,28,29,30],"artificial-intelligence","attention-mechanisms","deep-learning","human-feedback","reinforcement-learning","transformers","2026-06-12 02:02:12","\u003Cimg src=\".\u002Fchatgpt.png\" width=\"450px\">\u003C\u002Fimg>\n\n*\u003Ca href=\"https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt\u002F\">official chatgpt blogpost\u003C\u002Fa>*\n\n## PaLM + RLHF - Pytorch (wip)\n\nImplementation of RLHF (Reinforcement Learning with Human Feedback) on top of the PaLM architecture. Maybe I'll add retrieval functionality too, à la \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Flucidrains\u002FRETRO-pytorch\">RETRO\u003C\u002Fa>\n\nIf you are interested in replicating something like ChatGPT out in the open, please consider joining \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FxBPBXfcFHd\">Laion \u003Cimg alt=\"Join us on Discord\" src=\"https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F823813159592001537?color=5865F2&logo=discord&logoColor=white\">\u003C\u002Fa>\n\nPotential successor: \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.18290\">Direct Preference Optimization\u003C\u002Fa> - all the code in this repo becomes ~ binary cross entropy loss, \u003C 5 loc. So much for Reward models and PPO\n\n## FAQ\n\n- Does this contain a model for inference?\n\nThere is no trained model. This is just the ship and overall map. We still need millions of dollars of compute + data to sail to the correct point in high dimensional parameter space. Even then, you need professional sailors (like Robin Rombach of Stable Diffusion fame) to actually guide the ship through turbulent times to that point.\n\n## Community\n\n\u003Ca href=\"https:\u002F\u002Fcarper.ai\u002F\">CarperAI\u003C\u002Fa> had been working on \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FCarperAI\u002Ftrlx\">an RLHF framework\u003C\u002Fa> for large language models for many months prior to the release of ChatGPT.\n\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=sswA4j_IUxg\">Yannic Kilcher\u003C\u002Fa> is also working on an \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FLAION-AI\u002FOpen-Assistant\">open sourced implementation\u003C\u002Fa>\n\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=SWwQ3k-DWyo\">AI Coffeebreak w\u002F Letitia\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=NpmnWgQgcsA\">Code Emporium\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=_MPJ3CyDokU\">Code Emporium Part 2\u003C\u002Fa>\n\n## Appreciation\n\n- \u003Ca href=\"https:\u002F\u002Fstability.ai\u002F\">Stability.ai\u003C\u002Fa> for the generous sponsorship to work on cutting edge artificial intelligence research\n\n- \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002F\">🤗 Hugging Face\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fcarper.ai\u002F\">CarperAI\u003C\u002Fa> for penning the blog post \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fblog\u002Frlhf\">Illustrating Reinforcement Learning from Human Feedback (RLHF)\u003C\u002Fa>, and the former also for their \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex\">accelerate\u003C\u002Fa> library\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fkisseternity\">@kisseternity\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftaynoel84\">@taynoel84\u003C\u002Fa> for the code review and finding bugs\n\n- \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fconceptofmind\">Enrico\u003C\u002Fa> for integrating \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.14135\">Flash Attention\u003C\u002Fa> from Pytorch 2.0\n\n- [bycloud](https:\u002F\u002Fwww.youtube.com\u002F@bycloudAI) for his educational video for [Reasoning with Exploration](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=uOrJUksvIhs)\n\n## Install\n\n```bash\n$ pip install palm-rlhf-pytorch\n```\n\n## Usage\n\nFirst train `PaLM`, like any other autoregressive transformer\n\n```python\nimport torch\nfrom palm_rlhf_pytorch import PaLM\n\npalm = PaLM(\n    num_tokens = 20000,\n    dim = 512,\n    depth = 12,\n    flash_attn = True # https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.14135\n).cuda()\n\nseq = torch.randint(0, 20000, (1, 2048)).cuda()\n\nloss = palm(seq, return_loss = True)\nloss.backward()\n\n# after much training, you can now generate sequences\n\ngenerated = palm.generate(2048) # (1, 2048)\n```\n\nThen train your reward model, with the curated human feedback. In the original paper, they could not get reward model to be finetuned from a pretrained transformer without overfitting, but I gave the option to finetune with `LoRA` anyways, since it is still open research.\n\n```python\nimport torch\nfrom palm_rlhf_pytorch import PaLM, RewardModel\n\npalm = PaLM(\n    num_tokens = 20000,\n    dim = 512,\n    depth = 12,\n    causal = False\n)\n\nreward_model = RewardModel(\n    palm,\n    num_binned_output = 5 # say rating from 1 to 5\n).cuda()\n\n# mock data\n\nseq = torch.randint(0, 20000, (1, 1024)).cuda()\nprompt_mask = torch.zeros(1, 1024).bool().cuda() # which part of the sequence is prompt, which part is response\nlabels = torch.randint(0, 5, (1,)).cuda()\n\n# train\n\nloss = reward_model(seq, prompt_mask = prompt_mask, labels = labels)\nloss.backward()\n\n# after much training\n\nreward = reward_model(seq, prompt_mask = prompt_mask)\n```\n\nThen you will pass your transformer and the rewards model to the `RLHFTrainer`\n\n```python\nimport torch\nfrom palm_rlhf_pytorch import PaLM, RewardModel, RLHFTrainer\n\n# load your pretrained palm\n\npalm = PaLM(\n    num_tokens = 20000,\n    dim = 512,\n    depth = 12\n).cuda()\n\npalm.load('.\u002Fpath\u002Fto\u002Fpretrained\u002Fpalm.pt')\n\n# load your pretrained reward model\n\nreward_model = RewardModel(\n    palm,\n    num_binned_output = 5\n).cuda()\n\nreward_model.load('.\u002Fpath\u002Fto\u002Fpretrained\u002Freward_model.pt')\n\n# ready your list of prompts for reinforcement learning\n\nprompts = torch.randint(0, 256, (50000, 512)).cuda() # 50k prompts\n\n# pass it all to the trainer and train\n\ntrainer = RLHFTrainer(\n    palm = palm,\n    reward_model = reward_model,\n    prompt_token_ids = prompts\n)\n\ntrainer.train(num_episodes = 50000)\n\n# then, if it succeeded...\n# generate say 10 samples and use the reward model to return the best one\n\nanswer = trainer.generate(2048, prompt = prompts[0], num_samples = 10) # (\u003C= 2048,)\n```\n\n## Todo\n\n- [x] clone base transformer with separate lora for critic\n- [x] also allow for non-LoRA based finetuning\n- [x] redo normalize to be able to have a masked version, not sure if anyone will ever use per token rewards \u002F values, but good practice to implement\n- [x] equip with \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FhazyResearch\u002Fflash-attention\">the best attention\u003C\u002Fa>\n\n- [ ] add Hugging Face accelerate and test out wandb instrumentation\n- [ ] search literature to figure out what is the latest SOTA for PPO, assuming RL field is still making progress.\n- [ ] test the system using a pretrained sentiment network as reward model\n- [ ] write the memory in PPO to memmapped numpy file\n- [ ] get sampling with variable lengthed prompts working, even if it is not needed given bottleneck is human feedback\n- [ ] allow for finetuning penultimate N layers only in either actor or critic, assuming if pretrained\n- [ ] incorporate some learning points from Sparrow, given Letitia's video\n- [ ] simple web interface with django + htmx for collecting human feedback\n- [ ] consider \u003Ca href=\"https:\u002F\u002Fwww.anthropic.com\u002Fconstitutional.pdf\">RLAIF\u003C\u002Fa>\n\n## Citations\n\n```bibtex\n@article{Stiennon2020LearningTS,\n    title   = {Learning to summarize from human feedback},\n    author  = {Nisan Stiennon and Long Ouyang and Jeff Wu and Daniel M. Ziegler and Ryan J. Lowe and Chelsea Voss and Alec Radford and Dario Amodei and Paul Christiano},\n    journal = {ArXiv},\n    year    = {2020},\n    volume  = {abs\u002F2009.01325}\n}\n```\n\n```bibtex\n@inproceedings{Chowdhery2022PaLMSL,\n    title   = {PaLM: Scaling Language Modeling with Pathways},\n    author  = {Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam M. Shazeer and Vinodkumar Prabhakaran and Emily Reif and Nan Du and Benton C. Hutchinson and Reiner Pope and James Bradbury and Jacob Austin and Michael Isard and Guy Gur-Ari and Pengcheng Yin and Toju Duke and Anselm Levskaya and Sanjay Ghemawat and Sunipa Dev and Henryk Michalewski and Xavier Garc{\\'i}a and Vedant Misra and Kevin Robinson and Liam Fedus and Denny Zhou and Daphne Ippolito and David Luan and Hyeontaek Lim and Barret Zoph and Alexander Spiridonov and Ryan Sepassi and David Dohan and Shivani Agrawal and Mark Omernick and Andrew M. Dai and Thanumalayan Sankaranarayana Pillai and Marie Pellat and Aitor Lewkowycz and Erica Oliveira Moreira and Rewon Child and Oleksandr Polozov and Katherine Lee and Zongwei Zhou and Xuezhi Wang and Brennan Saeta and Mark Diaz and Orhan Firat and Michele Catasta and Jason Wei and Kathleen S. Meier-Hellstern and Douglas Eck and Jeff Dean and Slav Petrov and Noah Fiedel},\n    year    = {2022}\n}\n```\n\n```bibtex\n@article{Hu2021LoRALA,\n    title   = {LoRA: Low-Rank Adaptation of Large Language Models},\n    author  = {Edward J. Hu and Yelong Shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Weizhu Chen},\n    journal = {ArXiv},\n    year    = {2021},\n    volume  = {abs\u002F2106.09685}\n}\n```\n\n```bibtex\n@inproceedings{Sun2022ALT,\n    title     = {A Length-Extrapolatable Transformer},\n    author    = {Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei},\n    year      = {2022}\n}\n```\n\n```bibtex\n@misc{gilmer2023intriguing\n    title  = {Intriguing Properties of Transformer Training Instabilities},\n    author = {Justin Gilmer, Andrea Schioppa, and Jeremy Cohen},\n    year   = {2023},\n    status = {to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams}\n}\n```\n\n```bibtex\n@inproceedings{dao2022flashattention,\n    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},\n    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\\'e}, Christopher},\n    booktitle = {Advances in Neural Information Processing Systems},\n    year    = {2022}\n}\n```\n\n```bibtex\n@misc{Rubin2024,\n    author  = {Ohad Rubin},\n    url     = {https:\u002F\u002Fmedium.com\u002F@ohadrubin\u002Fexploring-weight-decay-in-layer-normalization-challenges-and-a-reparameterization-solution-ad4d12c24950}\n}\n```\n\n```bibtex\n@inproceedings{Yuan2024FreePR,\n    title   = {Free Process Rewards without Process Labels},\n    author  = {Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},\n    year    = {2024},\n    url     = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:274445748}\n}\n```\n\n```bibtex\n@article{Shao2024DeepSeekMathPT,\n    title   = {DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models},\n    author  = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Jun-Mei Song and Mingchuan Zhang and Y. K. Li and Yu Wu and Daya Guo},\n    journal = {ArXiv},\n    year    = {2024},\n    volume  = {abs\u002F2402.03300},\n    url     = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:267412607}\n}\n```\n\n```bibtex\n@article{Farebrother2024StopRT,\n    title   = {Stop Regressing: Training Value Functions via Classification for Scalable Deep RL},\n    author  = {Jesse Farebrother and Jordi Orbay and Quan Ho Vuong and Adrien Ali Taiga and Yevgen Chebotar and Ted Xiao and Alex Irpan and Sergey Levine and Pablo Samuel Castro and Aleksandra Faust and Aviral Kumar and Rishabh Agarwal},\n    journal = {ArXiv},\n    year   = {2024},\n    volume = {abs\u002F2403.03950},\n    url    = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:268253088}\n}\n```\n\n```bibtex\n@misc{Liu2025,\n    title   = {Understanding R1-Zero-Like Training: A Critical Perspective},\n    author  = {Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin},\n    url     = {https:\u002F\u002Fgithub.com\u002Fsail-sg\u002Funderstand-r1-zero\u002Fblob\u002Fmain\u002Funderstand-r1-zero.pdf}\n}\n```\n\n```bibtex\n@inproceedings{Yue2025DoesRL,\n    title   = {Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?},\n    author  = {Yang Yue and Zhiqi Chen and Rui Lu and Andrew Zhao and Zhaokai Wang and Shiji Song and Gao Huang},\n    year    = {2025},\n    url     = {https:\u002F\u002Fapi.semanticscholar.org\u002FCorpusID:277940134}\n}\n```\n\n```bibtex\n@misc{xie2025simplepolicyoptimization,\n    title   = {Simple Policy Optimization}, \n    author  = {Zhengpeng Xie and Qiang Zhang and Fan Yang and Marco Hutter and Renjing Xu},\n    year    = {2025},\n    eprint  = {2401.16025},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG},\n    url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.16025}, \n}\n```\n\n```bibtex\n@misc{cheng2025reasoningexplorationentropyperspective,\n    title   = {Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs}, \n    author  = {Daixuan Cheng and Shaohan Huang and Xuekai Zhu and Bo Dai and Wayne Xin Zhao and Zhenliang Zhang and Furu Wei},\n    year    = {2025},\n    eprint  = {2506.14758},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.CL},\n    url     = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14758}, \n}\n```\n\n```bibtex\n@article{Zhu2025FlowRL,\n    title   = {FlowRL: Matching Reward Distributions for LLM Reasoning},\n    author  = {Xuekai Zhu and Daixuan Cheng and Dinghuai Zhang and Hengli Li and Kaiyan Zhang and Che Jiang and Youbang Sun and Ermo Hua and Yuxin Zuo and Xingtai Lv and Qizheng Zhang and Lin Chen and Fanghao Shao and Bo Xue and Yunchong Song and Zhenjie Yang and Ganqu Cui and Ning Ding and Jianfeng Gao and Xiaodong Liu and Bowen Zhou and Hongyuan Mei and Zhouhan Lin},\n    journal = {ArXiv},\n    year    = {2025},\n    volume  = {abs\u002F2509.15207},\n    eprint  = {2509.15207},\n    archivePrefix = {arXiv},\n    primaryClass = {cs.LG},\n    url     = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.15207}\n}\n```\n","该项目实现了基于PaLM架构的RLHF（结合人类反馈的强化学习），类似于ChatGPT。其核心功能包括使用深度学习和注意力机制，通过人类反馈来优化模型输出的质量。技术特点上，项目采用Python语言开发，并且利用了PyTorch框架。适合于需要根据用户反馈持续改进文本生成质量的应用场景中，如客服聊天机器人、内容创作助手等。此外，它还可能支持检索功能扩展。需要注意的是，本项目不包含预训练模型，实际部署前还需投入大量计算资源及专业人员进行调优。",2,"2026-06-11 03:24:34","top_topic"]