[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"project-9735":3},{"id":4,"name":5,"fullName":6,"owner":7,"repo":5,"description":8,"homepage":9,"htmlUrl":10,"language":11,"languages":10,"totalLinesOfCode":10,"stars":12,"forks":13,"watchers":14,"openIssues":15,"contributorsCount":16,"subscribersCount":16,"size":16,"stars1d":16,"stars7d":17,"stars30d":18,"stars90d":16,"forks30d":16,"starsTrendScore":19,"compositeScore":20,"rankGlobal":10,"rankLanguage":10,"license":21,"archived":22,"fork":22,"defaultBranch":23,"hasWiki":24,"hasPages":22,"topics":25,"createdAt":10,"pushedAt":10,"updatedAt":28,"readmeContent":29,"aiSummary":30,"trendingCount":16,"starSnapshotCount":16,"syncStatus":31,"lastSyncTime":32,"discoverSource":33},9735,"reinforcement-learning-an-introduction","ShangtongZhang\u002Freinforcement-learning-an-introduction","ShangtongZhang","Python Implementation of Reinforcement Learning: An Introduction","",null,"Python",14676,4965,552,16,0,6,38,1,45,"MIT License",false,"master",true,[26,27],"artificial-intelligence","reinforcement-learning","2026-06-12 02:02:11","# Reinforcement Learning: An Introduction\n\n[![Build Status](https:\u002F\u002Ftravis-ci.org\u002FShangtongZhang\u002Freinforcement-learning-an-introduction.svg?branch=master)](https:\u002F\u002Ftravis-ci.org\u002FShangtongZhang\u002Freinforcement-learning-an-introduction)\n\nPython replication for Sutton & Barto's book [*Reinforcement Learning: An Introduction (2nd Edition)*](http:\u002F\u002Fincompleteideas.net\u002Fbook\u002Fthe-book-2nd.html)\n\n> If you have any confusion about the code or want to report a bug, please open an issue instead of emailing me directly, and unfortunately I do not have exercise answers for the book.\n\n# Contents \n\n### Chapter 1\n1. Tic-Tac-Toe\n\n### Chapter 2\n1. [Figure 2.1: An exemplary bandit problem from the 10-armed testbed](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_2_1.png)\n2. [Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_2_2.png)\n3. [Figure 2.3: Optimistic initial action-value estimates](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_2_3.png)\n4. [Figure 2.4: Average performance of UCB action selection on the 10-armed testbed](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_2_4.png)\n5. [Figure 2.5: Average performance of the gradient bandit algorithm](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_2_5.png)\n6. [Figure 2.6: A parameter study of the various bandit algorithms](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_2_6.png)\n\n### Chapter 3\n1. [Figure 3.2: Grid example with random policy](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_3_2.png)\n2. [Figure 3.5: Optimal solutions to the gridworld example](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_3_5.png)\n\n### Chapter 4\n1. [Figure 4.1: Convergence of iterative policy evaluation on a small gridworld](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_4_1.png)\n2. [Figure 4.2: Jack’s car rental problem](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_4_2.png)\n3. [Figure 4.3: The solution to the gambler’s problem](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_4_3.png)\n\n### Chapter 5\n1. [Figure 5.1: Approximate state-value functions for the blackjack policy](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_5_1.png)\n2. [Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_5_2.png)\n3. [Figure 5.3: Weighted importance sampling](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_5_3.png)\n4. [Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_5_4.png)\n\n### Chapter 6\n1. [Example 6.2: Random walk](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Fexample_6_2.png)\n2. [Figure 6.2: Batch updating](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_6_2.png)\n3. [Figure 6.3: Sarsa applied to windy grid world](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_6_3.png)\n4. [Figure 6.4: The cliff-walking task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_6_4.png)\n5. [Figure 6.6: Interim and asymptotic performance of TD control methods](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_6_6.png)\n6. [Figure 6.7: Comparison of Q-learning and Double Q-learning](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_6_7.png)\n\n### Chapter 7\n1. [Figure 7.2: Performance of n-step TD methods on 19-state random walk](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_7_2.png)\n\n### Chapter 8\n1. [Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_8_2.png)\n2. [Figure 8.4: Average performance of Dyna agents on a blocking task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_8_4.png)\n3. [Figure 8.5: Average performance of Dyna agents on a shortcut task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_8_5.png)\n4. [Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Fexample_8_4.png)\n5. [Figure 8.7: Comparison of efficiency of expected and sample updates](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_8_7.png)\n6. [Figure 8.8: Relative efficiency of different update distributions](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_8_8.png)\n\n### Chapter 9\n1. [Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_9_1.png)\n2. [Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_9_2.png)\n3. [Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_9_5.png)\n4. [Figure 9.8: Example of feature width’s effect on initial generalization and asymptotic accuracy](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_9_8.png)\n5. [Figure 9.10: Single tiling and multiple tilings on the 1000-state random walk task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_9_10.png)\n\n### Chapter 10\n1. [Figure 10.1: The cost-to-go function for Mountain Car task in one run](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_10_1.png)\n2. [Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_10_2.png)\n3. [Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_10_3.png)\n4. [Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_10_4.png)\n5. [Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_10_5.png)\n\n### Chapter 11\n1. [Figure 11.2: Baird's Counterexample](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_11_2.png)\n2. [Figure 11.6: The behavior of the TDC algorithm on Baird’s counterexample](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_11_6.png)\n3. [Figure 11.7: The behavior of the ETD algorithm in expectation on Baird’s counterexample](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_11_7.png)\n\n### Chapter 12\n1. [Figure 12.3: Off-line λ-return algorithm on 19-state random walk](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_12_3.png)\n2. [Figure 12.6: TD(λ) algorithm on 19-state random walk](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_12_6.png)\n3. [Figure 12.8: True online TD(λ) algorithm on 19-state random walk](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_12_8.png)\n4. [Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_12_10.png)\n5. [Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_12_11.png)\n\n### Chapter 13\n1. [Example 13.1: Short corridor with switched actions](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Fexample_13_1.png)\n2. [Figure 13.1: REINFORCE on the short-corridor grid world](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_13_1.png)\n3. [Figure 13.2: REINFORCE with baseline on the short-corridor grid-world](https:\u002F\u002Fraw.githubusercontent.com\u002FShangtongZhang\u002Freinforcement-learning-an-introduction\u002Fmaster\u002Fimages\u002Ffigure_13_2.png)\n\n\n# Environment\n* python 3.6 \n* numpy\n* matplotlib\n* [seaborn](https:\u002F\u002Fseaborn.pydata.org\u002Findex.html)\n* [tqdm](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftqdm\u002F)\n\n# Usage\n> All files are self-contained\n```commandline\npython any_file_you_want.py\n```\n\n# Contribution\nIf you want to contribute some missing examples or fix some bugs, feel free to open an issue or make a pull request. \n","该项目是对Sutton & Barto的《强化学习：入门》一书中的算法和概念进行Python实现。它涵盖了书中多个章节的关键示例与实验，包括多臂赌博机问题、网格世界以及21点游戏等，并提供了详细的代码复现这些经典案例。使用Python编写，易于理解和修改，适合教育目的及研究场景下的强化学习初学者或研究人员快速上手并深入理解基础理论与实践方法。",2,"2026-06-11 03:24:28","top_topic"]