I Foundation 1
1 Introduction 3
1.1 AI Breakthrough in Games . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 What is Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 9
1.3 Agent-Environment in Reinforcement Learning . . . . . . . . . . . . . 10
1.4 Examples of Reinforcement Learning . . . . . . . . . . . . . . . . . . 15
1.5 Common terms in Reinforcement Learning . . . . . . . . . . . . . . . 17
1.6 Why study Reinforcement Learning . . . . . . . . . . . . . . . . . . . 20
1.7 The Challenges in Reinforcement Learning . . . . . . . . . . . . . . . 23
2 Markov Decision Processes 29
2.1 Overview of MDP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.2 Model Reinforcement Learning Problem using MDP . . . . . . . . . . 31
2.3 Markov Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.4 Markov Reward Process . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.5 Markov Decision Process . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.6 Alternative Bellman Equations for Value Functions . . . . . . . . . . 54
2.7 Optimal Policy and Optimal Value Functions . . . . . . . . . . . . . 56
3 Dynamic Programming 61
3.1 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2 Policy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.3 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 General Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.5 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4 Monte Carlo Methods 79
4.1 Monte Carlo Policy Evaluation . . . . . . . . . . . . . . . . . . . . . 80
4.2 Incremental Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
iii
Contents iv
4.3 Exploration vs. Exploitation . . . . . . . . . . . . . . . . . . . . . . . 89
4.4 Monte Carlo Control (Policy Improvement) . . . . . . . . . . . . . . . 93
5 Temporal Di?erence Learning 99
5.1 Temporal Di?erence Learning . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Temporal Di?erence Policy Evaluation . . . . . . . . . . . . . . . . . 100
5.3 Simplified ?-greedy policy for Exploration . . . . . . . . . . . . . . . . 107
5.4 TD Control - SARSA . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.5 On-policy vs. O?-policy . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.6 Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.7 Double Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8 N-step Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
II Value Function Approximation 143
6 Linear Value Function Approximation 145
6.1 The Challenge of Large-scale MDPs . . . . . . . . . . . . . . . . . . . 145
6.2 Value Function Approximation . . . . . . . . . . . . . . . . . . . . . . 148
6.3 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 152
6.4 Linear Value Function Approximation . . . . . . . . . . . . . . . . . . 159
7 Nonlinear Value Function Approximation 171
7.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.2 Training Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 183
7.3 Policy Evaluation with Neural Networks . . . . . . . . . . . . . . . . 188
7.4 Naive Deep Q-learning . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.5 Deep Q-learning with Experience Replay and Target Network . . . . 191
7.6 DQN for Atari Games . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8 Improvement to DQN 211
8.1 DQN with Double Q-learning . . . . . . . . . . . . . . . . . . . . . . 211
8.2 Prioritized Experience Replay . . . . . . . . . . . . . . . . . . . . . . 214
8.3 Advantage function and Dueling Network Architecture . . . . . . . . 219
III Policy Approximation 225
9 Policy Gradient Methods 227
9.1 Policy-based methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
9.2 Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
9.3 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
9.4 REINFORCE with Baseline . . . . . . . . . . . . . . . . . . . . . . . 239
9.5 Actor-Critic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
9.6 Using Entropy to Encourage Exploration . . . . . . . . . . . . . . . . 249
10 Problems with Continuous Action Space 255
10.1 The Challenges of Problems with Action Space Problems . . . . . . . 256
10.2 MuJoCo Environments . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.3 Policy Gradient for Problems with Continuous Action Space . . . . . 260
11 Advanced Policy Gradient Methods 267
11.1 Problems with the standard Policy Gradients methods . . . . . . . . 267
11.2 Policy Performance Bounds . . . . . . . . . . . . . . . . . . . . . . . 270
11.3 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . 277
IV Advanced Topics 287
12 Distributed Reinforcement Learning 289
12.1 Why use Distributed Reinforcement Learning . . . . . . . . . . . . . 289
12.2 General Distributed Reinforcement Learning Architecture . . . . . . . 290
12.3 Data Parallelism for Distributed Reinforcement Learning . . . . . . . 298
13 Curiosity-Driven Exploration 301
13.1 Hard-to-explore problems vs. Sparse Reward problems . . . . . . . . 302
13.2 Curiosity-Driven Exploration . . . . . . . . . . . . . . . . . . . . . . . 303
13.3 Random Network Distillation . . . . . . . . . . . . . . . . . . . . . . 305
14 Planning with a Model - AlphaZero 317
14.1 Why We Need to Plan in Reinforcement Learning . . . . . . . . . . . 317
14.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . 320
14.3 AlphaZero . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
Michael Hu is a skilled software engineer with over a decade of experience in designing and implementing enterprise-level applications. He's a passionate coder who loves to delve into the world of mathematics and has a keen interest in cutting-edge technologies like machine learning and deep learning, with a particular interest in deep reinforcement learning. He has build various open-source projects on Github, which closely mimic the state-of-the-art reinforcement learning algorithms developed by DeepMind, such as AlphaZero, MuZero, and Agent57. Fluent in both English and Chinese, Michael currently resides in the bustling city of Shanghai, China.