Foreword |
|
xix | |
Preface |
|
xxi | |
Acknowledgments |
|
xxv | |
About the Authors |
|
xxvii | |
|
1 Introduction to Reinforcement Learning |
|
|
1 | (22) |
|
1.1 Reinforcement Learning |
|
|
1 | (5) |
|
1.2 Reinforcement Learning as MDP |
|
|
6 | (3) |
|
1.3 Learnable Functions in Reinforcement Learning |
|
|
9 | (2) |
|
1.4 Deep Reinforcement Learning Algorithms |
|
|
11 | (6) |
|
1.4.1 Policy-Based Algorithms |
|
|
12 | (1) |
|
1.4.2 Value-Based Algorithms |
|
|
13 | (1) |
|
1.4.3 Model-Based Algorithms |
|
|
13 | (2) |
|
|
15 | (1) |
|
1.4.5 Algorithms Covered in This Book |
|
|
15 | (1) |
|
1.4.6 On-Policy and Off-Policy Algorithms |
|
|
16 | (1) |
|
|
16 | (1) |
|
1.5 Deep Learning for Reinforcement Learning |
|
|
17 | (2) |
|
1.6 Reinforcement Learning and Supervised Learning |
|
|
19 | (2) |
|
|
19 | (1) |
|
1.6.2 Sparsity of Feedback |
|
|
20 | (1) |
|
|
20 | (1) |
|
|
21 | (2) |
|
I Policy-Based and Value-Based Algorithms |
|
|
23 | (110) |
|
|
25 | (28) |
|
|
26 | (1) |
|
2.2 The Objective Function |
|
|
26 | (1) |
|
|
27 | (3) |
|
2.3.1 Policy Gradient Derivation |
|
|
28 | (2) |
|
|
30 | (1) |
|
|
31 | (2) |
|
2.5.1 Improving Reinforce |
|
|
32 | (1) |
|
2.6 Implementing Reinforce |
|
|
33 | (11) |
|
2.6.1 A Minimal Reinforce Implementation |
|
|
33 | (3) |
|
2.6.2 Constructing Policies with PyTorch |
|
|
36 | (2) |
|
|
38 | (1) |
|
2.6.4 Calculating Policy Loss |
|
|
39 | (1) |
|
2.6.5 Reinforce Training Loop |
|
|
40 | (1) |
|
2.6.6 On-Policy Replay Memory |
|
|
41 | (3) |
|
2.7 Training a Reinforce Agent |
|
|
44 | (3) |
|
|
47 | (4) |
|
2.8.1 Experiment The Effect of Discount Factor γ |
|
|
47 | (2) |
|
2.8.2 Experiment: The Effect of Baseline |
|
|
49 | (2) |
|
|
51 | (1) |
|
|
51 | (1) |
|
|
51 | (2) |
|
|
53 | (28) |
|
3.1 The Q- and V-Functions |
|
|
54 | (2) |
|
3.2 Temporal Difference Learning |
|
|
56 | (9) |
|
3.2.1 Intuition for Temporal Difference Learning |
|
|
59 | (6) |
|
3.3 Action Selection in Sarsa |
|
|
65 | (2) |
|
3.3.1 Exploration and Exploitation |
|
|
66 | (1) |
|
|
67 | (2) |
|
3.4.1 On-Policy Algorithms |
|
|
68 | (1) |
|
|
69 | (5) |
|
3.5.1 Action Function: ε-Greedy |
|
|
69 | (1) |
|
3.5.2 Calculating the Q-Loss |
|
|
70 | (1) |
|
3.5.3 Sarsa Training Loop |
|
|
71 | (1) |
|
3.5.4 On-Policy Batched Replay Memory |
|
|
72 | (2) |
|
3.6 Training a Sarsa Agent |
|
|
74 | (2) |
|
|
76 | (2) |
|
3.7.1 Experiment: The Effect of Learning Rate |
|
|
77 | (1) |
|
|
78 | (1) |
|
|
79 | (1) |
|
|
79 | (2) |
|
|
81 | (22) |
|
4.1 Learning the Q-Function in DQN |
|
|
82 | (1) |
|
4.2 Action Selection in DQN |
|
|
83 | (5) |
|
4.2.1 The Boltzmann Policy |
|
|
86 | (2) |
|
|
88 | (1) |
|
|
89 | (2) |
|
|
91 | (5) |
|
4.5.1 Calculating the Q-Loss |
|
|
91 | (1) |
|
|
92 | (1) |
|
|
93 | (3) |
|
|
96 | (3) |
|
|
99 | (2) |
|
4.7.1 Experiment: The Effect of Network Architecture |
|
|
99 | (2) |
|
|
101 | (1) |
|
|
102 | (1) |
|
|
102 | (1) |
|
|
103 | (30) |
|
|
104 | (2) |
|
|
106 | (3) |
|
5.3 Prioritized Experience Replay (PER) |
|
|
109 | (3) |
|
5.3.1 Importance Sampling |
|
|
111 | (1) |
|
5.4 Modified DQN Implementation |
|
|
112 | (11) |
|
5.4.1 Network Initialization |
|
|
113 | (1) |
|
5.4.2 Calculating the Q-Loss |
|
|
113 | (2) |
|
5.4.3 Updating the Target Network |
|
|
115 | (1) |
|
5.4.4 DQN with Target Networks |
|
|
116 | (1) |
|
|
116 | (1) |
|
5.4.6 Prioritized Experienced Replay |
|
|
117 | (6) |
|
5.5 Training a DQN Agent to Play Atari Games |
|
|
123 | (5) |
|
|
128 | (4) |
|
5.6.1 Experiment: The Effect of Double DQN and PER |
|
|
128 | (4) |
|
|
132 | (1) |
|
|
132 | (1) |
|
|
133 | (74) |
|
6 Advantage Actor-Critic (A2C) |
|
|
135 | (30) |
|
|
136 | (1) |
|
|
136 | (5) |
|
6.2.1 The Advantage Function |
|
|
136 | (4) |
|
6.2.2 Learning the Advantage Function |
|
|
140 | (1) |
|
|
141 | (2) |
|
|
143 | (5) |
|
6.4.1 Advantage Estimation |
|
|
144 | (3) |
|
6.4.2 Calculating Value Loss and Policy Loss |
|
|
147 | (1) |
|
6.4.3 Actor-Critic Training Loop |
|
|
147 | (1) |
|
|
148 | (2) |
|
6.6 Training an A2C Agent |
|
|
150 | (7) |
|
6.6.1 A2C with n-Step Returns on Pong |
|
|
150 | (3) |
|
6.6.2 A2C with GAE on Pong |
|
|
153 | (2) |
|
6.6.3 A2C with n-Step Returns on Bipeda lWalker |
|
|
155 | (2) |
|
|
157 | (4) |
|
6.7.1 Experiment: The Effect of n-Step Returns |
|
|
158 | (1) |
|
6.7.2 Experiment: The Effect of λ of GAE |
|
|
159 | (2) |
|
|
161 | (1) |
|
|
162 | (1) |
|
|
162 | (3) |
|
7 Proximal Policy Optimization (PPO) |
|
|
165 | (30) |
|
|
165 | (9) |
|
7.1.1 Performance Collapse |
|
|
166 | (2) |
|
7.1.2 Modifying the Objective |
|
|
168 | (6) |
|
7.2 Proximal Policy Optimization (PPO) |
|
|
174 | (3) |
|
|
177 | (2) |
|
|
179 | (3) |
|
7.4.1 Calculating the PPO Policy Loss |
|
|
179 | (1) |
|
|
180 | (2) |
|
|
182 | (6) |
|
|
182 | (3) |
|
7.5.2 PPO on BipedalWalker |
|
|
185 | (3) |
|
|
188 | (4) |
|
7.6.1 Experiment: The Effect of λ of GAE |
|
|
188 | (2) |
|
7.6.2 Experiment: The Effect of Clipping Variable ε |
|
|
190 | (2) |
|
|
192 | (1) |
|
|
192 | (3) |
|
8 Parallelization Methods |
|
|
195 | (10) |
|
8.1 Synchronous Parallelization |
|
|
196 | (1) |
|
8.2 Asynchronous Parallelization |
|
|
197 | (3) |
|
|
198 | (2) |
|
8.3 Training an A3C Agent |
|
|
200 | (3) |
|
|
203 | (1) |
|
|
204 | (1) |
|
|
205 | (2) |
|
|
207 | (80) |
|
10 Getting Deep RL to Work |
|
|
209 | (30) |
|
10.1 Software Engineering Practices |
|
|
209 | (9) |
|
|
210 | (5) |
|
|
215 | (1) |
|
|
216 | (2) |
|
|
218 | (10) |
|
|
219 | (1) |
|
10.2.2 Policy Gradient Diagnoses |
|
|
219 | (1) |
|
|
220 | (2) |
|
|
222 | (1) |
|
|
222 | (1) |
|
10.2.6 Algorithmic Functions |
|
|
222 | (1) |
|
|
222 | (3) |
|
10.2.8 Algorithm Simplification |
|
|
225 | (1) |
|
10.2.9 Problem Simplification |
|
|
226 | (1) |
|
|
226 | (1) |
|
|
226 | (2) |
|
|
228 | (3) |
|
|
231 | (7) |
|
10.4.1 Hyperparameter Tables |
|
|
231 | (3) |
|
10.4.2 Algorithm Performance Comparison |
|
|
234 | (4) |
|
|
238 | (1) |
|
|
239 | (12) |
|
11.1 Algorithms Implemented in SLM Lab |
|
|
239 | (2) |
|
|
241 | (5) |
|
11.2.1 Search Spec Syntax |
|
|
243 | (3) |
|
|
246 | (1) |
|
|
246 | (1) |
|
11.4 Analyzing Experiment Results |
|
|
247 | (2) |
|
11.4.1 Overview of the Experiment Data |
|
|
247 | (2) |
|
|
249 | (2) |
|
|
251 | (22) |
|
12.1 Types of Neural Networks |
|
|
251 | (5) |
|
12.1.1 Multilayer Perceptrons (MLPs) |
|
|
252 | (1) |
|
12.1.2 Convolutional Neural Networks (CNNs) |
|
|
253 | (2) |
|
12.1.3 Recurrent Neural Networks (RNNs) |
|
|
255 | (1) |
|
12.2 Guidelines for Choosing a Network Family |
|
|
256 | (6) |
|
|
256 | (3) |
|
12.2.2 Choosing Networks for Environments |
|
|
259 | (3) |
|
|
262 | (9) |
|
12.3.1 Input and Output Layer Shape Inference |
|
|
264 | (2) |
|
12.3.2 Automatic Network Construction |
|
|
266 | (3) |
|
|
269 | (1) |
|
12.3.4 Exposure of Underlying Methods |
|
|
270 | (1) |
|
|
271 | (1) |
|
|
271 | (2) |
|
|
273 | (14) |
|
|
273 | (5) |
|
|
278 | (2) |
|
13.3 Optimizing Data Types in RL |
|
|
280 | (5) |
|
|
285 | (1) |
|
|
285 | (2) |
|
|
287 | (56) |
|
|
289 | (26) |
|
|
289 | (7) |
|
|
296 | (1) |
|
|
297 | (4) |
|
14.4 State Information Loss |
|
|
301 | (5) |
|
|
301 | (1) |
|
|
302 | (1) |
|
|
303 | (1) |
|
14.4.4 Metainformation Loss |
|
|
303 | (3) |
|
|
306 | (7) |
|
|
307 | (1) |
|
14.5.2 Image Preprocessing |
|
|
308 | (2) |
|
14.5.3 Temporal Preprocessing |
|
|
310 | (3) |
|
|
313 | (2) |
|
|
315 | (12) |
|
|
315 | (3) |
|
|
318 | (1) |
|
|
319 | (4) |
|
|
323 | (1) |
|
15.5 Further Reading: Action Design in Everyday Things |
|
|
324 | (3) |
|
|
327 | (6) |
|
|
327 | (1) |
|
16.2 Reward Design Guidelines |
|
|
328 | (4) |
|
|
332 | (1) |
|
|
333 | (10) |
|
|
333 | (2) |
|
|
335 | (2) |
|
|
337 | (1) |
|
|
338 | (5) |
|
A Deep Reinforcement Learning Timeline |
|
|
343 | (2) |
|
|
345 | (8) |
|
B.1 Discrete Environments |
|
|
346 | (4) |
|
|
346 | (1) |
|
|
347 | (1) |
|
|
347 | (1) |
|
|
348 | (1) |
|
B.1.5 BreakoutNoFrameskip-v4 |
|
|
349 | (1) |
|
B.2 Continuous Environments |
|
|
350 | (3) |
|
|
350 | (1) |
|
|
350 | (3) |
References |
|
353 | (10) |
Index |
|
363 | |