Preface |
|
xiii | |
Acknowledgments |
|
xv | |
About This Book |
|
xvi | |
About The Authors |
|
xix | |
About The Cover Illustration |
|
xx | |
|
|
1 | (138) |
|
1 What is reinforcement learning? |
|
|
3 | (20) |
|
1.1 The "deep" in deep reinforcement learning |
|
|
4 | (2) |
|
1.2 Reinforcement learning |
|
|
6 | (3) |
|
1.3 Dynamic programming versus Monte Carlo |
|
|
9 | (1) |
|
1.4 The reinforcement learning framework |
|
|
10 | (4) |
|
1.5 What can I do with reinforcement learning? |
|
|
14 | (2) |
|
1.6 Why deep reinforcement learning? |
|
|
16 | (2) |
|
1.7 Our didactic tool: String diagrams |
|
|
18 | (2) |
|
|
20 | (3) |
|
2 Modeling reinforcement learning problems: Markov decision processes |
|
|
23 | (31) |
|
2.1 String diagrams and our teaching methods |
|
|
23 | (5) |
|
2.2 Solving the multi-arm bandit |
|
|
28 | (9) |
|
Exploration and exploitation |
|
|
29 | (1) |
|
|
30 | (5) |
|
|
35 | (2) |
|
2.3 Applying bandits to optimize ad placements |
|
|
37 | (3) |
|
|
38 | (1) |
|
|
39 | (1) |
|
2.4 Building networks with PyTorch |
|
|
40 | (2) |
|
Automatic differentiation |
|
|
40 | (1) |
|
|
41 | (1) |
|
2.5 Solving contextual bandits |
|
|
42 | (5) |
|
|
47 | (2) |
|
2.7 Predicting future rewards: Value and policy functions |
|
|
49 | (5) |
|
|
50 | (1) |
|
|
51 | (1) |
|
|
51 | (3) |
|
3 Predicting the best states and actions: Deep Q networks |
|
|
54 | (36) |
|
|
55 | (1) |
|
3.2 Navigating with Q-learning |
|
|
56 | (19) |
|
|
56 | (1) |
|
|
57 | (2) |
|
|
59 | (1) |
|
|
60 | (1) |
|
|
61 | (2) |
|
Introducing the Gridworld game engine |
|
|
63 | (2) |
|
A neural network as the Q function |
|
|
65 | (10) |
|
3.3 Preventing catastrophic forgetting: Experience replay |
|
|
75 | (5) |
|
|
75 | (1) |
|
|
76 | (4) |
|
3.4 Improving stability with a target network |
|
|
80 | (6) |
|
|
81 | (5) |
|
|
86 | (4) |
|
4 Learning to pick the best policy: Policy gradient methods |
|
|
90 | (21) |
|
4.1 Policy function using neural networks |
|
|
91 | (4) |
|
Neural network as the policy function |
|
|
91 | (1) |
|
Stochastic policy gradient |
|
|
92 | (2) |
|
|
94 | (1) |
|
4.2 Reinforcing good actions: The policy gradient algorithm |
|
|
95 | (5) |
|
|
95 | (2) |
|
|
97 | (1) |
|
|
98 | (1) |
|
|
99 | (1) |
|
4.3 Working with OpenAI Gym |
|
|
100 | (3) |
|
|
102 | (1) |
|
|
103 | (1) |
|
4.4 The REINFORCE algorithm |
|
|
103 | (8) |
|
Creating the policy network |
|
|
104 | (1) |
|
Having the agent interact with the environment |
|
|
104 | (1) |
|
|
105 | (2) |
|
|
107 | (1) |
|
|
108 | (3) |
|
5 Tackling more complex problems with actor-critic methods |
|
|
111 | (28) |
|
5.1 Combining the value and policy function |
|
|
113 | (5) |
|
|
118 | (5) |
|
5.3 Advantage actor-critic |
|
|
123 | (9) |
|
|
132 | (7) |
|
|
139 | (197) |
|
6 Alternative optimization methods: Evolutionary algorithms |
|
|
141 | (26) |
|
6.1 A different approach to reinforcement learning |
|
|
142 | (1) |
|
6.2 Reinforcement learning with evolution strategies |
|
|
143 | (8) |
|
|
143 | (4) |
|
|
147 | (4) |
|
6.3 A genetic algorithm for CartPole |
|
|
151 | (7) |
|
6.4 Pros and cons of evolutionary algorithms |
|
|
158 | (1) |
|
Evolutionary algorithms explore more |
|
|
158 | (1) |
|
Evolutionary algorithms are incredibly sample intensive |
|
|
158 | (1) |
|
|
159 | (1) |
|
6.5 Evolutionary algorithms as a scalable alternative |
|
|
159 | (8) |
|
Scaling evolutionary algorithms |
|
|
160 | (1) |
|
Parallel vs. serial processing |
|
|
161 | (1) |
|
|
162 | (1) |
|
Communicating between nodes |
|
|
163 | (2) |
|
|
165 | (1) |
|
Scaling gradient-based approaches |
|
|
165 | (2) |
|
7 Distributional DQN: Getting the full story |
|
|
167 | (43) |
|
7.1 What's wrong with Q-learning? |
|
|
168 | (5) |
|
7.2 Probability and statistics revisited |
|
|
173 | (7) |
|
|
175 | (1) |
|
|
176 | (4) |
|
|
180 | (1) |
|
The distributional Bellman equation |
|
|
180 | (1) |
|
7.4 Distributional Q-learning |
|
|
181 | (12) |
|
Representing a probability distribution in Python |
|
|
182 | (9) |
|
Implementing the Dist-DQN |
|
|
191 | (2) |
|
7.5 Comparing probability distributions |
|
|
193 | (5) |
|
7.6 Dist-DQN on simulated data |
|
|
198 | (5) |
|
7.7 Using distributional Q-learning to play Freeway |
|
|
203 | (7) |
|
8 Curiosity-driven exploration |
|
|
210 | (33) |
|
8.1 Tackling sparse rewards with predictive coding |
|
|
212 | (3) |
|
8.2 Inverse dynamics prediction |
|
|
215 | (3) |
|
8.3 Setting up Super Mario Bros |
|
|
218 | (3) |
|
8.4 Preprocessing and the Q-network |
|
|
221 | (2) |
|
8.5 Setting up the Q-network and policy function |
|
|
223 | (3) |
|
8.6 Intrinsic curiosity module |
|
|
226 | (13) |
|
8.7 Alternative intrinsic reward mechanisms |
|
|
239 | (4) |
|
9 Multi-agent reinforcement learning |
|
|
243 | (40) |
|
9.1 From one to many agents |
|
|
244 | (4) |
|
9.2 Neighborhood Q-learning |
|
|
248 | (4) |
|
|
252 | (9) |
|
9.4 Mean field Q-learning and the 2D Ising model |
|
|
261 | (10) |
|
9.5 Mixed cooperative-competitive games |
|
|
271 | (12) |
|
10 Interpretable reinforcement learning: Attention and relational models |
|
|
283 | (46) |
|
10.1 Machine learning interpretability with attention and relational biases |
|
|
284 | (3) |
|
Invariance and equivariance |
|
|
286 | (1) |
|
10.2 Relational reasoning with attention |
|
|
287 | (11) |
|
|
288 | (2) |
|
|
290 | (5) |
|
|
295 | (3) |
|
10.3 Implementing self-attention for MNIST |
|
|
298 | (12) |
|
|
298 | (1) |
|
|
299 | (4) |
|
Tensor contractions and Einstein notation |
|
|
303 | (3) |
|
Training the relational module |
|
|
306 | (4) |
|
10.4 Multi-head attention and relational DQN |
|
|
310 | (7) |
|
|
317 | (2) |
|
10.6 Training and attention visualization |
|
|
319 | (10) |
|
|
323 | (1) |
|
|
323 | (1) |
|
Visualizing attention weights |
|
|
323 | (6) |
|
11 In conclusion: A review and roadmap |
|
|
329 | (1) |
|
|
329 | (2) |
|
11.2 The uncharted topics in deep reinforcement learning |
|
|
331 | (4) |
|
Prioritized experience replay |
|
|
331 | (1) |
|
Proximal policy optimization (PPO) |
|
|
332 | (1) |
|
Hierarchical reinforcement learning and the options framework |
|
|
333 | (1) |
|
|
333 | (1) |
|
Monte Carlo tree search (MCTS) |
|
|
334 | (1) |
|
|
335 | (1) |
Appendix Mathematics, deep learning, PyTorch |
|
336 | (12) |
Reference list |
|
348 | (3) |
Index |
|
351 | |