Preface to the Second Edition |
|
xiii | |
Preface to the First Edition |
|
xvii | |
Summary of Notation |
|
xix | |
|
|
1 | (22) |
|
1.1 Reinforcement Learning |
|
|
1 | (3) |
|
|
4 | (2) |
|
1.3 Elements of Reinforcement Learning |
|
|
6 | (1) |
|
1.4 Limitations and Scope |
|
|
7 | (1) |
|
1.5 An Extended Example: Tic-Tac-Toe |
|
|
8 | (5) |
|
|
13 | (1) |
|
1.7 Early History of Reinforcement Learning |
|
|
13 | (10) |
I Tabular Solution Methods |
|
23 | (172) |
|
|
25 | (22) |
|
2.1 A k-armed Bandit Problem |
|
|
25 | (2) |
|
|
27 | (1) |
|
|
28 | (2) |
|
2.4 Incremental Implementation |
|
|
30 | (2) |
|
2.5 Tracking a Nonstationary Problem |
|
|
32 | (2) |
|
2.6 Optimistic Initial Values |
|
|
34 | (1) |
|
2.7 Upper-Confidence-Bound Action Selection |
|
|
35 | (2) |
|
2.8 Gradient Bandit Algorithms |
|
|
37 | (4) |
|
2.9 Associative Search (Contextual Bandits) |
|
|
41 | (1) |
|
|
42 | (5) |
|
3 Finite Markov Decision Processes |
|
|
47 | (26) |
|
3.1 The Agent-Environment Interface |
|
|
47 | (6) |
|
|
53 | (1) |
|
|
54 | (3) |
|
3.4 Unified Notation for Episodic and Continuing Tasks |
|
|
57 | (1) |
|
3.5 Policies and Value Functions |
|
|
58 | (4) |
|
3.6 Optimal Policies and Optimal Value Functions |
|
|
62 | (5) |
|
3.7 Optimality and Approximation |
|
|
67 | (1) |
|
|
68 | (5) |
|
|
73 | (18) |
|
4.1 Policy Evaluation (Prediction) |
|
|
74 | (2) |
|
|
76 | (4) |
|
|
80 | (2) |
|
|
82 | (3) |
|
4.5 Asynchronous Dynamic Programming |
|
|
85 | (1) |
|
4.6 Generalized Policy Iteration |
|
|
86 | (1) |
|
4.7 Efficiency of Dynamic Programming |
|
|
87 | (1) |
|
|
88 | (3) |
|
|
91 | (28) |
|
5.1 Monte Carlo Prediction |
|
|
92 | (4) |
|
5.2 Monte Carlo Estimation of Action Values |
|
|
96 | (1) |
|
|
97 | (3) |
|
5.4 Monte Carlo Control without Exploring Starts |
|
|
100 | (3) |
|
5.5 Off-policy Prediction via Importance Sampling |
|
|
103 | (6) |
|
5.6 Incremental Implementation |
|
|
109 | (1) |
|
5.7 Off-policy Monte Carlo Control |
|
|
110 | (2) |
|
5.8 *Discounting-aware Importance Sampling |
|
|
112 | (2) |
|
5.9 *Per-decision Importance Sampling |
|
|
114 | (1) |
|
|
115 | (4) |
|
6 Temporal-Difference Learning |
|
|
119 | (22) |
|
|
119 | (5) |
|
6.2 Advantages of TD Prediction Methods |
|
|
124 | (2) |
|
|
126 | (3) |
|
6.4 Sarsa: On-policy TD Control |
|
|
129 | (2) |
|
6.5 Q-learning: Off-policy TD Control |
|
|
131 | (2) |
|
|
133 | (1) |
|
6.7 Maximization Bias and Double Learning |
|
|
134 | (2) |
|
6.8 Games, Afterstates, and Other Special Cases |
|
|
136 | (2) |
|
|
138 | (3) |
|
|
141 | (18) |
|
|
142 | (3) |
|
|
145 | (3) |
|
7.3 n-step Off-policy Learning |
|
|
148 | (2) |
|
7.4 *Per-decision Methods with Control Variates |
|
|
150 | (2) |
|
7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm |
|
|
152 | (2) |
|
7.6 *A Unifying Algorithm: n-step Q(u) |
|
|
154 | (3) |
|
|
157 | (2) |
|
8 Planning and Learning with Tabular Methods |
|
|
159 | (36) |
|
|
159 | (2) |
|
8.2 Dyna: Integrated Planning, Acting, and Learning |
|
|
161 | (5) |
|
8.3 When the Model Is Wrong |
|
|
166 | (2) |
|
|
168 | (4) |
|
8.5 Expected vs. Sample Updates |
|
|
172 | (2) |
|
|
174 | (3) |
|
8.7 Real-time Dynamic Programming |
|
|
177 | (3) |
|
8.8 Planning at Decision Time |
|
|
180 | (1) |
|
|
181 | (2) |
|
|
183 | (2) |
|
8.11 Monte Carlo Tree Search |
|
|
185 | (3) |
|
8.12 Summary of the Chapter |
|
|
188 | (1) |
|
8.13 Summary of Part I: Dimensions |
|
|
189 | (6) |
II Approximate Solution Methods |
|
195 | (144) |
|
9 On-policy Prediction with Approximation |
|
|
197 | (46) |
|
9.1 Value-function Approximation |
|
|
198 | (1) |
|
9.2 The Prediction Objective (VE) |
|
|
199 | (1) |
|
9.3 Stochastic-gradient and Semi-gradient Methods |
|
|
200 | (4) |
|
|
204 | (6) |
|
9.5 Feature Construction for Linear Methods |
|
|
210 | (12) |
|
|
210 | (1) |
|
|
211 | (4) |
|
|
215 | (2) |
|
|
217 | (4) |
|
9.5.5 Radial Basis Functions |
|
|
221 | (1) |
|
9.6 Selecting Step-Size Parameters Manually |
|
|
222 | (1) |
|
9.7 Nonlinear Function Approximation: Artificial Neural Networks |
|
|
223 | (5) |
|
|
228 | (2) |
|
9.9 Memory-based Function Approximation |
|
|
230 | (2) |
|
9.10 Kernel-based Function Approximation |
|
|
232 | (2) |
|
9.11 Looking Deeper at On-policy Learning: Interest and Emphasis |
|
|
234 | (2) |
|
|
236 | (7) |
|
10 On-policy Control with Approximation |
|
|
243 | (14) |
|
10.1 Episodic Semi-gradient Control |
|
|
243 | (4) |
|
10.2 Semi-gradient n-step Sarsa |
|
|
247 | (2) |
|
10.3 Average Reward: A New Problem Setting for Continuing Tasks |
|
|
249 | (4) |
|
10.4 Deprecating the Discounted Setting |
|
|
253 | (2) |
|
10.5 Differential Semi-gradient n-step Sarsa |
|
|
255 | (1) |
|
|
256 | (1) |
|
11 *Off-policy Methods with Approximation |
|
|
257 | (30) |
|
11.1 Semi-gradient Methods |
|
|
258 | (2) |
|
11.2 Examples of Off-policy Divergence |
|
|
260 | (4) |
|
|
264 | (2) |
|
11.4 Linear Value-function Geometry |
|
|
266 | (3) |
|
11.5 Gradient Descent in the Bellman Error |
|
|
269 | (5) |
|
11.6 The Bellman Error is Not Learnable |
|
|
274 | (4) |
|
|
278 | (3) |
|
|
281 | (2) |
|
|
283 | (1) |
|
|
284 | (3) |
|
|
287 | (34) |
|
|
288 | (4) |
|
|
292 | (3) |
|
12.3 n-step Truncated A-return Methods |
|
|
295 | (2) |
|
12.4 Redoing Updates: Online A-return Algorithm |
|
|
297 | (2) |
|
|
299 | (2) |
|
12.6 *Dutch Traces in Monte Carlo Learning |
|
|
301 | (2) |
|
|
303 | (4) |
|
|
307 | (2) |
|
12.9 Off-policy Traces with Control Variates |
|
|
309 | (3) |
|
12.10 Watkins's Q(A) to Tree-Backup(A) |
|
|
312 | (2) |
|
12.11 Stable Off-policy Methods with Traces |
|
|
314 | (2) |
|
12.12 Implementation Issues |
|
|
316 | (1) |
|
|
317 | (4) |
|
13 Policy Gradient Methods |
|
|
321 | (18) |
|
13.1 Policy Approximation and its Advantages |
|
|
322 | (2) |
|
13.2 The Policy Gradient Theorem |
|
|
324 | (2) |
|
13.3 REINFORCE: Monte Carlo Policy Gradient |
|
|
326 | (3) |
|
13.4 REINFORCE with Baseline |
|
|
329 | (2) |
|
13.5 Actor-Critic Methods |
|
|
331 | (2) |
|
13.6 Policy Gradient for Continuing Problems |
|
|
333 | (2) |
|
13.7 Policy Parameterization for Continuous Actions |
|
|
335 | (2) |
|
|
337 | (2) |
III Looking Deeper |
|
339 | (142) |
|
|
341 | (36) |
|
14.1 Prediction and Control |
|
|
342 | (1) |
|
14.2 Classical Conditioning |
|
|
343 | (14) |
|
14.2.1 Blocking and Higher-order Conditioning |
|
|
345 | (1) |
|
14.2.2 The Rescorla-Wagner Model |
|
|
346 | (3) |
|
|
349 | (1) |
|
14.2.4 TD Model Simulations |
|
|
350 | (7) |
|
14.3 Instrumental Conditioning |
|
|
357 | (4) |
|
14.4 Delayed Reinforcement |
|
|
361 | (2) |
|
|
363 | (1) |
|
14.6 Habitual and Goal-directed Behavior |
|
|
364 | (4) |
|
|
368 | (9) |
|
|
377 | (44) |
|
|
378 | (2) |
|
15.2 Reward Signals, Reinforcement Signals, Values, and Prediction Errors |
|
|
380 | (1) |
|
15.3 The Reward Prediction Error Hypothesis |
|
|
381 | (2) |
|
|
383 | (4) |
|
15.5 Experimental Support for the Reward Prediction Error Hypothesis |
|
|
387 | (3) |
|
15.6 TD Error/Dopamine Correspondence |
|
|
390 | (5) |
|
|
395 | (3) |
|
15.8 Actor and Critic Learning Rules |
|
|
398 | (4) |
|
|
402 | (2) |
|
15.10 Collective Reinforcement Learning |
|
|
404 | (3) |
|
15.11 Model-based Methods in the Brain |
|
|
407 | (2) |
|
|
409 | (1) |
|
|
410 | (11) |
|
16 Applications and Case Studies |
|
|
421 | (38) |
|
|
421 | (5) |
|
16.2 Samuel's Checkers Player |
|
|
426 | (3) |
|
16.3 Watson's Daily-Double Wagering |
|
|
429 | (3) |
|
16.4 Optimizing Memory Control |
|
|
432 | (4) |
|
16.5 Human-level Video Game Play |
|
|
436 | (5) |
|
16.6 Mastering the Game of Go |
|
|
441 | (9) |
|
|
444 | (3) |
|
|
447 | (3) |
|
16.7 Personalized Web Services |
|
|
450 | (3) |
|
|
453 | (6) |
|
|
459 | (22) |
|
17.1 General Value Functions and Auxiliary Tasks |
|
|
459 | (2) |
|
17.2 Temporal Abstraction via Options |
|
|
461 | (3) |
|
17.3 Observations and State |
|
|
464 | (5) |
|
17.4 Designing Reward Signals |
|
|
469 | (3) |
|
|
472 | (3) |
|
17.6 Experimental Support for the Reward Prediction Error Hypothesis |
|
|
475 | (6) |
References |
|
481 | (38) |
Index |
|
519 | |