Preface |
|
xv | |
Author Biographies |
|
xix | |
Summary of Notation |
|
xxi | |
|
|
1 | (18) |
|
1.1 Learning Reinforcement Learning |
|
|
1 | (1) |
|
1.2 What You Will Learn From This Book |
|
|
2 | (1) |
|
1.3 Expected Background To Read This Book |
|
|
3 | (1) |
|
1.4 Decluttering The Jargon Linked To Reinforcement Learning |
|
|
4 | (1) |
|
1.5 Introduction To The Markov Decision Process (Mdp) Framework |
|
|
5 | (3) |
|
1.6 Real-World Problems That Fit The Mdp Framework |
|
|
8 | (1) |
|
1.7 The Inherent Difficulty In Solving Mdps |
|
|
9 | (1) |
|
1.8 Value Function, Bellman Equations, Dynamic Programming And Rl |
|
|
10 | (2) |
|
|
12 | (7) |
|
1.9.1 Module I: Processes and Planning Algorithms |
|
|
13 | (1) |
|
1.9.2 Module II: Modeling Financial Applications |
|
|
13 | (2) |
|
1.9.3 Module III: Reinforcement Learning Algorithms |
|
|
15 | (1) |
|
1.9.4 Module IV: Finishing Touches |
|
|
16 | (1) |
|
1.9.5 Short Appendix Chapters |
|
|
16 | (3) |
|
Chapter 2 Programming and Design |
|
|
19 | (22) |
|
|
19 | (1) |
|
|
20 | (1) |
|
2.3 Classes and Interfaces |
|
|
21 | (11) |
|
2.3.1 A Distribution Interface |
|
|
22 | (1) |
|
2.3.2 A Concrete Distribution |
|
|
22 | (2) |
|
|
24 | (2) |
|
|
26 | (1) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
29 | (1) |
|
|
30 | (2) |
|
2.4 Abstracting Over Computation |
|
|
32 | (6) |
|
2.4.1 First-Class Functions |
|
|
32 | (2) |
|
|
34 | (1) |
|
2.4.2 Iterative Algorithms |
|
|
34 | (1) |
|
2.4.2.1 Iterators and Generators |
|
|
35 | (3) |
|
2.5 Key Takeaways From This Chapter |
|
|
38 | (3) |
|
Module I Processes and Planning Algorithms |
|
|
|
Chapter 3 Markov Processes |
|
|
41 | (32) |
|
3.1 The Concept Of State In A Process |
|
|
41 | (1) |
|
3.2 Understanding Markov Property From Stock Price Examples |
|
|
41 | (7) |
|
3.3 Formal Definitions For Markov Processes |
|
|
48 | (4) |
|
|
49 | (1) |
|
|
50 | (1) |
|
3.3.3 Markov Process Implementation |
|
|
50 | (2) |
|
3.4 Stock Price Examples Modeled As Markov Processes |
|
|
52 | (1) |
|
3.5 Finite Markov Processes |
|
|
53 | (2) |
|
3.6 Simple Inventory Example |
|
|
55 | (3) |
|
3.7 Stationary Distribution Of A Markov Process |
|
|
58 | (2) |
|
3.8 Formalism Of Markov Reward Processes |
|
|
60 | (3) |
|
3.9 Simple Inventory Example As A Markov Reward Process |
|
|
63 | (2) |
|
3.10 Finite Markov Reward Processes |
|
|
65 | (1) |
|
3.11 Simple Inventory Example As A Finite Markov Reward Process |
|
|
66 | (3) |
|
3.12 Value Function Of A Markov Reward Process |
|
|
69 | (2) |
|
3.13 Summary Of Key Learnings From This Chapter |
|
|
71 | (2) |
|
Chapter 4 Markov Decision Processes |
|
|
73 | (30) |
|
4.1 Simple Inventory Example: How Much To Order? |
|
|
73 | (1) |
|
4.2 The Difficulty Of Sequential Decisioning Under Uncertainty |
|
|
74 | (1) |
|
4.3 Formal Definition Of A Markov Decision Process |
|
|
75 | (2) |
|
|
77 | (3) |
|
4.5 [ Markov Decision Process, Policy]: = Markov Reward Process |
|
|
80 | (1) |
|
4.6 Simple Inventory Example With Unlimited Capacity (Infinite State/Action Space) |
|
|
81 | (2) |
|
4.7 Finite Markov Decision Processes |
|
|
83 | (2) |
|
4.8 Simple Inventory Example As A Finite Markov Decision Process |
|
|
85 | (3) |
|
4.9 Mdp Value Function For A Fixed Policy |
|
|
88 | (3) |
|
4.10 Optimal Value Function And Optimal Policies |
|
|
91 | (4) |
|
4.11 Variants And Extensions Of Mdps |
|
|
95 | (6) |
|
4.11.1 Size of Spaces and Discrete versus Continuous |
|
|
95 | (1) |
|
|
95 | (2) |
|
|
97 | (1) |
|
|
97 | (1) |
|
4.11.2 Partially-Observable Markov Decision Processes (POMDPs) |
|
|
98 | (3) |
|
4.12 Summary Of Key Learnings From This Chapter |
|
|
101 | (2) |
|
Chapter 5 Dynamic Programming Algorithms |
|
|
103 | (36) |
|
5.1 Planning Versus Learning |
|
|
103 | (1) |
|
5.2 Usage Of The Term Dynamic Programming |
|
|
104 | (1) |
|
|
105 | (4) |
|
5.4 Bellman Policy Operator And Policy Evaluation Algorithm |
|
|
109 | (3) |
|
|
112 | (1) |
|
|
113 | (2) |
|
5.7 Policy Iteration Algorithm |
|
|
115 | (3) |
|
5.8 Bellman Optimality Operator And Value Iteration Algorithm |
|
|
118 | (3) |
|
5.9 Optimal Policy From Optimal Value Function |
|
|
121 | (1) |
|
5.10 Revisiting The Simple Inventory Example |
|
|
122 | (2) |
|
5.11 Generalized Policy Iteration |
|
|
124 | (3) |
|
5.12 Asynchronous Dynamic Programming |
|
|
127 | (1) |
|
5.13 Finite-Horizon Dynamic Programming: Backward Induction |
|
|
128 | (5) |
|
5.14 Dynamic Pricing For End-Of-Life/End-Of-Season Of A Product |
|
|
133 | (4) |
|
5.15 Generalization To Non-Tabular Algorithms |
|
|
137 | (1) |
|
5.16 Summary Of Key Learnings From This Chapter |
|
|
138 | (1) |
|
Chapter 6 Function Approximation and Approximate Dynamic Programming |
|
|
139 | (34) |
|
6.1 Function Approximation |
|
|
140 | (4) |
|
6.2 Linear Function Approximation |
|
|
144 | (6) |
|
6.3 Neural Network Function Approximation |
|
|
150 | (11) |
|
6.4 Tabular As A Form Of Functionapprox |
|
|
161 | (3) |
|
6.5 Approximate Policy Evaluation |
|
|
164 | (1) |
|
6.6 Approximate Value Iteration |
|
|
165 | (1) |
|
6.7 Finite-Horizon Approximate Policy Evaluation |
|
|
166 | (1) |
|
6.8 Finite-Horizon Approximate Value Iteration |
|
|
167 | (1) |
|
6.9 Finite-Horizon Approximate Q-Value Iteration |
|
|
168 | (1) |
|
6.10 How To Construct The Non-Terminal States Distribution |
|
|
169 | (1) |
|
6.11 Key Takeaways From This Chapter |
|
|
170 | (3) |
|
Module II Modeling Financial Applications |
|
|
|
|
173 | (12) |
|
7.1 Introduction To The Concept Of Utility |
|
|
173 | (1) |
|
7.2 A Simple Financial Example |
|
|
174 | (1) |
|
7.3 The Shape Of The Utility Function |
|
|
175 | (2) |
|
7.4 Calculating The Risk-Premium |
|
|
177 | (2) |
|
7.5 Constant Absolute Risk-Aversion (Cara) |
|
|
179 | (1) |
|
7.6 A Portfolio Application Of Cara |
|
|
180 | (1) |
|
7.7 Constant Relative Risk-Aversion (Crra) |
|
|
181 | (1) |
|
7.8 A Portfolio Application Of Crra |
|
|
182 | (1) |
|
7.9 Key Takeaways From This Chapter |
|
|
183 | (2) |
|
Chapter 8 Dynamic Asset-Allocation and Consumption |
|
|
185 | (22) |
|
8.1 Optimization Of Personal Finance |
|
|
185 | (2) |
|
8.2 Merton's Portfolio Problem and Solution |
|
|
187 | (5) |
|
8.3 Developing Intuition For The Solution To Merton's Portfolio Problem |
|
|
192 | (3) |
|
8.4 A Discrete-Time Asset-Allocation Example |
|
|
195 | (4) |
|
8.5 Porting To Real-World |
|
|
199 | (7) |
|
8.6 Key Takeaways From This Chapter |
|
|
206 | (1) |
|
Chapter 9 Derivatives Pricing and Hedging |
|
|
207 | (36) |
|
9.1 A Brief Introduction To Derivatives |
|
|
208 | (2) |
|
|
208 | (1) |
|
|
209 | (1) |
|
|
210 | (1) |
|
9.2 Notation For The Single-Period Simple Setting |
|
|
210 | (1) |
|
9.3 Portfolios, Arbitrage And Risk-Neutral Probability Measure |
|
|
211 | (1) |
|
9.4 First Fundamental Theorem Of Asset Pricing (1st Ftap) |
|
|
212 | (2) |
|
9.5 Second Fundamental Theorem Of Asset Pricing (2nd Ftap) |
|
|
214 | (1) |
|
9.6 Derivatives Pricing In Single-Period Setting |
|
|
215 | (12) |
|
9.6.1 Derivatives Pricing When Market Is Complete |
|
|
216 | (2) |
|
9.6.2 Derivatives Pricing When Market Is Incomplete |
|
|
218 | (8) |
|
9.6.3 Derivatives Pricing When Market Has Arbitrage |
|
|
226 | (1) |
|
9.7 Derivatives Pricing In Multi-Period/Continuous-Time |
|
|
227 | (2) |
|
9.7.1 Multi-Period Complete-Market Setting |
|
|
228 | (1) |
|
9.7.2 Continuous-Time Complete-Market Setting |
|
|
229 | (1) |
|
9.8 Optimal Exercise Of American Options Cast As A Finite Mdp |
|
|
229 | (7) |
|
9.9 Generalizing To Optimal-Stopping Problems |
|
|
236 | (2) |
|
9.10 Pricing/Hedging In An Incomplete Market Cast As An Mdp |
|
|
238 | (2) |
|
9.11 Key Takeaways From This Chapter |
|
|
240 | (3) |
|
Chapter 10 Order-Book Trading Algorithms |
|
|
243 | (36) |
|
10.1 Basics Of Order Book And Price Impact |
|
|
243 | (9) |
|
10.2 Optimal Execution Of A Market Order |
|
|
252 | (12) |
|
10.2.1 Simple Linear Price Impact Model with No Risk-Aversion |
|
|
257 | (5) |
|
10.2.2 Paper by Bertsimas and Lo on Optimal Order Execution |
|
|
262 | (1) |
|
10.2.3 Incorporating Risk-Aversion and Real-World Considerations |
|
|
263 | (1) |
|
10.3 Optimal Market-Making |
|
|
264 | (11) |
|
10.3.1 Avellaneda-Stoikov Continuous-Time Formulation |
|
|
266 | (1) |
|
10.3.2 Solving the Avellaneda-Stoikov Formulation |
|
|
267 | (5) |
|
10.3.3 Analytical Approximation to the Solution to Avellaneda-Stoikov Formulation |
|
|
272 | (3) |
|
10.3.4 Real-World Market-Making |
|
|
275 | (1) |
|
10.4 Key Takeaways From This Chapter |
|
|
275 | (4) |
|
Module III Reinforcement Learning Algorithms |
|
|
|
Chapter 11 Monte-Carlo and Temporal-Difference for Prediction |
|
|
279 | (36) |
|
11.1 Overview Of The Reinforcement Learning Approach |
|
|
279 | (2) |
|
|
281 | (1) |
|
11.3 Monte-Carlo (Mc) Prediction |
|
|
282 | (6) |
|
11.4 Temporal-Difference (Td) Prediction |
|
|
288 | (5) |
|
|
293 | (12) |
|
11.5.1 TD Learning Akin to Human Learning |
|
|
293 | (1) |
|
11.5.2 Bias, Variance and Convergence |
|
|
294 | (2) |
|
11.5.3 Fixed-Data Experience Replay on TD versus MC |
|
|
296 | (6) |
|
11.5.4 Bootstrapping and Experiencing |
|
|
302 | (3) |
|
|
305 | (9) |
|
11.6.1 N-Step Bootstrapping Prediction Algorithm |
|
|
305 | (1) |
|
11.6.2 A-Return Prediction Algorithm |
|
|
306 | (2) |
|
11.6.3 Eligibility Traces |
|
|
308 | (4) |
|
11.6.4 Implementation of the TD(A) Prediction Algorithm |
|
|
312 | (2) |
|
11.7 Key Takeaways From This Chapter |
|
|
314 | (1) |
|
Chapter 12 Monte-Carlo and Temporal-Difference for Control |
|
|
315 | (34) |
|
12.1 Refresher On Generalized Policy Iteration (Gpi) |
|
|
315 | (1) |
|
12.2 Gpi With Evaluation As Monte-Carlo |
|
|
316 | (4) |
|
12.3 Glie Monte-Control Control |
|
|
320 | (5) |
|
|
325 | (6) |
|
|
331 | (1) |
|
|
332 | (12) |
|
|
333 | (2) |
|
|
335 | (7) |
|
12.6.3 Importance Sampling |
|
|
342 | (2) |
|
12.7 Conceptual Linkage Between Dp And Td Algorithms |
|
|
344 | (2) |
|
12.8 Convergence Of Rl Algorithms |
|
|
346 | (2) |
|
12.9 Key Takeaways From This Chapter |
|
|
348 | (1) |
|
Chapter 13 Batch RL, Experience-Replay, DQN, LSPI, Gradient TD |
|
|
349 | (32) |
|
13.1 Batch Rl And Experience-Replay |
|
|
350 | (3) |
|
13.2 A Generic Implementation Of Experience-Replay |
|
|
353 | (1) |
|
13.3 Least-Squares Rl Prediction |
|
|
354 | (6) |
|
13.3.1 Least-Squares Monte-Carlo (LSMC) |
|
|
355 | (1) |
|
13.3.2 Least-Squares Temporal-Difference (LSTD) |
|
|
355 | (3) |
|
|
358 | (1) |
|
13.3.4 Convergence of Least-Squares Prediction |
|
|
359 | (1) |
|
13.4 Q-Learning With Experience-Replay |
|
|
360 | (3) |
|
13.4.1 Deep Q-Networks (DON) Algorithm |
|
|
362 | (1) |
|
13.5 Least-Squares Policy Iteration (Lspi) |
|
|
363 | (6) |
|
13.5.1 Saving Your Village from a Vampire |
|
|
365 | (3) |
|
13.5.2 Least-Squares Control Convergence |
|
|
368 | (1) |
|
13.6 Rl For Optimal Exercise Of American Options |
|
|
369 | (3) |
|
13.6.1 LSPI for American Options Pricing |
|
|
370 | (2) |
|
13.6.2 Deep Q-Learning for American Options Pricing |
|
|
372 | (1) |
|
13.7 Value Function Geometry |
|
|
372 | (6) |
|
13.7.1 Notation and Definitions |
|
|
373 | (1) |
|
13.7.2 Bellman Policy Operator and Projection Operator |
|
|
374 | (1) |
|
13.7.3 Vectors of Interest in the Φ Subspace |
|
|
374 | (4) |
|
13.8 Gradient Temporal-Difference (Gradient Td) |
|
|
378 | (1) |
|
13.9 Key Takeaways From This Chapter |
|
|
379 | (2) |
|
Chapter 14 Policy Gradient Algorithms |
|
|
381 | (30) |
|
14.1 Advantages And Disadvantages Of Policy Gradient Algorithms |
|
|
382 | (1) |
|
14.2 Policy Gradient Theorem |
|
|
383 | (4) |
|
14.2.1 Notation and Definitions |
|
|
383 | (1) |
|
14.2.2 Statement of the Policy Gradient Theorem |
|
|
384 | (1) |
|
14.2.3 Proof of the Policy Gradient Theorem |
|
|
385 | (2) |
|
14.3 Score Function For Canonical Policy Functions |
|
|
387 | (1) |
|
14.3.1 Canonical π(s, a; θ) for Finite Action Spaces |
|
|
387 | (1) |
|
14.3.2 Canonical π(s, a; θ) for Single-Dimensional Continuous Action Spaces |
|
|
388 | (1) |
|
14.4 Reinforce Algorithm (Monte-Carlo Policy Gradient) |
|
|
388 | (3) |
|
14.5 Optimal Asset Allocation (Revisited) |
|
|
391 | (4) |
|
14.6 Actor-Critic And Variance Reduction |
|
|
395 | (5) |
|
14.7 Overcoming Bias With Compatible Function Approximation |
|
|
400 | (3) |
|
14.8 Policy Gradient Methods In Practice |
|
|
403 | (3) |
|
14.8.1 Natural Policy Gradient |
|
|
403 | (1) |
|
14.8.2 Deterministic Policy Gradient |
|
|
404 | (2) |
|
14.9 Evolutionary Strategies |
|
|
406 | (2) |
|
14.10 Key Takeaways From This Chapter |
|
|
408 | (3) |
|
Module IV Finishing Touches |
|
|
|
Chapter 15 Multi-Armed Bandits: Exploration versus Exploitation |
|
|
411 | (28) |
|
15.1 Introduction To The Multi-Armed Bandit Problem |
|
|
411 | (4) |
|
15.1.1 Some Examples of Explore-Exploit Dilemma |
|
|
412 | (1) |
|
15.1.2 Problem Definition |
|
|
412 | (1) |
|
|
413 | (1) |
|
|
413 | (2) |
|
|
415 | (3) |
|
15.2.1 Greedy and ε-Greedy |
|
|
415 | (1) |
|
15.2.2 Optimistic Initialization |
|
|
415 | (1) |
|
15.2.3 Decaying εt-Greedy Algorithm |
|
|
416 | (2) |
|
|
418 | (1) |
|
15.4 Upper Confidence Bound Algorithms |
|
|
418 | (5) |
|
15.4.1 Hoeffding's Inequality |
|
|
421 | (1) |
|
|
421 | (1) |
|
|
422 | (1) |
|
15.5 Probability Matching |
|
|
423 | (5) |
|
|
425 | (3) |
|
|
428 | (3) |
|
|
431 | (3) |
|
15.8 Information State Space Mdp |
|
|
434 | (1) |
|
15.9 Extending To Contextual Bandits And Rl Control |
|
|
435 | (2) |
|
15.10 Key Takeaways From This Chapter |
|
|
437 | (2) |
|
Chapter 16 Blending Learning and Planning |
|
|
439 | (12) |
|
16.1 Planning Versus Learning |
|
|
439 | (4) |
|
16.1.1 Planning the Solution of Prediction/Control |
|
|
440 | (1) |
|
16.1.2 Learning the Solution of Prediction/Control |
|
|
441 | (1) |
|
16.1.3 Advantages and Disadvantages of Planning versus Learning |
|
|
441 | (1) |
|
16.1.4 Blending Planning and Learning |
|
|
442 | (1) |
|
16.2 Decision-Time Planning |
|
|
443 | (1) |
|
16.3 Monte-Carlo Tree-Search (Mcts) |
|
|
444 | (1) |
|
16.4 Adaptive Multi-Stage Sampling |
|
|
445 | (4) |
|
16.5 Summary Of Key Learnings From This Chapter |
|
|
449 | (2) |
|
Chapter 17 Summary and Real-World Considerations |
|
|
451 | (8) |
|
17.1 Summary Of Key Learnings From This Book |
|
|
451 | (4) |
|
17.2 Rl In The Real-World |
|
|
455 | (4) |
|
Appendix A Moment Generating Function and Its Applications |
|
|
459 | (4) |
|
A.1 The Moment Generating Function (Mgf) |
|
|
459 | (1) |
|
A.2 Mgf For Linear Functions Of Random Variables |
|
|
460 | (1) |
|
A.3 Mgf For The Normal Distribution |
|
|
460 | (1) |
|
|
460 | (3) |
|
A.4.1 Minimizing the MGF When x Follows a Normal Distribution |
|
|
461 | (1) |
|
A.4.2 Minimizing the MGF When x Follows a Symmetric Binary Distribution |
|
|
461 | (2) |
|
Appendix B Portfolio Theory |
|
|
463 | (4) |
|
|
463 | (1) |
|
|
463 | (1) |
|
B.3 Derivation Of Efficient Frontier Curve |
|
|
463 | (1) |
|
B.4 Global Minimum Variance Portfolio (Gmvp) |
|
|
464 | (1) |
|
B.5 Orthogonal Efficient Portfolios |
|
|
464 | (1) |
|
|
465 | (1) |
|
B.7 An Example Of The Efficient Frontier For 16 Assets |
|
|
465 | (1) |
|
B.8 Capm: Linearity Of Covariance Vector W.R.T. Mean Returns |
|
|
465 | (1) |
|
B.9 Useful Corollaries Of Capm |
|
|
466 | (1) |
|
B.10 Cross-Sectional Variance |
|
|
466 | (1) |
|
B.11 Efficient Set With A Risk-Free Asset |
|
|
466 | (1) |
|
Appendix C Introduction to and Overview of Stochastic Calculus Basics |
|
|
467 | (8) |
|
|
467 | (1) |
|
C.2 Brownian Motion As Scaled Random Walk |
|
|
468 | (1) |
|
C.3 Continuous-Time Stochastic Processes |
|
|
469 | (1) |
|
C.4 Properties Of Brownian Motion Sample Traces |
|
|
469 | (1) |
|
|
470 | (1) |
|
|
471 | (1) |
|
|
471 | (1) |
|
C.8 A Mean-Reverting Process |
|
|
472 | (3) |
|
Appendix D The Hamilton-Jacobi-Bellman (HJB) Equation |
|
|
475 | (2) |
|
D.1 Hjb As A Continuous-Time Version Of Bellman Optimality Equation |
|
|
475 | (1) |
|
D.2 Hjb With State Transitions As An Ito Process |
|
|
476 | (1) |
|
Appendix E Black-Scholes Equation and Its Solution for Call/Put Options |
|
|
477 | (4) |
|
|
477 | (1) |
|
E.2 Derivation Of The Black-Scholes Equation |
|
|
478 | (1) |
|
E.3 Solution Of The Black-Scholes Equation For Call/Put Options |
|
|
479 | (2) |
|
Appendix F Function Approximations as Affine Spaces |
|
|
481 | (6) |
|
|
481 | (1) |
|
|
481 | (1) |
|
F.3 Linear Map Of Vector Spaces |
|
|
481 | (1) |
|
|
482 | (1) |
|
|
482 | (1) |
|
F.6 Function Approximations |
|
|
483 | (1) |
|
F.6.1 D[ R] as an Affine Space P |
|
|
483 | (1) |
|
|
483 | (1) |
|
F.7 Stochastic Gradient Descent |
|
|
484 | (1) |
|
F.8 Sgd Update For Linear Function Approximations |
|
|
485 | (2) |
|
Appendix G Conjugate Priors for Gaussian and Bernoulli Distributions |
|
|
487 | (2) |
|
G.1 Conjugate Prior For Gaussian Distribution |
|
|
487 | (1) |
|
G.2 Conjugate Prior For Bernoulli Distribution |
|
|
488 | (1) |
Bibliography |
|
489 | (4) |
Index |
|
493 | |