Muutke küpsiste eelistusi

E-raamat: Reinforcement Learning, second edition

(University of Alberta),
  • Formaat: EPUB+DRM
  • Sari: Reinforcement Learning
  • Ilmumisaeg: 13-Nov-2018
  • Kirjastus: MIT Press
  • Keel: eng
  • ISBN-13: 9780262352703
Teised raamatud teemal:
  • Formaat - EPUB+DRM
  • Hind: 97,01 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: EPUB+DRM
  • Sari: Reinforcement Learning
  • Ilmumisaeg: 13-Nov-2018
  • Kirjastus: MIT Press
  • Keel: eng
  • ISBN-13: 9780262352703
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.

The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.

Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has been significantly expanded and updated, presenting new topics and updating coverage of other topics.

Like the first edition, this second edition focuses on core online learning algorithms, with the more mathematical material set off in shaded boxes. Part I covers as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, and Double Learning. Part II extends these ideas to function approximation, with new sections on such topics as artificial neural networks and the Fourier basis, and offers expanded treatment of off-policy learning and policy-gradient methods. Part III has new chapters on reinforcement learning's relationships to psychology and neuroscience, as well as an updated case-studies chapter including AlphaGo and AlphaGo Zero, Atari game playing, and IBM Watson's wagering strategy. The final chapter discusses the future societal impacts of reinforcement learning.

Preface to the Second Edition xiii
Preface to the First Edition xvii
Summary of Notation xix
1 Introduction
1(22)
1.1 Reinforcement Learning
1(3)
1.2 Examples
4(2)
1.3 Elements of Reinforcement Learning
6(1)
1.4 Limitations and Scope
7(1)
1.5 An Extended Example: Tic-Tac-Toe
8(5)
1.6 Summary
13(1)
1.7 Early History of Reinforcement Learning
13(10)
I Tabular Solution Methods 23(172)
2 Multi-armed Bandits
25(22)
2.1 A k-armed Bandit Problem
25(2)
2.2 Action-value Methods
27(1)
2.3 The 10-armed Testbed
28(2)
2.4 Incremental Implementation
30(2)
2.5 Tracking a Nonstationary Problem
32(2)
2.6 Optimistic Initial Values
34(1)
2.7 Upper-Confidence-Bound Action Selection
35(2)
2.8 Gradient Bandit Algorithms
37(4)
2.9 Associative Search (Contextual Bandits)
41(1)
2.10 Summary
42(5)
3 Finite Markov Decision Processes
47(26)
3.1 The Agent-Environment Interface
47(6)
3.2 Goals and Rewards
53(1)
3.3 Returns and Episodes
54(3)
3.4 Unified Notation for Episodic and Continuing Tasks
57(1)
3.5 Policies and Value Functions
58(4)
3.6 Optimal Policies and Optimal Value Functions
62(5)
3.7 Optimality and Approximation
67(1)
3.8 Summary
68(5)
4 Dynamic Programming
73(18)
4.1 Policy Evaluation (Prediction)
74(2)
4.2 Policy Improvement
76(4)
4.3 Policy Iteration
80(2)
4.4 Value Iteration
82(3)
4.5 Asynchronous Dynamic Programming
85(1)
4.6 Generalized Policy Iteration
86(1)
4.7 Efficiency of Dynamic Programming
87(1)
4.8 Summary
88(3)
5 Monte Carlo Methods
91(28)
5.1 Monte Carlo Prediction
92(4)
5.2 Monte Carlo Estimation of Action Values
96(1)
5.3 Monte Carlo Control
97(3)
5.4 Monte Carlo Control without Exploring Starts
100(3)
5.5 Off-policy Prediction via Importance Sampling
103(6)
5.6 Incremental Implementation
109(1)
5.7 Off-policy Monte Carlo Control
110(2)
5.8 *Discounting-aware Importance Sampling
112(2)
5.9 *Per-decision Importance Sampling
114(1)
5.10 Summary
115(4)
6 Temporal-Difference Learning
119(22)
6.1 TD Prediction
119(5)
6.2 Advantages of TD Prediction Methods
124(2)
6.3 Optimality of TD(0)
126(3)
6.4 Sarsa: On-policy TD Control
129(2)
6.5 Q-learning: Off-policy TD Control
131(2)
6.6 Expected Sarsa
133(1)
6.7 Maximization Bias and Double Learning
134(2)
6.8 Games, Afterstates, and Other Special Cases
136(2)
6.9 Summary
138(3)
7 n-step Bootstrapping
141(18)
7.1 n-step TD Prediction
142(3)
7.2 n-step Sarsa
145(3)
7.3 n-step Off-policy Learning
148(2)
7.4 *Per-decision Methods with Control Variates
150(2)
7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm
152(2)
7.6 *A Unifying Algorithm: n-step Q(u)
154(3)
7.7 Summary
157(2)
8 Planning and Learning with Tabular Methods
159(36)
8.1 Models and Planning
159(2)
8.2 Dyna: Integrated Planning, Acting, and Learning
161(5)
8.3 When the Model Is Wrong
166(2)
8.4 Prioritized Sweeping
168(4)
8.5 Expected vs. Sample Updates
172(2)
8.6 Trajectory Sampling
174(3)
8.7 Real-time Dynamic Programming
177(3)
8.8 Planning at Decision Time
180(1)
8.9 Heuristic Search
181(2)
8.10 Rollout Algorithms
183(2)
8.11 Monte Carlo Tree Search
185(3)
8.12 Summary of the
Chapter
188(1)
8.13 Summary of Part I: Dimensions
189(6)
II Approximate Solution Methods 195(144)
9 On-policy Prediction with Approximation
197(46)
9.1 Value-function Approximation
198(1)
9.2 The Prediction Objective (VE)
199(1)
9.3 Stochastic-gradient and Semi-gradient Methods
200(4)
9.4 Linear Methods
204(6)
9.5 Feature Construction for Linear Methods
210(12)
9.5.1 Polynomials
210(1)
9.5.2 Fourier Basis
211(4)
9.5.3 Coarse Coding
215(2)
9.5.4 Tile Coding
217(4)
9.5.5 Radial Basis Functions
221(1)
9.6 Selecting Step-Size Parameters Manually
222(1)
9.7 Nonlinear Function Approximation: Artificial Neural Networks
223(5)
9.8 Least-Squares TD
228(2)
9.9 Memory-based Function Approximation
230(2)
9.10 Kernel-based Function Approximation
232(2)
9.11 Looking Deeper at On-policy Learning: Interest and Emphasis
234(2)
9.12 Summary
236(7)
10 On-policy Control with Approximation
243(14)
10.1 Episodic Semi-gradient Control
243(4)
10.2 Semi-gradient n-step Sarsa
247(2)
10.3 Average Reward: A New Problem Setting for Continuing Tasks
249(4)
10.4 Deprecating the Discounted Setting
253(2)
10.5 Differential Semi-gradient n-step Sarsa
255(1)
10.6 Summary
256(1)
11 *Off-policy Methods with Approximation
257(30)
11.1 Semi-gradient Methods
258(2)
11.2 Examples of Off-policy Divergence
260(4)
11.3 The Deadly Triad
264(2)
11.4 Linear Value-function Geometry
266(3)
11.5 Gradient Descent in the Bellman Error
269(5)
11.6 The Bellman Error is Not Learnable
274(4)
11.7 Gradient-TD Methods
278(3)
11.8 Emphatic-TD Methods
281(2)
11.9 Reducing Variance
283(1)
11.10 Summary
284(3)
12 Eligibility Traces
287(34)
12.1 The A-return
288(4)
12.2 TD(A)
292(3)
12.3 n-step Truncated A-return Methods
295(2)
12.4 Redoing Updates: Online A-return Algorithm
297(2)
12.5 True Online TD(A)
299(2)
12.6 *Dutch Traces in Monte Carlo Learning
301(2)
12.7 Sarsa(A)
303(4)
12.8 Variable A and ry
307(2)
12.9 Off-policy Traces with Control Variates
309(3)
12.10 Watkins's Q(A) to Tree-Backup(A)
312(2)
12.11 Stable Off-policy Methods with Traces
314(2)
12.12 Implementation Issues
316(1)
12.13 Conclusions
317(4)
13 Policy Gradient Methods
321(18)
13.1 Policy Approximation and its Advantages
322(2)
13.2 The Policy Gradient Theorem
324(2)
13.3 REINFORCE: Monte Carlo Policy Gradient
326(3)
13.4 REINFORCE with Baseline
329(2)
13.5 Actor-Critic Methods
331(2)
13.6 Policy Gradient for Continuing Problems
333(2)
13.7 Policy Parameterization for Continuous Actions
335(2)
13.8 Summary
337(2)
III Looking Deeper 339(142)
14 Psychology
341(36)
14.1 Prediction and Control
342(1)
14.2 Classical Conditioning
343(14)
14.2.1 Blocking and Higher-order Conditioning
345(1)
14.2.2 The Rescorla-Wagner Model
346(3)
14.2.3 The TD Model
349(1)
14.2.4 TD Model Simulations
350(7)
14.3 Instrumental Conditioning
357(4)
14.4 Delayed Reinforcement
361(2)
14.5 Cognitive Maps
363(1)
14.6 Habitual and Goal-directed Behavior
364(4)
14.7 Summary
368(9)
15 Neuroscience
377(44)
15.1 Neuroscience Basics
378(2)
15.2 Reward Signals, Reinforcement Signals, Values, and Prediction Errors
380(1)
15.3 The Reward Prediction Error Hypothesis
381(2)
15.4 Dopamine
383(4)
15.5 Experimental Support for the Reward Prediction Error Hypothesis
387(3)
15.6 TD Error/Dopamine Correspondence
390(5)
15.7 Neural Actor-Critic
395(3)
15.8 Actor and Critic Learning Rules
398(4)
15.9 Hedonistic Neurons
402(2)
15.10 Collective Reinforcement Learning
404(3)
15.11 Model-based Methods in the Brain
407(2)
15.12 Addiction
409(1)
15.13 Summary
410(11)
16 Applications and Case Studies
421(38)
16.1 TD-Gammon
421(5)
16.2 Samuel's Checkers Player
426(3)
16.3 Watson's Daily-Double Wagering
429(3)
16.4 Optimizing Memory Control
432(4)
16.5 Human-level Video Game Play
436(5)
16.6 Mastering the Game of Go
441(9)
16.6.1 AlphaGo
444(3)
16.6.2 AlphaGo Zero
447(3)
16.7 Personalized Web Services
450(3)
16.8 Thermal Soaring
453(6)
17 Frontiers
459(22)
17.1 General Value Functions and Auxiliary Tasks
459(2)
17.2 Temporal Abstraction via Options
461(3)
17.3 Observations and State
464(5)
17.4 Designing Reward Signals
469(3)
17.5 Remaining Issues
472(3)
17.6 Experimental Support for the Reward Prediction Error Hypothesis
475(6)
References 481(38)
Index 519
Richard S. Sutton is Professor of Computing Science and AITF Chair in Reinforcement Learning and Artificial Intelligence at the University of Alberta, and also Distinguished Research Scientist at DeepMind. Andrew G. Barto is Professor Emeritus in the College of Computer and Information Sciences at the University of Massachusetts Amherst.