Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Reinforcement Learning, second edition

4.54/5 (908 hinnangut Goodreads-ist)

Richard S. Sutton (University of Alberta), Andrew G. Barto

Formaat: EPUB+DRM
Sari: Reinforcement Learning
Ilmumisaeg: 13-Nov-2018
Kirjastus: MIT Press
Keel: eng
ISBN-13: 9780262352703

Teised raamatud teemal:

Machine learning

Formaat - EPUB+DRM
Hind: 97,01 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: EPUB+DRM
Sari: Reinforcement Learning
Ilmumisaeg: 13-Nov-2018
Kirjastus: MIT Press
Keel: eng
ISBN-13: 9780262352703

Teised raamatud teemal:

Machine learning

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.

The significantly expanded and updated new edition of a widely used text on reinforcement learning, one of the most active research areas in artificial intelligence.

Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms. This second edition has been significantly expanded and updated, presenting new topics and updating coverage of other topics.

Like the first edition, this second edition focuses on core online learning algorithms, with the more mathematical material set off in shaded boxes. Part I covers as much of reinforcement learning as possible without going beyond the tabular case for which exact solutions can be found. Many algorithms presented in this part are new to the second edition, including UCB, Expected Sarsa, and Double Learning. Part II extends these ideas to function approximation, with new sections on such topics as artificial neural networks and the Fourier basis, and offers expanded treatment of off-policy learning and policy-gradient methods. Part III has new chapters on reinforcement learning's relationships to psychology and neuroscience, as well as an updated case-studies chapter including AlphaGo and AlphaGo Zero, Atari game playing, and IBM Watson's wagering strategy. The final chapter discusses the future societal impacts of reinforcement learning.

Preface to the Second Edition

xiii

Preface to the First Edition

xvii

Summary of Notation

xix

1 Introduction

(22)

1.1 Reinforcement Learning

(3)

1.2 Examples

(2)

1.3 Elements of Reinforcement Learning

(1)

1.4 Limitations and Scope

(1)

1.5 An Extended Example: Tic-Tac-Toe

(5)

1.6 Summary

(1)

1.7 Early History of Reinforcement Learning

(10)

I Tabular Solution Methods

(172)

2 Multi-armed Bandits

(22)

2.1 A k-armed Bandit Problem

(2)

2.2 Action-value Methods

(1)

2.3 The 10-armed Testbed

(2)

2.4 Incremental Implementation

(2)

2.5 Tracking a Nonstationary Problem

(2)

2.6 Optimistic Initial Values

(1)

2.7 Upper-Confidence-Bound Action Selection

(2)

2.8 Gradient Bandit Algorithms

(4)

2.9 Associative Search (Contextual Bandits)

(1)

2.10 Summary

(5)

3 Finite Markov Decision Processes

(26)

3.1 The Agent-Environment Interface

(6)

3.2 Goals and Rewards

(1)

3.3 Returns and Episodes

(3)

3.4 Unified Notation for Episodic and Continuing Tasks

(1)

3.5 Policies and Value Functions

(4)

3.6 Optimal Policies and Optimal Value Functions

(5)

3.7 Optimality and Approximation

(1)

3.8 Summary

(5)

4 Dynamic Programming

(18)

4.1 Policy Evaluation (Prediction)

(2)

4.2 Policy Improvement

(4)

4.3 Policy Iteration

(2)

4.4 Value Iteration

(3)

4.5 Asynchronous Dynamic Programming

(1)

4.6 Generalized Policy Iteration

(1)

4.7 Efficiency of Dynamic Programming

(1)

4.8 Summary

(3)

5 Monte Carlo Methods

(28)

5.1 Monte Carlo Prediction

(4)

5.2 Monte Carlo Estimation of Action Values

(1)

5.3 Monte Carlo Control

(3)

5.4 Monte Carlo Control without Exploring Starts

100

(3)

5.5 Off-policy Prediction via Importance Sampling

103

(6)

5.6 Incremental Implementation

109

(1)

5.7 Off-policy Monte Carlo Control

110

(2)

5.8 *Discounting-aware Importance Sampling

112

(2)

5.9 *Per-decision Importance Sampling

114

(1)

5.10 Summary

115

(4)

6 Temporal-Difference Learning

119

(22)

6.1 TD Prediction

119

(5)

6.2 Advantages of TD Prediction Methods

124

(2)

6.3 Optimality of TD(0)

126

(3)

6.4 Sarsa: On-policy TD Control

129

(2)

6.5 Q-learning: Off-policy TD Control

131

(2)

6.6 Expected Sarsa

133

(1)

6.7 Maximization Bias and Double Learning

134

(2)

6.8 Games, Afterstates, and Other Special Cases

136

(2)

6.9 Summary

138

(3)

7 n-step Bootstrapping

141

(18)

7.1 n-step TD Prediction

142

(3)

7.2 n-step Sarsa

145

(3)

7.3 n-step Off-policy Learning

148

(2)

7.4 *Per-decision Methods with Control Variates

150

(2)

7.5 Off-policy Learning Without Importance Sampling: The n-step Tree Backup Algorithm

152

(2)

7.6 *A Unifying Algorithm: n-step Q(u)

154

(3)

7.7 Summary

157

(2)

8 Planning and Learning with Tabular Methods

159

(36)

8.1 Models and Planning

159

(2)

8.2 Dyna: Integrated Planning, Acting, and Learning

161

(5)

8.3 When the Model Is Wrong

166

(2)

8.4 Prioritized Sweeping

168

(4)

8.5 Expected vs. Sample Updates

172

(2)

8.6 Trajectory Sampling

174

(3)

8.7 Real-time Dynamic Programming

177

(3)

8.8 Planning at Decision Time

180

(1)

8.9 Heuristic Search

181

(2)

8.10 Rollout Algorithms

183

(2)

8.11 Monte Carlo Tree Search

185

(3)

8.12 Summary of the
Chapter

188

(1)

8.13 Summary of Part I: Dimensions

189

(6)

II Approximate Solution Methods

195

(144)

9 On-policy Prediction with Approximation

197

(46)

9.1 Value-function Approximation

198

(1)

9.2 The Prediction Objective (VE)

199

(1)

9.3 Stochastic-gradient and Semi-gradient Methods

200

(4)

9.4 Linear Methods

204

(6)

9.5 Feature Construction for Linear Methods

210

(12)

9.5.1 Polynomials

210

(1)

9.5.2 Fourier Basis

211

(4)

9.5.3 Coarse Coding

215

(2)

9.5.4 Tile Coding

217

(4)

9.5.5 Radial Basis Functions

221

(1)

9.6 Selecting Step-Size Parameters Manually

222

(1)

9.7 Nonlinear Function Approximation: Artificial Neural Networks

223

(5)

9.8 Least-Squares TD

228

(2)

9.9 Memory-based Function Approximation

230

(2)

9.10 Kernel-based Function Approximation

232

(2)

9.11 Looking Deeper at On-policy Learning: Interest and Emphasis

234

(2)

9.12 Summary

236

(7)

10 On-policy Control with Approximation

243

(14)

10.1 Episodic Semi-gradient Control

243

(4)

10.2 Semi-gradient n-step Sarsa

247

(2)

10.3 Average Reward: A New Problem Setting for Continuing Tasks

249

(4)

10.4 Deprecating the Discounted Setting

253

(2)

10.5 Differential Semi-gradient n-step Sarsa

255

(1)

10.6 Summary

256

(1)

11 *Off-policy Methods with Approximation

257

(30)

11.1 Semi-gradient Methods

258

(2)

11.2 Examples of Off-policy Divergence

260

(4)

11.3 The Deadly Triad

264

(2)

11.4 Linear Value-function Geometry

266

(3)

11.5 Gradient Descent in the Bellman Error

269

(5)

11.6 The Bellman Error is Not Learnable

274

(4)

11.7 Gradient-TD Methods

278

(3)

11.8 Emphatic-TD Methods

281

(2)

11.9 Reducing Variance

283

(1)

11.10 Summary

284

(3)

12 Eligibility Traces

287

(34)

12.1 The A-return

288

(4)

12.2 TD(A)

292

(3)

12.3 n-step Truncated A-return Methods

295

(2)

12.4 Redoing Updates: Online A-return Algorithm

297

(2)

12.5 True Online TD(A)

299

(2)

12.6 *Dutch Traces in Monte Carlo Learning

301

(2)

12.7 Sarsa(A)

303

(4)

12.8 Variable A and ry

307

(2)

12.9 Off-policy Traces with Control Variates

309

(3)

12.10 Watkins's Q(A) to Tree-Backup(A)

312

(2)

12.11 Stable Off-policy Methods with Traces

314

(2)

12.12 Implementation Issues

316

(1)

12.13 Conclusions

317

(4)

13 Policy Gradient Methods

321

(18)

13.1 Policy Approximation and its Advantages

322

(2)

13.2 The Policy Gradient Theorem

324

(2)

13.3 REINFORCE: Monte Carlo Policy Gradient

326

(3)

13.4 REINFORCE with Baseline

329

(2)

13.5 Actor-Critic Methods

331

(2)

13.6 Policy Gradient for Continuing Problems

333

(2)

13.7 Policy Parameterization for Continuous Actions

335

(2)

13.8 Summary

337

(2)

III Looking Deeper

339

(142)

14 Psychology

341

(36)

14.1 Prediction and Control

342

(1)

14.2 Classical Conditioning

343

(14)

14.2.1 Blocking and Higher-order Conditioning

345

(1)

14.2.2 The Rescorla-Wagner Model

346

(3)

14.2.3 The TD Model

349

(1)

14.2.4 TD Model Simulations

350

(7)

14.3 Instrumental Conditioning

357

(4)

14.4 Delayed Reinforcement

361

(2)

14.5 Cognitive Maps

363

(1)

14.6 Habitual and Goal-directed Behavior

364

(4)

14.7 Summary

368

(9)

15 Neuroscience

377

(44)

15.1 Neuroscience Basics

378

(2)

15.2 Reward Signals, Reinforcement Signals, Values, and Prediction Errors

380

(1)

15.3 The Reward Prediction Error Hypothesis

381

(2)

15.4 Dopamine

383

(4)

15.5 Experimental Support for the Reward Prediction Error Hypothesis

387

(3)

15.6 TD Error/Dopamine Correspondence

390

(5)

15.7 Neural Actor-Critic

395

(3)

15.8 Actor and Critic Learning Rules

398

(4)

15.9 Hedonistic Neurons

402

(2)

15.10 Collective Reinforcement Learning

404

(3)

15.11 Model-based Methods in the Brain

407

(2)

15.12 Addiction

409

(1)

15.13 Summary

410

(11)

16 Applications and Case Studies

421

(38)

16.1 TD-Gammon

421

(5)

16.2 Samuel's Checkers Player

426

(3)

16.3 Watson's Daily-Double Wagering

429

(3)

16.4 Optimizing Memory Control

432

(4)

16.5 Human-level Video Game Play

436

(5)

16.6 Mastering the Game of Go

441

(9)

16.6.1 AlphaGo

444

(3)

16.6.2 AlphaGo Zero

447

(3)

16.7 Personalized Web Services

450

(3)

16.8 Thermal Soaring

453

(6)

17 Frontiers

459

(22)

17.1 General Value Functions and Auxiliary Tasks

459

(2)

17.2 Temporal Abstraction via Options

461

(3)

17.3 Observations and State

464

(5)

17.4 Designing Reward Signals

469

(3)

17.5 Remaining Issues

472

(3)

17.6 Experimental Support for the Reward Prediction Error Hypothesis

475

(6)

References

481

(38)

Index

519

Richard S. Sutton is Professor of Computing Science and AITF Chair in Reinforcement Learning and Artificial Intelligence at the University of Alberta, and also Distinguished Research Scientist at DeepMind. Andrew G. Barto is Professor Emeritus in the College of Computer and Information Sciences at the University of Massachusetts Amherst.

Lisainfo e-raamatute kohta