Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications [Pehme köide]

3.00/5 (4 hinnangut Goodreads-ist)

Adam Kelleher, Andrew Kelleher

Formaat: Paperback / softback, 288 pages, kõrgus x laius x paksus: 230x180x20 mm, kaal: 561 g
Sari: Addison-Wesley Data & Analytics Series
Ilmumisaeg: 15-May-2019
Kirjastus: Addison Wesley
ISBN-10: 0134116542
ISBN-13: 9780134116549

Teised raamatud teemal:

Machine learning

Pehme köide
Hind: 47,34 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 288 pages, kõrgus x laius x paksus: 230x180x20 mm, kaal: 561 g
Sari: Addison-Wesley Data & Analytics Series
Ilmumisaeg: 15-May-2019
Kirjastus: Addison Wesley
ISBN-10: 0134116542
ISBN-13: 9780134116549

Teised raamatud teemal:

Machine learning

Püsilink: https://www.kriso.ee/db/9780134116549.html

Märksõnad:

The typical data science task in industry starts with an “ask” from the business. But few data scientists have been taught what to do with that ask. This book shows them how to assess it in the context of the business’s goals, reframe it to work optimally for both the data scientist and the employer, and then execute on it. Written by two of the experts who’ve achieved breakthrough optimizations at BuzzFeed, it’s packed with real-world examples that take you from start to finish: from ask to actionable insight.

Andrew Kelleher and Adam Kelleher walk you through well-formed, concrete principles for approaching common data science problems, giving you an easy-to-use checklist for effective execution. Using their principles and techniques, you’ll gain deeper understanding of your data, learn how to analyze noise and confounding variables so they don’t compromise your analysis, and save weeks of iterative improvement by planning your projects more effectively upfront.

Once you’ve mastered their principles, you’ll put them to work in two realistic, beginning-to-end site optimization tasks. These extended examples come complete with reusable code examples and recommended open-source solutions designed for easy adaptation to your everyday challenges. They will be especially valuable for anyone seeking their first data science job -- and everyone who’s found that job and wants to succeed in it.

Foreword

Preface

xvii

About the Authors

xxi

I: Principles of Framing

(66)

1 The Role of the Data Scientist

(4)

1.1 Introduction

(1)

1.2 The Role of the Data Scientist

(3)

1.2.1 Company Size

(1)

1.2.2 Team Context

(1)

1.2.3 Ladders and Career Development

(1)

1.2.4 Importance

(1)

1.2.5 The Work Breakdown

(1)

1.3 Conclusion

(1)

2 Project Workflow

(10)

2.1 Introduction

(1)

2.2 The Data Team Context

(3)

2.2.1 Embedding vs. Pooling Resources

(1)

2.2.2 Research

(1)

2.2.3 Prototyping

(1)

2.2.4 A Combined Workflow

(1)

2.3 Agile Development and the Product Focus

(5)

2.3.1 The 12 Principles

(4)

2.4 Conclusion

(2)

3 Quantifying Error

(8)

3.1 Introduction

(1)

3.2 Quantifying Error in Measured Values

(2)

3.3 Sampling Error

(2)

3.4 Error Propagation

(2)

3.5 Conclusion

(2)

4 Data Encoding and Preprocessing

(12)

4.1 Introduction

(1)

4.2 Simple Text Preprocessing

(7)

4.2.1 Tokenization

(1)

4.2.2 N-grams

(1)

4.2.3 Sparsity

(1)

4.2.4 Feature Selection

(2)

4.2.5 Representation Learning

(3)

4.3 Information Loss

(1)

4.4 Conclusion

(3)

5 Hypothesis Testing

(8)

5.1 Introduction

(1)

5.2 What Is a Hypothesis?

(2)

5.3 Types of Errors

(1)

5.4 P-values and Confidence Intervals

(1)

5.5 Multiple Testing and "P-hacking"

(1)

5.6 An Example

(1)

5.7 Planning and Context

(1)

5.8 Conclusion

(1)

6 Data Visualization

(22)

6.1 Introduction

(1)

6.2 Distributions and Summary Statistics

(13)

6.2.1 Distributions and Histograms

(5)

6.2.2 Scatter Plots and Heat Maps

(4)

6.2.3 Box Plots and Error Bars

(3)

6.3 Time-Series Plots

(3)

6.3.1 Rolling Statistics

(2)

6.3.2 Auto-Correlation

(1)

6.4 Graph Visualization

(3)

6.4.1 Layout Algorithms

(2)

6.4.2 Time Complexity

(1)

6.5 Conclusion

(3)

II: Algorithms and Architectures

(136)

7 Introduction to Algorithms and Architectures

(10)

7.1 Introduction

(1)

7.2 Architectures

(4)

7.2.1 Services

(1)

7.2.2 Data Sources

(1)

7.2.3 Batch and Online Computing

(1)

7.2.4 Scaling

(1)

7.3 Models

(3)

7.3.1 Training

(1)

7.3.2 Prediction

(1)

7.3.3 Validation

(1)

7.4 Conclusion

(2)

8 Comparison

(10)

8.1 Introduction

(1)

8.2 Jaccard Distance

(3)

8.2.1 The Algorithm

(1)

8.2.2 Time Complexity

(1)

8.2.3 Memory Considerations

(1)

8.2.4 A Distributed Approach

(1)

8.3 MinHash

(2)

8.3.1 Assumptions

(1)

8.3.2 Time and Space Complexity

(1)

8.3.3 Tools

(1)

8.3.4 A Distributed Approach

(1)

8.4 Cosine Similarity

(2)

8.4.1 Complexity

(1)

8.4.2 Memory Considerations

(1)

8.4.3 A Distributed Approach

(1)

8.5 Mahalanobis Distance

(2)

8.5.1 Complexity

(1)

8.5.2 Memory Considerations

(1)

8.5.3 A Distributed Approach

(1)

8.6 Conclusion

(1)

9 Regression

(28)

9.1 Introduction

(7)

9.1.1 Choosing the Model

(1)

9.1.2 Choosing the Objective Function

(1)

9.1.3 Fitting

(1)

9.1.4 Validation

(4)

9.2 Linear Least Squares

(9)

9.2.1 Assumptions

(1)

9.2.2 Complexity

(1)

9.2.3 Memory Considerations

(1)

9.2.4 Tools

(1)

9.2.5 A Distributed Approach

(1)

9.2.6 A Worked Example

(7)

9.3 Nonlinear Regression with Linear Regression

105

(4)

9.3.1 Uncertainty

107

(2)

9.4 Random Forest

109

(6)

9.4.1 Decision Trees

109

(3)

9.4.2 Random Forests

112

(3)

9.5 Conclusion

115

(2)

10 Classification and Clustering

117

(18)

10.1 Introduction

117

(1)

10.2 Logistic Regression

118

(4)

10.2.1 Assumptions

121

(1)

10.2.2 Time Complexity

121

(1)

10.2.3 Memory Considerations

122

(1)

10.2.4 Tools

122

(1)

10.3 Bayesian Inference, Naive Bayes

122

(3)

10.3.1 Assumptions

124

(1)

10.3.2 Complexity

124

(1)

10.3.3 Memory Considerations

124

(1)

10.3.4 Tools

124

(1)

10.4 K-Means

125

(3)

10.4.1 Assumptions

127

(1)

10.4.2 Complexity

128

(1)

10.4.3 Memory Considerations

128

(1)

10.4.4 Tools

128

(1)

10.5 Leading Eigenvalue

128

(2)

10.5.1 Complexity

129

(1)

10.5.2 Memory Considerations

130

(1)

10.5.3 Tools

130

(1)

10.6 Greedy Louvain

130

(1)

10.6.1 Assumptions

130

(1)

10.6.2 Complexity

130

(1)

10.6.3 Memory Considerations

131

(1)

10.6.4 Tools

131

(1)

10.7 Nearest Neighbors

131

(2)

10.7.1 Assumptions

132

(1)

10.7.2 Complexity

132

(1)

10.7.3 Memory Considerations

133

(1)

10.7.4 Tools

133

(1)

10.8 Conclusion

133

(2)

11 Bayeslan Networks

135

(14)

11.1 Introduction

135

(1)

11.2 Causal Graphs, Conditional Independence, and Markovity

136

(2)

11.2.1 Causal Graphs and Conditional Independence

136

(1)

11.2.2 Stability and Dependence

137

(1)

11.3 D-separation and the Markov Property

138

(4)

11.3.1 Markovity and Factorization

138

(1)

11.3.2 D-separation

139

(3)

11.4 Causal Graphs as Bayesian Networks

142

(1)

11.4.1 Linear Regression

142

(1)

11.5 Fitting Models

143

(4)

11.6 Conclusion

147

(2)

12 Dimensional Reduction and Latent Variable Models

149

(18)

12.1 Introduction

149

(1)

12.2 Priors

149

(2)

12.3 Factor Analysis

151

(1)

12.4 Principal Components Analysis

152

(2)

12.4.1 Complexity

154

(1)

12.4.2 Memory Considerations

154

(1)

12.4.3 Tools

154

(1)

12.5 Independent Component Analysis

154

(5)

12.5.1 Assumptions

158

(1)

12.5.2 Complexity

158

(1)

12.5.3 Memory Considerations

159

(1)

12.5.4 Tools

159

(1)

12.6 Latent Dirichlet Allocation

159

(6)

12.7 Conclusion

165

(2)

13 Causal Inference

167

(22)

13.1 Introduction

167

(1)

13.2 Experiments

168

(3)

13.3 Observation: An Example

171

(6)

13.4 Controlling to Block Non-causal Paths

177

(5)

13.4.1 The G-formula

179

(3)

13.5 Machine-Learning Estimators

182

(5)

13.5.1 The G-formula Revisited

182

(1)

13.5.2 An Example

183

(4)

13.6 Conclusion

187

(2)

14 Advanced Machine Learning

189

(14)

14.1 Introduction

189

(1)

14.2 Optimization

189

(2)

14.3 Neural Networks

191

(10)

14.3.1 Layers

192

(1)

14.3.2 Capacity

193

(3)

14.3.3 Overfitting

196

(3)

14.3.4 Batch Fitting

199

(1)

14.3.5 Loss Functions

200

(1)

14.4 Conclusion

201

(2)

III: Bottlenecks and Optimizations

203

(42)

15 Hardware Fundamentals

205

(8)

15.1 Introduction

205

(1)

15.2 Random Access Memory

205

(1)

15.2.1 Access

205

(1)

15.2.2 Volatility

206

(1)

15.3 Nonvolatile/Persistent Storage

206

(2)

15.3.1 Hard Disk Drives or "Spinning Disks"

207

(1)

15.3.2 SSDs

207

(1)

15.3.3 Latency

207

(1)

15.3.4 Paging

207

(1)

15.3.5 Thrashing

208

(1)

15.4 Throughput

208

(1)

15.4.1 Locality

208

(1)

15.4.2 Execution-Level Locality

208

(1)

15.4.3 Network Locality

209

(1)

15.5 Processors

209

(3)

15.5.1 Clock Rate

209

(1)

15.5.2 Cores

210

(1)

15.5.3 Threading

210

(1)

15.5.4 Branch Prediction

210

(2)

15.6 Conclusion

212

(1)

16 Software Fundamentals

213

(4)

16.1 Introduction

213

(1)

16.2 Paging

213

(1)

16.3 Indexing

214

(1)

16.4 Granularity

214

(2)

16.5 Robustness

216

(1)

16.6 Extract, Transfer/Transform, Load

216

(1)

16.7 Conclusion

216

(1)

17 Software Architecture

217

(6)

17.1 Introduction

217

(1)

17.2 Client-Server Architecture

217

(1)

17.3 N-tier/Service-Oriented Architecture

218

(2)

17.4 Microservices

220

(1)

17.5 Monolith

220

(1)

17.6 Practical Cases (Mix-and-Match Architectures)

221

(1)

17.7 Conclusion

221

(2)

18 The CAP Theorem

223

(10)

18.1 Introduction

223

(1)

18.2 Consistency/Concurrency

223

(2)

18.2.1 Conflict-Free Replicated Data Types

224

(1)

18.3 Availability

225

(6)

18.3.1 Redundancy

225

(1)

18.3.2 Front Ends and Load Balancers

225

(3)

18.3.3 Client-Side Load Balancing

228

(1)

18.3.4 Data Layer

228

(2)

18.3.5 Jobs and Taskworkers

230

(1)

18.3.6 Failover

230

(1)

18.4 Partition Tolerance

231

(1)

18.4.1 Split Brains

231

(1)

18.5 Conclusion

232

(1)

19 Logical Network Topological Nodes

233

(12)

19.1 Introduction

233

(1)

19.2 Network Diagrams

233

(1)

19.3 Load Balancing

234

(1)

19.4 Caches

235

(3)

19.4.1 Application-Level Caching

236

(1)

19.4.2 Cache Services

237

(1)

19.4.3 Write-Through Caches

238

(1)

19.5 Databases

238

(3)

19.5.1 Primary and Replica

238

(1)

19.5.2 Multimaster

239

(1)

19.5.3 A/B Replication

240

(1)

19.6 Queues

241

(2)

19.6.1 Task Scheduling and Parallelization

241

(1)

19.6.2 Asynchronous Process Execution

242

(1)

19.6.3 API Buffering

243

(1)

19.7 Conclusion

243

(2)

Bibliography

245

(2)

Index

247

Andrew Kelleher is a staff software engineer and distributed systems architect at Venmo. He was previously a staff software engineer at BuzzFeed and has worked on data pipelines and algorithm implementations for modern optimization. He graduated with a BS in physics from Clemson University. He runs a meetup in New York City that studies the fundamentals behind distributed systems in the context of production applications, and was ranked one of FastCompany's most creative people two years in a row.

Adam Kelleher wrote this book while working as principal data scientist at BuzzFeed and adjunct professor at Columbia University in the City of New York. As of May 2018, he is chief data scientist for research at Barclays and teaches causal inference and machine learning products at Columbia. He graduated from Clemson University with a BS in physics, and has a PhD in cosmology from University of North Carolina at Chapel Hill.

Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications [Pehme köide]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv