Muutke küpsiste eelistusi

Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications [Pehme köide]

  • Formaat: Paperback / softback, 288 pages, kõrgus x laius x paksus: 230x180x20 mm, kaal: 561 g
  • Sari: Addison-Wesley Data & Analytics Series
  • Ilmumisaeg: 15-May-2019
  • Kirjastus: Addison Wesley
  • ISBN-10: 0134116542
  • ISBN-13: 9780134116549
Teised raamatud teemal:
  • Formaat: Paperback / softback, 288 pages, kõrgus x laius x paksus: 230x180x20 mm, kaal: 561 g
  • Sari: Addison-Wesley Data & Analytics Series
  • Ilmumisaeg: 15-May-2019
  • Kirjastus: Addison Wesley
  • ISBN-10: 0134116542
  • ISBN-13: 9780134116549
Teised raamatud teemal:

The typical data science task in industry starts with an “ask” from the business. But few data scientists have been taught what to do with that ask. This book shows them how to assess it in the context of the business’s goals, reframe it to work optimally for both the data scientist and the employer, and then execute on it. Written by two of the experts who’ve achieved breakthrough optimizations at BuzzFeed, it’s packed with real-world examples that take you from start to finish: from ask to actionable insight.

 

Andrew Kelleher and Adam Kelleher walk you through well-formed, concrete principles for approaching common data science problems, giving you an easy-to-use checklist for effective execution. Using their principles and techniques, you’ll gain deeper understanding of your data, learn how to analyze noise and confounding variables so they don’t compromise your analysis, and save weeks of iterative improvement by planning your projects more effectively upfront.

 

Once you’ve mastered their principles, you’ll put them to work in two realistic, beginning-to-end site optimization tasks. These extended examples come complete with reusable code examples and recommended open-source solutions designed for easy adaptation to your everyday challenges. They will be especially valuable for anyone seeking their first data science job -- and everyone who’s found that job and wants to succeed in it.

Foreword xv
Preface xvii
About the Authors xxi
I: Principles of Framing 1(66)
1 The Role of the Data Scientist
3(4)
1.1 Introduction
3(1)
1.2 The Role of the Data Scientist
3(3)
1.2.1 Company Size
3(1)
1.2.2 Team Context
4(1)
1.2.3 Ladders and Career Development
5(1)
1.2.4 Importance
5(1)
1.2.5 The Work Breakdown
6(1)
1.3 Conclusion
6(1)
2 Project Workflow
7(10)
2.1 Introduction
7(1)
2.2 The Data Team Context
7(3)
2.2.1 Embedding vs. Pooling Resources
8(1)
2.2.2 Research
8(1)
2.2.3 Prototyping
9(1)
2.2.4 A Combined Workflow
10(1)
2.3 Agile Development and the Product Focus
10(5)
2.3.1 The 12 Principles
11(4)
2.4 Conclusion
15(2)
3 Quantifying Error
17(8)
3.1 Introduction
17(1)
3.2 Quantifying Error in Measured Values
17(2)
3.3 Sampling Error
19(2)
3.4 Error Propagation
21(2)
3.5 Conclusion
23(2)
4 Data Encoding and Preprocessing
25(12)
4.1 Introduction
25(1)
4.2 Simple Text Preprocessing
26(7)
4.2.1 Tokenization
26(1)
4.2.2 N-grams
27(1)
4.2.3 Sparsity
28(1)
4.2.4 Feature Selection
28(2)
4.2.5 Representation Learning
30(3)
4.3 Information Loss
33(1)
4.4 Conclusion
34(3)
5 Hypothesis Testing
37(8)
5.1 Introduction
37(1)
5.2 What Is a Hypothesis?
37(2)
5.3 Types of Errors
39(1)
5.4 P-values and Confidence Intervals
40(1)
5.5 Multiple Testing and "P-hacking"
41(1)
5.6 An Example
42(1)
5.7 Planning and Context
43(1)
5.8 Conclusion
44(1)
6 Data Visualization
45(22)
6.1 Introduction
45(1)
6.2 Distributions and Summary Statistics
45(13)
6.2.1 Distributions and Histograms
46(5)
6.2.2 Scatter Plots and Heat Maps
51(4)
6.2.3 Box Plots and Error Bars
55(3)
6.3 Time-Series Plots
58(3)
6.3.1 Rolling Statistics
58(2)
6.3.2 Auto-Correlation
60(1)
6.4 Graph Visualization
61(3)
6.4.1 Layout Algorithms
62(2)
6.4.2 Time Complexity
64(1)
6.5 Conclusion
64(3)
II: Algorithms and Architectures 67(136)
7 Introduction to Algorithms and Architectures
69(10)
7.1 Introduction
69(1)
7.2 Architectures
70(4)
7.2.1 Services
71(1)
7.2.2 Data Sources
72(1)
7.2.3 Batch and Online Computing
72(1)
7.2.4 Scaling
73(1)
7.3 Models
74(3)
7.3.1 Training
74(1)
7.3.2 Prediction
75(1)
7.3.3 Validation
76(1)
7.4 Conclusion
77(2)
8 Comparison
79(10)
8.1 Introduction
79(1)
8.2 Jaccard Distance
79(3)
8.2.1 The Algorithm
80(1)
8.2.2 Time Complexity
81(1)
8.2.3 Memory Considerations
81(1)
8.2.4 A Distributed Approach
81(1)
8.3 MinHash
82(2)
8.3.1 Assumptions
83(1)
8.3.2 Time and Space Complexity
83(1)
8.3.3 Tools
83(1)
8.3.4 A Distributed Approach
83(1)
8.4 Cosine Similarity
84(2)
8.4.1 Complexity
85(1)
8.4.2 Memory Considerations
85(1)
8.4.3 A Distributed Approach
86(1)
8.5 Mahalanobis Distance
86(2)
8.5.1 Complexity
86(1)
8.5.2 Memory Considerations
87(1)
8.5.3 A Distributed Approach
87(1)
8.6 Conclusion
88(1)
9 Regression
89(28)
9.1 Introduction
89(7)
9.1.1 Choosing the Model
90(1)
9.1.2 Choosing the Objective Function
90(1)
9.1.3 Fitting
91(1)
9.1.4 Validation
92(4)
9.2 Linear Least Squares
96(9)
9.2.1 Assumptions
97(1)
9.2.2 Complexity
97(1)
9.2.3 Memory Considerations
97(1)
9.2.4 Tools
98(1)
9.2.5 A Distributed Approach
98(1)
9.2.6 A Worked Example
98(7)
9.3 Nonlinear Regression with Linear Regression
105(4)
9.3.1 Uncertainty
107(2)
9.4 Random Forest
109(6)
9.4.1 Decision Trees
109(3)
9.4.2 Random Forests
112(3)
9.5 Conclusion
115(2)
10 Classification and Clustering
117(18)
10.1 Introduction
117(1)
10.2 Logistic Regression
118(4)
10.2.1 Assumptions
121(1)
10.2.2 Time Complexity
121(1)
10.2.3 Memory Considerations
122(1)
10.2.4 Tools
122(1)
10.3 Bayesian Inference, Naive Bayes
122(3)
10.3.1 Assumptions
124(1)
10.3.2 Complexity
124(1)
10.3.3 Memory Considerations
124(1)
10.3.4 Tools
124(1)
10.4 K-Means
125(3)
10.4.1 Assumptions
127(1)
10.4.2 Complexity
128(1)
10.4.3 Memory Considerations
128(1)
10.4.4 Tools
128(1)
10.5 Leading Eigenvalue
128(2)
10.5.1 Complexity
129(1)
10.5.2 Memory Considerations
130(1)
10.5.3 Tools
130(1)
10.6 Greedy Louvain
130(1)
10.6.1 Assumptions
130(1)
10.6.2 Complexity
130(1)
10.6.3 Memory Considerations
131(1)
10.6.4 Tools
131(1)
10.7 Nearest Neighbors
131(2)
10.7.1 Assumptions
132(1)
10.7.2 Complexity
132(1)
10.7.3 Memory Considerations
133(1)
10.7.4 Tools
133(1)
10.8 Conclusion
133(2)
11 Bayeslan Networks
135(14)
11.1 Introduction
135(1)
11.2 Causal Graphs, Conditional Independence, and Markovity
136(2)
11.2.1 Causal Graphs and Conditional Independence
136(1)
11.2.2 Stability and Dependence
137(1)
11.3 D-separation and the Markov Property
138(4)
11.3.1 Markovity and Factorization
138(1)
11.3.2 D-separation
139(3)
11.4 Causal Graphs as Bayesian Networks
142(1)
11.4.1 Linear Regression
142(1)
11.5 Fitting Models
143(4)
11.6 Conclusion
147(2)
12 Dimensional Reduction and Latent Variable Models
149(18)
12.1 Introduction
149(1)
12.2 Priors
149(2)
12.3 Factor Analysis
151(1)
12.4 Principal Components Analysis
152(2)
12.4.1 Complexity
154(1)
12.4.2 Memory Considerations
154(1)
12.4.3 Tools
154(1)
12.5 Independent Component Analysis
154(5)
12.5.1 Assumptions
158(1)
12.5.2 Complexity
158(1)
12.5.3 Memory Considerations
159(1)
12.5.4 Tools
159(1)
12.6 Latent Dirichlet Allocation
159(6)
12.7 Conclusion
165(2)
13 Causal Inference
167(22)
13.1 Introduction
167(1)
13.2 Experiments
168(3)
13.3 Observation: An Example
171(6)
13.4 Controlling to Block Non-causal Paths
177(5)
13.4.1 The G-formula
179(3)
13.5 Machine-Learning Estimators
182(5)
13.5.1 The G-formula Revisited
182(1)
13.5.2 An Example
183(4)
13.6 Conclusion
187(2)
14 Advanced Machine Learning
189(14)
14.1 Introduction
189(1)
14.2 Optimization
189(2)
14.3 Neural Networks
191(10)
14.3.1 Layers
192(1)
14.3.2 Capacity
193(3)
14.3.3 Overfitting
196(3)
14.3.4 Batch Fitting
199(1)
14.3.5 Loss Functions
200(1)
14.4 Conclusion
201(2)
III: Bottlenecks and Optimizations 203(42)
15 Hardware Fundamentals
205(8)
15.1 Introduction
205(1)
15.2 Random Access Memory
205(1)
15.2.1 Access
205(1)
15.2.2 Volatility
206(1)
15.3 Nonvolatile/Persistent Storage
206(2)
15.3.1 Hard Disk Drives or "Spinning Disks"
207(1)
15.3.2 SSDs
207(1)
15.3.3 Latency
207(1)
15.3.4 Paging
207(1)
15.3.5 Thrashing
208(1)
15.4 Throughput
208(1)
15.4.1 Locality
208(1)
15.4.2 Execution-Level Locality
208(1)
15.4.3 Network Locality
209(1)
15.5 Processors
209(3)
15.5.1 Clock Rate
209(1)
15.5.2 Cores
210(1)
15.5.3 Threading
210(1)
15.5.4 Branch Prediction
210(2)
15.6 Conclusion
212(1)
16 Software Fundamentals
213(4)
16.1 Introduction
213(1)
16.2 Paging
213(1)
16.3 Indexing
214(1)
16.4 Granularity
214(2)
16.5 Robustness
216(1)
16.6 Extract, Transfer/Transform, Load
216(1)
16.7 Conclusion
216(1)
17 Software Architecture
217(6)
17.1 Introduction
217(1)
17.2 Client-Server Architecture
217(1)
17.3 N-tier/Service-Oriented Architecture
218(2)
17.4 Microservices
220(1)
17.5 Monolith
220(1)
17.6 Practical Cases (Mix-and-Match Architectures)
221(1)
17.7 Conclusion
221(2)
18 The CAP Theorem
223(10)
18.1 Introduction
223(1)
18.2 Consistency/Concurrency
223(2)
18.2.1 Conflict-Free Replicated Data Types
224(1)
18.3 Availability
225(6)
18.3.1 Redundancy
225(1)
18.3.2 Front Ends and Load Balancers
225(3)
18.3.3 Client-Side Load Balancing
228(1)
18.3.4 Data Layer
228(2)
18.3.5 Jobs and Taskworkers
230(1)
18.3.6 Failover
230(1)
18.4 Partition Tolerance
231(1)
18.4.1 Split Brains
231(1)
18.5 Conclusion
232(1)
19 Logical Network Topological Nodes
233(12)
19.1 Introduction
233(1)
19.2 Network Diagrams
233(1)
19.3 Load Balancing
234(1)
19.4 Caches
235(3)
19.4.1 Application-Level Caching
236(1)
19.4.2 Cache Services
237(1)
19.4.3 Write-Through Caches
238(1)
19.5 Databases
238(3)
19.5.1 Primary and Replica
238(1)
19.5.2 Multimaster
239(1)
19.5.3 A/B Replication
240(1)
19.6 Queues
241(2)
19.6.1 Task Scheduling and Parallelization
241(1)
19.6.2 Asynchronous Process Execution
242(1)
19.6.3 API Buffering
243(1)
19.7 Conclusion
243(2)
Bibliography 245(2)
Index 247
Andrew Kelleher is a staff software engineer and distributed systems architect at Venmo. He was previously a staff software engineer at BuzzFeed and has worked on data pipelines and algorithm implementations for modern optimization. He graduated with a BS in physics from Clemson University. He runs a meetup in New York City that studies the fundamentals behind distributed systems in the context of production applications, and was ranked one of FastCompany's most creative people two years in a row.

 





Adam Kelleher wrote this book while working as principal data scientist at BuzzFeed and adjunct professor at Columbia University in the City of New York. As of May 2018, he is chief data scientist for research at Barclays and teaches causal inference and machine learning products at Columbia. He graduated from Clemson University with a BS in physics, and has a PhD in cosmology from University of North Carolina at Chapel Hill.