Foreword |
|
xv | |
Preface |
|
xvii | |
About the Authors |
|
xxi | |
I: Principles of Framing |
|
1 | (66) |
|
1 The Role of the Data Scientist |
|
|
3 | (4) |
|
|
3 | (1) |
|
1.2 The Role of the Data Scientist |
|
|
3 | (3) |
|
|
3 | (1) |
|
|
4 | (1) |
|
1.2.3 Ladders and Career Development |
|
|
5 | (1) |
|
|
5 | (1) |
|
|
6 | (1) |
|
|
6 | (1) |
|
|
7 | (10) |
|
|
7 | (1) |
|
2.2 The Data Team Context |
|
|
7 | (3) |
|
2.2.1 Embedding vs. Pooling Resources |
|
|
8 | (1) |
|
|
8 | (1) |
|
|
9 | (1) |
|
2.2.4 A Combined Workflow |
|
|
10 | (1) |
|
2.3 Agile Development and the Product Focus |
|
|
10 | (5) |
|
|
11 | (4) |
|
|
15 | (2) |
|
|
17 | (8) |
|
|
17 | (1) |
|
3.2 Quantifying Error in Measured Values |
|
|
17 | (2) |
|
|
19 | (2) |
|
|
21 | (2) |
|
|
23 | (2) |
|
4 Data Encoding and Preprocessing |
|
|
25 | (12) |
|
|
25 | (1) |
|
4.2 Simple Text Preprocessing |
|
|
26 | (7) |
|
|
26 | (1) |
|
|
27 | (1) |
|
|
28 | (1) |
|
|
28 | (2) |
|
4.2.5 Representation Learning |
|
|
30 | (3) |
|
|
33 | (1) |
|
|
34 | (3) |
|
|
37 | (8) |
|
|
37 | (1) |
|
5.2 What Is a Hypothesis? |
|
|
37 | (2) |
|
|
39 | (1) |
|
5.4 P-values and Confidence Intervals |
|
|
40 | (1) |
|
5.5 Multiple Testing and "P-hacking" |
|
|
41 | (1) |
|
|
42 | (1) |
|
|
43 | (1) |
|
|
44 | (1) |
|
|
45 | (22) |
|
|
45 | (1) |
|
6.2 Distributions and Summary Statistics |
|
|
45 | (13) |
|
6.2.1 Distributions and Histograms |
|
|
46 | (5) |
|
6.2.2 Scatter Plots and Heat Maps |
|
|
51 | (4) |
|
6.2.3 Box Plots and Error Bars |
|
|
55 | (3) |
|
|
58 | (3) |
|
|
58 | (2) |
|
|
60 | (1) |
|
|
61 | (3) |
|
|
62 | (2) |
|
|
64 | (1) |
|
|
64 | (3) |
II: Algorithms and Architectures |
|
67 | (136) |
|
7 Introduction to Algorithms and Architectures |
|
|
69 | (10) |
|
|
69 | (1) |
|
|
70 | (4) |
|
|
71 | (1) |
|
|
72 | (1) |
|
7.2.3 Batch and Online Computing |
|
|
72 | (1) |
|
|
73 | (1) |
|
|
74 | (3) |
|
|
74 | (1) |
|
|
75 | (1) |
|
|
76 | (1) |
|
|
77 | (2) |
|
|
79 | (10) |
|
|
79 | (1) |
|
|
79 | (3) |
|
|
80 | (1) |
|
|
81 | (1) |
|
8.2.3 Memory Considerations |
|
|
81 | (1) |
|
8.2.4 A Distributed Approach |
|
|
81 | (1) |
|
|
82 | (2) |
|
|
83 | (1) |
|
8.3.2 Time and Space Complexity |
|
|
83 | (1) |
|
|
83 | (1) |
|
8.3.4 A Distributed Approach |
|
|
83 | (1) |
|
|
84 | (2) |
|
|
85 | (1) |
|
8.4.2 Memory Considerations |
|
|
85 | (1) |
|
8.4.3 A Distributed Approach |
|
|
86 | (1) |
|
|
86 | (2) |
|
|
86 | (1) |
|
8.5.2 Memory Considerations |
|
|
87 | (1) |
|
8.5.3 A Distributed Approach |
|
|
87 | (1) |
|
|
88 | (1) |
|
|
89 | (28) |
|
|
89 | (7) |
|
|
90 | (1) |
|
9.1.2 Choosing the Objective Function |
|
|
90 | (1) |
|
|
91 | (1) |
|
|
92 | (4) |
|
|
96 | (9) |
|
|
97 | (1) |
|
|
97 | (1) |
|
9.2.3 Memory Considerations |
|
|
97 | (1) |
|
|
98 | (1) |
|
9.2.5 A Distributed Approach |
|
|
98 | (1) |
|
|
98 | (7) |
|
9.3 Nonlinear Regression with Linear Regression |
|
|
105 | (4) |
|
|
107 | (2) |
|
|
109 | (6) |
|
|
109 | (3) |
|
|
112 | (3) |
|
|
115 | (2) |
|
10 Classification and Clustering |
|
|
117 | (18) |
|
|
117 | (1) |
|
|
118 | (4) |
|
|
121 | (1) |
|
|
121 | (1) |
|
10.2.3 Memory Considerations |
|
|
122 | (1) |
|
|
122 | (1) |
|
10.3 Bayesian Inference, Naive Bayes |
|
|
122 | (3) |
|
|
124 | (1) |
|
|
124 | (1) |
|
10.3.3 Memory Considerations |
|
|
124 | (1) |
|
|
124 | (1) |
|
|
125 | (3) |
|
|
127 | (1) |
|
|
128 | (1) |
|
10.4.3 Memory Considerations |
|
|
128 | (1) |
|
|
128 | (1) |
|
|
128 | (2) |
|
|
129 | (1) |
|
10.5.2 Memory Considerations |
|
|
130 | (1) |
|
|
130 | (1) |
|
|
130 | (1) |
|
|
130 | (1) |
|
|
130 | (1) |
|
10.6.3 Memory Considerations |
|
|
131 | (1) |
|
|
131 | (1) |
|
|
131 | (2) |
|
|
132 | (1) |
|
|
132 | (1) |
|
10.7.3 Memory Considerations |
|
|
133 | (1) |
|
|
133 | (1) |
|
|
133 | (2) |
|
|
135 | (14) |
|
|
135 | (1) |
|
11.2 Causal Graphs, Conditional Independence, and Markovity |
|
|
136 | (2) |
|
11.2.1 Causal Graphs and Conditional Independence |
|
|
136 | (1) |
|
11.2.2 Stability and Dependence |
|
|
137 | (1) |
|
11.3 D-separation and the Markov Property |
|
|
138 | (4) |
|
11.3.1 Markovity and Factorization |
|
|
138 | (1) |
|
|
139 | (3) |
|
11.4 Causal Graphs as Bayesian Networks |
|
|
142 | (1) |
|
|
142 | (1) |
|
|
143 | (4) |
|
|
147 | (2) |
|
12 Dimensional Reduction and Latent Variable Models |
|
|
149 | (18) |
|
|
149 | (1) |
|
|
149 | (2) |
|
|
151 | (1) |
|
12.4 Principal Components Analysis |
|
|
152 | (2) |
|
|
154 | (1) |
|
12.4.2 Memory Considerations |
|
|
154 | (1) |
|
|
154 | (1) |
|
12.5 Independent Component Analysis |
|
|
154 | (5) |
|
|
158 | (1) |
|
|
158 | (1) |
|
12.5.3 Memory Considerations |
|
|
159 | (1) |
|
|
159 | (1) |
|
12.6 Latent Dirichlet Allocation |
|
|
159 | (6) |
|
|
165 | (2) |
|
|
167 | (22) |
|
|
167 | (1) |
|
|
168 | (3) |
|
13.3 Observation: An Example |
|
|
171 | (6) |
|
13.4 Controlling to Block Non-causal Paths |
|
|
177 | (5) |
|
|
179 | (3) |
|
13.5 Machine-Learning Estimators |
|
|
182 | (5) |
|
13.5.1 The G-formula Revisited |
|
|
182 | (1) |
|
|
183 | (4) |
|
|
187 | (2) |
|
14 Advanced Machine Learning |
|
|
189 | (14) |
|
|
189 | (1) |
|
|
189 | (2) |
|
|
191 | (10) |
|
|
192 | (1) |
|
|
193 | (3) |
|
|
196 | (3) |
|
|
199 | (1) |
|
|
200 | (1) |
|
|
201 | (2) |
III: Bottlenecks and Optimizations |
|
203 | (42) |
|
|
205 | (8) |
|
|
205 | (1) |
|
15.2 Random Access Memory |
|
|
205 | (1) |
|
|
205 | (1) |
|
|
206 | (1) |
|
15.3 Nonvolatile/Persistent Storage |
|
|
206 | (2) |
|
15.3.1 Hard Disk Drives or "Spinning Disks" |
|
|
207 | (1) |
|
|
207 | (1) |
|
|
207 | (1) |
|
|
207 | (1) |
|
|
208 | (1) |
|
|
208 | (1) |
|
|
208 | (1) |
|
15.4.2 Execution-Level Locality |
|
|
208 | (1) |
|
|
209 | (1) |
|
|
209 | (3) |
|
|
209 | (1) |
|
|
210 | (1) |
|
|
210 | (1) |
|
|
210 | (2) |
|
|
212 | (1) |
|
|
213 | (4) |
|
|
213 | (1) |
|
|
213 | (1) |
|
|
214 | (1) |
|
|
214 | (2) |
|
|
216 | (1) |
|
16.6 Extract, Transfer/Transform, Load |
|
|
216 | (1) |
|
|
216 | (1) |
|
|
217 | (6) |
|
|
217 | (1) |
|
17.2 Client-Server Architecture |
|
|
217 | (1) |
|
17.3 N-tier/Service-Oriented Architecture |
|
|
218 | (2) |
|
|
220 | (1) |
|
|
220 | (1) |
|
17.6 Practical Cases (Mix-and-Match Architectures) |
|
|
221 | (1) |
|
|
221 | (2) |
|
|
223 | (10) |
|
|
223 | (1) |
|
18.2 Consistency/Concurrency |
|
|
223 | (2) |
|
18.2.1 Conflict-Free Replicated Data Types |
|
|
224 | (1) |
|
|
225 | (6) |
|
|
225 | (1) |
|
18.3.2 Front Ends and Load Balancers |
|
|
225 | (3) |
|
18.3.3 Client-Side Load Balancing |
|
|
228 | (1) |
|
|
228 | (2) |
|
18.3.5 Jobs and Taskworkers |
|
|
230 | (1) |
|
|
230 | (1) |
|
|
231 | (1) |
|
|
231 | (1) |
|
|
232 | (1) |
|
19 Logical Network Topological Nodes |
|
|
233 | (12) |
|
|
233 | (1) |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
235 | (3) |
|
19.4.1 Application-Level Caching |
|
|
236 | (1) |
|
|
237 | (1) |
|
19.4.3 Write-Through Caches |
|
|
238 | (1) |
|
|
238 | (3) |
|
19.5.1 Primary and Replica |
|
|
238 | (1) |
|
|
239 | (1) |
|
|
240 | (1) |
|
|
241 | (2) |
|
19.6.1 Task Scheduling and Parallelization |
|
|
241 | (1) |
|
19.6.2 Asynchronous Process Execution |
|
|
242 | (1) |
|
|
243 | (1) |
|
|
243 | (2) |
Bibliography |
|
245 | (2) |
Index |
|
247 | |