Preface |
|
vii | |
1 An Introduction |
|
1 | (4) |
|
1.1 Why we wrote this book |
|
|
2 | (1) |
|
|
2 | (1) |
|
1.3 How this book is structured |
|
|
3 | (2) |
2 The Case for Programming |
|
5 | (8) |
|
2.1 Doing visual analytics since the 1780s |
|
|
5 | (2) |
|
2.2 How does programming work? |
|
|
7 | (1) |
|
2.3 Setting up R and RStudio |
|
|
8 | (3) |
|
|
8 | (1) |
|
|
9 | (1) |
|
2.3.3 DIY: Running your first code snippet |
|
|
10 | (1) |
|
2.4 Making the case for open-source software |
|
|
11 | (2) |
3 Elements of Programming |
|
13 | (20) |
|
|
13 | (1) |
|
|
14 | (3) |
|
|
14 | (1) |
|
|
14 | (1) |
|
|
14 | (2) |
|
|
16 | (1) |
|
|
16 | (1) |
|
|
16 | (1) |
|
|
17 | (1) |
|
|
18 | (4) |
|
|
18 | (1) |
|
|
18 | (1) |
|
|
19 | (1) |
|
|
20 | (1) |
|
3.4.5 The class function, v2 |
|
|
21 | (1) |
|
|
22 | (1) |
|
|
22 | (2) |
|
3.5.1 Base R and the need to extend functionality |
|
|
22 | (1) |
|
3.5.2 Installing packages |
|
|
22 | (1) |
|
|
23 | (1) |
|
3.5.4 Package management and pacman |
|
|
23 | (1) |
|
|
24 | (3) |
|
|
24 | (2) |
|
|
26 | (1) |
|
|
26 | (1) |
|
|
27 | (1) |
|
|
27 | (1) |
|
3.7.2 Google and online communities |
|
|
27 | (1) |
|
|
27 | (2) |
|
|
27 | (1) |
|
|
28 | (1) |
|
3.9 DIY: Loading solar energy data from the web |
|
|
29 | (4) |
4 Transforming Data |
|
33 | (28) |
|
4.1 Importing and assembling data |
|
|
34 | (4) |
|
|
35 | (3) |
|
|
38 | (7) |
|
4.2.1 Text manipulation functions |
|
|
39 | (1) |
|
4.2.2 Regular Expressions (RegEx) |
|
|
40 | (3) |
|
4.2.3 DIY: Working with PII |
|
|
43 | (1) |
|
|
44 | (1) |
|
4.3 The structure of data |
|
|
45 | (6) |
|
4.3.1 Matrix or data frame? |
|
|
45 | (1) |
|
|
45 | (1) |
|
|
46 | (1) |
|
4.3.4 Sorting and re-ordering |
|
|
47 | (1) |
|
|
48 | (1) |
|
|
49 | (2) |
|
|
51 | (5) |
|
|
52 | (1) |
|
|
53 | (2) |
|
|
55 | (1) |
|
|
56 | (1) |
|
|
57 | (4) |
|
|
57 | (1) |
|
|
58 | (3) |
5 Record Linkage |
|
61 | (22) |
|
5.1 Edward Kennedy, Bill de Blasio, and Bayerische Motoren Werke |
|
|
61 | (1) |
|
5.2 How does record linkage work? |
|
|
62 | (1) |
|
5.3 Pre-processing the data |
|
|
63 | (3) |
|
|
66 | (1) |
|
5.5 Deterministic record linkage |
|
|
67 | (3) |
|
|
70 | (4) |
|
|
70 | (1) |
|
5.6.2 Phonetic algorithms |
|
|
71 | (2) |
|
5.6.3 New tricks, same heuristics |
|
|
73 | (1) |
|
5.7 Probabilistic record linkage |
|
|
74 | (2) |
|
|
76 | (1) |
|
5.9 DIY: Matching people in the UK-UN sanction lists |
|
|
77 | (3) |
|
|
80 | (3) |
|
|
80 | (1) |
|
|
81 | (2) |
6 Exploratory Data Analysis |
|
83 | (30) |
|
6.1 Visually detecting patterns |
|
|
83 | (2) |
|
|
85 | (2) |
|
6.3 Visualizing distributions |
|
|
87 | (7) |
|
|
92 | (2) |
|
6.4 Exploring missing values |
|
|
94 | (9) |
|
|
94 | (1) |
|
6.4.2 Missing value functions |
|
|
95 | (1) |
|
6.4.3 Exploring missingness |
|
|
96 | (2) |
|
6.4.4 Treating missingness |
|
|
98 | (5) |
|
6.5 Analyzing time series |
|
|
103 | (2) |
|
6.6 Finding visual correlations |
|
|
105 | (4) |
|
6.6.1 Visual analysis on high-dimensional datasets |
|
|
108 | (1) |
|
|
109 | (4) |
7 Regression Analysis |
|
113 | (26) |
|
7.1 Measuring and predicting the preferences of society |
|
|
113 | (1) |
|
7.2 Simple linear regression |
|
|
114 | (7) |
|
|
116 | (1) |
|
7.2.2 Ordinary least squares |
|
|
117 | (1) |
|
7.2.3 DIY: A simple hedonic model |
|
|
118 | (3) |
|
7.3 Checking for linearity |
|
|
121 | (2) |
|
|
123 | (14) |
|
|
124 | (1) |
|
|
125 | (2) |
|
|
127 | (1) |
|
7.4.4 Measures of model fitness |
|
|
128 | (1) |
|
7.4.5 DIY: Choosing between models |
|
|
129 | (3) |
|
7.4.6 DIY: Housing prices over time |
|
|
132 | (5) |
|
|
137 | (2) |
8 Framing Classification |
|
139 | (24) |
|
|
139 | (2) |
|
|
139 | (1) |
|
8.1.2 What's a classifier? |
|
|
140 | (1) |
|
8.2 The basics of classifiers |
|
|
141 | (5) |
|
8.2.1 The anatomy of a classifier |
|
|
141 | (1) |
|
8.2.2 Finding signal in classification contexts |
|
|
142 | (1) |
|
|
142 | (4) |
|
|
146 | (10) |
|
8.3.1 The social science workhorse |
|
|
146 | (1) |
|
8.3.2 Telling the story from coefficients |
|
|
147 | (1) |
|
8.3.3 How are coefficients learned? |
|
|
148 | (1) |
|
|
148 | (2) |
|
8.3.5 DIY: Expanding health care coverage |
|
|
150 | (6) |
|
8.4 Regularized regression |
|
|
156 | (5) |
|
8.4.1 From regularization to interpretation |
|
|
158 | (1) |
|
8.4.2 DIY: Re-visiting health care coverage |
|
|
158 | (3) |
|
|
161 | (2) |
9 Three Quantitative Perspectives |
|
163 | (22) |
|
|
164 | (1) |
|
|
165 | (9) |
|
9.2.1 Potential outcomes framework |
|
|
166 | (1) |
|
9.2.2 Regression' discontinuity |
|
|
167 | (5) |
|
9.2.3 Difference-in-differences |
|
|
172 | (2) |
|
|
174 | (8) |
|
9.3.1 Understanding accuracy |
|
|
175 | (5) |
|
|
180 | (2) |
|
|
182 | (3) |
10 Prediction |
|
185 | (32) |
|
10.1 The role of algorithms |
|
|
185 | (2) |
|
10.2 Data science pipelines |
|
|
187 | (2) |
|
10.3 K-Nearest Neighbors (k-NN) |
|
|
189 | (6) |
|
|
190 | (2) |
|
10.3.2 DIY: Predicting the extent of storm damage |
|
|
192 | (3) |
|
|
195 | (15) |
|
10.4.1 Classification and Regression Trees (CART) |
|
|
196 | (5) |
|
|
201 | (2) |
|
|
203 | (1) |
|
10.4.4 DIY: Wage prediction with CART and random forests |
|
|
204 | (6) |
|
10.5 An introduction to other algorithms |
|
|
210 | (5) |
|
|
211 | (1) |
|
|
212 | (3) |
|
|
215 | (2) |
11 Cluster Analysis |
|
217 | (20) |
|
11.1 Things closer together are more related |
|
|
217 | (1) |
|
11.2 Foundational concepts |
|
|
218 | (1) |
|
|
219 | (7) |
|
|
219 | (2) |
|
|
221 | (2) |
|
11.3.3 DIY: Clustering for economic development |
|
|
223 | (3) |
|
11.4 Hierarchical clustering |
|
|
226 | (8) |
|
|
227 | (2) |
|
|
229 | (1) |
|
11.4.3 DIY: Clustering time series |
|
|
230 | (4) |
|
|
234 | (3) |
12 Spatial Data |
|
237 | (22) |
|
12.1 Anticipating climate impacts |
|
|
237 | (2) |
|
12.2 Classes of spatial data |
|
|
239 | (1) |
|
|
239 | (5) |
|
|
241 | (1) |
|
|
242 | (1) |
|
12.3.3 DIY: Working with raster math |
|
|
242 | (2) |
|
|
244 | (12) |
|
|
244 | (1) |
|
12.4.2 Converting points to spatial objects |
|
|
245 | (1) |
|
12.4.3 Coordinate Reference Systems |
|
|
246 | (2) |
|
12.4.4 DIY: Converting coordinates into point vectors |
|
|
248 | (1) |
|
12.4.5 Reading shapefiles |
|
|
249 | (1) |
|
|
250 | (2) |
|
12.4.7 DIY: Analyzing spatial relationships |
|
|
252 | (4) |
|
|
256 | (3) |
13 Natural Language |
|
259 | (24) |
|
13.1 Transforming text into data |
|
|
260 | (6) |
|
13.1.1 Processing textual data |
|
|
260 | (2) |
|
|
262 | (1) |
|
13.1.3 Document similarities |
|
|
263 | (1) |
|
13.1.4 DIY: Basic text processing |
|
|
263 | (3) |
|
|
266 | (5) |
|
13.2.1 Sentiment lexicons |
|
|
267 | (1) |
|
13.2.2 Calculating sentiment scores |
|
|
267 | (2) |
|
13.2.3 DIY: Scoring text for sentiment |
|
|
269 | (2) |
|
|
271 | (9) |
|
|
271 | (1) |
|
13.3.2 How do topics models work? |
|
|
272 | (1) |
|
13.3.3 DIY: Finding topics in presidential speeches |
|
|
273 | (7) |
|
|
280 | (3) |
|
|
280 | (1) |
|
|
281 | (2) |
14 The Ethics of Data Science |
|
283 | (16) |
|
|
283 | (1) |
|
|
284 | (5) |
|
|
285 | (2) |
|
|
287 | (2) |
|
|
289 | (1) |
|
|
289 | (2) |
|
14.3.1 Score-based fairness |
|
|
290 | (1) |
|
14.3.2 Accuracy-based fairness |
|
|
290 | (1) |
|
14.3.3 Other considerations |
|
|
291 | (1) |
|
14.4 Transparency and Interpretability |
|
|
291 | (4) |
|
|
292 | (1) |
|
|
293 | (2) |
|
|
295 | (2) |
|
14.5.1 An evolving landscape |
|
|
295 | (1) |
|
14.5.2 Privacy strategies |
|
|
295 | (2) |
|
|
297 | (2) |
15 Developing Data Products |
|
299 | (18) |
|
15.1 Meeting people where they are |
|
|
299 | (2) |
|
15.2 Designing for impact |
|
|
301 | (3) |
|
15.2.1 Identify a user need |
|
|
301 | (1) |
|
15.2.2 Size up the situation |
|
|
302 | (1) |
|
|
303 | (1) |
|
15.2.4 Test and evaluate its impact, then iterate |
|
|
303 | (1) |
|
15.3 Communicating data science projects |
|
|
304 | (4) |
|
|
304 | (2) |
|
|
306 | (2) |
|
15.4 Reporting dashboards |
|
|
308 | (3) |
|
|
311 | (2) |
|
15.5.1 Prioritization and targeting lists |
|
|
311 | (1) |
|
|
311 | (2) |
|
15.6 Continuing to hone your craft |
|
|
313 | (2) |
|
|
315 | (2) |
16 Building Data Teams |
|
317 | (14) |
|
16.1 Establishing a baseline |
|
|
317 | (3) |
|
|
320 | (6) |
|
16.2.1 Center of excellence |
|
|
320 | (1) |
|
|
321 | (2) |
|
|
323 | (1) |
|
16.2.4 Matrix organizations |
|
|
324 | (2) |
|
|
326 | (2) |
|
|
326 | (1) |
|
|
326 | (1) |
|
16.3.3 Data product roles |
|
|
327 | (1) |
|
16.3.4 Titles in the civil service system |
|
|
328 | (1) |
|
|
328 | (2) |
|
16.4.1 Job postings and application review |
|
|
328 | (1) |
|
|
329 | (1) |
|
|
330 | (1) |
Appendix A: Planning a Data Product |
|
331 | (4) |
|
|
331 | (4) |
Appendix B: Interview Questions |
|
335 | (8) |
|
Getting to know the candidate |
|
|
335 | (1) |
|
|
335 | (1) |
|
|
335 | (1) |
|
|
336 | (5) |
|
|
336 | (1) |
|
|
337 | (1) |
|
Estimation versus prediction |
|
|
337 | (1) |
|
|
338 | (1) |
|
|
339 | (1) |
|
Communication and visualization |
|
|
339 | (1) |
|
|
340 | (1) |
|
|
341 | (2) |
References |
|
343 | (14) |
Index |
|
357 | |