Preface |
|
xi | |
Author |
|
xiii | |
|
1 Data, Exploratory Analysis, and R |
|
|
1 | (28) |
|
1.1 Why do we analyze data? |
|
|
1 | (1) |
|
1.2 The view from 90,000 feet |
|
|
2 | (9) |
|
|
2 | (2) |
|
1.2.2 Exploratory analysis |
|
|
4 | (3) |
|
1.2.3 Computers, software, and R |
|
|
7 | (4) |
|
1.3 A representative R session |
|
|
11 | (10) |
|
1.4 Organization of this book |
|
|
21 | (5) |
|
|
26 | (3) |
|
|
29 | (50) |
|
2.1 Exploratory vs. explanatory graphics |
|
|
29 | (3) |
|
2.2 Graphics systems in R |
|
|
32 | (5) |
|
|
33 | (1) |
|
|
33 | (1) |
|
|
34 | (2) |
|
2.2.4 The ggplot2 package |
|
|
36 | (1) |
|
|
37 | (7) |
|
2.3.1 The flexibility of the plot function |
|
|
37 | (3) |
|
2.3.2 S3 classes and generic functions |
|
|
40 | (2) |
|
2.3.3 Optional parameters for base graphics |
|
|
42 | (2) |
|
2.4 Adding details to plots |
|
|
44 | (8) |
|
2.4.1 Adding points and lines to a scatterplot |
|
|
44 | (4) |
|
2.4.2 Adding text to a plot |
|
|
48 | (1) |
|
2.4.3 Adding a legend to a plot |
|
|
49 | (1) |
|
|
50 | (2) |
|
2.5 A few different plot types |
|
|
52 | (5) |
|
2.5.1 Pie charts and why they should be avoided |
|
|
53 | (1) |
|
|
54 | (1) |
|
2.5.3 The symbols function |
|
|
55 | (2) |
|
|
57 | (7) |
|
2.6.1 Setting up simple arrays with mfrow |
|
|
58 | (3) |
|
2.6.2 Using the layout function |
|
|
61 | (3) |
|
|
64 | (6) |
|
2.7.1 A few general guidelines |
|
|
64 | (2) |
|
|
66 | (2) |
|
2.7.3 The tableplot function |
|
|
68 | (2) |
|
|
70 | (9) |
|
3 Exploratory Data Analysis: A First Look |
|
|
79 | (62) |
|
3.1 Exploring a new dataset |
|
|
80 | (7) |
|
|
81 | (1) |
|
3.1.2 Examining the basic data characteristics |
|
|
82 | (2) |
|
3.1.3 Variable types in practice |
|
|
84 | (3) |
|
3.2 Summarizing numerical data |
|
|
87 | (13) |
|
3.2.1 "Typical" values: the mean |
|
|
88 | (1) |
|
3.2.2 "Spread": the standard deviation |
|
|
88 | (2) |
|
3.2.3 Limitations of simple summary statistics |
|
|
90 | (2) |
|
3.2.4 The Gaussian assumption |
|
|
92 | (3) |
|
3.2.5 Is the Gaussian assumption reasonable? |
|
|
95 | (5) |
|
3.3 Anomalies in numerical data |
|
|
100 | (30) |
|
3.3.1 Outliers and their influence |
|
|
100 | (4) |
|
3.3.2 Detecting univariate outliers |
|
|
104 | (12) |
|
3.3.3 Inliers and their detection |
|
|
116 | (2) |
|
|
118 | (2) |
|
3.3.5 Missing data, possibly disguised |
|
|
120 | (5) |
|
|
125 | (5) |
|
3.4 Visualizing relations between variables |
|
|
130 | (7) |
|
3.4.1 Scatterplots between numerical variables |
|
|
131 | (2) |
|
3.4.2 Boxplots: numerical vs. categorical variables |
|
|
133 | (2) |
|
3.4.3 Mosaic plots: categorical scatterplots |
|
|
135 | (2) |
|
|
137 | (4) |
|
4 Working with External Data |
|
|
141 | (40) |
|
|
142 | (3) |
|
|
145 | (3) |
|
4.2.1 Entering the data by hand |
|
|
145 | (2) |
|
4.2.2 Manual data entry is bad but sometimes expedient |
|
|
147 | (1) |
|
4.3 Interacting with the Internet |
|
|
148 | (4) |
|
4.3.1 Previews of three Internet data examples |
|
|
148 | (3) |
|
4.3.2 A very brief introduction to HTML |
|
|
151 | (1) |
|
4.4 Working with CSV files |
|
|
152 | (6) |
|
4.4.1 Reading and writing CSV files |
|
|
152 | (2) |
|
4.4.2 Spreadsheets and csv files are not the same thing |
|
|
154 | (1) |
|
4.4.3 Two potential problems with CSV files |
|
|
155 | (3) |
|
4.5 Working with other file types |
|
|
158 | (7) |
|
4.5.1 Working with text files |
|
|
158 | (4) |
|
4.5.2 Saving and retrieving R objects |
|
|
162 | (1) |
|
|
163 | (2) |
|
4.6 Merging data from different sources |
|
|
165 | (3) |
|
4.7 A brief introduction to databases |
|
|
168 | (10) |
|
4.7.1 Relational databases, queries, and SQL |
|
|
169 | (2) |
|
4.7.2 An introduction to the sqldf package |
|
|
171 | (3) |
|
4.7.3 An overview of R's database support |
|
|
174 | (1) |
|
4.7.4 An introduction to the RSQLite package |
|
|
175 | (3) |
|
|
178 | (3) |
|
5 Linear Regression Models |
|
|
181 | (48) |
|
5.1 Modeling the whiteside data |
|
|
181 | (7) |
|
5.1.1 Describing lines in the plane |
|
|
182 | (3) |
|
5.1.2 Fitting lines to points in the plane |
|
|
185 | (1) |
|
5.1.3 Fitting the whiteside data |
|
|
186 | (2) |
|
5.2 Overrating and data splitting |
|
|
188 | (13) |
|
5.2.1 An overfitting example |
|
|
188 | (4) |
|
5.2.2 The training/validation/holdout split |
|
|
192 | (4) |
|
5.2.3 Two useful model validation tools |
|
|
196 | (5) |
|
5.3 Regression with multiple predictors |
|
|
201 | (10) |
|
|
202 | (5) |
|
5.3.2 The problem of collinearity |
|
|
207 | (4) |
|
5.4 Using categorical predictors |
|
|
211 | (3) |
|
5.5 Interactions in linear regression models |
|
|
214 | (3) |
|
5.6 Variable transformations in linear regression |
|
|
217 | (4) |
|
5.7 Robust regression: a very brief introduction |
|
|
221 | (3) |
|
|
224 | (5) |
|
|
229 | (18) |
|
6.1 Crafting good data stories |
|
|
229 | (3) |
|
6.1.1 The importance of clarity |
|
|
230 | (1) |
|
6.1.2 The basic elements of an effective data story |
|
|
231 | (1) |
|
6.2 Different audiences have different needs |
|
|
232 | (3) |
|
6.2.1 The executive summary or abstract |
|
|
233 | (1) |
|
|
234 | (1) |
|
|
235 | (1) |
|
6.3 Three example data stories |
|
|
235 | (12) |
|
6.3.1 The Big Mac and Grande Latte economic indices |
|
|
236 | (4) |
|
6.3.2 Small losses in the Australian vehicle insurance data |
|
|
240 | (3) |
|
6.3.3 Unexpected heterogeneity: the Boston housing data |
|
|
243 | (4) |
|
|
247 | (42) |
|
7.1 Interactive use versus programming |
|
|
247 | (9) |
|
7.1.1 A simple example: computing Fibonnacci numbers |
|
|
248 | (4) |
|
7.1.2 Creating your own functions |
|
|
252 | (4) |
|
7.2 Key elements of the R language |
|
|
256 | (19) |
|
7.2.1 Functions and their arguments |
|
|
256 | (4) |
|
|
260 | (2) |
|
|
262 | (6) |
|
7.2.4 Replacing loops with apply functions |
|
|
268 | (2) |
|
7.2.5 Generic functions revisited |
|
|
270 | (5) |
|
7.3 Good programming practices |
|
|
275 | (2) |
|
7.3.1 Modularity and the DRY principle |
|
|
275 | (1) |
|
|
275 | (1) |
|
|
276 | (1) |
|
7.3.4 Testing and debugging |
|
|
276 | (1) |
|
7.4 Five programming examples |
|
|
277 | (7) |
|
7.4.1 The function ValidationRsquared |
|
|
277 | (1) |
|
7.4.2 The function TVHsplit |
|
|
278 | (1) |
|
7.4.3 The function PredictedVsObservedPlot |
|
|
278 | (1) |
|
7.4.4 The function BasicSummary |
|
|
279 | (2) |
|
7.4.5 The function FindOutliers |
|
|
281 | (3) |
|
|
284 | (1) |
|
|
285 | (4) |
|
|
289 | (68) |
|
8.1 The fundamentals of text data analysis |
|
|
290 | (8) |
|
8.1.1 The basic steps in analyzing text data |
|
|
290 | (3) |
|
8.1.2 An illustrative example |
|
|
293 | (5) |
|
8.2 Basic character functions in R |
|
|
298 | (13) |
|
|
298 | (3) |
|
|
301 | (1) |
|
8.2.3 Application to missing data and alternative spellings |
|
|
302 | (2) |
|
8.2.4 The sub and gsub functions |
|
|
304 | (2) |
|
8.2.5 The strsplit function |
|
|
306 | (1) |
|
8.2.6 Another application: ConvertAutoMpgRecords |
|
|
307 | (2) |
|
|
309 | (2) |
|
8.3 A brief introduction to regular expressions |
|
|
311 | (8) |
|
8.3.1 Regular expression basics |
|
|
311 | (2) |
|
8.3.2 Some useful regular expression examples |
|
|
313 | (6) |
|
8.4 An aside: ASCII vs. UNICODE |
|
|
319 | (1) |
|
8.5 Quantitative text analysis |
|
|
320 | (10) |
|
8.5.1 Document-term and document-feature matrices |
|
|
320 | (2) |
|
8.5.2 String distances and approximate matching |
|
|
322 | (8) |
|
8.6 Three detailed examples |
|
|
330 | (23) |
|
8.6.1 Characterizing a book |
|
|
331 | (5) |
|
8.6.2 The cpus data frame |
|
|
336 | (8) |
|
8.6.3 The unclaimed bank account data |
|
|
344 | (9) |
|
|
353 | (4) |
|
9 Exploratory Data Analysis: A Second Look |
|
|
357 | (102) |
|
9.1 An example: repeated measurements |
|
|
358 | (6) |
|
9.1.1 Summary and practical implications |
|
|
358 | (1) |
|
|
359 | (5) |
|
9.2 Confidence intervals and significance |
|
|
364 | (11) |
|
9.2.1 Probability models versus data |
|
|
364 | (2) |
|
9.2.2 Quantiles of a distribution |
|
|
366 | (2) |
|
9.2.3 Confidence intervals |
|
|
368 | (4) |
|
9.2.4 Statistical significance and p-values |
|
|
372 | (3) |
|
9.3 Characterizing a binary variable |
|
|
375 | (11) |
|
9.3.1 The binomial distribution |
|
|
375 | (2) |
|
9.3.2 Binomial confidence intervals |
|
|
377 | (5) |
|
|
382 | (4) |
|
9.4 Characterizing count data |
|
|
386 | (7) |
|
9.4.1 The Poisson distribution and rare events |
|
|
387 | (2) |
|
9.4.2 Alternative count distributions |
|
|
389 | (1) |
|
9.4.3 Discrete distribution plots |
|
|
390 | (3) |
|
9.5 Continuous distributions |
|
|
393 | (16) |
|
9.5.1 Limitations of the Gaussian distribution |
|
|
394 | (4) |
|
9.5.2 Some alternatives to the Gaussian distribution |
|
|
398 | (6) |
|
9.5.3 The qqPlot function revisited |
|
|
404 | (2) |
|
9.5.4 The problems of ties and implosion |
|
|
406 | (3) |
|
9.6 Associations between numerical variables |
|
|
409 | (18) |
|
9.6.1 Product-moment correlations |
|
|
409 | (4) |
|
9.6.2 Spearman's rank correlation measure |
|
|
413 | (2) |
|
9.6.3 The correlation trick |
|
|
415 | (3) |
|
9.6.4 Correlation matrices and correlation plots |
|
|
418 | (3) |
|
9.6.5 Robust correlations |
|
|
421 | (2) |
|
9.6.6 Multivariate outliers |
|
|
423 | (4) |
|
9.7 Associations between categorical variables |
|
|
427 | (11) |
|
|
427 | (2) |
|
9.7.2 The chi-squared measure and Cramer's V |
|
|
429 | (4) |
|
9.7.3 Goodman and Kruskal's tau measure |
|
|
433 | (5) |
|
9.8 Principal component analysis (PCA) |
|
|
438 | (9) |
|
9.9 Working with date variables |
|
|
447 | (2) |
|
|
449 | (10) |
|
10 More General Predictive Models |
|
|
459 | (66) |
|
10.1 A predictive modeling overview |
|
|
459 | (3) |
|
10.1.1 The predictive modeling problem |
|
|
460 | (1) |
|
10.1.2 The model-building process |
|
|
461 | (1) |
|
10.2 Binary classification and logistic regression |
|
|
462 | (16) |
|
10.2.1 Basic logistic regression formulation |
|
|
462 | (2) |
|
10.2.2 Fitting logistic regression models |
|
|
464 | (3) |
|
10.2.3 Evaluating binary classifier performance |
|
|
467 | (7) |
|
10.2.4 A brief introduction to glms |
|
|
474 | (4) |
|
10.3 Decision tree models |
|
|
478 | (13) |
|
10.3.1 Structure and fitting of decision trees |
|
|
479 | (6) |
|
10.3.2 A classification tree example |
|
|
485 | (2) |
|
10.3.3 A regression tree example |
|
|
487 | (4) |
|
10.4 Combining trees with regression |
|
|
491 | (7) |
|
10.5 Introduction to machine learning models |
|
|
498 | (8) |
|
10.5.1 The instability of simple tree-based models |
|
|
499 | (1) |
|
10.5.2 Random forest models |
|
|
500 | (2) |
|
10.5.3 Boosted tree models |
|
|
502 | (4) |
|
10.6 Three practical details |
|
|
506 | (15) |
|
10.6.1 Partial dependence plots |
|
|
507 | (6) |
|
10.6.2 Variable importance measures |
|
|
513 | (6) |
|
10.6.3 Thin levels and data partitioning |
|
|
519 | (2) |
|
|
521 | (4) |
|
11 Keeping It All Together |
|
|
525 | (14) |
|
11.1 Managing your R installation |
|
|
525 | (3) |
|
|
526 | (1) |
|
|
526 | (1) |
|
|
527 | (1) |
|
11.2 Managing files effectively |
|
|
528 | (5) |
|
11.2.1 Organizing directories |
|
|
528 | (3) |
|
11.2.2 Use appropriate file extensions |
|
|
531 | (1) |
|
11.2.3 Choose good file names |
|
|
532 | (1) |
|
|
533 | (3) |
|
|
533 | (1) |
|
|
534 | (1) |
|
11.3.3 Documenting results |
|
|
535 | (1) |
|
11.4 Introduction to reproducible computing |
|
|
536 | (3) |
|
11.4.1 The key ideas of reproducibility |
|
|
536 | (1) |
|
|
537 | (2) |
Bibliography |
|
539 | (5) |
Index |
|
544 | |