Preface |
|
xi | |
Acknowledgments |
|
xiii | |
|
|
xv | |
|
|
xix | |
|
|
1 | (4) |
|
1.1 How to Read This Book |
|
|
2 | (1) |
|
|
3 | (2) |
|
|
5 | (186) |
|
|
7 | (36) |
|
|
7 | (2) |
|
2.2 Basic Interaction with the R Console |
|
|
9 | (1) |
|
2.3 R Objects and Variables |
|
|
10 | (2) |
|
|
12 | (4) |
|
|
16 | (2) |
|
|
18 | (1) |
|
|
19 | (3) |
|
|
22 | (2) |
|
|
24 | (2) |
|
|
26 | (4) |
|
|
30 | (2) |
|
|
32 | (4) |
|
2.13 Useful Extensions to Data Frames |
|
|
36 | (4) |
|
2.14 Objects, Classes, and Methods |
|
|
40 | (1) |
|
2.15 Managing Your Sessions |
|
|
41 | (2) |
|
3 Introduction to Data Mining |
|
|
43 | (148) |
|
3.1 A Bird's Eye View on Data Mining |
|
|
43 | (2) |
|
3.2 Data Collection and Business Understanding |
|
|
45 | (8) |
|
|
45 | (1) |
|
3.2.2 Importing Data into R |
|
|
46 | (1) |
|
|
47 | (2) |
|
|
49 | (3) |
|
|
52 | (1) |
|
|
52 | (1) |
|
|
53 | (34) |
|
|
53 | (1) |
|
|
53 | (3) |
|
|
56 | (2) |
|
3.3.1.3 String Processing |
|
|
58 | (2) |
|
3.3.1.4 Dealing with Unknown Values |
|
|
60 | (2) |
|
3.3.2 Transforming Variables |
|
|
62 | (1) |
|
3.3.2.1 Handling Different Scales of Variables |
|
|
62 | (1) |
|
3.3.2.2 Discretizing Variables |
|
|
63 | (2) |
|
|
65 | (1) |
|
3.3.3.1 Handling Case Dependencies |
|
|
65 | (9) |
|
3.3.3.2 Handling Text Datasets |
|
|
74 | (4) |
|
3.3.4 Dimensionality Reduction |
|
|
78 | (1) |
|
|
78 | (4) |
|
3.3.4.2 Variable Selection |
|
|
82 | (5) |
|
|
87 | (85) |
|
3.4.1 Exploratory Data Analysis |
|
|
87 | (1) |
|
3.4.1.1 Data Summarization |
|
|
87 | (9) |
|
3.4.1.2 Data Visualization |
|
|
96 | (14) |
|
3.4.2 Dependency Modeling using Association Rules |
|
|
110 | (9) |
|
|
119 | (1) |
|
3.4.3.1 Measures of Dissimilarity |
|
|
119 | (1) |
|
3.4.3.2 Clustering Methods |
|
|
120 | (11) |
|
|
131 | (1) |
|
3.4.4.1 Univariate Outlier Detection Methods |
|
|
132 | (1) |
|
3.4.4.2 Multi-Variate Outlier Detection Methods |
|
|
133 | (7) |
|
3.4.5 Predictive Analytics |
|
|
140 | (1) |
|
3.4.5.1 Evaluation Metrics |
|
|
141 | (4) |
|
3.4.5.2 Tree-Based Models |
|
|
145 | (6) |
|
3.4.5.3 Support Vector Machines |
|
|
151 | (7) |
|
3.4.5.4 Artificial Neural Networks and Deep Learning |
|
|
158 | (7) |
|
|
165 | (7) |
|
|
172 | (10) |
|
3.5.1 The Holdout and Random Subsampling |
|
|
174 | (3) |
|
|
177 | (2) |
|
3.5.3 Bootstrap Estimates |
|
|
179 | (2) |
|
3.5.4 Recommended Procedures |
|
|
181 | (1) |
|
3.6 Reporting and Deployment |
|
|
182 | (9) |
|
3.6.1 Reporting Through Dynamic Documents |
|
|
183 | (3) |
|
3.6.2 Deployment through Web Applications |
|
|
186 | (5) |
|
|
191 | (192) |
|
4 Predicting Algae Blooms |
|
|
193 | (48) |
|
4.1 Problem Description and Objectives |
|
|
193 | (1) |
|
|
194 | (1) |
|
4.3 Loading the Data into R |
|
|
194 | (2) |
|
4.4 Data Visualization and Summarization |
|
|
196 | (9) |
|
|
205 | (9) |
|
4.5.1 Removing the Observations with Unknown Values |
|
|
205 | (2) |
|
4.5.2 Filling in the Unknowns with the Most Frequent Values |
|
|
207 | (1) |
|
4.5.3 Filling in the Unknown Values by Exploring Correlations |
|
|
208 | (4) |
|
4.5.4 Filling in the Unknown Values by Exploring Similarities between Cases |
|
|
212 | (2) |
|
4.6 Obtaining Prediction Models |
|
|
214 | (11) |
|
4.6.1 Multiple Linear Regression |
|
|
215 | (5) |
|
|
220 | (5) |
|
4.7 Model Evaluation and Selection |
|
|
225 | (12) |
|
4.8 Predictions for the Seven Algae |
|
|
237 | (2) |
|
|
239 | (2) |
|
5 Predicting Stock Market Returns |
|
|
241 | (54) |
|
5.1 Problem Description and Objectives |
|
|
241 | (1) |
|
|
242 | (2) |
|
5.2.1 Reading the Data from the CSV File |
|
|
243 | (1) |
|
5.2.2 Getting the Data from the Web |
|
|
243 | (1) |
|
5.3 Defining the Prediction Tasks |
|
|
244 | (10) |
|
|
244 | (3) |
|
|
247 | (4) |
|
5.3.3 The Prediction Tasks |
|
|
251 | (1) |
|
5.3.4 Evaluation Criteria |
|
|
252 | (2) |
|
5.4 The Prediction Models |
|
|
254 | (9) |
|
5.4.1 How Will the Training Data Be Used? |
|
|
254 | (2) |
|
|
256 | (1) |
|
5.4.2.1 Artificial Neural Networks |
|
|
256 | (3) |
|
5.4.2.2 Support Vector Machines |
|
|
259 | (1) |
|
5.4.2.3 Multivariate Adaptive Regression Splines |
|
|
260 | (3) |
|
5.5 From Predictions into Actions |
|
|
263 | (8) |
|
5.5.1 How Will the Predictions Be Used? |
|
|
263 | (1) |
|
5.5.2 Trading-Related Evaluation Criteria |
|
|
264 | (1) |
|
5.5.3 Putting Everything Together: A Simulated Trader |
|
|
265 | (6) |
|
5.6 Model Evaluation and Selection |
|
|
271 | (15) |
|
5.6.1 Monte Carlo Estimates |
|
|
271 | (1) |
|
5.6.2 Experimental Comparisons |
|
|
272 | (6) |
|
|
278 | (8) |
|
|
286 | (6) |
|
5.7.1 Evaluation of the Final Test Data |
|
|
286 | (5) |
|
5.7.2 An Online Trading System |
|
|
291 | (1) |
|
|
292 | (3) |
|
6 Detecting Fraudulent Transactions |
|
|
295 | (58) |
|
6.1 Problem Description and Objectives |
|
|
295 | (1) |
|
|
296 | (17) |
|
6.2.1 Loading the Data into R |
|
|
296 | (1) |
|
6.2.2 Exploring the Dataset |
|
|
297 | (7) |
|
|
304 | (1) |
|
|
304 | (5) |
|
6.2.3.2 Few Transactions of Some Products |
|
|
309 | (4) |
|
6.3 Defining the Data Mining Tasks |
|
|
313 | (10) |
|
6.3.1 Different Approaches to the Problem |
|
|
313 | (1) |
|
6.3.1.1 Unsupervised Techniques |
|
|
313 | (1) |
|
6.3.1.2 Supervised Techniques |
|
|
314 | (1) |
|
6.3.1.3 Semi-Supervised Techniques |
|
|
315 | (1) |
|
6.3.2 Evaluation Criteria |
|
|
316 | (1) |
|
6.3.2.1 Precision and Recall |
|
|
316 | (1) |
|
6.3.2.2 Lift Charts and Precision/Recall Curves |
|
|
317 | (3) |
|
6.3.2.3 Normalized Distance to Typical Price |
|
|
320 | (1) |
|
6.3.3 Experimental Methodology |
|
|
321 | (2) |
|
6.4 Obtaining Outlier Rankings |
|
|
323 | (27) |
|
6.4.1 Unsupervised Approaches |
|
|
323 | (1) |
|
6.4.1.1 The Modified Box Plot Rule |
|
|
323 | (4) |
|
6.4.1.2 Local Outlier Factors (LOF) |
|
|
327 | (3) |
|
6.4.1.3 Clustering-Based Outlier Rankings (ORh) |
|
|
330 | (2) |
|
6.4.2 Supervised Approaches |
|
|
332 | (1) |
|
6.4.2.1 The Class Imbalance Problem |
|
|
333 | (2) |
|
|
335 | (4) |
|
|
339 | (5) |
|
6.4.3 Semi-Supervised Approaches |
|
|
344 | (6) |
|
|
350 | (3) |
|
7 Classifying Microarray Samples |
|
|
353 | (30) |
|
7.1 Problem Description and Objectives |
|
|
353 | (1) |
|
7.1.1 Brief Background on Microarray Experiments |
|
|
353 | (1) |
|
|
354 | (1) |
|
|
354 | (5) |
|
7.2.1 Exploring the Dataset |
|
|
357 | (2) |
|
7.3 Gene (Feature) Selection |
|
|
359 | (9) |
|
7.3.1 Simple Filters Based on Distribution Properties |
|
|
360 | (2) |
|
|
362 | (2) |
|
7.3.3 Filtering Using Random Forests |
|
|
364 | (3) |
|
7.3.4 Filtering Using Feature Clustering Ensembles |
|
|
367 | (1) |
|
7.4 Predicting Cytogenetic Abnormalities |
|
|
368 | (13) |
|
7.4.1 Defining the Prediction Task |
|
|
368 | (1) |
|
7.4.2 The Evaluation Metric |
|
|
369 | (1) |
|
7.4.3 The Experimental Procedure |
|
|
369 | (1) |
|
7.4.4 The Modeling Techniques |
|
|
370 | (3) |
|
7.4.5 Comparing the Models |
|
|
373 | (8) |
|
|
381 | (2) |
Bibliography |
|
383 | (12) |
Subject Index |
|
395 | (4) |
Index of Data Mining Topics |
|
399 | (2) |
Index of R Functions |
|
401 | |