Preface |
|
xvii | |
Foreword |
|
xxi | |
Foreword from the French language edition |
|
xxiii | |
|
|
xxv | |
|
1 Overview of data mining |
|
|
1 | (24) |
|
|
1 | (3) |
|
1.2 What is data mining used for? |
|
|
4 | (7) |
|
1.2.1 Data mining in different sectors |
|
|
4 | (4) |
|
1.2.2 Data mining in different applications |
|
|
8 | (3) |
|
1.3 Data mining and statistics |
|
|
11 | (1) |
|
1.4 Data mining and information technology |
|
|
12 | (4) |
|
1.5 Data mining and protection of personal data |
|
|
16 | (7) |
|
1.6 Implementation of data mining |
|
|
23 | (2) |
|
2 The development of a data mining study |
|
|
25 | (18) |
|
|
26 | (1) |
|
2.2 Listing the existing data |
|
|
26 | (1) |
|
|
27 | (3) |
|
2.4 Exploring and preparing the data |
|
|
30 | (3) |
|
2.5 Population segmentation |
|
|
33 | (2) |
|
2.6 Drawing up and validating predictive models |
|
|
35 | (1) |
|
2.7 Synthesizing predictive models of different segments |
|
|
36 | (1) |
|
2.8 Iteration of the preceding steps |
|
|
37 | (1) |
|
|
37 | (1) |
|
2.10 Training the model users |
|
|
38 | (1) |
|
2.11 Monitoring the models |
|
|
38 | (2) |
|
2.12 Enriching the models |
|
|
40 | (1) |
|
|
41 | (1) |
|
2.14 Life cycle of a model |
|
|
41 | (1) |
|
2.15 Costs of a pilot project |
|
|
41 | (2) |
|
3 Data exploration and preparation |
|
|
43 | (50) |
|
3.1 The different types of data |
|
|
43 | (1) |
|
3.2 Examining the distribution of variables |
|
|
44 | (1) |
|
3.3 Detection of rare or missing values |
|
|
45 | (4) |
|
3.4 Detection of aberrant values |
|
|
49 | (3) |
|
3.5 Detection of extreme values |
|
|
52 | (1) |
|
|
52 | (6) |
|
3.7 Homoscedasticity and heteroscedasticity |
|
|
58 | (1) |
|
3.8 Detection of the most discriminating variables |
|
|
59 | (14) |
|
3.8.1 Qualitative, discrete or binned independent variables |
|
|
60 | (2) |
|
3.8.2 Continuous independent variables |
|
|
62 | (3) |
|
3.8.3 Details of single-factor non-parametric tests |
|
|
65 | (5) |
|
3.8.4 ODS and automated selection of discriminating variables |
|
|
70 | (3) |
|
3.9 Transformation of variables |
|
|
73 | (1) |
|
3.10 Choosing ranges of values of binned variables |
|
|
74 | (7) |
|
3.11 Creating new variables |
|
|
81 | (1) |
|
3.12 Detecting interactions |
|
|
82 | (3) |
|
3.13 Automatic variable selection |
|
|
85 | (1) |
|
3.14 Detection of collinearity |
|
|
86 | (3) |
|
|
89 | (4) |
|
|
89 | (1) |
|
3.15.2 Random sampling methods |
|
|
90 | (3) |
|
|
93 | (18) |
|
4.1 Data used in commercial applications |
|
|
93 | (5) |
|
4.1.1 Data on transactions and RFM data |
|
|
93 | (1) |
|
4.1.2 Data on products and contracts |
|
|
94 | (1) |
|
|
94 | (2) |
|
|
96 | (1) |
|
4.1.5 Relational, attitudinal and psychographic data |
|
|
96 | (1) |
|
4.1.6 Sociodemographic data |
|
|
97 | (1) |
|
4.1.7 When data are unavailable |
|
|
97 | (1) |
|
|
98 | (1) |
|
|
98 | (8) |
|
4.2.1 Geodemographic data |
|
|
98 | (7) |
|
|
105 | (1) |
|
4.3 Data used by business sector |
|
|
106 | (5) |
|
4.3.1 Data used in banking |
|
|
106 | (2) |
|
4.3.2 Data used in insurance |
|
|
108 | (1) |
|
4.3.3 Data used in telephony |
|
|
108 | (1) |
|
4.3.4 Data used in mail order |
|
|
109 | (2) |
|
5 Statistical and data mining software |
|
|
111 | (56) |
|
5.1 Types of data mining and statistical software |
|
|
111 | (3) |
|
5.2 Essential characteristics of the software |
|
|
114 | (3) |
|
5.2.1 Points of comparison |
|
|
114 | (1) |
|
5.2.2 Methods implemented |
|
|
115 | (1) |
|
5.2.3 Data preparation functions |
|
|
116 | (1) |
|
|
116 | (1) |
|
5.2.5 Technical characteristics |
|
|
117 | (1) |
|
5.3 The main software packages |
|
|
117 | (19) |
|
|
117 | (2) |
|
|
119 | (3) |
|
|
122 | (2) |
|
|
124 | (9) |
|
5.3.5 Some elements of the R language |
|
|
133 | (3) |
|
5.4 Comparison of R, SAS and IBM SPSS |
|
|
136 | (28) |
|
5.5 How to reduce processing time |
|
|
164 | (3) |
|
6 An outline of data mining methods |
|
|
167 | (8) |
|
6.1 Classification of the methods |
|
|
167 | (7) |
|
6.2 Comparison of the methods |
|
|
174 | (1) |
|
|
175 | (42) |
|
7.1 Principal component analysis |
|
|
175 | (17) |
|
|
175 | (6) |
|
7.1.2 Representation of variables |
|
|
181 | (4) |
|
7.1.3 Representation of individuals |
|
|
185 | (2) |
|
|
187 | (2) |
|
7.1.5 Choosing the number of factor axes |
|
|
189 | (3) |
|
|
192 | (1) |
|
7.2 Variants of principal component analysis |
|
|
192 | (2) |
|
|
192 | (1) |
|
|
193 | (1) |
|
7.2.3 PCA on qualitative variables |
|
|
194 | (1) |
|
7.3 Correspondence analysis |
|
|
194 | (7) |
|
|
194 | (3) |
|
7.3.2 Implementing CA with IBM SPSS Statistics |
|
|
197 | (4) |
|
7.4 Multiple correspondence analysis |
|
|
201 | (16) |
|
|
201 | (4) |
|
7.4.2 Review of CA and MCA |
|
|
205 | (2) |
|
7.4.3 Implementing MCA and CA with SAS |
|
|
207 | (10) |
|
|
217 | (18) |
|
8.1 General information on neural networks |
|
|
217 | (3) |
|
8.2 Structure of a neural network |
|
|
220 | (1) |
|
8.3 Choosing the learning sample |
|
|
221 | (1) |
|
8.4 Some empirical rules for network design |
|
|
222 | (1) |
|
|
223 | (1) |
|
8.5.1 Continuous variables |
|
|
223 | (1) |
|
|
223 | (1) |
|
8.5.3 Qualitative variables |
|
|
224 | (1) |
|
|
224 | (1) |
|
8.7 The main neural networks |
|
|
224 | (11) |
|
8.7.1 The multilayer perceptron |
|
|
225 | (2) |
|
8.7.2 The radial basis function network |
|
|
227 | (4) |
|
8.7.3 The Kohonen network |
|
|
231 | (4) |
|
|
235 | (52) |
|
9.1 Definition of clustering |
|
|
235 | (1) |
|
9.2 Applications of clustering |
|
|
236 | (1) |
|
9.3 Complexity of clustering |
|
|
236 | (1) |
|
9.4 Clustering structures |
|
|
237 | (1) |
|
9.4.1 Structure of the data to be clustered |
|
|
237 | (1) |
|
9.4.2 Structure of the resulting clusters |
|
|
237 | (1) |
|
9.5 Some methodological considerations |
|
|
238 | (4) |
|
9.5.1 The optimum number of clusters |
|
|
238 | (1) |
|
9.5.2 The use of certain types of variables |
|
|
238 | (1) |
|
9.5.3 The use of illustrative variables |
|
|
239 | (1) |
|
9.5.4 Evaluating the quality of clustering |
|
|
239 | (1) |
|
9.5.5 Interpreting the resulting clusters |
|
|
240 | (2) |
|
9.5.6 The criteria for correct clustering |
|
|
242 | (1) |
|
9.6 Comparison of factor analysis and clustering |
|
|
242 | (1) |
|
9.7 Within-cluster and between-cluster sum of squares |
|
|
243 | (1) |
|
9.8 Measurements of clustering quality |
|
|
244 | (3) |
|
9.8.1 All types of clustering |
|
|
245 | (1) |
|
9.8.2 Agglomerative hierarchical clustering |
|
|
246 | (1) |
|
|
247 | (6) |
|
9.9.1 The moving centres method |
|
|
247 | (1) |
|
9.9.2 k-means and dynamic clouds |
|
|
248 | (1) |
|
9.9.3 Processing qualitative data |
|
|
249 | (1) |
|
9.9.4 k-medoids and their variants |
|
|
249 | (1) |
|
9.9.5 Advantages of the partitioning methods |
|
|
250 | (1) |
|
9.9.6 Disadvantages of the partitioning methods |
|
|
251 | (1) |
|
9.9.7 Sensitivity to the choice of initial centres |
|
|
252 | (1) |
|
9.10 Agglomerative hierarchical clustering |
|
|
253 | (8) |
|
|
253 | (1) |
|
9.10.2 The main distances used |
|
|
254 | (4) |
|
9.10.3 Density estimation methods |
|
|
258 | (1) |
|
9.10.4 Advantages of agglomerative hierarchical clustering |
|
|
259 | (2) |
|
9.10.5 Disadvantages of agglomerative hierarchical clustering |
|
|
261 | (1) |
|
9.11 Hybrid clustering methods |
|
|
261 | (11) |
|
|
261 | (1) |
|
9.11.2 Illustration using SAS Software |
|
|
262 | (10) |
|
|
272 | (1) |
|
|
272 | (1) |
|
|
272 | (1) |
|
9.13 Clustering by similarity aggregation |
|
|
273 | (5) |
|
9.13.1 Principle of relational analysis |
|
|
273 | (1) |
|
9.13.2 Implementing clustering by similarity aggregation |
|
|
274 | (1) |
|
9.13.3 Example of use of the R amap package |
|
|
275 | (2) |
|
9.13.4 Advantages of clustering by similarity aggregation |
|
|
277 | (1) |
|
9.13.5 Disadvantages of clustering by similarity aggregation |
|
|
278 | (1) |
|
9.14 Clustering of numeric variables |
|
|
278 | (8) |
|
9.15 Overview of clustering methods |
|
|
286 | (1) |
|
|
287 | (14) |
|
|
287 | (4) |
|
|
291 | (1) |
|
10.3 Using supplementary variables |
|
|
292 | (1) |
|
|
292 | (2) |
|
|
294 | (7) |
|
11 Classification and prediction methods |
|
|
301 | (254) |
|
|
301 | (1) |
|
11.2 Inductive and transductive methods |
|
|
302 | (2) |
|
11.3 Overview of classification and prediction methods |
|
|
304 | (9) |
|
11.3.1 The qualities expected from a classification and prediction method |
|
|
304 | (1) |
|
|
305 | (3) |
|
11.3.3 Vapnik's learning theory |
|
|
308 | (2) |
|
|
310 | (3) |
|
11.4 Classification by decision tree |
|
|
313 | (17) |
|
11.4.1 Principle of the decision trees |
|
|
313 | (1) |
|
11.4.2 Definitions -- the first step in creating the tree |
|
|
313 | (3) |
|
11.4.3 Splitting criterion |
|
|
316 | (2) |
|
11.4.4 Distribution among nodes - the second step in creating the tree |
|
|
318 | (1) |
|
11.4.5 Pruning -- the third step in creating the tree |
|
|
319 | (1) |
|
11.4.6 A pitfall to avoid |
|
|
320 | (1) |
|
11.4.7 The CART, C5.0 and CHAID trees |
|
|
321 | (6) |
|
11.4.8 Advantages of decision trees |
|
|
327 | (1) |
|
11.4.9 Disadvantages of decision trees |
|
|
328 | (2) |
|
11.5 Prediction by decision tree |
|
|
330 | (2) |
|
11.6 Classification by discriminant analysis |
|
|
332 | (23) |
|
|
332 | (1) |
|
11.6.2 Geometric descriptive discriminant analysis (discriminant factor analysis) |
|
|
333 | (5) |
|
11.6.3 Geometric predictive discriminant analysis |
|
|
338 | (4) |
|
11.6.4 Probabilistic discriminant analysis |
|
|
342 | (3) |
|
11.6.5 Measurements of the quality of the model |
|
|
345 | (5) |
|
11.6.6 Syntax of discriminant analysis in SAS |
|
|
350 | (2) |
|
11.6.7 Discriminant analysis on qualitative variables (DISQUAL Method) |
|
|
352 | (2) |
|
11.6.8 Advantages of discriminant analysis |
|
|
354 | (1) |
|
11.6.9 Disadvantages of discriminant analysis |
|
|
354 | (1) |
|
11.7 Prediction by linear regression |
|
|
355 | (82) |
|
11.7.1 Simple linear regression |
|
|
356 | (3) |
|
11.7.2 Multiple linear regression and regularized regression |
|
|
359 | (6) |
|
11.7.3 Tests in linear regression |
|
|
365 | (6) |
|
11.7.4 Tests on residuals |
|
|
371 | (4) |
|
11.7.5 The influence of observations |
|
|
375 | (2) |
|
11.7.6 Example of linear regression |
|
|
377 | (6) |
|
11.7.7 Further details of the SAS linear regression syntax |
|
|
383 | (4) |
|
11.7.8 Problems of collinearity in linear regression: an example using R |
|
|
387 | (7) |
|
11.7.9 Problems of collinearity in linear regression: diagnosis and solutions |
|
|
394 | (3) |
|
|
397 | (3) |
|
11.7.11 Handling regularized regression with SAS and R |
|
|
400 | (30) |
|
11.7.12 Robust regression |
|
|
430 | (4) |
|
11.7.13 The general linear model |
|
|
434 | (3) |
|
11.8 Classification by logistic regression |
|
|
437 | (42) |
|
11.8.1 Principles of binary logistic regression |
|
|
437 | (4) |
|
11.8.2 Logit, probit and log-log logistic regressions |
|
|
441 | (2) |
|
|
443 | (2) |
|
11.8.4 Illustration of division into categories |
|
|
445 | (1) |
|
11.8.5 Estimating the parameters |
|
|
446 | (3) |
|
11.8.6 Deviance and quality measurement in a model |
|
|
449 | (4) |
|
11.8.7 Complete separation in logistic regression |
|
|
453 | (1) |
|
11.8.8 Statistical tests in logistic regression |
|
|
454 | (4) |
|
11.8.9 Effect of division into categories and choice of the reference category |
|
|
458 | (1) |
|
11.8.10 Effect of collinearity |
|
|
459 | (1) |
|
11.8.11 The effect of sampling on logit regression |
|
|
460 | (1) |
|
11.8.12 The syntax of logistic regression in SAS Software |
|
|
461 | (2) |
|
11.8.13 An example of modelling by logistic regression |
|
|
463 | (11) |
|
11.8.14 Logistic regression with R |
|
|
474 | (3) |
|
11.8.15 Advantages of logistic regression |
|
|
477 | (1) |
|
11.8.16 Advantages of the logit model compared with probit |
|
|
478 | (1) |
|
11.8.17 Disadvantages of logistic regression |
|
|
478 | (1) |
|
11.9 Developments in logistic regression |
|
|
479 | (13) |
|
11.9.1 Logistic regression on individuals with different weights |
|
|
479 | (1) |
|
11.9.2 Logistic regression with correlated data |
|
|
479 | (3) |
|
11.9.3 Ordinal logistic regression |
|
|
482 | (1) |
|
11.9.4 Multinomial logistic regression |
|
|
482 | (1) |
|
11.9.5 PLS logistic regression |
|
|
483 | (1) |
|
11.9.6 The generalized linear model |
|
|
484 | (3) |
|
11.9.7 Poisson regression |
|
|
487 | (4) |
|
11.9.8 The generalized additive model |
|
|
491 | (1) |
|
|
492 | (7) |
|
11.10.1 The naive Bayesian classifier |
|
|
492 | (5) |
|
11.10.2 Bayesian networks |
|
|
497 | (2) |
|
11.11 Classification and prediction by neural networks |
|
|
499 | (2) |
|
11.11.1 Advantages of neural networks |
|
|
499 | (1) |
|
11.11.2 Disadvantages of neural networks |
|
|
500 | (1) |
|
11.12 Classification by support vector machines |
|
|
501 | (9) |
|
11.12.1 Introduction to SVMs |
|
|
501 | (5) |
|
|
506 | (2) |
|
11.12.3 Advantages of SVMs |
|
|
508 | (1) |
|
11.12.4 Disadvantages of SVMs |
|
|
508 | (2) |
|
11.13 Prediction by genetic algorithms |
|
|
510 | (4) |
|
11.13.1 Random generation of initial rules |
|
|
511 | (1) |
|
11.13.2 Selecting the best rules |
|
|
512 | (1) |
|
11.13.3 Generating new rules |
|
|
512 | (1) |
|
11.13.4 End of the algorithm |
|
|
513 | (1) |
|
11.13.5 Applications of genetic algorithms |
|
|
513 | (1) |
|
11.13.6 Disadvantages of genetic algorithms |
|
|
514 | (1) |
|
11.14 Improving the performance of a predictive model |
|
|
514 | (2) |
|
11.15 Bootstrapping and ensemble methods |
|
|
516 | (18) |
|
|
516 | (2) |
|
|
518 | (3) |
|
|
521 | (7) |
|
11.15.4 Some applications |
|
|
528 | (4) |
|
|
532 | (2) |
|
11.16 Using classification and prediction methods |
|
|
534 | (21) |
|
11.16.1 Choosing the modelling methods |
|
|
534 | (3) |
|
11.16.2 The training phase of a model |
|
|
537 | (2) |
|
|
539 | (1) |
|
11.16.4 The test phase of a model |
|
|
540 | (2) |
|
11.16.5 The ROC curve, the lift curve and the Gini index |
|
|
542 | (9) |
|
11.16.6 The classification table of a model |
|
|
551 | (2) |
|
11.16.7 The validation phase of a model |
|
|
553 | (1) |
|
11.16.8 The application phase of a model |
|
|
553 | (2) |
|
12 An application of data mining: scoring |
|
|
555 | (62) |
|
12.1 The different types of score |
|
|
555 | (1) |
|
12.2 Using propensity scores and risk scores |
|
|
556 | (2) |
|
|
558 | (4) |
|
12.3.1 Determining the objectives |
|
|
558 | (1) |
|
12.3.2 Data inventory and preparation |
|
|
559 | (1) |
|
12.3.3 Creating the analysis base |
|
|
559 | (2) |
|
12.3.4 Developing a predictive model |
|
|
561 | (1) |
|
|
561 | (1) |
|
12.3.6 Deploying the score |
|
|
562 | (1) |
|
12.3.7 Monitoring the available tools |
|
|
562 | (1) |
|
12.4 Implementing a strategic score |
|
|
562 | (1) |
|
12.5 Implementing an operational score |
|
|
563 | (1) |
|
12.6 Scoring solutions used in a business |
|
|
564 | (3) |
|
12.6.1 In-house or outsourced? |
|
|
564 | (3) |
|
12.6.2 Generic or personalized score |
|
|
567 | (1) |
|
12.6.3 Summary of the possible solutions |
|
|
567 | (1) |
|
12.7 An example of credit scoring (data preparation) |
|
|
567 | (27) |
|
12.8 An example of credit scoring (modelling by logistic regression) |
|
|
594 | (10) |
|
12.9 An example of credit scoring (modelling by DISQUAL discriminant analysis) |
|
|
604 | (11) |
|
12.10 A brief history of credit scoring |
|
|
615 | (2) |
|
|
616 | (1) |
|
13 Factors for success in a data mining project |
|
|
617 | (10) |
|
|
617 | (1) |
|
|
618 | (1) |
|
|
618 | (1) |
|
|
619 | (1) |
|
13.5 The business culture |
|
|
620 | (1) |
|
13.6 Data mining: eight common misconceptions |
|
|
621 | (3) |
|
13.6.1 No a priori knowledge is needed |
|
|
621 | (1) |
|
13.6.2 No specialist staff are needed |
|
|
621 | (1) |
|
13.6.3 No statisticians are needed (`you can just press a button') |
|
|
622 | (1) |
|
13.6.4 Data mining will reveal unbelievable wonders |
|
|
622 | (1) |
|
13.6.5 Data mining is revolutionary |
|
|
623 | (1) |
|
13.6.6 You must use all the available data |
|
|
623 | (1) |
|
13.6.7 You must always sample |
|
|
623 | (1) |
|
13.6.8 You must never sample |
|
|
623 | (1) |
|
13.7 Return on investment |
|
|
624 | (3) |
|
|
627 | (10) |
|
14.1 Definition of text mining |
|
|
627 | (2) |
|
|
629 | (1) |
|
|
629 | (1) |
|
14.4 Information retrieval |
|
|
630 | (5) |
|
14.4.1 Linguistic analysis |
|
|
630 | (3) |
|
14.4.2 Application of statistics and data mining |
|
|
633 | (1) |
|
|
633 | (2) |
|
14.5 Information extraction |
|
|
635 | (1) |
|
14.5.1 Principles of information extraction |
|
|
635 | (1) |
|
14.5.2 Example of application: transcription of business interviews |
|
|
635 | (1) |
|
14.6 Multi-type data mining |
|
|
636 | (1) |
|
|
637 | (8) |
|
15.1 The aims of web mining |
|
|
637 | (1) |
|
|
638 | (3) |
|
15.2.1 What can they be used for? |
|
|
638 | (1) |
|
15.2.2 The structure of the log file |
|
|
638 | (1) |
|
15.2.3 Using the log file |
|
|
639 | (2) |
|
|
641 | (1) |
|
|
642 | (3) |
|
Appendix A Elements of statistics |
|
|
645 | (30) |
|
|
645 | (3) |
|
|
645 | (1) |
|
A.1.2 From statistics...to data mining |
|
|
645 | (3) |
|
A.2 Elements of statistics |
|
|
648 | (17) |
|
A.2.1 Statistical characteristics |
|
|
648 | (1) |
|
A.2.2 Box and whisker plot |
|
|
649 | (1) |
|
|
649 | (3) |
|
A.2.4 Asymptotic, exact, parametric and non-parametric tests |
|
|
652 | (1) |
|
A.2.5 Confidence interval for a mean: student's r lest |
|
|
652 | (2) |
|
A.2.6 Confidence interval of a frequency (or proportion) |
|
|
654 | (2) |
|
A.2.7 The relationship between two continuous variables: the linear correlation coefficient |
|
|
656 | (1) |
|
A.2.8 The relationship between two numeric or ordinal variables: Spearman's rank correlation coefficient and Kendall's tau |
|
|
657 | (1) |
|
A.2.9 The relationship between n sets of several continuous or binary variables: canonical correlation analysis |
|
|
658 | (1) |
|
A.2.10 The relationship between two nominal variables: the Χ2 test |
|
|
659 | (1) |
|
A.2.11 Example of use of the Χ2 test |
|
|
660 | (1) |
|
A.2.12 The relationship between two nominal variables: Cramer's coefficient |
|
|
661 | (1) |
|
A.2.13 The relationship between a nominal variable and a numeric variable: the variance test (one-way ANOVA test) |
|
|
662 | (2) |
|
A.2.14 The cox semi-parametric survival model |
|
|
664 | (1) |
|
|
665 | (10) |
|
A.3.1 Table of the standard normal distribution |
|
|
665 | (1) |
|
A.3.2 Table of student's t distribution |
|
|
665 | (1) |
|
|
666 | (1) |
|
A.3.4 Table of the Fisher-Snedecor distribution at the 0.05 significance level |
|
|
667 | (6) |
|
A.3.5 Table of the Fisher-Snedecor distribution at the 0.10 significance level |
|
|
673 | (2) |
|
Appendix B Further reading |
|
|
675 | (10) |
|
B.1 Statistics and data analysis |
|
|
675 | (3) |
|
B.2 Data mining and statistical learning |
|
|
678 | (2) |
|
|
680 | (1) |
|
|
680 | (1) |
|
|
680 | (1) |
|
|
681 | (1) |
|
|
682 | (1) |
|
|
682 | (3) |
Index |
|
685 | |