Muutke küpsiste eelistusi

E-raamat: Data Mining and Statistics for Decision Making [Wiley Online]

(Universities of Paris-Dauphine and Rennes, France)
Teised raamatud teemal:
  • Wiley Online
  • Hind: 108,85 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
Teised raamatud teemal:
Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge. Data mining is usually associated with a business or an organization's need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives.

This book looks at both classical and recent techniques of data mining, such as clustering, discriminant analysis, logistic regression, generalized linear models, regularized regression, PLS regression, decision trees, neural networks, support vector machines, Vapnik theory, naive Bayesian classifier, ensemble learning and detection of association rules. They are discussed along with illustrative examples throughout the book to explain the theory of these methods, as well as their strengths and limitations.

 Key Features:





Presents a comprehensive introduction to all techniques used in data mining and statistical learning, from classical to latest techniques. Starts from basic principles up to advanced concepts. Includes many step-by-step examples with the main software (R, SAS, IBM SPSS) as well as a thorough discussion and comparison of those software. Gives practical tips for data mining implementation to solve real world problems. Looks at a range of tools and applications, such as association rules, web mining and text mining, with a special focus on credit scoring. Supported by an accompanying website hosting datasets and user analysis.

Statisticians and business intelligence analysts, students as well as computer science, biology, marketing and financial risk professionals in both commercial and government organizations across all business and industry sectors will benefit from this book.
Preface xvii
Foreword xxi
Foreword from the French language edition xxiii
List of trademarks
xxv
1 Overview of data mining
1(24)
1.1 What is data mining?
1(3)
1.2 What is data mining used for?
4(7)
1.2.1 Data mining in different sectors
4(4)
1.2.2 Data mining in different applications
8(3)
1.3 Data mining and statistics
11(1)
1.4 Data mining and information technology
12(4)
1.5 Data mining and protection of personal data
16(7)
1.6 Implementation of data mining
23(2)
2 The development of a data mining study
25(18)
2.1 Defining the aims
26(1)
2.2 Listing the existing data
26(1)
2.3 Collecting the data
27(3)
2.4 Exploring and preparing the data
30(3)
2.5 Population segmentation
33(2)
2.6 Drawing up and validating predictive models
35(1)
2.7 Synthesizing predictive models of different segments
36(1)
2.8 Iteration of the preceding steps
37(1)
2.9 Deploying the models
37(1)
2.10 Training the model users
38(1)
2.11 Monitoring the models
38(2)
2.12 Enriching the models
40(1)
2.13 Remarks
41(1)
2.14 Life cycle of a model
41(1)
2.15 Costs of a pilot project
41(2)
3 Data exploration and preparation
43(50)
3.1 The different types of data
43(1)
3.2 Examining the distribution of variables
44(1)
3.3 Detection of rare or missing values
45(4)
3.4 Detection of aberrant values
49(3)
3.5 Detection of extreme values
52(1)
3.6 Tests of normality
52(6)
3.7 Homoscedasticity and heteroscedasticity
58(1)
3.8 Detection of the most discriminating variables
59(14)
3.8.1 Qualitative, discrete or binned independent variables
60(2)
3.8.2 Continuous independent variables
62(3)
3.8.3 Details of single-factor non-parametric tests
65(5)
3.8.4 ODS and automated selection of discriminating variables
70(3)
3.9 Transformation of variables
73(1)
3.10 Choosing ranges of values of binned variables
74(7)
3.11 Creating new variables
81(1)
3.12 Detecting interactions
82(3)
3.13 Automatic variable selection
85(1)
3.14 Detection of collinearity
86(3)
3.15 Sampling
89(4)
3.15.1 Using sampling
89(1)
3.15.2 Random sampling methods
90(3)
4 Using commercial data
93(18)
4.1 Data used in commercial applications
93(5)
4.1.1 Data on transactions and RFM data
93(1)
4.1.2 Data on products and contracts
94(1)
4.1.3 Lifetimes
94(2)
4.1.4 Data on channels
96(1)
4.1.5 Relational, attitudinal and psychographic data
96(1)
4.1.6 Sociodemographic data
97(1)
4.1.7 When data are unavailable
97(1)
4.1.8 Technical data
98(1)
4.2 Special data
98(8)
4.2.1 Geodemographic data
98(7)
4.2.2 Profitability
105(1)
4.3 Data used by business sector
106(5)
4.3.1 Data used in banking
106(2)
4.3.2 Data used in insurance
108(1)
4.3.3 Data used in telephony
108(1)
4.3.4 Data used in mail order
109(2)
5 Statistical and data mining software
111(56)
5.1 Types of data mining and statistical software
111(3)
5.2 Essential characteristics of the software
114(3)
5.2.1 Points of comparison
114(1)
5.2.2 Methods implemented
115(1)
5.2.3 Data preparation functions
116(1)
5.2.4 Other functions
116(1)
5.2.5 Technical characteristics
117(1)
5.3 The main software packages
117(19)
5.3.1 Overview
117(2)
5.3.2 IBM SPSS
119(3)
5.3.3 SAS
122(2)
5.3.4 R
124(9)
5.3.5 Some elements of the R language
133(3)
5.4 Comparison of R, SAS and IBM SPSS
136(28)
5.5 How to reduce processing time
164(3)
6 An outline of data mining methods
167(8)
6.1 Classification of the methods
167(7)
6.2 Comparison of the methods
174(1)
7 Factor analysis
175(42)
7.1 Principal component analysis
175(17)
7.1.1 Introduction
175(6)
7.1.2 Representation of variables
181(4)
7.1.3 Representation of individuals
185(2)
7.1.4 Use of PCA
187(2)
7.1.5 Choosing the number of factor axes
189(3)
7.1.6 Summary
192(1)
7.2 Variants of principal component analysis
192(2)
7.2.1 PCA with rotation
192(1)
7.2.2 PCA of ranks
193(1)
7.2.3 PCA on qualitative variables
194(1)
7.3 Correspondence analysis
194(7)
7.3.1 Introduction
194(3)
7.3.2 Implementing CA with IBM SPSS Statistics
197(4)
7.4 Multiple correspondence analysis
201(16)
7.4.1 Introduction
201(4)
7.4.2 Review of CA and MCA
205(2)
7.4.3 Implementing MCA and CA with SAS
207(10)
8 Neural networks
217(18)
8.1 General information on neural networks
217(3)
8.2 Structure of a neural network
220(1)
8.3 Choosing the learning sample
221(1)
8.4 Some empirical rules for network design
222(1)
8.5 Data normalization
223(1)
8.5.1 Continuous variables
223(1)
8.5.2 Discrete variables
223(1)
8.5.3 Qualitative variables
224(1)
8.6 Learning algorithms
224(1)
8.7 The main neural networks
224(11)
8.7.1 The multilayer perceptron
225(2)
8.7.2 The radial basis function network
227(4)
8.7.3 The Kohonen network
231(4)
9 Cluster analysis
235(52)
9.1 Definition of clustering
235(1)
9.2 Applications of clustering
236(1)
9.3 Complexity of clustering
236(1)
9.4 Clustering structures
237(1)
9.4.1 Structure of the data to be clustered
237(1)
9.4.2 Structure of the resulting clusters
237(1)
9.5 Some methodological considerations
238(4)
9.5.1 The optimum number of clusters
238(1)
9.5.2 The use of certain types of variables
238(1)
9.5.3 The use of illustrative variables
239(1)
9.5.4 Evaluating the quality of clustering
239(1)
9.5.5 Interpreting the resulting clusters
240(2)
9.5.6 The criteria for correct clustering
242(1)
9.6 Comparison of factor analysis and clustering
242(1)
9.7 Within-cluster and between-cluster sum of squares
243(1)
9.8 Measurements of clustering quality
244(3)
9.8.1 All types of clustering
245(1)
9.8.2 Agglomerative hierarchical clustering
246(1)
9.9 Partitioning methods
247(6)
9.9.1 The moving centres method
247(1)
9.9.2 k-means and dynamic clouds
248(1)
9.9.3 Processing qualitative data
249(1)
9.9.4 k-medoids and their variants
249(1)
9.9.5 Advantages of the partitioning methods
250(1)
9.9.6 Disadvantages of the partitioning methods
251(1)
9.9.7 Sensitivity to the choice of initial centres
252(1)
9.10 Agglomerative hierarchical clustering
253(8)
9.10.1 Introduction
253(1)
9.10.2 The main distances used
254(4)
9.10.3 Density estimation methods
258(1)
9.10.4 Advantages of agglomerative hierarchical clustering
259(2)
9.10.5 Disadvantages of agglomerative hierarchical clustering
261(1)
9.11 Hybrid clustering methods
261(11)
9.11.1 Introduction
261(1)
9.11.2 Illustration using SAS Software
262(10)
9.12 Neural clustering
272(1)
9.12.1 Advantages
272(1)
9.12.2 Disadvantages
272(1)
9.13 Clustering by similarity aggregation
273(5)
9.13.1 Principle of relational analysis
273(1)
9.13.2 Implementing clustering by similarity aggregation
274(1)
9.13.3 Example of use of the R amap package
275(2)
9.13.4 Advantages of clustering by similarity aggregation
277(1)
9.13.5 Disadvantages of clustering by similarity aggregation
278(1)
9.14 Clustering of numeric variables
278(8)
9.15 Overview of clustering methods
286(1)
10 Association analysis
287(14)
10.1 Principles
287(4)
10.2 Using taxonomy
291(1)
10.3 Using supplementary variables
292(1)
10.4 Applications
292(2)
10.5 Example of use
294(7)
11 Classification and prediction methods
301(254)
11.1 Introduction
301(1)
11.2 Inductive and transductive methods
302(2)
11.3 Overview of classification and prediction methods
304(9)
11.3.1 The qualities expected from a classification and prediction method
304(1)
11.3.2 Generalizability
305(3)
11.3.3 Vapnik's learning theory
308(2)
11.3.4 Overfitting
310(3)
11.4 Classification by decision tree
313(17)
11.4.1 Principle of the decision trees
313(1)
11.4.2 Definitions -- the first step in creating the tree
313(3)
11.4.3 Splitting criterion
316(2)
11.4.4 Distribution among nodes - the second step in creating the tree
318(1)
11.4.5 Pruning -- the third step in creating the tree
319(1)
11.4.6 A pitfall to avoid
320(1)
11.4.7 The CART, C5.0 and CHAID trees
321(6)
11.4.8 Advantages of decision trees
327(1)
11.4.9 Disadvantages of decision trees
328(2)
11.5 Prediction by decision tree
330(2)
11.6 Classification by discriminant analysis
332(23)
11.6.1 The problem
332(1)
11.6.2 Geometric descriptive discriminant analysis (discriminant factor analysis)
333(5)
11.6.3 Geometric predictive discriminant analysis
338(4)
11.6.4 Probabilistic discriminant analysis
342(3)
11.6.5 Measurements of the quality of the model
345(5)
11.6.6 Syntax of discriminant analysis in SAS
350(2)
11.6.7 Discriminant analysis on qualitative variables (DISQUAL Method)
352(2)
11.6.8 Advantages of discriminant analysis
354(1)
11.6.9 Disadvantages of discriminant analysis
354(1)
11.7 Prediction by linear regression
355(82)
11.7.1 Simple linear regression
356(3)
11.7.2 Multiple linear regression and regularized regression
359(6)
11.7.3 Tests in linear regression
365(6)
11.7.4 Tests on residuals
371(4)
11.7.5 The influence of observations
375(2)
11.7.6 Example of linear regression
377(6)
11.7.7 Further details of the SAS linear regression syntax
383(4)
11.7.8 Problems of collinearity in linear regression: an example using R
387(7)
11.7.9 Problems of collinearity in linear regression: diagnosis and solutions
394(3)
11.7.10 PLS regression
397(3)
11.7.11 Handling regularized regression with SAS and R
400(30)
11.7.12 Robust regression
430(4)
11.7.13 The general linear model
434(3)
11.8 Classification by logistic regression
437(42)
11.8.1 Principles of binary logistic regression
437(4)
11.8.2 Logit, probit and log-log logistic regressions
441(2)
11.8.3 Odds ratios
443(2)
11.8.4 Illustration of division into categories
445(1)
11.8.5 Estimating the parameters
446(3)
11.8.6 Deviance and quality measurement in a model
449(4)
11.8.7 Complete separation in logistic regression
453(1)
11.8.8 Statistical tests in logistic regression
454(4)
11.8.9 Effect of division into categories and choice of the reference category
458(1)
11.8.10 Effect of collinearity
459(1)
11.8.11 The effect of sampling on logit regression
460(1)
11.8.12 The syntax of logistic regression in SAS Software
461(2)
11.8.13 An example of modelling by logistic regression
463(11)
11.8.14 Logistic regression with R
474(3)
11.8.15 Advantages of logistic regression
477(1)
11.8.16 Advantages of the logit model compared with probit
478(1)
11.8.17 Disadvantages of logistic regression
478(1)
11.9 Developments in logistic regression
479(13)
11.9.1 Logistic regression on individuals with different weights
479(1)
11.9.2 Logistic regression with correlated data
479(3)
11.9.3 Ordinal logistic regression
482(1)
11.9.4 Multinomial logistic regression
482(1)
11.9.5 PLS logistic regression
483(1)
11.9.6 The generalized linear model
484(3)
11.9.7 Poisson regression
487(4)
11.9.8 The generalized additive model
491(1)
11.10 Bayesian methods
492(7)
11.10.1 The naive Bayesian classifier
492(5)
11.10.2 Bayesian networks
497(2)
11.11 Classification and prediction by neural networks
499(2)
11.11.1 Advantages of neural networks
499(1)
11.11.2 Disadvantages of neural networks
500(1)
11.12 Classification by support vector machines
501(9)
11.12.1 Introduction to SVMs
501(5)
11.12.2 Example
506(2)
11.12.3 Advantages of SVMs
508(1)
11.12.4 Disadvantages of SVMs
508(2)
11.13 Prediction by genetic algorithms
510(4)
11.13.1 Random generation of initial rules
511(1)
11.13.2 Selecting the best rules
512(1)
11.13.3 Generating new rules
512(1)
11.13.4 End of the algorithm
513(1)
11.13.5 Applications of genetic algorithms
513(1)
11.13.6 Disadvantages of genetic algorithms
514(1)
11.14 Improving the performance of a predictive model
514(2)
11.15 Bootstrapping and ensemble methods
516(18)
11.15.1 Bootstrapping
516(2)
11.15.2 Bagging
518(3)
11.15.3 Boosting
521(7)
11.15.4 Some applications
528(4)
11.15.5 Conclusion
532(2)
11.16 Using classification and prediction methods
534(21)
11.16.1 Choosing the modelling methods
534(3)
11.16.2 The training phase of a model
537(2)
11.16.3 Reject inference
539(1)
11.16.4 The test phase of a model
540(2)
11.16.5 The ROC curve, the lift curve and the Gini index
542(9)
11.16.6 The classification table of a model
551(2)
11.16.7 The validation phase of a model
553(1)
11.16.8 The application phase of a model
553(2)
12 An application of data mining: scoring
555(62)
12.1 The different types of score
555(1)
12.2 Using propensity scores and risk scores
556(2)
12.3 Methodology
558(4)
12.3.1 Determining the objectives
558(1)
12.3.2 Data inventory and preparation
559(1)
12.3.3 Creating the analysis base
559(2)
12.3.4 Developing a predictive model
561(1)
12.3.5 Using the score
561(1)
12.3.6 Deploying the score
562(1)
12.3.7 Monitoring the available tools
562(1)
12.4 Implementing a strategic score
562(1)
12.5 Implementing an operational score
563(1)
12.6 Scoring solutions used in a business
564(3)
12.6.1 In-house or outsourced?
564(3)
12.6.2 Generic or personalized score
567(1)
12.6.3 Summary of the possible solutions
567(1)
12.7 An example of credit scoring (data preparation)
567(27)
12.8 An example of credit scoring (modelling by logistic regression)
594(10)
12.9 An example of credit scoring (modelling by DISQUAL discriminant analysis)
604(11)
12.10 A brief history of credit scoring
615(2)
References
616(1)
13 Factors for success in a data mining project
617(10)
13.1 The subject
617(1)
13.2 The people
618(1)
13.3 The data
618(1)
13.4 The IT systems
619(1)
13.5 The business culture
620(1)
13.6 Data mining: eight common misconceptions
621(3)
13.6.1 No a priori knowledge is needed
621(1)
13.6.2 No specialist staff are needed
621(1)
13.6.3 No statisticians are needed (`you can just press a button')
622(1)
13.6.4 Data mining will reveal unbelievable wonders
622(1)
13.6.5 Data mining is revolutionary
623(1)
13.6.6 You must use all the available data
623(1)
13.6.7 You must always sample
623(1)
13.6.8 You must never sample
623(1)
13.7 Return on investment
624(3)
14 Text mining
627(10)
14.1 Definition of text mining
627(2)
14.2 Text sources used
629(1)
14.3 Using text mining
629(1)
14.4 Information retrieval
630(5)
14.4.1 Linguistic analysis
630(3)
14.4.2 Application of statistics and data mining
633(1)
14.4.3 Suitable methods
633(2)
14.5 Information extraction
635(1)
14.5.1 Principles of information extraction
635(1)
14.5.2 Example of application: transcription of business interviews
635(1)
14.6 Multi-type data mining
636(1)
15 Web mining
637(8)
15.1 The aims of web mining
637(1)
15.2 Global analyses
638(3)
15.2.1 What can they be used for?
638(1)
15.2.2 The structure of the log file
638(1)
15.2.3 Using the log file
639(2)
15.3 Individual analyses
641(1)
15.4 Personal analysis
642(3)
Appendix A Elements of statistics
645(30)
A.1 A brief history
645(3)
A.1.1 A few dates
645(1)
A.1.2 From statistics...to data mining
645(3)
A.2 Elements of statistics
648(17)
A.2.1 Statistical characteristics
648(1)
A.2.2 Box and whisker plot
649(1)
A.2.3 Hypothesis testing
649(3)
A.2.4 Asymptotic, exact, parametric and non-parametric tests
652(1)
A.2.5 Confidence interval for a mean: student's r lest
652(2)
A.2.6 Confidence interval of a frequency (or proportion)
654(2)
A.2.7 The relationship between two continuous variables: the linear correlation coefficient
656(1)
A.2.8 The relationship between two numeric or ordinal variables: Spearman's rank correlation coefficient and Kendall's tau
657(1)
A.2.9 The relationship between n sets of several continuous or binary variables: canonical correlation analysis
658(1)
A.2.10 The relationship between two nominal variables: the Χ2 test
659(1)
A.2.11 Example of use of the Χ2 test
660(1)
A.2.12 The relationship between two nominal variables: Cramer's coefficient
661(1)
A.2.13 The relationship between a nominal variable and a numeric variable: the variance test (one-way ANOVA test)
662(2)
A.2.14 The cox semi-parametric survival model
664(1)
A.3 Statistical tables
665(10)
A.3.1 Table of the standard normal distribution
665(1)
A.3.2 Table of student's t distribution
665(1)
A.3.3 Chi-Square table
666(1)
A.3.4 Table of the Fisher-Snedecor distribution at the 0.05 significance level
667(6)
A.3.5 Table of the Fisher-Snedecor distribution at the 0.10 significance level
673(2)
Appendix B Further reading
675(10)
B.1 Statistics and data analysis
675(3)
B.2 Data mining and statistical learning
678(2)
B.3 Text mining
680(1)
B.4 Web mining
680(1)
B.5 R software
680(1)
B.6 SAS software
681(1)
B.7 IBM SPSS software
682(1)
B.8 Websites
682(3)
Index 685
Dr Stephane Tuffery teaches Data Mining and statistics, University Rennes 1, Paris, France.

Translator, Rod Riesco, UK.