Customer Support: +372 7440010

Help | New account | Log In

E-book: Data Mining and Statistics for Decision Making

3.42/5 (19 ratings by Goodreads)

Stéphane Tufféry (Universities of Paris-Dauphine and Rennes, France)

Format: PDF+DRM
Series: Wiley Series in Computational Statistics
Pub. Date: 18-Feb-2011
Publisher: John Wiley & Sons Inc
Language: eng
ISBN-13: 9780470979167

Other books in subject:

Data mining

Format - PDF+DRM
Price: 87,62 €*
* the price is final i.e. no additional discount will apply
Add to basket
Add to Wishlist
This ebook is for personal use only. E-Books are non-refundable.
For Libraries

Format: PDF+DRM
Series: Wiley Series in Computational Statistics
Pub. Date: 18-Feb-2011
Publisher: John Wiley & Sons Inc
Language: eng
ISBN-13: 9780470979167

Other books in subject:

Data mining

DRM restrictions

Copying (copy/paste):

not allowed
Printing:

not allowed
Usage:

Digital Rights Management (DRM)
The publisher has supplied this book in encrypted form, which means that you need to install free software in order to unlock and read it. To read this e-book you have to create Adobe ID More info here. Ebook can be read and downloaded up to 6 devices (single user with the same Adobe ID).

Required software
To read this ebook on a mobile device (phone or tablet) you'll need to install this free app: PocketBook Reader (iOS / Android)

To download and read this eBook on a PC or Mac you need Adobe Digital Editions (This is a free app specially developed for eBooks. It's not the same as Adobe Reader, which you probably already have on your computer.)

You can't read this ebook with Amazon Kindle

Data mining is the process of automatically searching large volumes of data for models and patterns using computational techniques from statistics, machine learning and information theory; it is the ideal tool for such an extraction of knowledge. Data mining is usually associated with a business or an organization's need to identify trends and profiles, allowing, for example, retailers to discover patterns on which to base marketing objectives.

This book looks at both classical and recent techniques of data mining, such as clustering, discriminant analysis, logistic regression, generalized linear models, regularized regression, PLS regression, decision trees, neural networks, support vector machines, Vapnik theory, naive Bayesian classifier, ensemble learning and detection of association rules. They are discussed along with illustrative examples throughout the book to explain the theory of these methods, as well as their strengths and limitations.

Key Features:

Presents a comprehensive introduction to all techniques used in data mining and statistical learning, from classical to latest techniques. Starts from basic principles up to advanced concepts. Includes many step-by-step examples with the main software (R, SAS, IBM SPSS) as well as a thorough discussion and comparison of those software. Gives practical tips for data mining implementation to solve real world problems. Looks at a range of tools and applications, such as association rules, web mining and text mining, with a special focus on credit scoring. Supported by an accompanying website hosting datasets and user analysis.

Statisticians and business intelligence analysts, students as well as computer science, biology, marketing and financial risk professionals in both commercial and government organizations across all business and industry sectors will benefit from this book.

Reviews

"Business intelligence analysts and statisticians, compliance and financial experts in both commercial and government organizations across all industry sectors will benefit from this book." (Zentralblatt MATH, 2011)

Preface

xvii

Foreword

xxi

Foreword from the French language edition

xxiii

List of trademarks

xxv

1 Overview of data mining

(24)

1.1 What is data mining?

(3)

1.2 What is data mining used for?

(7)

1.2.1 Data mining in different sectors

(4)

1.2.2 Data mining in different applications

(3)

1.3 Data mining and statistics

(1)

1.4 Data mining and information technology

(4)

1.5 Data mining and protection of personal data

(7)

1.6 Implementation of data mining

(2)

2 The development of a data mining study

(18)

2.1 Defining the aims

(1)

2.2 Listing the existing data

(1)

2.3 Collecting the data

(3)

2.4 Exploring and preparing the data

(3)

2.5 Population segmentation

(2)

2.6 Drawing up and validating predictive models

(1)

2.7 Synthesizing predictive models of different segments

(1)

2.8 Iteration of the preceding steps

(1)

2.9 Deploying the models

(1)

2.10 Training the model users

(1)

2.11 Monitoring the models

(2)

2.12 Enriching the models

(1)

2.13 Remarks

(1)

2.14 Life cycle of a model

(1)

2.15 Costs of a pilot project

(2)

3 Data exploration and preparation

(50)

3.1 The different types of data

(1)

3.2 Examining the distribution of variables

(1)

3.3 Detection of rare or missing values

(4)

3.4 Detection of aberrant values

(3)

3.5 Detection of extreme values

(1)

3.6 Tests of normality

(6)

3.7 Homoscedasticity and heteroscedasticity

(1)

3.8 Detection of the most discriminating variables

(14)

3.8.1 Qualitative, discrete or binned independent variables

(2)

3.8.2 Continuous independent variables

(3)

3.8.3 Details of single-factor non-parametric tests

(5)

3.8.4 ODS and automated selection of discriminating variables

(3)

3.9 Transformation of variables

(1)

3.10 Choosing ranges of values of binned variables

(7)

3.11 Creating new variables

(1)

3.12 Detecting interactions

(3)

3.13 Automatic variable selection

(1)

3.14 Detection of collinearity

(3)

3.15 Sampling

(4)

3.15.1 Using sampling

(1)

3.15.2 Random sampling methods

(3)

4 Using commercial data

(18)

4.1 Data used in commercial applications

(5)

4.1.1 Data on transactions and RFM data

(1)

4.1.2 Data on products and contracts

(1)

4.1.3 Lifetimes

(2)

4.1.4 Data on channels

(1)

4.1.5 Relational, attitudinal and psychographic data

(1)

4.1.6 Sociodemographic data

(1)

4.1.7 When data are unavailable

(1)

4.1.8 Technical data

(1)

4.2 Special data

(8)

4.2.1 Geodemographic data

(7)

4.2.2 Profitability

105

(1)

4.3 Data used by business sector

106

(5)

4.3.1 Data used in banking

106

(2)

4.3.2 Data used in insurance

108

(1)

4.3.3 Data used in telephony

108

(1)

4.3.4 Data used in mail order

109

(2)

5 Statistical and data mining software

111

(56)

5.1 Types of data mining and statistical software

111

(3)

5.2 Essential characteristics of the software

114

(3)

5.2.1 Points of comparison

114

(1)

5.2.2 Methods implemented

115

(1)

5.2.3 Data preparation functions

116

(1)

5.2.4 Other functions

116

(1)

5.2.5 Technical characteristics

117

(1)

5.3 The main software packages

117

(19)

5.3.1 Overview

117

(2)

5.3.2 IBM SPSS

119

(3)

5.3.3 SAS

122

(2)

5.3.4 R

124

(9)

5.3.5 Some elements of the R language

133

(3)

5.4 Comparison of R, SAS and IBM SPSS

136

(28)

5.5 How to reduce processing time

164

(3)

6 An outline of data mining methods

167

(8)

6.1 Classification of the methods

167

(7)

6.2 Comparison of the methods

174

(1)

7 Factor analysis

175

(42)

7.1 Principal component analysis

175

(17)

7.1.1 Introduction

175

(6)

7.1.2 Representation of variables

181

(4)

7.1.3 Representation of individuals

185

(2)

7.1.4 Use of PCA

187

(2)

7.1.5 Choosing the number of factor axes

189

(3)

7.1.6 Summary

192

(1)

7.2 Variants of principal component analysis

192

(2)

7.2.1 PCA with rotation

192

(1)

7.2.2 PCA of ranks

193

(1)

7.2.3 PCA on qualitative variables

194

(1)

7.3 Correspondence analysis

194

(7)

7.3.1 Introduction

194

(3)

7.3.2 Implementing CA with IBM SPSS Statistics

197

(4)

7.4 Multiple correspondence analysis

201

(16)

7.4.1 Introduction

201

(4)

7.4.2 Review of CA and MCA

205

(2)

7.4.3 Implementing MCA and CA with SAS

207

(10)

8 Neural networks

217

(18)

8.1 General information on neural networks

217

(3)

8.2 Structure of a neural network

220

(1)

8.3 Choosing the learning sample

221

(1)

8.4 Some empirical rules for network design

222

(1)

8.5 Data normalization

223

(1)

8.5.1 Continuous variables

223

(1)

8.5.2 Discrete variables

223

(1)

8.5.3 Qualitative variables

224

(1)

8.6 Learning algorithms

224

(1)

8.7 The main neural networks

224

(11)

8.7.1 The multilayer perceptron

225

(2)

8.7.2 The radial basis function network

227

(4)

8.7.3 The Kohonen network

231

(4)

9 Cluster analysis

235

(52)

9.1 Definition of clustering

235

(1)

9.2 Applications of clustering

236

(1)

9.3 Complexity of clustering

236

(1)

9.4 Clustering structures

237

(1)

9.4.1 Structure of the data to be clustered

237

(1)

9.4.2 Structure of the resulting clusters

237

(1)

9.5 Some methodological considerations

238

(4)

9.5.1 The optimum number of clusters

238

(1)

9.5.2 The use of certain types of variables

238

(1)

9.5.3 The use of illustrative variables

239

(1)

9.5.4 Evaluating the quality of clustering

239

(1)

9.5.5 Interpreting the resulting clusters

240

(2)

9.5.6 The criteria for correct clustering

242

(1)

9.6 Comparison of factor analysis and clustering

242

(1)

9.7 Within-cluster and between-cluster sum of squares

243

(1)

9.8 Measurements of clustering quality

244

(3)

9.8.1 All types of clustering

245

(1)

9.8.2 Agglomerative hierarchical clustering

246

(1)

9.9 Partitioning methods

247

(6)

9.9.1 The moving centres method

247

(1)

9.9.2 k-means and dynamic clouds

248

(1)

9.9.3 Processing qualitative data

249

(1)

9.9.4 k-medoids and their variants

249

(1)

9.9.5 Advantages of the partitioning methods

250

(1)

9.9.6 Disadvantages of the partitioning methods

251

(1)

9.9.7 Sensitivity to the choice of initial centres

252

(1)

9.10 Agglomerative hierarchical clustering

253

(8)

9.10.1 Introduction

253

(1)

9.10.2 The main distances used

254

(4)

9.10.3 Density estimation methods

258

(1)

9.10.4 Advantages of agglomerative hierarchical clustering

259

(2)

9.10.5 Disadvantages of agglomerative hierarchical clustering

261

(1)

9.11 Hybrid clustering methods

261

(11)

9.11.1 Introduction

261

(1)

9.11.2 Illustration using SAS Software

262

(10)

9.12 Neural clustering

272

(1)

9.12.1 Advantages

272

(1)

9.12.2 Disadvantages

272

(1)

9.13 Clustering by similarity aggregation

273

(5)

9.13.1 Principle of relational analysis

273

(1)

9.13.2 Implementing clustering by similarity aggregation

274

(1)

9.13.3 Example of use of the R amap package

275

(2)

9.13.4 Advantages of clustering by similarity aggregation

277

(1)

9.13.5 Disadvantages of clustering by similarity aggregation

278

(1)

9.14 Clustering of numeric variables

278

(8)

9.15 Overview of clustering methods

286

(1)

10 Association analysis

287

(14)

10.1 Principles

287

(4)

10.2 Using taxonomy

291

(1)

10.3 Using supplementary variables

292

(1)

10.4 Applications

292

(2)

10.5 Example of use

294

(7)

11 Classification and prediction methods

301

(254)

11.1 Introduction

301

(1)

11.2 Inductive and transductive methods

302

(2)

11.3 Overview of classification and prediction methods

304

(9)

11.3.1 The qualities expected from a classification and prediction method

304

(1)

11.3.2 Generalizability

305

(3)

11.3.3 Vapnik's learning theory

308

(2)

11.3.4 Overfitting

310

(3)

11.4 Classification by decision tree

313

(17)

11.4.1 Principle of the decision trees

313

(1)

11.4.2 Definitions -- the first step in creating the tree

313

(3)

11.4.3 Splitting criterion

316

(2)

11.4.4 Distribution among nodes - the second step in creating the tree

318

(1)

11.4.5 Pruning -- the third step in creating the tree

319

(1)

11.4.6 A pitfall to avoid

320

(1)

11.4.7 The CART, C5.0 and CHAID trees

321

(6)

11.4.8 Advantages of decision trees

327

(1)

11.4.9 Disadvantages of decision trees

328

(2)

11.5 Prediction by decision tree

330

(2)

11.6 Classification by discriminant analysis

332

(23)

11.6.1 The problem

332

(1)

11.6.2 Geometric descriptive discriminant analysis (discriminant factor analysis)

333

(5)

11.6.3 Geometric predictive discriminant analysis

338

(4)

11.6.4 Probabilistic discriminant analysis

342

(3)

11.6.5 Measurements of the quality of the model

345

(5)

11.6.6 Syntax of discriminant analysis in SAS

350

(2)

11.6.7 Discriminant analysis on qualitative variables (DISQUAL Method)

352

(2)

11.6.8 Advantages of discriminant analysis

354

(1)

11.6.9 Disadvantages of discriminant analysis

354

(1)

11.7 Prediction by linear regression

355

(82)

11.7.1 Simple linear regression

356

(3)

11.7.2 Multiple linear regression and regularized regression

359

(6)

11.7.3 Tests in linear regression

365

(6)

11.7.4 Tests on residuals

371

(4)

11.7.5 The influence of observations

375

(2)

11.7.6 Example of linear regression

377

(6)

11.7.7 Further details of the SAS linear regression syntax

383

(4)

11.7.8 Problems of collinearity in linear regression: an example using R

387

(7)

11.7.9 Problems of collinearity in linear regression: diagnosis and solutions

394

(3)

11.7.10 PLS regression

397

(3)

11.7.11 Handling regularized regression with SAS and R

400

(30)

11.7.12 Robust regression

430

(4)

11.7.13 The general linear model

434

(3)

11.8 Classification by logistic regression

437

(42)

11.8.1 Principles of binary logistic regression

437

(4)

11.8.2 Logit, probit and log-log logistic regressions

441

(2)

11.8.3 Odds ratios

443

(2)

11.8.4 Illustration of division into categories

445

(1)

11.8.5 Estimating the parameters

446

(3)

11.8.6 Deviance and quality measurement in a model

449

(4)

11.8.7 Complete separation in logistic regression

453

(1)

11.8.8 Statistical tests in logistic regression

454

(4)

11.8.9 Effect of division into categories and choice of the reference category

458

(1)

11.8.10 Effect of collinearity

459

(1)

11.8.11 The effect of sampling on logit regression

460

(1)

11.8.12 The syntax of logistic regression in SAS Software

461

(2)

11.8.13 An example of modelling by logistic regression

463

(11)

11.8.14 Logistic regression with R

474

(3)

11.8.15 Advantages of logistic regression

477

(1)

11.8.16 Advantages of the logit model compared with probit

478

(1)

11.8.17 Disadvantages of logistic regression

478

(1)

11.9 Developments in logistic regression

479

(13)

11.9.1 Logistic regression on individuals with different weights

479

(1)

11.9.2 Logistic regression with correlated data

479

(3)

11.9.3 Ordinal logistic regression

482

(1)

11.9.4 Multinomial logistic regression

482

(1)

11.9.5 PLS logistic regression

483

(1)

11.9.6 The generalized linear model

484

(3)

11.9.7 Poisson regression

487

(4)

11.9.8 The generalized additive model

491

(1)

11.10 Bayesian methods

492

(7)

11.10.1 The naive Bayesian classifier

492

(5)

11.10.2 Bayesian networks

497

(2)

11.11 Classification and prediction by neural networks

499

(2)

11.11.1 Advantages of neural networks

499

(1)

11.11.2 Disadvantages of neural networks

500

(1)

11.12 Classification by support vector machines

501

(9)

11.12.1 Introduction to SVMs

501

(5)

11.12.2 Example

506

(2)

11.12.3 Advantages of SVMs

508

(1)

11.12.4 Disadvantages of SVMs

508

(2)

11.13 Prediction by genetic algorithms

510

(4)

11.13.1 Random generation of initial rules

511

(1)

11.13.2 Selecting the best rules

512

(1)

11.13.3 Generating new rules

512

(1)

11.13.4 End of the algorithm

513

(1)

11.13.5 Applications of genetic algorithms

513

(1)

11.13.6 Disadvantages of genetic algorithms

514

(1)

11.14 Improving the performance of a predictive model

514

(2)

11.15 Bootstrapping and ensemble methods

516

(18)

11.15.1 Bootstrapping

516

(2)

11.15.2 Bagging

518

(3)

11.15.3 Boosting

521

(7)

11.15.4 Some applications

528

(4)

11.15.5 Conclusion

532

(2)

11.16 Using classification and prediction methods

534

(21)

11.16.1 Choosing the modelling methods

534

(3)

11.16.2 The training phase of a model

537

(2)

11.16.3 Reject inference

539

(1)

11.16.4 The test phase of a model

540

(2)

11.16.5 The ROC curve, the lift curve and the Gini index

542

(9)

11.16.6 The classification table of a model

551

(2)

11.16.7 The validation phase of a model

553

(1)

11.16.8 The application phase of a model

553

(2)

12 An application of data mining: scoring

555

(62)

12.1 The different types of score

555

(1)

12.2 Using propensity scores and risk scores

556

(2)

12.3 Methodology

558

(4)

12.3.1 Determining the objectives

558

(1)

12.3.2 Data inventory and preparation

559

(1)

12.3.3 Creating the analysis base

559

(2)

12.3.4 Developing a predictive model

561

(1)

12.3.5 Using the score

561

(1)

12.3.6 Deploying the score

562

(1)

12.3.7 Monitoring the available tools

562

(1)

12.4 Implementing a strategic score

562

(1)

12.5 Implementing an operational score

563

(1)

12.6 Scoring solutions used in a business

564

(3)

12.6.1 In-house or outsourced?

564

(3)

12.6.2 Generic or personalized score

567

(1)

12.6.3 Summary of the possible solutions

567

(1)

12.7 An example of credit scoring (data preparation)

567

(27)

12.8 An example of credit scoring (modelling by logistic regression)

594

(10)

12.9 An example of credit scoring (modelling by DISQUAL discriminant analysis)

604

(11)

12.10 A brief history of credit scoring

615

(2)

References

616

(1)

13 Factors for success in a data mining project

617

(10)

13.1 The subject

617

(1)

13.2 The people

618

(1)

13.3 The data

618

(1)

13.4 The IT systems

619

(1)

13.5 The business culture

620

(1)

13.6 Data mining: eight common misconceptions

621

(3)

13.6.1 No a priori knowledge is needed

621

(1)

13.6.2 No specialist staff are needed

621

(1)

13.6.3 No statisticians are needed (`you can just press a button')

622

(1)

13.6.4 Data mining will reveal unbelievable wonders

622

(1)

13.6.5 Data mining is revolutionary

623

(1)

13.6.6 You must use all the available data

623

(1)

13.6.7 You must always sample

623

(1)

13.6.8 You must never sample

623

(1)

13.7 Return on investment

624

(3)

14 Text mining

627

(10)

14.1 Definition of text mining

627

(2)

14.2 Text sources used

629

(1)

14.3 Using text mining

629

(1)

14.4 Information retrieval

630

(5)

14.4.1 Linguistic analysis

630

(3)

14.4.2 Application of statistics and data mining

633

(1)

14.4.3 Suitable methods

633

(2)

14.5 Information extraction

635

(1)

14.5.1 Principles of information extraction

635

(1)

14.5.2 Example of application: transcription of business interviews

635

(1)

14.6 Multi-type data mining

636

(1)

15 Web mining

637

(8)

15.1 The aims of web mining

637

(1)

15.2 Global analyses

638

(3)

15.2.1 What can they be used for?

638

(1)

15.2.2 The structure of the log file

638

(1)

15.2.3 Using the log file

639

(2)

15.3 Individual analyses

641

(1)

15.4 Personal analysis

642

(3)

Appendix A Elements of statistics

645

(30)

A.1 A brief history

645

(3)

A.1.1 A few dates

645

(1)

A.1.2 From statistics...to data mining

645

(3)

A.2 Elements of statistics

648

(17)

A.2.1 Statistical characteristics

648

(1)

A.2.2 Box and whisker plot

649

(1)

A.2.3 Hypothesis testing

649

(3)

A.2.4 Asymptotic, exact, parametric and non-parametric tests

652

(1)

A.2.5 Confidence interval for a mean: student's r lest

652

(2)

A.2.6 Confidence interval of a frequency (or proportion)

654

(2)

A.2.7 The relationship between two continuous variables: the linear correlation coefficient

656

(1)

A.2.8 The relationship between two numeric or ordinal variables: Spearman's rank correlation coefficient and Kendall's tau

657

(1)

A.2.9 The relationship between n sets of several continuous or binary variables: canonical correlation analysis

658

(1)

A.2.10 The relationship between two nominal variables: the Χ2 test

659

(1)

A.2.11 Example of use of the Χ2 test

660

(1)

A.2.12 The relationship between two nominal variables: Cramer's coefficient

661

(1)

A.2.13 The relationship between a nominal variable and a numeric variable: the variance test (one-way ANOVA test)

662

(2)

A.2.14 The cox semi-parametric survival model

664

(1)

A.3 Statistical tables

665

(10)

A.3.1 Table of the standard normal distribution

665

(1)

A.3.2 Table of student's t distribution

665

(1)

A.3.3 Chi-Square table

666

(1)

A.3.4 Table of the Fisher-Snedecor distribution at the 0.05 significance level

667

(6)

A.3.5 Table of the Fisher-Snedecor distribution at the 0.10 significance level

673

(2)

Appendix B Further reading

675

(10)

B.1 Statistics and data analysis

675

(3)

B.2 Data mining and statistical learning

678

(2)

B.3 Text mining

680

(1)

B.4 Web mining

680

(1)

B.5 R software

680

(1)

B.6 SAS software

681

(1)

B.7 IBM SPSS software

682

(1)

B.8 Websites

682

(3)

Index

685

Dr Stephane Tuffery teaches Data Mining and statistics, University Rennes 1, Paris, France.

Translator, Rod Riesco, UK.

More information about ebooks

Permanent link: https://www.kriso.ee/db/97804709791672e.html

Keywords:

E-book: Data Mining and Statistics for Decision Making

DRM restrictions

Copying (copy/paste):

Printing:

Usage:

Reviews

Account & settings

Search

Search database

Refine By

Subjects Ebook Subjects

Choose shopping cart