Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Statistical and Machine-Learning Data Mining:: Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition 3rd edition [Taylor & Francis e-raamat]

Bruce Ratner (DM STAT-1 Consulting, New York, New York, USA)

Formaat: 690 pages
Ilmumisaeg: 30-Jun-2020
Kirjastus: Chapman & Hall/CRC
ISBN-13: 9781315156316

Teised raamatud teemal:

Taylor & Francis e-raamat
Hind: 170,80 €*
* hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
Tavahind: 244,00 €
Säästad 30%

Formaat: 690 pages
Ilmumisaeg: 30-Jun-2020
Kirjastus: Chapman & Hall/CRC
ISBN-13: 9781315156316

Teised raamatud teemal:

Rohkem infot Taylor & Francis e-raamatute kohta

Raamatu kodulehekülg: https://www.taylorfrancis.com/books/9781315156316

Interest in predictive analytics of big data has grown exponentially in the four years since the publication of Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, Second Edition. In the third edition of this bestseller, the author has completely revised, reorganized, and repositioned the original chapters and produced 13 new chapters of creative and useful machine-learning data mining techniques. In sum, the 43 chapters of simple yet insightful quantitative techniques make this book unique in the field of data mining literature.

What is new in the Third Edition:

The current chapters have been completely rewritten.

The core content has been extended with strategies and methods for problems drawn from the top predictive analytics conference and statistical modeling workshops.

Adds thirteen new chapters including coverage of data science and its rise, market share estimation, share of wallet modeling without survey data, latent market segmentation, statistical regression modeling that deals with incomplete data, decile analysis assessment in terms of the predictive power of the data, and a user-friendly version of text mining, not requiring an advanced background in natural language processing (NLP).

Includes SAS subroutines which can be easily converted to other languages.

As in the previous edition, this book offers detailed background, discussion, and illustration of specific methods for solving the most commonly experienced problems in predictive modeling and analysis of big data. The author addresses each methodology and assigns its application to a specific type of problem. To better ground readers, the book provides an in-depth discussion of the basic methodologies of predictive modeling and analysis. While this type of overview has been attempted before, this approach offers a truly nitty-gritty, step-by-step method that both tyros and experts in the field can enjoy playing with.

Preface to Third Edition

xxiii

Preface of Second Edition

xxvii

Acknowledgments

xxxi

Author

xxxiii

1 Introduction

(12)

1.1 The Personal Computer and Statistics

(2)

1.2 Statistics and Data Analysis

(1)

1.3 EDA

(1)

1.4 The EDA Paradigm

(1)

1.5 EDA Weaknesses

(1)

1.6 Small and Big Data

(1)

1.6.1 Data Size Characteristics

(1)

1.6.2 Data Size: Personal Observation of One

(1)

1.7 Data Mining Paradigm

(1)

1.8 Statistics and Machine Learning

(1)

1.9 Statistical Data Mining

(3)

References

(2)

2 Science Dealing with Data: Statistics and Data Science

(12)

2.1 Introduction

(1)

2.2 Background

(2)

2.3 The Statistics and Data Science Comparison

(6)

2.3.1 Statistics versus Data Science

(6)

2.4 Discussion: Are Statistics and Data Science Different?

(2)

2.4.1 Analysis: Are Statistics and Data Science Different?

(1)

2.5 Summary

(1)

2.6 Epilogue

(2)

References

(2)

3 Two Basic Data Mining Methods for Variable Assessment

(12)

3.1 Introduction

(1)

3.2 Correlation Coefficient

(2)

3.3 Scatterplots

(1)

3.4 Data Mining

(2)

3.4.1 Example 3.1

(1)

3.4.2 Example 3.2

(1)

3.5 Smoothed Scatterplot

(3)

3.6 General Association Test

(1)

3.7 Summary

(3)

References

(2)

4 CHAID-Based Data Mining for Paired-Variable Assessment

(10)

4.1 Introduction

(1)

4.2 The Scatterplot

(1)

4.2.1 An Exemplar Scatterplot

(1)

4.3 The Smooth Scatterplot

(1)

4.4 Primer on CHAID

(1)

4.5 CHAID-Based Data Mining for a Smoother Scatterplot

(5)

4.5.1 The Smoother Scatterplot

(3)

4.6 Summary

(2)

Reference

(2)

5 The Importance of Straight Data: Simplicity and Desirability for Good Model-Building Practice

(8)

5.1 Introduction

(1)

5.2 Straightness and Symmetry in Data

(1)

5.3 Data Mining Is a High Concept

(1)

5.4 The Correlation Coefficient

(2)

5.5 Scatterplot of (xx3, yy3)

(1)

5.6 Data Mining the Relationship of (xx3, yy3)

(3)

5.6.1 Side-by-Side Scatterplot

(1)

5.7 What Is the GP-Based Data Mining Doing to the Data?

(1)

5.8 Straightening a Handful of Variables and a Baker's Dozen of Variables

(1)

5.9 Summary

(1)

References

(1)

6 Symmetrizing Ranked Data: A Statistical Data Mining Method for Improving the Predictive Power of Data

(14)

6.1 Introduction

(1)

6.2 Scales of Measurement

(2)

6.3 Stem-and-Leaf Display

(1)

6.4 Box-and-Whiskers Plot

(1)

6.5 Illustration of the Symmetrizing Ranked Data Method

(10)

6.5.1 Illustration 1

(1)

6.5.1.1 Discussion of Illustration 1

(2)

6.5.2 Illustration 2

(1)

6.5.2.1 Titanic Dataset

(1)

6.5.2.2 Looking at the Recoded Titanic Ordinal Variables CLASS_, AGE_, GENDER_, CLASS_AGE_, and CLASS_GENDER_

(2)

6.5.2.3 Looking at the Symmetrized-Ranked Titanic Ordinal Variables rCLASS_, rAGE_, rGENDER_, rCLASS_AGE_, and rCLASS_GENDER_

(1)

6.5.2.4 Building a Preliminary Titanic Model

(3)

6.6 Summary

(1)

References

(1)

7 Principal Component Analysis: A Statistical Data Mining Method for Many-Variable Assessment

(12)

7.1 Introduction

(1)

7.2 EDA Reexpression Paradigm

(1)

7.3 What Is the Big Deal?

(1)

7.4 PCA Basics

(1)

7.5 Exemplary Detailed Illustration

(1)

7.5.1 Discussion

(1)

7.6 Algebraic Properties of PCA

(1)

7.7 Uncommon Illustration

(3)

7.7.1 PCA of R_CD Elements (X1, X2, X3, X4, X5, X6)

(1)

7.7.2 Discussion of the PCA of R_CD Elements

(2)

7.8 PCA in the Construction of Quasi-Interaction Variables

(4)

7.8.1 SAS Program for the PCA of the Quasi-Interaction Variable

(2)

7.9 Summary

(1)

8 Market Share Estimation: Data Mining for an Exceptional Case

(16)

8.1 Introduction

(1)

8.2 Background

(1)

8.3 Data Mining for an Exceptional Case

(1)

8.3.1 Exceptional Case: Infant Formula YUM

(1)

8.4 Building the RAL-YUM Market Share Model

(10)

8.4.1 Decile Analysis of YUM_3mos MARKET-SHARE Model

(1)

8.4.2 Conclusion of YUM_3mos MARKET-SHARE Model

(1)

8.5 Summary

(4)

Appendix 8.A Dummify PROMO_Code

(1)

Appendix 8.B PCA of PROMO_Code Dummy Variables

(1)

Appendix 8.C Logistic Regression YUM_3mos on PROMO_Code Dummy Variables

(1)

Appendix 8.D Creating YUM_3mos_wo_PROMO_CodeEff

(1)

Appendix 8.E Normalizing a Variable to Lie Within [ 0, 1]

(1)

References

(1)

9 The Correlation Coefficient: Its Values Range between Plus and Minus 1, or Do They?

(8)

9.1 Introduction

(1)

9.2 Basics of the Correlation Coefficient

(2)

9.3 Calculation of the Correlation Coefficient

(1)

9.4 Rematching

(2)

9.5 Calculation of the Adjusted Correlation Coefficient

101

(1)

9.6 Implication of Rematching

102

(1)

9.7 Summary

102

(3)

10 Logistic Regression: The Workhorse of Response Modeling

105

(46)

10.1 Introduction

105

(1)

10.2 Logistic Regression Model

106

(3)

10.2.1 Illustration

106

(1)

10.2.2 Scoring an LRM

107

(2)

10.3 Case Study

109

(1)

10.3.1 Candidate Predictor and Dependent Variables

110

(1)

10.4 Logits and Logit Plots

110

(2)

10.4.1 Logits for Case Study

111

(1)

10.5 The Importance of Straight Data

112

(1)

10.6 Reexpressing for Straight Data

112

(3)

10.6.1 Ladder of Powers

113

(1)

10.6.2 Bulging Rule

114

(1)

10.6.3 Measuring Straight Data

114

(1)

10.7 Straight Data for Case Study

115

(3)

10.7.1 Reexpressing FD2_OPEN

116

(1)

10.7.2 Reexpressing INVESTMENT

116

(2)

10.8 Techniques when the Bulging Rule Does Not Apply

118

(1)

10.8.1 Fitted Logit Plot

118

(1)

10.8.2 Smooth Predicted-versus-Actual Plot

119

(1)

10.9 Reexpressing MOS_OPEN

119

(4)

10.9.1 Plot of Smooth Predicted versus Actual for MOS_OPEN

120

(3)

10.10 Assessing the Importance of Variables

123

(2)

10.10.1 Computing the G Statistic

123

(1)

10.10.2 Importance of a Single Variable

124

(1)

10.10.3 Importance of a Subset of Variables

124

(1)

10.10.4 Comparing the Importance of Different Subsets of Variables

124

(1)

10.11 Important Variables for Case Study

125

(2)

10.11.1 Importance of the Predictor Variables

126

(1)

10.12 Relative Importance of the Variables

127

(1)

10.12.1 Selecting the Best Subset

127

(1)

10.13 Best Subset of Variables for Case Study

128

(1)

10.14 Visual Indicators of Goodness of Model Predictions

129

(7)

10.14.1 Plot of Smooth Residual by Score Groups

130

(1)

10.14.1.1 Plot of the Smooth Residual by Score Groups for Case Study

130

(2)

10.14.2 Plot of Smooth Actual versus Predicted by Decile Groups

132

(1)

10.14.2.1 Plot of Smooth Actual versus Predicted by Decile Groups for Case Study

132

(2)

10.14.3 Plot of Smooth Actual versus Predicted by Score Groups

134

(1)

10.14.3.1 Plot of Smooth Actual versus Predicted by Score Groups for Case Study

134

(2)

10.15 Evaluating the Data Mining Work

136

(5)

10.15.1 Comparison of Plots of Smooth Residual by Score Groups: EDA versus Non-EDA Models

137

(2)

10.15.2 Comparison of the Plots of Smooth Actual versus Predicted by Decile Groups: EDA versus Non-EDA Models

139

(1)

10.15.3 Comparison of Plots of Smooth Actual versus Predicted by Score Groups: EDA versus Non-EDA Models

140

(1)

10.15.4 Summary of the Data Mining Work

141

(1)

10.16 Smoothing a Categorical Variable

141

(4)

10.16.1 Smoothing FD_TYPE with CHAID

142

(2)

10.16.2 Importance of CH_FTY_1 and CH_FTY_2

144

(1)

10.17 Additional Data Mining Work for Case Study

145

(5)

10.17.1 Comparison of Plots of Smooth Residual by Score Group: 4var-EDA versus 3var-EDA Models

146

(1)

10.17.2 Comparison of the Plots of Smooth Actual versus Predicted by Decile Groups: 4var-EDA versus 3var-EDA Models

147

(1)

10.17.3 Comparison of Plots of Smooth Actual versus Predicted by Score Groups: 4var-EDA versus 3var-EDA Models

147

(2)

10.17.4 Final Summary of the Additional Data Mining Work

149

(1)

10.18 Summary

150

(1)

11 Predicting Share of Wallet without Survey Data

151

(18)

11.1 Introduction

151

(1)

11.2 Background

151

(2)

11.2.1 SOW Definition

152

(1)

11.2.1.1 SOW_q Definition

152

(1)

11.2.1.2 SOW_q Likeliness Assumption

152

(1)

11.3 Illustration of Calculation of SOW_q

153

(5)

11.3.1 Query of Interest

153

(1)

11.3.2 DOLLARS and TOTAL DOLLARS

153

(5)

11.4 Building the AMPECS SOW_q Model

158

(1)

11.5 SOW_q Model Definition

159

(2)

11.5.1 SOW_q Model Results

160

(1)

11.6 Summary

161

(8)

Appendix 11.A Six Steps

162

(2)

Appendix 11.B Seven Steps

164

(3)

References

167

(2)

12 Ordinary Regression: The Workhorse of Profit Modeling

169

(20)

12.1 Introduction

169

(1)

12.2 Ordinary Regression Model

169

(3)

12.2.1 Illustration

170

(1)

12.2.2 Scoring an OLS Profit Model

171

(1)

12.3 Mini Case Study

172

(8)

12.3.1 Straight Data for Mini Case Study

172

(2)

12.3.1.1 Reexpressing INCOME

174

(1)

12.3.1.2 Reexpressing AGE

175

(2)

12.3.2 Plot of Smooth Predicted versus Actual

177

(1)

12.3.3 Assessing the Importance of Variables

178

(1)

12.3.3.1 Defining the F Statistic and R-Squared

179

(1)

12.3.3.2 Importance of a Single Variable

179

(1)

12.3.3.3 Importance of a Subset of Variables

179

(1)

12.3.3.4 Comparing the Importance of Different Subsets of Variables

180

(1)

12.4 Important Variables for Mini Case Study

180

(2)

12.4.1 Relative Importance of the Variables

181

(1)

12.4.2 Selecting the Best Subset

181

(1)

12.5 Best Subset of Variables for Case Study

182

(3)

12.5.1 PROFIT Model with gINCOME and AGE

183

(2)

12.5.2 Best PROFIT Model

185

(1)

12.6 Suppressor Variable AGE

185

(1)

12.7 Summary

186

(3)

References

187

(2)

13 Variable Selection Methods in Regression: Ignorable Problem, Notable Solution

189

(14)

13.1 Introduction

189

(1)

13.2 Background

189

(3)

13.3 Frequently Used Variable Selection Methods

192

(1)

13.4 Weakness in the Stepwise

193

(1)

13.5 Enhanced Variable Selection Method

194

(2)

13.6 Exploratory Data Analysis

196

(4)

13.7 Summary

200

(3)

References

200

(3)

14 CHAID for Interpreting a Logistic Regression Model

203

(16)

14.1 Introduction

203

(1)

14.2 Logistic Regression Model

203

(1)

14.3 Database Marketing Response Model Case Study

204

(1)

14.3.1 Odds Ratio

205

(1)

14.4 CHAID

205

(3)

14.4.1 Proposed CHAID-Based Method

206

(2)

14.5 Multivariable CHAID Trees

208

(2)

14.6 CHAID Market Segmentation

210

(3)

14.7 CHAID Tree Graphs

213

(3)

14.8 Summary

216

(3)

15 The Importance of the Regression Coefficient

219

(10)

15.1 Introduction

219

(1)

15.2 The Ordinary Regression Model

219

(1)

15.3 Four Questions

220

(1)

15.4 Important Predictor Variables

220

(1)

15.5 P-Values and Big Data

221

(1)

15.6 Returning to Question 1

222

(1)

15.7 Effect of Predictor Variable on Prediction

222

(1)

15.8 The Caveat

223

(2)

15.9 Returning to Question 2

225

(1)

15.10 Ranking Predictor Variables by Effect on Prediction

225

(1)

15.11 Returning to Question 3

226

(1)

15.12 Returning to Question 4

227

(1)

15.13 Summary

227

(2)

References

228

(1)

16 The Average Correlation: A Statistical Data Mining Measure for Assessment of Competing Predictive Models and the Importance of the Predictor Variables

229

(10)

16.1 Introduction

229

(1)

16.2 Background

229

(2)

16.3 Illustration of the Difference between Reliability and Validity

231

(1)

16.4 Illustration of the Relationship between Reliability and Validity

231

(1)

16.5 The Average Correlation

232

(5)

16.5.1 Illustration of the Average Correlation with an LTV5 Model

232

(4)

16.5.2 Continuing with the Illustration of the Average Correlation with an LTV5 Model

236

(1)

16.5.3 Continuing with the Illustration with a Competing LTV5 Model

236

(1)

16.5.3.1 The Importance of the Predictor Variables

237

(1)

16.6 Summary

237

(2)

Reference

237

(2)

17 CHAID for Specifying a Model with Interaction Variables

239

(12)

17.1 Introduction

239

(1)

17.2 Interaction Variables

239

(1)

17.3 Strategy for Modeling with Interaction Variables

240

(1)

17.4 Strategy Based on the Notion of a Special Point

240

(1)

17.5 Example of a Response Model with an Interaction Variable

241

(1)

17.6 CHAID for Uncovering Relationships

242

(1)

17.7 Illustration of CHAID for Specifying a Model

243

(3)

17.8 An Exploratory Look

246

(1)

17.9 Database Implication

247

(1)

17.10 Summary

248

(3)

References

249

(2)

18 Market Segmentation Classification Modeling with Logistic Regression

251

(14)

18.1 Introduction

251

(1)

18.2 Binary Logistic Regression

251

(1)

18.2.1 Necessary Notation

252

(1)

18.3 Polychotomous Logistic Regression Model

252

(1)

18.4 Model Building with PLR

253

(1)

18.5 Market Segmentation Classification Model

254

(9)

18.5.1 Survey of Cellular Phone Users

254

(1)

18.5.2 CHAID Analysis

255

(3)

18.5.3 CHAID Tree Graphs

258

(3)

18.5.4 Market Segmentation Classification Model

261

(2)

18.6 Summary

263

(2)

19 Market Segmentation Based on Time-Series Data Using Latent Class Analysis

265

(22)

19.1 Introduction

265

(1)

19.2 Background

265

(5)

19.2.1 K-Means Clustering

265

(1)

19.2.2 PCA

266

(1)

19.2.3 FA

266

(1)

19.2.3.1 FA Model

267

(1)

19.2.3.2 FA Model Estimation

267

(1)

19.2.3.3 FA versus OLS Graphical Depiction

268

(1)

19.2.4 LCA versus FA Graphical Depiction

268

(2)

19.3 LCA

270

(2)

19.3.1 LCA of Universal and Particular Study

270

(1)

19.3.1.1 Discussion of LCA Output

270

(1)

19.3.1.2 Discussion of Posterior Probability

271

(1)

19.4 LCA versus k-Means Clustering

272

(2)

19.5 LCA Market Segmentation Model Based on Time-Series Data

274

(8)

19.5.1 Objective

274

(2)

19.5.2 Best LCA Models

276

(2)

19.5.2.1 Cluster Sizes and Conditional Probabilities/Means

278

(3)

19.5.2.2 Indicator-Level Posterior Probabilities

281

(1)

19.6 Summary

282

(5)

Appendix 19.A Creating Trend3 for UNITS

282

(2)

Appendix 19.B POS-ZER-NEG Creating Trend4

284

(1)

References

285

(2)

20 Market Segmentation: An Easy Way to Understand the Segments

287

(6)

20.1 Introduction

287

(1)

20.2 Background

287

(1)

20.3 Illustration

288

(1)

20.4 Understanding the Segments

289

(1)

20.5 Summary

290

(3)

Appendix 20.A Dataset SAMPLE

290

(1)

Appendix 20.B Segmentor-Means

291

(1)

Appendix 20.C Indexed Profiles

291

(1)

References

292

(1)

21 The Statistical Regression Model: An Easy Way to Understand the Model

293

(14)

21.1 Introduction

293

(1)

21.2 Background

293

(1)

21.3 EZ-Method Applied to the LR Model

294

(2)

21.4 Discussion of the LR EZ-Method Illustration

296

(3)

21.5 Summary

299

(8)

Appendix 21.A M65-Spread Base Means X10--X14

299

(2)

Appendix 21.B Create Ten Datasets for Each Decile

301

(1)

Appendix 21.C Indexed Profiles of Deciles

302

(5)

22 CHAID as a Method for Filling in Missing Values

307

(16)

22.1 Introduction

307

(1)

22.2 Introduction to the Problem of Missing Data

307

(2)

22.3 Missing Data Assumption

309

(1)

22.4 CHAID Imputation

310

(1)

22.5 Illustration

311

(5)

22.5.1 CHAID Mean-Value Imputation for a Continuous Variable

312

(1)

22.5.2 Many Mean-Value CHAID Imputations for a Continuous Variable

313

(1)

22.5.3 Regression Tree Imputation for LIFE_DOL

314

(2)

22.6 CHAID Most Likely Category Imputation for a Categorical Variable

316

(4)

22.6.1 CHAID Most Likely Category Imputation for GENDER

316

(2)

22.6.2 Classification Tree Imputation for GENDER

318

(2)

22.7 Summary

320

(3)

References

321

(2)

23 Model Building with Big Complete and Incomplete Data

323

(12)

23.1 Introduction

323

(1)

23.2 Background

323

(1)

23.3 The CCA-PCA Method: Illustration Details

324

(2)

23.3.1 Determining the Complete and Incomplete Datasets

324

(2)

23.4 Building the RESPONSE Model with Complete (CCA) Dataset

326

(2)

23.4.1 CCA RESPONSE Model Results

327

(1)

23.5 Building the RESPONSE Model with Incomplete (ICA) Dataset

328

(1)

23.5.1 PCA on BICA Data

329

(1)

23.6 Building the RESPONSE Model on PCA-BICA Data

329

(3)

23.6.1 PCA-BICA RESPONSE Model Results

330

(1)

23.6.2 Combined CCA and PCA-BICA RESPONSE Model Results

331

(1)

23.7 Summary

332

(3)

Appendix 23.A NMISS

333

(1)

Appendix 23.B Testing CCA Samsizes

333

(1)

Appendix 23.C CCA-CIA Datasets

333

(1)

Appendix 23.D Ones and Zeros

333

(1)

Reference

334

(1)

24 Art, Science, Numbers, and Poetry

335

(6)

24.1 Introduction

335

(1)

24.2 Zeros and Ones

336

(1)

24.3 Power of Thought

336

(2)

24.4 The Statistical Golden Rule: Measuring the Art and Science of Statistical Practice

338

(2)

24.4.1 Background

338

(1)

24.4.1.1 The Statistical Golden Rule

339

(1)

24.5 Summary

340

(1)

Reference

340

(1)

25 Identifying Your Best Customers: Descriptive, Predictive, and Look-Alike Profiling

341

(14)

25.1 Introduction

341

(1)

25.2 Some Definitions

341

(1)

25.3 Illustration of a Flawed Targeting Effort

342

(1)

25.4 Well-Defined Targeting Effort

343

(2)

25.5 Predictive Profiles

345

(3)

25.6 Continuous Trees

348

(2)

25.7 Look-Alike Profiling

350

(3)

25.8 Look-Alike Tree Characteristics

353

(1)

25.9 Summary

353

(2)

26 Assessment of Marketing Models

355

(12)

26.1 Introduction

355

(1)

26.2 Accuracy for Response Model

355

(1)

26.3 Accuracy for Profit Model

356

(2)

26.4 Decile Analysis and Cum Lift for Response Model

358

(1)

26.5 Decile Analysis and Cum Lift for Profit Model

359

(1)

26.6 Precision for Response Model

360

(2)

26.7 Precision for Profit Model

362

(1)

26.7.1 Construction of SWMAD

363

(1)

26.8 Separability for Response and Profit Models

363

(1)

26.9 Guidelines for Using Cum Lift, HL/SWMAD, and CV

364

(1)

26.10 Summary

364

(3)

27 Decile Analysis: Perspective and Performance

367

(20)

27.1 Introduction

367

(1)

27.2 Background

367

(4)

27.2.1 Illustration

369

(1)

27.2.1.1 Discussion of Classification Table of RESPONSE Model

370

(1)

27.3 Assessing Performance: RESPONSE Model versus Chance Model

371

(1)

27.4 Assessing Performance: The Decile Analysis

372

(5)

27.4.1 The RESPONSE Decile Analysis

372

(5)

27.5 Summary

377

(10)

Appendix 27.A Incremental Gain in Accuracy: Model versus Chance

378

(1)

Appendix 27.B Incremental Gain in Precision: Model versus Chance

379

(1)

Appendix 27.C RESPONSE Model Decile PROB_est Values

380

(2)

Appendix 27.D 2×2 Tables by Decile

382

(3)

References

385

(2)

28 Net T-C Lift Model: Assessing the Net Effects of Test and Control Campaigns

387

(26)

28.1 Introduction

387

(1)

28.2 Background

387

(2)

28.3 Building TEST and CONTROL Response Models

389

(5)

28.3.1 Building TEST Response Model

390

(2)

28.3.2 Building CONTROL Response Model

392

(2)

28.4 Net T-C Lift Model

394

(4)

28.4.1 Building the Net T-C Lift Model

395

(1)

28.4.1.1 Discussion of the Net T-C Lift Model

395

(2)

28.4.1.2 Discussion of Equal-Group Sizes Decile of the Net T-C Lift Model

397

(1)

28.5 Summary

398

(15)

Appendix 28.A TEST Logistic with Xs

400

(2)

Appendix 28.B CONTROL Logistic with Xs

402

(3)

Appendix 28.C Merge Score

405

(1)

Appendix 28.D NET T-C Decile Analysis

406

(4)

References

410

(3)

29 Bootstrapping in Marketing: A New Approach for Validating Models

413

(16)

29.1 Introduction

413

(1)

29.2 Traditional Model Validation

413

(1)

29.3 Illustration

414

(1)

29.4 Three Questions

415

(1)

29.5 The Bootstrap Method

416

(1)

29.5.1 Traditional Construction of Confidence Intervals

416

(1)

29.6 How to Bootstrap

417

(2)

29.6.1 Simple Illustration

418

(1)

29.7 Bootstrap Decile Analysis Validation

419

(1)

29.8 Another Question

420

(1)

29.9 Bootstrap Assessment of Model Implementation Performance

421

(5)

29.9.1 Illustration

424

(2)

29.10 Bootstrap Assessment of Model Efficiency

426

(2)

29.11 Summary

428

(1)

References

428

(1)

30 Validating the Logistic Regression Model: Try Bootstrapping

429

(2)

30.1 Introduction

429

(1)

30.2 Logistic Regression Model

429

(1)

30.3 The Bootstrap Validation Method

429

(1)

30.4 Summary

430

(1)

Reference

430

(1)

31 Visualization of Marketing Models: Data Mining to Uncover Innards of a Model

431

(22)

31.1 Introduction

431

(1)

31.2 Brief History of the Graph

431

(1)

31.3 Star Graph Basics

432

(2)

31.3.1 Illustration

433

(1)

31.4 Star Graphs for Single Variables

434

(1)

31.5 Star Graphs for Many Variables Considered Jointly

435

(2)

31.6 Profile Curves Method

437

(1)

31.6.1 Profile Curves Basics

437

(1)

31.6.2 Profile Analysis

438

(1)

31.7 Illustration

438

(6)

31.7.1 Profile Curves for RESPONSE Model

440

(2)

31.7.2 Decile Group Profile Curves

442

(2)

31.8 Summary

444

(9)

Appendix 31.A Star Graphs for Each Demographic Variable about the Deciles

445

(2)

Appendix 31.B Star Graphs for Each Decile about the Demographic Variables

447

(3)

Appendix 31.C Profile Curves: All Deciles

450

(2)

References

452

(1)

32 The Predictive Contribution Coefficient: A Measure of Predictive Importance

453

(12)

32.1 Introduction

453

(1)

32.2 Background

453

(2)

32.3 Illustration of Decision Rule

455

(2)

32.4 Predictive Contribution Coefficient

457

(1)

32.5 Calculation of Predictive Contribution Coefficient

458

(1)

32.6 Extra-Illustration of Predictive Contribution Coefficient

459

(3)

32.7 Summary

462

(3)

Reference

463

(2)

33 Regression Modeling Involves Art, Science, and Poetry, Too

465

(6)

33.1 Introduction

465

(1)

33.2 Shakespearean Modelogue

465

(1)

33.3 Interpretation of the Shakespearean Modelogue

466

(3)

33.4 Summary

469

(2)

References

469

(2)

34 Opening the Dataset: A Twelve-Step Program for Dataholics

471

(6)

34.1 Introduction

471

(1)

34.2 Background

471

(1)

34.3 Stepping

471

(2)

34.4 Brush Marking

473

(1)

34.5 Summary

474

(3)

Appendix 34.A Dataset IN

474

(1)

Appendix 34.B SamsizePlus

475

(1)

Appendix 34.C Copy-Pasteable

475

(1)

Appendix 34.D Missings

475

(1)

References

476

(1)

35 Genetic and Statistic Regression Models: A Comparison

477

(10)

35.1 Introduction

477

(1)

35.2 Background

477

(1)

35.3 Objective

478

(1)

35.4 The GenIQ Model, the Genetic Logistic Regression

478

(2)

35.4.1 Illustration of "Filling Up the Upper Deciles"

479

(1)

35.5 A Pithy Summary of the Development of Genetic Programming

480

(2)

35.6 The GenIQ Model: A Brief Review of Its Objective and Salient Features

482

(1)

35.6.1 The GenIQ Model Requires Selection of Variables and Function: An Extra Burden?

482

(1)

35.7 The GenIQ Model: How It Works

483

(3)

35.7.1 The GenIQ Model Maximizes the Decile Table

485

(1)

35.8 Summary

486

(1)

References

486

(1)

36 Data Reuse: A Powerful Data Mining Effect of the GenIQ Model

487

(8)

36.1 Introduction

487

(1)

36.2 Data Reuse

487

(1)

36.3 Illustration of Data Reuse

488

(3)

36.3.1 The GenIQ Profit Model

488

(1)

36.3.2 Data-Reused Variables

489

(1)

36.3.3 Data-Reused Variables GenIQvar_1 and GenIQvar_2

490

(1)

36.4 Modified Data Reuse: A GenIQ-Enhanced Regression Model

491

(2)

36.4.1 Illustration of a GenIQ-Enhanced LRM

491

(2)

36.5 Summary

493

(2)

37 A Data Mining Method for Moderating Outliers Instead of Discarding Them

495

(6)

37.1 Introduction

495

(1)

37.2 Background

495

(1)

37.3 Moderating Outliers Instead of Discarding Them

496

(3)

37.3.1 Illustration of Moderating Outliers Instead of Discarding Them

496

(2)

37.3.2 The GenIQ Model for Moderating the Outlier

498

(1)

37.4 Summary

499

(2)

Reference

499

(2)

38 Overfitting: Old Problem, New Solution

501

(8)

38.1 Introduction

501

(1)

38.2 Background

501

(2)

38.2.1 Idiomatic Definition of Overfitting to Help Remember the Concept

502

(1)

38.3 The GenIQ Model Solution to Overfitting

503

(5)

38.3.1 RANDOM.SPLIT GenIQ Model

505

(1)

38.3.2 RANDOM_SPLIT GenIQ Model Decile Analysis

505

(2)

38.3.3 Quasi N-tile Analysis

507

(1)

38.4 Summary

508

(1)

39 The Importance of Straight Data: Revisited

509

(4)

39.1 Introduction

509

(1)

39.2 Restatement of Why It Is Important to Straighten Data

509

(1)

39.3 Restatement of Section 12.3.1.1 "Reexpressing INCOME"

510

(1)

39.3.1 Complete Exposition of Reexpressing INCOME

510

(1)

39.3.1.1 The GenIQ Model Detail of the gINCOME Structure

511

(1)

39.4 Restatement of Section 5.6 "Data Mining the Relationship of (xx3, yy3)"

511

(1)

39.4.1 The GenIQ Model Detail of the GenIQvar(yy3) Structure

511

(1)

39.5 Summary

512

(1)

40 The GenIQ Model: Its Definition and an Application

513

(16)

40.1 Introduction

513

(1)

40.2 What Is Optimization?

513

(1)

40.3 What Is Genetic Modeling?

514

(1)

40.4 Genetic Modeling: An Illustration

515

(4)

40.4.1 Reproduction

517

(1)

40.4.2 Crossover

518

(1)

40.4.3 Mutation

518

(1)

40.5 Parameters for Controlling a Genetic Model Run

519

(1)

40.6 Genetic Modeling: Strengths and Limitations

519

(1)

40.7 Goals of Marketing Modeling

520

(1)

40.8 The GenIQ Response Model

520

(1)

40.9 The GenIQ Profit Model

521

(1)

40.10 Case Study: Response Model

522

(2)

40.11 Case Study: Profit Model

524

(3)

40.12 Summary

527

(2)

Reference

527

(2)

41 Finding the Best Variables for Marketing Models

529

(18)

41.1 Introduction

529

(1)

41.2 Background

529

(2)

41.3 Weakness in the Variable Selection Methods

531

(1)

41.4 Goals of Modeling in Marketing

532

(1)

41.5 Variable Selection with GenIQ

533

(9)

41.5.1 GenIQ Modeling

535

(2)

41.5.2 GenIQ Structure Identification

537

(2)

41.5.3 GenIQ Variable Selection

539

(3)

41.6 Nonlinear Alternative to Logistic Regression Model

542

(3)

41.7 Summary

545

(2)

References

546

(1)

42 Interpretation of Coefficient-Free Models

547

(22)

42.1 Introduction

547

(1)

42.2 The Linear Regression Coefficient

547

(2)

42.2.1 Illustration for the Simple Ordinary Regression Model

548

(1)

42.2.2 Illustration for the Simple Logistic Regression Model

548

(1)

42.3 The Quasi-Regression Coefficient for Simple Regression Models

549

(4)

42.3.1 Illustration of Quasi-RC for the Simple Ordinary Regression Model

549

(1)

42.3.2 Illustration of Quasi-RC for the Simple Logistic Regression Model

550

(1)

42.3.3 Illustration of Quasi-RC for Nonlinear Predictions

551

(2)

42.4 Partial Quasi-RC for the Everymodel

553

(7)

42.4.1 Calculating the Partial Quasi-RC for the Everymodel

554

(1)

42.4.2 Illustration for the Multiple Logistic Regression Model

555

(5)

42.5 Quasi-RC for a Coefficient-Free Model

560

(7)

42.5.1 Illustration of Quasi-RC for a Coefficient-Free Model

560

(7)

42.6 Summary

567

(2)

43 Text Mining: Primer, Illustration, and TXTDM Software

569

(24)

43.1 Introduction

569

(1)

43.2 Background

569

(2)

43.2.1 Text Mining Software: Free versus Commercial versus TXTDM

570

(1)

43.3 Primer of Text Mining

571

(2)

43.4 Statistics of the Words

573

(1)

43.5 The Binary Dataset of Words in Documents

574

(1)

43.6 Illustration of TXTDM Text Mining

575

(9)

43.7 Analysis of the Text-Mined GenIQ_FAVORED Model

584

(1)

43.7.1 Text-Based Profiling of Respondents Who Prefer GenIQ

584

(1)

43.7.2 Text-Based Profiling of Respondents Who Prefer OLS-Logistic

585

(1)

43.8 Weighted TXTDM

585

(1)

43.9 Clustering Documents

586

(7)

43.9.1 Clustering GenIQ Survey Documents

586

(6)

43.9.1.1 Conclusion of Clustering GenIQ Survey Documents

592

(1)

43.10 Summary

593

(1)

Appendix

593

(52)

Appendix 43.A Loading Corpus TEXT Dataset

594

(1)

Appendix 43.B Intermediate Step Creating Binary Words

594

(1)

Appendix 43.C Creating the Final Binary Words

595

(1)

Appendix 43.D Calculate Statistics TF, DF, NUM_DOCS, and N(=Num of Words)

596

(1)

Appendix 43.E Append GenIQ_FAVORED to WORDS Dataset

597

(1)

Appendix 43.F Logistic GenIQ_FAVORED Model

598

(1)

Appendix 43.G Average Correlation among Words

599

(1)

Appendix 43.H Creating TF--IDF

600

(2)

Appendix 43.I WORD_TF--IDF Weights by Concat of WORDS and TF-IDF

602

(2)

Appendix 43.J WORD_RESP WORD_TF--IDF RESP

604

(1)

Appendix 43.K Stemming

604

(1)

Appendix 43.L WORD Times TF--IDF

604

(1)

Appendix 43.M Dataset Weighted with Words for Profile

605

(1)

Appendix 43.N VARCLUS for Two-Class Solution

606

(1)

Appendix 43.O Scoring VARCLUS for Two-Cluster Solution

606

(1)

Appendix 43.P Direction of Words with Its Cluster 1

607

(2)

Appendix 43.Q Performance of GenIQ Model versus Chance Model

609

(1)

Appendix 43.R Performance of Liberal-Cluster Model versus Chance Model

609

(2)

References

610

(1)

44 Some of My Favorite Statistical Subroutines

611

(34)

44.1 List of Subroutines

611

(1)

44.2 Smoothplots (Mean and Median) of
Chapter 5---XI versus X2

611

(4)

44.3 Smoothplots of
Chapter 10---Logit and Probability

615

(3)

44.4 Average Correlation of
Chapter 16---Among Var1 Var2 Var3

618

(2)

44.5 Bootstrapped Decile Analysis of
Chapter 29---Using Data from Table 23.4

620

(7)

44.6 H-Spread Common Region of
Chapter 42

627

(3)

44.7 Favorite---Proc Corr with Option Rank, Vertical Output

630

(1)

44.8 Favorite---Decile Analysis---Response

631

(4)

44.9 Favorite---Decile Analysis---Profit

635

(3)

44.10 Favorite---Smoothing Time-Series Data (Running Medians of Three)

638

(5)

44.11 Favorite---First Cut Is the Deepest---Among Variables with Large Skew Values

643

(2)

Index

645

Bruce Ratner, The Significant StatisticianTM, is President and Founder of DM STAT-1 Consulting, the ensample for Statistical Modeling, Analysis and Data Mining, and Machine-learning Data Mining in the DM Space. DM STAT-1 specializes in all standard statistical techniques, and methods using machine-learning/statistics algorithms, such as its patented GenIQ Model, to achieve its clients' goals across industries including Direct and Database Marketing, Banking, Insurance, Finance, Retail, Telecommunications, Healthcare, Pharmaceutical, Publication & Circulation, Mass & Direct Advertising, Catalog Marketing, e-Commerce, Web-mining, B2B, Human Capital Management, Risk Management, and Nonprofit Fundraising. Bruce holds a doctorate in mathematics and statistics, with a concentration in multivariate statistics and response model simulation. His research interests include developing hybrid-modeling techniques, which combine traditional statistics and machine learning methods. He holds a patent for a unique application in solving the two-group classification problem with genetic programming.

Püsilink: https://www.kriso.ee/db/9781315156316_pe.html

Märksõnad:

E-raamat: Statistical and Machine-Learning Data Mining:: Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition 3rd edition [Taylor & Francis e-raamat]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Kirjastuste teemad

Vali ostukorv