Muutke küpsiste eelistusi

E-raamat: Statistical and Machine-Learning Data Mining:: Techniques for Better Predictive Modeling and Analysis of Big Data, Third Edition 3rd edition [Taylor & Francis e-raamat]

(DM STAT-1 Consulting, New York, New York, USA)
  • Formaat: 690 pages
  • Ilmumisaeg: 30-Jun-2020
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-13: 9781315156316
  • Taylor & Francis e-raamat
  • Hind: 170,80 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
  • Tavahind: 244,00 €
  • Säästad 30%
  • Formaat: 690 pages
  • Ilmumisaeg: 30-Jun-2020
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-13: 9781315156316
Interest in predictive analytics of big data has grown exponentially in the four years since the publication of Statistical and Machine-Learning Data Mining: Techniques for Better Predictive Modeling and Analysis of Big Data, Second Edition. In the third edition of this bestseller, the author has completely revised, reorganized, and repositioned the original chapters and produced 13 new chapters of creative and useful machine-learning data mining techniques. In sum, the 43 chapters of simple yet insightful quantitative techniques make this book unique in the field of data mining literature.

What is new in the Third Edition:



















The current chapters have been completely rewritten.













The core content has been extended with strategies and methods for problems drawn from the top predictive analytics conference and statistical modeling workshops.













Adds thirteen new chapters including coverage of data science and its rise, market share estimation, share of wallet modeling without survey data, latent market segmentation, statistical regression modeling that deals with incomplete data, decile analysis assessment in terms of the predictive power of the data, and a user-friendly version of text mining, not requiring an advanced background in natural language processing (NLP).













Includes SAS subroutines which can be easily converted to other languages.











As in the previous edition, this book offers detailed background, discussion, and illustration of specific methods for solving the most commonly experienced problems in predictive modeling and analysis of big data. The author addresses each methodology and assigns its application to a specific type of problem. To better ground readers, the book provides an in-depth discussion of the basic methodologies of predictive modeling and analysis. While this type of overview has been attempted before, this approach offers a truly nitty-gritty, step-by-step method that both tyros and experts in the field can enjoy playing with.
Preface to Third Edition xxiii
Preface of Second Edition xxvii
Acknowledgments xxxi
Author xxxiii
1 Introduction
1(12)
1.1 The Personal Computer and Statistics
1(2)
1.2 Statistics and Data Analysis
3(1)
1.3 EDA
4(1)
1.4 The EDA Paradigm
5(1)
1.5 EDA Weaknesses
6(1)
1.6 Small and Big Data
7(1)
1.6.1 Data Size Characteristics
7(1)
1.6.2 Data Size: Personal Observation of One
8(1)
1.7 Data Mining Paradigm
8(1)
1.8 Statistics and Machine Learning
9(1)
1.9 Statistical Data Mining
10(3)
References
11(2)
2 Science Dealing with Data: Statistics and Data Science
13(12)
2.1 Introduction
13(1)
2.2 Background
13(2)
2.3 The Statistics and Data Science Comparison
15(6)
2.3.1 Statistics versus Data Science
15(6)
2.4 Discussion: Are Statistics and Data Science Different?
21(2)
2.4.1 Analysis: Are Statistics and Data Science Different?
22(1)
2.5 Summary
23(1)
2.6 Epilogue
23(2)
References
23(2)
3 Two Basic Data Mining Methods for Variable Assessment
25(12)
3.1 Introduction
25(1)
3.2 Correlation Coefficient
25(2)
3.3 Scatterplots
27(1)
3.4 Data Mining
28(2)
3.4.1 Example 3.1
28(1)
3.4.2 Example 3.2
29(1)
3.5 Smoothed Scatterplot
30(3)
3.6 General Association Test
33(1)
3.7 Summary
34(3)
References
35(2)
4 CHAID-Based Data Mining for Paired-Variable Assessment
37(10)
4.1 Introduction
37(1)
4.2 The Scatterplot
37(1)
4.2.1 An Exemplar Scatterplot
38(1)
4.3 The Smooth Scatterplot
38(1)
4.4 Primer on CHAID
39(1)
4.5 CHAID-Based Data Mining for a Smoother Scatterplot
40(5)
4.5.1 The Smoother Scatterplot
42(3)
4.6 Summary
45(2)
Reference
45(2)
5 The Importance of Straight Data: Simplicity and Desirability for Good Model-Building Practice
47(8)
5.1 Introduction
47(1)
5.2 Straightness and Symmetry in Data
47(1)
5.3 Data Mining Is a High Concept
48(1)
5.4 The Correlation Coefficient
48(2)
5.5 Scatterplot of (xx3, yy3)
50(1)
5.6 Data Mining the Relationship of (xx3, yy3)
50(3)
5.6.1 Side-by-Side Scatterplot
53(1)
5.7 What Is the GP-Based Data Mining Doing to the Data?
53(1)
5.8 Straightening a Handful of Variables and a Baker's Dozen of Variables
53(1)
5.9 Summary
54(1)
References
54(1)
6 Symmetrizing Ranked Data: A Statistical Data Mining Method for Improving the Predictive Power of Data
55(14)
6.1 Introduction
55(1)
6.2 Scales of Measurement
55(2)
6.3 Stem-and-Leaf Display
57(1)
6.4 Box-and-Whiskers Plot
58(1)
6.5 Illustration of the Symmetrizing Ranked Data Method
58(10)
6.5.1 Illustration 1
59(1)
6.5.1.1 Discussion of Illustration 1
59(2)
6.5.2 Illustration 2
61(1)
6.5.2.1 Titanic Dataset
62(1)
6.5.2.2 Looking at the Recoded Titanic Ordinal Variables CLASS_, AGE_, GENDER_, CLASS_AGE_, and CLASS_GENDER_
62(2)
6.5.2.3 Looking at the Symmetrized-Ranked Titanic Ordinal Variables rCLASS_, rAGE_, rGENDER_, rCLASS_AGE_, and rCLASS_GENDER_
64(1)
6.5.2.4 Building a Preliminary Titanic Model
65(3)
6.6 Summary
68(1)
References
68(1)
7 Principal Component Analysis: A Statistical Data Mining Method for Many-Variable Assessment
69(12)
7.1 Introduction
69(1)
7.2 EDA Reexpression Paradigm
69(1)
7.3 What Is the Big Deal?
70(1)
7.4 PCA Basics
70(1)
7.5 Exemplary Detailed Illustration
71(1)
7.5.1 Discussion
71(1)
7.6 Algebraic Properties of PCA
72(1)
7.7 Uncommon Illustration
73(3)
7.7.1 PCA of R_CD Elements (X1, X2, X3, X4, X5, X6)
74(1)
7.7.2 Discussion of the PCA of R_CD Elements
74(2)
7.8 PCA in the Construction of Quasi-Interaction Variables
76(4)
7.8.1 SAS Program for the PCA of the Quasi-Interaction Variable
78(2)
7.9 Summary
80(1)
8 Market Share Estimation: Data Mining for an Exceptional Case
81(16)
8.1 Introduction
81(1)
8.2 Background
81(1)
8.3 Data Mining for an Exceptional Case
82(1)
8.3.1 Exceptional Case: Infant Formula YUM
82(1)
8.4 Building the RAL-YUM Market Share Model
83(10)
8.4.1 Decile Analysis of YUM_3mos MARKET-SHARE Model
92(1)
8.4.2 Conclusion of YUM_3mos MARKET-SHARE Model
92(1)
8.5 Summary
93(4)
Appendix 8.A Dummify PROMO_Code
93(1)
Appendix 8.B PCA of PROMO_Code Dummy Variables
94(1)
Appendix 8.C Logistic Regression YUM_3mos on PROMO_Code Dummy Variables
94(1)
Appendix 8.D Creating YUM_3mos_wo_PROMO_CodeEff
94(1)
Appendix 8.E Normalizing a Variable to Lie Within [ 0, 1]
95(1)
References
96(1)
9 The Correlation Coefficient: Its Values Range between Plus and Minus 1, or Do They?
97(8)
9.1 Introduction
97(1)
9.2 Basics of the Correlation Coefficient
97(2)
9.3 Calculation of the Correlation Coefficient
99(1)
9.4 Rematching
99(2)
9.5 Calculation of the Adjusted Correlation Coefficient
101(1)
9.6 Implication of Rematching
102(1)
9.7 Summary
102(3)
10 Logistic Regression: The Workhorse of Response Modeling
105(46)
10.1 Introduction
105(1)
10.2 Logistic Regression Model
106(3)
10.2.1 Illustration
106(1)
10.2.2 Scoring an LRM
107(2)
10.3 Case Study
109(1)
10.3.1 Candidate Predictor and Dependent Variables
110(1)
10.4 Logits and Logit Plots
110(2)
10.4.1 Logits for Case Study
111(1)
10.5 The Importance of Straight Data
112(1)
10.6 Reexpressing for Straight Data
112(3)
10.6.1 Ladder of Powers
113(1)
10.6.2 Bulging Rule
114(1)
10.6.3 Measuring Straight Data
114(1)
10.7 Straight Data for Case Study
115(3)
10.7.1 Reexpressing FD2_OPEN
116(1)
10.7.2 Reexpressing INVESTMENT
116(2)
10.8 Techniques when the Bulging Rule Does Not Apply
118(1)
10.8.1 Fitted Logit Plot
118(1)
10.8.2 Smooth Predicted-versus-Actual Plot
119(1)
10.9 Reexpressing MOS_OPEN
119(4)
10.9.1 Plot of Smooth Predicted versus Actual for MOS_OPEN
120(3)
10.10 Assessing the Importance of Variables
123(2)
10.10.1 Computing the G Statistic
123(1)
10.10.2 Importance of a Single Variable
124(1)
10.10.3 Importance of a Subset of Variables
124(1)
10.10.4 Comparing the Importance of Different Subsets of Variables
124(1)
10.11 Important Variables for Case Study
125(2)
10.11.1 Importance of the Predictor Variables
126(1)
10.12 Relative Importance of the Variables
127(1)
10.12.1 Selecting the Best Subset
127(1)
10.13 Best Subset of Variables for Case Study
128(1)
10.14 Visual Indicators of Goodness of Model Predictions
129(7)
10.14.1 Plot of Smooth Residual by Score Groups
130(1)
10.14.1.1 Plot of the Smooth Residual by Score Groups for Case Study
130(2)
10.14.2 Plot of Smooth Actual versus Predicted by Decile Groups
132(1)
10.14.2.1 Plot of Smooth Actual versus Predicted by Decile Groups for Case Study
132(2)
10.14.3 Plot of Smooth Actual versus Predicted by Score Groups
134(1)
10.14.3.1 Plot of Smooth Actual versus Predicted by Score Groups for Case Study
134(2)
10.15 Evaluating the Data Mining Work
136(5)
10.15.1 Comparison of Plots of Smooth Residual by Score Groups: EDA versus Non-EDA Models
137(2)
10.15.2 Comparison of the Plots of Smooth Actual versus Predicted by Decile Groups: EDA versus Non-EDA Models
139(1)
10.15.3 Comparison of Plots of Smooth Actual versus Predicted by Score Groups: EDA versus Non-EDA Models
140(1)
10.15.4 Summary of the Data Mining Work
141(1)
10.16 Smoothing a Categorical Variable
141(4)
10.16.1 Smoothing FD_TYPE with CHAID
142(2)
10.16.2 Importance of CH_FTY_1 and CH_FTY_2
144(1)
10.17 Additional Data Mining Work for Case Study
145(5)
10.17.1 Comparison of Plots of Smooth Residual by Score Group: 4var-EDA versus 3var-EDA Models
146(1)
10.17.2 Comparison of the Plots of Smooth Actual versus Predicted by Decile Groups: 4var-EDA versus 3var-EDA Models
147(1)
10.17.3 Comparison of Plots of Smooth Actual versus Predicted by Score Groups: 4var-EDA versus 3var-EDA Models
147(2)
10.17.4 Final Summary of the Additional Data Mining Work
149(1)
10.18 Summary
150(1)
11 Predicting Share of Wallet without Survey Data
151(18)
11.1 Introduction
151(1)
11.2 Background
151(2)
11.2.1 SOW Definition
152(1)
11.2.1.1 SOW_q Definition
152(1)
11.2.1.2 SOW_q Likeliness Assumption
152(1)
11.3 Illustration of Calculation of SOW_q
153(5)
11.3.1 Query of Interest
153(1)
11.3.2 DOLLARS and TOTAL DOLLARS
153(5)
11.4 Building the AMPECS SOW_q Model
158(1)
11.5 SOW_q Model Definition
159(2)
11.5.1 SOW_q Model Results
160(1)
11.6 Summary
161(8)
Appendix 11.A Six Steps
162(2)
Appendix 11.B Seven Steps
164(3)
References
167(2)
12 Ordinary Regression: The Workhorse of Profit Modeling
169(20)
12.1 Introduction
169(1)
12.2 Ordinary Regression Model
169(3)
12.2.1 Illustration
170(1)
12.2.2 Scoring an OLS Profit Model
171(1)
12.3 Mini Case Study
172(8)
12.3.1 Straight Data for Mini Case Study
172(2)
12.3.1.1 Reexpressing INCOME
174(1)
12.3.1.2 Reexpressing AGE
175(2)
12.3.2 Plot of Smooth Predicted versus Actual
177(1)
12.3.3 Assessing the Importance of Variables
178(1)
12.3.3.1 Defining the F Statistic and R-Squared
179(1)
12.3.3.2 Importance of a Single Variable
179(1)
12.3.3.3 Importance of a Subset of Variables
179(1)
12.3.3.4 Comparing the Importance of Different Subsets of Variables
180(1)
12.4 Important Variables for Mini Case Study
180(2)
12.4.1 Relative Importance of the Variables
181(1)
12.4.2 Selecting the Best Subset
181(1)
12.5 Best Subset of Variables for Case Study
182(3)
12.5.1 PROFIT Model with gINCOME and AGE
183(2)
12.5.2 Best PROFIT Model
185(1)
12.6 Suppressor Variable AGE
185(1)
12.7 Summary
186(3)
References
187(2)
13 Variable Selection Methods in Regression: Ignorable Problem, Notable Solution
189(14)
13.1 Introduction
189(1)
13.2 Background
189(3)
13.3 Frequently Used Variable Selection Methods
192(1)
13.4 Weakness in the Stepwise
193(1)
13.5 Enhanced Variable Selection Method
194(2)
13.6 Exploratory Data Analysis
196(4)
13.7 Summary
200(3)
References
200(3)
14 CHAID for Interpreting a Logistic Regression Model
203(16)
14.1 Introduction
203(1)
14.2 Logistic Regression Model
203(1)
14.3 Database Marketing Response Model Case Study
204(1)
14.3.1 Odds Ratio
205(1)
14.4 CHAID
205(3)
14.4.1 Proposed CHAID-Based Method
206(2)
14.5 Multivariable CHAID Trees
208(2)
14.6 CHAID Market Segmentation
210(3)
14.7 CHAID Tree Graphs
213(3)
14.8 Summary
216(3)
15 The Importance of the Regression Coefficient
219(10)
15.1 Introduction
219(1)
15.2 The Ordinary Regression Model
219(1)
15.3 Four Questions
220(1)
15.4 Important Predictor Variables
220(1)
15.5 P-Values and Big Data
221(1)
15.6 Returning to Question 1
222(1)
15.7 Effect of Predictor Variable on Prediction
222(1)
15.8 The Caveat
223(2)
15.9 Returning to Question 2
225(1)
15.10 Ranking Predictor Variables by Effect on Prediction
225(1)
15.11 Returning to Question 3
226(1)
15.12 Returning to Question 4
227(1)
15.13 Summary
227(2)
References
228(1)
16 The Average Correlation: A Statistical Data Mining Measure for Assessment of Competing Predictive Models and the Importance of the Predictor Variables
229(10)
16.1 Introduction
229(1)
16.2 Background
229(2)
16.3 Illustration of the Difference between Reliability and Validity
231(1)
16.4 Illustration of the Relationship between Reliability and Validity
231(1)
16.5 The Average Correlation
232(5)
16.5.1 Illustration of the Average Correlation with an LTV5 Model
232(4)
16.5.2 Continuing with the Illustration of the Average Correlation with an LTV5 Model
236(1)
16.5.3 Continuing with the Illustration with a Competing LTV5 Model
236(1)
16.5.3.1 The Importance of the Predictor Variables
237(1)
16.6 Summary
237(2)
Reference
237(2)
17 CHAID for Specifying a Model with Interaction Variables
239(12)
17.1 Introduction
239(1)
17.2 Interaction Variables
239(1)
17.3 Strategy for Modeling with Interaction Variables
240(1)
17.4 Strategy Based on the Notion of a Special Point
240(1)
17.5 Example of a Response Model with an Interaction Variable
241(1)
17.6 CHAID for Uncovering Relationships
242(1)
17.7 Illustration of CHAID for Specifying a Model
243(3)
17.8 An Exploratory Look
246(1)
17.9 Database Implication
247(1)
17.10 Summary
248(3)
References
249(2)
18 Market Segmentation Classification Modeling with Logistic Regression
251(14)
18.1 Introduction
251(1)
18.2 Binary Logistic Regression
251(1)
18.2.1 Necessary Notation
252(1)
18.3 Polychotomous Logistic Regression Model
252(1)
18.4 Model Building with PLR
253(1)
18.5 Market Segmentation Classification Model
254(9)
18.5.1 Survey of Cellular Phone Users
254(1)
18.5.2 CHAID Analysis
255(3)
18.5.3 CHAID Tree Graphs
258(3)
18.5.4 Market Segmentation Classification Model
261(2)
18.6 Summary
263(2)
19 Market Segmentation Based on Time-Series Data Using Latent Class Analysis
265(22)
19.1 Introduction
265(1)
19.2 Background
265(5)
19.2.1 K-Means Clustering
265(1)
19.2.2 PCA
266(1)
19.2.3 FA
266(1)
19.2.3.1 FA Model
267(1)
19.2.3.2 FA Model Estimation
267(1)
19.2.3.3 FA versus OLS Graphical Depiction
268(1)
19.2.4 LCA versus FA Graphical Depiction
268(2)
19.3 LCA
270(2)
19.3.1 LCA of Universal and Particular Study
270(1)
19.3.1.1 Discussion of LCA Output
270(1)
19.3.1.2 Discussion of Posterior Probability
271(1)
19.4 LCA versus k-Means Clustering
272(2)
19.5 LCA Market Segmentation Model Based on Time-Series Data
274(8)
19.5.1 Objective
274(2)
19.5.2 Best LCA Models
276(2)
19.5.2.1 Cluster Sizes and Conditional Probabilities/Means
278(3)
19.5.2.2 Indicator-Level Posterior Probabilities
281(1)
19.6 Summary
282(5)
Appendix 19.A Creating Trend3 for UNITS
282(2)
Appendix 19.B POS-ZER-NEG Creating Trend4
284(1)
References
285(2)
20 Market Segmentation: An Easy Way to Understand the Segments
287(6)
20.1 Introduction
287(1)
20.2 Background
287(1)
20.3 Illustration
288(1)
20.4 Understanding the Segments
289(1)
20.5 Summary
290(3)
Appendix 20.A Dataset SAMPLE
290(1)
Appendix 20.B Segmentor-Means
291(1)
Appendix 20.C Indexed Profiles
291(1)
References
292(1)
21 The Statistical Regression Model: An Easy Way to Understand the Model
293(14)
21.1 Introduction
293(1)
21.2 Background
293(1)
21.3 EZ-Method Applied to the LR Model
294(2)
21.4 Discussion of the LR EZ-Method Illustration
296(3)
21.5 Summary
299(8)
Appendix 21.A M65-Spread Base Means X10--X14
299(2)
Appendix 21.B Create Ten Datasets for Each Decile
301(1)
Appendix 21.C Indexed Profiles of Deciles
302(5)
22 CHAID as a Method for Filling in Missing Values
307(16)
22.1 Introduction
307(1)
22.2 Introduction to the Problem of Missing Data
307(2)
22.3 Missing Data Assumption
309(1)
22.4 CHAID Imputation
310(1)
22.5 Illustration
311(5)
22.5.1 CHAID Mean-Value Imputation for a Continuous Variable
312(1)
22.5.2 Many Mean-Value CHAID Imputations for a Continuous Variable
313(1)
22.5.3 Regression Tree Imputation for LIFE_DOL
314(2)
22.6 CHAID Most Likely Category Imputation for a Categorical Variable
316(4)
22.6.1 CHAID Most Likely Category Imputation for GENDER
316(2)
22.6.2 Classification Tree Imputation for GENDER
318(2)
22.7 Summary
320(3)
References
321(2)
23 Model Building with Big Complete and Incomplete Data
323(12)
23.1 Introduction
323(1)
23.2 Background
323(1)
23.3 The CCA-PCA Method: Illustration Details
324(2)
23.3.1 Determining the Complete and Incomplete Datasets
324(2)
23.4 Building the RESPONSE Model with Complete (CCA) Dataset
326(2)
23.4.1 CCA RESPONSE Model Results
327(1)
23.5 Building the RESPONSE Model with Incomplete (ICA) Dataset
328(1)
23.5.1 PCA on BICA Data
329(1)
23.6 Building the RESPONSE Model on PCA-BICA Data
329(3)
23.6.1 PCA-BICA RESPONSE Model Results
330(1)
23.6.2 Combined CCA and PCA-BICA RESPONSE Model Results
331(1)
23.7 Summary
332(3)
Appendix 23.A NMISS
333(1)
Appendix 23.B Testing CCA Samsizes
333(1)
Appendix 23.C CCA-CIA Datasets
333(1)
Appendix 23.D Ones and Zeros
333(1)
Reference
334(1)
24 Art, Science, Numbers, and Poetry
335(6)
24.1 Introduction
335(1)
24.2 Zeros and Ones
336(1)
24.3 Power of Thought
336(2)
24.4 The Statistical Golden Rule: Measuring the Art and Science of Statistical Practice
338(2)
24.4.1 Background
338(1)
24.4.1.1 The Statistical Golden Rule
339(1)
24.5 Summary
340(1)
Reference
340(1)
25 Identifying Your Best Customers: Descriptive, Predictive, and Look-Alike Profiling
341(14)
25.1 Introduction
341(1)
25.2 Some Definitions
341(1)
25.3 Illustration of a Flawed Targeting Effort
342(1)
25.4 Well-Defined Targeting Effort
343(2)
25.5 Predictive Profiles
345(3)
25.6 Continuous Trees
348(2)
25.7 Look-Alike Profiling
350(3)
25.8 Look-Alike Tree Characteristics
353(1)
25.9 Summary
353(2)
26 Assessment of Marketing Models
355(12)
26.1 Introduction
355(1)
26.2 Accuracy for Response Model
355(1)
26.3 Accuracy for Profit Model
356(2)
26.4 Decile Analysis and Cum Lift for Response Model
358(1)
26.5 Decile Analysis and Cum Lift for Profit Model
359(1)
26.6 Precision for Response Model
360(2)
26.7 Precision for Profit Model
362(1)
26.7.1 Construction of SWMAD
363(1)
26.8 Separability for Response and Profit Models
363(1)
26.9 Guidelines for Using Cum Lift, HL/SWMAD, and CV
364(1)
26.10 Summary
364(3)
27 Decile Analysis: Perspective and Performance
367(20)
27.1 Introduction
367(1)
27.2 Background
367(4)
27.2.1 Illustration
369(1)
27.2.1.1 Discussion of Classification Table of RESPONSE Model
370(1)
27.3 Assessing Performance: RESPONSE Model versus Chance Model
371(1)
27.4 Assessing Performance: The Decile Analysis
372(5)
27.4.1 The RESPONSE Decile Analysis
372(5)
27.5 Summary
377(10)
Appendix 27.A Incremental Gain in Accuracy: Model versus Chance
378(1)
Appendix 27.B Incremental Gain in Precision: Model versus Chance
379(1)
Appendix 27.C RESPONSE Model Decile PROB_est Values
380(2)
Appendix 27.D 2×2 Tables by Decile
382(3)
References
385(2)
28 Net T-C Lift Model: Assessing the Net Effects of Test and Control Campaigns
387(26)
28.1 Introduction
387(1)
28.2 Background
387(2)
28.3 Building TEST and CONTROL Response Models
389(5)
28.3.1 Building TEST Response Model
390(2)
28.3.2 Building CONTROL Response Model
392(2)
28.4 Net T-C Lift Model
394(4)
28.4.1 Building the Net T-C Lift Model
395(1)
28.4.1.1 Discussion of the Net T-C Lift Model
395(2)
28.4.1.2 Discussion of Equal-Group Sizes Decile of the Net T-C Lift Model
397(1)
28.5 Summary
398(15)
Appendix 28.A TEST Logistic with Xs
400(2)
Appendix 28.B CONTROL Logistic with Xs
402(3)
Appendix 28.C Merge Score
405(1)
Appendix 28.D NET T-C Decile Analysis
406(4)
References
410(3)
29 Bootstrapping in Marketing: A New Approach for Validating Models
413(16)
29.1 Introduction
413(1)
29.2 Traditional Model Validation
413(1)
29.3 Illustration
414(1)
29.4 Three Questions
415(1)
29.5 The Bootstrap Method
416(1)
29.5.1 Traditional Construction of Confidence Intervals
416(1)
29.6 How to Bootstrap
417(2)
29.6.1 Simple Illustration
418(1)
29.7 Bootstrap Decile Analysis Validation
419(1)
29.8 Another Question
420(1)
29.9 Bootstrap Assessment of Model Implementation Performance
421(5)
29.9.1 Illustration
424(2)
29.10 Bootstrap Assessment of Model Efficiency
426(2)
29.11 Summary
428(1)
References
428(1)
30 Validating the Logistic Regression Model: Try Bootstrapping
429(2)
30.1 Introduction
429(1)
30.2 Logistic Regression Model
429(1)
30.3 The Bootstrap Validation Method
429(1)
30.4 Summary
430(1)
Reference
430(1)
31 Visualization of Marketing Models: Data Mining to Uncover Innards of a Model
431(22)
31.1 Introduction
431(1)
31.2 Brief History of the Graph
431(1)
31.3 Star Graph Basics
432(2)
31.3.1 Illustration
433(1)
31.4 Star Graphs for Single Variables
434(1)
31.5 Star Graphs for Many Variables Considered Jointly
435(2)
31.6 Profile Curves Method
437(1)
31.6.1 Profile Curves Basics
437(1)
31.6.2 Profile Analysis
438(1)
31.7 Illustration
438(6)
31.7.1 Profile Curves for RESPONSE Model
440(2)
31.7.2 Decile Group Profile Curves
442(2)
31.8 Summary
444(9)
Appendix 31.A Star Graphs for Each Demographic Variable about the Deciles
445(2)
Appendix 31.B Star Graphs for Each Decile about the Demographic Variables
447(3)
Appendix 31.C Profile Curves: All Deciles
450(2)
References
452(1)
32 The Predictive Contribution Coefficient: A Measure of Predictive Importance
453(12)
32.1 Introduction
453(1)
32.2 Background
453(2)
32.3 Illustration of Decision Rule
455(2)
32.4 Predictive Contribution Coefficient
457(1)
32.5 Calculation of Predictive Contribution Coefficient
458(1)
32.6 Extra-Illustration of Predictive Contribution Coefficient
459(3)
32.7 Summary
462(3)
Reference
463(2)
33 Regression Modeling Involves Art, Science, and Poetry, Too
465(6)
33.1 Introduction
465(1)
33.2 Shakespearean Modelogue
465(1)
33.3 Interpretation of the Shakespearean Modelogue
466(3)
33.4 Summary
469(2)
References
469(2)
34 Opening the Dataset: A Twelve-Step Program for Dataholics
471(6)
34.1 Introduction
471(1)
34.2 Background
471(1)
34.3 Stepping
471(2)
34.4 Brush Marking
473(1)
34.5 Summary
474(3)
Appendix 34.A Dataset IN
474(1)
Appendix 34.B SamsizePlus
475(1)
Appendix 34.C Copy-Pasteable
475(1)
Appendix 34.D Missings
475(1)
References
476(1)
35 Genetic and Statistic Regression Models: A Comparison
477(10)
35.1 Introduction
477(1)
35.2 Background
477(1)
35.3 Objective
478(1)
35.4 The GenIQ Model, the Genetic Logistic Regression
478(2)
35.4.1 Illustration of "Filling Up the Upper Deciles"
479(1)
35.5 A Pithy Summary of the Development of Genetic Programming
480(2)
35.6 The GenIQ Model: A Brief Review of Its Objective and Salient Features
482(1)
35.6.1 The GenIQ Model Requires Selection of Variables and Function: An Extra Burden?
482(1)
35.7 The GenIQ Model: How It Works
483(3)
35.7.1 The GenIQ Model Maximizes the Decile Table
485(1)
35.8 Summary
486(1)
References
486(1)
36 Data Reuse: A Powerful Data Mining Effect of the GenIQ Model
487(8)
36.1 Introduction
487(1)
36.2 Data Reuse
487(1)
36.3 Illustration of Data Reuse
488(3)
36.3.1 The GenIQ Profit Model
488(1)
36.3.2 Data-Reused Variables
489(1)
36.3.3 Data-Reused Variables GenIQvar_1 and GenIQvar_2
490(1)
36.4 Modified Data Reuse: A GenIQ-Enhanced Regression Model
491(2)
36.4.1 Illustration of a GenIQ-Enhanced LRM
491(2)
36.5 Summary
493(2)
37 A Data Mining Method for Moderating Outliers Instead of Discarding Them
495(6)
37.1 Introduction
495(1)
37.2 Background
495(1)
37.3 Moderating Outliers Instead of Discarding Them
496(3)
37.3.1 Illustration of Moderating Outliers Instead of Discarding Them
496(2)
37.3.2 The GenIQ Model for Moderating the Outlier
498(1)
37.4 Summary
499(2)
Reference
499(2)
38 Overfitting: Old Problem, New Solution
501(8)
38.1 Introduction
501(1)
38.2 Background
501(2)
38.2.1 Idiomatic Definition of Overfitting to Help Remember the Concept
502(1)
38.3 The GenIQ Model Solution to Overfitting
503(5)
38.3.1 RANDOM.SPLIT GenIQ Model
505(1)
38.3.2 RANDOM_SPLIT GenIQ Model Decile Analysis
505(2)
38.3.3 Quasi N-tile Analysis
507(1)
38.4 Summary
508(1)
39 The Importance of Straight Data: Revisited
509(4)
39.1 Introduction
509(1)
39.2 Restatement of Why It Is Important to Straighten Data
509(1)
39.3 Restatement of Section 12.3.1.1 "Reexpressing INCOME"
510(1)
39.3.1 Complete Exposition of Reexpressing INCOME
510(1)
39.3.1.1 The GenIQ Model Detail of the gINCOME Structure
511(1)
39.4 Restatement of Section 5.6 "Data Mining the Relationship of (xx3, yy3)"
511(1)
39.4.1 The GenIQ Model Detail of the GenIQvar(yy3) Structure
511(1)
39.5 Summary
512(1)
40 The GenIQ Model: Its Definition and an Application
513(16)
40.1 Introduction
513(1)
40.2 What Is Optimization?
513(1)
40.3 What Is Genetic Modeling?
514(1)
40.4 Genetic Modeling: An Illustration
515(4)
40.4.1 Reproduction
517(1)
40.4.2 Crossover
518(1)
40.4.3 Mutation
518(1)
40.5 Parameters for Controlling a Genetic Model Run
519(1)
40.6 Genetic Modeling: Strengths and Limitations
519(1)
40.7 Goals of Marketing Modeling
520(1)
40.8 The GenIQ Response Model
520(1)
40.9 The GenIQ Profit Model
521(1)
40.10 Case Study: Response Model
522(2)
40.11 Case Study: Profit Model
524(3)
40.12 Summary
527(2)
Reference
527(2)
41 Finding the Best Variables for Marketing Models
529(18)
41.1 Introduction
529(1)
41.2 Background
529(2)
41.3 Weakness in the Variable Selection Methods
531(1)
41.4 Goals of Modeling in Marketing
532(1)
41.5 Variable Selection with GenIQ
533(9)
41.5.1 GenIQ Modeling
535(2)
41.5.2 GenIQ Structure Identification
537(2)
41.5.3 GenIQ Variable Selection
539(3)
41.6 Nonlinear Alternative to Logistic Regression Model
542(3)
41.7 Summary
545(2)
References
546(1)
42 Interpretation of Coefficient-Free Models
547(22)
42.1 Introduction
547(1)
42.2 The Linear Regression Coefficient
547(2)
42.2.1 Illustration for the Simple Ordinary Regression Model
548(1)
42.2.2 Illustration for the Simple Logistic Regression Model
548(1)
42.3 The Quasi-Regression Coefficient for Simple Regression Models
549(4)
42.3.1 Illustration of Quasi-RC for the Simple Ordinary Regression Model
549(1)
42.3.2 Illustration of Quasi-RC for the Simple Logistic Regression Model
550(1)
42.3.3 Illustration of Quasi-RC for Nonlinear Predictions
551(2)
42.4 Partial Quasi-RC for the Everymodel
553(7)
42.4.1 Calculating the Partial Quasi-RC for the Everymodel
554(1)
42.4.2 Illustration for the Multiple Logistic Regression Model
555(5)
42.5 Quasi-RC for a Coefficient-Free Model
560(7)
42.5.1 Illustration of Quasi-RC for a Coefficient-Free Model
560(7)
42.6 Summary
567(2)
43 Text Mining: Primer, Illustration, and TXTDM Software
569(24)
43.1 Introduction
569(1)
43.2 Background
569(2)
43.2.1 Text Mining Software: Free versus Commercial versus TXTDM
570(1)
43.3 Primer of Text Mining
571(2)
43.4 Statistics of the Words
573(1)
43.5 The Binary Dataset of Words in Documents
574(1)
43.6 Illustration of TXTDM Text Mining
575(9)
43.7 Analysis of the Text-Mined GenIQ_FAVORED Model
584(1)
43.7.1 Text-Based Profiling of Respondents Who Prefer GenIQ
584(1)
43.7.2 Text-Based Profiling of Respondents Who Prefer OLS-Logistic
585(1)
43.8 Weighted TXTDM
585(1)
43.9 Clustering Documents
586(7)
43.9.1 Clustering GenIQ Survey Documents
586(6)
43.9.1.1 Conclusion of Clustering GenIQ Survey Documents
592(1)
43.10 Summary
593(1)
Appendix
593(52)
Appendix 43.A Loading Corpus TEXT Dataset
594(1)
Appendix 43.B Intermediate Step Creating Binary Words
594(1)
Appendix 43.C Creating the Final Binary Words
595(1)
Appendix 43.D Calculate Statistics TF, DF, NUM_DOCS, and N(=Num of Words)
596(1)
Appendix 43.E Append GenIQ_FAVORED to WORDS Dataset
597(1)
Appendix 43.F Logistic GenIQ_FAVORED Model
598(1)
Appendix 43.G Average Correlation among Words
599(1)
Appendix 43.H Creating TF--IDF
600(2)
Appendix 43.I WORD_TF--IDF Weights by Concat of WORDS and TF-IDF
602(2)
Appendix 43.J WORD_RESP WORD_TF--IDF RESP
604(1)
Appendix 43.K Stemming
604(1)
Appendix 43.L WORD Times TF--IDF
604(1)
Appendix 43.M Dataset Weighted with Words for Profile
605(1)
Appendix 43.N VARCLUS for Two-Class Solution
606(1)
Appendix 43.O Scoring VARCLUS for Two-Cluster Solution
606(1)
Appendix 43.P Direction of Words with Its Cluster 1
607(2)
Appendix 43.Q Performance of GenIQ Model versus Chance Model
609(1)
Appendix 43.R Performance of Liberal-Cluster Model versus Chance Model
609(2)
References
610(1)
44 Some of My Favorite Statistical Subroutines
611(34)
44.1 List of Subroutines
611(1)
44.2 Smoothplots (Mean and Median) of
Chapter 5---XI versus X2
611(4)
44.3 Smoothplots of
Chapter 10---Logit and Probability
615(3)
44.4 Average Correlation of
Chapter 16---Among Var1 Var2 Var3
618(2)
44.5 Bootstrapped Decile Analysis of
Chapter 29---Using Data from Table 23.4
620(7)
44.6 H-Spread Common Region of
Chapter 42
627(3)
44.7 Favorite---Proc Corr with Option Rank, Vertical Output
630(1)
44.8 Favorite---Decile Analysis---Response
631(4)
44.9 Favorite---Decile Analysis---Profit
635(3)
44.10 Favorite---Smoothing Time-Series Data (Running Medians of Three)
638(5)
44.11 Favorite---First Cut Is the Deepest---Among Variables with Large Skew Values
643(2)
Index 645
Bruce Ratner, The Significant StatisticianTM, is President and Founder of DM STAT-1 Consulting, the ensample for Statistical Modeling, Analysis and Data Mining, and Machine-learning Data Mining in the DM Space. DM STAT-1 specializes in all standard statistical techniques, and methods using machine-learning/statistics algorithms, such as its patented GenIQ Model, to achieve its clients' goals across industries including Direct and Database Marketing, Banking, Insurance, Finance, Retail, Telecommunications, Healthcare, Pharmaceutical, Publication & Circulation, Mass & Direct Advertising, Catalog Marketing, e-Commerce, Web-mining, B2B, Human Capital Management, Risk Management, and Nonprofit Fundraising. Bruce holds a doctorate in mathematics and statistics, with a concentration in multivariate statistics and response model simulation. His research interests include developing hybrid-modeling techniques, which combine traditional statistics and machine learning methods. He holds a patent for a unique application in solving the two-group classification problem with genetic programming.