Muutke küpsiste eelistusi

Discovering Knowledge in Data: An Introduction to Data Mining 2nd edition [Kõva köide]

(Eastern Connecticut State University (ECSU)), (Central Connecticut State University)
Teised raamatud teemal:
Teised raamatud teemal:
"This is a new edition of a highly praised, successful reference on data mining, now more important than ever due to the growth of the field and wide range of applications. This edition features new chapters on multivariate statistical analysis, coveringanalysis of variance and chi-square procedures; cost-benefit analyses; and time-series data analysis. There is also extensive coverage of the R statistical programming language. Graduate and advanced undergraduate students of computer science and statistics, managers/CEOs/CFOs, marketing executives, market researchers and analysts, sales analysts, and medical professionals will want this comprehensive reference"--

To help alleviate the shortage of trained and skilled data analysts in business, statisticians Larose and Larose explain the models and techniques to uncover hidden nuggets of information, offer insight into how the data mining algorithms really work, and allow readers to experience data mining on large data sets. In most chapters, a section provides the actual code in R needed to obtain the results shown in the text, along with screen shots of some of the output using R Studio. The topics include data pre-processing, multivariate statistics, decision trees, Kohonen networks, and imputation of missing data. Annotation ©2014 Ringgold, Inc., Portland, OR (protoview.com)

  • The second edition of a highly praised, successful reference on data mining, with thorough coverage of big data applications, predictive analytics, and statistical analysis.
  • Includes new chapters on Multivariate Statistics, Preparing to Model the Data, and Imputation of Missing Data, and an Appendix on Data Summarization and Visualization
  • Offers extensive coverage of the R statistical programming language
  • Contains 280 end-of-chapter exercises
  • Includes a companion website with further resources for all readers, and Powerpoint slides, a solutions manual, and suggested projects for instructors who adopt the book
Preface xi
Chapter 1 An Introduction To Data Mining
1(15)
1.1 What is Data Mining?
1(1)
1.2 Wanted: Data Miners
2(1)
1.3 The Need for Human Direction of Data Mining
3(1)
1.4 The Cross-Industry Standard Practice for Data Mining
4(2)
1.4.1 Crisp-DM: The Six Phases
5(1)
1.5 Fallacies of Data Mining
6(2)
1.6 What Tasks Can Data Mining Accomplish?
8(8)
1.6.1 Description
8(1)
1.6.2 Estimation
8(2)
1.6.3 Prediction
10(1)
1.6.4 Classification
10(2)
1.6.5 Clustering
12(2)
1.6.6 Association
14(1)
References
14(1)
Exercises
15(1)
Chapter 2 Data Preprocessing
16(35)
2.1 Why do We Need to Preprocess the Data?
17(1)
2.2 Data Cleaning
17(2)
2.3 Handling Missing Data
19(3)
2.4 Identifying Misclassifications
22(1)
2.5 Graphical Methods for Identifying Outliers
22(1)
2.6 Measures of Center and Spread
23(3)
2.7 Data Transformation
26(1)
2.8 Min-Max Normalization
26(1)
2.9 Z-Score Standardization
27(1)
2.10 Decimal Scaling
28(1)
2.11 Transformations to Achieve Normality
28(7)
2.12 Numerical Methods for Identifying Outliers
35(1)
2.13 Flag Variables
36(1)
2.14 Transforming Categorical Variables into Numerical Variables
37(1)
2.15 Binning Numerical Variables
38(1)
2.16 Reclassifying Categorical Variables
39(1)
2.17 Adding an Index Field
39(1)
2.18 Removing Variables that are Not Useful
39(1)
2.19 Variables that Should Probably Not Be Removed
40(1)
2.20 Removal of Duplicate Records
41(1)
2.21 A Word About ID Fields
41(10)
The R Zone
42(6)
References
48(1)
Exercises
48(2)
Hands-On Analysis
50(1)
Chapter 3 Exploratory Data Analysis
51(40)
3.1 Hypothesis Testing Versus Exploratory Data Analysis
51(1)
3.2 Getting to Know the Data Set
52(3)
3.3 Exploring Categorical Variables
55(7)
3.4 Exploring Numeric Variables
62(7)
3.5 Exploring Multivariate Relationships
69(2)
3.6 Selecting Interesting Subsets of the Data for Further Investigation
71(1)
3.7 Using EDA to Uncover Anomalous Fields
71(1)
3.8 Binning Based on Predictive Value
72(2)
3.9 Deriving New Variables: Flag Variables
74(3)
3.10 Deriving New Variables: Numerical Variables
77(1)
3.11 Using EDA to Investigate Correlated Predictor Variables
77(3)
3.12 Summary
80(11)
The R Zone
82(6)
Reference
88(1)
Exercises
88(1)
Hands-On Analysis
89(2)
Chapter 4 Univariate Statistical Analysis
91(18)
4.1 Data Mining Tasks in Discovering Knowledge in Data
91(1)
4.2 Statistical Approaches to Estimation and Prediction
92(1)
4.3 Statistical Inference
93(1)
4.4 How Confident are We in Our Estimates?
94(1)
4.5 Confidence Interval Estimation of the Mean
95(2)
4.6 How to Reduce the Margin of Error
97(1)
4.7 Confidence Interval Estimation of the Proportion
98(1)
4.8 Hypothesis Testing for the Mean
99(2)
4.9 Assessing the Strength of Evidence Against the Null Hypothesis
101(1)
4.10 Using Confidence Intervals to Perform Hypothesis Tests
102(2)
4.11 Hypothesis Testing for the Proportion
104(5)
The R Zone
105(1)
Reference
106(1)
Exercises
106(3)
Chapter 5 Multivariate Statistics
109(29)
5.1 Two-Sample t-Test for Difference in Means
110(1)
5.2 Two-Sample Z-Test for Difference in Proportions
111(1)
5.3 Test for Homogeneity of Proportions
112(2)
5.4 Chi-Square Test for Goodness of Fit of Multinomial Data
114(1)
5.5 Analysis of Variance
115(3)
5.6 Regression Analysis
118(4)
5.7 Hypothesis Testing in Regression
122(1)
5.8 Measuring the Quality of a Regression Model
123(1)
5.9 Dangers of Extrapolation
123(2)
5.10 Confidence Intervals for the Mean Value of y Given x
125(1)
5.11 Prediction Intervals for a Randomly Chosen Value of y Given x
125(1)
5.12 Multiple Regression
126(1)
5.13 Verifying Model Assumptions
127(11)
The R Zone
131(4)
Reference
135(1)
Exercises
135(1)
Hands-On Analysis
136(2)
Chapter 6 Preparing To Model The Data
138(11)
6.1 Supervised Versus Unsupervised Methods
138(1)
6.2 Statistical Methodology and Data Mining Methodology
139(1)
6.3 Cross-Validation
139(2)
6.4 Overfitting
141(1)
6.5 BIAS--Variance Trade-Off
142(2)
6.6 Balancing the Training Data Set
144(1)
6.7 Establishing Baseline Performance
145(4)
The R Zone
146(1)
Reference
147(1)
Exercises
147(2)
Chapter 7 K-Nearest Neighbor Algorithm
149(16)
7.1 Classification Task
149(1)
7.2 k-Nearest Neighbor Algorithm
150(3)
7.3 Distance Function
153(3)
7.4 Combination Function
156(2)
7.4.1 Simple Unweighted Voting
156(1)
7.4.2 Weighted Voting
156(2)
7.5 Quantifying Attribute Relevance: Stretching the Axes
158(1)
7.6 Database Considerations
158(1)
7.7 k-Nearest Neighbor Algorithm for Estimation and Prediction
159(1)
7.8 Choosing k
160(1)
7.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler
160(5)
The R Zone
162(1)
Exercises
163(1)
Hands-On Analysis
164(1)
Chapter 8 Decision Trees
165(22)
8.1 What is a Decision Tree?
165(2)
8.2 Requirements for Using Decision Trees
167(1)
8.3 Classification and Regression Trees
168(6)
8.4 C4.5 Algorithm
174(5)
8.5 Decision Rules
179(1)
8.6 Comparison of the C5.0 and Cart Algorithms Applied to Real Data
180(7)
The R Zone
183(1)
References
184(1)
Exercises
185(1)
Hands-On Analysis
185(2)
Chapter 9 Neural Networks
187(22)
9.1 Input and Output Encoding
188(2)
9.2 Neural Networks for Estimation and Prediction
190(1)
9.3 Simple Example of a Neural Network
191(2)
9.4 Sigmoid Activation Function
193(1)
9.5 Back-Propagation
194(4)
9.5.1 Gradient Descent Method
194(1)
9.5.2 Back-Propagation Rules
195(1)
9.5.3 Example of Back-Propagation
196(2)
9.6 Termination Criteria
198(1)
9.7 Learning Rate
198(1)
9.8 Momentum Term
199(2)
9.9 Sensitivity Analysis
201(1)
9.10 Application of Neural Network Modeling
202(7)
The R Zone
204(3)
References
207(1)
Exercises
207(1)
Hands-On Analysis
207(2)
Chapter 10 Hierarchical And K-Means Clustering
209(19)
10.1 The Clustering Task
209(3)
10.2 Hierarchical Clustering Methods
212(1)
10.3 Single-Linkage Clustering
213(1)
10.4 Complete-Linkage Clustering
214(1)
10.5 k-Means Clustering
215(1)
10.6 Example of k-Means Clustering at Work
216(3)
10.7 Behavior of MSB, MSE, and PSEUDO-F as the k-Means Algorithm Proceeds
219(1)
10.8 Application of k-Means Clustering Using SAS Enterprise Miner
220(3)
10.9 Using Cluster Membership to Predict Churn
223(5)
The R Zone
224(2)
References
226(1)
Exercises
226(1)
Hands-On Analysis
226(2)
Chapter 11 Kohonen Networks
228(19)
11.1 Self-Organizing Maps
228(2)
11.2 Kohonen Networks
230(1)
11.2.1 Kohonen Networks Algorithm
231(1)
11.3 Example of a Kohonen Network Study
231(4)
11.4 Cluster Validity
235(1)
11.5 Application of Clustering Using Kohonen Networks
235(2)
11.6 Interpreting the Clusters
237(5)
11.6.1 Cluster Profiles
240(2)
11.7 Using Cluster Membership as Input to Downstream Data Mining Models
242(5)
The R Zone
243(2)
References
245(1)
Exercises
245(1)
Hands-On Analysis
245(2)
Chapter 12 Association Rules
247(19)
12.1 Affinity Analysis and Market Basket Analysis
247(2)
12.1.1 Data Representation for Market Basket Analysis
248(1)
12.2 Support, Confidence, Frequent Itemsets, and the a Priori Property
249(2)
12.3 How Does the a Priori Algorithm Work?
251(4)
12.3.1 Generating Frequent Itemsets
251(2)
12.3.2 Generating Association Rules
253(2)
12.4 Extension from Flag Data to General Categorical Data
255(1)
12.5 Information-Theoretic Approach: Generalized Rule Induction Method
256(2)
12.5.1 j-Measure
257(1)
12.6 Association Rules are Easy to do Badly
258(1)
12.7 How Can We Measure the Usefulness of Association Rules?
259(1)
12.8 Do Association Rules Represent Supervised or Unsupervised Learning?
260(1)
12.9 Local Patterns Versus Global Models
261(5)
The R Zone
262(1)
References
263(1)
Exercises
263(1)
Hands-On Analysis
264(2)
Chapter 13 Imputation Of Missing Data
266(11)
13.1 Need for Imputation of Missing Data
266(1)
13.2 Imputation of Missing Data: Continuous Variables
267(3)
13.3 Standard Error of the Imputation
270(1)
13.4 Imputation of Missing Data: Categorical Variables
271(1)
13.5 Handling Patterns in Missingness
272(5)
The R Zone
273(3)
Reference
276(1)
Exercises
276(1)
Hands-On Analysis
276(1)
Chapter 14 Model Evaluation Techniques
277(17)
14.1 Model Evaluation Techniques for the Description Task
278(1)
14.2 Model Evaluation Techniques for the Estimation and Prediction Tasks
278(2)
14.3 Model Evaluation Techniques for the Classification Task
280(1)
14.4 Error Rate, False Positives, and False Negatives
280(3)
14.5 Sensitivity and Specificity
283(1)
14.6 Misclassification Cost Adjustment to Reflect Real-World Concerns
284(1)
14.7 Decision Cost/Benefit Analysis
285(1)
14.8 Lift Charts and Gains Charts
286(3)
14.9 Interweaving Model Evaluation with Model Building
289(1)
14.10 Confluence of Results: Applying a Suite of Models
290(4)
The R Zone
291(1)
Reference
291(1)
Exercises
291(1)
Hands-On Analysis
291(3)
Appendix: Data Summarization And Visualization 294(15)
Index 309
Daniel T. Larose earned his PhD in Statistics at the University of Connecticut. He is Professor of Mathematical Sciences and Director of the Data Mining programs at Central Connecticut State University.  His consulting clients have included Microsoft, Forbes Magazine, the CIT Group, KPMG International, Computer Associates, and Deloitte, Inc. This is Laroses fourth book for Wiley.

Chantal D. Larose is an Assistant Professor of Statistics & Data Science at Eastern Connecticut State University (ECSU).  She has co-authored three books on data science and predictive analytics.  She helped develop data science programs at ECSU and at SUNY New Paltz.  She received her PhD in Statistics from the University of Connecticut, Storrs in 2015 (dissertation title: Model-based Clustering of Incomplete Data).