Muutke küpsiste eelistusi

E-raamat: Data Science Using Python and R [Wiley Online]

(Eastern Connecticut State University (ECSU)), (Central Connecticut State University)
Teised raamatud teemal:
  • Wiley Online
  • Hind: 121,54 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
Teised raamatud teemal:

Learn data science by doing data science! 

Data Science Using Python and R will get you plugged into the world’s two most widespread open-source platforms for data science: Python and R.

Data science is hot. Bloomberg called data scientist “the hottest job in America.” Python and R are the top two open-source data science tools in the world. In Data Science Using Python and R, you will learn step-by-step how to produce hands-on solutions to real-world business problems, using state-of-the-art techniques. 

Data Science Using Python and R is written for the general reader with no previous analytics or programming experience. An entire chapter is dedicated to learning the basics of Python and R. Then, each chapter presents step-by-step instructions and walkthroughs for solving data science problems using Python and R.

Those with analytics experience will appreciate having a one-stop shop for learning how to do data science using Python and R. Topics covered include data preparation, exploratory data analysis, preparing to model the data, decision trees, model evaluation, misclassification costs, naïve Bayes classification, neural networks, clustering, regression modeling, dimension reduction, and association rules mining.

Further, exciting new topics such as random forests and general linear models are also included. The book emphasizes data-driven error costs to enhance profitability, which avoids the common pitfalls that may cost a company millions of dollars.

Data Science Using Python and R provides exercises at the end of every chapter, totaling over 500 exercises in the book. Readers will therefore have plenty of opportunity to test their newfound data science skills and expertise. In the Hands-on Analysis exercises, readers are challenged to solve interesting business problems using real-world data sets.

Preface xi
About The Authors xv
Acknowledgements xvii
Chapter 1 Introduction To Data Science 1(8)
1.1 Why Data Science?
1(1)
1.2 What is Data Science?
1(1)
1.3 The Data Science Methodology
2(3)
1.4 Data Science Tasks
5(3)
1.4.1 Description
6(1)
1.4.2 Estimation
6(1)
1.4.3 Classification
6(1)
1.4.4 Clustering
7(1)
1.4.5 Prediction
7(1)
1.4.6 Association
7(1)
Exercises
8(1)
Chapter 2 The Basics Of Python And R 9(20)
2.1 Downloading Python
9(1)
2.2 Basics of Coding in Python
9(8)
2.2.1 Using Comments in Python
9(1)
2.2.2 Executing Commands in Python
10(1)
2.2.3 Importing Packages in Python
11(1)
2.2.4 Getting Data into Python
12(1)
2.2.5 Saving Output in Python
13(1)
2.2.6 Accessing Records and Variables in Python
14(1)
2.2.7 Setting Up Graphics in Python
15(2)
2.3 Downloading R and RStudio
17(2)
2.4 Basics of Coding in R
19(7)
2.4.1 Using Comments in R
19(1)
2.4.2 Executing Commands in R
20(1)
2.4.3 Importing Packages in R
20(1)
2.4.4 Getting Data into R
21(2)
2.4.5 Saving Output in R
23(1)
2.4.6 Accessing Records and Variables in R
24(2)
References
26(1)
Exercises
26(3)
Chapter 3 Data Preparation 29(18)
3.1 The Bank Marketing Data Set
29(1)
3.2 The Problem Understanding Phase
29(2)
3.2.1 Clearly Enunciate the Project Objectives
29(1)
3.2.2 Translate These Objectives into a Data Science Problem
30(1)
3.3 Data Preparation Phase
31(1)
3.4 Adding an Index Field
31(2)
3.4.1 How to Add an Index Field Using Python
31(1)
3.4.2 How to Add an Index Field Using R
32(1)
3.5 Changing Misleading Field Values
33(3)
3.5.1 How to Change Misleading Field Values Using Python
34(1)
3.5.2 How to Change Misleading Field Values Using R
34(2)
3.6 Reexpression of Categorical Data as Numeric
36(3)
3.6.1 How to Reexpress Categorical Field Values Using Python
36(2)
3.6.2 How to Reexpress Categorical Field Values Using R
38(1)
3.7 Standardizing the Numeric Fields
39(1)
3.7.1 How to Standardize Numeric Fields Using Python
40(1)
3.7.2 How to Standardize Numeric Fields Using R
40(1)
3.8 Identifying Outliers
40(3)
3.8.1 How to Identify Outliers Using Python
41(1)
3.8.2 How to Identify Outliers Using R
42(1)
References
43(1)
Exercises
44(3)
Chapter 4 Exploratory Data Analysis 47(22)
4.1 EDA Versus HT
47(1)
4.2 Bar Graphs with Response Overlay
47(4)
4.2.1 How to Construct a Bar Graph with Overlay Using Python
49(1)
4.2.2 How to Construct a Bar Graph with Overlay Using R
50(1)
4.3 Contingency Tables
51(2)
4.3.1 How to Construct Contingency Tables Using Python
52(1)
4.3.2 How to Construct Contingency Tables Using R
53(1)
4.4 Histograms with Response Overlay
53(5)
4.4.1 How to Construct Histograms with Overlay Using Python
55(3)
4.4.2 How to Construct Histograms with Overlay Using R
58(1)
4.5 Binning Based on Predictive Value
58(5)
4.5.1 How to Perform Binning Based on Predictive Value Using Python
59(3)
4.5.2 How to Perform Binning Based on Predictive Value Using R
62(1)
References
63(1)
Exercises
63(6)
Chapter 5 Preparing To Model The Data 69(12)
5.1 The Story So Far
69(1)
5.2 Partitioning the Data
69(3)
5.2.1 How to Partition the Data in Python
70(1)
5.2.2 How to Partition the Data in R
71(1)
5.3 Validating your Partition
72(1)
5.4 Balancing the Training Data Set
73(4)
5.4.1 How to Balance the Training Data Set in Python
74(1)
5.4.2 How to Balance the Training Data Set in R
75(2)
5.5 Establishing Baseline Model Performance
77(1)
References
78(1)
Exercises
78(3)
Chapter 6 Decision Trees 81(16)
6.1 Introduction to Decision Trees
81(2)
6.2 Classification and Regression Trees
83(5)
6.2.1 How to Build CART Decision Trees Using Python
84(2)
6.2.2 How to Build CART Decision Trees Using R
86(2)
6.3 The C5.0 Algorithm for Building Decision Trees
88(3)
6.3.1 How to Build C5.0 Decision Trees Using Python
89(1)
6.3.2 How to Build C5.0 Decision Trees Using R
90(1)
6.4 Random Forests
91(2)
6.4.1 How to Build Random Forests in Python
92(1)
6.4.2 How to Build Random Forests in R
92(1)
References
93(1)
Exercises
93(4)
Chapter 7 Model Evaluation 97(16)
7.1 Introduction to Model Evaluation
97(1)
7.2 Classification Evaluation Measures
97(2)
7.3 Sensitivity and Specificity
99(1)
7.4 Precision, Recall, and Fβ Scores
99(1)
7.5 Method for Model Evaluation
100(1)
7.6 An Application of Model Evaluation
100(4)
7.6.1 How to Perform Model Evaluation Using R
103(1)
7.7 Accounting for Unequal Error Costs
104(2)
7.7.1 Accounting for Unequal Error Costs Using R
105(1)
7.8 Comparing Models with and without Unequal Error Costs
106(1)
7.9 Data-Driven Error Costs
107(2)
Exercises
109(4)
Chapter 8 Naive Bayes Classification 113(16)
8.1 Introduction to Naive Bayes
113(1)
8.2 Bayes Theorem
113(1)
8.3 Maximum a Posteriori Hypothesis
114(1)
8.4 Class Conditional Independence
114(1)
8.5 Application of Naive Bayes Classification
115(10)
8.5.1 Naive Bayes in Python
121(2)
8.5.2 Naive Bayes in R
123(2)
References
125(1)
Exercises
126(3)
Chapter 9 Neural Networks 129(12)
9.1 Introduction to Neural Networks
129(1)
9.2 The Neural Network Structure
129(2)
9.3 Connection Weights and the Combination Function
131(2)
9.4 The Sigmoid Activation Function
133(1)
9.5 Backpropagation
134(1)
9.6 An Application of a Neural Network Model
134(2)
9.7 Interpreting the Weights in a Neural Network Model
136(1)
9.8 How to Use Neural Networks in R
137(1)
References
138(1)
Exercises
138(3)
Chapter 10 Clustering 141(10)
10.1 What is Clustering?
141(1)
10.2 Introduction to the K-Means Clustering Algorithm
142(1)
10.3 An Application of K-Means Clustering
143(1)
10.4 Cluster Validation
144(1)
10.5 How to Perform K-Means Clustering Using Python
145(2)
10.6 How to Perform K-Means Clustering Using R
147(2)
Exercises
149(2)
Chapter 11 Regression Modeling 151(16)
11.1 The Estimation Task
151(1)
11.2 Descriptive Regression Modeling
151(1)
11.3 An Application of Multiple Regression Modeling
152(2)
11.4 How to Perform Multiple Regression Modeling Using Python
154(2)
11.5 How to Perform Multiple Regression Modeling Using R
156(1)
11.6 Model Evaluation for Estimation
157(4)
11.6.1 How to Perform Estimation Model Evaluation Using Python
159(1)
11.6.2 How to Perform Estimation Model Evaluation Using R
160(1)
11.7 Stepwise Regression
161(1)
11.7.1 How to Perform Stepwise Regression Using R
162(1)
11.8 Baseline Models for Regression
162(1)
References
163(1)
Exercises
164(3)
Chapter 12 Dimension Reduction 167(20)
12.1 The Need for Dimension Reduction
167(1)
12.2 Multicollinearity
168(3)
12.3 Identifying Multicollinearity Using Variance Inflation Factors
171(4)
12.3.1 How to Identify Multicollinearity Using Python
172(1)
12.3.2 How to Identify Multicollinearity in R
173(2)
12.4 Principal Components Analysis
175(1)
12.5 An Application of Principal Components Analysis
175(1)
12.6 How Many Components Should We Extract?
176(2)
12.6.1 The Eigenvalue Criterion
176(1)
12.6.2 The Proportion of Variance Explained Criterion
177(1)
12.7 Performing Pca with K = 4
178(1)
12.8 Validation of the Principal Components
178(1)
12.9 How to Perform Principal Components Analysis Using Python
179(2)
12.10 How to Perform Principal Components Analysis Using R
181(2)
12.11 When is Multicollinearity Not a Problem?
183(1)
References
184(1)
Exercises
184(3)
Chapter 13 Generalized Linear Models 187(12)
13.1 An Overview of General Linear Models
187(1)
13.2 Linear Regression as a General Linear Model
188(1)
13.3 Logistic Regression as a General Linear Model
188(1)
13.4 An Application of Logistic Regression Modeling
189(3)
13.4.1 How to Perform Logistic Regression Using Python
190(1)
13.4.2 How to Perform Logistic Regression Using R
191(1)
13.5 Poisson Regression
192(1)
13.6 An Application of Poisson Regression Modeling
192(3)
13.6.1 How to Perform Poisson Regression Using Python
193(1)
13.6.2 How to Perform Poisson Regression Using R
194(1)
Reference
195(1)
Exercises
195(4)
Chapter 14 Association Rules 199(16)
14.1 Introduction to Association Rules
199(1)
14.2 A Simple Example of Association Rule Mining
200(1)
14.3 Support, Confidence, and Lift
200(2)
14.4 Mining Association Rules
202(5)
14.4.1 How to Mine Association Rules Using R
203(4)
14.5 Confirming Our Metrics
207(1)
14.6 The Confidence Difference Criterion
208(1)
14.6.1 How to Apply the Confidence Difference Criterion Using R
208(1)
14.7 The Confidence Quotient Criterion
209(2)
14.7.1 How to Apply the Confidence Quotient Criterion Using R
210(1)
References
211(1)
Exercises
211(4)
Appendix Data Summarization And Visualization 215(16)
Part 1: Summarization 1: Building Blocks of Data Analysis
215(2)
Part 2: Visualization: Graphs and Tables for Summarizing and Organizing Data
217(5)
Part 3: Summarization 2: Measures of Center, Variability, and Position
222(3)
Part 4: Summarization and Visualization of Bivariate Elationships
225(6)
Index 231
CHANTAL D. LAROSE, PHD, is an Assistant Professor of Statistics & Data Science at Eastern Connecticut State University (ECSU). She has co-authored three books on data science and predictive analytics and helped develop data science programs at ECSU and SUNY New Paltz. Her PhD dissertation, Model-Based Clustering of Incomplete Data, tackles the persistent problem of trying to do data science with incomplete data.

DANIEL T. LAROSE, PHD, is a Professor of Data Science and Statistics and Director of the Data Science programs at Central Connecticut State University. He has published many books on data science, data mining, predictive analytics, and statistics. His consulting clients include The Economist magazine, Forbes Magazine, the CIT Group, and Microsoft.