Muutke küpsiste eelistusi

E-raamat: Data Mining with R: Learning with Case Studies, Second Edition

(University of Porto, Portugal)
  • Formaat - PDF+DRM
  • Hind: 58,49 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Data Mining with R: Learning with Case Studies, Second Edition uses practical examples to illustrate the power of R and data mining. Providing an extensive update to the best-selling first edition, this new edition is divided into two parts. The first part will feature introductory material, including a new chapter that provides an introduction to data mining, to complement the already existing introduction to R. The second part includes case studies, and the new edition strongly revises the R code of the case studies making it more up-to-date with recent packages that have emerged in R.

The book does not assume any prior knowledge about R. Readers who are new to R and data mining should be able to follow the case studies, and they are designed to be self-contained so the reader can start anywhere in the document.

The book is accompanied by a set of freely available R source files that can be obtained at the books web site. These files include all the code used in the case studies, and they facilitate the "do-it-yourself" approach followed in the book.

Designed for users of data analysis tools, as well as researchers and developers, the book should be useful for anyone interested in entering the "world" of R and data mining.

About the Author

Luís Torgo is an associate professor in the Department of Computer Science at the University of Porto in Portugal. He teaches Data Mining in R in the NYU Stern School of Business MS in Business Analytics program. An active researcher in machine learning and data mining for more than 20 years, Dr. Torgo is also a researcher in the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) of INESC Porto LA.
Preface xi
Acknowledgments xiii
List of Figures
xv
List of Tables
xix
1 Introduction
1(4)
1.1 How to Read This Book
2(1)
1.2 Reproducibility
3(2)
I R and Data Mining
5(186)
2 Introduction to R
7(36)
2.1 Starting with R
7(2)
2.2 Basic Interaction with the R Console
9(1)
2.3 R Objects and Variables
10(2)
2.4 R Functions
12(4)
2.5 Vectors
16(2)
2.6 Vectorization
18(1)
2.7 Factors
19(3)
2.8 Generating Sequences
22(2)
2.9 Sub-Setting
24(2)
2.10 Matrices and Arrays
26(4)
2.11 Lists
30(2)
2.12 Data Frames
32(4)
2.13 Useful Extensions to Data Frames
36(4)
2.14 Objects, Classes, and Methods
40(1)
2.15 Managing Your Sessions
41(2)
3 Introduction to Data Mining
43(148)
3.1 A Bird's Eye View on Data Mining
43(2)
3.2 Data Collection and Business Understanding
45(8)
3.2.1 Data and Datasets
45(1)
3.2.2 Importing Data into R
46(1)
3.2.2.1 Text Files
47(2)
3.2.2.2 Databases
49(3)
3.2.2.3 Spreadsheets
52(1)
3.2.2.4 Other Formats
52(1)
3.3 Data Pre-Processing
53(34)
3.3.1 Data Cleaning
53(1)
3.3.1.1 Tidy Data
53(3)
3.3.1.2 Handling Dates
56(2)
3.3.1.3 String Processing
58(2)
3.3.1.4 Dealing with Unknown Values
60(2)
3.3.2 Transforming Variables
62(1)
3.3.2.1 Handling Different Scales of Variables
62(1)
3.3.2.2 Discretizing Variables
63(2)
3.3.3 Creating Variables
65(1)
3.3.3.1 Handling Case Dependencies
65(9)
3.3.3.2 Handling Text Datasets
74(4)
3.3.4 Dimensionality Reduction
78(1)
3.3.4.1 Sampling Rows
78(4)
3.3.4.2 Variable Selection
82(5)
3.4 Modeling
87(85)
3.4.1 Exploratory Data Analysis
87(1)
3.4.1.1 Data Summarization
87(9)
3.4.1.2 Data Visualization
96(14)
3.4.2 Dependency Modeling using Association Rules
110(9)
3.4.3 Clustering
119(1)
3.4.3.1 Measures of Dissimilarity
119(1)
3.4.3.2 Clustering Methods
120(11)
3.4.4 Anomaly Detection
131(1)
3.4.4.1 Univariate Outlier Detection Methods
132(1)
3.4.4.2 Multi-Variate Outlier Detection Methods
133(7)
3.4.5 Predictive Analytics
140(1)
3.4.5.1 Evaluation Metrics
141(4)
3.4.5.2 Tree-Based Models
145(6)
3.4.5.3 Support Vector Machines
151(7)
3.4.5.4 Artificial Neural Networks and Deep Learning
158(7)
3.4.5.5 Model Ensembles
165(7)
3.5 Evaluation
172(10)
3.5.1 The Holdout and Random Subsampling
174(3)
3.5.2 Cross Validation
177(2)
3.5.3 Bootstrap Estimates
179(2)
3.5.4 Recommended Procedures
181(1)
3.6 Reporting and Deployment
182(9)
3.6.1 Reporting Through Dynamic Documents
183(3)
3.6.2 Deployment through Web Applications
186(5)
II Case Studies
191(192)
4 Predicting Algae Blooms
193(48)
4.1 Problem Description and Objectives
193(1)
4.2 Data Description
194(1)
4.3 Loading the Data into R
194(2)
4.4 Data Visualization and Summarization
196(9)
4.5 Unknown Values
205(9)
4.5.1 Removing the Observations with Unknown Values
205(2)
4.5.2 Filling in the Unknowns with the Most Frequent Values
207(1)
4.5.3 Filling in the Unknown Values by Exploring Correlations
208(4)
4.5.4 Filling in the Unknown Values by Exploring Similarities between Cases
212(2)
4.6 Obtaining Prediction Models
214(11)
4.6.1 Multiple Linear Regression
215(5)
4.6.2 Regression Trees
220(5)
4.7 Model Evaluation and Selection
225(12)
4.8 Predictions for the Seven Algae
237(2)
4.9 Summary
239(2)
5 Predicting Stock Market Returns
241(54)
5.1 Problem Description and Objectives
241(1)
5.2 The Available Data
242(2)
5.2.1 Reading the Data from the CSV File
243(1)
5.2.2 Getting the Data from the Web
243(1)
5.3 Defining the Prediction Tasks
244(10)
5.3.1 What to Predict?
244(3)
5.3.2 Which Predictors?
247(4)
5.3.3 The Prediction Tasks
251(1)
5.3.4 Evaluation Criteria
252(2)
5.4 The Prediction Models
254(9)
5.4.1 How Will the Training Data Be Used?
254(2)
5.4.2 The Modeling Tools
256(1)
5.4.2.1 Artificial Neural Networks
256(3)
5.4.2.2 Support Vector Machines
259(1)
5.4.2.3 Multivariate Adaptive Regression Splines
260(3)
5.5 From Predictions into Actions
263(8)
5.5.1 How Will the Predictions Be Used?
263(1)
5.5.2 Trading-Related Evaluation Criteria
264(1)
5.5.3 Putting Everything Together: A Simulated Trader
265(6)
5.6 Model Evaluation and Selection
271(15)
5.6.1 Monte Carlo Estimates
271(1)
5.6.2 Experimental Comparisons
272(6)
5.6.3 Results Analysis
278(8)
5.7 The Trading System
286(6)
5.7.1 Evaluation of the Final Test Data
286(5)
5.7.2 An Online Trading System
291(1)
5.8 Summary
292(3)
6 Detecting Fraudulent Transactions
295(58)
6.1 Problem Description and Objectives
295(1)
6.2 The Available Data
296(17)
6.2.1 Loading the Data into R
296(1)
6.2.2 Exploring the Dataset
297(7)
6.2.3 Data Problems
304(1)
6.2.3.1 Unknown Values
304(5)
6.2.3.2 Few Transactions of Some Products
309(4)
6.3 Defining the Data Mining Tasks
313(10)
6.3.1 Different Approaches to the Problem
313(1)
6.3.1.1 Unsupervised Techniques
313(1)
6.3.1.2 Supervised Techniques
314(1)
6.3.1.3 Semi-Supervised Techniques
315(1)
6.3.2 Evaluation Criteria
316(1)
6.3.2.1 Precision and Recall
316(1)
6.3.2.2 Lift Charts and Precision/Recall Curves
317(3)
6.3.2.3 Normalized Distance to Typical Price
320(1)
6.3.3 Experimental Methodology
321(2)
6.4 Obtaining Outlier Rankings
323(27)
6.4.1 Unsupervised Approaches
323(1)
6.4.1.1 The Modified Box Plot Rule
323(4)
6.4.1.2 Local Outlier Factors (LOF)
327(3)
6.4.1.3 Clustering-Based Outlier Rankings (ORh)
330(2)
6.4.2 Supervised Approaches
332(1)
6.4.2.1 The Class Imbalance Problem
333(2)
6.4.2.2 Naive Bayes
335(4)
6.4.2.3 AdaBoost
339(5)
6.4.3 Semi-Supervised Approaches
344(6)
6.5 Summary
350(3)
7 Classifying Microarray Samples
353(30)
7.1 Problem Description and Objectives
353(1)
7.1.1 Brief Background on Microarray Experiments
353(1)
7.1.2 The ALL Dataset
354(1)
7.2 The Available Data
354(5)
7.2.1 Exploring the Dataset
357(2)
7.3 Gene (Feature) Selection
359(9)
7.3.1 Simple Filters Based on Distribution Properties
360(2)
7.3.2 ANOVA Filters
362(2)
7.3.3 Filtering Using Random Forests
364(3)
7.3.4 Filtering Using Feature Clustering Ensembles
367(1)
7.4 Predicting Cytogenetic Abnormalities
368(13)
7.4.1 Defining the Prediction Task
368(1)
7.4.2 The Evaluation Metric
369(1)
7.4.3 The Experimental Procedure
369(1)
7.4.4 The Modeling Techniques
370(3)
7.4.5 Comparing the Models
373(8)
7.5 Summary
381(2)
Bibliography 383(12)
Subject Index 395(4)
Index of Data Mining Topics 399(2)
Index of R Functions 401
Luís Torgo is an associate professor in the Department of Computer Science at the University of Porto in Portugal. He teaches Data Mining in R in the NYU Stern School of Business MS in Business Analytics program. An active researcher in machine learning and data mining for more than 20 years, Dr. Torgo is also a researcher in the Laboratory of Artificial Intelligence and Data Analysis (LIAAD) of INESC Porto LA.