Muutke küpsiste eelistusi

E-raamat: Exploratory Data Analysis Using R

(GeoVera Holdings, Inc., CA, USA)
Teised raamatud teemal:
  • Formaat - PDF+DRM
  • Hind: 58,49 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

This textbook will introduce exploratory data analysis (EDA) and will cover the range of interesting features we can expect to find in data. The book will also explore the practical mechanics of using R to do EDA. Based on the authors course at the University of Connecticut, the book assumes no prior exposure to data analysis or programming, and is designed to be as non-mathematical as possible. Exercises are included throughout, and a Solutions Manual will be available. The author will also provide a supplemental R package through the Comprehensive R Archive Network that will include implementations of some of the features in this book, along with data examples, tools, and datasets-- Exploratory Data Analysis Using R provides a classroom-tested introduction to exploratory data analysis (EDA) and introduces the range of interesting – good, bad, and ugly – features that can be found in data, and why it is important to find them. It also introduces the mechanics of using R to explore and explain data.The book begins with a detailed overview of data, exploratory analysis, and R, as well as graphics in R. It then explores working with external data, linear regression models, and crafting data stories. The second part of the book focuses on developing R programs, including good programming practices and examples, working with text data, and general predictive models. The book ends with a chapter on keeping it all together that includes managing the R installation, managing files, documenting, and an introduction to reproducible computing.The book is designed for both advanced undergraduate, entry-level graduate students, and working professionals with little to no prior exposure to data analysis, modeling, statistics, or programming. it keeps the treatment relatively non-mathematical, even though data analysis is an inherently mathematical subject. Exercises are included at the end of most chapters, and an instructors solution manual is available.About the Author:Ronald K. Pearson holds the position of Senior Data Scientist with GeoVera, a property insurance company in Fairfield, California, and he has previously held similar positions in a variety of application areas, including software development, drug safety data analysis, and the analysis of industrial process data. He holds a PhD in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology and has published conference and journal papers on topics ranging from nonlinear dynamic model structure selection to the problems of disguised missing data in predictive modeling. Dr. Pearson has authored or co-authored books including Exploring Data in Engineering, the Sciences, and Medicine (Oxford University Press, 2011) and Nonlinear Digital Filtering with Python. He is also the developer of the DataCamp course on base R graphics and is an author of the datarobot and GoodmanKruskal R packages available from CRAN (the Comprehensive R Archive Network).
Preface xi
Author xiii
1 Data, Exploratory Analysis, and R
1(28)
1.1 Why do we analyze data?
1(1)
1.2 The view from 90,000 feet
2(9)
1.2.1 Data
2(2)
1.2.2 Exploratory analysis
4(3)
1.2.3 Computers, software, and R
7(4)
1.3 A representative R session
11(10)
1.4 Organization of this book
21(5)
1.5 Exercises
26(3)
2 Graphics in R
29(50)
2.1 Exploratory vs. explanatory graphics
29(3)
2.2 Graphics systems in R
32(5)
2.2.1 Base graphics
33(1)
2.2.2 Grid graphics
33(1)
2.2.3 Lattice graphics
34(2)
2.2.4 The ggplot2 package
36(1)
2.3 The plot function
37(7)
2.3.1 The flexibility of the plot function
37(3)
2.3.2 S3 classes and generic functions
40(2)
2.3.3 Optional parameters for base graphics
42(2)
2.4 Adding details to plots
44(8)
2.4.1 Adding points and lines to a scatterplot
44(4)
2.4.2 Adding text to a plot
48(1)
2.4.3 Adding a legend to a plot
49(1)
2.4.4 Customizing axes
50(2)
2.5 A few different plot types
52(5)
2.5.1 Pie charts and why they should be avoided
53(1)
2.5.2 Barplot summaries
54(1)
2.5.3 The symbols function
55(2)
2.6 Multiple plot arrays
57(7)
2.6.1 Setting up simple arrays with mfrow
58(3)
2.6.2 Using the layout function
61(3)
2.7 Color graphics
64(6)
2.7.1 A few general guidelines
64(2)
2.7.2 Color options in R
66(2)
2.7.3 The tableplot function
68(2)
2.8 Exercises
70(9)
3 Exploratory Data Analysis: A First Look
79(62)
3.1 Exploring a new dataset
80(7)
3.1.1 A general strategy
81(1)
3.1.2 Examining the basic data characteristics
82(2)
3.1.3 Variable types in practice
84(3)
3.2 Summarizing numerical data
87(13)
3.2.1 "Typical" values: the mean
88(1)
3.2.2 "Spread": the standard deviation
88(2)
3.2.3 Limitations of simple summary statistics
90(2)
3.2.4 The Gaussian assumption
92(3)
3.2.5 Is the Gaussian assumption reasonable?
95(5)
3.3 Anomalies in numerical data
100(30)
3.3.1 Outliers and their influence
100(4)
3.3.2 Detecting univariate outliers
104(12)
3.3.3 Inliers and their detection
116(2)
3.3.4 Metadata errors
118(2)
3.3.5 Missing data, possibly disguised
120(5)
3.3.6 QQ-plots revisited
125(5)
3.4 Visualizing relations between variables
130(7)
3.4.1 Scatterplots between numerical variables
131(2)
3.4.2 Boxplots: numerical vs. categorical variables
133(2)
3.4.3 Mosaic plots: categorical scatterplots
135(2)
3.5 Exercises
137(4)
4 Working with External Data
141(40)
4.1 File management in R
142(3)
4.2 Manual data entry
145(3)
4.2.1 Entering the data by hand
145(2)
4.2.2 Manual data entry is bad but sometimes expedient
147(1)
4.3 Interacting with the Internet
148(4)
4.3.1 Previews of three Internet data examples
148(3)
4.3.2 A very brief introduction to HTML
151(1)
4.4 Working with CSV files
152(6)
4.4.1 Reading and writing CSV files
152(2)
4.4.2 Spreadsheets and csv files are not the same thing
154(1)
4.4.3 Two potential problems with CSV files
155(3)
4.5 Working with other file types
158(7)
4.5.1 Working with text files
158(4)
4.5.2 Saving and retrieving R objects
162(1)
4.5.3 Graphics files
163(2)
4.6 Merging data from different sources
165(3)
4.7 A brief introduction to databases
168(10)
4.7.1 Relational databases, queries, and SQL
169(2)
4.7.2 An introduction to the sqldf package
171(3)
4.7.3 An overview of R's database support
174(1)
4.7.4 An introduction to the RSQLite package
175(3)
4.8 Exercises
178(3)
5 Linear Regression Models
181(48)
5.1 Modeling the whiteside data
181(7)
5.1.1 Describing lines in the plane
182(3)
5.1.2 Fitting lines to points in the plane
185(1)
5.1.3 Fitting the whiteside data
186(2)
5.2 Overrating and data splitting
188(13)
5.2.1 An overfitting example
188(4)
5.2.2 The training/validation/holdout split
192(4)
5.2.3 Two useful model validation tools
196(5)
5.3 Regression with multiple predictors
201(10)
5.3.1 The Cars93 example
202(5)
5.3.2 The problem of collinearity
207(4)
5.4 Using categorical predictors
211(3)
5.5 Interactions in linear regression models
214(3)
5.6 Variable transformations in linear regression
217(4)
5.7 Robust regression: a very brief introduction
221(3)
5.8 Exercises
224(5)
6 Crafting Data Stories
229(18)
6.1 Crafting good data stories
229(3)
6.1.1 The importance of clarity
230(1)
6.1.2 The basic elements of an effective data story
231(1)
6.2 Different audiences have different needs
232(3)
6.2.1 The executive summary or abstract
233(1)
6.2.2 Extended summaries
234(1)
6.2.3 Longer documents
235(1)
6.3 Three example data stories
235(12)
6.3.1 The Big Mac and Grande Latte economic indices
236(4)
6.3.2 Small losses in the Australian vehicle insurance data
240(3)
6.3.3 Unexpected heterogeneity: the Boston housing data
243(4)
7 Programming in R
247(42)
7.1 Interactive use versus programming
247(9)
7.1.1 A simple example: computing Fibonnacci numbers
248(4)
7.1.2 Creating your own functions
252(4)
7.2 Key elements of the R language
256(19)
7.2.1 Functions and their arguments
256(4)
7.2.2 The list data type
260(2)
7.2.3 Control structures
262(6)
7.2.4 Replacing loops with apply functions
268(2)
7.2.5 Generic functions revisited
270(5)
7.3 Good programming practices
275(2)
7.3.1 Modularity and the DRY principle
275(1)
7.3.2 Comments
275(1)
7.3.3 Style guidelines
276(1)
7.3.4 Testing and debugging
276(1)
7.4 Five programming examples
277(7)
7.4.1 The function ValidationRsquared
277(1)
7.4.2 The function TVHsplit
278(1)
7.4.3 The function PredictedVsObservedPlot
278(1)
7.4.4 The function BasicSummary
279(2)
7.4.5 The function FindOutliers
281(3)
7.5 R scripts
284(1)
7.6 Exercises
285(4)
8 Working with Text Data
289(68)
8.1 The fundamentals of text data analysis
290(8)
8.1.1 The basic steps in analyzing text data
290(3)
8.1.2 An illustrative example
293(5)
8.2 Basic character functions in R
298(13)
8.2.1 The nchar function
298(3)
8.2.2 The grep function
301(1)
8.2.3 Application to missing data and alternative spellings
302(2)
8.2.4 The sub and gsub functions
304(2)
8.2.5 The strsplit function
306(1)
8.2.6 Another application: ConvertAutoMpgRecords
307(2)
8.2.7 The paste function
309(2)
8.3 A brief introduction to regular expressions
311(8)
8.3.1 Regular expression basics
311(2)
8.3.2 Some useful regular expression examples
313(6)
8.4 An aside: ASCII vs. UNICODE
319(1)
8.5 Quantitative text analysis
320(10)
8.5.1 Document-term and document-feature matrices
320(2)
8.5.2 String distances and approximate matching
322(8)
8.6 Three detailed examples
330(23)
8.6.1 Characterizing a book
331(5)
8.6.2 The cpus data frame
336(8)
8.6.3 The unclaimed bank account data
344(9)
8.7 Exercises
353(4)
9 Exploratory Data Analysis: A Second Look
357(102)
9.1 An example: repeated measurements
358(6)
9.1.1 Summary and practical implications
358(1)
9.1.2 The gory details
359(5)
9.2 Confidence intervals and significance
364(11)
9.2.1 Probability models versus data
364(2)
9.2.2 Quantiles of a distribution
366(2)
9.2.3 Confidence intervals
368(4)
9.2.4 Statistical significance and p-values
372(3)
9.3 Characterizing a binary variable
375(11)
9.3.1 The binomial distribution
375(2)
9.3.2 Binomial confidence intervals
377(5)
9.3.3 Odds ratios
382(4)
9.4 Characterizing count data
386(7)
9.4.1 The Poisson distribution and rare events
387(2)
9.4.2 Alternative count distributions
389(1)
9.4.3 Discrete distribution plots
390(3)
9.5 Continuous distributions
393(16)
9.5.1 Limitations of the Gaussian distribution
394(4)
9.5.2 Some alternatives to the Gaussian distribution
398(6)
9.5.3 The qqPlot function revisited
404(2)
9.5.4 The problems of ties and implosion
406(3)
9.6 Associations between numerical variables
409(18)
9.6.1 Product-moment correlations
409(4)
9.6.2 Spearman's rank correlation measure
413(2)
9.6.3 The correlation trick
415(3)
9.6.4 Correlation matrices and correlation plots
418(3)
9.6.5 Robust correlations
421(2)
9.6.6 Multivariate outliers
423(4)
9.7 Associations between categorical variables
427(11)
9.7.1 Contingency tables
427(2)
9.7.2 The chi-squared measure and Cramer's V
429(4)
9.7.3 Goodman and Kruskal's tau measure
433(5)
9.8 Principal component analysis (PCA)
438(9)
9.9 Working with date variables
447(2)
9.10 Exercises
449(10)
10 More General Predictive Models
459(66)
10.1 A predictive modeling overview
459(3)
10.1.1 The predictive modeling problem
460(1)
10.1.2 The model-building process
461(1)
10.2 Binary classification and logistic regression
462(16)
10.2.1 Basic logistic regression formulation
462(2)
10.2.2 Fitting logistic regression models
464(3)
10.2.3 Evaluating binary classifier performance
467(7)
10.2.4 A brief introduction to glms
474(4)
10.3 Decision tree models
478(13)
10.3.1 Structure and fitting of decision trees
479(6)
10.3.2 A classification tree example
485(2)
10.3.3 A regression tree example
487(4)
10.4 Combining trees with regression
491(7)
10.5 Introduction to machine learning models
498(8)
10.5.1 The instability of simple tree-based models
499(1)
10.5.2 Random forest models
500(2)
10.5.3 Boosted tree models
502(4)
10.6 Three practical details
506(15)
10.6.1 Partial dependence plots
507(6)
10.6.2 Variable importance measures
513(6)
10.6.3 Thin levels and data partitioning
519(2)
10.7 Exercises
521(4)
11 Keeping It All Together
525(14)
11.1 Managing your R installation
525(3)
11.1.1 Installing R
526(1)
11.1.2 Updating packages
526(1)
11.1.3 Updating R
527(1)
11.2 Managing files effectively
528(5)
11.2.1 Organizing directories
528(3)
11.2.2 Use appropriate file extensions
531(1)
11.2.3 Choose good file names
532(1)
11.3 Document everything
533(3)
11.3.1 Data dictionaries
533(1)
11.3.2 Documenting code
534(1)
11.3.3 Documenting results
535(1)
11.4 Introduction to reproducible computing
536(3)
11.4.1 The key ideas of reproducibility
536(1)
11.4.2 Using R Markdown
537(2)
Bibliography 539(5)
Index 544
Ronald K. Pearson currently works for GeoVera, a property insurance company in Fairfield, California, primarily in the analysis of text data. He holds a PhD in Electrical Engineering and Computer Science from the Massachussetts Institute of Technology and has published conference and journal papers on topics ranging from nonlinear dynamic model structure selection to the problems of disguised missing data in predictive modeling. Dr. Pearson has authored or co-authored books including Exploring Data in Engineering, the Sciences, and Medicine (Oxford University Press, 2011) and Nonlinear Digital Filtering with Python, co-authored with Moncef Gabbouj (CRC Press, 2015). He is also the developer of the DataCamp course on base R graphics.