Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

E-raamat: Exploratory Data Analysis Using R

Ronald K. Pearson (GeoVera Holdings, Inc., CA, USA)

Formaat: 562 pages
Sari: Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Ilmumisaeg: 04-May-2018
Kirjastus: CRC Press
Keel: eng
ISBN-13: 9780429847035

Teised raamatud teemal:

Formaat - PDF+DRM
Hind: 58,49 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

Formaat: 562 pages
Sari: Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Ilmumisaeg: 04-May-2018
Kirjastus: CRC Press
Keel: eng
ISBN-13: 9780429847035

Teised raamatud teemal:

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

This textbook will introduce exploratory data analysis (EDA) and will cover the range of interesting features we can expect to find in data. The book will also explore the practical mechanics of using R to do EDA. Based on the authors course at the University of Connecticut, the book assumes no prior exposure to data analysis or programming, and is designed to be as non-mathematical as possible. Exercises are included throughout, and a Solutions Manual will be available. The author will also provide a supplemental R package through the Comprehensive R Archive Network that will include implementations of some of the features in this book, along with data examples, tools, and datasets-- Exploratory Data Analysis Using R provides a classroom-tested introduction to exploratory data analysis (EDA) and introduces the range of interesting – good, bad, and ugly – features that can be found in data, and why it is important to find them. It also introduces the mechanics of using R to explore and explain data.The book begins with a detailed overview of data, exploratory analysis, and R, as well as graphics in R. It then explores working with external data, linear regression models, and crafting data stories. The second part of the book focuses on developing R programs, including good programming practices and examples, working with text data, and general predictive models. The book ends with a chapter on keeping it all together that includes managing the R installation, managing files, documenting, and an introduction to reproducible computing.The book is designed for both advanced undergraduate, entry-level graduate students, and working professionals with little to no prior exposure to data analysis, modeling, statistics, or programming. it keeps the treatment relatively non-mathematical, even though data analysis is an inherently mathematical subject. Exercises are included at the end of most chapters, and an instructors solution manual is available.About the Author:Ronald K. Pearson holds the position of Senior Data Scientist with GeoVera, a property insurance company in Fairfield, California, and he has previously held similar positions in a variety of application areas, including software development, drug safety data analysis, and the analysis of industrial process data. He holds a PhD in Electrical Engineering and Computer Science from the Massachusetts Institute of Technology and has published conference and journal papers on topics ranging from nonlinear dynamic model structure selection to the problems of disguised missing data in predictive modeling. Dr. Pearson has authored or co-authored books including Exploring Data in Engineering, the Sciences, and Medicine (Oxford University Press, 2011) and Nonlinear Digital Filtering with Python. He is also the developer of the DataCamp course on base R graphics and is an author of the datarobot and GoodmanKruskal R packages available from CRAN (the Comprehensive R Archive Network).

Preface

Author

xiii

1 Data, Exploratory Analysis, and R

(28)

1.1 Why do we analyze data?

(1)

1.2 The view from 90,000 feet

(9)

1.2.1 Data

(2)

1.2.2 Exploratory analysis

(3)

1.2.3 Computers, software, and R

(4)

1.3 A representative R session

(10)

1.4 Organization of this book

(5)

1.5 Exercises

(3)

2 Graphics in R

(50)

2.1 Exploratory vs. explanatory graphics

(3)

2.2 Graphics systems in R

(5)

2.2.1 Base graphics

(1)

2.2.2 Grid graphics

(1)

2.2.3 Lattice graphics

(2)

2.2.4 The ggplot2 package

(1)

2.3 The plot function

(7)

2.3.1 The flexibility of the plot function

(3)

2.3.2 S3 classes and generic functions

(2)

2.3.3 Optional parameters for base graphics

(2)

2.4 Adding details to plots

(8)

2.4.1 Adding points and lines to a scatterplot

(4)

2.4.2 Adding text to a plot

(1)

2.4.3 Adding a legend to a plot

(1)

2.4.4 Customizing axes

(2)

2.5 A few different plot types

(5)

2.5.1 Pie charts and why they should be avoided

(1)

2.5.2 Barplot summaries

(1)

2.5.3 The symbols function

(2)

2.6 Multiple plot arrays

(7)

2.6.1 Setting up simple arrays with mfrow

(3)

2.6.2 Using the layout function

(3)

2.7 Color graphics

(6)

2.7.1 A few general guidelines

(2)

2.7.2 Color options in R

(2)

2.7.3 The tableplot function

(2)

2.8 Exercises

(9)

3 Exploratory Data Analysis: A First Look

(62)

3.1 Exploring a new dataset

(7)

3.1.1 A general strategy

(1)

3.1.2 Examining the basic data characteristics

(2)

3.1.3 Variable types in practice

(3)

3.2 Summarizing numerical data

(13)

3.2.1 "Typical" values: the mean

(1)

3.2.2 "Spread": the standard deviation

(2)

3.2.3 Limitations of simple summary statistics

(2)

3.2.4 The Gaussian assumption

(3)

3.2.5 Is the Gaussian assumption reasonable?

(5)

3.3 Anomalies in numerical data

100

(30)

3.3.1 Outliers and their influence

100

(4)

3.3.2 Detecting univariate outliers

104

(12)

3.3.3 Inliers and their detection

116

(2)

3.3.4 Metadata errors

118

(2)

3.3.5 Missing data, possibly disguised

120

(5)

3.3.6 QQ-plots revisited

125

(5)

3.4 Visualizing relations between variables

130

(7)

3.4.1 Scatterplots between numerical variables

131

(2)

3.4.2 Boxplots: numerical vs. categorical variables

133

(2)

3.4.3 Mosaic plots: categorical scatterplots

135

(2)

3.5 Exercises

137

(4)

4 Working with External Data

141

(40)

4.1 File management in R

142

(3)

4.2 Manual data entry

145

(3)

4.2.1 Entering the data by hand

145

(2)

4.2.2 Manual data entry is bad but sometimes expedient

147

(1)

4.3 Interacting with the Internet

148

(4)

4.3.1 Previews of three Internet data examples

148

(3)

4.3.2 A very brief introduction to HTML

151

(1)

4.4 Working with CSV files

152

(6)

4.4.1 Reading and writing CSV files

152

(2)

4.4.2 Spreadsheets and csv files are not the same thing

154

(1)

4.4.3 Two potential problems with CSV files

155

(3)

4.5 Working with other file types

158

(7)

4.5.1 Working with text files

158

(4)

4.5.2 Saving and retrieving R objects

162

(1)

4.5.3 Graphics files

163

(2)

4.6 Merging data from different sources

165

(3)

4.7 A brief introduction to databases

168

(10)

4.7.1 Relational databases, queries, and SQL

169

(2)

4.7.2 An introduction to the sqldf package

171

(3)

4.7.3 An overview of R's database support

174

(1)

4.7.4 An introduction to the RSQLite package

175

(3)

4.8 Exercises

178

(3)

5 Linear Regression Models

181

(48)

5.1 Modeling the whiteside data

181

(7)

5.1.1 Describing lines in the plane

182

(3)

5.1.2 Fitting lines to points in the plane

185

(1)

5.1.3 Fitting the whiteside data

186

(2)

5.2 Overrating and data splitting

188

(13)

5.2.1 An overfitting example

188

(4)

5.2.2 The training/validation/holdout split

192

(4)

5.2.3 Two useful model validation tools

196

(5)

5.3 Regression with multiple predictors

201

(10)

5.3.1 The Cars93 example

202

(5)

5.3.2 The problem of collinearity

207

(4)

5.4 Using categorical predictors

211

(3)

5.5 Interactions in linear regression models

214

(3)

5.6 Variable transformations in linear regression

217

(4)

5.7 Robust regression: a very brief introduction

221

(3)

5.8 Exercises

224

(5)

6 Crafting Data Stories

229

(18)

6.1 Crafting good data stories

229

(3)

6.1.1 The importance of clarity

230

(1)

6.1.2 The basic elements of an effective data story

231

(1)

6.2 Different audiences have different needs

232

(3)

6.2.1 The executive summary or abstract

233

(1)

6.2.2 Extended summaries

234

(1)

6.2.3 Longer documents

235

(1)

6.3 Three example data stories

235

(12)

6.3.1 The Big Mac and Grande Latte economic indices

236

(4)

6.3.2 Small losses in the Australian vehicle insurance data

240

(3)

6.3.3 Unexpected heterogeneity: the Boston housing data

243

(4)

7 Programming in R

247

(42)

7.1 Interactive use versus programming

247

(9)

7.1.1 A simple example: computing Fibonnacci numbers

248

(4)

7.1.2 Creating your own functions

252

(4)

7.2 Key elements of the R language

256

(19)

7.2.1 Functions and their arguments

256

(4)

7.2.2 The list data type

260

(2)

7.2.3 Control structures

262

(6)

7.2.4 Replacing loops with apply functions

268

(2)

7.2.5 Generic functions revisited

270

(5)

7.3 Good programming practices

275

(2)

7.3.1 Modularity and the DRY principle

275

(1)

7.3.2 Comments

275

(1)

7.3.3 Style guidelines

276

(1)

7.3.4 Testing and debugging

276

(1)

7.4 Five programming examples

277

(7)

7.4.1 The function ValidationRsquared

277

(1)

7.4.2 The function TVHsplit

278

(1)

7.4.3 The function PredictedVsObservedPlot

278

(1)

7.4.4 The function BasicSummary

279

(2)

7.4.5 The function FindOutliers

281

(3)

7.5 R scripts

284

(1)

7.6 Exercises

285

(4)

8 Working with Text Data

289

(68)

8.1 The fundamentals of text data analysis

290

(8)

8.1.1 The basic steps in analyzing text data

290

(3)

8.1.2 An illustrative example

293

(5)

8.2 Basic character functions in R

298

(13)

8.2.1 The nchar function

298

(3)

8.2.2 The grep function

301

(1)

8.2.3 Application to missing data and alternative spellings

302

(2)

8.2.4 The sub and gsub functions

304

(2)

8.2.5 The strsplit function

306

(1)

8.2.6 Another application: ConvertAutoMpgRecords

307

(2)

8.2.7 The paste function

309

(2)

8.3 A brief introduction to regular expressions

311

(8)

8.3.1 Regular expression basics

311

(2)

8.3.2 Some useful regular expression examples

313

(6)

8.4 An aside: ASCII vs. UNICODE

319

(1)

8.5 Quantitative text analysis

320

(10)

8.5.1 Document-term and document-feature matrices

320

(2)

8.5.2 String distances and approximate matching

322

(8)

8.6 Three detailed examples

330

(23)

8.6.1 Characterizing a book

331

(5)

8.6.2 The cpus data frame

336

(8)

8.6.3 The unclaimed bank account data

344

(9)

8.7 Exercises

353

(4)

9 Exploratory Data Analysis: A Second Look

357

(102)

9.1 An example: repeated measurements

358

(6)

9.1.1 Summary and practical implications

358

(1)

9.1.2 The gory details

359

(5)

9.2 Confidence intervals and significance

364

(11)

9.2.1 Probability models versus data

364

(2)

9.2.2 Quantiles of a distribution

366

(2)

9.2.3 Confidence intervals

368

(4)

9.2.4 Statistical significance and p-values

372

(3)

9.3 Characterizing a binary variable

375

(11)

9.3.1 The binomial distribution

375

(2)

9.3.2 Binomial confidence intervals

377

(5)

9.3.3 Odds ratios

382

(4)

9.4 Characterizing count data

386

(7)

9.4.1 The Poisson distribution and rare events

387

(2)

9.4.2 Alternative count distributions

389

(1)

9.4.3 Discrete distribution plots

390

(3)

9.5 Continuous distributions

393

(16)

9.5.1 Limitations of the Gaussian distribution

394

(4)

9.5.2 Some alternatives to the Gaussian distribution

398

(6)

9.5.3 The qqPlot function revisited

404

(2)

9.5.4 The problems of ties and implosion

406

(3)

9.6 Associations between numerical variables

409

(18)

9.6.1 Product-moment correlations

409

(4)

9.6.2 Spearman's rank correlation measure

413

(2)

9.6.3 The correlation trick

415

(3)

9.6.4 Correlation matrices and correlation plots

418

(3)

9.6.5 Robust correlations

421

(2)

9.6.6 Multivariate outliers

423

(4)

9.7 Associations between categorical variables

427

(11)

9.7.1 Contingency tables

427

(2)

9.7.2 The chi-squared measure and Cramer's V

429

(4)

9.7.3 Goodman and Kruskal's tau measure

433

(5)

9.8 Principal component analysis (PCA)

438

(9)

9.9 Working with date variables

447

(2)

9.10 Exercises

449

(10)

10 More General Predictive Models

459

(66)

10.1 A predictive modeling overview

459

(3)

10.1.1 The predictive modeling problem

460

(1)

10.1.2 The model-building process

461

(1)

10.2 Binary classification and logistic regression

462

(16)

10.2.1 Basic logistic regression formulation

462

(2)

10.2.2 Fitting logistic regression models

464

(3)

10.2.3 Evaluating binary classifier performance

467

(7)

10.2.4 A brief introduction to glms

474

(4)

10.3 Decision tree models

478

(13)

10.3.1 Structure and fitting of decision trees

479

(6)

10.3.2 A classification tree example

485

(2)

10.3.3 A regression tree example

487

(4)

10.4 Combining trees with regression

491

(7)

10.5 Introduction to machine learning models

498

(8)

10.5.1 The instability of simple tree-based models

499

(1)

10.5.2 Random forest models

500

(2)

10.5.3 Boosted tree models

502

(4)

10.6 Three practical details

506

(15)

10.6.1 Partial dependence plots

507

(6)

10.6.2 Variable importance measures

513

(6)

10.6.3 Thin levels and data partitioning

519

(2)

10.7 Exercises

521

(4)

11 Keeping It All Together

525

(14)

11.1 Managing your R installation

525

(3)

11.1.1 Installing R

526

(1)

11.1.2 Updating packages

526

(1)

11.1.3 Updating R

527

(1)

11.2 Managing files effectively

528

(5)

11.2.1 Organizing directories

528

(3)

11.2.2 Use appropriate file extensions

531

(1)

11.2.3 Choose good file names

532

(1)

11.3 Document everything

533

(3)

11.3.1 Data dictionaries

533

(1)

11.3.2 Documenting code

534

(1)

11.3.3 Documenting results

535

(1)

11.4 Introduction to reproducible computing

536

(3)

11.4.1 The key ideas of reproducibility

536

(1)

11.4.2 Using R Markdown

537

(2)

Bibliography

539

(5)

Index

544

Ronald K. Pearson currently works for GeoVera, a property insurance company in Fairfield, California, primarily in the analysis of text data. He holds a PhD in Electrical Engineering and Computer Science from the Massachussetts Institute of Technology and has published conference and journal papers on topics ranging from nonlinear dynamic model structure selection to the problems of disguised missing data in predictive modeling. Dr. Pearson has authored or co-authored books including Exploring Data in Engineering, the Sciences, and Medicine (Oxford University Press, 2011) and Nonlinear Digital Filtering with Python, co-authored with Moncef Gabbouj (CRC Press, 2015). He is also the developer of the DataCamp course on base R graphics.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97804298470352e.html

Märksõnad:

E-raamat: Exploratory Data Analysis Using R

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv