Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Introduction to Data Science: Data Analysis and Prediction Algorithms with R [Kõva köide]

4.25/5 (20 hinnangut Goodreads-ist)

Edited by Martina Topi (Leeds Beckett University, UK.)

Formaat: Hardback, 713 pages, kõrgus x laius: 254x178 mm, kaal: 1720 g
Sari: Chapman & Hall/CRC Data Science Series
Ilmumisaeg: 08-Nov-2019
Kirjastus: Chapman & Hall/CRC
ISBN-10: 0367357984
ISBN-13: 9780367357986

Teised raamatud teemal:

Mathematical & statistical software - (Hetkel poes: 1 nimetust)
Probability & statistics - (Hetkel poes: 2 nimetust)

Kõva köide
Hind: 124,74 €*
* saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Lisa soovinimekirja

Formaat: Hardback, 713 pages, kõrgus x laius: 254x178 mm, kaal: 1720 g
Sari: Chapman & Hall/CRC Data Science Series
Ilmumisaeg: 08-Nov-2019
Kirjastus: Chapman & Hall/CRC
ISBN-10: 0367357984
ISBN-13: 9780367357986

Teised raamatud teemal:

Mathematical & statistical software - (Hetkel poes: 1 nimetust)
Probability & statistics - (Hetkel poes: 2 nimetust)

Püsilink: https://www.kriso.ee/db/9780367357986.html

Märksõnad:

"The book begins by going over the basics of R and the tidyverse. You learn R throughout the book, but in the first part we go over the building blocks needed to keep learning during the rest of the book"--

Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.

This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.

The author uses motivating case studies that realistically mimic a data scientist’s experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.

The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.

Arvustused

"I think the book would be perfect for schools looking to make a transition to a model where introduction to data science takes the place of introduction to statistics and maybe introductory computer science." ~Arend Kuyper, Northwestern University

"A great introduction to data science and modern R programing, with tons of examples of application of the R abilities throughout the whole volume. The book suggests multiple links to the internet websites related to the topics under consideration that makes it an incredibly useful source of contemporary data science and programing, helping to students and researchers in their projects." ~Technometrics

"Introduction to Data Science will teach you to juggle with your data and get maximum results from it using R. I highly recommended this book for students and everybody taking the first steps in data science using R." ~ Maria Ivanchuk, ISCB News

Preface

xxv

Acknowledgments

xxvii

Introduction

xxix

1 Getting started with R and RStudio

(10)

1.1 Why R?

(1)

1.2 The R console

(1)

1.3 Scripts

(1)

1.4 RStudio

(5)

1.4.1 The panes

(2)

1.4.2 Key bindings

(1)

1.4.3 Running commands while editing scripts

(2)

1.4.4 Changing global options

(1)

1.5 Installing R packages

(3)

I R

(74)

2 R basics

(32)

2.1 Case study: US Gun Murders

(2)

2.2 The very basics

(5)

2.2.1 Objects

(1)

2.2.2 The workspace

(1)

2.2.3 Functions

(2)

2.2.4 Other prebuilt objects

(1)

2.2.5 Variable names

(1)

2.2.6 Saving your workspace

(1)

2.2.7 Motivating scripts

(1)

2.2.8 Commenting your code

(1)

2.3 Exercises

(1)

2.4 Data types

(6)

2.4.1 Data frames

(1)

2.4.2 Examining an object

(1)

2.4.3 The accessor: $

(1)

2.4.4 Vectors: numerics, characters, and logical

(1)

2.4.5 Factors

(1)

2.4.6 Lists

(2)

2.4.7 Matrices

(1)

2.5 Exercises

(1)

2.6 Vectors

(3)

2.6.1 Creating vectors

(1)

2.6.2 Names

(1)

2.6.3 Sequences

(1)

2.6.4 Subsetting

(1)

2.7 Coercion

(1)

2.7.1 Not availables (NA)

(1)

2.8 Exercises

(1)

2.9 Sorting

(2)

2.9.1 sort

(1)

2.9.2 order

(1)

2.9.3 max and which.max

(1)

2.9.4 rank

(1)

2.9.5 Beware of recycling

(1)

2.10 Exercises

(1)

2.11 Vector arithmetics

(2)

2.11.1 Rescaling a vector

(1)

2.11.2 Two vectors

(1)

2.12 Exercises

(1)

2.13 Indexing

(2)

2.13.1 Subsetting with logicals

(1)

2.13.2 Logical operators

(1)

2.13.3 which

(1)

2.13.4 match

(1)

2.13.5 %in%

(1)

2.14 Exercises

(1)

2.15 Basic plots

(3)

2.15.1 plot

(1)

2.15.2 hist

(1)

2.15.3 boxplot

(1)

2.15.4 image

(1)

2.16 Exercises

(1)

3 Programming basics

(8)

3.1 Conditional expressions

(2)

3.2 Defining functions

(1)

3.3 Namespaces

(1)

3.4 For-loops

(1)

3.5 Vectorization and functionals

(1)

3.6 Exercises

(2)

4 The tidyverse

(22)

4.1 Tidy data

(1)

4.2 Exercises

(1)

4.3 Manipulating data frames

(2)

4.3.1 Adding a column with mutate

(1)

4.3.2 Subsetting with filter

(1)

4.3.3 Selecting columns with select

(1)

4.4 Exercises

(1)

4.5 The pipe: %>%

(1)

4.6 Exercises

(1)

4.7 Summarizing data

(4)

4.7.1 summarize

(2)

4.7.2 pull

(1)

4.7.3 Group then summarize with group_by

(1)

4.8 Sorting data frames

(1)

4.8.1 Nested sorting

(1)

4.8.2 The top n

(1)

4.9 Exercises

(1)

4.10 Tibbles

(3)

4.10.1 Tibbles display better

(1)

4.10.2 Subsets of tibbles are tibbles

(1)

4.10.3 Tibbles can have complex entries

(1)

4.10.4 Tibbles can be grouped

(1)

4.10.5 Create a tibble using tibble instead of data.frame

(1)

4.11 The dot operator

(1)

4.12 do

(1)

4.13 The purrr package

(2)

4.14 Tidyverse conditionals

(1)

4.14.1 case_when

(1)

4.14.2 between

(1)

4.15 Exercises

(1)

5 Importing data

(10)

5.1 Paths and the working directory

(3)

5.1.1 The filesystem

(1)

5.1.2 Relative and full paths

(1)

5.1.3 The working directory

(1)

5.1.4 Generating path names

(1)

5.1.5 Copying files using paths

(1)

5.2 The readr and readxl packages

(1)

5.2.1 readr

(1)

5.2.2 readxl

(1)

5.3 Exercises

(1)

5.4 Downloading files

(1)

5.5 R-base importing functions

(1)

5.5.1 scan

(1)

5.6 Text versus binary files

(1)

5.7 Unicode versus ASCII

(1)

5.8 Organizing data with spreadsheets

(1)

5.9 Exercises

(1)

II Data Visualization

(128)

6 Introduction to data visualization

(4)

7 ggplot2

(18)

7.1 The components of a graph

(1)

7.2 ggplot objects

(1)

7.3 Geometries

(1)

7.4 Aesthetic mappings

(1)

7.5 Layers

(2)

7.5.1 Tinkering with arguments

(1)

7.6 Global versus local aesthetic mappings

(1)

7.7 Scales

(1)

7.8 Labels and titles

100

(1)

7.9 Categories as colors

101

(1)

7.10 Annotation, shapes, and adjustments

102

(1)

7.11 Add-on packages

103

(1)

7.12 Putting it all together

104

(1)

7.13 Quick plots with qplot

105

(1)

7.14 Grids of plots

106

(1)

7.15 Exercises

106

(3)

8 Visualizing data distributions

109

(32)

8.1 Variable types

109

(1)

8.2 Case study: describing student heights

110

(1)

8.3 Distribution function

110

(1)

8.4 Cumulative distribution functions

111

(1)

8.5 Histograms

112

(1)

8.6 Smoothed density

113

(5)

8.6.1 Interpreting the y-axis

117

(1)

8.6.2 Densities permit stratification

118

(1)

8.7 Exercises

118

(4)

8.8 The normal distribution

122

(2)

8.9 Standard units

124

(1)

8.10 Quantile-quantile plots

125

(2)

8.11 Percentiles

127

(1)

8.12 Boxplots

127

(2)

8.13 Stratification

129

(1)

8.14 Case study: describing student heights (continued)

129

(2)

8.15 Exercises

131

(1)

8.16 ggplot2 geometries

132

(8)

8.16.1 Barplots

133

(1)

8.16.2 Histograms

134

(1)

8.16.3 Density plots

135

(1)

8.16.4 Boxplots

136

(1)

8.16.5 QQ-plots

136

(1)

8.16.6 Images

137

(1)

8.16.7 Quick plots

138

(2)

8.17 Exercises

140

(1)

9 Data visualization in practice

141

(30)

9.1 Case study: new insights on poverty

141

(2)

9.1.1 Hans Rosling's quiz

142

(1)

9.2 Scatterplots

143

(1)

9.3 Faceting

144

(3)

9.3.1 facet_wrap

146

(1)

9.3.2 Fixed scales for better comparisons

147

(1)

9.4 Time series plots

147

(4)

9.4.1 Labels instead of legends

150

(1)

9.5 Data transformations

151

(4)

9.5.1 Log transformation

151

(2)

9.5.2 Which base?

153

(1)

9.5.3 Transform the values or the scale?

154

(1)

9.6 Visualizing multimodal distributions

155

(1)

9.7 Comparing multiple distributions with boxplots and ridge plots

155

(12)

9.7.1 Boxplots

156

(1)

9.7.2 Ridge plots

157

(2)

9.7.3 Example: 1970 versus 2010 income distributions

159

(5)

9.7.4 Accessing computed variables

164

(3)

9.7.5 Weighted densities

167

(1)

9.8 The ecological fallacy and importance of showing the data

167

(4)

9.8.1 Logistic transformation

168

(1)

9.8.2 Show the data

168

(3)

10 Data visualization principles

171

(34)

10.1 Encoding data using visual cues

171

(3)

10.2 Know when to include 0

174

(3)

10.3 Do not distort quantities

177

(2)

10.4 Order categories by a meaningful value

179

(1)

10.5 Show the data

180

(3)

10.6 Ease comparisons

183

(5)

10.6.1 Use common axes

183

(1)

10.6.2 Align plots vertically to see horizontal changes and horizontally to see vertical changes

184

(1)

10.6.3 Consider transformations

185

(2)

10.6.4 Visual cues to be compared should be adjacent

187

(1)

10.6.5 Use color

188

(1)

10.7 Think of the color blind

188

(1)

10.8 Plots for two variables

189

(2)

10.8.1 Slope charts

189

(2)

10.8.2 Bland-Altman plot

191

(1)

10.9 Encoding a third variable

191

(2)

10.10 Avoid pseudo-three-dimensional plots

193

(2)

10.11 Avoid too many significant digits

195

(1)

10.12 Know your audience

196

(1)

10.13 Exercises

196

(5)

10.14 Case study: vaccines and infectious diseases

201

(3)

10.15 Exercises

204

(1)

11 Robust summaries

205

(8)

11.1 Outliers

205

(1)

11.2 Median

206

(1)

11.3 The inter quartile range (IQR)

206

(1)

11.4 Tukey's definition of an outlier

207

(1)

11.5 Median absolute deviation

208

(1)

11.6 Exercises

208

(1)

11.7 Case study: self-reported student heights

209

(4)

III Statistics with R

213

(172)

12 Introduction to statistics with R

215

(2)

13 Probability

217

(24)

13.1 Discrete probability

217

(1)

13.1.1 Relative frequency

217

(1)

13.1.2 Notation

218

(1)

13.1.3 Probability distributions

218

(1)

13.2 Monte Carlo simulations for categorical data

218

(3)

13.2.1 Setting the random seed

220

(1)

13.2.2 With and without replacement

220

(1)

13.3 Independence

221

(1)

13.4 Conditional probabilities

221

(1)

13.5 Addition and multiplication rules

222

(1)

13.5.1 Multiplication rule

222

(1)

13.5.2 Multiplication rule under independence

222

(1)

13.5.3 Addition rule

223

(1)

13.6 Combinations and permutations

223

(4)

13.6.1 Monte Carlo example

227

(1)

13.7 Examples

227

(4)

13.7.1 Monty Hall problem

228

(1)

13.7.2 Birthday problem

229

(2)

13.8 Infinity in practice

231

(1)

13.9 Exercises

232

(2)

13.10 Continuous probability

234

(1)

13.11 Theoretical continuous distributions

235

(3)

13.11.1 Theoretical distributions as approximations

235

(2)

13.11.2 The probability density

237

(1)

13.12 Monte Carlo simulations for continuous variables

238

(1)

13.13 Continuous distributions

239

(1)

13.14 Exercises

239

(2)

14 Random variables

241

(20)

14.1 Random variables

241

(1)

14.2 Sampling models

242

(1)

14.3 The probability distribution of a random variable

243

(2)

14.4 Distributions versus probability distributions

245

(1)

14.5 Notation for random variables

245

(1)

14.6 The expected value and standard error

246

(3)

14.6.1 Population SD versus the sample SD

248

(1)

14.7 Central Limit Theorem

249

(1)

14.7.1 How large is large in the Central Limit Theorem?

250

(1)

14.8 Statistical properties of averages

250

(2)

14.9 Law of large numbers

252

(1)

14.9.1 Misinterpreting law of averages

252

(1)

14.10 Exercises

252

(2)

14.11 Case study: The Big Short

254

(6)

14.11.1 Interest rates explained with chance model

254

(3)

14.11.2 The Big Short

257

(3)

14.12 Exercises

260

(1)

15 Statistical inference

261

(26)

15.1 Polls

261

(3)

15.1.1 The sampling model for polls

262

(2)

15.2 Populations, samples, parameters, and estimates

264

(3)

15.2.1 The sample average

264

(1)

15.2.2 Parameters

265

(1)

15.2.3 Polling versus forecasting

265

(1)

15.2.4 Properties of our estimate: expected value and standard error

266

(1)

15.3 Exercises

267

(1)

15.4 Central Limit Theorem in practice

268

(4)

15.4.1 A Monte Carlo simulation

269

(2)

15.4.2 The spread

271

(1)

15.4.3 Bias: why not run a very large poll?

271

(1)

15.5 Exercises

272

(2)

15.6 Confidence intervals

274

(3)

15.6.1 A Monte Carlo simulation

276

(1)

15.6.2 The correct language

277

(1)

15.7 Exercises

277

(1)

15.8 Power

278

(1)

15.9 p-values

279

(1)

15.10 Association tests

280

(6)

15.10.1 Lady Tasting Tea

281

(1)

15.10.2 Two-by-two tables

282

(1)

15.10.3 Chi-square Test

282

(1)

15.10.4 The odds ratio

283

(1)

15.10.5 Confidence intervals for the odds ratio

284

(1)

15.10.6 Small count correction

285

(1)

15.10.7 Large samples, small p-values

285

(1)

15.11 Exercises

286

(1)

16 Statistical models

287

(34)

16.1 Poll aggregators

288

(5)

16.1.1 Poll data

290

(2)

16.1.2 Pollster bias

292

(1)

16.2 Data-driven models

293

(2)

16.3 Exercises

295

(3)

16.4 Bayesian statistics

298

(1)

16.4.1 Bayes theorem

298

(1)

16.5 Bayes theorem simulation

299

(2)

16.5.1 Bayes in practice

300

(1)

16.6 Hierarchical models

301

(2)

16.7 Exercises

303

(2)

16.8 Case study: election forecasting

305

(12)

16.8.1 Bayesian approach

306

(1)

16.8.2 The general bias

307

(1)

16.8.3 Mathematical representations of models

307

(3)

16.8.4 Predicting the electoral college

310

(4)

16.8.5 Forecasting

314

(3)

16.9 Exercises

317

(1)

16.10 The t-distribution

318

(3)

17 Regression

321

(14)

17.1 Case study: is height hereditary?

321

(1)

17.2 The correlation coefficient

322

(4)

17.2.1 Sample correlation is a random variable

324

(2)

17.2.2 Correlation is not always a useful summary

326

(1)

17.3 Conditional expectations

326

(3)

17.4 The regression line

329

(5)

17.4.1 Regression improves precision

330

(1)

17.4.2 Bivariate normal distribution (advanced)

331

(2)

17.4.3 Variance explained

333

(1)

17.4.4 Warning: there are two regression lines

333

(1)

17.5 Exercises

334

(1)

18 Linear models

335

(38)

18.1 Case study: Moneyball

335

(9)

18.1.1 Sabermetics

336

(1)

18.1.2 Baseball basics

337

(1)

18.1.3 No awards for BB

338

(1)

18.1.4 Base on balls or stolen bases?

339

(2)

18.1.5 Regression applied to baseball statistics

341

(3)

18.2 Confounding

344

(4)

18.2.1 Understanding confounding through stratification

345

(3)

18.2.2 Multivariate regression

348

(1)

18.3 Least squares estimates

348

(6)

18.3.1 Interpreting linear models

349

(1)

18.3.2 Least Squares Estimates (LSE)

349

(2)

18.3.3 The lm function

351

(1)

18.3.4 LSE are random variables

352

(1)

18.3.5 Predicted values are random variables

353

(1)

18.4 Exercises

354

(1)

18.5 Linear regression in the tidyverse

355

(4)

18.5.1 The broom package

358

(1)

18.6 Exercises

359

(1)

18.7 Case study: Moneyball (continued)

360

(7)

18.7.1 Adding salary and position information

364

(1)

18.7.2 Picking nine players

365

(2)

18.8 The regression fallacy

367

(2)

18.9 Measurement error models

369

(2)

18.10 Exercises

371

(2)

19 Association is not causation

373

(12)

19.1 Spurious correlation

373

(3)

19.2 Outliers

376

(2)

19.3 Reversing cause and effect

378

(1)

19.4 Confounders

379

(3)

19.4.1 Example: UC Berkeley admissions

379

(1)

19.4.2 Confounding explained graphically

380

(1)

19.4.3 Average after stratifying

381

(1)

19.5 Simpson's paradox

382

(1)

19.6 Exercises

383

(2)

IV Data Wrangling

385

(84)

20 Introduction to data wrangling

387

(2)

21 Reshaping data

389

(8)

21.1 gather

389

(2)

21.2 spread

391

(1)

21.3 separate

391

(3)

21.4 unite

394

(1)

21.5 Exercises

395

(2)

22 Joining tables

397

(10)

22.1 Joins

398

(4)

22.1.1 Left join

399

(1)

22.1.2 Right join

400

(1)

22.1.3 Inner join

400

(1)

22.1.4 Full join

400

(1)

22.1.5 Semi join

401

(1)

22.1.6 Anti join

401

(1)

22.2 Binding

402

(1)

22.2.1 Binding columns

402

(1)

22.2.2 Binding by rows

402

(1)

22.3 Set operators

403

(2)

22.3.1 Intersect

403

(1)

22.3.2 Union

404

(1)

22.3.3 setdiff

404

(1)

22.3.4 setequal

404

(1)

22.4 Exercises

405

(2)

23 Web scraping

407

(8)

23.1 HTML

408

(1)

23.2 The rvest package

409

(2)

23.3 CSS selectors

411

(1)

23.4 JSON

412

(1)

23.5 Exercises

413

(2)

24 String processing

415

(34)

24.1 The stringr package

415

(2)

24.2 Case study 1: US murders data

417

(2)

24.3 Case study 2: self-reported heights

419

(2)

24.4 How to escape when defining strings

421

(2)

24.5 Regular expressions

423

(7)

24.5.1 Strings are a regexp

423

(1)

24.5.2 Special characters

423

(2)

24.5.3 Character classes

425

(1)

24.5.4 Anchors

426

(1)

24.5.5 Quantifiers

426

(1)

24.5.6 White space \s

427

(1)

24.5.7 Quantifiers: *, ?, +

428

(1)

24.5.8 Not

428

(1)

24.5.9 Groups

429

(1)

24.6 Search and replace with regex

430

(3)

24.6.1 Search and replace using groups

432

(1)

24.7 Testing and improving

433

(2)

24.8 Trimming

435

(1)

24.9 Changing lettercase

436

(1)

24.10 Case study 2: self-reported heights (continued)

436

(3)

24.10.1 The extract function

437

(1)

24.10.2 Putting it all together

438

(1)

24.11 String splitting

439

(3)

24.12 Case study 3: extracting tables from a PDF

442

(3)

24.13 Recoding

445

(1)

24.14 Exercises

446

(3)

25 Parsing dates and times

449

(6)

25.1 The date data type

449

(1)

25.2 The lubridate package

450

(3)

25.3 Exercises

453

(2)

26 Text mining

455

(14)

26.1 Case study: Trump tweets

455

(2)

26.2 Text as data

457

(5)

26.3 Sentiment analysis

462

(5)

26.4 Exercises

467

(2)

V Machine Learning

469

(176)

27 Introduction to machine learning

471

(22)

27.1 Notation

471

(1)

27.2 An example

472

(2)

27.3 Exercises

474

(1)

27.4 Evaluation metrics

474

(12)

27.4.1 Training and test sets

475

(1)

27.4.2 Overall accuracy

476

(2)

27.4.3 The confusion matrix

478

(1)

27.4.4 Sensitivity and specificity

479

(2)

27.4.5 Balanced accuracy and F1 score

481

(1)

27.4.6 Prevalence matters in practice

482

(1)

27.4.7 ROC and precision-recall curves

483

(1)

27.4.8 The loss function

484

(2)

27.5 Exercises

486

(1)

27.6 Conditional probabilities and expectations

486

(3)

27.6.1 Conditional probabilities

487

(1)

27.6.2 Conditional expectations

488

(1)

27.6.3 Conditional expectation minimizes squared loss function

488

(1)

27.7 Exercises

489

(1)

27.8 Case study: is it a 2 or a 7?

489

(4)

28 Smoothing

493

(14)

28.1 Bin smoothing

495

(2)

28.2 Kernels

497

(1)

28.3 Local weighted regression (loess)

498

(6)

28.3.1 Fitting parabolas

502

(1)

28.3.2 Beware of default smoothing parameters

503

(1)

28.4 Connecting smoothing to machine learning

504

(1)

28.5 Exercises

504

(3)

29 Cross validation

507

(16)

29.1 Motivation with k-nearest neighbors

507

(6)

29.1.1 Over-training

509

(1)

29.1.2 Over-smoothing

510

(1)

29.1.3 Picking the k in kNN

511

(2)

29.2 Mathematical description of cross validation

513

(1)

29.3 K-fold cross validation

514

(3)

29.4 Exercises

517

(1)

29.5 Bootstrap

518

(3)

29.6 Exercises

521

(2)

30 The caret package

523

(6)

30.1 The caret train function

523

(1)

30.2 Cross validation

524

(2)

30.3 Example: fitting with loess

526

(3)

31 Examples of algorithms

529

(44)

31.1 Linear regression

529

(2)

31.1.1 The predict function

530

(1)

31.2 Exercises

531

(2)

31.3 Logistic regression

533

(6)

31.3.1 Generalized linear models

534

(4)

31.3.2 Logistic regression with more than one predictor

538

(1)

31.4 Exercises

539

(1)

31.5 k-nearest neighbors

540

(1)

31.6 Exercises

541

(1)

31.7 Generative models

541

(8)

31.7.1 Naive Bayes

542

(1)

31.7.2 Controlling prevalence

543

(2)

31.7.3 Quadratic discriminant analysis

545

(2)

31.7.4 Linear discriminant analysis

547

(2)

31.7.5 Connection to distance

549

(1)

31.8 Case study: more than three classes

549

(4)

31.9 Exercises

553

(1)

31.10 Classification and regression trees (CART)

554

(12)

31.10.1 The curse of dimensionality

554

(1)

31.10.2 CART motivation

555

(3)

31.10.3 Regression trees

558

(6)

31.10.4 Classification (decision) trees

564

(2)

31.11 Random forests

566

(5)

31.12 Exercises

571

(2)

32 Machine learning in practice

573

(8)

32.1 Preprocessing

574

(1)

32.2 k-nearest neighbor and random forest

575

(3)

32.3 Variable importance

578

(1)

32.4 Visual assessments

579

(1)

32.5 Ensembles

579

(1)

32.6 Exercises

580

(1)

33 Large datasets

581

(58)

33.1 Matrix algebra

581

(10)

33.1.1 Notation

582

(2)

33.1.2 Converting a vector to a matrix

584

(1)

33.1.3 Row and column summaries

585

(1)

33.1.4 apply

586

(1)

33.1.5 Filtering columns based on summaries

586

(2)

33.1.6 Indexing with matrices

588

(2)

33.1.7 Binarizing the data

590

(1)

33.1.8 Vectorization for matrices

590

(1)

33.1.9 Matrix algebra operations

591

(1)

33.2 Exercises

591

(1)

33.3 Distance

591

(4)

33.3.1 Euclidean distance

592

(1)

33.3.2 Distance in higher dimensions

592

(1)

33.3.3 Euclidean distance example

593

(2)

33.3.4 Predictor space

595

(1)

33.3.5 Distance between predictors

595

(1)

33.4 Exercises

595

(1)

33.5 Dimension reduction

596

(13)

33.5.1 Preserving distance

596

(3)

33.5.2 Linear transformations (advanced)

599

(1)

33.5.3 Orthogonal transformations (advanced)

600

(2)

33.5.4 Principal component analysis

602

(2)

33.5.5 Iris example

604

(3)

33.5.6 MNIST example

607

(2)

33.6 Exercises

609

(1)

33.7 Recommendation systems

610

(6)

33.7.1 Movielens data

610

(2)

33.7.2 Recommendation systems as a machine learning challenge

612

(1)

33.7.3 Loss function

612

(1)

33.7.4 A first model

613

(1)

33.7.5 Modeling movie effects

614

(1)

33.7.6 User effects

615

(1)

33.8 Exercises

616

(1)

33.9 Regularization

617

(7)

33.9.1 Motivation

617

(2)

33.9.2 Penalized least squares

619

(3)

33.9.3 Choosing the penalty terms

622

(2)

33.10 Exercises

624

(1)

33.11 Matrix factorization

625

(8)

33.11.1 Factors analysis

628

(2)

33.11.2 Connection to SVD and PCA

630

(3)

33.12 Exercises

633

(6)

34 Clustering

639

(6)

34.1 Hierarchical clustering

640

(2)

34.2 k-means

642

(1)

34.3 Heatmaps

642

(1)

34.4 Filtering features

643

(1)

34.5 Exercises

644

(1)

VI Productivity Tools

645

(50)

35 Introduction to productivity tools

647

(2)

36 Organizing with Unix

649

(18)

36.1 Naming convention

649

(1)

36.2 The terminal

650

(1)

36.3 The filesystem

650

(3)

36.3.1 Directories and subdirectories

651

(1)

36.3.2 The home directory

651

(1)

36.3.3 Working directory

652

(1)

36.3.4 Paths

653

(1)

36.4 Unix commands

653

(4)

36.4.1 ls: Listing directory content

654

(1)

36.4.2 mkdir and rmdir: make and remove a directory

654

(1)

36.4.3 cd: navigating the filesystem by changing directories

655

(2)

36.5 Some examples

657

(1)

36.6 More Unix commands

658

(2)

36.6.1 mv: moving files

658

(1)

36.6.2 cp: copying files

659

(1)

36.6.3 rm: removing files

659

(1)

36.6.4 less: looking at a file

659

(1)

36.7 Preparing for a data science project

660

(1)

36.8 Advanced Unix

661

(6)

36.8.1 Arguments

661

(1)

36.8.2 Getting help

662

(1)

36.8.3 Pipes

662

(1)

36.8.4 Wild cards

663

(1)

36.8.5 Environment variables

663

(1)

36.8.6 Shells

664

(1)

36.8.7 Executables

664

(1)

36.8.8 Permissions and file types

665

(1)

36.8.9 Commands you should learn

665

(1)

36.8.10 File manipulation in R

665

(2)

37 Git and GitHub

667

(16)

37.1 Why use Git and GitHub?

667

(1)

37.2 GitHub accounts

667

(3)

37.3 GitHub repositories

670

(1)

37.4 Overview of Git

671

(5)

37.4.1 Clone

672

(4)

37.5 Initializing a Git directory

676

(2)

37.6 Using Git and GitHub in RStudio

678

(5)

38 Reproducible projects with RStudio and R markdown

683

(12)

38.1 RStudio projects

683

(3)

38.2 R markdown

686

(4)

38.2.1 The header

688

(1)

38.2.2 R code chunks

688

(1)

38.2.3 Global options

689

(1)

38.2.4 knitR

689

(1)

38.2.5 More on R markdown

690

(1)

38.3 Organizing a data science project

690

(5)

38.3.1 Create directories in Unix

690

(1)

38.3.2 Create an RStudio project

691

(1)

38.3.3 Edit some R scripts

692

(1)

38.3.4 Create some more directories using Unix

693

(1)

38.3.5 Add a README file

693

(1)

38.3.6 Initializing a Git directory

693

(1)

38.3.7 Add, commit, and push files using RStudio

694

(1)

Index

695

Rafael A. Irizarry is professor of data sciences at the Dana-Farber Cancer Institute, professor of biostatistics at Harvard, and a fellow of the American Statistical Association. Dr. Irizarry is an applied statistician and during the last 20 years has worked in diverse areas, including genomics, sound engineering, and public health. He disseminates solutions to data analysis challenges as open source software, tools that are widely downloaded and used. Prof. Irizarry has also developed and taught several data science courses at Harvard as well as popular online courses.

Introduction to Data Science: Data Analysis and Prediction Algorithms with R [Kõva köide]

Arvustused

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv