Muutke küpsiste eelistusi

Introduction to Data Science: Data Analysis and Prediction Algorithms with R [Kõva köide]

Edited by (Leeds Beckett University, UK.)
  • Formaat: Hardback, 713 pages, kõrgus x laius: 254x178 mm, kaal: 1720 g
  • Sari: Chapman & Hall/CRC Data Science Series
  • Ilmumisaeg: 08-Nov-2019
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 0367357984
  • ISBN-13: 9780367357986
Teised raamatud teemal:
  • Kõva köide
  • Hind: 124,74 €*
  • * saadame teile pakkumise kasutatud raamatule, mille hind võib erineda kodulehel olevast hinnast
  • See raamat on trükist otsas, kuid me saadame teile pakkumise kasutatud raamatule.
  • Kogus:
  • Lisa ostukorvi
  • Tasuta tarne
  • Lisa soovinimekirja
  • Formaat: Hardback, 713 pages, kõrgus x laius: 254x178 mm, kaal: 1720 g
  • Sari: Chapman & Hall/CRC Data Science Series
  • Ilmumisaeg: 08-Nov-2019
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 0367357984
  • ISBN-13: 9780367357986
Teised raamatud teemal:
"The book begins by going over the basics of R and the tidyverse. You learn R throughout the book, but in the first part we go over the building blocks needed to keep learning during the rest of the book"--

Introduction to Data Science: Data Analysis and Prediction Algorithms with R introduces concepts and skills that can help you tackle real-world data analysis challenges. It covers concepts from probability, statistical inference, linear regression, and machine learning. It also helps you develop skills such as R programming, data wrangling, data visualization, predictive algorithm building, file organization with UNIX/Linux shell, version control with Git and GitHub, and reproducible document preparation.

This book is a textbook for a first course in data science. No previous knowledge of R is necessary, although some experience with programming may be helpful. The book is divided into six parts: R, data visualization, statistics with R, data wrangling, machine learning, and productivity tools. Each part has several chapters meant to be presented as one lecture.

The author uses motivating case studies that realistically mimic a data scientist’s experience. He starts by asking specific questions and answers these through data analysis so concepts are learned as a means to answering the questions. Examples of the case studies included are: US murder rates by state, self-reported student heights, trends in world health and economics, the impact of vaccines on infectious disease rates, the financial crisis of 2007-2008, election forecasting, building a baseball team, image processing of hand-written digits, and movie recommendation systems.

The statistical concepts used to answer the case study questions are only briefly introduced, so complementing with a probability and statistics textbook is highly recommended for in-depth understanding of these concepts. If you read and understand the chapters and complete the exercises, you will be prepared to learn the more advanced concepts and skills needed to become an expert.

Arvustused

"I think the book would be perfect for schools looking to make a transition to a model where introduction to data science takes the place of introduction to statistics and maybe introductory computer science." ~Arend Kuyper, Northwestern University

"A great introduction to data science and modern R programing, with tons of examples of application of the R abilities throughout the whole volume. The book suggests multiple links to the internet websites related to the topics under consideration that makes it an incredibly useful source of contemporary data science and programing, helping to students and researchers in their projects." ~Technometrics

"Introduction to Data Science will teach you to juggle with your data and get maximum results from it using R. I highly recommended this book for students and everybody taking the first steps in data science using R." ~ Maria Ivanchuk, ISCB News

Preface xxv
Acknowledgments xxvii
Introduction xxix
1 Getting started with R and RStudio
1(10)
1.1 Why R?
1(1)
1.2 The R console
1(1)
1.3 Scripts
2(1)
1.4 RStudio
3(5)
1.4.1 The panes
3(2)
1.4.2 Key bindings
5(1)
1.4.3 Running commands while editing scripts
6(2)
1.4.4 Changing global options
8(1)
1.5 Installing R packages
8(3)
I R 11(74)
2 R basics
13(32)
2.1 Case study: US Gun Murders
13(2)
2.2 The very basics
15(5)
2.2.1 Objects
15(1)
2.2.2 The workspace
15(1)
2.2.3 Functions
16(2)
2.2.4 Other prebuilt objects
18(1)
2.2.5 Variable names
18(1)
2.2.6 Saving your workspace
19(1)
2.2.7 Motivating scripts
19(1)
2.2.8 Commenting your code
20(1)
2.3 Exercises
20(1)
2.4 Data types
21(6)
2.4.1 Data frames
21(1)
2.4.2 Examining an object
21(1)
2.4.3 The accessor: $
22(1)
2.4.4 Vectors: numerics, characters, and logical
23(1)
2.4.5 Factors
24(1)
2.4.6 Lists
24(2)
2.4.7 Matrices
26(1)
2.5 Exercises
27(1)
2.6 Vectors
28(3)
2.6.1 Creating vectors
28(1)
2.6.2 Names
29(1)
2.6.3 Sequences
29(1)
2.6.4 Subsetting
30(1)
2.7 Coercion
31(1)
2.7.1 Not availables (NA)
32(1)
2.8 Exercises
32(1)
2.9 Sorting
33(2)
2.9.1 sort
33(1)
2.9.2 order
33(1)
2.9.3 max and which.max
34(1)
2.9.4 rank
34(1)
2.9.5 Beware of recycling
35(1)
2.10 Exercises
35(1)
2.11 Vector arithmetics
36(2)
2.11.1 Rescaling a vector
37(1)
2.11.2 Two vectors
37(1)
2.12 Exercises
38(1)
2.13 Indexing
38(2)
2.13.1 Subsetting with logicals
38(1)
2.13.2 Logical operators
39(1)
2.13.3 which
39(1)
2.13.4 match
40(1)
2.13.5 %in%
40(1)
2.14 Exercises
40(1)
2.15 Basic plots
41(3)
2.15.1 plot
41(1)
2.15.2 hist
42(1)
2.15.3 boxplot
43(1)
2.15.4 image
43(1)
2.16 Exercises
44(1)
3 Programming basics
45(8)
3.1 Conditional expressions
45(2)
3.2 Defining functions
47(1)
3.3 Namespaces
48(1)
3.4 For-loops
49(1)
3.5 Vectorization and functionals
50(1)
3.6 Exercises
51(2)
4 The tidyverse
53(22)
4.1 Tidy data
53(1)
4.2 Exercises
54(1)
4.3 Manipulating data frames
55(2)
4.3.1 Adding a column with mutate
55(1)
4.3.2 Subsetting with filter
56(1)
4.3.3 Selecting columns with select
56(1)
4.4 Exercises
57(1)
4.5 The pipe: %>%
58(1)
4.6 Exercises
59(1)
4.7 Summarizing data
60(4)
4.7.1 summarize
60(2)
4.7.2 pull
62(1)
4.7.3 Group then summarize with group_by
63(1)
4.8 Sorting data frames
64(1)
4.8.1 Nested sorting
64(1)
4.8.2 The top n
65(1)
4.9 Exercises
65(1)
4.10 Tibbles
66(3)
4.10.1 Tibbles display better
67(1)
4.10.2 Subsets of tibbles are tibbles
67(1)
4.10.3 Tibbles can have complex entries
68(1)
4.10.4 Tibbles can be grouped
68(1)
4.10.5 Create a tibble using tibble instead of data.frame
68(1)
4.11 The dot operator
69(1)
4.12 do
70(1)
4.13 The purrr package
71(2)
4.14 Tidyverse conditionals
73(1)
4.14.1 case_when
73(1)
4.14.2 between
73(1)
4.15 Exercises
74(1)
5 Importing data
75(10)
5.1 Paths and the working directory
76(3)
5.1.1 The filesystem
76(1)
5.1.2 Relative and full paths
77(1)
5.1.3 The working directory
77(1)
5.1.4 Generating path names
78(1)
5.1.5 Copying files using paths
78(1)
5.2 The readr and readxl packages
79(1)
5.2.1 readr
79(1)
5.2.2 readxl
80(1)
5.3 Exercises
80(1)
5.4 Downloading files
81(1)
5.5 R-base importing functions
82(1)
5.5.1 scan
82(1)
5.6 Text versus binary files
83(1)
5.7 Unicode versus ASCII
83(1)
5.8 Organizing data with spreadsheets
84(1)
5.9 Exercises
84(1)
II Data Visualization 85(128)
6 Introduction to data visualization
87(4)
7 ggplot2
91(18)
7.1 The components of a graph
92(1)
7.2 ggplot objects
93(1)
7.3 Geometries
94(1)
7.4 Aesthetic mappings
95(1)
7.5 Layers
96(2)
7.5.1 Tinkering with arguments
97(1)
7.6 Global versus local aesthetic mappings
98(1)
7.7 Scales
99(1)
7.8 Labels and titles
100(1)
7.9 Categories as colors
101(1)
7.10 Annotation, shapes, and adjustments
102(1)
7.11 Add-on packages
103(1)
7.12 Putting it all together
104(1)
7.13 Quick plots with qplot
105(1)
7.14 Grids of plots
106(1)
7.15 Exercises
106(3)
8 Visualizing data distributions
109(32)
8.1 Variable types
109(1)
8.2 Case study: describing student heights
110(1)
8.3 Distribution function
110(1)
8.4 Cumulative distribution functions
111(1)
8.5 Histograms
112(1)
8.6 Smoothed density
113(5)
8.6.1 Interpreting the y-axis
117(1)
8.6.2 Densities permit stratification
118(1)
8.7 Exercises
118(4)
8.8 The normal distribution
122(2)
8.9 Standard units
124(1)
8.10 Quantile-quantile plots
125(2)
8.11 Percentiles
127(1)
8.12 Boxplots
127(2)
8.13 Stratification
129(1)
8.14 Case study: describing student heights (continued)
129(2)
8.15 Exercises
131(1)
8.16 ggplot2 geometries
132(8)
8.16.1 Barplots
133(1)
8.16.2 Histograms
134(1)
8.16.3 Density plots
135(1)
8.16.4 Boxplots
136(1)
8.16.5 QQ-plots
136(1)
8.16.6 Images
137(1)
8.16.7 Quick plots
138(2)
8.17 Exercises
140(1)
9 Data visualization in practice
141(30)
9.1 Case study: new insights on poverty
141(2)
9.1.1 Hans Rosling's quiz
142(1)
9.2 Scatterplots
143(1)
9.3 Faceting
144(3)
9.3.1 facet_wrap
146(1)
9.3.2 Fixed scales for better comparisons
147(1)
9.4 Time series plots
147(4)
9.4.1 Labels instead of legends
150(1)
9.5 Data transformations
151(4)
9.5.1 Log transformation
151(2)
9.5.2 Which base?
153(1)
9.5.3 Transform the values or the scale?
154(1)
9.6 Visualizing multimodal distributions
155(1)
9.7 Comparing multiple distributions with boxplots and ridge plots
155(12)
9.7.1 Boxplots
156(1)
9.7.2 Ridge plots
157(2)
9.7.3 Example: 1970 versus 2010 income distributions
159(5)
9.7.4 Accessing computed variables
164(3)
9.7.5 Weighted densities
167(1)
9.8 The ecological fallacy and importance of showing the data
167(4)
9.8.1 Logistic transformation
168(1)
9.8.2 Show the data
168(3)
10 Data visualization principles
171(34)
10.1 Encoding data using visual cues
171(3)
10.2 Know when to include 0
174(3)
10.3 Do not distort quantities
177(2)
10.4 Order categories by a meaningful value
179(1)
10.5 Show the data
180(3)
10.6 Ease comparisons
183(5)
10.6.1 Use common axes
183(1)
10.6.2 Align plots vertically to see horizontal changes and horizontally to see vertical changes
184(1)
10.6.3 Consider transformations
185(2)
10.6.4 Visual cues to be compared should be adjacent
187(1)
10.6.5 Use color
188(1)
10.7 Think of the color blind
188(1)
10.8 Plots for two variables
189(2)
10.8.1 Slope charts
189(2)
10.8.2 Bland-Altman plot
191(1)
10.9 Encoding a third variable
191(2)
10.10 Avoid pseudo-three-dimensional plots
193(2)
10.11 Avoid too many significant digits
195(1)
10.12 Know your audience
196(1)
10.13 Exercises
196(5)
10.14 Case study: vaccines and infectious diseases
201(3)
10.15 Exercises
204(1)
11 Robust summaries
205(8)
11.1 Outliers
205(1)
11.2 Median
206(1)
11.3 The inter quartile range (IQR)
206(1)
11.4 Tukey's definition of an outlier
207(1)
11.5 Median absolute deviation
208(1)
11.6 Exercises
208(1)
11.7 Case study: self-reported student heights
209(4)
III Statistics with R 213(172)
12 Introduction to statistics with R
215(2)
13 Probability
217(24)
13.1 Discrete probability
217(1)
13.1.1 Relative frequency
217(1)
13.1.2 Notation
218(1)
13.1.3 Probability distributions
218(1)
13.2 Monte Carlo simulations for categorical data
218(3)
13.2.1 Setting the random seed
220(1)
13.2.2 With and without replacement
220(1)
13.3 Independence
221(1)
13.4 Conditional probabilities
221(1)
13.5 Addition and multiplication rules
222(1)
13.5.1 Multiplication rule
222(1)
13.5.2 Multiplication rule under independence
222(1)
13.5.3 Addition rule
223(1)
13.6 Combinations and permutations
223(4)
13.6.1 Monte Carlo example
227(1)
13.7 Examples
227(4)
13.7.1 Monty Hall problem
228(1)
13.7.2 Birthday problem
229(2)
13.8 Infinity in practice
231(1)
13.9 Exercises
232(2)
13.10 Continuous probability
234(1)
13.11 Theoretical continuous distributions
235(3)
13.11.1 Theoretical distributions as approximations
235(2)
13.11.2 The probability density
237(1)
13.12 Monte Carlo simulations for continuous variables
238(1)
13.13 Continuous distributions
239(1)
13.14 Exercises
239(2)
14 Random variables
241(20)
14.1 Random variables
241(1)
14.2 Sampling models
242(1)
14.3 The probability distribution of a random variable
243(2)
14.4 Distributions versus probability distributions
245(1)
14.5 Notation for random variables
245(1)
14.6 The expected value and standard error
246(3)
14.6.1 Population SD versus the sample SD
248(1)
14.7 Central Limit Theorem
249(1)
14.7.1 How large is large in the Central Limit Theorem?
250(1)
14.8 Statistical properties of averages
250(2)
14.9 Law of large numbers
252(1)
14.9.1 Misinterpreting law of averages
252(1)
14.10 Exercises
252(2)
14.11 Case study: The Big Short
254(6)
14.11.1 Interest rates explained with chance model
254(3)
14.11.2 The Big Short
257(3)
14.12 Exercises
260(1)
15 Statistical inference
261(26)
15.1 Polls
261(3)
15.1.1 The sampling model for polls
262(2)
15.2 Populations, samples, parameters, and estimates
264(3)
15.2.1 The sample average
264(1)
15.2.2 Parameters
265(1)
15.2.3 Polling versus forecasting
265(1)
15.2.4 Properties of our estimate: expected value and standard error
266(1)
15.3 Exercises
267(1)
15.4 Central Limit Theorem in practice
268(4)
15.4.1 A Monte Carlo simulation
269(2)
15.4.2 The spread
271(1)
15.4.3 Bias: why not run a very large poll?
271(1)
15.5 Exercises
272(2)
15.6 Confidence intervals
274(3)
15.6.1 A Monte Carlo simulation
276(1)
15.6.2 The correct language
277(1)
15.7 Exercises
277(1)
15.8 Power
278(1)
15.9 p-values
279(1)
15.10 Association tests
280(6)
15.10.1 Lady Tasting Tea
281(1)
15.10.2 Two-by-two tables
282(1)
15.10.3 Chi-square Test
282(1)
15.10.4 The odds ratio
283(1)
15.10.5 Confidence intervals for the odds ratio
284(1)
15.10.6 Small count correction
285(1)
15.10.7 Large samples, small p-values
285(1)
15.11 Exercises
286(1)
16 Statistical models
287(34)
16.1 Poll aggregators
288(5)
16.1.1 Poll data
290(2)
16.1.2 Pollster bias
292(1)
16.2 Data-driven models
293(2)
16.3 Exercises
295(3)
16.4 Bayesian statistics
298(1)
16.4.1 Bayes theorem
298(1)
16.5 Bayes theorem simulation
299(2)
16.5.1 Bayes in practice
300(1)
16.6 Hierarchical models
301(2)
16.7 Exercises
303(2)
16.8 Case study: election forecasting
305(12)
16.8.1 Bayesian approach
306(1)
16.8.2 The general bias
307(1)
16.8.3 Mathematical representations of models
307(3)
16.8.4 Predicting the electoral college
310(4)
16.8.5 Forecasting
314(3)
16.9 Exercises
317(1)
16.10 The t-distribution
318(3)
17 Regression
321(14)
17.1 Case study: is height hereditary?
321(1)
17.2 The correlation coefficient
322(4)
17.2.1 Sample correlation is a random variable
324(2)
17.2.2 Correlation is not always a useful summary
326(1)
17.3 Conditional expectations
326(3)
17.4 The regression line
329(5)
17.4.1 Regression improves precision
330(1)
17.4.2 Bivariate normal distribution (advanced)
331(2)
17.4.3 Variance explained
333(1)
17.4.4 Warning: there are two regression lines
333(1)
17.5 Exercises
334(1)
18 Linear models
335(38)
18.1 Case study: Moneyball
335(9)
18.1.1 Sabermetics
336(1)
18.1.2 Baseball basics
337(1)
18.1.3 No awards for BB
338(1)
18.1.4 Base on balls or stolen bases?
339(2)
18.1.5 Regression applied to baseball statistics
341(3)
18.2 Confounding
344(4)
18.2.1 Understanding confounding through stratification
345(3)
18.2.2 Multivariate regression
348(1)
18.3 Least squares estimates
348(6)
18.3.1 Interpreting linear models
349(1)
18.3.2 Least Squares Estimates (LSE)
349(2)
18.3.3 The lm function
351(1)
18.3.4 LSE are random variables
352(1)
18.3.5 Predicted values are random variables
353(1)
18.4 Exercises
354(1)
18.5 Linear regression in the tidyverse
355(4)
18.5.1 The broom package
358(1)
18.6 Exercises
359(1)
18.7 Case study: Moneyball (continued)
360(7)
18.7.1 Adding salary and position information
364(1)
18.7.2 Picking nine players
365(2)
18.8 The regression fallacy
367(2)
18.9 Measurement error models
369(2)
18.10 Exercises
371(2)
19 Association is not causation
373(12)
19.1 Spurious correlation
373(3)
19.2 Outliers
376(2)
19.3 Reversing cause and effect
378(1)
19.4 Confounders
379(3)
19.4.1 Example: UC Berkeley admissions
379(1)
19.4.2 Confounding explained graphically
380(1)
19.4.3 Average after stratifying
381(1)
19.5 Simpson's paradox
382(1)
19.6 Exercises
383(2)
IV Data Wrangling 385(84)
20 Introduction to data wrangling
387(2)
21 Reshaping data
389(8)
21.1 gather
389(2)
21.2 spread
391(1)
21.3 separate
391(3)
21.4 unite
394(1)
21.5 Exercises
395(2)
22 Joining tables
397(10)
22.1 Joins
398(4)
22.1.1 Left join
399(1)
22.1.2 Right join
400(1)
22.1.3 Inner join
400(1)
22.1.4 Full join
400(1)
22.1.5 Semi join
401(1)
22.1.6 Anti join
401(1)
22.2 Binding
402(1)
22.2.1 Binding columns
402(1)
22.2.2 Binding by rows
402(1)
22.3 Set operators
403(2)
22.3.1 Intersect
403(1)
22.3.2 Union
404(1)
22.3.3 setdiff
404(1)
22.3.4 setequal
404(1)
22.4 Exercises
405(2)
23 Web scraping
407(8)
23.1 HTML
408(1)
23.2 The rvest package
409(2)
23.3 CSS selectors
411(1)
23.4 JSON
412(1)
23.5 Exercises
413(2)
24 String processing
415(34)
24.1 The stringr package
415(2)
24.2 Case study 1: US murders data
417(2)
24.3 Case study 2: self-reported heights
419(2)
24.4 How to escape when defining strings
421(2)
24.5 Regular expressions
423(7)
24.5.1 Strings are a regexp
423(1)
24.5.2 Special characters
423(2)
24.5.3 Character classes
425(1)
24.5.4 Anchors
426(1)
24.5.5 Quantifiers
426(1)
24.5.6 White space \s
427(1)
24.5.7 Quantifiers: *, ?, +
428(1)
24.5.8 Not
428(1)
24.5.9 Groups
429(1)
24.6 Search and replace with regex
430(3)
24.6.1 Search and replace using groups
432(1)
24.7 Testing and improving
433(2)
24.8 Trimming
435(1)
24.9 Changing lettercase
436(1)
24.10 Case study 2: self-reported heights (continued)
436(3)
24.10.1 The extract function
437(1)
24.10.2 Putting it all together
438(1)
24.11 String splitting
439(3)
24.12 Case study 3: extracting tables from a PDF
442(3)
24.13 Recoding
445(1)
24.14 Exercises
446(3)
25 Parsing dates and times
449(6)
25.1 The date data type
449(1)
25.2 The lubridate package
450(3)
25.3 Exercises
453(2)
26 Text mining
455(14)
26.1 Case study: Trump tweets
455(2)
26.2 Text as data
457(5)
26.3 Sentiment analysis
462(5)
26.4 Exercises
467(2)
V Machine Learning 469(176)
27 Introduction to machine learning
471(22)
27.1 Notation
471(1)
27.2 An example
472(2)
27.3 Exercises
474(1)
27.4 Evaluation metrics
474(12)
27.4.1 Training and test sets
475(1)
27.4.2 Overall accuracy
476(2)
27.4.3 The confusion matrix
478(1)
27.4.4 Sensitivity and specificity
479(2)
27.4.5 Balanced accuracy and F1 score
481(1)
27.4.6 Prevalence matters in practice
482(1)
27.4.7 ROC and precision-recall curves
483(1)
27.4.8 The loss function
484(2)
27.5 Exercises
486(1)
27.6 Conditional probabilities and expectations
486(3)
27.6.1 Conditional probabilities
487(1)
27.6.2 Conditional expectations
488(1)
27.6.3 Conditional expectation minimizes squared loss function
488(1)
27.7 Exercises
489(1)
27.8 Case study: is it a 2 or a 7?
489(4)
28 Smoothing
493(14)
28.1 Bin smoothing
495(2)
28.2 Kernels
497(1)
28.3 Local weighted regression (loess)
498(6)
28.3.1 Fitting parabolas
502(1)
28.3.2 Beware of default smoothing parameters
503(1)
28.4 Connecting smoothing to machine learning
504(1)
28.5 Exercises
504(3)
29 Cross validation
507(16)
29.1 Motivation with k-nearest neighbors
507(6)
29.1.1 Over-training
509(1)
29.1.2 Over-smoothing
510(1)
29.1.3 Picking the k in kNN
511(2)
29.2 Mathematical description of cross validation
513(1)
29.3 K-fold cross validation
514(3)
29.4 Exercises
517(1)
29.5 Bootstrap
518(3)
29.6 Exercises
521(2)
30 The caret package
523(6)
30.1 The caret train function
523(1)
30.2 Cross validation
524(2)
30.3 Example: fitting with loess
526(3)
31 Examples of algorithms
529(44)
31.1 Linear regression
529(2)
31.1.1 The predict function
530(1)
31.2 Exercises
531(2)
31.3 Logistic regression
533(6)
31.3.1 Generalized linear models
534(4)
31.3.2 Logistic regression with more than one predictor
538(1)
31.4 Exercises
539(1)
31.5 k-nearest neighbors
540(1)
31.6 Exercises
541(1)
31.7 Generative models
541(8)
31.7.1 Naive Bayes
542(1)
31.7.2 Controlling prevalence
543(2)
31.7.3 Quadratic discriminant analysis
545(2)
31.7.4 Linear discriminant analysis
547(2)
31.7.5 Connection to distance
549(1)
31.8 Case study: more than three classes
549(4)
31.9 Exercises
553(1)
31.10 Classification and regression trees (CART)
554(12)
31.10.1 The curse of dimensionality
554(1)
31.10.2 CART motivation
555(3)
31.10.3 Regression trees
558(6)
31.10.4 Classification (decision) trees
564(2)
31.11 Random forests
566(5)
31.12 Exercises
571(2)
32 Machine learning in practice
573(8)
32.1 Preprocessing
574(1)
32.2 k-nearest neighbor and random forest
575(3)
32.3 Variable importance
578(1)
32.4 Visual assessments
579(1)
32.5 Ensembles
579(1)
32.6 Exercises
580(1)
33 Large datasets
581(58)
33.1 Matrix algebra
581(10)
33.1.1 Notation
582(2)
33.1.2 Converting a vector to a matrix
584(1)
33.1.3 Row and column summaries
585(1)
33.1.4 apply
586(1)
33.1.5 Filtering columns based on summaries
586(2)
33.1.6 Indexing with matrices
588(2)
33.1.7 Binarizing the data
590(1)
33.1.8 Vectorization for matrices
590(1)
33.1.9 Matrix algebra operations
591(1)
33.2 Exercises
591(1)
33.3 Distance
591(4)
33.3.1 Euclidean distance
592(1)
33.3.2 Distance in higher dimensions
592(1)
33.3.3 Euclidean distance example
593(2)
33.3.4 Predictor space
595(1)
33.3.5 Distance between predictors
595(1)
33.4 Exercises
595(1)
33.5 Dimension reduction
596(13)
33.5.1 Preserving distance
596(3)
33.5.2 Linear transformations (advanced)
599(1)
33.5.3 Orthogonal transformations (advanced)
600(2)
33.5.4 Principal component analysis
602(2)
33.5.5 Iris example
604(3)
33.5.6 MNIST example
607(2)
33.6 Exercises
609(1)
33.7 Recommendation systems
610(6)
33.7.1 Movielens data
610(2)
33.7.2 Recommendation systems as a machine learning challenge
612(1)
33.7.3 Loss function
612(1)
33.7.4 A first model
613(1)
33.7.5 Modeling movie effects
614(1)
33.7.6 User effects
615(1)
33.8 Exercises
616(1)
33.9 Regularization
617(7)
33.9.1 Motivation
617(2)
33.9.2 Penalized least squares
619(3)
33.9.3 Choosing the penalty terms
622(2)
33.10 Exercises
624(1)
33.11 Matrix factorization
625(8)
33.11.1 Factors analysis
628(2)
33.11.2 Connection to SVD and PCA
630(3)
33.12 Exercises
633(6)
34 Clustering
639(6)
34.1 Hierarchical clustering
640(2)
34.2 k-means
642(1)
34.3 Heatmaps
642(1)
34.4 Filtering features
643(1)
34.5 Exercises
644(1)
VI Productivity Tools 645(50)
35 Introduction to productivity tools
647(2)
36 Organizing with Unix
649(18)
36.1 Naming convention
649(1)
36.2 The terminal
650(1)
36.3 The filesystem
650(3)
36.3.1 Directories and subdirectories
651(1)
36.3.2 The home directory
651(1)
36.3.3 Working directory
652(1)
36.3.4 Paths
653(1)
36.4 Unix commands
653(4)
36.4.1 ls: Listing directory content
654(1)
36.4.2 mkdir and rmdir: make and remove a directory
654(1)
36.4.3 cd: navigating the filesystem by changing directories
655(2)
36.5 Some examples
657(1)
36.6 More Unix commands
658(2)
36.6.1 mv: moving files
658(1)
36.6.2 cp: copying files
659(1)
36.6.3 rm: removing files
659(1)
36.6.4 less: looking at a file
659(1)
36.7 Preparing for a data science project
660(1)
36.8 Advanced Unix
661(6)
36.8.1 Arguments
661(1)
36.8.2 Getting help
662(1)
36.8.3 Pipes
662(1)
36.8.4 Wild cards
663(1)
36.8.5 Environment variables
663(1)
36.8.6 Shells
664(1)
36.8.7 Executables
664(1)
36.8.8 Permissions and file types
665(1)
36.8.9 Commands you should learn
665(1)
36.8.10 File manipulation in R
665(2)
37 Git and GitHub
667(16)
37.1 Why use Git and GitHub?
667(1)
37.2 GitHub accounts
667(3)
37.3 GitHub repositories
670(1)
37.4 Overview of Git
671(5)
37.4.1 Clone
672(4)
37.5 Initializing a Git directory
676(2)
37.6 Using Git and GitHub in RStudio
678(5)
38 Reproducible projects with RStudio and R markdown
683(12)
38.1 RStudio projects
683(3)
38.2 R markdown
686(4)
38.2.1 The header
688(1)
38.2.2 R code chunks
688(1)
38.2.3 Global options
689(1)
38.2.4 knitR
689(1)
38.2.5 More on R markdown
690(1)
38.3 Organizing a data science project
690(5)
38.3.1 Create directories in Unix
690(1)
38.3.2 Create an RStudio project
691(1)
38.3.3 Edit some R scripts
692(1)
38.3.4 Create some more directories using Unix
693(1)
38.3.5 Add a README file
693(1)
38.3.6 Initializing a Git directory
693(1)
38.3.7 Add, commit, and push files using RStudio
694(1)
Index 695
Rafael A. Irizarry is professor of data sciences at the Dana-Farber Cancer Institute, professor of biostatistics at Harvard, and a fellow of the American Statistical Association. Dr. Irizarry is an applied statistician and during the last 20 years has worked in diverse areas, including genomics, sound engineering, and public health. He disseminates solutions to data analysis challenges as open source software, tools that are widely downloaded and used. Prof. Irizarry has also developed and taught several data science courses at Harvard as well as popular online courses.