Muutke küpsiste eelistusi

Modern Data Science with R 2nd edition [Kõva köide]

(Amherst College, Amherst, MA), (Smith College, Northhampton, MA), (Smith College, Northhampton, MA)
"Modern Data Science with R is a comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world data problems. Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in the state-of-the-art R/RStudio computing environment can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling questions. The second edition is updatedto reflect the growing influence of the tidyverse set of packages. All code in the book has been revised and styled to be more readable and easier to understand. New functionality from packages like sf, purrr, tidymodels, and tidytext is now integrated into the text. All chapters have been revised, and several have been split, re-organized, or re-imagined to meet the shifting landscape of best practice. From a review of the first edition: "Modern Data Science with R ... is rich with examples and is guided by a strong narrative voice. What's more, it presents an organizing framework that makes a convincing argument that data science is a course distinct from applied statistics" (The American Statistician)"--

From a review of the first edition: "Modern Data Science with R… is rich with examples and is guided by a strong narrative voice. What’s more, it presents an organizing framework that makes a convincing argument that data science is a course distinct from applied statistics" (The American Statistician).

Modern Data Science with R

is a comprehensive data science textbook for undergraduates that incorporates statistical and computational thinking to solve real-world data problems. Rather than focus exclusively on case studies or programming syntax, this book illustrates how statistical programming in the state-of-the-art R/RStudio computing environment can be leveraged to extract meaningful information from a variety of data in the service of addressing compelling questions.

The second edition is updated to reflect the growing influence of the tidyverse set of packages. All code in the book has been revised and styled to be more readable and easier to understand. New functionality from packages like sf, purrr, tidymodels, and tidytext is now integrated into the text. All chapters have been revised, and several have been split, re-organized, or re-imagined to meet the shifting landscape of best practice.

Arvustused

"This text continues to be fantastic! There are a number of courses for which I would require this book and others that I would recommend it as a supplement. I would likely require it for courses focused on computing in R or courses in data science. I would include it as a recommended text in introductory and other statistics courses that used R as the software of choice, where this text could be used as a supplemental resource in how to use R to work with data." (Hunter Glanz Cal Poly San Luis Obispo)

"Easy for students to read and relate to the exercises and examples. Many questions and hands-on activities with data sets to practice skills." (Lynn Collen, St. Cloud Stat Univ.)

"I used the first edition of this book as the primary text for an intermediate data science course a few years ago and I liked it very muchI think that the technical breadth, writing style, and level of difficulty are very clear strengths. Also, my students and I found the `tidyverse` approach to be particularly well-suited for teaching and learning Rand I love that the MDSR book includes such complete code. Students can program everything they see in the book, and often times there are tips & tricks for them to discover along the way just by studying expert code provided by the authors. This really sets MDSR apart from other books I considered for the course." (Matthew Beckman, Penn State University) "[ ...] To answer a wide range of modern research questions, this book by Baumer, Kaplan, and Horton features an excellent introduction to data wrangling, visualization, statistical modeling, machine learning, and other advanced statistical applications through the RStudio environment following the tidyverse syntax. [ ...] Overall, Modern Data Science with R, 2nd edition serves as an excellent introductory resource to help develop techniques to extract, transform, visualize, and learn from datasets through the R environment. It focuses on implementing those techniques in R and does not provide a theoretical background for the discussed methods. The book will be a perfect reference for a broad audience ranging from undergraduates in data science courses to advanced graduate students and professionals from a variety of research fields." -Kohma Arai and Vyacheslav Lyubchich, in Technometrics, July 2022

"Overall, I enjoyed reading this book. The authors were very good at creating a complete tool for studying data science. Therefore, I recommend this book, for its content, writing, and organization, to graduate students in data science and statistics. I also recommend the book to professionals who should prepare themselves for the challenges they are going to face in the future with the voluminous and heterogenous amount of data that should be timely analyzed to extract meaningful information to guide action." -Georgios Nikolopoulos, in ISCB News, June 2022

"The authors have successfully completed the job of choosing the content with relevant topics and, deciding the extent of knowledge to be delivered, and finally, putting them in an understandable sequence. This is a well-written book and does not cover much theory. .. The books second edition contents are updated, expanded, revised, split, rewritten and rearranged compared to the first edition. The key changes are the use of recently developed R packages, .... (and) updated exercises in the chapters ..." -Shalabh,in Journal of the Royal Statistical Society Series A, August 2021

"[ This book] provides an excellent basis for statisticians who want to dig deeper into, for example, data handling, for computer scientists who aim to strengthen their knowledge of statistical methods as well as for all other researchers who are interested in data science in general. ... Each section is structured as an interplay between R-code and explanatory text for understanding. The division into several stand-alone segments is an advantage, because the reader may easily choose the section she or he is interested in without missing relevant information. A key feature of the book is its focus on different example data sets that are available via R-packages or from URLs that are embedded in the text. These data sets are used to illustrate the methodology presented using R-code. Their availability allows the reader to reproduce the code while working with the book. ... It can be warmly recommended to practical researchers who seek a comprehensive overview of different topics in data science with focus on implementations in R." -Annika Hoyer, in Biometrical Journal, August 2021

"This text continues to be fantastic! There are a number of courses for which I would require this book and others that I would recommend it as a supplement. I would likely require it for courses focused on computing in R or courses in data science. I would include it as a recommended text in introductory and other statistics courses that used R as the software of choice, where this text could be used as a supplemental resource in how to use R to work with data." -Hunter Glanz, Cal Poly San Luis Obispo

"Easy for students to read and relate to the exercises and examples. Many questions and hands-on activities with data sets to practice skills." -Lynn Collen, St. Cloud Stat University

"I used the first edition of this book as the primary text for an intermediate data science course a few years ago and I liked it very muchI think that the technical breadth, writing style, and level of difficulty are very clear strengths. Also, my students and I found the `tidyverse` approach to be particularly well-suited for teaching and learning Rand I love that the MDSR book includes such complete code. Students can program everything they see in the book, and often times there are tips & tricks for them to discover along the way just by studying expert code provided by the authors. This really sets MDSR apart from other books I considered for the course." -Matthew Beckman, Penn State University

"The authors have covered almost all aspects of data science, a revolutionary field that marries elements of computational thinking and traditional statistical theory. The book can thus equip the readers with the necessary knowledge and skills to extract data from a variety of sources, restructure observations in a form that allows analysis, store data in efficient databases, and work effectively on massive and complex data sets in order to produce actionable information." - Georgios Nikolopoulos, University of Cyprus, ISCB Book Reviews, June 2022.

About the Authors xi
Preface xiii
I Part I Introduction to Data Science
1(180)
1 Prologue: Why data science?
3(6)
1.1 What is data science?
4(2)
1.2 Case study: The evolution of sabermetrics
6(1)
1.3 Datasets
7(1)
1.4 Further resources
8(1)
2 Data visualization
9(26)
2.1 The 2012 federal election cycle
9(7)
2.2 Composing data graphics
16(8)
2.3 Importance of data graphics: Challenger
24(4)
2.4 Creating effective presentations
28(1)
2.5 The wider world of data visualization
29(2)
2.6 Further resources
31(1)
2.7 Exercises
32(1)
2.8 Supplementary exercises
33(2)
3 A grammar for graphics
35(32)
3.1 A grammar for data graphics
35(8)
3.2 Canonical data graphics in R
43(10)
3.3 Extended example: Historical baby names
53(9)
3.4 Further resources
62(1)
3.5 Exercises
62(3)
3.6 Supplementary exercises
65(2)
4 Data wrangling on one table
67(22)
4.1 A grammar for data wrangling
67(9)
4.2 Extended example: Ben's time with the Mets
76(8)
4.3 Further resources
84(1)
4.4 Exercises
84(4)
4.5 Supplementary exercises
88(1)
5 Data wrangling on multiple tables
89(14)
5.1 Inner_Join()
89(2)
5.2 Left_Join()
91(1)
5.3 Extended example: Manny Ramirez
92(7)
5.4 Further resources
99(1)
5.5 Exercises
99(2)
5.6 Supplementary exercises
101(2)
6 Tidy data
103(36)
6.1 Tidy data
103(9)
6.2 Reshaping data
112(8)
6.3 Naming conventions
120(1)
6.4 Data intake
121(14)
6.5 Further resources
135(1)
6.6 Exercises
135(3)
6.7 Supplementary exercises
138(1)
7 Iteration
139(20)
7.1 Vectorized operations
139(3)
7.2 Using across() with dplyr functions
142(1)
7.3 The map() family of functions
143(1)
7.4 Iterating over a one-dimensional vector
144(2)
7.5 Iteration over subgroups
146(5)
7.6 Simulation
151(2)
7.7 Extended example: Factors associated with BMI
153(2)
7.8 Further resources
155(2)
7.9 Exercises
157(1)
7.10 Supplementary exercises
157(2)
8 Data science ethics
159(22)
8.1 Introduction
159(1)
8.2 Truthful falsehoods
160(1)
8.3 Role of data science in society
161(2)
8.4 Some settings for professional ethics
163(4)
8.5 Some principles to guide ethical action
167(4)
8.6 Algorithmic bias
171(1)
8.7 Data and disclosure
172(2)
8.8 Reproducibility
174(1)
8.9 Ethics, collectively
175(1)
8.10 Professional guidelines for ethical conduct
176(1)
8.11 Further resources
176(1)
8.12 Exercises
177(2)
8.13 Supplementary exercises
179(2)
II Part II Statistics and Modeling
181(118)
9 Statistical foundations
183(24)
9.1 Samples and populations
183(3)
9.2 Sample statistics
186(4)
9.3 The bootstrap
190(4)
9.4 Outliers
194(2)
9.5 Statistical models: Explaining variation
196(3)
9.6 Confounding and accounting for other factors
199(3)
9.7 The perils of p-values
202(2)
9.8 Further resources
204(1)
9.9 Exercises
205(1)
9.10 Supplementary exercises
206(1)
10 Predictive modeling
207(22)
10.1 Predictive modeling
208(1)
10.2 Simple classification models
209(7)
10.3 Evaluating models
216(7)
10.4 Extended example: Who has diabetes?
223(4)
10.5 Further resources
227(1)
10.6 Exercises
227(1)
10.7 Supplementary exercises
228(1)
11 Supervised learning
229(34)
11.1 Non-regression classifiers
229(16)
11.2 Parameter tuning
245(1)
11.3 Example: Evaluation of income models redux
246(4)
11.4 Extended example: Who has diabetes this time?
250(5)
11.5 Regularization
255(3)
11.6 Further resources
258(1)
11.7 Exercises
259(2)
11.8 Supplementary exercises
261(2)
12 Unsupervised learning
263(18)
12.1 Clustering
263(7)
12.2 Dimension reduction
270(8)
12.3 Further resources
278(1)
12.4 Exercises
278(1)
12.5 Supplementary exercises
279(2)
13 Simulation
281(18)
13.1 Reasoning in reverse
281(1)
13.2 Extended example: Grouping cancers
282(3)
13.3 Randomizing functions
285(1)
13.4 Simulating variability
286(7)
13.5 Random networks
293(1)
13.6 Key principles of simulation
293(3)
13.7 Further resources
296(1)
13.8 Exercises
296(2)
13.9 Supplementary exercises
298(1)
III Part III Topics in Data Science
299(192)
14 Dynamic and customized data graphics
301(24)
14.1 Rich Web content using D3.js and htmlwidgets
301(5)
14.2 Animation
306(1)
14.3 Flexdashboard
306(2)
14.4 Interactive web apps with Shiny
308(5)
14.5 Customization of ggplot2 graphics
313(4)
14.6 Extended example: Hot dog eating
317(5)
14.7 Further resources
322(1)
14.8 Exercises
322(2)
14.9 Supplementary exercises
324(1)
15 Database querying using SQL
325(38)
15.1 From dplyr to SQL
325(4)
15.2 Flat-file databases
329(2)
15.3 The SQL universe
331(1)
15.4 The SQL data manipulation language
332(20)
15.5 Extended example: FiveThirtyEight flights
352(8)
15.6 SQL vs. R
360(1)
15.7 Further resources
360(1)
15.8 Exercises
360(2)
15.9 Supplementary exercises
362(1)
16 Database administration
363(14)
16.1 Constructing efficient SQL databases
363(6)
16.2 Changing SQL data
369(2)
16.3 Extended example: Building a database
371(4)
16.4 Scalability
375(1)
16.5 Further resources
375(1)
16.6 Exercises
375(1)
16.7 Supplementary exercises
376(1)
17 Working with geospatial data
377(30)
17.1 Motivation: What's so great about geospatial data?
377(3)
17.2 Spatial data structures
380(2)
17.3 Making maps
382(9)
17.4 Extended example: Congressional districts
391(8)
17.5 Effective maps: How (not) to lie
399(2)
17.6 Projecting polygons
401(1)
17.7 Playing well with others
402(1)
17.8 Further resources
403(1)
17.9 Exercises
404(1)
17.10 Supplementary exercises
405(2)
18 Geospatial computations
407(18)
18.1 Geospatial operations
407(9)
18.2 Geospatial aggregation
416(2)
18.3 Geospatial joins
418(1)
18.4 Extended example: Trail elevations at MacLeish
419(4)
18.5 Further resources
423(1)
18.6 Exercises
423(1)
18.7 Supplementary exercises
424(1)
19 Text as data
425(26)
19.1 Regular expressions using Macbeth
425(6)
19.2 Extended example: Analyzing textual data from arXiv.org
431(14)
19.3 Ingesting text
445(3)
19.4 Further resources
448(1)
19.5 Exercises
448(2)
19.6 Supplementary exercises
450(1)
20 Network science
451(26)
20.1 Introduction to network science
451(5)
20.2 Extended example: Six degrees of Kristen Stewart
456(9)
20.3 PageRank
465(2)
20.4 Extended example: 1996 men's college basketball
467(7)
20.5 Further resources
474(1)
20.6 Exercises
475(1)
20.7 Supplementary exercises
475(2)
21 Epilogue: Towards "big data"
477(14)
21.1 Notions of big data
477(2)
21.2 Tools for bigger data
479(10)
21.3 Alternatives to R
489(1)
21.4 Closing thoughts
489(1)
21.5 Further resources
490(1)
IV Part IV Appendices
491(82)
A Packages used in this book
493(6)
A.1 The mdsr package
493(1)
A.2 Other packages
493(5)
A.3 Further resources
498(1)
B Introduction to R and RStudio
499(20)
B.1 Installation
499(1)
B.2 Learning R
500(1)
B.3 Fundamental structures and objects
501(7)
B.4 Add-ons: Packages
508(6)
B.5 Further resources
514(1)
B.6 Exercises
515(2)
B.7 Supplementary exercises
517(2)
C Algorithmic thinking
519(12)
C.1 Introduction
519(1)
C.2 Simple example
519(3)
C.3 Extended example: Law of large numbers
522(3)
C.4 Non-standard evaluation
525(2)
C.5 Debugging and defensive coding
527(2)
C.6 Further resources
529(1)
C.7 Exercises
529(1)
C.8 Supplementary exercises
530(1)
D Reproducible analysis and workflow
531(10)
D.1 Scriptable statistical computing
532(1)
D.2 Reproducible analysis with R Markdown
532(3)
D.3 Projects and version control
535(2)
D.4 Further resources
537(1)
D.5 Exercises
537(3)
D.6 Supplementary exercises
540(1)
E Regression modeling
541(22)
E.1 Simple linear regression
541(5)
E.2 Multiple regression
546(6)
E.3 Inference for regression
552(1)
E.4 Assumptions underlying regression
553(3)
E.5 Logistic regression
556(3)
E.6 Further resources
559(2)
E.7 Exercises
561(1)
E.8 Supplementary exercises
562(1)
F Setting up a database server
563(10)
F.1 SQLite
563(1)
F.2 MySQL
564(3)
F.3 PostgreSQL
567(1)
F.4 Connecting to SQL
568(5)
Bibliography 573(16)
Indices 589(1)
Subject index 590(32)
R index 622
Benjamin S. Baumer is an associate professor in the Statistical & Data Sciences program at Smith College. He has been a practicing data scientist since 2004, when he became the first full-time statistical analyst for the New York Mets. Ben is a co-author of The Sabermetric Revolution and Analyzing Baseball Data with R. He received the 2019 Waller Education Award and the 2016 Significant Contributor Award from the Society for American Baseball Research.

Daniel T. Kaplan is the DeWitt Wallace emeritus professor of mathematics and computer science at Macalester College. He is the author of several textbooks on statistical modeling and statistical computing. Danny received the 2006 Macalester Excellence in Teaching award and the 2017 CAUSE Lifetime Achievement Award.

Nicholas J. Horton is Beitzel Professor of Technology and Society (Statistics and Data Science) at Amherst College. He is a Fellow of the ASA and the AAAS, co-chair of the National Academies Committee on Applied and Theoretical Statistics, recipient of a number of national teaching awards, author of a series of books on statistical computing, and actively involved in data science curriculum efforts to help students "think with data".