Muutke küpsiste eelistusi

Data Scientist's Guide to Acquiring, Cleaning, and Managing Data in R [Kõva köide]

  • Formaat: Hardback, 312 pages, kõrgus x laius x paksus: 231x155x20 mm, kaal: 522 g
  • Ilmumisaeg: 01-Dec-2017
  • Kirjastus: John Wiley & Sons Inc
  • ISBN-10: 1119080029
  • ISBN-13: 9781119080022
Teised raamatud teemal:
  • Formaat: Hardback, 312 pages, kõrgus x laius x paksus: 231x155x20 mm, kaal: 522 g
  • Ilmumisaeg: 01-Dec-2017
  • Kirjastus: John Wiley & Sons Inc
  • ISBN-10: 1119080029
  • ISBN-13: 9781119080022
Teised raamatud teemal:

The only how-to guide offering a unified, systemic approach to acquiring, cleaning, and managing data in R

Every experienced practitioner knows that preparing data for modeling is a painstaking, time-consuming process. Adding to the difficulty is that most modelers learn the steps involved in cleaning and managing data piecemeal, often on the fly, or they develop their own ad hoc methods. This book helps simplify their task by providing a unified, systematic approach to acquiring, modeling, manipulating, cleaning, and maintaining data in R. 

Starting with the very basics, data scientists Samuel E. Buttrey and Lyn R. Whitaker walk readers through the entire process. From what data looks like and what it should look like, they progress through all the steps involved in getting data ready for modeling.  They describe best practices for acquiring data from numerous sources; explore key issues in data handling, including text/regular expressions, big data, parallel processing, merging, matching, and checking for duplicates; and outline highly efficient and reliable techniques for documenting data and recordkeeping, including audit trails, getting data back out of R, and more.

  • The only single-source guide to R data and its preparation, it describes best practices for acquiring, manipulating, cleaning, and maintaining data
  • Begins with the basics and walks readers through all the steps necessary to get data ready for the modeling process
  • Provides expert guidance on how to document the processes described so that they are reproducible
  • Written by seasoned professionals, it provides both introductory and advanced techniques
  • Features case studies with supporting data and R code, hosted on a companion website

A Data Scientist's Guide to Acquiring, Cleaning and Managing Data in R is a valuable working resource/bench manual for practitioners who collect and analyze data, lab scientists and research associates of all levels of experience, and graduate-level data mining students.

About the Authors xv
Preface xvii
Acknowledgments xix
1 R 1(20)
1.1 Introduction
1(2)
1.1.1 What Is R?
1(1)
1.1.2 Who Uses R and Why?
2(1)
1.1.3 Acquiring and Installing R
2(1)
1.1.4 Starting and Quitting R
3(1)
1.2 Data
3(2)
1.2.1 Acquiring Data
3(1)
1.2.2 Cleaning Data
4(1)
1.2.3 The Goal of Data Cleaning
4(1)
1.2.4 Making Your Work Reproducible
5(1)
1.3 The Very Basics of R
5(7)
1.3.1 Top Ten Quick Facts You Need to Know about R
5(3)
1.3.2 Vocabulary
8(3)
1.3.3 Calculating and Printing in R
11(1)
1.4 Running an R Session
12(4)
1.4.1 Where Your Data Is Stored
13(1)
1.4.2 Options
13(1)
1.4.3 Scripts
14(1)
1.4.4 R Packages
14(1)
1.4.5 RStudio and Other GUIs
15(1)
1.4.6 Locales and Character Sets
15(1)
1.5 Getting Help
16(1)
1.5.1 At the Command Line
16(1)
1.5.2 The Online Manuals
16(1)
1.5.3 On the Internet
17(1)
1.5.4 Further Reading
17(1)
1.6 How to Use This Book
17(4)
1.6.1 Syntax and Conventions in This Book
17(1)
1.6.2 The
Chapters
18(3)
2 R Data, Part 1: Vectors 21(32)
2.1 Vectors
21(6)
2.1.1 Creating Vectors
21(1)
2.1.2 Sequences
22(1)
2.1.3 Logical Vectors
23(1)
2.1.4 Vector Operations
24(3)
2.1.5 Names
27(1)
2.2 Data Types
27(4)
2.2.1 Some Less-Common Data Types
28(1)
2.2.2 What Type of Vector Is This?
28(1)
2.2.3 Converting from One Type to Another
29(2)
2.3 Subsets of Vectors
31(5)
2.3.1 Extracting
31(3)
2.3.2 Vectors of Length 0
34(1)
2.3.3 Assigning or Replacing Elements of a Vector
35(1)
2.4 Missing Data (NA) and Other Special Values
36(4)
2.4.1 The Effect of NAs in Expressions
37(1)
2.4.2 Identifying and Removing or Replacing NAs
37(2)
2.4.3 Indexing with NAs
39(1)
2.4.4 NaN and Inf Values
40(1)
2.4.5 NULL Values
40(1)
2.5 The table ( ) Function
40(5)
2.5.1 Two-and Higher-Way Tables
42(1)
2.5.2 Operating on Elements of a Table
42(3)
2.6 Other Actions on Vectors
45(5)
2.6.1 Rounding
45(1)
2.6.2 Sorting and Ordering
45(1)
2.6.3 Vectors as Sets
46(1)
2.6.4 Identifying Duplicates and Matching
47(2)
2.6.5 Finding Runs of Duplicate Values
49(1)
2.7 Long Vectors and Big Data
50(1)
2.8
Chapter Summary and Critical Data Handling Tools
50(3)
3 R Data, Part 2: More Complicated Structures 53(46)
3.1 Introduction
53(1)
3.2 Matrices
53(9)
3.2.1 Extracting and Assigning
54(2)
3.2.2 Row and Column Names
56(1)
3.2.3 Applying a Function to Rows or Columns
57(2)
3.2.4 Missing Values in Matrices
59(1)
3.2.5 Using a Matrix Subscript
60(1)
3.2.6 Sparse Matrices
61(1)
3.2.7 Three-and Higher-Way Arrays
62(1)
3.3 Lists
62(5)
3.3.1 Extracting and Assigning
64(1)
3.3.2 Lists in Practice
65(2)
3.4 Data Frames
67(7)
3.4.1 Missing Values in Data Frames
69(1)
3.4.2 Extracting and Assigning in Data Frames
69(3)
3.4.3 Extracting Things That Aren't There
72(2)
3.5 Operating on Lists and Data Frames
74(6)
3.5.1 Split, Apply, Combine
75(2)
3.5.2 All-Numeric Data Frames
77(1)
3.5.3 Convenience Functions
78(1)
3.5.4 Re-Ordering, De-Duplicating, and Sampling from Data Frames
79(1)
3.6 Date and Time Objects
80(10)
3.6.1 Formatting Dates
80(2)
3.6.2 Common Operations on Date Objects
82(1)
3.6.3 Differences between Dates
83(1)
3.6.4 Dates and Times
83(2)
3.6.5 Creating POSIXt Objects
85(1)
3.6.6 Mathematical Functions for Date and Times
86(2)
3.6.7 Missing Values in Dates
88(1)
3.6.8 Using Apply Functions with Dates and Times
89(1)
3.7 Other Actions on Data Frames
90(4)
3.7.1 Combining by Rows or Columns
90(1)
3.7.2 Merging Data Frames
91(3)
3.7.3 Comparing Two Data Frames
94(1)
3.7.4 Viewing and Editing Data Frames Interactively
94(1)
3.8 Handling Big Data
94(2)
3.9
Chapter Summary and Critical Data Handling Tools
96(3)
4 R Data, Part 3: Text and Factors 99(44)
4.1 Character Data
100(3)
4.1.1 The length ( ) and nchar ( ) Functions
100(1)
4.1.2 Tab, New-Line, Quote, and Backslash Characters
100(1)
4.1.3 The Empty String
101(1)
4.1.4 Substrings
102(1)
4.1.5 Changing Case and Other Substitutions
103(1)
4.2 Converting Numbers into Text
103(6)
4.2.1 Formatting Numbers
103(3)
4.2.2 Scientific Notation
106(1)
4.2.3 Discretizing a Numeric Variable
107(2)
4.3 Constructing Character Strings: Paste in Action
109(3)
4.3.1 Constructing Column Names
109(2)
4.3.2 Tabulating Dates by Year and Month or Quarter Labels
111(1)
4.3.3 Constructing Unique Keys
112(1)
4.3.4 Constructing File and Path Names
112(1)
4.4 Regular Expressions
112(16)
4.4.1 Types of Regular Expressions
113(1)
4.4.2 Tools for Regular Expressions in R
113(1)
4.4.3 Special Characters in Regular Expressions
114(1)
4.4.4 Examples
114(7)
4.4.5 The regexpr ( ) Function and Its Variants
121(2)
4.4.6 Using Regular Expressions in Replacement
123(1)
4.4.7 Splitting Strings at Regular Expressions
124(1)
4.4.8 Regular Expressions versus Wildcard Matching
125(1)
4.4.9 Common Data Cleaning Tasks Using Regular Expressions
126(1)
4.4.10 Documenting and Debugging Regular Expressions
127(1)
4.5 UTF-8 and Other Non-ASCII Characters
128(3)
4.5.1 Extended ASCII for Latin Alphabets
128(1)
4.5.2 Non-Latin Alphabets
129(1)
4.5.3 Character and String Encoding in R
130(1)
4.6 Factors
131(6)
4.6.1 What Is a Factor?
131(1)
4.6.2 Factor Levels
132(2)
4.6.3 Converting and Combining Factors
134(2)
4.6.4 Missing Values in Factors
136(1)
4.6.5 Factors in Data Frames
137(1)
4.7 R Object Names and Commands as Text
137(3)
4.7.1 R Object Names as Text
137(1)
4.7.2 R Commands as Text
138(2)
4.8
Chapter Summary and Critical Data Handling Tools
140(3)
5 Writing Functions and Scripts 143(28)
5.1 Functions
143(10)
5.1.1 Function Arguments
144(4)
5.1.2 Global versus Local Variables
148(1)
5.1.3 Return Values
149(2)
5.1.4 Creating and Editing Functions
151(2)
5.2 Scripts and Shell Scripts
153(3)
5.2.1 Line-by-Line Parsing
155(1)
5.3 Error Handling and Debugging
156(5)
5.3.1 Debugging Functions
156(2)
5.3.2 Issuing Error and Warning Messages
158(1)
5.3.3 Catching and Processing Errors
159(2)
5.4 Interacting with the Operating System
161(2)
5.4.1 File and Directory Handling
162(1)
5.4.2 Environment Variables
162(1)
5.5 Speeding Things Up
163(4)
5.5.1 Profiling
163(1)
5.5.2 Vectorizing Functions
164(1)
5.5.3 Other Techniques to Speed Things Up
165(2)
5.6
Chapter Summary and Critical Data Handling Tools
167(4)
5.6.1 Programming Style
168(1)
5.6.2 Common Bugs
169(1)
5.6.3 Objects, Classes, and Methods
170(1)
6 Getting Data into and out of R 171(42)
6.1 Reading Tabular ASCII Data into Data Frames
171(13)
6.1.1 Files with Delimiters
172(1)
6.1.2 Column Classes
173(2)
6.1.3 Common Pitfalls in Reading Tables
175(2)
6.1.4 An Example of When read.table ( ) Fails
177(4)
6.1.5 Other Uses of the scan ( ) Function
181(1)
6.1.6 Writing Delimited Files
182(1)
6.1.7 Reading and Writing Fixed-Width Files
183(1)
6.1.8 A Note on End-of-Line Characters
183(1)
6.2 Reading Large, Non-Tabular, or Non-ASCII Data
184(8)
6.2.1 Opening and Closing Files
184(1)
6.2.2 Reading and Writing Lines
185(2)
6.2.3 Reading and Writing UTF-8 and Other Encodings
187(1)
6.2.4 The Null Character
187(1)
6.2.5 Binary Data
188(2)
6.2.6 Reading Problem Files in Action
190(2)
6.3 Reading Data From Relational Databases
192(5)
6.3.1 Connecting to the Database Server
193(1)
6.3.2 Introduction to SQL
194(3)
6.4 Handling Large Numbers of Input Files
197(3)
6.5 Other Formats
200(9)
6.5.1 Using the Clipboard
200(1)
6.5.2 Reading Data from Spreadsheets
201(2)
6.5.3 Reading Data from the Web
203(5)
6.5.4 Reading Data from Other Statistical Packages
208(1)
6.6 Reading and Writing R Data Directly
209(1)
6.7
Chapter Summary and Critical Data Handling Tools
210(3)
7 Data Handling in Practice 213(34)
7.1 Acquiring and Reading Data
213(1)
7.2 Cleaning Data
214(2)
7.3 Combining Data
216(3)
7.3.1 Combining by Row
216(2)
7.3.2 Combining by Column
218(1)
7.3.3 Merging by Key
218(1)
7.4 Transactional Data
219(6)
7.4.1 Example of Transactional Data
219(2)
7.4.2 Combining Tabular and Transactional Data
221(4)
7.5 Preparing Data
225(1)
7.6 Documentation and Reproducibility
226(2)
7.7 The Role of Judgment
228(2)
7.8 Data Cleaning in Action
230(15)
7.8.1 Reading and Cleaning BedBath1.csv
231(5)
7.8.2 Reading and Cleaning BedBath2.csv
236(2)
7.8.3 Combining the BedBath Data Frames
238(1)
7.8.4 Reading and Cleaning EnergyUsage.csv
239(3)
7.8.5 Merging the BedBath and EnergyUsage Data Frames
242(3)
7.9
Chapter Summary and Critical Data Handling Tools
245(2)
8 Extended Exercise 247(18)
8.1 Introduction to the Problem
247(3)
8.1.1 The Goal
248(1)
8.1.2 Modeling Considerations
249(1)
8.1.3 Examples of Things to Check
249(1)
8.2 The Data
250(2)
8.3 Five Important Fields
252(1)
8.4 Loan and Application Portfolios
252(4)
8.4.1 Layout of the Beachside Lenders Data
253(1)
8.4.2 Layout of the Wilson and Sons Data
254(1)
8.4.3 Combining the Two Portfolios
254(2)
8.5 Scores
256(1)
8.5.1 Scores Layout
256(1)
8.6 Co-borrower Scores
257(2)
8.6.1 Co-borrower Score Examples
258(1)
8.7 Updated KScores
259(1)
8.7.1 Updated KScores Layout
259(1)
8.8 Loans to Be Excluded
260(2)
8.8.1 Sample Exclusion File
260(1)
8.9 Response Variable
260(2)
8.10 Assembling the Final Data Sets
262(3)
8.10.1 Final Data Layout
262(1)
8.10.2 Concluding Remarks
263(2)
A Hints and Pseudocode 265(12)
A.1 Loan Portfolios
265(2)
A.1.1 Things to Check
266(1)
A.2 Scores Database
267(2)
A.2.1 Things to Check
268(1)
A.3 Co-borrower Scores
269(2)
A.3.1 Things to Check
270(1)
A.4 Updated KScores
271(1)
A.4.1 Things to Check
272(1)
A.5 Excluder Files
272(1)
A.5.1 Things to Check
272(1)
A.6 Payment Matrix
273(2)
A.6.1 Things to Check
274(1)
A.7 Starting the Modeling Process
275(2)
Bibliography 277(2)
Index 279
SAMUEL E. BUTTREY, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.

LYN R. WHITAKER, PhD is an Associate Professor of Operations Research at the Naval Postgraduate School, Monterey, California, USA.