Muutke küpsiste eelistusi

E-raamat: Cody's Data Cleaning Techniques Using SAS, Third Edition

  • Formaat: 234 pages
  • Ilmumisaeg: 15-Mar-2017
  • Kirjastus: SAS Institute
  • Keel: eng
  • ISBN-13: 9781635260694
  • Formaat - PDF+DRM
  • Hind: 33,92 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Formaat: 234 pages
  • Ilmumisaeg: 15-Mar-2017
  • Kirjastus: SAS Institute
  • Keel: eng
  • ISBN-13: 9781635260694

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Find errors and clean up data easily using SAS®! Thoroughly updated, Cody's Data Cleaning Techniques Using SAS, Third Edition, addresses tasks that nearly every data analyst needs to do - that is, make sure that data errors are located and corrected. Written in Ron Cody's signature informal, tutorial style, this book develops and demonstrates data cleaning programs and macros that you can use as written or modify which will make your job of data cleaning easier, faster, and more efficient.Building on both the author's experience gained from teaching a data cleaning course for over 10 years, and advances in SAS, this third edition includes four new chapters, covering topics such as the use of Perl regular expressions for checking the format of character values (such as zip codes or email addresses) and how to standardize company names and addresses.With this book, you will learn how to: •find and correct errors in character and numeric values •develop programming techniques related to dates and missing values •deal with highly skewed data •develop techniques for correcting your data errors •use integrity constraints and audit trails to prevent errors from being added to a clean data set
List of Programs
ix
About This Book xv
About The Author xvii
Acknowledgments xix
Introduction xxi
Chapter 1 Working with Character Data
1(20)
Introduction
1(3)
Using PROC FREQ to Detect Character Variable Errors
4(2)
Changing the Case of All Character Variables in a Data Set
6(2)
A Summary of Some Character Functions (Useful for Data Cleaning)
8(3)
Upcase, Lowcase, And Propcase
8(1)
Notdigit, Notalpha, And Notalnum
8(1)
Verify
9(1)
Compbl
9(1)
Compress
9(1)
Missing
10(1)
Trimn And Strip
10(1)
Checking that a Character Value Conforms to a Pattern
11(1)
Using a DATA Step to Detect Character Data Errors
12(1)
Using PROC PRINT with a WHERE Statement to Identify Data Errors
13(1)
Using Formats to Check for Invalid Values
14(2)
Creating Permanent Formats
16(1)
Removing Units from a Value
17(1)
Removing Non-Printing Characters from a Character Value
18(1)
Conclusions
19(2)
Chapter 2 Using Perl Regular Expressions to Detect Data Errors
21(10)
Introduction
21(1)
Describing the Syntax of Regular Expressions
21(2)
Checking for Valid ZIP Codes and Canadian Postal Codes
23(2)
Searching for Invalid Email Addresses
25(1)
Verifying Phone Numbers
26(1)
Converting All Phone Numbers to a Standard Form
27(1)
Developing a Macro to Test Regular Expressions
28(1)
Conclusions
29(2)
Chapter 3 Standardizing Data
31(14)
Introduction
31(1)
Using Formats to Standardize Company Names
31(2)
Creating a Format from a SAS Data Set
33(3)
Using TRANWRD and Other Functions to Standardize Addresses
36(2)
Using Regular Expressions to Help Standardize Addresses
38(2)
Performing a "Fuzzy" Match between Two Files
40(4)
Conclusions
44(1)
Chapter 4 Data Cleaning Techniques for Numeric Data
45(28)
Introduction
45(1)
Using PROC UNIVARIATE to Examine Numeric Variables
45(4)
Describing an ODS Option to List Selected Portions of the Output
49(3)
Listing Output Objects Using the Statement TRACE ON
52(1)
Using a PROC UNIVARIATE Option to List More Extreme Values
52(1)
Presenting a Program to List the 10 Highest and Lowest Values
53(2)
Presenting a Macro to List the n Highest and Lowest Values
55(3)
Describing Two Programs to List the Highest and Lowest Values by Percentage
58(8)
Using PROC UNIVARIATE
58(2)
Presenting a Macro to List the Highest and Lowest n% Values
60(2)
Using PROC RANK
62(4)
Using Pre-Determined Ranges to Check for Possible Data Errors
66(1)
Identifying Invalid Values versus Missing Values
67(2)
Checking Ranges for Several Variables and Generating a Single Report
69(3)
Conclusions
72(1)
Chapter 5 Automatic Outlier Detection for Numeric Data
73(14)
Introduction
73(1)
Automatic Outlier Detection (Using Means and Standard Deviations)
73(2)
Detecting Outliers Based on a Trimmed Mean and Standard Deviation
75(3)
Describing a Program that Uses Trimmed Statistics for Multiple Variables
78(3)
Presenting a Macro Based on Trimmed Statistics
81(2)
Detecting Outliers Based on the Interquartile Range
83(3)
Conclusions
86(1)
Chapter 6 More Advanced Techniques for Finding Errors in Numeric Data
87(22)
Introduction
87(1)
Introducing the Banking Data Set
87(4)
Running the Auto Outliers Macro on Bank Deposits
91(1)
Identifying Outliers Within Each Account
92(3)
Using Box Plots to Inspect Suspicious Deposits
95(4)
Using Regression Techniques to Identify Possible Errors in the Banking Data
99(5)
Using Regression Diagnostics to Identify Outliers
104(4)
Conclusions
108(1)
Chapter 7 Describing Issues Related to Missing and Special Values (Such as 999)
109(14)
Introduction
109(1)
Inspecting the SAS Log
109(1)
Using PROC MEANS and PROC FREQ to Count Missing Values
110(3)
Counting Missing Values for Numeric Variables
110(1)
Counting Missing Values for Character Variables
111(2)
Using DATA Step Approaches to Identify and Count Missing Values
113(1)
Locating Patient Numbers for Records Where Patno Is Either Missing or Invalid
113(4)
Searching for a Specific Numeric Value
117(2)
Creating a Macro to Search for Specific Numeric Values
119(2)
Converting Values Such as 999 to a SAS Missing Value
121(1)
Conclusions
121(2)
Chapter 8 Working with SAS Dates
123(10)
Introduction
123(1)
Changing the Storage Length for SAS Dates
123(1)
Checking Ranges for Dates (Using a DATA Step)
124(1)
Checking Ranges for Dates (Using PROC PRINT)
125(1)
Checking for Invalid Dates
125(3)
Working with Dates in Nonstandard Form
128(1)
Creating a SAS Date When the Day of the Month Is Missing
129(2)
Suspending Error Checking for Known Invalid Dates
131(1)
Conclusions
131(2)
Chapter 9 Looking for Duplicates and Checking Data with Multiple Observations per Subject
133(14)
Introduction
133(1)
Eliminating Duplicates by Using PROC SORT
133(3)
Demonstrating a Possible Problem with the NODUPRECS Option
136(2)
Reviewing First, and Last. Variables
138(2)
Detecting Duplicates by Using DATA Step Approaches
140(1)
Using PROC FREQ to Detect Duplicate IDs
141(2)
Working with Data Sets with More Than One Observation per Subject
143(1)
Identifying Subjects with n Observations Each (DATA Step Approach)
144(2)
Identifying Subjects with n Observations Each (Using PROC FREQ)
146(1)
Conclusions
146(1)
Chapter 10 Working with Multiple Files
147(8)
Introduction
147(1)
Checking for an ID in Each of Two Files
147(3)
Checking for an ID in Each of n Files
150(2)
A Macro for ID Checking
152(2)
Conclusions
154(1)
Chapter 11 Using PROC COMPARE to Perform Data Verification
155(8)
Introduction
155(1)
Conducting a Simple Comparison of Two Data Files
155(5)
Simulating Double Entry Verification Using PROC COMPARE
160(1)
Other Features of PROC COMPARE
161(1)
Conclusions
162(1)
Chapter 12 Correcting Errors
163(10)
Introduction
163(1)
Hard Coding Corrections
163(1)
Describing Named Input
164(2)
Reviewing the UPDATE Statement
166(2)
Using the UPDATE Statement to Correct Errors in the Patients Data Set
168(3)
Conclusions
171(2)
Chapter 13 Creating Integrity Constraints and Audit Trails
173(22)
Introduction
173(1)
Demonstrating General Integrity Constraints
174(3)
Describing PROC APPEND
177(1)
Demonstrating How Integrity Constraints Block the Addition of Data Errors
178(1)
Adding Your Own Messages to Violations of an Integrity Constraint
179(1)
Deleting an Integrity Constraint Using PROC DATASETS
180(1)
Creating an Audit Trail Data Set
180(3)
Demonstrating an Integrity Constraint Involving More Than One Variable
183(3)
Demonstrating a Referential Constraint
186(2)
Attempting to Delete a Primary Key When a Foreign Key Still Exists
188(2)
Attempting to Add a Name to the Child Data Set
190(1)
Demonstrating How to Delete a Referential Constraint
191(1)
Demonstrating the CASCADE Feature of a Referential Constraint
191(1)
Demonstrating the SET NULL Feature of a Referential Constraint
192(1)
Conclusions
193(2)
Chapter 14 A Summary of Useful Data Cleaning Macros
195(10)
Introduction
195(1)
A Macro to Test Regular Expressions
195(1)
A Macro to List the n Highest and Lowest Values of a Variable
196(1)
A Macro to List the n% Highest and Lowest Values of a Variable
197(1)
A Macro to Perform Range Checks on Several Variables
198(2)
A Macro that Uses Trimmed Statistics to Automatically Search for Outliers
200(2)
A Macro to Search a Data Set for Specific Values Such as 999
202(1)
A Macro to Check for ID Values in Multiple Data Sets
203(1)
Conclusions
204(1)
Index 205