|
|
ix | |
About This Book |
|
xv | |
About The Author |
|
xvii | |
Acknowledgments |
|
xix | |
Introduction |
|
xxi | |
|
Chapter 1 Working with Character Data |
|
|
1 | (20) |
|
|
1 | (3) |
|
Using PROC FREQ to Detect Character Variable Errors |
|
|
4 | (2) |
|
Changing the Case of All Character Variables in a Data Set |
|
|
6 | (2) |
|
A Summary of Some Character Functions (Useful for Data Cleaning) |
|
|
8 | (3) |
|
Upcase, Lowcase, And Propcase |
|
|
8 | (1) |
|
Notdigit, Notalpha, And Notalnum |
|
|
8 | (1) |
|
|
9 | (1) |
|
|
9 | (1) |
|
|
9 | (1) |
|
|
10 | (1) |
|
|
10 | (1) |
|
Checking that a Character Value Conforms to a Pattern |
|
|
11 | (1) |
|
Using a DATA Step to Detect Character Data Errors |
|
|
12 | (1) |
|
Using PROC PRINT with a WHERE Statement to Identify Data Errors |
|
|
13 | (1) |
|
Using Formats to Check for Invalid Values |
|
|
14 | (2) |
|
Creating Permanent Formats |
|
|
16 | (1) |
|
Removing Units from a Value |
|
|
17 | (1) |
|
Removing Non-Printing Characters from a Character Value |
|
|
18 | (1) |
|
|
19 | (2) |
|
Chapter 2 Using Perl Regular Expressions to Detect Data Errors |
|
|
21 | (10) |
|
|
21 | (1) |
|
Describing the Syntax of Regular Expressions |
|
|
21 | (2) |
|
Checking for Valid ZIP Codes and Canadian Postal Codes |
|
|
23 | (2) |
|
Searching for Invalid Email Addresses |
|
|
25 | (1) |
|
|
26 | (1) |
|
Converting All Phone Numbers to a Standard Form |
|
|
27 | (1) |
|
Developing a Macro to Test Regular Expressions |
|
|
28 | (1) |
|
|
29 | (2) |
|
Chapter 3 Standardizing Data |
|
|
31 | (14) |
|
|
31 | (1) |
|
Using Formats to Standardize Company Names |
|
|
31 | (2) |
|
Creating a Format from a SAS Data Set |
|
|
33 | (3) |
|
Using TRANWRD and Other Functions to Standardize Addresses |
|
|
36 | (2) |
|
Using Regular Expressions to Help Standardize Addresses |
|
|
38 | (2) |
|
Performing a "Fuzzy" Match between Two Files |
|
|
40 | (4) |
|
|
44 | (1) |
|
Chapter 4 Data Cleaning Techniques for Numeric Data |
|
|
45 | (28) |
|
|
45 | (1) |
|
Using PROC UNIVARIATE to Examine Numeric Variables |
|
|
45 | (4) |
|
Describing an ODS Option to List Selected Portions of the Output |
|
|
49 | (3) |
|
Listing Output Objects Using the Statement TRACE ON |
|
|
52 | (1) |
|
Using a PROC UNIVARIATE Option to List More Extreme Values |
|
|
52 | (1) |
|
Presenting a Program to List the 10 Highest and Lowest Values |
|
|
53 | (2) |
|
Presenting a Macro to List the n Highest and Lowest Values |
|
|
55 | (3) |
|
Describing Two Programs to List the Highest and Lowest Values by Percentage |
|
|
58 | (8) |
|
|
58 | (2) |
|
Presenting a Macro to List the Highest and Lowest n% Values |
|
|
60 | (2) |
|
|
62 | (4) |
|
Using Pre-Determined Ranges to Check for Possible Data Errors |
|
|
66 | (1) |
|
Identifying Invalid Values versus Missing Values |
|
|
67 | (2) |
|
Checking Ranges for Several Variables and Generating a Single Report |
|
|
69 | (3) |
|
|
72 | (1) |
|
Chapter 5 Automatic Outlier Detection for Numeric Data |
|
|
73 | (14) |
|
|
73 | (1) |
|
Automatic Outlier Detection (Using Means and Standard Deviations) |
|
|
73 | (2) |
|
Detecting Outliers Based on a Trimmed Mean and Standard Deviation |
|
|
75 | (3) |
|
Describing a Program that Uses Trimmed Statistics for Multiple Variables |
|
|
78 | (3) |
|
Presenting a Macro Based on Trimmed Statistics |
|
|
81 | (2) |
|
Detecting Outliers Based on the Interquartile Range |
|
|
83 | (3) |
|
|
86 | (1) |
|
Chapter 6 More Advanced Techniques for Finding Errors in Numeric Data |
|
|
87 | (22) |
|
|
87 | (1) |
|
Introducing the Banking Data Set |
|
|
87 | (4) |
|
Running the Auto Outliers Macro on Bank Deposits |
|
|
91 | (1) |
|
Identifying Outliers Within Each Account |
|
|
92 | (3) |
|
Using Box Plots to Inspect Suspicious Deposits |
|
|
95 | (4) |
|
Using Regression Techniques to Identify Possible Errors in the Banking Data |
|
|
99 | (5) |
|
Using Regression Diagnostics to Identify Outliers |
|
|
104 | (4) |
|
|
108 | (1) |
|
Chapter 7 Describing Issues Related to Missing and Special Values (Such as 999) |
|
|
109 | (14) |
|
|
109 | (1) |
|
|
109 | (1) |
|
Using PROC MEANS and PROC FREQ to Count Missing Values |
|
|
110 | (3) |
|
Counting Missing Values for Numeric Variables |
|
|
110 | (1) |
|
Counting Missing Values for Character Variables |
|
|
111 | (2) |
|
Using DATA Step Approaches to Identify and Count Missing Values |
|
|
113 | (1) |
|
Locating Patient Numbers for Records Where Patno Is Either Missing or Invalid |
|
|
113 | (4) |
|
Searching for a Specific Numeric Value |
|
|
117 | (2) |
|
Creating a Macro to Search for Specific Numeric Values |
|
|
119 | (2) |
|
Converting Values Such as 999 to a SAS Missing Value |
|
|
121 | (1) |
|
|
121 | (2) |
|
Chapter 8 Working with SAS Dates |
|
|
123 | (10) |
|
|
123 | (1) |
|
Changing the Storage Length for SAS Dates |
|
|
123 | (1) |
|
Checking Ranges for Dates (Using a DATA Step) |
|
|
124 | (1) |
|
Checking Ranges for Dates (Using PROC PRINT) |
|
|
125 | (1) |
|
Checking for Invalid Dates |
|
|
125 | (3) |
|
Working with Dates in Nonstandard Form |
|
|
128 | (1) |
|
Creating a SAS Date When the Day of the Month Is Missing |
|
|
129 | (2) |
|
Suspending Error Checking for Known Invalid Dates |
|
|
131 | (1) |
|
|
131 | (2) |
|
Chapter 9 Looking for Duplicates and Checking Data with Multiple Observations per Subject |
|
|
133 | (14) |
|
|
133 | (1) |
|
Eliminating Duplicates by Using PROC SORT |
|
|
133 | (3) |
|
Demonstrating a Possible Problem with the NODUPRECS Option |
|
|
136 | (2) |
|
Reviewing First, and Last. Variables |
|
|
138 | (2) |
|
Detecting Duplicates by Using DATA Step Approaches |
|
|
140 | (1) |
|
Using PROC FREQ to Detect Duplicate IDs |
|
|
141 | (2) |
|
Working with Data Sets with More Than One Observation per Subject |
|
|
143 | (1) |
|
Identifying Subjects with n Observations Each (DATA Step Approach) |
|
|
144 | (2) |
|
Identifying Subjects with n Observations Each (Using PROC FREQ) |
|
|
146 | (1) |
|
|
146 | (1) |
|
Chapter 10 Working with Multiple Files |
|
|
147 | (8) |
|
|
147 | (1) |
|
Checking for an ID in Each of Two Files |
|
|
147 | (3) |
|
Checking for an ID in Each of n Files |
|
|
150 | (2) |
|
|
152 | (2) |
|
|
154 | (1) |
|
Chapter 11 Using PROC COMPARE to Perform Data Verification |
|
|
155 | (8) |
|
|
155 | (1) |
|
Conducting a Simple Comparison of Two Data Files |
|
|
155 | (5) |
|
Simulating Double Entry Verification Using PROC COMPARE |
|
|
160 | (1) |
|
Other Features of PROC COMPARE |
|
|
161 | (1) |
|
|
162 | (1) |
|
Chapter 12 Correcting Errors |
|
|
163 | (10) |
|
|
163 | (1) |
|
|
163 | (1) |
|
|
164 | (2) |
|
Reviewing the UPDATE Statement |
|
|
166 | (2) |
|
Using the UPDATE Statement to Correct Errors in the Patients Data Set |
|
|
168 | (3) |
|
|
171 | (2) |
|
Chapter 13 Creating Integrity Constraints and Audit Trails |
|
|
173 | (22) |
|
|
173 | (1) |
|
Demonstrating General Integrity Constraints |
|
|
174 | (3) |
|
|
177 | (1) |
|
Demonstrating How Integrity Constraints Block the Addition of Data Errors |
|
|
178 | (1) |
|
Adding Your Own Messages to Violations of an Integrity Constraint |
|
|
179 | (1) |
|
Deleting an Integrity Constraint Using PROC DATASETS |
|
|
180 | (1) |
|
Creating an Audit Trail Data Set |
|
|
180 | (3) |
|
Demonstrating an Integrity Constraint Involving More Than One Variable |
|
|
183 | (3) |
|
Demonstrating a Referential Constraint |
|
|
186 | (2) |
|
Attempting to Delete a Primary Key When a Foreign Key Still Exists |
|
|
188 | (2) |
|
Attempting to Add a Name to the Child Data Set |
|
|
190 | (1) |
|
Demonstrating How to Delete a Referential Constraint |
|
|
191 | (1) |
|
Demonstrating the CASCADE Feature of a Referential Constraint |
|
|
191 | (1) |
|
Demonstrating the SET NULL Feature of a Referential Constraint |
|
|
192 | (1) |
|
|
193 | (2) |
|
Chapter 14 A Summary of Useful Data Cleaning Macros |
|
|
195 | (10) |
|
|
195 | (1) |
|
A Macro to Test Regular Expressions |
|
|
195 | (1) |
|
A Macro to List the n Highest and Lowest Values of a Variable |
|
|
196 | (1) |
|
A Macro to List the n% Highest and Lowest Values of a Variable |
|
|
197 | (1) |
|
A Macro to Perform Range Checks on Several Variables |
|
|
198 | (2) |
|
A Macro that Uses Trimmed Statistics to Automatically Search for Outliers |
|
|
200 | (2) |
|
A Macro to Search a Data Set for Specific Values Such as 999 |
|
|
202 | (1) |
|
A Macro to Check for ID Values in Multiple Data Sets |
|
|
203 | (1) |
|
|
204 | (1) |
Index |
|
205 | |