Preface |
|
xvii | |
Introduction |
|
1 | (8) |
|
Data Exploration as a Process |
|
|
9 | (36) |
|
The Data Exploration Process |
|
|
10 | (18) |
|
Stage 1: Exploring the Problem Space |
|
|
12 | (7) |
|
Stage 2: Exploring the Solution Space |
|
|
19 | (3) |
|
Stage 3: Specifying the Implementation Method |
|
|
22 | (1) |
|
|
22 | (6) |
|
Exploration: Mining and Modeling |
|
|
28 | (1) |
|
Data Mining, Modeling, and Modeling Tools |
|
|
28 | (9) |
|
|
29 | (1) |
|
Introducing Modeling Tools |
|
|
30 | (2) |
|
|
32 | (1) |
|
Active and Passive Models |
|
|
33 | (1) |
|
Explanatory and Predictive Models |
|
|
33 | (2) |
|
Static and Continuously Learning Models |
|
|
35 | (2) |
|
|
37 | (2) |
|
|
39 | (6) |
|
A Continuously Learning Model Application |
|
|
39 | (1) |
|
How the Continuously Learning Model Worked |
|
|
40 | (5) |
|
The Nature of the World and Its Impact on Data Preparation |
|
|
45 | (44) |
|
|
46 | (7) |
|
|
46 | (1) |
|
|
47 | (1) |
|
|
48 | (5) |
|
Typing Measurements to the Real World |
|
|
53 | (1) |
|
|
53 | (7) |
|
|
54 | (6) |
|
|
60 | (1) |
|
Continua of Attributes of Variables |
|
|
60 | (6) |
|
The Qualitative-Quantitative Continuum |
|
|
61 | (1) |
|
The Discrete-Continuous Continuum |
|
|
61 | (5) |
|
Scale Measurement Example |
|
|
66 | (1) |
|
Transformations and Difficulties---Variables, Data, and Information |
|
|
66 | (1) |
|
Building Mineable Data Representations |
|
|
67 | (19) |
|
|
68 | (1) |
|
Building Data---Dealing with Variables |
|
|
69 | (8) |
|
Building Mineable Data Sets |
|
|
77 | (9) |
|
|
86 | (1) |
|
|
87 | (2) |
|
|
87 | (2) |
|
Data Preparation as a Process |
|
|
89 | (36) |
|
Data Preparation: Inputs, Outputs, Models, and Decisions |
|
|
90 | (10) |
|
|
92 | (5) |
|
|
97 | (1) |
|
|
98 | (1) |
|
|
98 | (2) |
|
Modeling Tools and Data Preparation |
|
|
100 | (12) |
|
How Modeling Tools Drive Data Preparation |
|
|
102 | (2) |
|
|
104 | (1) |
|
|
104 | (3) |
|
|
107 | (1) |
|
|
107 | (1) |
|
Modeling Data with the Tools |
|
|
107 | (2) |
|
|
109 | (2) |
|
|
111 | (1) |
|
Missing Data and Modeling Tools |
|
|
111 | (1) |
|
Stages of Data Preparation |
|
|
112 | (10) |
|
Stage 1: Accessing the Data |
|
|
112 | (1) |
|
Stage 2: Auditing the Data |
|
|
113 | (1) |
|
Stage 3: Enhancing and Enriching the Data |
|
|
114 | (1) |
|
Stage 4: Looking for Sampling Bias |
|
|
114 | (1) |
|
Stage 5: Determining Data Structure (Super-, Macro-, and Micro-) |
|
|
115 | (1) |
|
Stage 6: Building the PIE |
|
|
116 | (5) |
|
Stage 7: Surveying the Data |
|
|
121 | (1) |
|
Stage 8: Modeling the Data |
|
|
122 | (1) |
|
|
122 | (3) |
|
Getting the Data: Basic Preparation |
|
|
125 | (30) |
|
|
127 | (2) |
|
|
127 | (2) |
|
|
129 | (6) |
|
Detail/Aggregation Level (Granularity) |
|
|
129 | (2) |
|
|
131 | (1) |
|
|
132 | (1) |
|
|
133 | (1) |
|
|
133 | (1) |
|
|
133 | (1) |
|
|
134 | (1) |
|
|
134 | (1) |
|
|
135 | (1) |
|
Duplicate or Redundant Variables |
|
|
135 | (1) |
|
|
135 | (6) |
|
|
136 | (1) |
|
|
137 | (1) |
|
Physical or Behavioral Data Sets |
|
|
138 | (1) |
|
|
138 | (1) |
|
Data Enhancement or Enrichment |
|
|
139 | (1) |
|
|
140 | (1) |
|
|
141 | (8) |
|
|
141 | (5) |
|
Relationships between Variables |
|
|
146 | (3) |
|
|
149 | (2) |
|
|
149 | (1) |
|
Relationships between Variables |
|
|
150 | (1) |
|
|
151 | (4) |
|
Sampling, Variability, and Confidence |
|
|
155 | (36) |
|
Sampling, or First Catch Your Hare! |
|
|
155 | (11) |
|
|
155 | (1) |
|
|
156 | (3) |
|
Converging on a Representative Sample |
|
|
159 | (3) |
|
|
162 | (1) |
|
Variability and Deviation |
|
|
162 | (4) |
|
|
166 | (1) |
|
Variability of Numeric Variables |
|
|
167 | (3) |
|
|
168 | (1) |
|
Variability and Convergence |
|
|
168 | (2) |
|
Variability and Confidence in Alpha Variables |
|
|
170 | (2) |
|
Ordering and Rate of Discovery |
|
|
171 | (1) |
|
|
172 | (6) |
|
Modeling and Confidence with the Whole Population |
|
|
172 | (1) |
|
|
173 | (3) |
|
Confidence Tests and Variability |
|
|
176 | (2) |
|
Confidence in Capturing Variability |
|
|
178 | (6) |
|
A Brief Introduction to the Normal Distribution |
|
|
178 | (2) |
|
Normally Distributed Probabilities |
|
|
180 | (1) |
|
Capturing Normally Distributed Probabilities: An Example |
|
|
181 | (1) |
|
Capturing Confidence, Capturing Variance |
|
|
182 | (2) |
|
Problems and Shortcomings of Taking Samples Using Variability |
|
|
184 | (4) |
|
|
184 | (1) |
|
Constants (Variables with Only One Value) |
|
|
185 | (1) |
|
|
185 | (1) |
|
Monotonic Variable Detection |
|
|
186 | (1) |
|
|
187 | (1) |
|
|
187 | (1) |
|
Confidence and Instance Count |
|
|
188 | (1) |
|
|
188 | (1) |
|
|
189 | (2) |
|
|
189 | (2) |
|
Handling Nonnumerical Variables |
|
|
191 | (48) |
|
Representing Alphas and Remapping |
|
|
192 | (10) |
|
|
193 | (1) |
|
|
194 | (1) |
|
Remapping to Eliminate Ordering |
|
|
195 | (1) |
|
Remapping One-to-Many Patterns, or Ill-Formed Problems |
|
|
196 | (4) |
|
Remapping Circular Discontinuity |
|
|
200 | (2) |
|
|
202 | (20) |
|
|
202 | (2) |
|
Pythagoras in State Space |
|
|
204 | (1) |
|
|
204 | (1) |
|
|
205 | (1) |
|
|
206 | (5) |
|
Nearby and Distant Nearest Neighbors |
|
|
211 | (1) |
|
Normalizing Measured Point Separation |
|
|
211 | (2) |
|
Contours, Peaks, and Valleys |
|
|
213 | (1) |
|
|
213 | (1) |
|
|
213 | (1) |
|
|
214 | (1) |
|
|
215 | (1) |
|
Location, Location, Location! |
|
|
216 | (1) |
|
Numerics, Alphas, and the Montreal Canadiens |
|
|
216 | (6) |
|
Joint Distribution Tables |
|
|
222 | (8) |
|
|
223 | (5) |
|
More Values, More Variables, and Meaning of the Numeration |
|
|
228 | (1) |
|
Dealing with Low-Frequency Alpha Labels and Other Problems |
|
|
229 | (1) |
|
|
230 | (5) |
|
|
230 | (1) |
|
|
231 | (3) |
|
|
234 | (1) |
|
|
234 | (1) |
|
Practical Consideration---Implementing Alpha Numeration in the Demonstration Code |
|
|
235 | (3) |
|
Implementing Neighborhoods |
|
|
235 | (2) |
|
Implementing Numeration in All Alpha Data Sets |
|
|
237 | (1) |
|
Implementing Dimensionality Reduction for Variables |
|
|
237 | (1) |
|
|
238 | (1) |
|
Normalizing and Redistributing Variables |
|
|
239 | (36) |
|
Normalizing a Variable's Range |
|
|
240 | (19) |
|
Review of Data Preparation and Modeling (Training, Testing, and Execution) |
|
|
241 | (1) |
|
The Nature and Scope of the Out-of-Range Values Problem |
|
|
242 | (1) |
|
Discovering the Range of Values When Building the PIE |
|
|
243 | (4) |
|
Out-of-Range Values When Training |
|
|
247 | (2) |
|
Out-of-Range Values When Testing |
|
|
249 | (1) |
|
Out-of-Range Values When Executing |
|
|
250 | (1) |
|
|
251 | (6) |
|
|
257 | (1) |
|
|
258 | (1) |
|
Redistributing Variable Values |
|
|
259 | (10) |
|
The Nature of Distributions |
|
|
259 | (1) |
|
Distributive Difficulties |
|
|
260 | (1) |
|
|
261 | (5) |
|
|
266 | (3) |
|
|
269 | (2) |
|
|
271 | (4) |
|
|
271 | (3) |
|
Modifying the Linear Part of the Logistic Function Range |
|
|
274 | (1) |
|
Replacing Missing and Empty Values |
|
|
275 | (24) |
|
Retaining Information about Missing Values |
|
|
275 | (3) |
|
|
276 | (1) |
|
|
277 | (1) |
|
|
278 | (7) |
|
|
279 | (1) |
|
Variability Relationships |
|
|
279 | (3) |
|
Relationships between Variables |
|
|
282 | (2) |
|
Preserving Between-Variable Relationships |
|
|
284 | (1) |
|
|
285 | (1) |
|
|
286 | (13) |
|
Using Regression to Find Least Information-Damaging Missing Values |
|
|
286 | (8) |
|
Alternative Methods of Missing-Value Replacement |
|
|
294 | (5) |
|
|
299 | (52) |
|
|
300 | (1) |
|
|
300 | (1) |
|
|
301 | (19) |
|
|
302 | (1) |
|
|
302 | (1) |
|
Describing a Series---Fourier |
|
|
303 | (4) |
|
Describing a Series---Spectrum |
|
|
307 | (7) |
|
Describing a Series---Trend, Seasonality, Cycles, Noise |
|
|
314 | (2) |
|
Describing a Series---Autocorrelation |
|
|
316 | (4) |
|
|
320 | (1) |
|
Repairing Series Data Problems |
|
|
320 | (5) |
|
|
320 | (2) |
|
|
322 | (1) |
|
|
322 | (1) |
|
|
323 | (2) |
|
|
325 | (14) |
|
|
325 | (1) |
|
|
326 | (7) |
|
Smoothing 1---PVM Smoothing |
|
|
333 | (1) |
|
Smoothing 2---Median Smoothing, Resmoothing, and Hanning |
|
|
333 | (2) |
|
|
335 | (1) |
|
|
336 | (3) |
|
|
339 | (5) |
|
|
341 | (1) |
|
|
341 | (3) |
|
|
344 | (1) |
|
|
344 | (4) |
|
|
346 | (1) |
|
Signposts on the Rocky Road |
|
|
346 | (2) |
|
|
348 | (3) |
|
|
351 | (50) |
|
Using Sparsely Populated Variables |
|
|
351 | (4) |
|
Increasing Information Density Using Sparsely Populated Variables |
|
|
352 | (1) |
|
Binning Sparse Numerical Values |
|
|
353 | (1) |
|
Present-Value Patterns (PVPs) |
|
|
353 | (2) |
|
Problems with High-Dimensionality Data Sets |
|
|
355 | (5) |
|
Information Representation |
|
|
357 | (1) |
|
Representing High-Dimensionality Data in Fewer Dimensions |
|
|
358 | (2) |
|
Introducing the Neural Network |
|
|
360 | (16) |
|
Training a Neural Network |
|
|
361 | (1) |
|
|
362 | (1) |
|
Reshaping the Logistic Curve |
|
|
363 | (1) |
|
|
363 | (3) |
|
|
366 | (2) |
|
Networking Neurons to Estimate a Function |
|
|
368 | (1) |
|
|
368 | (3) |
|
Network Prediction---Hidden Layer |
|
|
371 | (1) |
|
Network Prediction---Output Layer |
|
|
371 | (1) |
|
Stochastic Network Performance |
|
|
372 | (1) |
|
Network Architecture 1---The Autoassociative Network |
|
|
373 | (2) |
|
Network Architecture 2---The Sparsely Connected Network |
|
|
375 | (1) |
|
|
376 | (2) |
|
Using Compressed Dimensionality Data |
|
|
376 | (2) |
|
|
378 | (5) |
|
Estimating Variable Importance 1: What Doesn't Work |
|
|
379 | (1) |
|
Estimating Variable Importance 2: Clues |
|
|
379 | (1) |
|
Estimating Variable Importance 3: Configuring and Training the Network |
|
|
380 | (3) |
|
|
383 | (9) |
|
|
384 | (6) |
|
Capturing Joint Variability |
|
|
390 | (1) |
|
|
391 | (1) |
|
Beyond Joint Distribution |
|
|
392 | (4) |
|
|
393 | (3) |
|
|
396 | (1) |
|
|
396 | (3) |
|
Collapsing Extremely Sparsely Populated Variables |
|
|
397 | (1) |
|
Reducing Excessive Dimensionality |
|
|
397 | (1) |
|
Measuring Variable Importance |
|
|
398 | (1) |
|
|
398 | (1) |
|
|
399 | (2) |
|
|
401 | (82) |
|
Introduction to the Data Survey |
|
|
402 | (1) |
|
Information and Communication |
|
|
403 | (11) |
|
Measuring Information: Signals and Dictionaries |
|
|
405 | (1) |
|
Measuring Information: Signals |
|
|
406 | (1) |
|
Measuring Information: Bits of Information |
|
|
407 | (3) |
|
Measuring Information: Surprise |
|
|
410 | (1) |
|
Measuring Information: Entropy |
|
|
411 | (1) |
|
Measuring Information: Dictionaries |
|
|
412 | (2) |
|
|
414 | (9) |
|
|
416 | (1) |
|
Conditional Entropy between Inputs and Outputs |
|
|
417 | (3) |
|
|
420 | (1) |
|
Other Survey Uses for Entropy and Information |
|
|
420 | (1) |
|
|
421 | (2) |
|
Identifying Problems with a Data Survey |
|
|
423 | (12) |
|
Confidence and Sufficient Data |
|
|
424 | (2) |
|
|
426 | (1) |
|
|
427 | (8) |
|
|
435 | (1) |
|
|
436 | (3) |
|
|
439 | (3) |
|
|
442 | (1) |
|
|
443 | (3) |
|
|
446 | (37) |
|
Entropic Analysis---Example |
|
|
446 | (5) |
|
|
451 | (32) |
|
|
483 | (22) |
|
|
485 | (4) |
|
|
485 | (1) |
|
|
485 | (1) |
|
Data Mining vs. Exploratory Data Analysis |
|
|
486 | (3) |
|
|
489 | (5) |
|
|
490 | (1) |
|
|
491 | (1) |
|
|
492 | (1) |
|
Neural Networks and Regression |
|
|
493 | (1) |
|
Prepared Data and Modeling Algorithms |
|
|
494 | (6) |
|
Neural Networks and the Credit Data Set |
|
|
494 | (5) |
|
Decision Trees and the Credit Data Set |
|
|
499 | (1) |
|
Practical Use of Data Preparation and Prepared Data |
|
|
500 | (1) |
|
Looking at Present Modeling Tools and Future Directions |
|
|
501 | (4) |
|
|
503 | (1) |
|
|
504 | (1) |
Appendix |
|
Using the Demonstration Code on the CD-ROM |
|
505 | (4) |
Further Reading |
|
509 | (4) |
Index |
|
513 | (24) |
About the Author |
|
537 | (2) |
About the CD-ROM |
|
539 | |