Muutke küpsiste eelistusi

Data Preparation for Data Mining [Pehme köide]

(Chief Scientist and Founder of PTI, Leominster, MA, USA)

Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation. Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is certain to be disappointing.

Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced distortion, and more valuable results.

On the enclosed CD-ROM, you'll find a suite of programs as C source code and compiled into a command-line-driven toolkit. This code illustrates how the author's techniques can be applied to arrive at an automated preparation solution that works for you. Also included are demonstration versions of three commercial products that help with data preparation, along with sample data with which you can practice and experiment.

* Offers in-depth coverage of an essential but largely ignored subject.
* Goes far beyond theory, leading you-step by step-through the author's own data preparation techniques.
* Provides practical illustrations of the author's methodology using realistic sample data sets.
* Includes algorithms you can apply directly to your own project, along with instructions for understanding when automation is possible and when greater intervention is required.
* Explains how to identify and correct data problems that may be present in your application.
* Prepares miners, helping them head into preparation with a better understanding of data sets and their limitations.

Data Preparation for Data Mining addresses an issue unfortunately ignored by most authorities on data mining: data preparation. Thanks largely to its perceived difficulty, data preparation has traditionally taken a backseat to the more alluring question of how best to extract meaningful knowledge. But without adequate preparation of your data, the return on the resources invested in mining is certain to be disappointing.

Dorian Pyle corrects this imbalance. A twenty-five-year veteran of what has become the data mining industry, Pyle shares his own successful data preparation methodology, offering both a conceptual overview for managers and complete technical details for IT professionals. Apply his techniques and watch your mining efforts pay off-in the form of improved performance, reduced distortion, and more valuable results.

On the enclosed CD-ROM, you'll find a suite of programs as C source code and compiled into a command-line-driven toolkit. This code illustrates how the author's techniques can be applied to arrive at an automated preparation solution that works for you. Also included are demonstration versions of three commercial products that help with data preparation, along with sample data with which you can practice and experiment.

* Offers in-depth coverage of an essential but largely ignored subject.
* Goes far beyond theory, leading you-step by step-through the author's own data preparation techniques.
* Provides practical illustrations of the author's methodology using realistic sample data sets.
* Includes algorithms you can apply directly to your own project, along with instructions for understanding when automation is possible and when greater intervention is required.
* Explains how to identify and correct data problems that may be present in your application.
* Prepares miners, helping them head into preparation with a better understanding of data sets and their limitations.

Muu info

* Offers in-depth coverage of an essential but largely ignored subject. * Goes far beyond theory, leading you-step by step-through the author's own data preparation techniques. * Provides practical illustrations of the author's methodology using realistic sample data sets. * Includes algorithms you can apply directly to your own project, along with instructions for understanding when automation is possible and when greater intervention is required. * Explains how to identify and correct data problems that may be present in your application. * Prepares miners, helping them head into preparation with a better understanding of data sets and their limitations.
Preface xvii
Introduction 1(8)
Data Exploration as a Process
9(36)
The Data Exploration Process
10(18)
Stage 1: Exploring the Problem Space
12(7)
Stage 2: Exploring the Solution Space
19(3)
Stage 3: Specifying the Implementation Method
22(1)
Stage 4: Mining the Data
22(6)
Exploration: Mining and Modeling
28(1)
Data Mining, Modeling, and Modeling Tools
28(9)
Ten Golden Rules
29(1)
Introducing Modeling Tools
30(2)
Types of Models
32(1)
Active and Passive Models
33(1)
Explanatory and Predictive Models
33(2)
Static and Continuously Learning Models
35(2)
Summary
37(2)
Supplemental Material
39(6)
A Continuously Learning Model Application
39(1)
How the Continuously Learning Model Worked
40(5)
The Nature of the World and Its Impact on Data Preparation
45(44)
Measuring the World
46(7)
Objects
46(1)
Capturing Measurements
47(1)
Errors of Measurement
48(5)
Typing Measurements to the Real World
53(1)
Types of Measurements
53(7)
Scalar Measurements
54(6)
Nonscalar Measurements
60(1)
Continua of Attributes of Variables
60(6)
The Qualitative-Quantitative Continuum
61(1)
The Discrete-Continuous Continuum
61(5)
Scale Measurement Example
66(1)
Transformations and Difficulties---Variables, Data, and Information
66(1)
Building Mineable Data Representations
67(19)
Data Representation
68(1)
Building Data---Dealing with Variables
69(8)
Building Mineable Data Sets
77(9)
Summary
86(1)
Supplemental Material
87(2)
Combinations
87(2)
Data Preparation as a Process
89(36)
Data Preparation: Inputs, Outputs, Models, and Decisions
90(10)
Step 1: Prepare the Data
92(5)
Step 2: Survey the Data
97(1)
Step 3: Model the Data
98(1)
Use the Model
98(2)
Modeling Tools and Data Preparation
100(12)
How Modeling Tools Drive Data Preparation
102(2)
Decision Trees
104(1)
Decision Lists
104(3)
Neural Networks
107(1)
Evolution Programs
107(1)
Modeling Data with the Tools
107(2)
Predictions and Rules
109(2)
Choosing Techniques
111(1)
Missing Data and Modeling Tools
111(1)
Stages of Data Preparation
112(10)
Stage 1: Accessing the Data
112(1)
Stage 2: Auditing the Data
113(1)
Stage 3: Enhancing and Enriching the Data
114(1)
Stage 4: Looking for Sampling Bias
114(1)
Stage 5: Determining Data Structure (Super-, Macro-, and Micro-)
115(1)
Stage 6: Building the PIE
116(5)
Stage 7: Surveying the Data
121(1)
Stage 8: Modeling the Data
122(1)
And the Result Is . . .?
122(3)
Getting the Data: Basic Preparation
125(30)
Data Discovery
127(2)
Data Access Issues
127(2)
Data Characterization
129(6)
Detail/Aggregation Level (Granularity)
129(2)
Consistency
131(1)
Pollution
132(1)
Objects
133(1)
Relationship
133(1)
Domain
133(1)
Defaults
134(1)
Integrity
134(1)
Concurrency
135(1)
Duplicate or Redundant Variables
135(1)
Data Set Assembly
135(6)
Reverse Pivoting
136(1)
Feature Extraction
137(1)
Physical or Behavioral Data Sets
138(1)
Explanatory Structure
138(1)
Data Enhancement or Enrichment
139(1)
Sampling Bias
140(1)
Example 1: Credit
141(8)
Looking at the Variables
141(5)
Relationships between Variables
146(3)
Example 2: Shoe
149(2)
Looking at the Variables
149(1)
Relationships between Variables
150(1)
The Data Assay
151(4)
Sampling, Variability, and Confidence
155(36)
Sampling, or First Catch Your Hare!
155(11)
How Much Data?
155(1)
Variability
156(3)
Converging on a Representative Sample
159(3)
Measuring Variability
162(1)
Variability and Deviation
162(4)
Confidence
166(1)
Variability of Numeric Variables
167(3)
Variability and Sampling
168(1)
Variability and Convergence
168(2)
Variability and Confidence in Alpha Variables
170(2)
Ordering and Rate of Discovery
171(1)
Measuring Confidence
172(6)
Modeling and Confidence with the Whole Population
172(1)
Testing for Confidence
173(3)
Confidence Tests and Variability
176(2)
Confidence in Capturing Variability
178(6)
A Brief Introduction to the Normal Distribution
178(2)
Normally Distributed Probabilities
180(1)
Capturing Normally Distributed Probabilities: An Example
181(1)
Capturing Confidence, Capturing Variance
182(2)
Problems and Shortcomings of Taking Samples Using Variability
184(4)
Missing Values
184(1)
Constants (Variables with Only One Value)
185(1)
Problems with Sampling
185(1)
Monotonic Variable Detection
186(1)
Interstitial Linearity
187(1)
Rate of Discovery
187(1)
Confidence and Instance Count
188(1)
Summary
188(1)
Supplemental Material
189(2)
Confidence Samples
189(2)
Handling Nonnumerical Variables
191(48)
Representing Alphas and Remapping
192(10)
One-of-n Remapping
193(1)
m-of-n Remapping
194(1)
Remapping to Eliminate Ordering
195(1)
Remapping One-to-Many Patterns, or Ill-Formed Problems
196(4)
Remapping Circular Discontinuity
200(2)
State Space
202(20)
Unit State Space
202(2)
Pythagoras in State Space
204(1)
Position in State Space
204(1)
Neighbors and Associates
205(1)
Density and Sparsity
206(5)
Nearby and Distant Nearest Neighbors
211(1)
Normalizing Measured Point Separation
211(2)
Contours, Peaks, and Valleys
213(1)
Mapping State Space
213(1)
Objects in State Space
213(1)
Phase Space
214(1)
Mapping Alpha Values
215(1)
Location, Location, Location!
216(1)
Numerics, Alphas, and the Montreal Canadiens
216(6)
Joint Distribution Tables
222(8)
Two-Way Tables
223(5)
More Values, More Variables, and Meaning of the Numeration
228(1)
Dealing with Low-Frequency Alpha Labels and Other Problems
229(1)
Dimensionality
230(5)
Multidimensional Scaling
230(1)
Squashing a Triangle
231(3)
Projecting Alpha Values
234(1)
Scree Plots
234(1)
Practical Consideration---Implementing Alpha Numeration in the Demonstration Code
235(3)
Implementing Neighborhoods
235(2)
Implementing Numeration in All Alpha Data Sets
237(1)
Implementing Dimensionality Reduction for Variables
237(1)
Summary
238(1)
Normalizing and Redistributing Variables
239(36)
Normalizing a Variable's Range
240(19)
Review of Data Preparation and Modeling (Training, Testing, and Execution)
241(1)
The Nature and Scope of the Out-of-Range Values Problem
242(1)
Discovering the Range of Values When Building the PIE
243(4)
Out-of-Range Values When Training
247(2)
Out-of-Range Values When Testing
249(1)
Out-of-Range Values When Executing
250(1)
Scaling Transformations
251(6)
Softmax Scaling
257(1)
Normalizing Ranges
258(1)
Redistributing Variable Values
259(10)
The Nature of Distributions
259(1)
Distributive Difficulties
260(1)
Adjusting Distributions
261(5)
Modified Distributions
266(3)
Summary
269(2)
Supplemental Material
271(4)
The Logistic Function
271(3)
Modifying the Linear Part of the Logistic Function Range
274(1)
Replacing Missing and Empty Values
275(24)
Retaining Information about Missing Values
275(3)
Missing-Value Patterns
276(1)
Capturing Patterns
277(1)
Replacing Missing Values
278(7)
Unbiased Estimators
279(1)
Variability Relationships
279(3)
Relationships between Variables
282(2)
Preserving Between-Variable Relationships
284(1)
Summary
285(1)
Supplemental Material
286(13)
Using Regression to Find Least Information-Damaging Missing Values
286(8)
Alternative Methods of Missing-Value Replacement
294(5)
Series Variables
299(52)
Here There Be Dragons!
300(1)
Types of Series
300(1)
Describing Series Data
301(19)
Constructing a Series
302(1)
Features of a Series
302(1)
Describing a Series---Fourier
303(4)
Describing a Series---Spectrum
307(7)
Describing a Series---Trend, Seasonality, Cycles, Noise
314(2)
Describing a Series---Autocorrelation
316(4)
Modeling Series Data
320(1)
Repairing Series Data Problems
320(5)
Missing Values
320(2)
Outliers
322(1)
Nonuniform Displacement
322(1)
Trend
323(2)
Tools
325(14)
Filtering
325(1)
Moving Averages
326(7)
Smoothing 1---PVM Smoothing
333(1)
Smoothing 2---Median Smoothing, Resmoothing, and Hanning
333(2)
Extraction
335(1)
Differencing
336(3)
Other Problems
339(5)
Numerating Alpha Values
341(1)
Distribution
341(3)
Normalization
344(1)
Preparing Series Data
344(4)
Looking at the Data
346(1)
Signposts on the Rocky Road
346(2)
Implementation Notes
348(3)
Preparing the Data Set
351(50)
Using Sparsely Populated Variables
351(4)
Increasing Information Density Using Sparsely Populated Variables
352(1)
Binning Sparse Numerical Values
353(1)
Present-Value Patterns (PVPs)
353(2)
Problems with High-Dimensionality Data Sets
355(5)
Information Representation
357(1)
Representing High-Dimensionality Data in Fewer Dimensions
358(2)
Introducing the Neural Network
360(16)
Training a Neural Network
361(1)
Neurons
362(1)
Reshaping the Logistic Curve
363(1)
Single-Input Neurons
363(3)
Multiple-Input Neurons
366(2)
Networking Neurons to Estimate a Function
368(1)
Network Learning
368(3)
Network Prediction---Hidden Layer
371(1)
Network Prediction---Output Layer
371(1)
Stochastic Network Performance
372(1)
Network Architecture 1---The Autoassociative Network
373(2)
Network Architecture 2---The Sparsely Connected Network
375(1)
Compressing Variables
376(2)
Using Compressed Dimensionality Data
376(2)
Removing Variables
378(5)
Estimating Variable Importance 1: What Doesn't Work
379(1)
Estimating Variable Importance 2: Clues
379(1)
Estimating Variable Importance 3: Configuring and Training the Network
380(3)
How Much Data Is Enough?
383(9)
Joint Distribution
384(6)
Capturing Joint Variability
390(1)
Degrees of Freedom
391(1)
Beyond Joint Distribution
392(4)
Enhancing the Data Set
393(3)
Data Sets in Perspective
396(1)
Implementation Notes
396(3)
Collapsing Extremely Sparsely Populated Variables
397(1)
Reducing Excessive Dimensionality
397(1)
Measuring Variable Importance
398(1)
Feature Enhancement
398(1)
Where Next?
399(2)
The Data Survey
401(82)
Introduction to the Data Survey
402(1)
Information and Communication
403(11)
Measuring Information: Signals and Dictionaries
405(1)
Measuring Information: Signals
406(1)
Measuring Information: Bits of Information
407(3)
Measuring Information: Surprise
410(1)
Measuring Information: Entropy
411(1)
Measuring Information: Dictionaries
412(2)
Mapping Using Entropy
414(9)
Whole Data Set Entropy
416(1)
Conditional Entropy between Inputs and Outputs
417(3)
Mutual Information
420(1)
Other Survey Uses for Entropy and Information
420(1)
Looking for Information
421(2)
Identifying Problems with a Data Survey
423(12)
Confidence and Sufficient Data
424(2)
Detecting Sparsity
426(1)
Manifold Definition
427(8)
Clusters
435(1)
Sampling Bias
436(3)
Making the Data Survey
439(3)
Novelty Detection
442(1)
Other Directions
443(3)
Supplemental Material
446(37)
Entropic Analysis---Example
446(5)
Surveying Data Sets
451(32)
Using Prepared Data
483(22)
Modeling Data
485(4)
Assumptions
485(1)
Models
485(1)
Data Mining vs. Exploratory Data Analysis
486(3)
Characterizing Data
489(5)
Decision Trees
490(1)
Clusters
491(1)
Nearest Neighbor
492(1)
Neural Networks and Regression
493(1)
Prepared Data and Modeling Algorithms
494(6)
Neural Networks and the Credit Data Set
494(5)
Decision Trees and the Credit Data Set
499(1)
Practical Use of Data Preparation and Prepared Data
500(1)
Looking at Present Modeling Tools and Future Directions
501(4)
Near Future
503(1)
Farther Out
504(1)
Appendix
Using the Demonstration Code on the CD-ROM 505(4)
Further Reading 509(4)
Index 513(24)
About the Author 537(2)
About the CD-ROM 539


Dorian Pyle is Chief Scientist and Founder of PTI (www.pti.com), which develops and markets Powerhouse predictive and explanatory analytics software. Dorian has over 20 years experience in artificial intelligence and machine learning techniques which are used in what is known today as data mining” or predictive analytics”. He has applied this knowledge as a consultant with Knowledge Stream Partners, Xchange, Naviant, Thinking Machines, and Data Miners and with various companies directly involved in credit card marketing for banks and with manufacturing companies using industrial automation. In 1976 he was involved in building artificially intelligent machine learning systems utilizing the pioneering technologies that are currently known as neural computing and associative memories. He is current in and familiar with using the most advanced technologies in data mining including: entropic analysis (information theory), chaotic and fractal decomposition, neural technologies, evolution and genetic optimization, algebra evolvers, case-based reasoning, concept induction and other advanced statistical techniques.