Muutke küpsiste eelistusi

Data Profiling [Kõva köide]

  • Formaat: Hardback, 154 pages, kõrgus x laius: 235x191 mm, kaal: 333 g
  • Sari: Synthesis Lectures on Data Management
  • Ilmumisaeg: 08-Nov-2018
  • Kirjastus: Morgan & Claypool Publishers
  • ISBN-10: 1681734486
  • ISBN-13: 9781681734484
Teised raamatud teemal:
  • Formaat: Hardback, 154 pages, kõrgus x laius: 235x191 mm, kaal: 333 g
  • Sari: Synthesis Lectures on Data Management
  • Ilmumisaeg: 08-Nov-2018
  • Kirjastus: Morgan & Claypool Publishers
  • ISBN-10: 1681734486
  • ISBN-13: 9781681734484
Teised raamatud teemal:

Data profiling refers to the activity of collecting data about data, i.e., metadata. Most IT professionals and researchers who work with data have engaged in data profiling, at least informally, to understand and explore an unfamiliar dataset or to determine whether a new dataset is appropriate for a particular task at hand. Data profiling results are also important in a variety of other situations, including query optimization, data integration, and data cleaning. Simple metadata are statistics, such as the number of rows and columns, schema and datatype information, the number of distinct values, statistical value distributions, and the number of null or empty values in each column. More complex types of metadata are statements about multiple columns and their correlation, such as candidate keys, functional dependencies, and other types of dependencies.

This book provides a classification of the various types of profilable metadata, discusses popular data profiling tasks, and surveys state-of-the-art profiling algorithms. While most of the book focuses on tasks and algorithms for relational data profiling, we also briefly discuss systems and techniques for profiling non-relational data such as graphs and text. We conclude with a discussion of data profiling challenges and directions for future work in this area.

Preface xv
Acknowledgments xvii
1 Discovering Metadata
1(6)
1.1 Motivation and Overview
1(2)
1.2 Data Profiling and Data Mining
3(1)
1.3 Use Cases
4(2)
1.4 Organization of This Book
6(1)
2 Data Profiling Tasks
7(4)
2.1 Single-Column Analysis
7(2)
2.2 Dependency Discovery
9(1)
2.3 Relaxed Dependencies
9(2)
3 Single-Column Analysis
11(8)
3.1 Cardinalities
11(1)
3.2 Value Distributions
11(3)
3.3 Data Types, Patterns, and Domains
14(1)
3.4 Data Completeness
15(1)
3.5 Approximate Statistics
16(1)
3.6 Summary and Discussion
17(2)
4 Dependency Discovery
19(56)
4.1 Dependency Definitions
19(5)
4.1.1 Functional Dependencies
21(1)
4.1.2 Unique Column Combinations
22(1)
4.1.3 Inclusion Dependencies
23(1)
4.2 Search Space and Data Structures
24(7)
4.2.1 Lattices and Search Space Sizes
24(3)
4.2.2 Position List Indexes and Search Space Validation
27(2)
4.2.3 Search Complexity
29(1)
4.2.4 Null Semantics
30(1)
4.3 Discovering Unique Column Combinations
31(8)
4.3.1 Gordian
32(2)
4.3.2 HCA
34(1)
4.3.3 Ducc
35(2)
4.3.4 HyUCC
37(1)
4.3.5 Swan
38(1)
4.4 Discovering Functional Dependencies
39(16)
4.4.1 Tane
41(1)
4.4.2 Fun
42(3)
4.4.3 FD_Mine
45(1)
4.4.4 Dfd
45(1)
4.4.5 Dep-Miner
46(2)
4.4.6 FastFDs
48(2)
4.4.7 Fdep V
50(1)
4.4.8 HyFD
51(4)
4.5 Discovering Inclusion Dependencies
55(20)
4.5.1 SQL-Based IND Validation
57(3)
4.5.2 B&B
60(1)
4.5.3 DeMarchi
61(1)
4.5.4 Binder
62(2)
4.5.5 Spider
64(2)
4.5.6 S-IndD
66(2)
4.5.7 Sindy
68(1)
4.5.8 Mind
69(1)
4.5.9 Find2
70(1)
4.5.10 ZigZag
71(1)
4.5.11 Mind2
72(3)
5 Relaxed and Other Dependencies
75(12)
5.1 Relaxing the Extent of a Dependency
75(3)
5.1.1 Partial Dependencies
76(1)
5.1.2 Conditional Dependencies
76(2)
5.2 Relaxing Attribute Comparisons
78(5)
5.2.1 Metric and Matching Dependencies
78(3)
5.2.2 Order and Sequential Dependencies
81(2)
5.3 Approximating the Dependency Discovery
83(1)
5.4 Generalizing Functional Dependencies
83(4)
5.4.1 Denial Constraints
84(1)
5.4.2 Multivalued Dependencies
84(3)
6 Use Cases
87(6)
6.1 Data Exploration
87(1)
6.2 Schema Engineering
88(1)
6.3 Data Cleaning
89(1)
6.4 Query Optimization
90(1)
6.5 Data Integration
91(2)
7 Profiling Non-Relational Data
93(4)
7.1 XML
93(1)
7.2 RDF
94(1)
7.3 Time Series
94(1)
7.4 Graphs
95(1)
7.5 Text
96(1)
8 Data Profiling Tools
97(6)
8.1 Research Prototypes
97(2)
8.2 Commercial Tools
99(4)
9 Data Profiling Challenges
103(8)
9.1 Functional Challenges
103(5)
9.1.1 Profiling Dynamic Data
103(1)
9.1.2 Interactive Profiling
104(1)
9.1.3 Profiling tor Integration
105(1)
9.1.4 Interpreting Profiling Results
106(2)
9.2 Non-Functional Challenges
108(3)
9.2.1 Efficiency and Scalability
108(1)
9.2.2 Profiling on New Architectures
109(1)
9.2.3 Benchmarking Profiling Methods
109(2)
10 Conclusions
111(2)
Bibliography 113(22)
Authors' Biographies 135