Muutke küpsiste eelistusi

E-raamat: Data Mining: Concepts and Techniques

(Professor, Department of Computer ScienceUniversity of Illinois, Urbana Champaign, USA), (Simon Fraser University, Burnaby, Canada), (Associate Professor, Department of Computer Science, University of Illinois at Urbana-Champaign, Urb)
  • Formaat - EPUB+DRM
  • Hind: 76,43 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

Data Mining: Concepts and Techniques, Fourth Edition provides the theories and methods for processing data or information used in various applications. Specifically, it explains data mining and the tools used in discovering knowledge from collected data, known as KDD. The book focuses on the feasibility, usefulness, effectiveness and scalability of techniques of large datasets. After describing data mining, the authors explain the methods of knowing, preprocessing, processing and warehousing data. They then present information about data warehouses, online analytical processing (OLAP), and data cube technology. Then, the methods involved in mining frequent patterns, associations, and correlations for large data sets are described.

The book details the methods for data classification and introduces the concepts and methods for data clustering. The remaining chapters discuss the outlier detection and the trends, applications, and research frontiers in data mining. Users from computer science students, application developers, business professionals, and researchers who seek information on data mining will find this resource very helpful.

  • Presents dozens of algorithms and implementation examples, all in pseudo-code and suitable for use in real-world, large-scale data mining projects
  • Addresses advanced topics such as mining object-relational databases, spatial databases, multimedia databases, time-series databases, text databases, the World Wide Web, and applications in several fields
  • Provides a comprehensive, practical look at the concepts and techniques needed to get the most out of your data
Foreword xvii
Foreword to second edition xix
Preface xxi
Acknowledgments xxvii
About the authors xxix
Chapter 1 Introduction
1(22)
1.1 What is data mining?
1(1)
1.2 Data mining: an essential step in knowledge discovery
2(2)
1.3 Diversity of data types for data mining
4(1)
1.4 Mining various kinds of knowledge
5(7)
1.4.1 Multidimensional data summarization
6(1)
1.4.2 Mining frequent patterns, associations, and correlations
6(1)
1.4.3 Classification and regression for predictive analysis
7(2)
1.4.4 Cluster analysis
9(1)
1.4.5 Deep learning
9(1)
1.4.6 Outlier analysis
10(1)
1.4.7 Are all mining results interesting?
10(2)
1.5 Data mining: confluence of multiple disciplines
12(5)
1.5.1 Statistics and data mining
12(1)
1.5.2 Machine learning and data mining
13(2)
1.5.3 Database technology and data mining
15(1)
1.5.4 Data mining and data science
15(1)
1.5.5 Data mining and other disciplines
16(1)
1.6 Data mining and applications
17(2)
1.7 Data mining and society
19(1)
1.8 Summary
19(1)
1.9 Exercises
20(1)
1.10 Bibliographic notes
21(2)
Chapter 2 Data, measurements, and data preprocessing
23(62)
2.1 Datatypes
24(3)
2.1.1 Nominal attributes
24(1)
2.1.2 Binary attributes
25(1)
2.1.3 Ordinal attributes
25(1)
2.1.4 Numeric attributes
26(1)
2.1.5 Discrete vs. continuous attributes
27(1)
2.2 Statistics of data
27(16)
2.2.1 Measuring the central tendency
28(3)
2.2.2 Measuring the dispersion of data
31(3)
2.2.3 Covariance and correlation analysis
34(4)
2.2.4 Graphic displays of basic statistics of data
38(5)
2.3 Similarity and distance measures
43(12)
2.3.1 Data matrix vs. dissimilarity matrix
43(1)
2.3.2 Proximity measures for nominal attributes
44(2)
2.3.3 Proximity measures for binary attributes
46(2)
2.3.4 Dissimilarity of numeric data: Minkowski distance
48(1)
2.3.5 Proximity measures for ordinal attributes
49(1)
2.3.6 Dissimilarity for attributes of mixed types
50(2)
2.3.7 Cosine similarity
52(1)
2.3.8 Measuring similar distributions: the Kullback-Leibler divergence
53(2)
2.3.9 Capturing hidden semantics in similarity measures
55(1)
2.4 Data quality, data cleaning, and data integration
55(8)
2.4.1 Data quality measures
55(1)
2.4.2 Data cleaning
56(6)
2.4.3 Data integration
62(1)
2.5 Data transformation
63(8)
2.5.1 Normalization
64(1)
2.5.2 Discretization
65(3)
2.5.3 Data compression
68(2)
2.5.4 Sampling
70(1)
2.6 Dimensionality reduction
71(8)
2.6.1 Principal components analysis
71(1)
2.6.2 Attribute subset selection
72(2)
2.6.3 Nonlinear dimensionality reduction methods
74(5)
2.7 Summary
79(1)
2.8 Exercises
80(3)
2.9 Bibliographic notes
83(2)
Chapter 3 Data warehousing and online analytical processing
85(60)
3.1 Data warehouse
85(11)
3.1.1 Data warehouse: what and why?
85(3)
3.1.2 Architecture of data warehouses: enterprise data warehouses and data marts
88(5)
3.1.3 Data lakes
93(3)
3.2 Data warehouse modeling: schema and measures
96(10)
3.2.1 Data cube: a multidimensional data model
97(2)
3.2.2 Schemas for multidimensional data models: stars, snowflakes, and fact constellations
99(4)
3.2.3 Concept hierarchies
103(2)
3.2.4 Measures: categorization and computation
105(1)
3.3 OLAP operations
106(7)
3.3.1 Typical OLAP operations
106(2)
3.3.2 Indexing OLAP data: bitmap index and join index
108(3)
3.3.3 Storage implementation: column-based databases
111(2)
3.4 Data cube computation
113(7)
3.4.1 Terminology of data cube computation
113(2)
3.4.2 Data cube materialization: ideas
115(2)
3.4.3 OLAP server architectures: ROLAP vs. MOLAP vs. HOLAP
117(2)
3.4.4 General strategies for data cube computation
119(1)
3.5 Data cube computation methods
120(13)
3.5.1 Multiway array aggregation for full cube computation
121(4)
3.5.2 BUC: computing iceberg cubes from the apex cuboid downward
125(4)
3.5.3 Precomputing shell fragments for fast high-dimensional OLAP
129(3)
3.5.4 Efficient processing of OLAP queries using cuboids
132(1)
3.6 Summary
133(2)
3.7 Exercises
135(7)
3.8 Bibliographic notes
142(3)
Chapter 4 Pattern mining: basic concepts and methods
145(30)
4.1 Basic concepts
145(4)
4.1.1 Market basket analysis: a motivating example
145(2)
4.1.2 Frequent itemsets, closed itemsets, and association rules
147(2)
4.2 Frequent itemset mining methods
149(14)
4.2.1 Apriori algorithm: finding frequent itemsets by confined candidate generation
150(3)
4.2.2 Generating association rules from frequent itemsets
153(2)
4.2.3 Improving the efficiency of Apriori
155(2)
4.2.4 A pattern-growth approach for mining frequent itemsets
157(3)
4.2.5 Mining frequent itemsets using the vertical data format
160(2)
4.2.6 Mining closed and max patterns
162(1)
4.3 Which patterns are interesting?--Pattern evaluation methods
163(6)
4.3.1 Strong rules are not necessarily interesting
163(1)
4.3.2 From association analysis to correlation analysis
164(1)
4.3.3 A comparison of pattern evaluation measures
165(4)
4.4 Summary
169(1)
4.5 Exercises
170(3)
4.6 Bibliographic notes
173(2)
Chapter 5 Pattern mining: advanced methods
175(64)
5.1 Mining various kinds of patterns
175(12)
5.1.1 Mining multilevel associations
175(4)
5.1.2 Mining multidimensional associations
179(1)
5.1.3 Mining quantitative association rules
180(3)
5.1.4 Mining high-dimensional data
183(2)
5.1.5 Mining rare patterns and negative patterns
185(2)
5.2 Mining compressed or approximate patterns
187(4)
5.2.1 Mining compressed patterns by pattern clustering
187(2)
5.2.2 Extracting redundancy-aware top-fc patterns
189(2)
5.3 Constraint-based pattern mining
191(7)
5.3.1 Pruning pattern space with pattern pruning constraints
193(3)
5.3.2 Pruning data space with data pruning constraints
196(1)
5.3.3 Mining space pruning with succinctness constraints
197(1)
5.4 Mining sequential patterns
198(13)
5.4.1 Sequential pattern mining: concepts and primitives
198(2)
5.4.2 Scalable methods for mining sequential patterns
200(10)
5.4.3 Constraint-based mining of sequential patterns
210(1)
5.5 Mining subgraph patterns
211(12)
5.5.1 Methods for mining frequent subgraphs
212(7)
5.5.2 Mining variant and constrained substructure patterns
219(4)
5.6 Pattern mining: application examples
223(9)
5.6.1 Phrase mining in massive text data
223(7)
5.6.2 Mining copy and paste bugs in software programs
230(2)
5.7 Summary
232(1)
5.8 Exercises
233(2)
5.9 Bibliographic notes
235(4)
Chapter 6 Classification: basic concepts and methods
239(68)
6.1 Basic concepts
239(4)
6.1.1 What is classification?
239(1)
6.1.2 General approach to classification
240(3)
6.2 Decision tree induction
243(16)
6.2.1 Decision tree induction
244(4)
6.2.2 Attribute selection measures
248(9)
6.2.3 Tree pruning
257(2)
6.3 Bayes classification methods
259(7)
6.3.1 Bayes' theorem
260(2)
6.3.2 Naive Bayesian classification
262(4)
6.4 Lazy learners (or learning from your neighbors)
266(3)
6.4.1 FC-nearest-neighbor classifiers
266(3)
6.4.2 Case-based reasoning
269(1)
6.5 Linear classifiers
269(9)
6.5.1 Linear regression
270(2)
6.5.2 Perceptron: turning linear regression to classification
272(2)
6.5.3 Logistic regression
274(4)
6.6 Model evaluation and selection
278(12)
6.6.1 Metrics for evaluating classifier performance
278(5)
6.6.2 Holdout method and random subsampling
283(1)
6.6.3 Cross-validation
283(1)
6.6.4 Bootstrap
284(1)
6.6.5 Model selection using statistical tests of significance
285(1)
6.6.6 Comparing classifiers based on cost-benefit and ROC curves
286(4)
6.7 Techniques to improve classification accuracy
290(8)
6.7.1 Introducing ensemble methods
290(1)
6.7.2 Bagging
291(1)
6.7.3 Boosting
292(4)
6.7.4 Random forests
296(1)
6.7.5 Improving classification accuracy of class-imbalanced data
297(1)
6.8 Summary
298(1)
6.9 Exercises
299(3)
6.10 Bibliographic notes
302(5)
Chapter 7 Classification: advanced methods
307(72)
7.1 Feature selection and engineering
307(8)
7.1.1 Filter methods
308(3)
7.1.2 Wrapper methods
311(1)
7.1.3 Embedded methods
312(3)
7.2 Bayesian belief networks
315(3)
7.2.1 Concepts and mechanisms
315(2)
7.2.2 Training Bayesian belief networks
317(1)
7.3 Support vector machines
318(9)
7.3.1 Linear support vector machines
319(5)
7.3.2 Nonlinear support vector machines
324(3)
7.4 Rule-based and pattern-based classification
327(15)
7.4.1 Using IF-THEN rules for classification
328(2)
7.4.2 Rule extraction from a decision tree
330(1)
7.4.3 Rule induction using a sequential covering algorithm
331(4)
7.4.4 Associative classification
335(3)
7.4.5 Discriminative frequent pattern-based classification
338(4)
7.5 Classification with weak supervision
342(9)
7.5.1 Semisupervised classification
343(2)
7.5.2 Active learning
345(1)
7.5.3 Transfer learning
346(2)
7.5.4 Distant supervision
348(1)
7.5.5 Zero-shot learning
349(2)
7.6 Classification with rich data type
351(8)
7.6.1 Stream data classification
352(2)
7.6.2 Sequence classification
354(1)
7.6.3 Graph data classification
355(4)
7.7 Potpourri: other related techniques
359(10)
7.7.1 Multiclass classification
359(3)
7.7.2 Distance metric learning
362(2)
7.7.3 Interpretability of classification
364(3)
7.7.4 Genetic algorithms
367(1)
7.7.5 Reinforcement learning
367(2)
7.8 Summary
369(1)
7.9 Exercises
370(4)
7.10 Bibliographic notes
374(5)
Chapter 8 Cluster analysis: basic concepts and methods
379(52)
8.1 Cluster analysis
379(6)
8.1.1 What is cluster analysis?
380(1)
8.1.2 Requirements for cluster analysis
381(2)
8.1.3 Overview of basic clustering methods
383(2)
8.2 Partitioning methods
385(9)
8.2.1 &-Means: a centroid-based technique
386(2)
8.2.2 Variations of /c-means
388(6)
8.3 Hierarchical methods
394(13)
8.3.1 Basic concepts of hierarchical clustering
394(3)
8.3.2 Agglomerative hierarchical clustering
397(3)
8.3.3 Divisive hierarchical clustering
400(2)
8.3.4 BIRCH: scalable hierarchical clustering using clustering feature trees
402(2)
8.3.5 Probabilistic hierarchical clustering
404(3)
8.4 Density-based and grid-based methods
407(10)
8.4.1 DBSCAN: density-based clustering based on connected regions with high density
408(3)
8.4.2 DENCLUE: clustering based on density distribution functions
411(3)
8.4.3 Grid-based methods
414(3)
8.5 Evaluation of clustering
417(8)
8.5.1 Assessing clustering tendency
417(2)
8.5.2 Determining the number of clusters
419(1)
8.5.3 Measuring clustering quality: extrinsic methods
420(4)
8.5.4 Intrinsic methods
424(1)
8.6 Summary
425(2)
8.7 Exercises
427(2)
8.8 Bibliographic notes
429(2)
Chapter 9 Cluster analysis: advanced methods
431(54)
9.1 Probabilistic model-based clustering
431(10)
9.1.1 Fuzzy clusters
433(2)
9.1.2 Probabilistic model-based clusters
435(3)
9.1.3 Expectation-maximization algorithm
438(3)
9.2 Clustering high-dimensional data
441(6)
9.2.1 Why is clustering high-dimensional data challenging?
441(4)
9.2.2 Axis-parallel subspace approaches
445(2)
9.2.3 Arbitrarily oriented subspace approaches
447(1)
9.3 Biclustering
447(7)
9.3.1 Why and where is biclustering useful?
448(2)
9.3.2 Types of hucksters
450(2)
9.3.3 Biclustering methods
452(1)
9.3.4 Enumerating all biclusters using MaPle
453(1)
9.4 Dimensionality reduction for clustering
454(9)
9.4.1 Linear dimensionality reduction methods for clustering
455(3)
9.4.2 Nonnegative matrix factorization (NMF)
458(2)
9.4.3 Spectral clustering
460(3)
9.5 Clustering graph and network data
463(12)
9.5.1 Applications and challenges
463(2)
9.5.2 Similarity measures
465(5)
9.5.3 Graph clustering methods
470(5)
9.6 Semisupervised clustering
475(4)
9.6.1 Semisupervised clustering on partially labeled data
475(1)
9.6.2 Semisupervised clustering on pairwise constraints
476(1)
9.6.3 Other types of background knowledge for semisupervised clustering
477(2)
9.7 Summary
479(1)
9.8 Exercises
480(2)
9.9 Bibliographic notes
482(3)
Chapter 10 Deep learning
485(72)
10.1 Basic concepts
485(15)
10.1.1 What is deep learning?
485(4)
10.1.2 Backpropagation algorithm
489(9)
10.1.3 Key challenges for training deep learning models
498(1)
10.1.4 Overview of deep learning architecture
499(1)
10.2 Improve training of deep learning models
500(17)
10.2.1 Responsive activation functions
500(1)
10.2.2 Adaptive learning rate
501(3)
10.2.3 Dropout
504(3)
10.2.4 Pretraining
507(2)
10.2.5 Cross-entropy
509(2)
10.2.6 Autoencoder: unsupervised deep learning
511(3)
10.2.7 Other techniques
514(3)
10.3 Convolutional neural networks
517(9)
10.3.1 Introducing convolution operation
517(2)
10.3.2 Multidimensional convolution
519(4)
10.3.3 Convolutional layer
523(3)
10.4 Recurrent neural networks
526(13)
10.4.1 Basic RNN models and applications
526(6)
10.4.2 Gated RNNs
532(4)
10.4.3 Other techniques for addressing long-term dependence
536(3)
10.5 Graph neural networks
539(8)
10.5.1 Basic concepts
540(1)
10.5.2 Graph convolutional networks
541(4)
10.5.3 Other types of GNNs
545(2)
10.6 Summary
547(1)
10.7 Exercises
548(4)
10.8 Bibliographic notes
552(5)
Chapter 11 Outlier detection
557(48)
11.1 Basic concepts
557(8)
11.1.1 What are outliers?
558(1)
11.1.2 Types of outliers
559(2)
11.1.3 Challenges of outlier detection
561(1)
11.1.4 An overview of outlier detection methods
562(3)
11.2 Statistical approaches
565(7)
11.2.1 Parametric methods
565(4)
11.2.2 Nonparametric methods
569(3)
11.3 Proximity-based approaches
572(4)
11.3.1 Distance-based outlier detection
572(1)
11.3.2 Density-based outlier detection
573(3)
11.4 Reconstruction-based approaches
576(9)
11.4.1 Matrix factorization-based methods for numerical data
577(5)
11.4.2 Pattern-based compression methods for categorical data
582(3)
11.5 Clustering- vs. classification-based approaches
585(5)
11.5.1 Clustering-based approaches
585(3)
11.5.2 Classification-based approaches
588(2)
11.6 Mining contextual and collective outliers
590(3)
11.6.1 Transforming contextual outlier detection to conventional outlier detection
591(1)
11.6.2 Modeling normal behavior with respect to contexts
591(1)
11.6.3 Mining collective outliers
592(1)
11.7 Outlier detection in high-dimensional data
593(7)
11.7.1 Extending conventional outlier detection
594(1)
11.7.2 Finding outliers in subspaces
595(1)
11.7.3 Outlier detection ensemble
596(1)
11.7.4 Taming high dimensionality by deep learning
597(2)
11.7.5 Modeling high-dimensional outliers
599(1)
11.8 Summary
600(1)
11.9 Exercises
601(1)
11.10 Bibliographic notes
602(3)
Chapter 12 Data mining trends and research frontiers
605(50)
12.1 Mining rich data types
605(12)
12.1.1 Mining text data
605(5)
12.1.2 Spatial-temporal data
610(2)
12.1.3 Graph and networks
612(5)
12.2 Data mining applications
617(12)
12.2.1 Data mining for sentiment and opinion
617(3)
12.2.2 Truth discovery and misinformation identification
620(3)
12.2.3 Information and disease propagation
623(3)
12.2.4 Productivity and team science
626(3)
12.3 Data mining methodologies and systems
629(13)
12.3.1 Structuring unstructured data for knowledge mining: a data-driven approach
629(3)
12.3.2 Data augmentation
632(3)
12.3.3 From correlation to causality
635(2)
12.3.4 Network as a context
637(3)
12.3.5 Auto-ML: methods and systems
640(2)
12.4 Data mining, people, and society `
642(13)
12.4.1 Privacy-preserving data mining
642(4)
12.4.2 Human-algorithm interaction
646(2)
12.4.3 Mining beyond maximizing accuracy: fairness, interpretability, and robustness
648(4)
12.4.4 Data mining for social good
652(3)
APPENDIX A Mathematical background
655(26)
A.1 Probability and statistics
655(6)
A.1.1 PDF of typical distributions
655(1)
A.1.2 MLE and MAP
656(1)
A.1.3 Significance test
657(1)
A.1.4 Density estimation
658(1)
A.1.5 Bias-variance tradeoff
659(1)
A.1.6 Cross-validation and Jackknife
660(1)
A.2 Numerical optimization
661(7)
A.2.1 Gradient descent
661(1)
A.2.2 Variants of gradient descent
662(2)
A.2.3 Newton's method
664(2)
A.2.4 Coordinate descent
666(1)
A.2.5 Quadratic programming
666(2)
A.3 Matrix and linear algebra
668(5)
A.3.1 Linear system Ax = b
668(1)
A.3.2 Norms of vectors and matrices
669(1)
A.3.3 Matrix decompositions
669(2)
A.3.4 Subspace
671(1)
A.3.5 Orthogonality
672(1)
A.4 Concepts and tools from signal processing
673(5)
A.4.1 Entropy
673(1)
A.4.2 Kullback-Leibler divergence (KL-divergence)
674(1)
A.4.3 Mutual information
675(1)
A.4.4 Discrete Fourier transform (DFT) and fast Fourier transform (FFT)
676(2)
A.5 Bibliographic notes
678(3)
Bibliography 681(54)
Index 735
Jiawei Han is Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. Well known for his research in the areas of data mining and database systems, he has received many awards for his contributions in the field, including the 2004 ACM SIGKDD Innovations Award. He has served as Editor-in-Chief of ACM Transactions on Knowledge Discovery from Data, and on editorial boards of several journals, including IEEE Transactions on Knowledge and Data Engineering and Data Mining and Knowledge Discovery. Jian Pei is currently a Canada Research Chair (Tier 1) in Big Data Science and a Professor in the School of Computing Science at Simon Fraser University. He is also an associate member of the Department of Statistics and Actuarial Science. He is a well-known leading researcher in the general areas of data science, big data, data mining, and database systems. His expertise is on developing effective and efficient data analysis techniques for novel data intensive applications. He is recognized as a Fellow of the Association of Computing Machinery (ACM) for his contributions to the foundation, methodology and applications of data mining” and as a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) for his contributions to data mining and knowledge discovery”. He is the editor-in-chief of the IEEE Transactions of Knowledge and Data Engineering (TKDE), a director of the Special Interest Group on Knowledge Discovery in Data (SIGKDD) of the Association for Computing Machinery (ACM), and a general co-chair or program committee co-chair of many premier conferences. Hanghang Tong Ph.D. is currently an associate professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign. Before that he was an associate professor at the School of Computing, Informatics, and Decision Systems Engineering (CIDSE), Arizona State University. He received his M.Sc. and Ph.D. degrees from Carnegie Mellon University in 2008 and 2009, both in Machine Learning. His research interest is in large scale data mining for graphs and multimedia. He has received several awards, including SDM/IBM Early Career Data Mining Research award (2018), NSF CAREER award (2017), ICDM 10-Year Highest Impact Paper award (2015), four best paper awards (TUP'14, CIKM'12, SDM'08, ICDM'06), seven 'bests of conference', 1 best demo, honorable mention (SIGMOD'17), and 1 best demo candidate, second place (CIKM'17). He has published over 100 refereed articles. He is the Editor-in-Chief of SIGKDD Explorations (ACM), an action editor of Data Mining and Knowledge Discovery (Springer), and an associate editor of Knowledge and Information Systems (Springer) and Neurocomputing Journal (Elsevier); and has served as a program committee member in multiple data mining, database and artificial intelligence venues (e.g., SIGKDD, SIGMOD, AAAI, WWW, CIKM, etc.).