Muutke küpsiste eelistusi

E-raamat: Data Clustering: Algorithms and Applications [Taylor & Francis e-raamat]

Edited by (IBM Research, Yorktown Heights, New York, USA), Edited by (Wayne State University, Detroit, Michigan, USA)
Teised raamatud teemal:
  • Taylor & Francis e-raamat
  • Hind: 184,65 €*
  • * hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
  • Tavahind: 263,78 €
  • Säästad 30%
Teised raamatud teemal:
Research on the problem of clustering tends to be fragmented across the pattern recognition, database, data mining, and machine learning communities. Addressing this problem in a unified way, Data Clustering: Algorithms and Applications provides complete coverage of the entire area of clustering, from basic methods to more refined and complex data clustering approaches. It pays special attention to recent issues in graphs, social networks, and other domains.

The book focuses on three primary aspects of data clustering:





Methods, describing key techniques commonly used for clustering, such as feature selection, agglomerative clustering, partitional clustering, density-based clustering, probabilistic clustering, grid-based clustering, spectral clustering, and nonnegative matrix factorization Domains, covering methods used for different domains of data, such as categorical data, text data, multimedia data, graph data, biological data, stream data, uncertain data, time series clustering, high-dimensional clustering, and big data Variations and Insights, discussing important variations of the clustering process, such as semisupervised clustering, interactive clustering, multiview clustering, cluster ensembles, and cluster validation

In this book, top researchers from around the world explore the characteristics of clustering problems in a variety of application areas. They also explain how to glean detailed insight from the clustering processincluding how to verify the quality of the underlying clustersthrough supervision, human intervention, or the automated generation of alternative clusters.
Preface xxi
Editor Biographies xxiii
Contributors xxv
1 An Introduction to Cluster Analysis 1(28)
Charu C. Aggarwal
1.1 Introduction
2(1)
1.2 Common Techniques Used in Cluster Analysis
3(12)
1.2.1 Feature Selection Methods
4(1)
1.2.2 Probabilistic and Generative Models
4(1)
1.2.3 Distance-Based Algorithms
5(2)
1.2.4 Density- and Grid-Based Methods
7(1)
1.2.5 Leveraging Dimensionality Reduction Methods
8(3)
1.2.5.1 Generative Models for Dimensionality Reduction
8(1)
1.2.5.2 Matrix Factorization and Co-Clustering
8(2)
1.2.5.3 Spectral Methods
10(1)
1.2.6 The High Dimensional Scenario
11(2)
1.2.7 Scalable Techniques for Cluster Analysis
13(2)
1.2.7.1 I/O Issues in Database Management
13(1)
1.2.7.2 Streaming Algorithms
14(1)
1.2.7.3 The Big Data Framework
14(1)
1.3 Data Types Studied in Cluster Analysis
15(4)
1.3.1 Clustering Categorical Data
15(1)
1.3.2 Clustering Text Data
16(1)
1.3.3 Clustering Multimedia Data
16(1)
1.3.4 Clustering Time-Series Data
17(1)
1.3.5 Clustering Discrete Sequences
17(1)
1.3.6 Clustering Network Data
18(1)
1.3.7 Clustering Uncertain Data
19(1)
1.4 Insights Gained from Different Variations of Cluster Analysis
19(3)
1.4.1 Visual Insights
20(1)
1.4.2 Supervised Insights
20(1)
1.4.3 Multiview and Ensemble-Based Insights
21(1)
1.4.4 Validation-Based Insights
21(1)
1.5 Discussion and Conclusions
22(7)
2 Feature Selection for Clustering: A Review 29(32)
Salem Alelyani
Jiliang Tang
Huan Liu
2.1 Introduction
30(5)
2.1.1 Data Clustering
32(1)
2.1.2 Feature Selection
32(1)
2.1.3 Feature Selection for Clustering
33(2)
2.1.3.1 Filter Model
34(1)
2.1.3.2 Wrapper Model
35(1)
2.1.3.3 Hybrid Model
35(1)
2.2 Feature Selection for Clustering
35(18)
2.2.1 Algorithms for Generic Data
36(5)
2.2.1.1 Spectral Feature Selection (SPEC)
36(1)
2.2.1.2 Laplacian Score (LS)
36(1)
2.2.1.3 Feature Selection for Sparse Clustering
37(1)
2.2.1.4 Localized Feature Selection Based on Scatter Separability (LFSBSS)
38(1)
2.2.1.5 Multicluster Feature Selection (MCFS)
39(1)
2.2.1.6 Feature Weighting k-Means
40(1)
2.2.2 Algorithms for Text Data
41(6)
2.2.2.1 Term Frequency (TF)
41(1)
2.2.2.2 Inverse Document Frequency (IDF)
42(1)
2.2.2.3 Term Frequency-Inverse Document Frequency (TF-IDF)
42(1)
2.2.2.4 Chi Square Statistic
42(2)
2.2.2.5 Frequent Term-Based Text Clustering
44(1)
2.2.2.6 Frequent Term Sequence
45(2)
2.2.3 Algorithms for Streaming Data
47(3)
2.2.3.1 Text Stream Clustering Based on Adaptive Feature Selection (TSC-AFS)
47(1)
2.2.3.2 High-Dimensional Projected Stream Clustering (HPStream)
48(2)
2.2.4 Algorithms for Linked Data
50(3)
2.2.4.1 Challenges and Opportunities
50(1)
2.2.4.2 LUFS: An Unsupervised Feature Selection Framework for Linked Data
51(1)
2.2.4.3 Conclusion and Future Work for Linked Data
52(1)
2.3 Discussions and Challenges
53(8)
2.3.1 The Chicken or the Egg Dilemma
53(1)
2.3.2 Model Selection: K and l
54(1)
2.3.3 Scalability
54(1)
2.3.4 Stability
55(6)
3 Probabilistic Models for Clustering 61(26)
Hongbo Deng
Jiawei Han
3.1 Introduction
61(1)
3.2 Mixture Models
62(7)
3.2.1 Overview
62(2)
3.2.2 Gaussian Mixture Model
64(3)
3.2.3 Bernoulli Mixture Model
67(1)
3.2.4 Model Selection Criteria
68(1)
3.3 EM Algorithm and Its Variations
69(7)
3.3.1 The General EM Algorithm
69(4)
3.3.2 Mixture Models Revisited
73(2)
3.3.3 Limitations of the EM Algorithm
75(1)
3.3.4 Applications of the EM Algorithm
76(1)
3.4 Probabilistic Topic Models
76(5)
3.4.1 Probabilistic Latent Semantic Analysis
77(2)
3.4.2 Latent Dirichlet Allocation
79(2)
3.4.3 Variations and Extensions
81(1)
3.5 Conclusions and Summary
81(6)
4 A Survey of Partitional and Hierarchical Clustering Algorithms 87(24)
Chandan K. Reddy
Bhanukiran Vinzamuri
4.1 Introduction
88(1)
4.2 Partitional Clustering Algorithms
89(11)
4.2.1 K-Means Clustering
89(1)
4.2.2 Minimization of Sum of Squared Errors
90(1)
4.2.3 Factors Affecting K-Means
91(2)
4.2.3.1 Popular Initialization Methods
91(1)
4.2.3.2 Estimating the Number of Clusters
92(1)
4.2.4 Variations of K-Means
93(7)
4.2.4.1 K-Medoids Clustering
93(1)
4.2.4.2 K-Medians Clustering
94(1)
4.2.4.3 K-Modes Clustering
94(1)
4.2.4.4 Fuzzy K-Means Clustering
95(1)
4.2.4.5 X-Means Clustering
95(1)
4.2.4.6 Intelligent K-Means Clustering
96(1)
4.2.4.7 Bisecting K-Means Clustering
97(1)
4.2.4.8 Kernel K-Means Clustering
97(1)
4.2.4.9 Mean Shift Clustering
98(1)
4.2.4.10 Weighted K-Means Clustering
98(1)
4.2.4.11 Genetic K-Means Clustering
99(1)
4.2.5 Making K-Means Faster
100(1)
4.3 Hierarchical Clustering Algorithms
100(6)
4.3.1 Agglomerative Clustering
101(3)
4.3.1.1 Single and Complete Link
101(1)
4.3.1.2 Group Averaged and Centroid Agglomerative Clustering .
102(1)
4.3.1.3 Ward's Criterion
103(1)
4.3.1.4 Agglomerative Hierarchical Clustering Algorithm
103(1)
4.3.1.5 Lance-Williams Dissimilarity Update Formula
103(1)
4.3.2 Divisive Clustering
104(2)
4.3.2.1 Issues in Divisive Clustering
104(1)
4.3.2.2 Divisive Hierarchical Clustering Algorithm
105(1)
4.3.2.3 Minimum Spanning Tree-Based Clustering
105(1)
4.3.3 Other Hierarchical Clustering Algorithms
106(1)
4.4 Discussion and Summary
106(5)
5 Density-Based Clustering 111(16)
Martin Ester
5.1 Introduction
111(2)
5.2 DBSCAN
113(2)
5.3 DENCLUE
115(1)
5.4 OPTICS
116(1)
5.5 Other Algorithms
116(2)
5.6 Subspace Clustering
118(2)
5.7 Clustering Networks
120(3)
5.8 Other Directions
123(1)
5.9 Conclusion
124(3)
6 Grid-Based Clustering 127(22)
Wei Cheng
Wei Wang
Sandra Batista
6.1 Introduction
128(3)
6.2 The Classical Algorithms
131(4)
6.2.1 Earliest Approaches: GRIDCLUS and BANG
131(1)
6.2.2 STING and STING+: The Statistical Information Grid Approach
132(2)
6.2.3 WaveCluster: Wavelets in Grid-Based Clustering
134(1)
6.3 Adaptive Grid-Based Algorithms
135(1)
6.3.1 AMR: Adaptive Mesh Refinement Clustering
135(1)
6.4 Axis-Shifting Grid-Based Algorithms
136(3)
6.4.1 NSGC: New Shifting Grid Clustering Algorithm
136(1)
6.4.2 ADCC: Adaptable Deflect and Conquer Clustering
137(1)
6.4.3 ASGC: Axis-Shifted Grid-Clustering
137(1)
6.4.4 GDILC: Grid-Based Density-IsoLine Clustering Algorithm
138(1)
6.5 High-Dimensional Algorithms
139(6)
6.5.1 CLIQUE: The Classical High-Dimensional Algorithm
139(1)
6.5.2 Variants of CLIQUE
140(1)
6.5.2.1 ENCLUS: Entropy-Based Approach
140(1)
6.5.2.2 MAFIA: Adaptive Grids in High Dimensions
141(1)
6.5.3 OptiGrid: Density-Based Optimal Grid Partitioning
141(2)
6.5.4 Variants of the OptiGrid Approach
143(7)
6.5.4.1 0-Cluster: A Scalable Approach
143(1)
6.5.4.2 CBF: Cell-Based Filtering
144(1)
6.6 Conclusions and Summary
145(4)
7 Nonnegative Matrix Factorizations for Clustering: A Survey 149(28)
Tao Li
Chris Ding
7.1 Introduction
150(1)
7.1.1 Background
150(1)
7.1.2 NMF Formulations
151(1)
7.2 NMF for Clustering: Theoretical Foundations
151(2)
7.2.1 NMF and K-Means Clustering
151(1)
7.2.2 NMF and Probabilistic Latent Semantic Indexing
152(1)
7.2.3 NMF and Kernel K-Means and Spectral Clustering
152(1)
7.2.4 NMF Boundedness Theorem
153(1)
7.3 NMF Clustering Capabilities
153(2)
7.3.1 Examples
153(1)
7.3.2 Analysis
153(2)
7.4 NMF Algorithms
155(3)
7.4.1 Introduction
155(1)
7.4.2 Algorithm Development
155(1)
7.4.3 Practical Issues in NMF Algorithms
156(5)
7.4.3.1 Initialization
156(1)
7.4.3.2 Stopping Criteria
156(1)
7.4.3.3 Objective Function vs. Clustering Performance
157(1)
7.4.3.4 Scalability
157(1)
7.5 NMF Related Factorizations
158(3)
7.6 NMF for Clustering: Extensions
161(4)
7.6.1 Co-Clustering
161(1)
7.6.2 Semisupervised Clustering
162(1)
7.6.3 Semisupervised Co-Clustering
162(1)
7.6.4 Consensus Clustering
163(1)
7.6.5 Graph Clustering
164(1)
7.6.6 Other Clustering Extensions
164(1)
7.7 Conclusions
165(12)
8 Spectral Clustering 177(24)
Jialu Liu
Jiawei Han
8.1 Introduction
177(2)
8.2 Similarity Graph
179(1)
8.3 Unnormalized Spectral Clustering
180(2)
8.3.1 Notation
180(1)
8.3.2 Unnormalized Graph Laplacian
180(1)
8.3.3 Spectrum Analysis
181(1)
8.3.4 Unnormalized Spectral Clustering Algorithm
182(1)
8.4 Normalized Spectral Clustering
182(3)
8.4.1 Normalized Graph Laplacian
183(1)
8.4.2 Spectrum Analysis
184(1)
8.4.3 Normalized Spectral Clustering Algorithm
184(1)
8.5 Graph Cut View
185(3)
8.5.1 Ratio Cut Relaxation
186(1)
8.5.2 Normalized Cut Relaxation
187(1)
8.6 Random Walks View
188(1)
8.7 Connection to Laplacian Eigenmap
189(2)
8.8 Connection to Kernel k-Means and Nonnegative Matrix Factorization
191(1)
8.9 Large Scale Spectral Clustering
192(2)
8.10 Further Reading
194(7)
9 Clustering High-Dimensional Data 201(30)
Arthur Zimek
9.1 Introduction
201(1)
9.2 The "Curse of Dimensionality"
202(4)
9.2.1 Different Aspects of the "Curse"
202(4)
9.2.2 Consequences
206(1)
9.3 Clustering Tasks in Subspaces of High-Dimensional Data
206(2)
9.3.1 Categories of Subspaces
206(1)
9.3.1.1 Axis-Parallel Subspaces
206(1)
9.3.1.2 Arbitrarily Oriented Subspaces
207(1)
9.3.1.3 Special Cases
207(1)
9.3.2 Search Spaces for the Clustering Problem
207(1)
9.4 Fundamental Algorithmic Ideas
208(10)
9.4.1 Clustering in Axis-Parallel Subspaces
208(7)
9.4.1.1 Cluster Model
208(1)
9.4.1.2 Basic Techniques
208(2)
9.4.1.3 Clustering Algorithms
210(5)
9.4.2 Clustering in Arbitrarily Oriented Subspaces
215(18)
9.4.2.1 Cluster Model
215(1)
9.4.2.2 Basic Techniques and Example Algorithms
216(2)
9.5 Open Questions and Current Research Directions
218(1)
9.6 Conclusion
219(12)
10 A Survey of Stream Clustering Algorithms 231(28)
Charu C. Aggarwal
10.1 Introduction
231(2)
10.2 Methods Based on Partitioning Representatives
233(6)
10.2.1 The STREAM Algorithm
233(2)
10.2.2 CluStream: The Microclustering Framework
235(4)
10.2.2.1 Microcluster Definition
235(1)
10.2.2.2 Pyramidal Time Frame
236(1)
10.2.2.3 Online Clustering with CluStream
237(2)
10.3 Density-Based Stream Clustering
239(4)
10.3.1 DenStream: Density-Based Microclustering
240(1)
10.3.2 Grid-Based Streaming Algorithms
241(2)
10.3.2.1 D-Stream Algorithm
241(1)
10.3.2.2 Other Grid-Based Algorithms
242(1)
10.4 Probabilistic Streaming Algorithms
243(1)
10.5 Clustering High-Dimensional Streams
243(2)
10.5.1 The HPSTREAM Method
244(1)
10.5.2 Other High-Dimensional Streaming Algorithms
244(1)
10.6 Clustering Discrete and Categorical Streams
245(4)
10.6.1 Clustering Binary Data Streams with k-Means
245(1)
10.6.2 The StreamCluCD Algorithm
245(1)
10.6.3 Massive-Domain Clustering
246(3)
10.7 Text Stream Clustering
249(3)
10.8 Other Scenarios for Stream Clustering
252(2)
10.8.1 Clustering Uncertain Data Streams
253(1)
10.8.2 Clustering Graph Streams
253(1)
10.8.3 Distributed Clustering of Data Streams
254(1)
10.9 Discussion and Conclusions
254(5)
11 Big Data Clustering 259(18)
Hanghang Tong
U. Kang
11.1 Introduction
259(1)
11.2 One-Pass Clustering Algorithms
260(3)
11.2.1 CLARANS: Fighting with Exponential Search Space
260(1)
11.2.2 BIRCH: Fighting with Limited Memory
261(2)
11.2.3 CURE: Fighting with the Irregular Clusters
263(1)
11.3 Randomized Techniques for Clustering Algorithms
263(5)
11.3.1 Locality-Preserving Projection
264(2)
11.3.2 Global Projection
266(2)
11.4 Parallel and Distributed Clustering Algorithms
268(6)
11.4.1 General Framework
268(1)
11.4.2 DBDC: Density-Based Clustering
269(1)
11.4.3 ParMETIS: Graph Partitioning
269(1)
11.4.4 PKMeans: K-Means with MapReduce
270(1)
11.4.5 DisCo: Co-Clustering with MapReduce
271(1)
11.4.6 BoW: Subspace Clustering with MapReduce
272(2)
11.5 Conclusion
274(3)
12 Clustering Categorical Data 277(28)
Bill Andreopoulos
12.1 Introduction
278(1)
12.2 Goals of Categorical Clustering
279(3)
12.2.1 Clustering Road Map
280(2)
12.3 Similarity Measures for Categorical Data
282(2)
12.3.1 The Hamming Distance in Categorical and Binary Data
282(1)
12.3.2 Probabilistic Measures
283(1)
12.3.3 Information-Theoretic Measures
283(1)
12.3.4 Context-Based Similarity Measures
284(1)
12.4 Descriptions of Algorithms
284(14)
12.4.1 Partition-Based Clustering
284(3)
12.4.1.1 k-Modes
284(1)
12.4.1.2 k-Prototypes (Mixed Categorical and Numerical)
285(1)
12.4.1.3 Fuzzy k-Modes
286(1)
12.4.1.4 Squeezer
286(1)
12.4.1.5 COOLCAT
286(1)
12.4.2 Hierarchical Clustering
287(2)
12.4.2.1 ROCK
287(1)
12.4.2.2 COBWEB
288(1)
12.4.2.3 LIMBO
289(1)
12.4.3 Density-Based Clustering
289(7)
12.4.3.1 Projected (Subspace) Clustering
290(1)
12.4.3.2 CACTUS
290(1)
12.4.3.3 CLICKS
291(1)
12.4.3.4 STIRR
291(1)
12.4.3.5 CLOPE
292(1)
12.4.3.6 HIERDENC: Hierarchical Density-Based Clustering
292(1)
12.4.3.7 MULIC: Multiple Layer Incremental Clustering
293(3)
12.4.4 Model-Based Clustering
296(10)
12.4.4.1 BILCOM Empirical Bayesian (Mixed Categorical and Numerical)
296(1)
12.4.4.2 AutoClass (Mixed Categorical and Numerical)
296(1)
12.4.4.3 SVM Clustering (Mixed Categorical and Numerical)
297(1)
12.5 Conclusion
298(7)
13 Document Clustering: The Next Frontier 305(34)
David C. Anastasiu
Andrea Tagarelli
George Karypis
13.1 Introduction
306(1)
13.2 Modeling a Document
306(5)
13.2.1 Preliminaries
306(1)
13.2.2 The Vector Space Model
307(2)
13.2.3 Alternate Document Models
309(1)
13.2.4 Dimensionality Reduction for Text
309(1)
13.2.5 Characterizing Extremes
310(1)
13.3 General Purpose Document Clustering
311(4)
13.3.1 Similarity/Dissimilarity-Based Algorithms
311(1)
13.3.2 Density-Based Algorithms
312(1)
13.3.3 Adjacency-Based Algorithms
313(1)
13.3.4 Generative Algorithms
313(2)
13.4 Clustering Long Documents
315(8)
13.4.1 Document Segmentation
315(2)
13.4.2 Clustering Segmented Documents
317(4)
13.4.3 Simultaneous Segment Identification and Clustering
321(2)
13.5 Clustering Short Documents
323(5)
13.5.1 General Methods for Short Document Clustering
323(1)
13.5.2 Clustering with Knowledge Infusion
324(1)
13.5.3 Clustering Web Snippets
325(1)
13.5.4 Clustering Microblogs
326(2)
13.6 Conclusion
328(11)
14 Clustering Multimedia Data 339(18)
Shen-Fu Tsai
Guo-Jun Qi
Shiyu Chang
Min-Hsuan Tsai
Thomas S. Huang
14.1 Introduction
340(1)
14.2 Clustering with Image Data
340(7)
14.2.1 Visual Words Learning
341(1)
14.2.2 Face Clustering and Annotation
342(1)
14.2.3 Photo Album Event Recognition
343(1)
14.2.4 Image Segmentation
344(1)
14.2.5 Large-Scale Image Classification
345(2)
14.3 Clustering with Video and Audio Data
347(4)
14.3.1 Video Summarization
348(1)
14.3.2 Video Event Detection
349(1)
14.3.3 Video Story Clustering
350(1)
14.3.4 Music Summarization
350(1)
14.4 Clustering with Multimodal Data
351(2)
14.5 Summary and Future Directions
353(4)
15 Time-Series Data Clustering 357(24)
Dimitrios Kotsakos
Goce Trajcevski
Dimitrios Gunopulos
Charu C. Aggarwal
15.1 Introduction
358(1)
15.2 The Diverse Formulations for Time-Series Clustering
359(1)
15.3 Online Correlation-Based Clustering
360(3)
15.3.1 Selective Muscles and Related Methods
361(1)
15.3.2 Sensor Selection Algorithms for Correlation Clustering
362(1)
15.4 Similarity and Distance Measures
363(6)
15.4.1 Univariate Distance Measures
363(3)
15.4.1.1 LP Distance
363(1)
15.4.1.2 Dynamic Time Warping Distance
364(1)
15.4.1.3 EDIT Distance
365(1)
15.4.1.4 Longest Common Subsequence
365(1)
15.4.2 Multivariate Distance Measures
366(3)
15.4.2.1 Multidimensional LP Distance
366(1)
15.4.2.2 Multidimensional DTW
367(1)
15.4.2.3 Multidimensional LCSS
368(1)
15.4.2.4 Multidimensional Edit Distance
368(1)
15.4.2.5 Multidimensional Subsequence Matching
368(1)
15.5 Shape-Based Time-Series Clustering Techniques
369(5)
15.5.1 k-Means Clustering
370(1)
15.5.2 Hierarchical Clustering
371(1)
15.5.3 Density-Based Clustering
372(1)
15.5.4 Trajectory Clustering
372(2)
15.6 Time-Series Clustering Applications
374(1)
15.7 Conclusions
375(6)
16 Clustering Biological Data 381(34)
Chandan K. Reddy
Mohammad Al Hasan
Mohammed J. Zaki
16.1 Introduction
382(1)
16.2 Clustering Microarray Data
383(11)
16.2.1 Proximity Measures
383(1)
16.2.2 Categorization of Algorithms
384(1)
16.2.3 Standard Clustering Algorithms
385(3)
16.2.3.1 Hierarchical Clustering
385(1)
16.2.3.2 Probabilistic Clustering
386(1)
16.2.3.3 Graph-Theoretic Clustering
386(1)
16.2.3.4 Self-Organizing Maps
387(1)
16.2.3.5 Other Clustering Methods
387(1)
16.2.4 Biclustering
388(3)
16.2.4.1 Types and Structures of Biclusters
389(1)
16.2.4.2 Biclustering Algorithms
390(1)
16.2.4.3 Recent Developments
391(1)
16.2.5 Triclustering
391(1)
16.2.6 Time-Series Gene Expression Data Clustering
392(1)
16.2.7 Cluster Validation
393(1)
16.3 Clustering Biological Networks
394(3)
16.3.1 Characteristics of PPI Network Data
394(1)
16.3.2 Network Clustering Algorithms
394(3)
16.3.2.1 Molecular Complex Detection
394(1)
16.3.2.2 Markov Clustering
395(1)
16.3.2.3 Neighborhood Search Methods
395(1)
16.3.2.4 Clique Percolation Method
395(1)
16.3.2.5 Ensemble Clustering
396(1)
16.3.2.6 Other Clustering Methods
396(1)
16.3.3 Cluster Validation and Challenges
397(1)
16.4 Biological Sequence Clustering
397(6)
16.4.1 Sequence Similarity Metrics
397(2)
16.4.1.1 Alignment-Based Similarity
398(1)
16.4.1.2 Keyword-Based Similarity
398(1)
16.4.1.3 Kernel-Based Similarity
399(1)
16.4.1.4 Model-Based Similarity
399(1)
16.4.2 Sequence Clustering Algorithms
399(20)
16.4.2.1 Subsequence-Based Clustering
399(1)
16.4.2.2 Graph-Based Clustering
400(2)
16.4.2.3 Probabilistic Models
402(1)
16.4.2.4 Suffix Tree and Suffix Array-Based Method
403(1)
16.5 Software Packages
403(2)
16.6 Discussion and Summary
405(10)
17 Network Clustering 415(42)
Srinivasan Parthasarathy
S.M. Faisal
17.1 Introduction
416(1)
17.2 Background and Nomenclature
417(1)
17.3 Problem Definition
417(1)
17.4 Common Evaluation Criteria
418(1)
17.5 Partitioning with Geometric Information
419(2)
17.5.1 Coordinate Bisection
419(1)
17.5.2 Inertial Bisection
419(1)
17.5.3 Geometric Partitioning
420(1)
17.6 Graph Growing and Greedy Algorithms
421(2)
17.6.1 Kernighan-Lin Algorithm
422(1)
17.7 Agglomerative and Divisive Clustering
423(1)
17.8 Spectral Clustering
424(4)
17.8.1 Similarity Graphs
425(1)
17.8.2 Types of Similarity Graphs
425(1)
17.8.3 Graph Laplacians
426(1)
17.8.3.1 Unnormalized Graph Laplacian
426(1)
17.8.3.2 Normalized Graph Laplacians
427(1)
17.8.4 Spectral Clustering Algorithms
427(1)
17.9 Markov Clustering
428(2)
17.9.1 Regularized MCL (RMCL): Improvement over MCL
429(1)
17.10 Multilevel Partitioning
430(2)
17.11 Local Partitioning Algorithms
432(1)
17.12 Hypergraph Partitioning
433(2)
17.13 Emerging Methods for Partitioning Special Graphs
435(8)
17.13.1 Bipartite Graphs
435(1)
17.13.2 Dynamic Graphs
436(1)
17.13.3 Heterogeneous Networks
437(1)
17.13.4 Directed Networks
438(1)
17.13.5 Combining Content and Relationship Information
439(1)
17.13.6 Networks with Overlapping Communities
440(2)
17.13.7 Probabilistic Methods
442(1)
17.14 Conclusion
443(14)
18 A Survey of Uncertain Data Clustering Algorithms 457(26)
Charu C. Aggarwal
18.1 Introduction
457(2)
18.2 Mixture Model Clustering of Uncertain Data
459(1)
18.3 Density-Based Clustering Algorithms
460(2)
18.3.1 FDBSCAN Algorithm
460(1)
18.3.2 FOPTICS Algorithm
461(1)
18.4 Partitional Clustering Algorithms
462(4)
18.4.1 The UK-Means Algorithm
462(1)
18.4.2 The CK-Means Algorithm
463(1)
18.4.3 Clustering Uncertain Data with Voronoi Diagrams
464(1)
18.4.4 Approximation Algorithms for Clustering Uncertain Data
464(1)
18.4.5 Speeding Up Distance Computations
465(1)
18.5 Clustering Uncertain Data Streams
466(6)
18.5.1 The UMicro Algorithm
466(5)
18.5.2 The LuMicro Algorithm
471(1)
18.5.3 Enhancements to Stream Clustering
471(1)
18.6 Clustering Uncertain Data in High Dimensionality
472(5)
18.6.1 Subspace Clustering of Uncertain Data
473(1)
18.6.2 UPStream: Projected Clustering of Uncertain Data Streams
474(3)
18.7 Clustering with the Possible Worlds Model
477(1)
18.8 Clustering Uncertain Graphs
478(1)
18.9 Conclusions and Summary
478(5)
19 Concepts of Visual and Interactive Clustering 483(22)
Alexander Hinneburg
19.1 Introduction
483(1)
19.2 Direct Visual and Interactive Clustering
484(7)
19.2.1 Scatterplots
485(3)
19.2.2 Parallel Coordinates
488(3)
19.2.3 Discussion
491(1)
19.3 Visual Interactive Steering of Clustering
491(4)
19.3.1 Visual Assessment of Convergence of Clustering Algorithm
491(1)
19.3.2 Interactive Hierarchical Clustering
492(2)
19.3.3 Visual Clustering with SOMs
494(1)
19.3.4 Discussion
494(1)
19.4 Interactive Comparison and Combination of Clusterings
495(2)
19.4.1 Space of Clusterings
495(2)
19.4.2 Visualization
497(1)
19.4.3 Discussion
497(1)
19.5 Visualization of Clusters for Sense-Making
497(3)
19.6 Summary
500(5)
20 Semisupervised Clustering 505(30)
Amrudin Agovic
Arindam Banerjee
20.1 Introduction
506(1)
20.2 Clustering with Pointwise and Pairwise Semisupervision
507(6)
20.2.1 Semisupervised Clustering Based on Seeding
507(1)
20.2.2 Semisupervised Clustering Based on Pairwise Constraints
508(3)
20.2.3 Active Learning for Semisupervised Clustering
511(1)
20.2.4 Semisupervised Clustering Based on User Feedback
512(1)
20.2.5 Semisupervised Clustering Based on Nonnegative Matrix Factorization
513(1)
20.3 Semisupervised Graph Cuts
513(4)
20.3.1 Semisupervised Unnormalized Cut
515(1)
20.3.2 Semisupervised Ratio Cut
515(1)
20.3.3 Semisupervised Normalized Cut
516(1)
20.4 A Unified View of Label Propagation
517(4)
20.4.1 Generalized Label Propagation
517(1)
20.4.2 Gaussian Fields
517(1)
20.4.3 Tikhonov Regularization (TIKREG)
518(1)
20.4.4 Local and Global Consistency
518(1)
20.4.5 Related Methods
519(2)
20.4.5.1 Cluster Kernels
519(1)
20.4.5.2 Gaussian Random Walks EM (GWEM)
519(1)
20.4.5.3 Linear Neighborhood Propagation
520(1)
20.4.6 Label Propagation and Green's Function
521(1)
20.4.7 Label Propagation and Semisupervised Graph Cuts
521(1)
20.5 Semisupervised Embedding
521(3)
20.5.1 Nonlinear Manifold Embedding
522(1)
20.5.2 Semisupervised Embedding
522(2)
20.5.2.1 Unconstrained Semisupervised Embedding
523(1)
20.5.2.2 Constrained Semisupervised Embedding
523(1)
20.6 Comparative Experimental Analysis
524(6)
20.6.1 Experimental Results
524(5)
20.6.2 Semisupervised Embedding Methods
529(1)
20.7 Conclusions
530(5)
21 Alternative Clustering Analysis: A Review 535(16)
James Bailey
21.1 Introduction
535(2)
21.2 Technical Preliminaries
537(1)
21.3 Multiple Clustering Analysis Using Alternative Clusterings
538(7)
21.3.1 Alternative Clustering Algorithms: A Taxonomy
538(1)
21.3.2 Unguided Generation
539(2)
21.3.2.1 Naive
539(1)
21.3.2.2 Meta Clustering
539(1)
21.3.2.3 Eigenvectors of the Laplacian Matrix
540(1)
21.3.2.4 Decorrelated k-Means and Convolutional EM
540(1)
21.3.2.5 CAMI
540(1)
21.3.3 Guided Generation with Constraints
541(2)
21.3.3.1 COALA
541(1)
21.3.3.2 Constrained Optimization Approach
541(1)
21.3.3.3 MAXIMUS
542(1)
21.3.4 Orthogonal Transformation Approaches
543(1)
21.3.4.1 Orthogonal Views
543(1)
21.3.4.2 ADFT
543(1)
21.3.5 Information Theoretic
544(14)
21.3.5.1 Conditional Information Bottleneck (CIB)
544(1)
21.3.5.2 Conditional Ensemble Clustering
544(1)
21.3.5.3 NACI
544(1)
21.3.5.4 mSC
545(1)
21.4 Connections to Multiview Clustering and Subspace Clustering
545(2)
21.5 Future Research Issues
547(1)
21.6 Summary
547(4)
22 Cluster Ensembles: Theory and Applications 551(20)
Joydeep Ghosh
Ayan Acharya
22.1 Introduction
551(3)
22.2 The Cluster Ensemble Problem
554(1)
22.3 Measuring Similarity Between Clustering Solutions
555(3)
22.4 Cluster Ensemble Algorithms
558(6)
22.4.1 Probabilistic Approaches to Cluster Ensembles
558(2)
22.4.1.1 A Mixture Model for Cluster Ensembles (MMCE)
558(1)
22.4.1.2 Bayesian Cluster Ensembles (BCE)
558(1)
22.4.1.3 Nonparametric Bayesian Cluster Ensembles (NPBCE)
559(1)
22.4.2 Pairwise Similarity-Based Approaches
560(2)
22.4.2.1 Methods Based on Ensemble Co-Association Matrix
560(2)
22.4.2.2 Relating Consensus Clustering to Other Optimization Formulations
562(1)
22.4.3 Direct Approaches Using Cluster Labels
562(2)
22.4.3.1 Graph Partitioning
562(1)
22.4.3.2 Cumulative Voting
563(1)
22.5 Applications of Consensus Clustering
564(2)
22.5.1 Gene Expression Data Analysis
564(1)
22.5.2 Image Segmentation
564(2)
22.6 Concluding Remarks
566(5)
23 Clustering Validation Measures 571(36)
Hui Xiong
Zhongmou Li
23.1 Introduction
572(1)
23.2 External Clustering Validation Measures
573(16)
23.2.1 An Overview of External Clustering Validation Measures
574(1)
23.2.2 Defective Validation Measures
575(2)
23.2.2.1 K-Means: The Uniform Effect
575(1)
23.2.2.2 A Necessary Selection Criterion
576(1)
23.2.2.3 The Cluster Validation Results
576(1)
23.2.2.4 The Issues with the Defective Measures
577(1)
23.2.2.5 Improving the Defective Measures
577(1)
23.2.3 Measure Normalization
577(7)
23.2.3.1 Normalizing the Measures
578(3)
23.2.3.2 The DCV Criterion
581(2)
23.2.3.3 The Effect of Normalization
583(1)
23.2.4 Measure Properties
584(5)
23.2.4.1 The Consistency Between Measures
584(2)
23.2.4.2 Properties of Measures
586(3)
23.2.4.3 Discussions
589(1)
23.3 Internal Clustering Validation Measures
589(12)
23.3.1 An Overview of Internal Clustering Validation Measures
589(3)
23.3.2 Understanding of Internal Clustering Validation Measures
592(8)
23.3.2.1 The Impact of Monotonicity
592(1)
23.3.2.2 The Impact of Noise
593(1)
23.3.2.3 The Impact of Density
594(1)
23.3.2.4 The Impact of Subclusters
595(1)
23.3.2.5 The Impact of Skewed Distributions
596(2)
23.3.2.6 The Impact of Arbitrary Shapes
598(2)
23.3.3 Properties of Measures
600(1)
23.4 Summary
601(6)
24 Educational and Software Resources for Data Clustering 607(10)
Charu C. Aggarwal
Chandan K. Reddy
24.1 Introduction
607(1)
24.2 Educational Resources
608(2)
24.2.1 Books on Data Clustering
608(1)
24.2.2 Popular Survey Papers on Data Clustering
608(2)
24.3 Software for Data Clustering
610(2)
24.3.1 Free and Open-Source Software
610(1)
24.3.1.1 General Clustering Software
610(1)
24.3.1.2 Specialized Clustering Software
610(1)
24.3.2 Commercial Packages
611(1)
24.3.3 Data Benchmarks for Software and Research
611(1)
24.4 Summary
612(5)
Index 617
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin. He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books, including this one. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He is a fellow of the IEEE for "contributions to knowledge discovery and data mining techniques", and a life-member of the ACM. Chandan K. Reddy is an Assistant Professor in the Department of Computer Science at Wayne State University. He received his PhD from Cornell University and MS from Michigan State University. His primary research interests are in the areas of data mining and machine learning with applications to healthcare, bioinformatics, and social network analysis. His research is funded by the National Science Foundation, Department of Transportation, and the Susan G. Komen for the Cure Foundation. He has published over 40 peer-reviewed articles in leading conferences and journals. He received the Best Application Paper Award at the ACM SIGKDD conference in 2010 and was a finalist of the INFORMS Franz Edelman Award Competition in 2011. He is a member of IEEE, ACM, and SIAM.