Klienditugi: 7440010 (E-R 10-18)

E-raamat: Data Clustering: Algorithms and Applications [Taylor & Francis e-raamat]

Edited by Charu C. Aggarwal (IBM Research, Yorktown Heights, New York, USA), Edited by Chandan K. Reddy (Wayne State University, Detroit, Michigan, USA)

Formaat: 652 pages, 36 Tables, black and white; 102 Illustrations, black and white
Sari: Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Ilmumisaeg: 21-Aug-2013
Kirjastus: CRC Press Inc
ISBN-13: 9781315373515

Teised raamatud teemal:

Taylor & Francis e-raamat
Hind: 184,65 €*
* hind, mis tagab piiramatu üheaegsete kasutajate arvuga ligipääsu piiramatuks ajaks
Tavahind: 263,78 €
Säästad 30%

Formaat: 652 pages, 36 Tables, black and white; 102 Illustrations, black and white
Sari: Chapman & Hall/CRC Data Mining and Knowledge Discovery Series
Ilmumisaeg: 21-Aug-2013
Kirjastus: CRC Press Inc
ISBN-13: 9781315373515

Teised raamatud teemal:

Rohkem infot Taylor & Francis e-raamatute kohta

Raamatu kodulehekülg: https://www.taylorfrancis.com/books/9781315373515

Research on the problem of clustering tends to be fragmented across the pattern recognition, database, data mining, and machine learning communities. Addressing this problem in a unified way, Data Clustering: Algorithms and Applications provides complete coverage of the entire area of clustering, from basic methods to more refined and complex data clustering approaches. It pays special attention to recent issues in graphs, social networks, and other domains.

The book focuses on three primary aspects of data clustering:

Methods, describing key techniques commonly used for clustering, such as feature selection, agglomerative clustering, partitional clustering, density-based clustering, probabilistic clustering, grid-based clustering, spectral clustering, and nonnegative matrix factorization Domains, covering methods used for different domains of data, such as categorical data, text data, multimedia data, graph data, biological data, stream data, uncertain data, time series clustering, high-dimensional clustering, and big data Variations and Insights, discussing important variations of the clustering process, such as semisupervised clustering, interactive clustering, multiview clustering, cluster ensembles, and cluster validation

In this book, top researchers from around the world explore the characteristics of clustering problems in a variety of application areas. They also explain how to glean detailed insight from the clustering processincluding how to verify the quality of the underlying clustersthrough supervision, human intervention, or the automated generation of alternative clusters.

Preface

xxi

Editor Biographies

xxiii

Contributors

xxv

1 An Introduction to Cluster Analysis

(28)

Charu C. Aggarwal

1.1 Introduction

(1)

1.2 Common Techniques Used in Cluster Analysis

(12)

1.2.1 Feature Selection Methods

(1)

1.2.2 Probabilistic and Generative Models

(1)

1.2.3 Distance-Based Algorithms

(2)

1.2.4 Density- and Grid-Based Methods

(1)

1.2.5 Leveraging Dimensionality Reduction Methods

(3)

1.2.5.1 Generative Models for Dimensionality Reduction

(1)

1.2.5.2 Matrix Factorization and Co-Clustering

(2)

1.2.5.3 Spectral Methods

(1)

1.2.6 The High Dimensional Scenario

(2)

1.2.7 Scalable Techniques for Cluster Analysis

(2)

1.2.7.1 I/O Issues in Database Management

(1)

1.2.7.2 Streaming Algorithms

(1)

1.2.7.3 The Big Data Framework

(1)

1.3 Data Types Studied in Cluster Analysis

(4)

1.3.1 Clustering Categorical Data

(1)

1.3.2 Clustering Text Data

(1)

1.3.3 Clustering Multimedia Data

(1)

1.3.4 Clustering Time-Series Data

(1)

1.3.5 Clustering Discrete Sequences

(1)

1.3.6 Clustering Network Data

(1)

1.3.7 Clustering Uncertain Data

(1)

1.4 Insights Gained from Different Variations of Cluster Analysis

(3)

1.4.1 Visual Insights

(1)

1.4.2 Supervised Insights

(1)

1.4.3 Multiview and Ensemble-Based Insights

(1)

1.4.4 Validation-Based Insights

(1)

1.5 Discussion and Conclusions

(7)

2 Feature Selection for Clustering: A Review

(32)

Salem Alelyani

Jiliang Tang

Huan Liu

2.1 Introduction

(5)

2.1.1 Data Clustering

(1)

2.1.2 Feature Selection

(1)

2.1.3 Feature Selection for Clustering

(2)

2.1.3.1 Filter Model

(1)

2.1.3.2 Wrapper Model

(1)

2.1.3.3 Hybrid Model

(1)

2.2 Feature Selection for Clustering

(18)

2.2.1 Algorithms for Generic Data

(5)

2.2.1.1 Spectral Feature Selection (SPEC)

(1)

2.2.1.2 Laplacian Score (LS)

(1)

2.2.1.3 Feature Selection for Sparse Clustering

(1)

2.2.1.4 Localized Feature Selection Based on Scatter Separability (LFSBSS)

(1)

2.2.1.5 Multicluster Feature Selection (MCFS)

(1)

2.2.1.6 Feature Weighting k-Means

(1)

2.2.2 Algorithms for Text Data

(6)

2.2.2.1 Term Frequency (TF)

(1)

2.2.2.2 Inverse Document Frequency (IDF)

(1)

2.2.2.3 Term Frequency-Inverse Document Frequency (TF-IDF)

(1)

2.2.2.4 Chi Square Statistic

(2)

2.2.2.5 Frequent Term-Based Text Clustering

(1)

2.2.2.6 Frequent Term Sequence

(2)

2.2.3 Algorithms for Streaming Data

(3)

2.2.3.1 Text Stream Clustering Based on Adaptive Feature Selection (TSC-AFS)

(1)

2.2.3.2 High-Dimensional Projected Stream Clustering (HPStream)

(2)

2.2.4 Algorithms for Linked Data

(3)

2.2.4.1 Challenges and Opportunities

(1)

2.2.4.2 LUFS: An Unsupervised Feature Selection Framework for Linked Data

(1)

2.2.4.3 Conclusion and Future Work for Linked Data

(1)

2.3 Discussions and Challenges

(8)

2.3.1 The Chicken or the Egg Dilemma

(1)

2.3.2 Model Selection: K and l

(1)

2.3.3 Scalability

(1)

2.3.4 Stability

(6)

3 Probabilistic Models for Clustering

(26)

Hongbo Deng

Jiawei Han

3.1 Introduction

(1)

3.2 Mixture Models

(7)

3.2.1 Overview

(2)

3.2.2 Gaussian Mixture Model

(3)

3.2.3 Bernoulli Mixture Model

(1)

3.2.4 Model Selection Criteria

(1)

3.3 EM Algorithm and Its Variations

(7)

3.3.1 The General EM Algorithm

(4)

3.3.2 Mixture Models Revisited

(2)

3.3.3 Limitations of the EM Algorithm

(1)

3.3.4 Applications of the EM Algorithm

(1)

3.4 Probabilistic Topic Models

(5)

3.4.1 Probabilistic Latent Semantic Analysis

(2)

3.4.2 Latent Dirichlet Allocation

(2)

3.4.3 Variations and Extensions

(1)

3.5 Conclusions and Summary

(6)

4 A Survey of Partitional and Hierarchical Clustering Algorithms

(24)

Chandan K. Reddy

Bhanukiran Vinzamuri

4.1 Introduction

(1)

4.2 Partitional Clustering Algorithms

(11)

4.2.1 K-Means Clustering

(1)

4.2.2 Minimization of Sum of Squared Errors

(1)

4.2.3 Factors Affecting K-Means

(2)

4.2.3.1 Popular Initialization Methods

(1)

4.2.3.2 Estimating the Number of Clusters

(1)

4.2.4 Variations of K-Means

(7)

4.2.4.1 K-Medoids Clustering

(1)

4.2.4.2 K-Medians Clustering

(1)

4.2.4.3 K-Modes Clustering

(1)

4.2.4.4 Fuzzy K-Means Clustering

(1)

4.2.4.5 X-Means Clustering

(1)

4.2.4.6 Intelligent K-Means Clustering

(1)

4.2.4.7 Bisecting K-Means Clustering

(1)

4.2.4.8 Kernel K-Means Clustering

(1)

4.2.4.9 Mean Shift Clustering

(1)

4.2.4.10 Weighted K-Means Clustering

(1)

4.2.4.11 Genetic K-Means Clustering

(1)

4.2.5 Making K-Means Faster

100

(1)

4.3 Hierarchical Clustering Algorithms

100

(6)

4.3.1 Agglomerative Clustering

101

(3)

4.3.1.1 Single and Complete Link

101

(1)

4.3.1.2 Group Averaged and Centroid Agglomerative Clustering .

102

(1)

4.3.1.3 Ward's Criterion

103

(1)

4.3.1.4 Agglomerative Hierarchical Clustering Algorithm

103

(1)

4.3.1.5 Lance-Williams Dissimilarity Update Formula

103

(1)

4.3.2 Divisive Clustering

104

(2)

4.3.2.1 Issues in Divisive Clustering

104

(1)

4.3.2.2 Divisive Hierarchical Clustering Algorithm

105

(1)

4.3.2.3 Minimum Spanning Tree-Based Clustering

105

(1)

4.3.3 Other Hierarchical Clustering Algorithms

106

(1)

4.4 Discussion and Summary

106

(5)

5 Density-Based Clustering

111

(16)

Martin Ester

5.1 Introduction

111

(2)

5.2 DBSCAN

113

(2)

5.3 DENCLUE

115

(1)

5.4 OPTICS

116

(1)

5.5 Other Algorithms

116

(2)

5.6 Subspace Clustering

118

(2)

5.7 Clustering Networks

120

(3)

5.8 Other Directions

123

(1)

5.9 Conclusion

124

(3)

6 Grid-Based Clustering

127

(22)

Wei Cheng

Wei Wang

Sandra Batista

6.1 Introduction

128

(3)

6.2 The Classical Algorithms

131

(4)

6.2.1 Earliest Approaches: GRIDCLUS and BANG

131

(1)

6.2.2 STING and STING+: The Statistical Information Grid Approach

132

(2)

6.2.3 WaveCluster: Wavelets in Grid-Based Clustering

134

(1)

6.3 Adaptive Grid-Based Algorithms

135

(1)

6.3.1 AMR: Adaptive Mesh Refinement Clustering

135

(1)

6.4 Axis-Shifting Grid-Based Algorithms

136

(3)

6.4.1 NSGC: New Shifting Grid Clustering Algorithm

136

(1)

6.4.2 ADCC: Adaptable Deflect and Conquer Clustering

137

(1)

6.4.3 ASGC: Axis-Shifted Grid-Clustering

137

(1)

6.4.4 GDILC: Grid-Based Density-IsoLine Clustering Algorithm

138

(1)

6.5 High-Dimensional Algorithms

139

(6)

6.5.1 CLIQUE: The Classical High-Dimensional Algorithm

139

(1)

6.5.2 Variants of CLIQUE

140

(1)

6.5.2.1 ENCLUS: Entropy-Based Approach

140

(1)

6.5.2.2 MAFIA: Adaptive Grids in High Dimensions

141

(1)

6.5.3 OptiGrid: Density-Based Optimal Grid Partitioning

141

(2)

6.5.4 Variants of the OptiGrid Approach

143

(7)

6.5.4.1 0-Cluster: A Scalable Approach

143

(1)

6.5.4.2 CBF: Cell-Based Filtering

144

(1)

6.6 Conclusions and Summary

145

(4)

7 Nonnegative Matrix Factorizations for Clustering: A Survey

149

(28)

Tao Li

Chris Ding

7.1 Introduction

150

(1)

7.1.1 Background

150

(1)

7.1.2 NMF Formulations

151

(1)

7.2 NMF for Clustering: Theoretical Foundations

151

(2)

7.2.1 NMF and K-Means Clustering

151

(1)

7.2.2 NMF and Probabilistic Latent Semantic Indexing

152

(1)

7.2.3 NMF and Kernel K-Means and Spectral Clustering

152

(1)

7.2.4 NMF Boundedness Theorem

153

(1)

7.3 NMF Clustering Capabilities

153

(2)

7.3.1 Examples

153

(1)

7.3.2 Analysis

153

(2)

7.4 NMF Algorithms

155

(3)

7.4.1 Introduction

155

(1)

7.4.2 Algorithm Development

155

(1)

7.4.3 Practical Issues in NMF Algorithms

156

(5)

7.4.3.1 Initialization

156

(1)

7.4.3.2 Stopping Criteria

156

(1)

7.4.3.3 Objective Function vs. Clustering Performance

157

(1)

7.4.3.4 Scalability

157

(1)

7.5 NMF Related Factorizations

158

(3)

7.6 NMF for Clustering: Extensions

161

(4)

7.6.1 Co-Clustering

161

(1)

7.6.2 Semisupervised Clustering

162

(1)

7.6.3 Semisupervised Co-Clustering

162

(1)

7.6.4 Consensus Clustering

163

(1)

7.6.5 Graph Clustering

164

(1)

7.6.6 Other Clustering Extensions

164

(1)

7.7 Conclusions

165

(12)

8 Spectral Clustering

177

(24)

Jialu Liu

Jiawei Han

8.1 Introduction

177

(2)

8.2 Similarity Graph

179

(1)

8.3 Unnormalized Spectral Clustering

180

(2)

8.3.1 Notation

180

(1)

8.3.2 Unnormalized Graph Laplacian

180

(1)

8.3.3 Spectrum Analysis

181

(1)

8.3.4 Unnormalized Spectral Clustering Algorithm

182

(1)

8.4 Normalized Spectral Clustering

182

(3)

8.4.1 Normalized Graph Laplacian

183

(1)

8.4.2 Spectrum Analysis

184

(1)

8.4.3 Normalized Spectral Clustering Algorithm

184

(1)

8.5 Graph Cut View

185

(3)

8.5.1 Ratio Cut Relaxation

186

(1)

8.5.2 Normalized Cut Relaxation

187

(1)

8.6 Random Walks View

188

(1)

8.7 Connection to Laplacian Eigenmap

189

(2)

8.8 Connection to Kernel k-Means and Nonnegative Matrix Factorization

191

(1)

8.9 Large Scale Spectral Clustering

192

(2)

8.10 Further Reading

194

(7)

9 Clustering High-Dimensional Data

201

(30)

Arthur Zimek

9.1 Introduction

201

(1)

9.2 The "Curse of Dimensionality"

202

(4)

9.2.1 Different Aspects of the "Curse"

202

(4)

9.2.2 Consequences

206

(1)

9.3 Clustering Tasks in Subspaces of High-Dimensional Data

206

(2)

9.3.1 Categories of Subspaces

206

(1)

9.3.1.1 Axis-Parallel Subspaces

206

(1)

9.3.1.2 Arbitrarily Oriented Subspaces

207

(1)

9.3.1.3 Special Cases

207

(1)

9.3.2 Search Spaces for the Clustering Problem

207

(1)

9.4 Fundamental Algorithmic Ideas

208

(10)

9.4.1 Clustering in Axis-Parallel Subspaces

208

(7)

9.4.1.1 Cluster Model

208

(1)

9.4.1.2 Basic Techniques

208

(2)

9.4.1.3 Clustering Algorithms

210

(5)

9.4.2 Clustering in Arbitrarily Oriented Subspaces

215

(18)

9.4.2.1 Cluster Model

215

(1)

9.4.2.2 Basic Techniques and Example Algorithms

216

(2)

9.5 Open Questions and Current Research Directions

218

(1)

9.6 Conclusion

219

(12)

10 A Survey of Stream Clustering Algorithms

231

(28)

Charu C. Aggarwal

10.1 Introduction

231

(2)

10.2 Methods Based on Partitioning Representatives

233

(6)

10.2.1 The STREAM Algorithm

233

(2)

10.2.2 CluStream: The Microclustering Framework

235

(4)

10.2.2.1 Microcluster Definition

235

(1)

10.2.2.2 Pyramidal Time Frame

236

(1)

10.2.2.3 Online Clustering with CluStream

237

(2)

10.3 Density-Based Stream Clustering

239

(4)

10.3.1 DenStream: Density-Based Microclustering

240

(1)

10.3.2 Grid-Based Streaming Algorithms

241

(2)

10.3.2.1 D-Stream Algorithm

241

(1)

10.3.2.2 Other Grid-Based Algorithms

242

(1)

10.4 Probabilistic Streaming Algorithms

243

(1)

10.5 Clustering High-Dimensional Streams

243

(2)

10.5.1 The HPSTREAM Method

244

(1)

10.5.2 Other High-Dimensional Streaming Algorithms

244

(1)

10.6 Clustering Discrete and Categorical Streams

245

(4)

10.6.1 Clustering Binary Data Streams with k-Means

245

(1)

10.6.2 The StreamCluCD Algorithm

245

(1)

10.6.3 Massive-Domain Clustering

246

(3)

10.7 Text Stream Clustering

249

(3)

10.8 Other Scenarios for Stream Clustering

252

(2)

10.8.1 Clustering Uncertain Data Streams

253

(1)

10.8.2 Clustering Graph Streams

253

(1)

10.8.3 Distributed Clustering of Data Streams

254

(1)

10.9 Discussion and Conclusions

254

(5)

11 Big Data Clustering

259

(18)

Hanghang Tong

U. Kang

11.1 Introduction

259

(1)

11.2 One-Pass Clustering Algorithms

260

(3)

11.2.1 CLARANS: Fighting with Exponential Search Space

260

(1)

11.2.2 BIRCH: Fighting with Limited Memory

261

(2)

11.2.3 CURE: Fighting with the Irregular Clusters

263

(1)

11.3 Randomized Techniques for Clustering Algorithms

263

(5)

11.3.1 Locality-Preserving Projection

264

(2)

11.3.2 Global Projection

266

(2)

11.4 Parallel and Distributed Clustering Algorithms

268

(6)

11.4.1 General Framework

268

(1)

11.4.2 DBDC: Density-Based Clustering

269

(1)

11.4.3 ParMETIS: Graph Partitioning

269

(1)

11.4.4 PKMeans: K-Means with MapReduce

270

(1)

11.4.5 DisCo: Co-Clustering with MapReduce

271

(1)

11.4.6 BoW: Subspace Clustering with MapReduce

272

(2)

11.5 Conclusion

274

(3)

12 Clustering Categorical Data

277

(28)

Bill Andreopoulos

12.1 Introduction

278

(1)

12.2 Goals of Categorical Clustering

279

(3)

12.2.1 Clustering Road Map

280

(2)

12.3 Similarity Measures for Categorical Data

282

(2)

12.3.1 The Hamming Distance in Categorical and Binary Data

282

(1)

12.3.2 Probabilistic Measures

283

(1)

12.3.3 Information-Theoretic Measures

283

(1)

12.3.4 Context-Based Similarity Measures

284

(1)

12.4 Descriptions of Algorithms

284

(14)

12.4.1 Partition-Based Clustering

284

(3)

12.4.1.1 k-Modes

284

(1)

12.4.1.2 k-Prototypes (Mixed Categorical and Numerical)

285

(1)

12.4.1.3 Fuzzy k-Modes

286

(1)

12.4.1.4 Squeezer

286

(1)

12.4.1.5 COOLCAT

286

(1)

12.4.2 Hierarchical Clustering

287

(2)

12.4.2.1 ROCK

287

(1)

12.4.2.2 COBWEB

288

(1)

12.4.2.3 LIMBO

289

(1)

12.4.3 Density-Based Clustering

289

(7)

12.4.3.1 Projected (Subspace) Clustering

290

(1)

12.4.3.2 CACTUS

290

(1)

12.4.3.3 CLICKS

291

(1)

12.4.3.4 STIRR

291

(1)

12.4.3.5 CLOPE

292

(1)

12.4.3.6 HIERDENC: Hierarchical Density-Based Clustering

292

(1)

12.4.3.7 MULIC: Multiple Layer Incremental Clustering

293

(3)

12.4.4 Model-Based Clustering

296

(10)

12.4.4.1 BILCOM Empirical Bayesian (Mixed Categorical and Numerical)

296

(1)

12.4.4.2 AutoClass (Mixed Categorical and Numerical)

296

(1)

12.4.4.3 SVM Clustering (Mixed Categorical and Numerical)

297

(1)

12.5 Conclusion

298

(7)

13 Document Clustering: The Next Frontier

305

(34)

David C. Anastasiu

Andrea Tagarelli

George Karypis

13.1 Introduction

306

(1)

13.2 Modeling a Document

306

(5)

13.2.1 Preliminaries

306

(1)

13.2.2 The Vector Space Model

307

(2)

13.2.3 Alternate Document Models

309

(1)

13.2.4 Dimensionality Reduction for Text

309

(1)

13.2.5 Characterizing Extremes

310

(1)

13.3 General Purpose Document Clustering

311

(4)

13.3.1 Similarity/Dissimilarity-Based Algorithms

311

(1)

13.3.2 Density-Based Algorithms

312

(1)

13.3.3 Adjacency-Based Algorithms

313

(1)

13.3.4 Generative Algorithms

313

(2)

13.4 Clustering Long Documents

315

(8)

13.4.1 Document Segmentation

315

(2)

13.4.2 Clustering Segmented Documents

317

(4)

13.4.3 Simultaneous Segment Identification and Clustering

321

(2)

13.5 Clustering Short Documents

323

(5)

13.5.1 General Methods for Short Document Clustering

323

(1)

13.5.2 Clustering with Knowledge Infusion

324

(1)

13.5.3 Clustering Web Snippets

325

(1)

13.5.4 Clustering Microblogs

326

(2)

13.6 Conclusion

328

(11)

14 Clustering Multimedia Data

339

(18)

Shen-Fu Tsai

Guo-Jun Qi

Shiyu Chang

Min-Hsuan Tsai

Thomas S. Huang

14.1 Introduction

340

(1)

14.2 Clustering with Image Data

340

(7)

14.2.1 Visual Words Learning

341

(1)

14.2.2 Face Clustering and Annotation

342

(1)

14.2.3 Photo Album Event Recognition

343

(1)

14.2.4 Image Segmentation

344

(1)

14.2.5 Large-Scale Image Classification

345

(2)

14.3 Clustering with Video and Audio Data

347

(4)

14.3.1 Video Summarization

348

(1)

14.3.2 Video Event Detection

349

(1)

14.3.3 Video Story Clustering

350

(1)

14.3.4 Music Summarization

350

(1)

14.4 Clustering with Multimodal Data

351

(2)

14.5 Summary and Future Directions

353

(4)

15 Time-Series Data Clustering

357

(24)

Dimitrios Kotsakos

Goce Trajcevski

Dimitrios Gunopulos

Charu C. Aggarwal

15.1 Introduction

358

(1)

15.2 The Diverse Formulations for Time-Series Clustering

359

(1)

15.3 Online Correlation-Based Clustering

360

(3)

15.3.1 Selective Muscles and Related Methods

361

(1)

15.3.2 Sensor Selection Algorithms for Correlation Clustering

362

(1)

15.4 Similarity and Distance Measures

363

(6)

15.4.1 Univariate Distance Measures

363

(3)

15.4.1.1 LP Distance

363

(1)

15.4.1.2 Dynamic Time Warping Distance

364

(1)

15.4.1.3 EDIT Distance

365

(1)

15.4.1.4 Longest Common Subsequence

365

(1)

15.4.2 Multivariate Distance Measures

366

(3)

15.4.2.1 Multidimensional LP Distance

366

(1)

15.4.2.2 Multidimensional DTW

367

(1)

15.4.2.3 Multidimensional LCSS

368

(1)

15.4.2.4 Multidimensional Edit Distance

368

(1)

15.4.2.5 Multidimensional Subsequence Matching

368

(1)

15.5 Shape-Based Time-Series Clustering Techniques

369

(5)

15.5.1 k-Means Clustering

370

(1)

15.5.2 Hierarchical Clustering

371

(1)

15.5.3 Density-Based Clustering

372

(1)

15.5.4 Trajectory Clustering

372

(2)

15.6 Time-Series Clustering Applications

374

(1)

15.7 Conclusions

375

(6)

16 Clustering Biological Data

381

(34)

Chandan K. Reddy

Mohammad Al Hasan

Mohammed J. Zaki

16.1 Introduction

382

(1)

16.2 Clustering Microarray Data

383

(11)

16.2.1 Proximity Measures

383

(1)

16.2.2 Categorization of Algorithms

384

(1)

16.2.3 Standard Clustering Algorithms

385

(3)

16.2.3.1 Hierarchical Clustering

385

(1)

16.2.3.2 Probabilistic Clustering

386

(1)

16.2.3.3 Graph-Theoretic Clustering

386

(1)

16.2.3.4 Self-Organizing Maps

387

(1)

16.2.3.5 Other Clustering Methods

387

(1)

16.2.4 Biclustering

388

(3)

16.2.4.1 Types and Structures of Biclusters

389

(1)

16.2.4.2 Biclustering Algorithms

390

(1)

16.2.4.3 Recent Developments

391

(1)

16.2.5 Triclustering

391

(1)

16.2.6 Time-Series Gene Expression Data Clustering

392

(1)

16.2.7 Cluster Validation

393

(1)

16.3 Clustering Biological Networks

394

(3)

16.3.1 Characteristics of PPI Network Data

394

(1)

16.3.2 Network Clustering Algorithms

394

(3)

16.3.2.1 Molecular Complex Detection

394

(1)

16.3.2.2 Markov Clustering

395

(1)

16.3.2.3 Neighborhood Search Methods

395

(1)

16.3.2.4 Clique Percolation Method

395

(1)

16.3.2.5 Ensemble Clustering

396

(1)

16.3.2.6 Other Clustering Methods

396

(1)

16.3.3 Cluster Validation and Challenges

397

(1)

16.4 Biological Sequence Clustering

397

(6)

16.4.1 Sequence Similarity Metrics

397

(2)

16.4.1.1 Alignment-Based Similarity

398

(1)

16.4.1.2 Keyword-Based Similarity

398

(1)

16.4.1.3 Kernel-Based Similarity

399

(1)

16.4.1.4 Model-Based Similarity

399

(1)

16.4.2 Sequence Clustering Algorithms

399

(20)

16.4.2.1 Subsequence-Based Clustering

399

(1)

16.4.2.2 Graph-Based Clustering

400

(2)

16.4.2.3 Probabilistic Models

402

(1)

16.4.2.4 Suffix Tree and Suffix Array-Based Method

403

(1)

16.5 Software Packages

403

(2)

16.6 Discussion and Summary

405

(10)

17 Network Clustering

415

(42)

Srinivasan Parthasarathy

S.M. Faisal

17.1 Introduction

416

(1)

17.2 Background and Nomenclature

417

(1)

17.3 Problem Definition

417

(1)

17.4 Common Evaluation Criteria

418

(1)

17.5 Partitioning with Geometric Information

419

(2)

17.5.1 Coordinate Bisection

419

(1)

17.5.2 Inertial Bisection

419

(1)

17.5.3 Geometric Partitioning

420

(1)

17.6 Graph Growing and Greedy Algorithms

421

(2)

17.6.1 Kernighan-Lin Algorithm

422

(1)

17.7 Agglomerative and Divisive Clustering

423

(1)

17.8 Spectral Clustering

424

(4)

17.8.1 Similarity Graphs

425

(1)

17.8.2 Types of Similarity Graphs

425

(1)

17.8.3 Graph Laplacians

426

(1)

17.8.3.1 Unnormalized Graph Laplacian

426

(1)

17.8.3.2 Normalized Graph Laplacians

427

(1)

17.8.4 Spectral Clustering Algorithms

427

(1)

17.9 Markov Clustering

428

(2)

17.9.1 Regularized MCL (RMCL): Improvement over MCL

429

(1)

17.10 Multilevel Partitioning

430

(2)

17.11 Local Partitioning Algorithms

432

(1)

17.12 Hypergraph Partitioning

433

(2)

17.13 Emerging Methods for Partitioning Special Graphs

435

(8)

17.13.1 Bipartite Graphs

435

(1)

17.13.2 Dynamic Graphs

436

(1)

17.13.3 Heterogeneous Networks

437

(1)

17.13.4 Directed Networks

438

(1)

17.13.5 Combining Content and Relationship Information

439

(1)

17.13.6 Networks with Overlapping Communities

440

(2)

17.13.7 Probabilistic Methods

442

(1)

17.14 Conclusion

443

(14)

18 A Survey of Uncertain Data Clustering Algorithms

457

(26)

Charu C. Aggarwal

18.1 Introduction

457

(2)

18.2 Mixture Model Clustering of Uncertain Data

459

(1)

18.3 Density-Based Clustering Algorithms

460

(2)

18.3.1 FDBSCAN Algorithm

460

(1)

18.3.2 FOPTICS Algorithm

461

(1)

18.4 Partitional Clustering Algorithms

462

(4)

18.4.1 The UK-Means Algorithm

462

(1)

18.4.2 The CK-Means Algorithm

463

(1)

18.4.3 Clustering Uncertain Data with Voronoi Diagrams

464

(1)

18.4.4 Approximation Algorithms for Clustering Uncertain Data

464

(1)

18.4.5 Speeding Up Distance Computations

465

(1)

18.5 Clustering Uncertain Data Streams

466

(6)

18.5.1 The UMicro Algorithm

466

(5)

18.5.2 The LuMicro Algorithm

471

(1)

18.5.3 Enhancements to Stream Clustering

471

(1)

18.6 Clustering Uncertain Data in High Dimensionality

472

(5)

18.6.1 Subspace Clustering of Uncertain Data

473

(1)

18.6.2 UPStream: Projected Clustering of Uncertain Data Streams

474

(3)

18.7 Clustering with the Possible Worlds Model

477

(1)

18.8 Clustering Uncertain Graphs

478

(1)

18.9 Conclusions and Summary

478

(5)

19 Concepts of Visual and Interactive Clustering

483

(22)

Alexander Hinneburg

19.1 Introduction

483

(1)

19.2 Direct Visual and Interactive Clustering

484

(7)

19.2.1 Scatterplots

485

(3)

19.2.2 Parallel Coordinates

488

(3)

19.2.3 Discussion

491

(1)

19.3 Visual Interactive Steering of Clustering

491

(4)

19.3.1 Visual Assessment of Convergence of Clustering Algorithm

491

(1)

19.3.2 Interactive Hierarchical Clustering

492

(2)

19.3.3 Visual Clustering with SOMs

494

(1)

19.3.4 Discussion

494

(1)

19.4 Interactive Comparison and Combination of Clusterings

495

(2)

19.4.1 Space of Clusterings

495

(2)

19.4.2 Visualization

497

(1)

19.4.3 Discussion

497

(1)

19.5 Visualization of Clusters for Sense-Making

497

(3)

19.6 Summary

500

(5)

20 Semisupervised Clustering

505

(30)

Amrudin Agovic

Arindam Banerjee

20.1 Introduction

506

(1)

20.2 Clustering with Pointwise and Pairwise Semisupervision

507

(6)

20.2.1 Semisupervised Clustering Based on Seeding

507

(1)

20.2.2 Semisupervised Clustering Based on Pairwise Constraints

508

(3)

20.2.3 Active Learning for Semisupervised Clustering

511

(1)

20.2.4 Semisupervised Clustering Based on User Feedback

512

(1)

20.2.5 Semisupervised Clustering Based on Nonnegative Matrix Factorization

513

(1)

20.3 Semisupervised Graph Cuts

513

(4)

20.3.1 Semisupervised Unnormalized Cut

515

(1)

20.3.2 Semisupervised Ratio Cut

515

(1)

20.3.3 Semisupervised Normalized Cut

516

(1)

20.4 A Unified View of Label Propagation

517

(4)

20.4.1 Generalized Label Propagation

517

(1)

20.4.2 Gaussian Fields

517

(1)

20.4.3 Tikhonov Regularization (TIKREG)

518

(1)

20.4.4 Local and Global Consistency

518

(1)

20.4.5 Related Methods

519

(2)

20.4.5.1 Cluster Kernels

519

(1)

20.4.5.2 Gaussian Random Walks EM (GWEM)

519

(1)

20.4.5.3 Linear Neighborhood Propagation

520

(1)

20.4.6 Label Propagation and Green's Function

521

(1)

20.4.7 Label Propagation and Semisupervised Graph Cuts

521

(1)

20.5 Semisupervised Embedding

521

(3)

20.5.1 Nonlinear Manifold Embedding

522

(1)

20.5.2 Semisupervised Embedding

522

(2)

20.5.2.1 Unconstrained Semisupervised Embedding

523

(1)

20.5.2.2 Constrained Semisupervised Embedding

523

(1)

20.6 Comparative Experimental Analysis

524

(6)

20.6.1 Experimental Results

524

(5)

20.6.2 Semisupervised Embedding Methods

529

(1)

20.7 Conclusions

530

(5)

21 Alternative Clustering Analysis: A Review

535

(16)

James Bailey

21.1 Introduction

535

(2)

21.2 Technical Preliminaries

537

(1)

21.3 Multiple Clustering Analysis Using Alternative Clusterings

538

(7)

21.3.1 Alternative Clustering Algorithms: A Taxonomy

538

(1)

21.3.2 Unguided Generation

539

(2)

21.3.2.1 Naive

539

(1)

21.3.2.2 Meta Clustering

539

(1)

21.3.2.3 Eigenvectors of the Laplacian Matrix

540

(1)

21.3.2.4 Decorrelated k-Means and Convolutional EM

540

(1)

21.3.2.5 CAMI

540

(1)

21.3.3 Guided Generation with Constraints

541

(2)

21.3.3.1 COALA

541

(1)

21.3.3.2 Constrained Optimization Approach

541

(1)

21.3.3.3 MAXIMUS

542

(1)

21.3.4 Orthogonal Transformation Approaches

543

(1)

21.3.4.1 Orthogonal Views

543

(1)

21.3.4.2 ADFT

543

(1)

21.3.5 Information Theoretic

544

(14)

21.3.5.1 Conditional Information Bottleneck (CIB)

544

(1)

21.3.5.2 Conditional Ensemble Clustering

544

(1)

21.3.5.3 NACI

544

(1)

21.3.5.4 mSC

545

(1)

21.4 Connections to Multiview Clustering and Subspace Clustering

545

(2)

21.5 Future Research Issues

547

(1)

21.6 Summary

547

(4)

22 Cluster Ensembles: Theory and Applications

551

(20)

Joydeep Ghosh

Ayan Acharya

22.1 Introduction

551

(3)

22.2 The Cluster Ensemble Problem

554

(1)

22.3 Measuring Similarity Between Clustering Solutions

555

(3)

22.4 Cluster Ensemble Algorithms

558

(6)

22.4.1 Probabilistic Approaches to Cluster Ensembles

558

(2)

22.4.1.1 A Mixture Model for Cluster Ensembles (MMCE)

558

(1)

22.4.1.2 Bayesian Cluster Ensembles (BCE)

558

(1)

22.4.1.3 Nonparametric Bayesian Cluster Ensembles (NPBCE)

559

(1)

22.4.2 Pairwise Similarity-Based Approaches

560

(2)

22.4.2.1 Methods Based on Ensemble Co-Association Matrix

560

(2)

22.4.2.2 Relating Consensus Clustering to Other Optimization Formulations

562

(1)

22.4.3 Direct Approaches Using Cluster Labels

562

(2)

22.4.3.1 Graph Partitioning

562

(1)

22.4.3.2 Cumulative Voting

563

(1)

22.5 Applications of Consensus Clustering

564

(2)

22.5.1 Gene Expression Data Analysis

564

(1)

22.5.2 Image Segmentation

564

(2)

22.6 Concluding Remarks

566

(5)

23 Clustering Validation Measures

571

(36)

Hui Xiong

Zhongmou Li

23.1 Introduction

572

(1)

23.2 External Clustering Validation Measures

573

(16)

23.2.1 An Overview of External Clustering Validation Measures

574

(1)

23.2.2 Defective Validation Measures

575

(2)

23.2.2.1 K-Means: The Uniform Effect

575

(1)

23.2.2.2 A Necessary Selection Criterion

576

(1)

23.2.2.3 The Cluster Validation Results

576

(1)

23.2.2.4 The Issues with the Defective Measures

577

(1)

23.2.2.5 Improving the Defective Measures

577

(1)

23.2.3 Measure Normalization

577

(7)

23.2.3.1 Normalizing the Measures

578

(3)

23.2.3.2 The DCV Criterion

581

(2)

23.2.3.3 The Effect of Normalization

583

(1)

23.2.4 Measure Properties

584

(5)

23.2.4.1 The Consistency Between Measures

584

(2)

23.2.4.2 Properties of Measures

586

(3)

23.2.4.3 Discussions

589

(1)

23.3 Internal Clustering Validation Measures

589

(12)

23.3.1 An Overview of Internal Clustering Validation Measures

589

(3)

23.3.2 Understanding of Internal Clustering Validation Measures

592

(8)

23.3.2.1 The Impact of Monotonicity

592

(1)

23.3.2.2 The Impact of Noise

593

(1)

23.3.2.3 The Impact of Density

594

(1)

23.3.2.4 The Impact of Subclusters

595

(1)

23.3.2.5 The Impact of Skewed Distributions

596

(2)

23.3.2.6 The Impact of Arbitrary Shapes

598

(2)

23.3.3 Properties of Measures

600

(1)

23.4 Summary

601

(6)

24 Educational and Software Resources for Data Clustering

607

(10)

Charu C. Aggarwal

Chandan K. Reddy

24.1 Introduction

607

(1)

24.2 Educational Resources

608

(2)

24.2.1 Books on Data Clustering

608

(1)

24.2.2 Popular Survey Papers on Data Clustering

608

(2)

24.3 Software for Data Clustering

610

(2)

24.3.1 Free and Open-Source Software

610

(1)

24.3.1.1 General Clustering Software

610

(1)

24.3.1.2 Specialized Clustering Software

610

(1)

24.3.2 Commercial Packages

611

(1)

24.3.3 Data Benchmarks for Software and Research

611

(1)

24.4 Summary

612

(5)

Index

617

Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Yorktown Heights, New York. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from Massachusetts Institute of Technology in 1996. His research interest during his Ph.D. years was in combinatorial optimization (network flow algorithms), and his thesis advisor was Professor James B. Orlin. He has since worked in the field of performance analysis, databases, and data mining. He has published over 200 papers in refereed conferences and journals, and has applied for or been granted over 80 patents. He is author or editor of nine books, including this one. Because of the commercial value of the above-mentioned patents, he has received several invention achievement awards and has thrice been designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his work on bio-terrorist threat detection in data streams, a recipient of the IBM Outstanding Innovation Award (2008) for his scientific contributions to privacy technology, and a recipient of an IBM Research Division Award (2008) for his scientific contributions to data stream research. He has served on the program committees of most major database/data mining conferences, and served as program vice-chairs of the SIAM Conference on Data Mining, 2007, the IEEE ICDM Conference, 2007, the WWW Conference 2009, and the IEEE ICDM Conference, 2009. He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering Journal from 2004 to 2008. He is an associate editor of the ACM TKDD Journal, an action editor of the Data Mining and Knowledge Discovery Journal, an associate editor of the ACM SIGKDD Explorations, and an associate editor of the Knowledge and Information Systems Journal. He is a fellow of the IEEE for "contributions to knowledge discovery and data mining techniques", and a life-member of the ACM. Chandan K. Reddy is an Assistant Professor in the Department of Computer Science at Wayne State University. He received his PhD from Cornell University and MS from Michigan State University. His primary research interests are in the areas of data mining and machine learning with applications to healthcare, bioinformatics, and social network analysis. His research is funded by the National Science Foundation, Department of Transportation, and the Susan G. Komen for the Cure Foundation. He has published over 40 peer-reviewed articles in leading conferences and journals. He received the Best Application Paper Award at the ACM SIGKDD conference in 2010 and was a finalist of the INFORMS Franz Edelman Award Competition in 2011. He is a member of IEEE, ACM, and SIAM.

Püsilink: https://www.kriso.ee/db/9781315373515_pe.html

Märksõnad:

E-raamat: Data Clustering: Algorithms and Applications [Taylor & Francis e-raamat]

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Kirjastuste teemad

Vali ostukorv