|
|
1 | (16) |
|
1.1 What is the World Wide Web? |
|
|
1 | (1) |
|
1.2 A Brief History of the Web and the Internet |
|
|
2 | (2) |
|
|
4 | (4) |
|
1.3.1 What is Data Mining? |
|
|
6 | (1) |
|
1.3.2 What is Web Mining? |
|
|
7 | (1) |
|
|
8 | (3) |
|
1.5 How to Read this Book |
|
|
11 | (1) |
|
|
12 | (1) |
|
|
13 | (4) |
|
Part I Data Mining Foundations |
|
|
|
2 Association Rules and Sequential Patterns |
|
|
17 | (46) |
|
2.1 Basic Concepts of Association Rules |
|
|
17 | (3) |
|
|
20 | (6) |
|
2.2.1 Frequent Itemset Generation |
|
|
20 | (4) |
|
2.2.2 Association Rule Generation |
|
|
24 | (2) |
|
2.3 Data Formats for Association Rule Mining |
|
|
26 | (1) |
|
2.4 Mining with Multiple Minimum Supports |
|
|
26 | (10) |
|
|
28 | (2) |
|
|
30 | (5) |
|
|
35 | (1) |
|
2.5 Mining Class Association Rules |
|
|
36 | (5) |
|
|
36 | (2) |
|
|
38 | (3) |
|
2.5.3 Mining with Multiple Minimum Supports |
|
|
41 | (1) |
|
2.6 Basic Concepts of Sequential Patterns |
|
|
41 | (2) |
|
2.7 Mining Sequential Patterns Based on GSP |
|
|
43 | (6) |
|
|
43 | (2) |
|
2.7.2 Mining with Multiple Minimum Supports |
|
|
45 | (4) |
|
2.8 Mining Sequential Patterns Based on PrefixSpan |
|
|
49 | (4) |
|
2.8.1 PrefixSpan Algorithm |
|
|
50 | (2) |
|
2.8.2 Mining with Multiple Minimum Supports |
|
|
52 | (1) |
|
2.9 Generating Rules from Sequential Patterns |
|
|
53 | (3) |
|
|
54 | (1) |
|
2.9.2 Label Sequential Rules |
|
|
54 | (1) |
|
2.9.3 Class Sequential Rules |
|
|
55 | (1) |
|
|
56 | (2) |
|
|
58 | (5) |
|
|
63 | (70) |
|
|
63 | (4) |
|
3.2 Decision Tree Induction |
|
|
67 | (12) |
|
|
70 | (1) |
|
|
71 | (4) |
|
3.2.3 Handling of Continuous Attributes |
|
|
75 | (1) |
|
|
76 | (3) |
|
3.3 Classifier Evaluation |
|
|
79 | (8) |
|
|
79 | (2) |
|
3.3.2 Precision, Recall, F-score and Breakeven Point |
|
|
81 | (2) |
|
3.3.3 Receiver Operating Characteristic Curve |
|
|
83 | (3) |
|
|
86 | (1) |
|
|
87 | (6) |
|
3.4.1 Sequential Covering |
|
|
87 | (3) |
|
3.4.2 Rule Learning: Learn-One-Rule Function |
|
|
90 | (3) |
|
|
93 | (1) |
|
3.5 Classification Based on Associations |
|
|
93 | (7) |
|
3.5.1 Classification Using Class Association Rules |
|
|
94 | (4) |
|
3.5.2 Class Association Rules as Features |
|
|
98 | (1) |
|
3.5.3 Classification Using Normal Association Rules |
|
|
99 | (1) |
|
3.6 Naive Bayesian Classification |
|
|
100 | (3) |
|
3.7 Naive Bayesian Text Classification |
|
|
103 | (6) |
|
3.7.1 Probabilistic Framework |
|
|
104 | (1) |
|
3.7.2 Naive Bayesian Model |
|
|
105 | (3) |
|
|
108 | (1) |
|
3.8 Support Vector Machines |
|
|
109 | (15) |
|
3.8.1 Linear SVM: Separable Case |
|
|
111 | (6) |
|
3.8.2 Linear SVM: Non-Separable Case |
|
|
117 | (3) |
|
3.8.3 Nonlinear SVM: Kernel Functions |
|
|
120 | (4) |
|
3.9 K-Nearest Neighbor Learning |
|
|
124 | (2) |
|
3.10 Ensemble of Classifiers |
|
|
126 | (2) |
|
|
126 | (1) |
|
|
126 | (2) |
|
|
128 | (1) |
|
|
129 | (4) |
|
|
133 | (38) |
|
|
133 | (3) |
|
|
136 | (8) |
|
|
136 | (3) |
|
4.2.2 Disk Version of the K-means Algorithm |
|
|
139 | (1) |
|
4.2.3 Strengths and Weaknesses |
|
|
140 | (4) |
|
4.3 Representation of Clusters |
|
|
144 | (3) |
|
4.3.1 Common Ways of Representing Clusters |
|
|
145 | (1) |
|
4.3.2 Clusters of Arbitrary Shapes |
|
|
146 | (1) |
|
4.4 Hierarchical Clustering |
|
|
147 | (4) |
|
|
149 | (1) |
|
4.4.2 Complete-Link Method |
|
|
149 | (1) |
|
4.4.3 Average-Link Method |
|
|
150 | (1) |
|
4.4.4 Strengths and Weaknesses |
|
|
150 | (1) |
|
|
151 | (4) |
|
|
151 | (1) |
|
4.5.2 Binary and Nominal Attributes |
|
|
152 | (2) |
|
|
154 | (1) |
|
|
155 | (2) |
|
4.7 Handling of Mixed Attributes |
|
|
157 | (2) |
|
4.8 Which Clustering Algorithm to Use? |
|
|
159 | (1) |
|
|
159 | (3) |
|
4.10 Discovering Holes and Data Regions |
|
|
162 | (3) |
|
|
165 | (1) |
|
|
166 | (5) |
|
5 Partially Supervised Learning |
|
|
171 | (40) |
|
5.1 Learning from Labeled and Unlabeled Examples |
|
|
171 | (13) |
|
5.1.1 EM Algorithm with Naive Bayesian Classification |
|
|
173 | (3) |
|
|
176 | (2) |
|
|
178 | (1) |
|
5.1.4 Transductive Support Vector Machines |
|
|
179 | (1) |
|
5.1.5 Graph-Based Methods |
|
|
180 | (3) |
|
|
183 | (1) |
|
5.2 Learning from Positive and Unlabeled Examples |
|
|
184 | (18) |
|
5.2.1 Applications of PU Learning |
|
|
185 | (2) |
|
5.2.2 Theoretical Foundation |
|
|
187 | (3) |
|
5.2.3 Building Classifiers: Two-Step Approach |
|
|
190 | (7) |
|
5.2.4 Building Classifiers: Biased-SVM |
|
|
197 | (2) |
|
5.2.5 Building Classifiers: Probability Estimation |
|
|
199 | (2) |
|
|
201 | (1) |
|
Appendix: Derivation of EM for Naive Bayesian Classification |
|
|
202 | (2) |
|
|
204 | (2) |
|
|
206 | (5) |
|
|
|
6 Information Retrieval and Web Search |
|
|
211 | (58) |
|
6.1 Basic Concepts of Information Retrieval |
|
|
212 | (3) |
|
6.2 Information Retrieval Models |
|
|
215 | (5) |
|
|
216 | (1) |
|
|
217 | (2) |
|
6.2.3 Statistical Language Model |
|
|
219 | (1) |
|
|
220 | (3) |
|
|
223 | (4) |
|
6.5 Text and Web Page Pre-Processing |
|
|
227 | (5) |
|
|
227 | (1) |
|
|
228 | (1) |
|
6.5.3 Other Pre-Processing Tasks for Text |
|
|
228 | (1) |
|
6.5.4 Web Page Pre-Processing |
|
|
229 | (2) |
|
6.5.5 Duplicate Detection |
|
|
231 | (1) |
|
6.6 Inverted Index and Its Compression |
|
|
232 | (10) |
|
|
232 | (2) |
|
6.6.2 Search Using an Inverted Index |
|
|
234 | (1) |
|
|
235 | (1) |
|
|
236 | (6) |
|
6.7 Latent Semantic Indexing |
|
|
242 | (7) |
|
6.7.1 Singular Value Decomposition |
|
|
243 | (2) |
|
6.7.2 Query and Retrieval |
|
|
245 | (1) |
|
|
246 | (3) |
|
|
249 | (1) |
|
|
249 | (3) |
|
6.9 Meta-Search: Combining Multiple Rankings |
|
|
252 | (5) |
|
6.9.1 Combination Using Similarity Scores |
|
|
254 | (1) |
|
6.9.2 Combination Using Rank Positions |
|
|
255 | (2) |
|
|
257 | (6) |
|
|
258 | (1) |
|
|
259 | (1) |
|
|
260 | (1) |
|
|
261 | (2) |
|
|
263 | (1) |
|
|
264 | (5) |
|
7 Social Network Analysis |
|
|
269 | (42) |
|
7.1 Social Network Analysis |
|
|
270 | (5) |
|
|
270 | (3) |
|
|
273 | (2) |
|
7.2 Co-Citation and Bibliographic Coupling |
|
|
275 | (2) |
|
|
276 | (1) |
|
7.2.2 Bibliographic Coupling |
|
|
277 | (1) |
|
|
277 | (11) |
|
|
278 | (7) |
|
7.3.2 Strengths and Weaknesses of PageRank |
|
|
285 | (1) |
|
7.3.3 Timed PageRank and Recency Search |
|
|
286 | (2) |
|
|
288 | (6) |
|
|
289 | (2) |
|
7.4.2 Finding Other Eigenvectors |
|
|
291 | (1) |
|
7.4.3 Relationships with Co-Citation and Bibliographic Coupling |
|
|
292 | (1) |
|
7.4.4 Strengths and Weaknesses of HITS |
|
|
293 | (1) |
|
|
294 | (10) |
|
|
295 | (2) |
|
7.5.2 Bipartite Core Communities |
|
|
297 | (1) |
|
7.5.3 Maximum Flow Communities |
|
|
298 | (3) |
|
7.5.4 Email Communities Based on Betweenness |
|
|
301 | (2) |
|
7.5.5 Overlapping Communities of Named Entities |
|
|
303 | (1) |
|
|
304 | (1) |
|
|
305 | (6) |
|
|
311 | (52) |
|
8.1 A Basic Crawler Algorithm |
|
|
312 | (3) |
|
8.1.1 Breadth-First Crawlers |
|
|
313 | (1) |
|
8.1.2 Preferential Crawlers |
|
|
314 | (1) |
|
8.2 Implementation Issues |
|
|
315 | (8) |
|
|
315 | (1) |
|
|
316 | (2) |
|
8.2.3 Stopword Removal and Stemming |
|
|
318 | (1) |
|
8.2.4 Link Extraction and Canonicalization |
|
|
318 | (2) |
|
|
320 | (1) |
|
|
321 | (1) |
|
|
322 | (1) |
|
|
323 | (4) |
|
|
324 | (2) |
|
8.3.2 Coverage vs. Freshness vs. Importance |
|
|
326 | (1) |
|
|
327 | (3) |
|
|
330 | (18) |
|
8.5.1 Topical Locality and Cues |
|
|
332 | (6) |
|
8.5.2 Best-First Variations |
|
|
338 | (3) |
|
|
341 | (7) |
|
|
348 | (5) |
|
8.7 Crawler Ethics and Conflicts |
|
|
353 | (3) |
|
8.8 Some New Developments |
|
|
356 | (2) |
|
|
358 | (1) |
|
|
359 | (4) |
|
9 Structured Data Extraction: Wrapper Generation |
|
|
363 | (62) |
|
|
364 | (6) |
|
9.1.1 Two Types of Data Rich Pages |
|
|
364 | (2) |
|
|
366 | (2) |
|
9.1.3 HTML Mark-Up Encoding of Data Instances |
|
|
368 | (2) |
|
|
370 | (8) |
|
9.2.1 Extraction from a Page |
|
|
370 | (3) |
|
9.2.2 Learning Extraction Rules |
|
|
373 | (4) |
|
9.2.3 Identifying Informative Examples |
|
|
377 | (1) |
|
9.2.4 Wrapper Maintenance |
|
|
378 | (1) |
|
9.3 Instance-Based Wrapper Learning |
|
|
378 | (3) |
|
9.4 Automatic Wrapper Generation: Problems |
|
|
381 | (3) |
|
9.4.1 Two Extraction Problems |
|
|
382 | (1) |
|
9.4.2 Patterns as Regular Expressions |
|
|
383 | (1) |
|
9.5 String Matching and Tree Matching |
|
|
384 | (6) |
|
9.5.1 String Edit Distance |
|
|
384 | (2) |
|
|
386 | (4) |
|
|
390 | (6) |
|
|
390 | (1) |
|
9.6.2 Partial Tree Alignment |
|
|
391 | (5) |
|
|
396 | (1) |
|
9.8 Extraction Based on a Single List Page: Flat Data Records |
|
|
397 | (10) |
|
9.8.1 Two Observations about Data Records |
|
|
398 | (1) |
|
9.8.2 Mining Data Regions |
|
|
399 | (5) |
|
9.8.3 Identifying Data Records in Data Regions |
|
|
404 | (1) |
|
9.8.4 Data Item Alignment and Extraction |
|
|
405 | (1) |
|
9.8.5 Making Use of Visual Information |
|
|
406 | (1) |
|
9.8.6 Some Other Techniques |
|
|
406 | (1) |
|
9.9 Extraction Based on a Single List Page: Nested Data Records |
|
|
407 | (6) |
|
9.10 Extraction Based on Multiple Pages |
|
|
413 | (2) |
|
9.10.1 Using Techniques in Previous Sections |
|
|
413 | (1) |
|
9.10.2 RoadRunner Algorithm |
|
|
414 | (1) |
|
|
415 | (4) |
|
9.11.1 Extraction from Other Pages |
|
|
416 | (1) |
|
9.11.2 Disjunction or Optional |
|
|
416 | (1) |
|
9.11.3 A Set Type or a Tuple Type |
|
|
417 | (1) |
|
9.11.4 Labeling and Integration |
|
|
418 | (1) |
|
9.11.5 Domain Specific Extraction |
|
|
418 | (1) |
|
|
419 | (1) |
|
|
419 | (2) |
|
|
421 | (4) |
|
10 Information Integration |
|
|
425 | (34) |
|
10.1 Introduction to Schema Matching |
|
|
426 | (2) |
|
10.2 Pre-Processing for Schema Matching |
|
|
428 | (1) |
|
10.3 Schema-Level Matching |
|
|
429 | (2) |
|
10.3.1 Linguistic Approaches |
|
|
429 | (1) |
|
10.3.2 Constraint Based Approaches |
|
|
430 | (1) |
|
10.4 Domain and Instance-Level Matching |
|
|
431 | (3) |
|
10.5 Combining Similarities |
|
|
434 | (1) |
|
|
435 | (1) |
|
|
436 | (2) |
|
10.7.1 Reuse of Previous Match Results |
|
|
436 | (1) |
|
10.7.2 Matching a Large Number of Schemas |
|
|
437 | (1) |
|
10.7.3 Schema Match Results |
|
|
437 | (1) |
|
|
438 | (1) |
|
10.8 Integration of Web Query Interfaces |
|
|
438 | (12) |
|
10.8.1 A Clustering Based Approach |
|
|
441 | (3) |
|
10.8.2 A Correlation Based Approach |
|
|
444 | (3) |
|
10.8.3 An Instance Based Approach |
|
|
447 | (3) |
|
10.9 Constructing a Unified Global Query Interface |
|
|
450 | (4) |
|
10.9.1 Structural Appropriateness and the Merge Algorithm |
|
|
451 | (2) |
|
10.9.2 Lexical Appropriateness |
|
|
453 | (1) |
|
10.9.3 Instance Appropriateness |
|
|
454 | (1) |
|
|
454 | (1) |
|
|
455 | (4) |
|
11 Opinion Mining and Sentiment Analysis |
|
|
459 | (68) |
|
1.1.1 The Problem of Opinion Mining |
|
|
460 | (1) |
|
11.1.1 Problem Definitions |
|
|
460 | (7) |
|
11.1.2 Aspect-Based Opinion Summary |
|
|
467 | (2) |
|
11.2 Document Sentiment Classification |
|
|
469 | (5) |
|
11.2.1 Classification Based on Supervised Learning |
|
|
470 | (2) |
|
11.2.2 Classification Based on Unsupervised Learning |
|
|
472 | (2) |
|
11.3 Sentence Subjectivity and Sentiment Classification |
|
|
474 | (3) |
|
11.4 Opinion Lexicon Expansion |
|
|
477 | (3) |
|
11.5 Aspect-Based Opinion Mining |
|
|
480 | (13) |
|
11.5.1 Aspect Sentiment Classification |
|
|
481 | (2) |
|
11.5.2 Basic Rules of Opinions |
|
|
483 | (3) |
|
|
486 | (4) |
|
11.5.4 Simultaneous Opinion Lexicon Expansion and Aspect Extraction |
|
|
490 | (3) |
|
11.6 Mining Comparative Opinions |
|
|
493 | (5) |
|
11.6.1 Problem Definitions |
|
|
493 | (2) |
|
11.6.2 Identification of Comparative Sentences |
|
|
495 | (1) |
|
11.6.3 Identification of Preferred Entities |
|
|
496 | (2) |
|
|
498 | (5) |
|
11.8 Opinion Search and Retrieval |
|
|
503 | (3) |
|
11.9 Opinion Spam Detection |
|
|
506 | (8) |
|
11.9.1 Types of Spam and Spammers |
|
|
506 | (2) |
|
|
508 | (1) |
|
11.9.3 Spam Detection Based on Supervised Learning |
|
|
509 | (2) |
|
11.9.4 Spam Detection Based on Abnormal Behaviors |
|
|
511 | (2) |
|
11.9.5 Group Spam Detection |
|
|
513 | (1) |
|
|
514 | (1) |
|
|
515 | (2) |
|
|
517 | (10) |
|
|
527 | (78) |
|
12.1 Data Collection and Pre-Processing |
|
|
528 | (12) |
|
12.1.1 Sources and Types of Data |
|
|
530 | (3) |
|
12.1.2 Key Elements of Web Usage Data Pre-Processing |
|
|
533 | (7) |
|
12.2 Data Modeling for Web Usage Mining |
|
|
540 | (4) |
|
12.3 Discovery and Analysis of Web Usage Patterns |
|
|
544 | (11) |
|
12.3.1 Session and Visitor Analysis |
|
|
544 | (1) |
|
12.3.2 Cluster Analysis and Visitor Segmentation |
|
|
545 | (4) |
|
12.3.3 Association and Correlation Analysis |
|
|
549 | (1) |
|
12.3.4 Analysis of Sequential and Navigational Patterns |
|
|
550 | (4) |
|
12.3:5 Classification and Prediction based on Web User Transactions |
|
|
554 | (1) |
|
12.4 Recommender Systems and Collaborative Filtering |
|
|
555 | (16) |
|
12.4.1 The Recommendation Problem |
|
|
556 | (1) |
|
12.4.2 Content-Based Recommendation |
|
|
557 | (2) |
|
12.4.3 Collaborative Filtering: K-Nearest Neighbor (KNN) |
|
|
559 | (2) |
|
12.4.4 Collaborative Filtering: Using Association Rules |
|
|
561 | (4) |
|
12.4.5 Collaborative Filtering: Matrix Factorization |
|
|
565 | (6) |
|
|
571 | (18) |
|
12.5.1 Data Sources, Characteristics, and Challenges |
|
|
573 | (1) |
|
12.5.2 Query Log Data Preparation |
|
|
574 | (3) |
|
12.5.3 Query Log Data Models |
|
|
577 | (5) |
|
12.5.4 Query Log Feature Extraction |
|
|
582 | (1) |
|
12.5.5 Query Log Mining Applications |
|
|
583 | (3) |
|
12.5.6 Query Log Mining Methods |
|
|
586 | (3) |
|
12.6 Computational Advertising |
|
|
589 | (4) |
|
12.7 Discussion and Outlook |
|
|
593 | (1) |
|
|
593 | (1) |
|
|
594 | (11) |
Subject Index |
|
605 | |