1 Overview of Text Mining |
|
1 | (12) |
|
1.1 What's Special About Text Mining? |
|
|
1 | (5) |
|
1.1.1 Structured or Unstructured Data'? |
|
|
2 | (1) |
|
1.1.2 Is Text Different from Numbers'? |
|
|
3 | (3) |
|
1.2 What Types of Problems Can Be Solved9 5 |
|
|
|
1.3 Document Classification |
|
|
6 | (1) |
|
1.4 Information Retrieval |
|
|
6 | (1) |
|
1.5 Clustering and Organizing Documents |
|
|
7 | (1) |
|
1.6 Information Extraction |
|
|
8 | (1) |
|
1.7 Prediction and Evaluation |
|
|
9 | (1) |
|
|
10 | (1) |
|
|
11 | (1) |
|
1.10 Historical and Bibliographical Remarks |
|
|
11 | (1) |
|
1.11 Questions and Exercises |
|
|
12 | (1) |
2 From Textual Information to Numerical Vectors |
|
13 | (28) |
|
|
13 | (2) |
|
2.2 Document Standardization |
|
|
15 | (2) |
|
|
17 | (2) |
|
|
19 | (2) |
|
2.4.1 Inflectional Stemming |
|
|
19 | (2) |
|
|
21 | (1) |
|
2.5 Vector Generation for Prediction |
|
|
21 | (9) |
|
|
26 | (3) |
|
2.5.2 Labels for the Right Answers |
|
|
29 | (1) |
|
2.5.3 Feature Selection by Attribute Ranking |
|
|
29 | (1) |
|
2.6 Sentence Boundary Determination |
|
|
30 | (1) |
|
2.7 Part-of-Speech Tagging |
|
|
30 | (2) |
|
2.8 Word Sense Disambiguation |
|
|
32 | (1) |
|
|
33 | (1) |
|
2.10 Named Entity Recognition |
|
|
33 | (1) |
|
|
34 | (1) |
|
|
35 | (2) |
|
|
37 | (1) |
|
2.14 Historical and Bibliographical Remarks |
|
|
37 | (2) |
|
2.15 Questions and Exercises |
|
|
39 | (2) |
3 Using Text for Prediction |
|
41 | (40) |
|
3.1 Recognizing that Documents Fit a Pattern |
|
|
43 | (1) |
|
3.2 How Many Documents Are Enough? |
|
|
44 | (1) |
|
3.3 Document Classification |
|
|
45 | (1) |
|
3.4 Learning to Predict from Text |
|
|
46 | (23) |
|
3.4.1 Similarity and Nearest-Neighbor Methods |
|
|
47 | (1) |
|
3.4.2 Document Similarity |
|
|
48 | (2) |
|
|
50 | (6) |
|
|
56 | (1) |
|
3.4.5 Scoring by Probabilities |
|
|
57 | (3) |
|
3.4.6 Linear Scoring Methods |
|
|
60 | (9) |
|
3.5 Evaluation of Performance |
|
|
69 | (5) |
|
3.5.1 Estimating Current and Future Performance |
|
|
69 | (2) |
|
3.5.2 Getting the Most from a Learning Method |
|
|
71 | (1) |
|
3.5.3 Errors and Pitfalls in Big Data Evaluation |
|
|
72 | (2) |
|
|
74 | (1) |
|
3.7 Graph Models for Social Networks |
|
|
74 | (2) |
|
|
76 | (1) |
|
3.9 Historical and Bibliographical Remarks |
|
|
77 | (2) |
|
3.10 Questions and Exercises |
|
|
79 | (2) |
4 Information Retrieval and Text Mining |
|
81 | (16) |
|
4.1 Is Information Retrieval a Form of Text Mining? |
|
|
81 | (1) |
|
|
82 | (1) |
|
4.3 Nearest-Neighbor Methods |
|
|
83 | (1) |
|
|
84 | (3) |
|
|
84 | (1) |
|
4.4.2 Word Count and Bonus |
|
|
85 | (1) |
|
|
86 | (1) |
|
4.5 Web-Based Document Search |
|
|
87 | (4) |
|
|
88 | (3) |
|
|
91 | (1) |
|
|
92 | (1) |
|
4.8 Evaluation of Performance |
|
|
93 | (1) |
|
|
94 | (1) |
|
4.10 Historical and Bibliographical Remarks |
|
|
95 | (1) |
|
4.11 Questions and Exercises |
|
|
95 | (2) |
5 Finding Structure in a Document Collection |
|
97 | (22) |
|
5.1 Clustering Documents by Similarity |
|
|
99 | (1) |
|
5.2 Similarity of Composite Documents |
|
|
100 | (12) |
|
|
102 | (4) |
|
5.2.2 Hierarchical Clustering |
|
|
106 | (2) |
|
|
108 | (4) |
|
5.3 What Do a Cluster's Labels Mean? |
|
|
112 | (1) |
|
|
113 | (1) |
|
5.5 Evaluation of Performance |
|
|
114 | (2) |
|
|
116 | (1) |
|
5.7 Historical and Bibliographical Remarks |
|
|
116 | (2) |
|
5.8 Questions and Exercises |
|
|
118 | (1) |
6 Looking for Information in Documents |
|
119 | (28) |
|
6.1 Goals of Information Extraction |
|
|
119 | (2) |
|
6.2 Finding Patterns and Entities from Text |
|
|
121 | (14) |
|
6.2.1 Entity Extraction as Sequential Tagging |
|
|
122 | (1) |
|
6.2.2 Tag Prediction as Classification |
|
|
123 | (1) |
|
6.2.3 The Maximum Entropy Method |
|
|
124 | (5) |
|
6.2.4 Linguistic Features and Encoding |
|
|
129 | (1) |
|
6.2.5 Local Sequence Prediction Models |
|
|
130 | (4) |
|
6.2.6 Global Sequence Prediction Models |
|
|
134 | (1) |
|
6.3 Coreference and Relationship Extraction |
|
|
135 | (4) |
|
6.3.1 Coreference Resolution |
|
|
135 | (3) |
|
6.3.2 Relationship Extraction |
|
|
138 | (1) |
|
6.4 Template Filling and Database Construction |
|
|
139 | (1) |
|
|
140 | (3) |
|
6.5.1 Information Retrieval |
|
|
140 | (1) |
|
6.5.2 Commercial Extraction Systems |
|
|
140 | (1) |
|
|
141 | (1) |
|
|
142 | (1) |
|
|
143 | (1) |
|
6.7 Historical and Bibliographical Remarks |
|
|
143 | (2) |
|
6.8 Questions and Exercises |
|
|
145 | (2) |
7 Data Sources for Prediction: Databases, Hybrid Data and the Web |
|
147 | (18) |
|
|
147 | (3) |
|
7.1.1 Ideal Data for Prediction |
|
|
147 | (1) |
|
7.1.2 Ideal Data for Text and Unstructured Data |
|
|
148 | (1) |
|
7.1.3 Hybrid and Mixed Data |
|
|
148 | (2) |
|
7.2 Practical Data Sourcing |
|
|
150 | (1) |
|
7.3 Prototypical Examples |
|
|
151 | (7) |
|
7.3.1 Web-Based Spreadsheet Data |
|
|
152 | (1) |
|
|
152 | (1) |
|
7.3.3 Opinion Data and Sentiment Analysis |
|
|
153 | (5) |
|
7.4 Hybrid Example: Independent Sources of Numerical and Text Data |
|
|
158 | (1) |
|
7.5 Mixed Data in Standard Table Format |
|
|
159 | (1) |
|
|
160 | (2) |
|
7.7 Historical and Bibliographical Remarks |
|
|
162 | (1) |
|
7.8 Questions and Exercises |
|
|
162 | (3) |
8 Case Studies |
|
165 | (38) |
|
8.1 Market Intelligence from the Web |
|
|
165 | (5) |
|
|
165 | (1) |
|
|
166 | (1) |
|
8.1.3 Methods and Procedures |
|
|
167 | (1) |
|
|
168 | (2) |
|
8.2 Lightweight Document Matching for Digital Libraries |
|
|
170 | (3) |
|
|
170 | (1) |
|
|
170 | (1) |
|
8.2.3 Methods and Procedures |
|
|
171 | (2) |
|
|
173 | (1) |
|
8.3 Generating Model Cases for Help Desk Applications |
|
|
173 | (4) |
|
|
173 | (1) |
|
|
174 | (1) |
|
8.3.3 Methods and Procedures |
|
|
174 | (2) |
|
|
176 | (1) |
|
8.4 Assigning Topics to News Articles |
|
|
177 | (5) |
|
|
177 | (1) |
|
|
177 | (1) |
|
8.4.3 Methods and Procedures |
|
|
178 | (4) |
|
|
182 | (1) |
|
|
182 | (4) |
|
|
182 | (1) |
|
|
183 | (1) |
|
8.5.3 Methods and Procedures |
|
|
184 | (1) |
|
|
185 | (1) |
|
|
186 | (4) |
|
|
186 | (1) |
|
|
186 | (1) |
|
8.6.3 Methods and Procedures |
|
|
187 | (1) |
|
|
188 | (2) |
|
8.7 Extracting Named Entities from Documents |
|
|
190 | (4) |
|
|
190 | (1) |
|
|
190 | (1) |
|
8.7.3 Methods and Procedures |
|
|
191 | (2) |
|
|
193 | (1) |
|
|
194 | (3) |
|
|
194 | (1) |
|
|
195 | (1) |
|
8.8.3 Methods and Procedures |
|
|
196 | (1) |
|
|
197 | (1) |
|
8.9 Customized Newspapers |
|
|
197 | (3) |
|
|
197 | (1) |
|
|
198 | (1) |
|
8.9.3 Methods and Procedures |
|
|
198 | (1) |
|
|
199 | (1) |
|
|
200 | (1) |
|
8.11 Historical and Bibliographical Remarks |
|
|
200 | (1) |
|
8.12 Questions and Exercises |
|
|
201 | (2) |
9 Emerging Directions |
|
203 | (20) |
|
|
203 | (3) |
|
|
206 | (1) |
|
9.3 Learning with Unlabeled Data |
|
|
207 | (1) |
|
9.4 Different Ways of Collecting Samples |
|
|
208 | (7) |
|
9.4.1 Ensembles and Voting Methods |
|
|
208 | (2) |
|
|
210 | (1) |
|
|
211 | (3) |
|
9.4.4 Cost-Sensitive Learning |
|
|
214 | (1) |
|
9.4.5 Unbalanced Samples and Rare Events |
|
|
214 | (1) |
|
9.5 Distributed Text Mining |
|
|
215 | (2) |
|
|
217 | (1) |
|
|
218 | (1) |
|
|
219 | (1) |
|
9.9 Historical and Bibliographical Remarks |
|
|
219 | (3) |
|
9.10 Questions and Exercises |
|
|
222 | (1) |
References |
|
223 | (8) |
Author Index |
|
231 | (4) |
Subject Index |
|
235 | |