1 Overview of Text Mining |
|
1 | (12) |
|
1.1 What's Special About Text Mining? |
|
|
1 | (4) |
|
1.1.1 Structured or Unstructured Data? |
|
|
2 | (1) |
|
1.1.2 Is Text Different from Numbers? |
|
|
3 | (2) |
|
1.2 What Types of Problems Can Be Solved? |
|
|
5 | (1) |
|
1.3 Document Classification |
|
|
6 | (1) |
|
1.4 Information Retrieval |
|
|
6 | (1) |
|
1.5 Clustering and Organizing Documents |
|
|
7 | (1) |
|
1.6 Information Extraction |
|
|
8 | (1) |
|
1.7 Prediction and Evaluation |
|
|
9 | (1) |
|
|
10 | (1) |
|
|
10 | (1) |
|
1.10 Historical and Bibliographical Remarks |
|
|
11 | (1) |
|
1.11 Questions and Exercises |
|
|
12 | (1) |
2 From Textual Information to Numerical Vectors |
|
13 | (26) |
|
|
13 | (2) |
|
2.2 Document Standardization |
|
|
15 | (1) |
|
|
16 | (1) |
|
|
17 | (4) |
|
2.4.1 Inflectional Stemming |
|
|
19 | (1) |
|
|
19 | (2) |
|
2.5 Vector Generation for Prediction |
|
|
21 | (8) |
|
|
26 | (2) |
|
2.5.2 Labels for the Right Answers |
|
|
28 | (1) |
|
2.5.3 Feature Selection by Attribute Ranking |
|
|
29 | (1) |
|
2.6 Sentence Boundary Determination |
|
|
29 | (2) |
|
2.7 Part-of-Speech Tagging |
|
|
31 | (1) |
|
2.8 Word Sense Disambiguation |
|
|
32 | (1) |
|
|
32 | (1) |
|
2.10 Named Entity Recognition |
|
|
33 | (1) |
|
|
33 | (2) |
|
|
35 | (1) |
|
|
36 | (1) |
|
2.14 Historical and Bibliographical Remarks |
|
|
36 | (2) |
|
2.15 Questions and Exercises |
|
|
38 | (1) |
3 Using Text for Prediction |
|
39 | (36) |
|
3.1 Recognizing that Documents Fit a Pattern |
|
|
41 | (1) |
|
3.2 How Many Documents Are Enough? |
|
|
42 | (1) |
|
3.3 Document Classification |
|
|
43 | (1) |
|
3.4 Learning to Predict from Text |
|
|
44 | (22) |
|
3.4.1 Similarity and Nearest-Neighbor Methods |
|
|
45 | (1) |
|
3.4.2 Document Similarity |
|
|
46 | (2) |
|
|
48 | (6) |
|
|
54 | (1) |
|
3.4.5 Scoring by Probabilities |
|
|
55 | (3) |
|
3.4.6 Linear Scoring Methods |
|
|
58 | (8) |
|
3.5 Evaluation of Performance |
|
|
66 | (3) |
|
3.5.1 Estimating Current and Future Performance |
|
|
66 | (3) |
|
3.5.2 Getting the Most from a Learning Method |
|
|
69 | (1) |
|
|
69 | (1) |
|
|
70 | (1) |
|
3.8 Historical and Bibliographical Remarks |
|
|
70 | (2) |
|
3.9 Questions and Exercises |
|
|
72 | (3) |
4 Information Retrieval and Text Mining |
|
75 | (16) |
|
4.1 Is Information Retrieval a Form of Text Mining? |
|
|
75 | (1) |
|
|
76 | (1) |
|
4.3 Nearest-Neighbor Methods |
|
|
77 | (1) |
|
|
78 | (2) |
|
|
78 | (1) |
|
4.4.2 Word Count and Bonus |
|
|
78 | (1) |
|
|
79 | (1) |
|
4.5 Web-based Document Search |
|
|
80 | (5) |
|
|
81 | (4) |
|
|
85 | (1) |
|
|
85 | (2) |
|
4.8 Evaluation of Performance |
|
|
87 | (1) |
|
|
88 | (1) |
|
4.10 Historical and Bibliographical Remarks |
|
|
88 | (1) |
|
4.11 Questions and Exercises |
|
|
89 | (2) |
5 Finding Structure in a Document Collection |
|
91 | (22) |
|
5.1 Clustering Documents by Similarity |
|
|
93 | (1) |
|
5.2 Similarity of Composite Documents |
|
|
94 | (11) |
|
|
96 | (3) |
|
5.2.2 Hierarchical Clustering |
|
|
99 | (3) |
|
|
102 | (3) |
|
5.3 What Do a Cluster's Labels Mean? |
|
|
105 | (2) |
|
|
107 | (1) |
|
5.5 Evaluation of Performance |
|
|
108 | (2) |
|
|
110 | (1) |
|
5.7 Historical and Bibliographical Remarks |
|
|
110 | (1) |
|
5.8 Questions and Exercises |
|
|
111 | (2) |
6 Looking for Information in Documents |
|
113 | (28) |
|
6.1 Goals of Information Extraction |
|
|
113 | (2) |
|
6.2 Finding Patterns and Entities from Text |
|
|
115 | (14) |
|
6.2.1 Entity Extraction as Sequential Tagging |
|
|
116 | (1) |
|
6.2.2 Tag Prediction as Classification |
|
|
117 | (1) |
|
6.2.3 The Maximum Entropy Method |
|
|
118 | (5) |
|
6.2.4 Linguistic Features and Encoding |
|
|
123 | (1) |
|
6.2.5 Local Sequence Prediction Models |
|
|
124 | (4) |
|
6.2.6 Global Sequence Prediction Models |
|
|
128 | (1) |
|
6.3 Coreference and Relationship Extraction |
|
|
129 | (3) |
|
6.3.1 Coreference Resolution |
|
|
129 | (2) |
|
6.3.2 Relationship Extraction |
|
|
131 | (1) |
|
6.4 Template Filling and Database Construction |
|
|
132 | (1) |
|
|
133 | (3) |
|
6.5.1 Information Retrieval |
|
|
133 | (1) |
|
6.5.2 Commercial Extraction Systems |
|
|
134 | (1) |
|
|
135 | (1) |
|
|
135 | (1) |
|
|
136 | (1) |
|
6.7 Historical and Bibliographical Remarks |
|
|
137 | (1) |
|
6.8 Questions and Exercises |
|
|
138 | (3) |
7 Data Sources for Prediction: Databases, Hybrid Data and the Web |
|
141 | (16) |
|
|
141 | (3) |
|
7.1.1 Ideal Data for Prediction |
|
|
141 | (1) |
|
7.1.2 Ideal Data for Text and Unstructured Data |
|
|
142 | (1) |
|
7.1.3 Hybrid and Mixed Data |
|
|
142 | (2) |
|
7.2 Practical Data Sourcing |
|
|
144 | (1) |
|
7.3 Prototypical Examples |
|
|
145 | (6) |
|
7.3.1 Web-based Spreadsheet Data |
|
|
146 | (1) |
|
|
146 | (2) |
|
7.3.3 Opinion Data and Sentiment Analysis |
|
|
148 | (3) |
|
7.4 Hybrid Example: Independent Sources of Numerical and Text Data |
|
|
151 | (1) |
|
7.5 Mixed Data in Standard Table Format |
|
|
152 | (1) |
|
|
153 | (1) |
|
7.7 Historical and Bibliographical Remarks |
|
|
154 | (1) |
|
7.8 Questions and Exercises |
|
|
154 | (3) |
8 Case Studies |
|
157 | (32) |
|
8.1 Market Intelligence from the Web |
|
|
157 | (4) |
|
|
157 | (1) |
|
|
158 | (1) |
|
8.1.3 Methods and Procedures |
|
|
159 | (1) |
|
|
160 | (1) |
|
8.2 Lightweight Document Matching for Digital Libraries |
|
|
161 | (4) |
|
|
161 | (1) |
|
|
162 | (1) |
|
8.2.3 Methods and Procedures |
|
|
163 | (1) |
|
|
164 | (1) |
|
8.3 Generating Model Cases for Help Desk Applications |
|
|
165 | (4) |
|
|
165 | (1) |
|
|
165 | (1) |
|
8.3.3 Methods and Procedures |
|
|
166 | (2) |
|
|
168 | (1) |
|
8.4 Assigning Topics to News Articles |
|
|
169 | (5) |
|
|
169 | (1) |
|
|
169 | (1) |
|
8.4.3 Methods and Procedures |
|
|
169 | (4) |
|
|
173 | (1) |
|
|
174 | (3) |
|
|
174 | (1) |
|
|
174 | (1) |
|
8.5.3 Methods and Procedures |
|
|
175 | (2) |
|
|
177 | (1) |
|
|
177 | (4) |
|
|
177 | (1) |
|
|
177 | (1) |
|
8.6.3 Methods and Procedures |
|
|
178 | (1) |
|
|
179 | (2) |
|
8.7 Extracting Named Entities from Documents |
|
|
181 | (3) |
|
|
181 | (1) |
|
|
181 | (1) |
|
8.7.3 Methods and Procedures |
|
|
182 | (2) |
|
|
184 | (1) |
|
8.8 Customized Newspapers |
|
|
184 | (3) |
|
|
184 | (1) |
|
|
185 | (1) |
|
8.8.3 Methods and Procedures |
|
|
186 | (1) |
|
|
187 | (1) |
|
|
187 | (1) |
|
8.10 Historical and Bibliographical Remarks |
|
|
188 | (1) |
|
8.11 Questions and Exercises |
|
|
188 | (1) |
9 Emerging Directions |
|
189 | (18) |
|
|
189 | (3) |
|
|
192 | (1) |
|
9.3 Learning with Unlabeled Data |
|
|
193 | (1) |
|
9.4 Different Ways of Collecting Samples |
|
|
194 | (4) |
|
9.4.1 Ensembles and Voting Methods |
|
|
194 | (2) |
|
|
196 | (1) |
|
9.4.3 Cost-Sensitive Learning |
|
|
197 | (1) |
|
9.4.4 Unbalanced Samples and Rare Events |
|
|
198 | (1) |
|
9.5 Distributed Text Mining |
|
|
198 | (2) |
|
|
200 | (1) |
|
|
201 | (1) |
|
|
202 | (1) |
|
9.9 Historical and Bibliographical Remarks |
|
|
203 | (1) |
|
9.10 Questions and Exercises |
|
|
204 | (3) |
A Software Notes |
|
207 | (4) |
|
|
207 | (1) |
|
|
208 | (1) |
|
A.3 Download Instructions |
|
|
208 | (3) |
References |
|
211 | (8) |
Author Index |
|
219 | (4) |
Subject Index |
|
223 | |