|
|
xiii | |
|
|
xv | |
Preface |
|
xvii | |
Acknowledgments |
|
xxiii | |
|
|
1 | (6) |
|
|
1 | (1) |
|
Text Mining and Related Fields |
|
|
2 | (3) |
|
|
2 | (1) |
|
|
3 | (1) |
|
|
3 | (1) |
|
|
3 | (1) |
|
|
4 | (1) |
|
|
4 | (1) |
|
|
5 | (1) |
|
|
5 | (1) |
|
Advice for Reading this Book |
|
|
5 | (2) |
|
|
7 | (52) |
|
|
7 | (1) |
|
|
8 | (7) |
|
First Regex: Finding the Word Cat |
|
|
8 | (2) |
|
Character Ranges adn Finding Telephone Numbers |
|
|
10 | (2) |
|
Testing Regexes with Perl |
|
|
12 | (3) |
|
|
15 | (6) |
|
|
15 | (2) |
|
Nineteenth-Century Literature |
|
|
17 | (1) |
|
Perl Variables and the Function split |
|
|
17 | (3) |
|
|
20 | (1) |
|
Decomposing Poe's ``The Tell-Tale Heart'' into Words |
|
|
21 | (7) |
|
Dashes and String Substitutions |
|
|
23 | (1) |
|
|
24 | (3) |
|
|
27 | (1) |
|
|
28 | (6) |
|
|
33 | (1) |
|
|
33 | (1) |
|
First Attempt at Extracting Sentences |
|
|
34 | (12) |
|
Sentence Segmentation Preliminaries |
|
|
35 | (2) |
|
Sentence Segmentation for A Christmas Carol |
|
|
37 | (4) |
|
Leftmost Greediness and Sentence Segmentation |
|
|
41 | (5) |
|
|
46 | (6) |
|
Match Variables and Backreferences |
|
|
47 | (1) |
|
Regular Expression Operators and Their Output |
|
|
48 | (2) |
|
|
50 | (2) |
|
|
52 | (1) |
|
|
52 | (7) |
|
Quantitative Text Summaries |
|
|
59 | (46) |
|
|
59 | (1) |
|
Scalars, Interpolation, and Context in Perl |
|
|
59 | (1) |
|
Arrays and Context in Perl |
|
|
60 | (4) |
|
Word Lenghts in Poe's ``The Tell-Tale Heart'' |
|
|
64 | (2) |
|
|
66 | (7) |
|
Additing and Removing Entries from Arrays |
|
|
66 | (3) |
|
Selecting Subsets of an Array |
|
|
69 | (1) |
|
|
69 | (4) |
|
|
73 | (4) |
|
|
74 | (3) |
|
|
77 | (9) |
|
Zipf's Law for a Christmas Carol |
|
|
77 | (6) |
|
|
83 | (1) |
|
An Aid to Crossword Puzzles |
|
|
83 | (1) |
|
|
84 | (1) |
|
Finding Word in a Set of Letters |
|
|
85 | (1) |
|
|
86 | (11) |
|
|
87 | (3) |
|
Arrays of Arrays and Beyond |
|
|
90 | (2) |
|
Application: Comparing the Words in Two Poe Stories |
|
|
92 | (5) |
|
|
97 | (1) |
|
|
97 | (1) |
|
|
97 | (8) |
|
Probability and Text Sampling |
|
|
105 | (28) |
|
|
105 | (1) |
|
|
105 | (10) |
|
Probability and Coin Flipping |
|
|
106 | (2) |
|
|
108 | (1) |
|
Estimating Letter Probabilities for Poe and Dickens |
|
|
109 | (3) |
|
Estimating Letter Bigram Probabilities |
|
|
112 | (3) |
|
|
115 | (3) |
|
|
117 | (1) |
|
Mean and Vairance of Random Variables |
|
|
118 | (5) |
|
Sampling and Error Estimates |
|
|
120 | (3) |
|
The Bag-of-Words Model for Poe's ``The Black Cat'' |
|
|
123 | (1) |
|
The Effect of Sample Size |
|
|
124 | (4) |
|
Tokens vs. Types in Poe's ``Hans Pfaall'' |
|
|
124 | (4) |
|
|
128 | (1) |
|
|
129 | (4) |
|
Applying Information Retrieval to Text Mining |
|
|
133 | (28) |
|
|
133 | (1) |
|
Counting Letters and Words |
|
|
134 | (4) |
|
Counting Letters in Poe with Perl |
|
|
134 | (2) |
|
Counting Pronouns Occuring in Poe |
|
|
136 | (2) |
|
|
138 | (5) |
|
Vectors and Angles for Two Poe Stories |
|
|
139 | (1) |
|
Computing Angles Between Vectors |
|
|
140 | (1) |
|
|
140 | (3) |
|
Computing the Angle between Vectors |
|
|
143 | (1) |
|
The Term-Document Matrix Applied to Poe |
|
|
143 | (4) |
|
|
147 | (3) |
|
Matrix Multiplications Applied to Poe |
|
|
148 | (2) |
|
|
150 | (2) |
|
|
152 | (5) |
|
Inverse Document Frequency |
|
|
153 | (1) |
|
Poe Story Angles Revisited |
|
|
154 | (3) |
|
|
157 | (1) |
|
|
157 | (4) |
|
Concordance Lines and Corpus Lingusitics |
|
|
161 | (30) |
|
|
161 | (1) |
|
|
162 | (2) |
|
Statistical Survey Sampling |
|
|
162 | (1) |
|
|
163 | (1) |
|
|
164 | (5) |
|
Function vs. Content Words in Dickens, London, and Shelley |
|
|
168 | (1) |
|
|
169 | (10) |
|
Sorting Concordance Lines |
|
|
170 | (1) |
|
Code for Sorting Concordance Lines |
|
|
171 | (1) |
|
Applications: Word Usage Differences between London and Shelley |
|
|
172 | (4) |
|
Application: Word Morphology of Adverbs |
|
|
176 | (3) |
|
Collocations and Concordance Lines |
|
|
179 | (6) |
|
More Ways to Sort Concordance Lines |
|
|
179 | (2) |
|
Application: Phrasal Verbs in The Call of the Wild |
|
|
181 | (3) |
|
Grouping Words: Colors in The Call of the Wild |
|
|
184 | (1) |
|
Applications with References |
|
|
185 | (2) |
|
|
187 | (1) |
|
|
188 | (3) |
|
Multivariate Techniques with Text |
|
|
191 | (28) |
|
|
191 | (1) |
|
|
192 | (10) |
|
|
193 | (2) |
|
Word Correlations among Poe's Short Stories |
|
|
195 | (4) |
|
|
199 | (2) |
|
Correlations and Covariances |
|
|
201 | (1) |
|
|
202 | (3) |
|
2 by 2 Correlation Matrices |
|
|
202 | (3) |
|
Principal Components Analysis |
|
|
205 | (6) |
|
Finding the Principal Components |
|
|
206 | (1) |
|
PCA Applied to the 68 Poe Short Stories |
|
|
206 | (3) |
|
Another PCA Example with Poe's Short Stories |
|
|
209 | (1) |
|
|
209 | (2) |
|
|
211 | (1) |
|
A Word on Factor Analysis |
|
|
211 | (1) |
|
Applications and References |
|
|
211 | (1) |
|
|
212 | (7) |
|
|
219 | (24) |
|
|
219 | (1) |
|
|
220 | (15) |
|
Two-Variable Example of k-Means |
|
|
220 | (3) |
|
|
223 | (1) |
|
He versus She in Poe's Short Stories |
|
|
224 | (5) |
|
Poe Clusters Using Eight Pronouns |
|
|
229 | (1) |
|
Clustering Poe Using Principal Components |
|
|
230 | (4) |
|
Hierarchical Clustering of Poe's Short Stories |
|
|
234 | (1) |
|
|
235 | (1) |
|
Decision Trees and Overfitting |
|
|
236 | (1) |
|
|
236 | (1) |
|
|
236 | (1) |
|
|
236 | (7) |
|
A Sample of Additional Topics |
|
|
243 | (16) |
|
|
243 | (1) |
|
|
243 | (5) |
|
|
244 | (1) |
|
|
245 | (1) |
|
The Sentence Segmentation Module |
|
|
245 | (2) |
|
An Object-Oriented Module for Tagging |
|
|
247 | (1) |
|
|
248 | (1) |
|
Other Languages: Analyzing Goethe in German |
|
|
248 | (3) |
|
|
251 | (7) |
|
Runs and Hypothesis Testing |
|
|
252 | (2) |
|
Distribution of Character Names in Dickens and London |
|
|
254 | (4) |
|
|
258 | (1) |
|
Appendix A: Overview of Perl for Text Mining |
|
|
259 | (16) |
|
|
259 | (4) |
|
Special Variables and Arrays |
|
|
262 | (1) |
|
|
263 | (3) |
|
|
266 | (4) |
|
|
270 | (1) |
|
Introduction to Regular Expressions |
|
|
271 | (4) |
|
Appendix B: Summary of R used in this Book |
|
|
275 | (8) |
|
|
275 | (4) |
|
|
276 | (1) |
|
|
277 | (1) |
|
|
278 | (1) |
|
|
279 | (4) |
Refernces |
|
283 | (8) |
Index |
|
291 | |