Preface |
|
xiii | |
|
I Natural Language Features |
|
|
1 | (98) |
|
|
3 | (6) |
|
1.1 Linguistics for text analysis |
|
|
3 | (2) |
|
1.2 A glimpse into one area: morphology |
|
|
5 | (1) |
|
|
6 | (1) |
|
1.4 Other ways text can vary |
|
|
7 | (1) |
|
|
8 | (1) |
|
1.5.1 In this chapter, you learned |
|
|
8 | (1) |
|
|
9 | (28) |
|
|
9 | (4) |
|
|
13 | (12) |
|
|
16 | (2) |
|
|
18 | (1) |
|
2.2.3 Tokenizing by n-grams |
|
|
19 | (3) |
|
2.2.4 Lines, sentence, and paragraph tokens |
|
|
22 | (3) |
|
2.3 Where does tokenization break down? |
|
|
25 | (1) |
|
2.4 Building your own tokenizer |
|
|
26 | (7) |
|
2.4.1 Tokenize to characters, only keeping letters |
|
|
27 | (2) |
|
2.4.2 Allow for hyphenated words |
|
|
29 | (3) |
|
2.4.3 Wrapping it in a function |
|
|
32 | (1) |
|
2.5 Tokenization for non-Latin alphabets |
|
|
33 | (1) |
|
2.6 Tokenization benchmark |
|
|
34 | (1) |
|
|
35 | (2) |
|
2.7.1 In this chapter, you learned |
|
|
35 | (2) |
|
|
37 | (16) |
|
3.1 Using premade stop word lists |
|
|
38 | (5) |
|
3.1.1 Stop word removal in R |
|
|
41 | (2) |
|
3.2 Creating your own stop words list |
|
|
43 | (5) |
|
3.3 All stop word lists are context-specific |
|
|
48 | (1) |
|
3.4 What happens when you remove stop words |
|
|
49 | (1) |
|
3.5 Stop words in languages other than English |
|
|
50 | (2) |
|
|
52 | (1) |
|
3.6.1 In this chapter, you learned |
|
|
52 | (1) |
|
|
53 | (20) |
|
4.1 How to stem text in R |
|
|
54 | (4) |
|
4.2 Should you use stemming at all? |
|
|
58 | (3) |
|
4.3 Understand a stemming algorithm |
|
|
61 | (2) |
|
4.4 Handling punctuation when stemming |
|
|
63 | (2) |
|
4.5 Compare some stemming options |
|
|
65 | (3) |
|
4.6 Lemmatization and stemming |
|
|
68 | (2) |
|
4.7 Stemming and stop words |
|
|
70 | (1) |
|
|
71 | (2) |
|
4.8.1 In this chapter, you learned |
|
|
72 | (1) |
|
|
73 | (26) |
|
5.1 Motivating embeddings for sparse, high-dimensional data |
|
|
73 | (4) |
|
5.2 Understand word embeddings by finding them yourself |
|
|
77 | (4) |
|
5.3 Exploring CFPB word embeddings |
|
|
81 | (7) |
|
5.4 Use pre-trained word embeddings |
|
|
88 | (5) |
|
5.5 Fairness and word embeddings |
|
|
93 | (2) |
|
5.6 Using word embeddings in the real world |
|
|
95 | (1) |
|
|
96 | (3) |
|
5.7.1 In this chapter, you learned |
|
|
97 | (2) |
|
II Machine Learning Methods |
|
|
99 | (124) |
|
|
101 | (4) |
|
|
105 | (50) |
|
6.1 A first regression model |
|
|
106 | (11) |
|
6.1.1 Building our first regression model |
|
|
107 | (5) |
|
|
112 | (5) |
|
6.2 Compare to the null model |
|
|
117 | (2) |
|
6.3 Compare to a random forest model |
|
|
119 | (3) |
|
6.4 Case study: removing stop words |
|
|
122 | (4) |
|
6.5 Case study: varying n-grams |
|
|
126 | (3) |
|
6.6 Case study: lemmatization |
|
|
129 | (4) |
|
6.7 Case study: feature hashing |
|
|
133 | (6) |
|
|
137 | (2) |
|
6.8 What evaluation metrics are appropriate? |
|
|
139 | (3) |
|
6.9 The full game: regression |
|
|
142 | (11) |
|
6.9.1 Preprocess the data |
|
|
142 | (1) |
|
|
143 | (1) |
|
|
144 | (2) |
|
6.9.4 Evaluate the modeling |
|
|
146 | (7) |
|
|
153 | (2) |
|
6.10.1 In this chapter, you learned |
|
|
153 | (2) |
|
|
155 | (68) |
|
7.1 A first classification model |
|
|
156 | (10) |
|
7.1.1 Building our first classification model |
|
|
158 | (3) |
|
|
161 | (5) |
|
7.2 Compare to the null model |
|
|
166 | (1) |
|
7.3 Compare to a lasso classification model |
|
|
167 | (3) |
|
7.4 Tuning lasso hyperparameters |
|
|
170 | (9) |
|
7.5 Case study: sparse encoding |
|
|
179 | (4) |
|
7.6 Two-class or multiclass? |
|
|
183 | (8) |
|
7.7 Case study: including non-text data |
|
|
191 | (4) |
|
7.8 Case study: data censoring |
|
|
195 | (6) |
|
7.9 Case study: custom features |
|
|
201 | (5) |
|
7.9.1 Detect credit cards |
|
|
202 | (2) |
|
7.9.2 Calculate percentage censoring |
|
|
204 | (1) |
|
7.9.3 Detect monetary amounts |
|
|
205 | (1) |
|
7.10 What evaluation metrics are appropriate? |
|
|
206 | (2) |
|
7.11 The full game: classification |
|
|
208 | (12) |
|
|
209 | (1) |
|
|
210 | (2) |
|
7.11.3 Evaluate the modeling |
|
|
212 | (8) |
|
|
220 | (3) |
|
7.12.1 In this chapter, you learned |
|
|
221 | (2) |
|
III Deep Learning Methods |
|
|
223 | (120) |
|
|
225 | (6) |
|
|
231 | (42) |
|
|
232 | (5) |
|
8.2 A first deep learning model |
|
|
237 | (16) |
|
8.2.1 Preprocessing for deep learning |
|
|
237 | (3) |
|
8.2.2 One-hot sequence embedding of text |
|
|
240 | (4) |
|
8.2.3 Simple flattened dense network |
|
|
244 | (4) |
|
|
248 | (5) |
|
8.3 Using bag-of-words features |
|
|
253 | (4) |
|
8.4 Using pre-trained word embeddings |
|
|
257 | (6) |
|
8.5 Cross-validation for deep learning models |
|
|
263 | (4) |
|
8.6 Compare and evaluate DNN models |
|
|
267 | (4) |
|
8.7 Limitations of deep learning |
|
|
271 | (1) |
|
|
272 | (1) |
|
8.8.1 In this chapter, you learned |
|
|
272 | (1) |
|
9 Long short-term memory (LSTM) networks |
|
|
273 | (30) |
|
|
273 | (10) |
|
|
275 | (4) |
|
|
279 | (4) |
|
9.2 Compare to a recurrent neural network |
|
|
283 | (3) |
|
9.3 Case study: bidirectional LSTM |
|
|
286 | (2) |
|
9.4 Case study: stacking LSTM layers |
|
|
288 | (1) |
|
|
289 | (3) |
|
9.6 Case study: training a regression model |
|
|
292 | (3) |
|
9.7 Case study: vocabulary size |
|
|
295 | (2) |
|
|
297 | (4) |
|
9.8.1 Preprocess the data |
|
|
297 | (1) |
|
|
298 | (3) |
|
|
301 | (2) |
|
9.9.1 In this chapter, you learned |
|
|
302 | (1) |
|
10 Convolutional neural networks |
|
|
303 | (40) |
|
|
303 | (2) |
|
|
304 | (1) |
|
|
304 | (1) |
|
|
305 | (4) |
|
10.3 Case study: adding more layers |
|
|
309 | (8) |
|
10.4 Case study: byte pair encoding |
|
|
317 | (7) |
|
10.5 Case study: explainability with LIME |
|
|
324 | (6) |
|
10.6 Case study: hyperparameter search |
|
|
330 | (4) |
|
10.7 Cross-validation for evaluation |
|
|
334 | (3) |
|
|
337 | (4) |
|
10.8.1 Preprocess the data |
|
|
337 | (1) |
|
|
338 | (3) |
|
|
341 | (2) |
|
10.9.1 In this chapter, you learned |
|
|
342 | (1) |
|
|
343 | (4) |
|
Text models in the real world |
|
|
345 | (2) |
|
|
347 | (22) |
|
|
347 | (2) |
|
|
347 | (2) |
|
|
349 | (1) |
|
A.2 Full stop, the wildcard |
|
|
349 | (8) |
|
|
350 | (2) |
|
A.3.1 Shorthand character classes |
|
|
352 | (1) |
|
|
353 | (2) |
|
|
355 | (1) |
|
|
355 | (2) |
|
|
357 | (4) |
|
B.1 Hans Christian Andersen fairy tales |
|
|
357 | (1) |
|
B.2 Opinions of the Supreme Court of the United States |
|
|
358 | (1) |
|
B.3 Consumer Financial Protection Bureau (CFPB) complaints |
|
|
359 | (1) |
|
B.4 Kickstarter campaign blurbs |
|
|
359 | (2) |
|
C Baseline linear classifier |
|
|
361 | (8) |
|
|
361 | (1) |
|
C.2 Split into test/train and create resampling folds |
|
|
362 | (1) |
|
C.3 Recipe for data preprocessing |
|
|
363 | (1) |
|
C.4 Lasso regularized classification model |
|
|
363 | (1) |
|
|
364 | (2) |
|
|
366 | (3) |
References |
|
369 | (10) |
Index |
|
379 | |