Foreword |
|
xi | |
|
|
1 | (16) |
|
|
1 | (1) |
|
1.1.1 What is Text Mining in Practice? |
|
|
2 | (1) |
|
1.1.2 Where Does Text Mining Fit? |
|
|
2 | (1) |
|
1.2 Why We Care About Text Mining |
|
|
2 | (7) |
|
1.2.1 What Are the Consequences of Ignoring Text? |
|
|
3 | (2) |
|
1.2.2 What Are the Benefits of Text Mining? |
|
|
5 | (1) |
|
1.2.3 Setting Expectations: When Text Mining Should (and Should Not) Be Used |
|
|
6 | (3) |
|
1.3 A Basic Workflow -- How the Process Works |
|
|
9 | (3) |
|
1.4 What Tools Do I Need to Get Started with This? |
|
|
12 | (1) |
|
|
12 | (1) |
|
1.6 A Real World Use Case |
|
|
13 | (2) |
|
|
15 | (2) |
|
|
17 | (34) |
|
2.1 What is Text Mining in a Practical Sense? |
|
|
17 | (3) |
|
2.2 Types of Text Mining: Bag of Words |
|
|
20 | (4) |
|
2.2.1 Types of Text Mining: Syntactic Parsing |
|
|
22 | (2) |
|
2.3 The Text Mining Process in Context |
|
|
24 | (1) |
|
2.4 String Manipulation: Number of Characters and Substitutions |
|
|
25 | (8) |
|
2.4.1 String Manipulations: Paste, Character Splits and Extractions |
|
|
29 | (4) |
|
|
33 | (3) |
|
2.6 String Packages stringr and stringi |
|
|
36 | (1) |
|
2.7 Preprocessing Steps for Bag of Words Text Mining |
|
|
37 | (7) |
|
|
44 | (3) |
|
2.9 Frequent Terms and Associations |
|
|
47 | (2) |
|
|
49 | (1) |
|
|
49 | (2) |
|
3 Common Text Mining Visualizations |
|
|
51 | (34) |
|
3.1 A Tale of Two (or Three) Cultures |
|
|
51 | (2) |
|
3.2 Simple Exploration: Term Frequency, Associations and Word Networks |
|
|
53 | (14) |
|
|
54 | (3) |
|
|
57 | (2) |
|
|
59 | (8) |
|
3.3 Simple Word Clusters: Hierarchical Dendrograms |
|
|
67 | (6) |
|
3.4 Word Clouds: Overused but Effective |
|
|
73 | (10) |
|
3.4.1 One Corpus Word Clouds |
|
|
74 | (1) |
|
3.4.2 Comparing and Contrasting Corpora in Word Clouds |
|
|
75 | (4) |
|
|
79 | (4) |
|
|
83 | (2) |
|
|
85 | (44) |
|
4.1 What is Sentiment Analysis? |
|
|
85 | (3) |
|
4.2 Sentiment Scoring: Parlor Trick or Insightful? |
|
|
88 | (1) |
|
4.3 Polarity: Simple Sentiment Scoring |
|
|
89 | (14) |
|
4.3.1 Subjectivity Lexicons |
|
|
89 | (4) |
|
4.3.2 Qdap's Scoring for Positive and Negative Word Choice |
|
|
93 | (3) |
|
4.3.3 Revisiting Word Clouds -- Sentiment Word Clouds |
|
|
96 | (7) |
|
4.4 Emoticons -- Dealing with These Perplexing Clues |
|
|
103 | (10) |
|
4.4.1 Symbol-Based Emoticons Native to R |
|
|
105 | (1) |
|
4.4.2 Punctuation Based Emoticons |
|
|
106 | (2) |
|
|
108 | (5) |
|
4.5 R's Archived Sentiment Scoring Library |
|
|
113 | (5) |
|
4.6 Sentiment the Tidytext Way |
|
|
118 | (8) |
|
4.7 Airbnb.com Boston Wrap Up |
|
|
126 | (1) |
|
|
126 | (3) |
|
5 Hidden Structures: Clustering, String Distance, Text Vectors and Topic Modeling |
|
|
129 | (52) |
|
|
129 | (18) |
|
|
130 | (9) |
|
5.1.2 Spherical K-Means Clustering |
|
|
139 | (5) |
|
5.1.3 K-Mediod Clustering |
|
|
144 | (1) |
|
5.1.4 Evaluating the Cluster Approaches |
|
|
145 | (2) |
|
5.2 Calculating and Exploring String Distance |
|
|
147 | (7) |
|
5.2.1 What is String Distance? |
|
|
148 | (3) |
|
5.2.2 Fuzzy Matching -- Amatch, Ain |
|
|
151 | (1) |
|
5.2.3 Similarity Distances -- Stringdist, Stringdistmatrix |
|
|
152 | (2) |
|
5.3 LDA Topic Modeling Explained |
|
|
154 | (15) |
|
5.3.1 Topic Modeling Case Study |
|
|
156 | (2) |
|
|
158 | (11) |
|
5.4 Text to Vectors using text2vec |
|
|
169 | (10) |
|
|
171 | (8) |
|
|
179 | (2) |
|
6 Document Classification: Finding Clickbait from Headlines |
|
|
181 | (28) |
|
6.1 What is Document Classification? |
|
|
181 | (2) |
|
|
183 | (24) |
|
6.2.1 Session and Data Set-Up |
|
|
185 | (3) |
|
|
188 | (8) |
|
6.2.3 GLMNet Test Predictions |
|
|
196 | (2) |
|
6.2.4 Test Set Evaluation |
|
|
198 | (2) |
|
6.2.5 Finding the Most Impactful Words |
|
|
200 | (6) |
|
6.2.6 Case Study Wrap Up: Model Accuracy and Improving Performance Recommendations |
|
|
206 | (1) |
|
|
207 | (2) |
|
7 Predictive Modeling: Using Text for Classifying and Predicting Outcomes |
|
|
209 | (28) |
|
7.1 Classification vs Prediction |
|
|
209 | (1) |
|
7.2 Case Study I: Will This Patient Come Back to the Hospital? |
|
|
210 | (14) |
|
7.2.1 Patient Readmission in the Text Mining Workflow |
|
|
211 | (1) |
|
7.2.2 Session and Data Set-Up |
|
|
211 | (3) |
|
|
214 | (2) |
|
7.2.4 More Model KPIs: AUC, Recall, Precision and F1 |
|
|
216 | (2) |
|
7.2.4.1 Additional Evaluation Metrics |
|
|
218 | (4) |
|
7.2.5 Apply the Model to New Patients |
|
|
222 | (1) |
|
7.2.6 Patient Readmission Conclusion |
|
|
223 | (1) |
|
7.3 Case Study II: Predicting Box Office Success |
|
|
224 | (12) |
|
7.3.1 Opening Weekend Revenue in the Text Mining Workflow |
|
|
225 | (1) |
|
7.3.2 Session and Data Set-Up |
|
|
225 | (3) |
|
7.3.3 Opening Weekend Modeling |
|
|
228 | (3) |
|
|
231 | (3) |
|
7.3.5 Apply the Model to New Movie Reviews |
|
|
234 | (1) |
|
7.3.6 Movie Revenue Conclusion |
|
|
235 | (1) |
|
|
236 | (1) |
|
|
237 | (34) |
|
8.1 What is the OpenNLP project? |
|
|
237 | (1) |
|
|
238 | (4) |
|
8.3 Named Entities in Hillary Clinton's Email |
|
|
242 | (13) |
|
|
243 | (2) |
|
8.3.2 Minor Text Cleaning |
|
|
245 | (1) |
|
8.3.3 Using OpenNLP on a single email |
|
|
246 | (5) |
|
8.3.4 Using OpenNLP on Multiple Documents |
|
|
251 | (3) |
|
8.3.5 Revisiting the Text Mining Workflow |
|
|
254 | (1) |
|
8.4 Analyzing the Named Entities |
|
|
255 | (14) |
|
8.4.1 Worldwide Map of Hillary Clinton's Location Mentions |
|
|
256 | (3) |
|
8.4.2 Mapping Only European Locations |
|
|
259 | (3) |
|
8.4.3 Entities and Polarity: How Does Hillary Clinton Feel About an Entity? |
|
|
262 | (4) |
|
8.4.4 Stock Charts for Entities |
|
|
266 | (2) |
|
8.4.5 Reach an Insight or Conclusion About Hillary Clinton's Emails |
|
|
268 | (1) |
|
|
269 | (2) |
|
|
271 | (34) |
|
|
271 | (1) |
|
|
272 | (21) |
|
9.2.1 Web Scraping a Single Page with rvest |
|
|
272 | (4) |
|
9.2.2 Web Scraping Multiple Pages with rvest |
|
|
276 | (6) |
|
9.2.3 Application Program Interfaces (APIs) |
|
|
282 | (1) |
|
9.2.4 Newspaper Articles from the Guardian Newspaper |
|
|
283 | (2) |
|
9.2.5 Tweets Using the twitteR Package |
|
|
285 | (2) |
|
9.2.6 Calling an API Without a Dedicated R Package |
|
|
287 | (1) |
|
9.2.7 Using Jsonlite to Access the New York Times |
|
|
288 | (2) |
|
9.2.8 Using RCurl and XML to Parse Google Newsfeeds |
|
|
290 | (2) |
|
9.2.9 The tm Library Web-Mining Plugin |
|
|
292 | (1) |
|
9.3 Getting Text from File Sources |
|
|
293 | (9) |
|
9.3.1 Individual CSV, TXT and Microsoft Office Files |
|
|
294 | (2) |
|
9.3.2 Reading Multiple Files Quickly |
|
|
296 | (2) |
|
9.3.3 Extracting Text from PDFs |
|
|
298 | (1) |
|
9.3.4 Optical Character Recognition: Extracting Text from Images |
|
|
299 | (3) |
|
|
302 | (3) |
Index |
|
305 | |