Muutke küpsiste eelistusi

Supervised Machine Learning for Text Analysis in R [Kõva köide]

  • Formaat: Hardback, 402 pages, kõrgus x laius: 234x156 mm, kaal: 900 g, 1 Tables, black and white; 57 Line drawings, color; 8 Line drawings, black and white; 57 Illustrations, color; 8 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Data Science Series
  • Ilmumisaeg: 04-Nov-2021
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 0367554186
  • ISBN-13: 9780367554187
Teised raamatud teemal:
  • Formaat: Hardback, 402 pages, kõrgus x laius: 234x156 mm, kaal: 900 g, 1 Tables, black and white; 57 Line drawings, color; 8 Line drawings, black and white; 57 Illustrations, color; 8 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Data Science Series
  • Ilmumisaeg: 04-Nov-2021
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 0367554186
  • ISBN-13: 9780367554187
Teised raamatud teemal:
"Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing"--

This book is designed to provide practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate text into their modeling pipelines. We assume that the reader is somewhat familiar with R, predictive modeling concepts for non-text data, and the tidyverse family of packages.



Text data is important for many domains, from healthcare to marketing to the digital humanities, but specialized approaches are necessary to create features for machine learning from language. Supervised Machine Learning for Text Analysis in R explains how to preprocess text data for modeling, train models, and evaluate model performance using tools from the tidyverse and tidymodels ecosystem. Models like these can be used to make predictions for new observations, to understand what natural language features or characteristics contribute to differences in the output, and more. If you are already familiar with the basics of predictive modeling, use the comprehensive, detailed examples in this book to extend your skills to the domain of natural language processing.

This book provides practical guidance and directly applicable knowledge for data scientists and analysts who want to integrate unstructured text data into their modeling pipelines. Learn how to use text data for both regression and classification tasks, and how to apply more straightforward algorithms like regularized regression or support vector machines as well as deep learning approaches. Natural language must be dramatically transformed to be ready for computation, so we explore typical text preprocessing and feature engineering steps like tokenization and word embeddings from the ground up. These steps influence model results in ways we can measure, both in terms of model metrics and other tangible consequences such as how fair or appropriate model results are.?

Arvustused

"I find this book very useful, as predictive modelling with text is an important field in data science and statistics, and yet the one that has been consistently under-represented in technical literature. Given the growing volume, complexity and accessibility of unstructured data sources, as well as the rapid development of NLP algorithms, knowledge and skills in this domain is in increasing demand. In particular, theres a demand for pragmatic guidelines that offer not just the theoretical background to the NLP issues but also explain the end-to-end modelling process and good practices supported with code examples, just like "Supervised Machine Learning for Text Analysis in R" does. Data scientists and computational linguists would be a prime audience for this kind of publication and would most likely use it as both, (coding) reference and a textbook." ~Kasia Kulma, data science consultant

"This book fills a critical gap between the plethora of text mining books (even in R) that are too basic for practical use and the more complex text mining books that are not accessible to most data scientists. In addition, this book uses statistical techniques to do text mining and text prediction and classification. Not all text mining books take this approach, and given the level of this book, it is one of its strongest features." ~Carol Haney, Quatrics

"This book would be valuable for advanced undergraduates and early PhD students in a wide range of areas that have started using text as dataThe main strength of the book is its connection to the tidyverse environment in R. It's relatively easy to pick up and do powerful things." ~David Mimno, Cornell University

"The authors do a great job of presenting R programmers a variety of deep learning applications to text-based problems. Perhaps one of the best parts of this book is the section on interpretability, where the authors showcase methods to diagnose features on which these complex models rely to make their prediction. Considering how important the area of interpretability is to natural language processing research and is often skipped in applied textbooks, the authors should be commended for incorporating it in this book." ~Kanishka Misra, Purdue University

"In conclusion, the presented book is extremely useful for graduate students, advanced researchers, and practitioners of statistics and data science who are interested in learning cutting-edge supervised ML techniques for text data. By utilizing the tidyverse environment and providing easy-to-understand R code examples with detailed study cases of real-world text mining problems, this book stands out and is a worthwhile read." -Han-Ming Wu, National Chengchi University, Biometrics, September 2022

"The volume is a valuable methodological resource, primarily for students interested in data science, concerned with: understanding the fundamentals of preprocessing steps required to transform a corpus, not always large, into a structure that is a good fit for modeling; implementation of machine learning and deep learning algorithms for building text predictive models under given research contexts in which they have to be integrated." -Anca Vitcu in ISCB Book Reviews, September 2022

Preface xiii
I Natural Language Features
1(98)
1 Language and modeling
3(6)
1.1 Linguistics for text analysis
3(2)
1.2 A glimpse into one area: morphology
5(1)
1.3 Different languages
6(1)
1.4 Other ways text can vary
7(1)
1.5 Summary
8(1)
1.5.1 In this chapter, you learned
8(1)
2 Tokenization
9(28)
2.1 What is a token?
9(4)
2.2 Types of tokens
13(12)
2.2.1 Character tokens
16(2)
2.2.2 Word tokens
18(1)
2.2.3 Tokenizing by n-grams
19(3)
2.2.4 Lines, sentence, and paragraph tokens
22(3)
2.3 Where does tokenization break down?
25(1)
2.4 Building your own tokenizer
26(7)
2.4.1 Tokenize to characters, only keeping letters
27(2)
2.4.2 Allow for hyphenated words
29(3)
2.4.3 Wrapping it in a function
32(1)
2.5 Tokenization for non-Latin alphabets
33(1)
2.6 Tokenization benchmark
34(1)
2.7 Summary
35(2)
2.7.1 In this chapter, you learned
35(2)
3 Stop words
37(16)
3.1 Using premade stop word lists
38(5)
3.1.1 Stop word removal in R
41(2)
3.2 Creating your own stop words list
43(5)
3.3 All stop word lists are context-specific
48(1)
3.4 What happens when you remove stop words
49(1)
3.5 Stop words in languages other than English
50(2)
3.6 Summary
52(1)
3.6.1 In this chapter, you learned
52(1)
4 Stemming
53(20)
4.1 How to stem text in R
54(4)
4.2 Should you use stemming at all?
58(3)
4.3 Understand a stemming algorithm
61(2)
4.4 Handling punctuation when stemming
63(2)
4.5 Compare some stemming options
65(3)
4.6 Lemmatization and stemming
68(2)
4.7 Stemming and stop words
70(1)
4.8 Summary
71(2)
4.8.1 In this chapter, you learned
72(1)
5 Word Embeddings
73(26)
5.1 Motivating embeddings for sparse, high-dimensional data
73(4)
5.2 Understand word embeddings by finding them yourself
77(4)
5.3 Exploring CFPB word embeddings
81(7)
5.4 Use pre-trained word embeddings
88(5)
5.5 Fairness and word embeddings
93(2)
5.6 Using word embeddings in the real world
95(1)
5.7 Summary
96(3)
5.7.1 In this chapter, you learned
97(2)
II Machine Learning Methods
99(124)
Overview
101(4)
6 Regression
105(50)
6.1 A first regression model
106(11)
6.1.1 Building our first regression model
107(5)
6.1.2 Evaluation
112(5)
6.2 Compare to the null model
117(2)
6.3 Compare to a random forest model
119(3)
6.4 Case study: removing stop words
122(4)
6.5 Case study: varying n-grams
126(3)
6.6 Case study: lemmatization
129(4)
6.7 Case study: feature hashing
133(6)
6.7.1 Text normalization
137(2)
6.8 What evaluation metrics are appropriate?
139(3)
6.9 The full game: regression
142(11)
6.9.1 Preprocess the data
142(1)
6.9.2 Specify the model
143(1)
6.9.3 Tune the model
144(2)
6.9.4 Evaluate the modeling
146(7)
6.10 Summary
153(2)
6.10.1 In this chapter, you learned
153(2)
7 Classification
155(68)
7.1 A first classification model
156(10)
7.1.1 Building our first classification model
158(3)
7.1.2 Evaluation
161(5)
7.2 Compare to the null model
166(1)
7.3 Compare to a lasso classification model
167(3)
7.4 Tuning lasso hyperparameters
170(9)
7.5 Case study: sparse encoding
179(4)
7.6 Two-class or multiclass?
183(8)
7.7 Case study: including non-text data
191(4)
7.8 Case study: data censoring
195(6)
7.9 Case study: custom features
201(5)
7.9.1 Detect credit cards
202(2)
7.9.2 Calculate percentage censoring
204(1)
7.9.3 Detect monetary amounts
205(1)
7.10 What evaluation metrics are appropriate?
206(2)
7.11 The full game: classification
208(12)
7.11.1 Feature selection
209(1)
7.11.2 Specify the model
210(2)
7.11.3 Evaluate the modeling
212(8)
7.12 Summary
220(3)
7.12.1 In this chapter, you learned
221(2)
III Deep Learning Methods
223(120)
Overview
225(6)
8 Dense neural networks
231(42)
8.1 Kickstarter data
232(5)
8.2 A first deep learning model
237(16)
8.2.1 Preprocessing for deep learning
237(3)
8.2.2 One-hot sequence embedding of text
240(4)
8.2.3 Simple flattened dense network
244(4)
8.2.4 Evaluation
248(5)
8.3 Using bag-of-words features
253(4)
8.4 Using pre-trained word embeddings
257(6)
8.5 Cross-validation for deep learning models
263(4)
8.6 Compare and evaluate DNN models
267(4)
8.7 Limitations of deep learning
271(1)
8.8 Summary
272(1)
8.8.1 In this chapter, you learned
272(1)
9 Long short-term memory (LSTM) networks
273(30)
9.1 A first LSTM model
273(10)
9.1.1 Building an LSTM
275(4)
9.1.2 Evaluation
279(4)
9.2 Compare to a recurrent neural network
283(3)
9.3 Case study: bidirectional LSTM
286(2)
9.4 Case study: stacking LSTM layers
288(1)
9.5 Case study: padding
289(3)
9.6 Case study: training a regression model
292(3)
9.7 Case study: vocabulary size
295(2)
9.8 The full game: LSTM
297(4)
9.8.1 Preprocess the data
297(1)
9.8.2 Specify the model
298(3)
9.9 Summary
301(2)
9.9.1 In this chapter, you learned
302(1)
10 Convolutional neural networks
303(40)
10.1 What are CNNs?
303(2)
10.1.1 Kernel
304(1)
10.1.2 Kernel size
304(1)
10.2 A first CNN model
305(4)
10.3 Case study: adding more layers
309(8)
10.4 Case study: byte pair encoding
317(7)
10.5 Case study: explainability with LIME
324(6)
10.6 Case study: hyperparameter search
330(4)
10.7 Cross-validation for evaluation
334(3)
10.8 The full game: CNN
337(4)
10.8.1 Preprocess the data
337(1)
10.8.2 Specify the model
338(3)
10.9 Summary
341(2)
10.9.1 In this chapter, you learned
342(1)
IV Conclusion
343(4)
Text models in the real world
345(2)
Appendix
347(22)
A Regular expressions
347(2)
A.1 Literal characters
347(2)
A.1.1 Meta characters
349(8)
A.2 Full stop, the wildcard
349(1)
A.3 Character classes
350(2)
A.3.1 Shorthand character classes
352(1)
A.4 Quantifiers
353(2)
A.5 Anchors
355(1)
A.6 Additional resources
355(2)
B Data
357(4)
B.1 Hans Christian Andersen fairy tales
357(1)
B.2 Opinions of the Supreme Court of the United States
358(1)
B.3 Consumer Financial Protection Bureau (CFPB) complaints
359(1)
B.4 Kickstarter campaign blurbs
359(2)
C Baseline linear classifier
361(8)
C.1 Read in the data
361(1)
C.2 Split into test/train and create resampling folds
362(1)
C.3 Recipe for data preprocessing
363(1)
C.4 Lasso regularized classification model
363(1)
C.5 A model workflow
364(2)
C.6 Tune the workflow
366(3)
References 369(10)
Index 379
Emil Hvitfeldt is a clinical data analyst working in healthcare, and an adjunct professor at American University where he is teaching statistical machine learning with tidymodels. He is also an open source R developer and author of the textrecipes package.

Julia Silge is a data scientist and software engineer at RStudio PBC where she works on open source modeling tools. She is an author, an international keynote speaker and educator, and a real-world practitioner focusing on data analysis and machine learning practice.