Preface |
|
xvii | |
Prerequisites and Notation |
|
xvii | |
Uses for This Book |
|
xviii | |
What This Book Is Not |
|
xix | |
|
|
1 | (32) |
|
|
3 | (10) |
|
1.1 How This Book Informs the Social Sciences |
|
|
5 | (3) |
|
1.2 How This Book Informs the Digital Humanities |
|
|
8 | (1) |
|
1.3 How This Book Informs Data Science in Industry and Government |
|
|
9 | (1) |
|
|
10 | (1) |
|
|
11 | (2) |
|
Chapter 2 Social Science Research and Text Analysis |
|
|
13 | (20) |
|
|
15 | (1) |
|
|
16 | (1) |
|
|
17 | (1) |
|
2.4 Social Science as an Iterative and Cumulative Process |
|
|
17 | (1) |
|
2.5 An Agnostic Approach to Text Analysis |
|
|
18 | (2) |
|
2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media |
|
|
20 | (2) |
|
2.7 Six Principles of Text Analysis |
|
|
22 | (10) |
|
2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design |
|
|
22 | (2) |
|
2.7.2 Text Analysis does not Replace Humans--It Augments Them |
|
|
24 | (2) |
|
2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation |
|
|
26 | (2) |
|
2.7.4 Text Analysis Methods Distill Generalizations from Language |
|
|
28 | (1) |
|
2.7.5 The Best Method Depends on the Task |
|
|
29 | (1) |
|
2.7.6 Validations are Essential and Depend on the Theory and the Task |
|
|
30 | (2) |
|
2.8 Conclusion: Text Data and Social Science |
|
|
32 | (1) |
|
PART II SELECTION AND REPRESENTATION |
|
|
33 | (66) |
|
Chapter 3 Principles of Selection and Representation |
|
|
35 | (6) |
|
3.1 Principle 1: Question-Specific Corpus Construction |
|
|
35 | (1) |
|
3.2 Principle 2: No Values-Free Corpus Construction |
|
|
36 | (1) |
|
3.3 Principle 3: No Right Way to Represent Text |
|
|
37 | (1) |
|
3.4 Principle 4: Validation |
|
|
38 | (1) |
|
3.5 State of the Union Addresses |
|
|
38 | (1) |
|
3.6 The Authorship of the Federalist Papers |
|
|
39 | (1) |
|
|
40 | (1) |
|
Chapter 4 Selecting Documents |
|
|
41 | (7) |
|
4.1 Populations and Quantities of Interest |
|
|
42 | (1) |
|
|
43 | (3) |
|
|
43 | (1) |
|
|
44 | (1) |
|
|
44 | (1) |
|
|
45 | (1) |
|
4.3 Considerations of "Found Data" |
|
|
46 | (1) |
|
|
46 | (2) |
|
|
48 | (12) |
|
5.1 The Bag of Words Model |
|
|
48 | (1) |
|
5.2 Choose the Unit of Analysis |
|
|
49 | (1) |
|
|
50 | (2) |
|
|
52 | (3) |
|
|
52 | (1) |
|
|
52 | (1) |
|
|
53 | (1) |
|
5.4.4 Create Equivalence Classes (Lemmatize/Stem) |
|
|
54 | (1) |
|
5.4.5 Filter by Frequency |
|
|
55 | (1) |
|
5.5 Construct Document-Feature Matrix |
|
|
55 | (2) |
|
5.6 Rethinking the Defaults |
|
|
57 | (2) |
|
5.6.1 Authorship of the Federalist Papers |
|
|
57 | (1) |
|
5.6.2 The Scale Argument against Preprocessing |
|
|
58 | (1) |
|
|
59 | (1) |
|
Chapter 6 The Multinomial Language Model |
|
|
60 | (10) |
|
6.1 Multinomial Distribution |
|
|
61 | (2) |
|
6.2 Basic Language Modeling |
|
|
63 | (3) |
|
6.3 Regularization and Smoothing |
|
|
66 | (1) |
|
6.4 The Dirichlet Distribution |
|
|
66 | (3) |
|
|
69 | (1) |
|
Chapter 7 The Vector Space Model and Similarity Metrics |
|
|
70 | (8) |
|
|
70 | (3) |
|
|
73 | (2) |
|
|
75 | (2) |
|
|
77 | (1) |
|
Chapter 8 Distributed Representations of Words |
|
|
78 | (12) |
|
|
79 | (2) |
|
8.2 Estimating Word Embeddings |
|
|
81 | (5) |
|
8.2.1 The Self-Supervision Insight |
|
|
81 | (1) |
|
8.2.2 Design Choices in Word Embeddings |
|
|
81 | (1) |
|
8.2.3 Latent Semantic Analysis |
|
|
82 | (1) |
|
8.2.4 Neural Word Embeddings |
|
|
82 | (2) |
|
8.2.5 Pretrained Embeddings |
|
|
84 | (1) |
|
|
84 | (1) |
|
|
85 | (1) |
|
8.3 Aggregating Word Embeddings to the Document Level |
|
|
86 | (1) |
|
|
87 | (1) |
|
8.5 Contextualized Word Embeddings |
|
|
88 | (1) |
|
|
89 | (1) |
|
Chapter 9 Representations from Language Sequences |
|
|
90 | (9) |
|
|
90 | (1) |
|
9.2 Parts of Speech Tagging |
|
|
91 | (3) |
|
9.2.1 Using Phrases to Improve Visualization |
|
|
92 | (2) |
|
9.3 Named-Entity Recognition |
|
|
94 | (1) |
|
|
95 | (1) |
|
9.5 Broader Information Extraction Tasks |
|
|
96 | (1) |
|
|
97 | (2) |
|
|
99 | (72) |
|
Chapter 10 Principles of Discovery |
|
|
103 | (8) |
|
10.1 Principle 1: Context Relevance |
|
|
103 | (1) |
|
10.2 Principle 2: No Ground Truth |
|
|
104 | (1) |
|
10.3 Principle 3: Judge the Concept, Not the Method |
|
|
105 | (1) |
|
10.4 Principle 4: Separate Data Is Best |
|
|
106 | (1) |
|
10.5 Conceptualizing the US Congress |
|
|
106 | (3) |
|
|
109 | (2) |
|
Chapter 11 Discriminating Words |
|
|
111 | (12) |
|
|
112 | (3) |
|
|
115 | (2) |
|
11.3 Fictitious Prediction Problems |
|
|
117 | (4) |
|
11.3.1 Standardized Test Statistics as Measures of Separation |
|
|
118 | (1) |
|
11.3.2 Χ2 Test Statistics |
|
|
118 | (3) |
|
11.3.3 Multinomial Inverse Regression |
|
|
121 | (1) |
|
|
121 | (2) |
|
|
123 | (24) |
|
12.1 An Initial Example Using fc-Means Clustering |
|
|
124 | (3) |
|
12.2 Representations for Clustering |
|
|
127 | (1) |
|
12.3 Approaches to Clustering |
|
|
127 | (10) |
|
12.3.1 Components of a Clustering Method |
|
|
128 | (2) |
|
12.3.2 Styles of Clustering Methods |
|
|
130 | (2) |
|
12.3.3 Probabilistic Clustering Models |
|
|
132 | (2) |
|
12.3.4 Algorithmic Clustering Models |
|
|
134 | (3) |
|
12.3.5 Connections between Probabilistic and Algorithmic Clustering |
|
|
137 | (1) |
|
|
137 | (7) |
|
|
137 | (3) |
|
|
140 | (1) |
|
12.4.3 Choosing the Number of Clusters |
|
|
140 | (4) |
|
12.5 The Human Side of Clustering |
|
|
144 | (1) |
|
|
144 | (1) |
|
12.5.2 Interactive Clustering |
|
|
144 | (1) |
|
|
145 | (2) |
|
|
147 | (15) |
|
13.1 Latent Dirichlet Allocation |
|
|
147 | (4) |
|
|
149 | (1) |
|
13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases |
|
|
149 | (2) |
|
13.2 Interpreting the Output of Topic Models |
|
|
151 | (2) |
|
13.3 Incorporating Structure into LDA |
|
|
153 | (4) |
|
13.3.1 Structure with Upstream, Known Prevalence Covariates |
|
|
154 | (1) |
|
13.3.2 Structure with Upstream, Known Content Covariates |
|
|
154 | (2) |
|
13.3.3 Structure with Downstream, Known Covariates |
|
|
156 | (1) |
|
13.3.4 Additional Sources of Structure |
|
|
157 | (1) |
|
13.4 Structural Topic Models |
|
|
157 | (2) |
|
13.4.1 Example: Discovering the Components of Radical Discourse |
|
|
159 | (1) |
|
13.5 Labeling Topic Models |
|
|
159 | (1) |
|
|
160 | (2) |
|
Chapter 14 Low-Dimensional Document Embeddings |
|
|
162 | (9) |
|
14.1 Principal Component Analysis |
|
|
162 | (5) |
|
14.1.1 Automated Methods for Labeling Principal Components |
|
|
163 | (1) |
|
14.1.2 Manual Methods for Labeling Principal Components |
|
|
164 | (1) |
|
14.1.3 Principal Component Analysis of Senate Press Releases |
|
|
164 | (1) |
|
14.1.4 Choosing the Number of Principal Components |
|
|
165 | (2) |
|
14.2 Classical Multidimensional Scaling |
|
|
167 | (2) |
|
14.2.1 Extensions of Classical MDS |
|
|
168 | (1) |
|
14.2.2 Applying Classical MDS to Senate Press Releases |
|
|
168 | (1) |
|
|
169 | (2) |
|
|
171 | (60) |
|
Chapter 15 Principles of Measurement |
|
|
173 | (5) |
|
15.1 From Concept to Measurement |
|
|
174 | (1) |
|
15.2 What Makes a Good Measurement |
|
|
174 | (2) |
|
15.2.1 Principle 1: Measures should have Clear Goals |
|
|
175 | (1) |
|
15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public |
|
|
175 | (1) |
|
15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible |
|
|
175 | (1) |
|
15.2.4 Principle 4: The Measure should be Validated |
|
|
175 | (1) |
|
15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience |
|
|
176 | (1) |
|
15.3 Balancing Discovery and Measurement with Sample Splits |
|
|
176 | (2) |
|
|
178 | (6) |
|
|
178 | (2) |
|
|
180 | (1) |
|
16.3 Limitations and Validations of Dictionary Methods |
|
|
181 | (2) |
|
16.3.1 Moving Beyond Dictionaries: Wordscores |
|
|
182 | (1) |
|
|
183 | (1) |
|
Chapter 17 An Overview of Supervised Classification |
|
|
184 | (5) |
|
17.1 Example: Discursive Governance |
|
|
185 | (1) |
|
17.2 Create a Training Set |
|
|
186 | (1) |
|
17.3 Classify Documents with Supervised Learning |
|
|
186 | (1) |
|
|
187 | (1) |
|
|
187 | (1) |
|
|
188 | (1) |
|
Chapter 18 Coding a Training Set |
|
|
189 | (8) |
|
18.1 Characteristics of a Good Training Set |
|
|
190 | (1) |
|
|
190 | (3) |
|
18.2.1 1: Decide on a Codebook |
|
|
191 | (1) |
|
|
191 | (1) |
|
18.2.3 3: Select Documents to Code |
|
|
191 | (1) |
|
|
192 | (1) |
|
18.2.5 5: Check Reliability |
|
|
192 | (1) |
|
|
192 | (1) |
|
18.2.7 Example: Making the News |
|
|
192 | (1) |
|
|
193 | (2) |
|
18.4 Supervision with Found Data |
|
|
195 | (1) |
|
|
196 | (1) |
|
Chapter 19 Classifying Documents with Supervised Learning |
|
|
197 | (14) |
|
|
198 | (4) |
|
19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong |
|
|
200 | (1) |
|
19.1.2 Naive Bayes is a Generative Model |
|
|
200 | (1) |
|
19.1.3 Naive Bayes is a Linear Classifier |
|
|
201 | (1) |
|
|
202 | (5) |
|
19.2.1 Fixed Basis Functions |
|
|
203 | (2) |
|
19.2.2 Adaptive Basis Functions |
|
|
205 | (1) |
|
|
206 | (1) |
|
19.2.4 Concluding Thoughts on Supervised Learning with Random Samples |
|
|
207 | (1) |
|
19.3 Example: Estimating Jihad Scores |
|
|
207 | (3) |
|
|
210 | (1) |
|
Chapter 20 Checking Performance |
|
|
211 | (8) |
|
20.1 Validation with Gold-Standard Data |
|
|
211 | (3) |
|
|
212 | (1) |
|
|
213 | (1) |
|
20.1.3 The Importance of Gold-Standard Data |
|
|
213 | (1) |
|
20.1.4 Ongoing Evaluations |
|
|
214 | (1) |
|
20.2 Validation without Gold-Standard Data |
|
|
214 | (2) |
|
|
214 | (1) |
|
20.2.2 Partial Category Replication |
|
|
215 | (1) |
|
20.2.3 Nonexpert Human Evaluation |
|
|
215 | (1) |
|
20.2.4 Correspondence to External Information |
|
|
215 | (1) |
|
20.3 Example: Validating Jihad Scores |
|
|
216 | (1) |
|
|
217 | (2) |
|
Chapter 21 Repurposing Discovery Methods |
|
|
219 | (12) |
|
21.1 Unsupervised Methods Tend to Measure Subject Better than Subtleties |
|
|
219 | (1) |
|
21.2 Example: Scaling via Differential Word Rates |
|
|
220 | (1) |
|
21.3 A Workflow for Repurposing Unsupervised Methods for Measurement |
|
|
221 | (4) |
|
|
223 | (1) |
|
|
223 | (1) |
|
21.3.3 3: Validate the Model |
|
|
223 | (2) |
|
21.3.4 4: Fit to the Test Data and Revalidate |
|
|
225 | (1) |
|
21.4 Concerns in Repurposing Unsupervised Methods for Measurement |
|
|
225 | (4) |
|
21.4.1 Concern 1: The Method Always Returns a Result |
|
|
226 | (1) |
|
21.4.2 Concern 2: Opaque Differences in Estimation Strategies |
|
|
226 | (1) |
|
21.4.3 Concern 3: Sensitivity to Unintuitive Hyperparameters |
|
|
227 | (1) |
|
21.4.4 Concern 4: Instability in results |
|
|
227 | (1) |
|
21.4.5 Rethinking Stability |
|
|
228 | (1) |
|
|
229 | (2) |
|
|
231 | (64) |
|
Chapter 22 Principles of Inference |
|
|
233 | (8) |
|
|
233 | (1) |
|
|
234 | (4) |
|
22.2.1 Causal Inference Places Identification First |
|
|
235 | (1) |
|
22.2.2 Prediction Is about Outcomes That Will Happen, Causal Inference is about Outcomes from Interventions |
|
|
235 | (1) |
|
22.2.3 Prediction and Causal Inference Require Different Validations |
|
|
236 | (1) |
|
22.2.4 Prediction and Causal Inference Use Features Differently |
|
|
237 | (1) |
|
22.3 Comparing Prediction and Causal Inference |
|
|
238 | (1) |
|
22.4 Partial and General Equilibrium in Prediction and Causal Inference |
|
|
238 | (2) |
|
|
240 | (1) |
|
|
241 | (18) |
|
23.1 The Basic Task of Prediction |
|
|
242 | (1) |
|
23.2 Similarities and Differences between Prediction and Measurement |
|
|
243 | (1) |
|
23.3 Five Principles of Prediction |
|
|
244 | (5) |
|
23.3.1 Predictive Features do not have to Cause the Outcome |
|
|
244 | (1) |
|
23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power |
|
|
244 | (2) |
|
23.3.3 It's Not Always Better to be More Accurate on Average |
|
|
246 | (1) |
|
23.3.4 There can be Practical Value in Interpreting Models for Prediction |
|
|
247 | (1) |
|
23.3.5 It can be Difficult to Apply Prediction to Policymaking |
|
|
247 | (2) |
|
23.4 Using Text as Data for Prediction: Examples |
|
|
249 | (8) |
|
|
249 | (4) |
|
23.4.2 Linguistic Prediction |
|
|
253 | (1) |
|
23.4.3 Social Forecasting |
|
|
254 | (2) |
|
|
256 | (1) |
|
|
257 | (2) |
|
Chapter 24 Causal Inference |
|
|
259 | (13) |
|
24.1 Introduction to Causal Inference |
|
|
260 | (3) |
|
24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference |
|
|
263 | (1) |
|
24.3 Key Principles of Causal Inference with Text |
|
|
263 | (3) |
|
24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text |
|
|
263 | (1) |
|
24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text |
|
|
264 | (1) |
|
24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science |
|
|
264 | (2) |
|
24.4 The Mapping Function |
|
|
266 | (3) |
|
24.4.1 Causal Inference with g |
|
|
267 | (1) |
|
24.4.2 Identification and Overfitting |
|
|
268 | (1) |
|
24.5 Workflows for Making Causal Inferences with Text |
|
|
269 | (2) |
|
24.5.1 Define g before Looking at the Documents |
|
|
269 | (1) |
|
24.5.2 Use a Train/Test Split |
|
|
269 | (2) |
|
24.5.3 Run Sequential Experiments |
|
|
271 | (1) |
|
|
271 | (1) |
|
Chapter 25 Text as Outcome |
|
|
272 | (5) |
|
25.1 An Experiment on Immigration |
|
|
272 | (3) |
|
25.2 The Effect of Presidential Public Appeals |
|
|
275 | (1) |
|
|
276 | (1) |
|
Chapter 26 Text as Treatment |
|
|
277 | (8) |
|
26.1 An Experiment Using Trump's Tweets |
|
|
279 | (2) |
|
26.2 A Candidate Biography Experiment |
|
|
281 | (3) |
|
|
284 | (1) |
|
Chapter 27 Text as Confounder |
|
|
285 | (10) |
|
27.1 Regression Adjustments for Text Confounders |
|
|
287 | (3) |
|
27.2 Matching Adjustments for Text |
|
|
290 | (2) |
|
|
292 | (3) |
|
|
295 | (8) |
|
|
297 | (6) |
|
28.1 How to Use Text as Data in the Social Sciences |
|
|
298 | (1) |
|
28.1.1 The Focus on Social Science Tasks |
|
|
298 | (1) |
|
28.1.2 Iterative and Sequential Nature of the Social Sciences |
|
|
298 | (1) |
|
28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences |
|
|
299 | (1) |
|
28.2 Applying Our Principles beyond Text Data |
|
|
299 | (1) |
|
28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology |
|
|
300 | (3) |
Acknowledgments |
|
303 | (4) |
Bibliography |
|
307 | (24) |
Index |
|
331 | |