Klienditugi: 7440010 (E-R 10-18)

Abi | Registreeri | Logi sisse

Text as Data: A New Framework for Machine Learning and the Social Sciences [Pehme köide]

4.22/5 (70 hinnangut Goodreads-ist)

Brandon M. Stewart, Margaret E. Roberts, Justin Grimmer

Formaat: Paperback / softback, 360 pages, kõrgus x laius: 254x178 mm, 41 b/w illus. 27 tables.
Ilmumisaeg: 29-Mar-2022
Kirjastus: Princeton University Press
ISBN-10: 0691207550
ISBN-13: 9780691207551

Teised raamatud teemal:

Pehme köide
Hind: 54,10 €
Raamatu kohalejõudmiseks kirjastusest kulub orienteeruvalt 2-4 nädalat
Kogus:
- - 1
  - 2
  - 3
  - 4
  - 5
  - 6
  - 7
  - 8
  - 9
  - 10
Lisa ostukorvi
Tasuta tarne
Tellimisaeg 2-4 nädalat
Lisa soovinimekirja

Formaat: Paperback / softback, 360 pages, kõrgus x laius: 254x178 mm, 41 b/w illus. 27 tables.
Ilmumisaeg: 29-Mar-2022
Kirjastus: Princeton University Press
ISBN-10: 0691207550
ISBN-13: 9780691207551

Teised raamatud teemal:

Püsilink: https://www.kriso.ee/db/9780691207551.html

Märksõnad:

A guide for using computational text analysis to learn about the social world

From social media posts and text messages to digital government documents and archives, researchers are bombarded with a deluge of text reflecting the social world. This textual data gives unprecedented insights into fundamental questions in the social sciences, humanities, and industry. Meanwhile new machine learning tools are rapidly transforming the way science and business are conducted. Text as Data shows how to combine new sources of data, machine learning tools, and social science research design to develop and evaluate new insights.

Text as Data is organized around the core tasks in research projects using text—representation, discovery, measurement, prediction, and causal inference. The authors offer a sequential, iterative, and inductive approach to research design. Each research task is presented complete with real-world applications, example methods, and a distinct style of task-focused research.

Bridging many divides—computer science and social science, the qualitative and the quantitative, and industry and academia—Text as Data is an ideal resource for anyone wanting to analyze large collections of text in an era when data is abundant and computation is cheap, but the enduring challenges of social science remain.

Overview of how to use text as data
Research design for a world of data deluge
Examples from across the social sciences and industry

Arvustused

"Among the metaverse of possible books on Text as Data that could have been published . . . I was pleased that my universe produced this one. I will assign this book as a critical part of my own course on content analysis for years to come, and it has already altered and improved the coherence of my own vocabulary and articulation for several critical choices underlying the process of turning text into data. . . . Highly recommend."---James Evans, Sociological Methods & Research

Preface

xvii

Prerequisites and Notation

xvii

Uses for This Book

xviii

What This Book Is Not

xix

PART I PRELIMINARIES

(32)

Chapter 1 Introduction

(10)

1.1 How This Book Informs the Social Sciences

(3)

1.2 How This Book Informs the Digital Humanities

(1)

1.3 How This Book Informs Data Science in Industry and Government

(1)

1.4 A Guide to This Book

(1)

1.5 Conclusion

(2)

Chapter 2 Social Science Research and Text Analysis

(20)

2.1 Discovery

(1)

2.2 Measurement

(1)

2.3 Inference

(1)

2.4 Social Science as an Iterative and Cumulative Process

(1)

2.5 An Agnostic Approach to Text Analysis

(2)

2.6 Discovery, Measurement, and Causal Inference: How the Chinese Government Censors Social Media

(2)

2.7 Six Principles of Text Analysis

(10)

2.7.1 Social Science Theories and Substantive Knowledge are Essential for Research Design

(2)

2.7.2 Text Analysis does not Replace Humans--It Augments Them

(2)

2.7.3 Building, Refining, and Testing Social Science Theories Requires Iteration and Cumulation

(2)

2.7.4 Text Analysis Methods Distill Generalizations from Language

(1)

2.7.5 The Best Method Depends on the Task

(1)

2.7.6 Validations are Essential and Depend on the Theory and the Task

(2)

2.8 Conclusion: Text Data and Social Science

(1)

PART II SELECTION AND REPRESENTATION

(66)

Chapter 3 Principles of Selection and Representation

(6)

3.1 Principle 1: Question-Specific Corpus Construction

(1)

3.2 Principle 2: No Values-Free Corpus Construction

(1)

3.3 Principle 3: No Right Way to Represent Text

(1)

3.4 Principle 4: Validation

(1)

3.5 State of the Union Addresses

(1)

3.6 The Authorship of the Federalist Papers

(1)

3.7 Conclusion

(1)

Chapter 4 Selecting Documents

(7)

4.1 Populations and Quantities of Interest

(1)

4.2 Four Types of Bias

(3)

4.2.1 Resource Bias

(1)

4.2.2 Incentive Bias

(1)

4.2.3 Medium Bias

(1)

4.2.4 Retrieval Bias

(1)

4.3 Considerations of "Found Data"

(1)

4.4 Conclusion

(2)

Chapter 5 Bag of Words

(12)

5.1 The Bag of Words Model

(1)

5.2 Choose the Unit of Analysis

(1)

5.3 Tokenize

(2)

5.4 Reduce Complexity

(3)

5.4.1 Lowercase

(1)

5.4.2 Remove Punctuation

(1)

5.4.3 Remove Stop Words

(1)

5.4.4 Create Equivalence Classes (Lemmatize/Stem)

(1)

5.4.5 Filter by Frequency

(1)

5.5 Construct Document-Feature Matrix

(2)

5.6 Rethinking the Defaults

(2)

5.6.1 Authorship of the Federalist Papers

(1)

5.6.2 The Scale Argument against Preprocessing

(1)

5.7 Conclusion

(1)

Chapter 6 The Multinomial Language Model

(10)

6.1 Multinomial Distribution

(2)

6.2 Basic Language Modeling

(3)

6.3 Regularization and Smoothing

(1)

6.4 The Dirichlet Distribution

(3)

6.5 Conclusion

(1)

Chapter 7 The Vector Space Model and Similarity Metrics

(8)

7.1 Similarity Metrics

(3)

7.2 Distance Metrics

(2)

7.3 tf-idf Weighting

(2)

7.4 Conclusion

(1)

Chapter 8 Distributed Representations of Words

(12)

8.1 Why Word Embeddings

(2)

8.2 Estimating Word Embeddings

(5)

8.2.1 The Self-Supervision Insight

(1)

8.2.2 Design Choices in Word Embeddings

(1)

8.2.3 Latent Semantic Analysis

(1)

8.2.4 Neural Word Embeddings

(2)

8.2.5 Pretrained Embeddings

(1)

8.2.6 Rare Words

(1)

8.2.7 An Illustration

(1)

8.3 Aggregating Word Embeddings to the Document Level

(1)

8.4 Validation

(1)

8.5 Contextualized Word Embeddings

(1)

8.6 Conclusion

(1)

Chapter 9 Representations from Language Sequences

(9)

9.1 Text Reuse

(1)

9.2 Parts of Speech Tagging

(3)

9.2.1 Using Phrases to Improve Visualization

(2)

9.3 Named-Entity Recognition

(1)

9.4 Dependency Parsing

(1)

9.5 Broader Information Extraction Tasks

(1)

9.6 Conclusion

(2)

PART III DISCOVERY

(72)

Chapter 10 Principles of Discovery

103

(8)

10.1 Principle 1: Context Relevance

103

(1)

10.2 Principle 2: No Ground Truth

104

(1)

10.3 Principle 3: Judge the Concept, Not the Method

105

(1)

10.4 Principle 4: Separate Data Is Best

106

(1)

10.5 Conceptualizing the US Congress

106

(3)

10.6 Conclusion

109

(2)

Chapter 11 Discriminating Words

111

(12)

11.1 Mutual Information

112

(3)

11.2 Fightin' Words

115

(2)

11.3 Fictitious Prediction Problems

117

(4)

11.3.1 Standardized Test Statistics as Measures of Separation

118

(1)

11.3.2 Χ2 Test Statistics

118

(3)

11.3.3 Multinomial Inverse Regression

121

(1)

11.4 Conclusion

121

(2)

Chapter 12 Clustering

123

(24)

12.1 An Initial Example Using fc-Means Clustering

124

(3)

12.2 Representations for Clustering

127

(1)

12.3 Approaches to Clustering

127

(10)

12.3.1 Components of a Clustering Method

128

(2)

12.3.2 Styles of Clustering Methods

130

(2)

12.3.3 Probabilistic Clustering Models

132

(2)

12.3.4 Algorithmic Clustering Models

134

(3)

12.3.5 Connections between Probabilistic and Algorithmic Clustering

137

(1)

12.4 Making Choices

137

(7)

12.4.1 Model Selection

137

(3)

12.4.2 Careful Reading

140

(1)

12.4.3 Choosing the Number of Clusters

140

(4)

12.5 The Human Side of Clustering

144

(1)

12.5.1 Interpretation

144

(1)

12.5.2 Interactive Clustering

144

(1)

12.6 Conclusion

145

(2)

Chapter 13 Topic Models

147

(15)

13.1 Latent Dirichlet Allocation

147

(4)

13.1.1 Inference

149

(1)

13.1.2 Example: Discovering Credit Claiming for Fire Grants in Congressional Press Releases

149

(2)

13.2 Interpreting the Output of Topic Models

151

(2)

13.3 Incorporating Structure into LDA

153

(4)

13.3.1 Structure with Upstream, Known Prevalence Covariates

154

(1)

13.3.2 Structure with Upstream, Known Content Covariates

154

(2)

13.3.3 Structure with Downstream, Known Covariates

156

(1)

13.3.4 Additional Sources of Structure

157

(1)

13.4 Structural Topic Models

157

(2)

13.4.1 Example: Discovering the Components of Radical Discourse

159

(1)

13.5 Labeling Topic Models

159

(1)

13.6 Conclusion

160

(2)

Chapter 14 Low-Dimensional Document Embeddings

162

(9)

14.1 Principal Component Analysis

162

(5)

14.1.1 Automated Methods for Labeling Principal Components

163

(1)

14.1.2 Manual Methods for Labeling Principal Components

164

(1)

14.1.3 Principal Component Analysis of Senate Press Releases

164

(1)

14.1.4 Choosing the Number of Principal Components

165

(2)

14.2 Classical Multidimensional Scaling

167

(2)

14.2.1 Extensions of Classical MDS

168

(1)

14.2.2 Applying Classical MDS to Senate Press Releases

168

(1)

14.3 Conclusion

169

(2)

PART IV MEASUREMENT

171

(60)

Chapter 15 Principles of Measurement

173

(5)

15.1 From Concept to Measurement

174

(1)

15.2 What Makes a Good Measurement

174

(2)

15.2.1 Principle 1: Measures should have Clear Goals

175

(1)

15.2.2 Principle 2: Source Material should Always be Identified and Ideally Made Public

175

(1)

15.2.3 Principle 3: The Coding Process should be Explainable and Reproducible

175

(1)

15.2.4 Principle 4: The Measure should be Validated

175

(1)

15.2.5 Principle 5: Limitations should be Explored, Documented and Communicated to the Audience

176

(1)

15.3 Balancing Discovery and Measurement with Sample Splits

176

(2)

Chapter 16 Word Counting

178

(6)

16.1 Keyword Counting

178

(2)

16.2 Dictionary Methods

180

(1)

16.3 Limitations and Validations of Dictionary Methods

181

(2)

16.3.1 Moving Beyond Dictionaries: Wordscores

182

(1)

16.4 Conclusion

183

(1)

Chapter 17 An Overview of Supervised Classification

184

(5)

17.1 Example: Discursive Governance

185

(1)

17.2 Create a Training Set

186

(1)

17.3 Classify Documents with Supervised Learning

186

(1)

17.4 Check Performance

187

(1)

17.5 Using the Measure

187

(1)

17.6 Conclusion

188

(1)

Chapter 18 Coding a Training Set

189

(8)

18.1 Characteristics of a Good Training Set

190

(1)

18.2 Hand Coding

190

(3)

18.2.1 1: Decide on a Codebook

191

(1)

18.2.2 2: Select Coders

191

(1)

18.2.3 3: Select Documents to Code

191

(1)

18.2.4 4: Manage Coders

192

(1)

18.2.5 5: Check Reliability

192

(1)

18.2.6 Managing Drift

192

(1)

18.2.7 Example: Making the News

192

(1)

18.3 Crowdsourcing

193

(2)

18.4 Supervision with Found Data

195

(1)

18.5 Conclusion

196

(1)

Chapter 19 Classifying Documents with Supervised Learning

197

(14)

19.1 Naive Bayes

198

(4)

19.1.1 The Assumptions in Naive Bayes are Almost Certainly Wrong

200

(1)

19.1.2 Naive Bayes is a Generative Model

200

(1)

19.1.3 Naive Bayes is a Linear Classifier

201

(1)

19.2 Machine Learning

202

(5)

19.2.1 Fixed Basis Functions

203

(2)

19.2.2 Adaptive Basis Functions

205

(1)

19.2.3 Quantification

206

(1)

19.2.4 Concluding Thoughts on Supervised Learning with Random Samples

207

(1)

19.3 Example: Estimating Jihad Scores

207

(3)

19.4 Conclusion

210

(1)

Chapter 20 Checking Performance

211

(8)

20.1 Validation with Gold-Standard Data

211

(3)

20.1.1 Validation Set

212

(1)

20.1.2 Cross-Validation

213

(1)

20.1.3 The Importance of Gold-Standard Data

213

(1)

20.1.4 Ongoing Evaluations

214

(1)

20.2 Validation without Gold-Standard Data

214

(2)

20.2.1 Surrogate Labels

214

(1)

20.2.2 Partial Category Replication

215

(1)

20.2.3 Nonexpert Human Evaluation

215

(1)

20.2.4 Correspondence to External Information

215

(1)

20.3 Example: Validating Jihad Scores

216

(1)

20.4 Conclusion

217

(2)

Chapter 21 Repurposing Discovery Methods

219

(12)

21.1 Unsupervised Methods Tend to Measure Subject Better than Subtleties

219

(1)

21.2 Example: Scaling via Differential Word Rates

220

(1)

21.3 A Workflow for Repurposing Unsupervised Methods for Measurement

221

(4)

21.3.1 1: Split the Data

223

(1)

21.3.2 2: Fit the Model

223

(1)

21.3.3 3: Validate the Model

223

(2)

21.3.4 4: Fit to the Test Data and Revalidate

225

(1)

21.4 Concerns in Repurposing Unsupervised Methods for Measurement

225

(4)

21.4.1 Concern 1: The Method Always Returns a Result

226

(1)

21.4.2 Concern 2: Opaque Differences in Estimation Strategies

226

(1)

21.4.3 Concern 3: Sensitivity to Unintuitive Hyperparameters

227

(1)

21.4.4 Concern 4: Instability in results

227

(1)

21.4.5 Rethinking Stability

228

(1)

21.5 Conclusion

229

(2)

PART V INFERENCE

231

(64)

Chapter 22 Principles of Inference

233

(8)

22.1 Prediction

233

(1)

22.2 Causal Inference

234

(4)

22.2.1 Causal Inference Places Identification First

235

(1)

22.2.2 Prediction Is about Outcomes That Will Happen, Causal Inference is about Outcomes from Interventions

235

(1)

22.2.3 Prediction and Causal Inference Require Different Validations

236

(1)

22.2.4 Prediction and Causal Inference Use Features Differently

237

(1)

22.3 Comparing Prediction and Causal Inference

238

(1)

22.4 Partial and General Equilibrium in Prediction and Causal Inference

238

(2)

22.5 Conclusion

240

(1)

Chapter 23 Prediction

241

(18)

23.1 The Basic Task of Prediction

242

(1)

23.2 Similarities and Differences between Prediction and Measurement

243

(1)

23.3 Five Principles of Prediction

244

(5)

23.3.1 Predictive Features do not have to Cause the Outcome

244

(1)

23.3.2 Cross-Validation is not Always a Good Measure of Predictive Power

244

(2)

23.3.3 It's Not Always Better to be More Accurate on Average

246

(1)

23.3.4 There can be Practical Value in Interpreting Models for Prediction

247

(1)

23.3.5 It can be Difficult to Apply Prediction to Policymaking

247

(2)

23.4 Using Text as Data for Prediction: Examples

249

(8)

23.4.1 Source Prediction

249

(4)

23.4.2 Linguistic Prediction

253

(1)

23.4.3 Social Forecasting

254

(2)

23.4.4 Nowcasting

256

(1)

23.5 Conclusion

257

(2)

Chapter 24 Causal Inference

259

(13)

24.1 Introduction to Causal Inference

260

(3)

24.2 Similarities and Differences between Prediction and Measurement, and Causal Inference

263

(1)

24.3 Key Principles of Causal Inference with Text

263

(3)

24.3.1 The Core Problems of Causal Inference Remain, even when Working with Text

263

(1)

24.3.2 Our Conceptualization of the Treatment and Outcome Remains a Critical Component of Causal Inference with Text

264

(1)

24.3.3 The Challenges of Making Causal Inferences with Text Underscore the Need for Sequential Science

264

(2)

24.4 The Mapping Function

266

(3)

24.4.1 Causal Inference with g

267

(1)

24.4.2 Identification and Overfitting

268

(1)

24.5 Workflows for Making Causal Inferences with Text

269

(2)

24.5.1 Define g before Looking at the Documents

269

(1)

24.5.2 Use a Train/Test Split

269

(2)

24.5.3 Run Sequential Experiments

271

(1)

24.6 Conclusion

271

(1)

Chapter 25 Text as Outcome

272

(5)

25.1 An Experiment on Immigration

272

(3)

25.2 The Effect of Presidential Public Appeals

275

(1)

25.3 Conclusion

276

(1)

Chapter 26 Text as Treatment

277

(8)

26.1 An Experiment Using Trump's Tweets

279

(2)

26.2 A Candidate Biography Experiment

281

(3)

26.3 Conclusion

284

(1)

Chapter 27 Text as Confounder

285

(10)

27.1 Regression Adjustments for Text Confounders

287

(3)

27.2 Matching Adjustments for Text

290

(2)

27.3 Conclusion

292

(3)

PART VI CONCLUSION

295

(8)

Chapter 28 Conclusion

297

(6)

28.1 How to Use Text as Data in the Social Sciences

298

(1)

28.1.1 The Focus on Social Science Tasks

298

(1)

28.1.2 Iterative and Sequential Nature of the Social Sciences

298

(1)

28.1.3 Model Skepticism and the Application of Machine Learning to the Social Sciences

299

(1)

28.2 Applying Our Principles beyond Text Data

299

(1)

28.3 Avoiding the Cycle of Creation and Destruction in Social Science Methodology

300

(3)

Acknowledgments

303

(4)

Bibliography

307

(24)

Index

331

Justin Grimmer is professor of political science and a senior fellow at the Hoover Institution at Stanford University. Twitter @justingrimmer Margaret E. Roberts is associate professor in political science and the Halcolu Data Science Institute at the University of California, San Diego. Twitter @mollyeroberts Brandon M. Stewart is assistant professor of sociology and Arthur H. Scribner Bicentennial Preceptor at Princeton University. Twitter @b_m_stewart

Text as Data: A New Framework for Machine Learning and the Social Sciences [Pehme köide]

Arvustused

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad Ingliskeelsed raamatud

Vali ostukorv