Klienditugi: 7440010 (E-R 10-18)

E-raamat: Big Data Meets Survey Science: A Collection of Innovative Methods

Edited by Paul P. Biemer (Research Triangle Institute), Edited by Trent D. Buskirk (Bowling Green State University), Edited by Stas Kolenikov (Abt Associates), Edited by Lilli Japec (Statistics Sweden), Edited by Lars E. Lyberg (Statistics Sweden), Edited by Antje Kirchner (RTI International), Edited by Craig A. Hill (RTI International)

Formaat: PDF+DRM
Sari: Wiley Series in Survey Methodology
Ilmumisaeg: 27-Aug-2020
Kirjastus: John Wiley & Sons Inc
Keel: eng
ISBN-13: 9781118976333

Teised raamatud teemal:

Social research & statistics

Formaat - PDF+DRM
Hind: 123,44 €*
* hind on lõplik, st. muud allahindlused enam ei rakendu
Lisa ostukorvi
Lisa soovinimekirja
See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
Raamatukogudele

Formaat: PDF+DRM
Sari: Wiley Series in Survey Methodology
Ilmumisaeg: 27-Aug-2020
Kirjastus: John Wiley & Sons Inc
Keel: eng
ISBN-13: 9781118976333

Teised raamatud teemal:

Social research & statistics

DRM piirangud

Kopeerimine (copy/paste):

ei ole lubatud
Printimine:

ei ole lubatud
Kasutamine:

Digitaalõiguste kaitse (DRM)
Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

Vajalik tarkvara
Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

Seda e-raamatut ei saa lugeda Amazon Kindle's.

"Written and painstakingly edited by leading experts in their respective fields, this volume offers a state-of-the-art overview of Big Data issues, concerns, and responses in survey methodology. Like several other books in the Wiley Series in Survey Methodology, this work has been prepared in conjunction with an international conference on the topic by the Survey Research Methods Section of the American Statistical Association. The conference and book constitute part of an ongoing effort by a group of international researchers to promote quality in Big Data and to raise the level of methodological expertise in various applied fields. The basic content, in light of emerging techniques and technologies, includes in-depth coverage of topics such as combining Big Data with traditional data sources; multiplicity; data sparseness; data streams; using Big Data for reducing, controlling, and evaluating total survey error; handling confidentiality and privacy; and ethical concerns and the concept of harm; among ahost of others. The editors and contributors are eminent, varied, and reflective of the international marketplace. Copious tables, figures, and references, as well as an extensive glossary, supplement the high quality discussion throughout the text"--

Offers a clear view of the utility and place for survey data within the broader Big Data ecosystem 

This book presents a collection of snapshots from two sides of the Big Data perspective.  It assembles an array of tangible tools, methods, and approaches that illustrate how Big Data sources and methods are being used in the survey and social sciences to improve official statistics and estimates for human populations.  It also provides examples of how survey data are being used to evaluate and improve the quality of insights derived from Big Data.   

Big Data Meets Survey Science: A Collection of Innovative Methods shows how survey data and Big Data are used together for the benefit of one or more sources of data, with numerous chapters providing consistent illustrations and examples of survey data enriching the evaluation of Big Data sources.  Examples of how machine learning, data mining, and other data science techniques are inserted into virtually every stage of the survey lifecycle are presented. Topics covered include: Total Error Frameworks for Found Data; Performance and Sensitivities of Home Detection on Mobile Phone Data; Assessing Community Wellbeing Using Google Street View and Satellite Imagery; Using Surveys to Build and Assess RBS Religious Flag; and more. 

Presents groundbreaking survey methods being utilized today in the field of Big Data 
Explores how machine learning methods can be applied to the design, collection, and analysis of social science data 
Filled with examples and illustrations that show how survey data benefits Big Data evaluation 
Covers methods and applications used in combining Big Data with survey statistics 
Examines regulations as well as ethical and privacy issues

Big Data Meets Survey Science: A Collection of Innovative Methods is an excellent book for both the survey and social science communities as they learn to capitalize on this new revolution. It will also appeal to the broader data and computer science communities looking for new areas of application for emerging methods and data sources. 

List of Contributors

xxiii

Introduction

(6)

Craig A. Hill

Paul P. Biemer

Trent D. Buskirk

Lilli Japec

Antje Kirchner

Stas Kolenikov

Lars E. Lyberg

Acknowledgments

(1)

References

(2)

Section 1 The New Survey Landscape

(122)

1 Why Machines Matter for Survey and Social Science Researchers: Exploring Applications of Machine Learning Methods for Design, Data Collection, and Analysis

(52)

Trent D. Buskirk

Antje Kirchner

1.1 Introduction

(2)

1.2 Overview of Machine Learning Methods and Their Evaluation

(3)

1.3 Creating Sample Designs and Constructing Sampling Frames Using Machine Learning Methods

(7)

1.3.1 Sample Design Creation

(2)

1.3.2 Sample Frame Construction

(2)

1.3.3 Considerations and Implications for Applying Machine Learning Methods for Creating Sampling Frames and Designs

(1)

1.3.3.1 Considerations About Algorithmic Optimization

(1)

1.3.3.2 Implications About Machine Learning Model Error

(1)

1.3.3.3 Data Type Considerations and Implications About Data Errors

(1)

1.4 Questionnaire Design and Evaluation Using Machine Learning Methods

(5)

1.4.1 Question Wording

(2)

1.4.2 Evaluation and Testing

(1)

1.4.3 Instrumentation and Interviewer Training

(1)

1.4.4 Alternative Data Sources

(1)

1.5 Survey Recruitment and Data Collection Using Machine Learning Methods

(5)

1.5.1 Monitoring and Interviewer Falsification

(1)

1.5.2 Responsive and Adaptive Designs

(4)

1.6 Survey Data Coding and Processing Using Machine Learning Methods

(4)

1.6.1 Coding Unstructured Text

(2)

1.6.2 Data Validation and Editing

(1)

1.6.3 Imputation

(1)

1.6.4 Record Linkage and Duplicate Detection

(1)

1.7 Sample Weighting and Survey Adjustments Using Machine Learning Methods

(6)

1.7.1 Propensity Score Estimation

(4)

1.7.2 Sample Matching

(2)

1.8 Survey Data Analysis and Estimation Using Machine Learning Methods

(4)

1.8.1 Gaining Insights Among Survey Variables

(1)

1.8.2 Adapting Machine Learning Methods to the Survey Setting

(1)

1.8.3 Leveraging Machine Learning Algorithms for Finite Population Inference

(1)

1.9 Discussion and Conclusions

(16)

References

(12)

Further Reading

(3)

2 The Future Is Now: How Surveys Can Harness Social Media to Address Twenty-first Century Challenges

(36)

Amelia Burke-Garcia

Brad Edwards

Ting Yan

2.1 Introduction

(4)

2.2 New Ways of Thinking About Survey Research

(1)

2.3 The Challenge with Sampling People

(5)

2.3.1 The Social Media Opportunities

(1)

2.3.1.1 Venue-Based, Time-Space Sampling

(2)

2.3.1.2 Respondent-Driven Sampling

(1)

2.3.2 Outstanding Challenges

(1)

2.4 The Challenge with Identifying People

(2)

2.4.1 The Social Media Opportunity

(1)

2.4.2 Outstanding Challenges

(1)

2.5 The Challenge with Reaching People

(3)

2.5.1 The Social Media Opportunities

(1)

2.5.1.1 Tracing

(1)

2.5.1.2 Paid Social Media Advertising

(1)

2.5.2 Outstanding Challenges

(1)

2.6 The Challenge with Persuading People to Participate

(4)

2.6.1 The Social Media Opportunities

(1)

2.6.1.1 Paid Social Media Advertising

(1)

2.6.1.2 Online Influencers

(1)

2.6.2 Outstanding Challenges

(1)

2.7 The Challenge with Interviewing People

(6)

2.7.1 Social Media Opportunities

(1)

2.7.1.1 Passive Social Media Data Mining

(1)

2.7.1.2 Active Data Collection

(1)

2.7.2 Outstanding Challenges

(3)

2.8 Conclusion

(12)

References

(10)

3 Linking Survey Data with Commercial or Administrative Data for Data Quality Assessment

(32)

A. Rupa Datta

Gabriel Ugarte

Dean Resnick

3.1 Introduction

(2)

3.2 Thinking About Quality Features of Analytic Data Sources

101

(3)

3.2.1 What Is the Purpose of the Data Linkage?

101

(1)

3.2.2 What Kind of Data Linkage for What Analytic Purpose?

102

(2)

3.3 Data Used in This
Chapter

104

(12)

3.3.1 NSECE Household Survey

104

(1)

3.3.2 Proprietary Research Files from Zillow

105

(2)

3.3.3 Linking the NSECE Household Survey with Zillow Proprietary Datafiles

107

(1)

3.3.3.1 Nonuniqueness of Matches

107

(3)

3.3.3.2 Misalignment of Units of Observation

110

(1)

3.3.3.3 Ability to Identify Matches

110

(2)

3.3.3.4 Identifying Matches

112

(2)

3.3.3.5 Implications of the Linking Process for Intended Analyses

114

(2)

3.4 Assessment of Data Quality Using the Linked File

116

(9)

3.4.1 What Variables in the Zillow Datafile Are Most Appropriate for Use in Substantive Analyses Linked to Survey Data?

116

(3)

3.4.2 How Did Different Steps in the Survey Administration Process Contribute to Representativeness of the NSECE Survey Data?

119

(4)

3.4.3 How Well Does the Linked Datafile Represent the Overall NSECE Dataset (Including Unlinked Records)?

123

(2)

3.5 Conclusion

125

(6)

References

127

(2)

Further Reading

129

(2)

Section 2 Total Error and Data Quality

131

(142)

4 Total Error Frameworks for Found Data

133

(30)

Paul P. Biemer

Ashley Amaya

4.1 Introduction

133

(1)

4.2 Data Integration and Estimation

134

(4)

4.2.1 Source Datasets

135

(2)

4.2.2 The Integration Process

137

(1)

4.2.3 Unified Dataset

137

(1)

4.3 Errors in Datasets

138

(3)

4.4 Errors in Hybrid Estimates

141

(15)

4.4.1 Error-Generating Processes

141

(4)

4.4.2 Components of Bias, Variance, and Mean Squared Error

145

(3)

4.4.3 Illustrations

148

(5)

4.4.4 Error Mitigation

153

(1)

4.4.4.1 Sample Recruitment Error

153

(3)

4.4.4.2 Data Encoding Error

156

(1)

4.5 Other Error Frameworks

156

(2)

4.6 Summary and Conclusions

158

(5)

References

160

(3)

5 Measuring the Strength of Attitudes in Social Media Data

163

(30)

Ashley Amaya

Ruben Bach

Frauke Kreuter

Florian Keusch

5.1 Introduction

163

(2)

5.2 Methods

165

(9)

5.2.1 Data

165

(1)

5.2.1.1 European Social Survey Data

166

(1)

5.2.1.2 Reddit 2016 Data

167

(2)

5.2.1.3 Reddit Survey

169

(1)

5.2.1.4 Reddit 2018 Data

169

(1)

5.2.2 Analysis

170

(1)

5.2.2.1 Missingness

171

(2)

5.2.2.2 Measurement

173

(1)

5.2.2.3 Coding

173

(1)

5.3 Results

174

(6)

5.3.1 Overall Comparisons

174

(1)

5.3.2 Missingness

175

(2)

5.3.3 Measurement

177

(1)

5.3.4 Coding

178

(2)

5.4 Summary

180

(4)

5.A 2016 German ESS Questions Used in Analysis

184

(2)

5.B Search Terms Used to Identify Topics in Reddit Posts (2016 and 2018)

186

(1)

5.B.1 Political Ideology

186

(1)

5.B.2 Interest in Politics

186

(1)

5.B.3 Gay Rights

186

(1)

5.B.4 EU

187

(1)

5.B.5 Immigration

187

(1)

5.B.6 Climate

187

(1)

5.C Example of Coding Steps Used to Identify Topics and Assign Sentiment in Reddit Submissions (2016 and 2018)

188

(5)

References

189

(4)

6 Attention to Campaign Events: Do Twitter and Self-Report Metrics Tell the Same Story?

193

(24)

Josh Pasek

Lisa O. Singh

Yifang Wei

Stuart N. Soroka

Jonathan M. Ladd

Michael W. Traugott

Ceren Budak

Leticia Bode

Frank Newport

6.1 What Can Social Media Tell Us About Social Phenomena?

193

(2)

6.2 The Empirical Evidence to Date

195

(1)

6.3 Tweets as Public Attention

196

(1)

6.4 Data Sources

197

(1)

6.5 Event Detection

198

(6)

6.6 Did Events Peak at the Same Time Across Data Streams?

204

(1)

6.7 Were Event Words Equally Prominent Across Data Streams?

205

(1)

6.8 Were Event Terms Similarly Associated with Particular Candidates?

206

(1)

6.9 Were Event Trends Similar Across Data Streams?

207

(4)

6.10 Unpacking Differences Between Samples

211

(1)

6.11 Conclusion

212

(1)

References

213

(4)

7 Improving Quality of Administrative Data: A Case Study with FBI's National Incident-Based Reporting System Data

217

(28)

Dan Liao

Marcus E. Berzofsky

G. Lance Couzens

Ian Thomas

Alexia Cooper

7.1 Introduction

217

(3)

1.2 The NIBRS Database

220

(2)

7.2.1 Administrative Crime Statistics and the History of NIBRS Data

220

(1)

1.2.2 Construction of the NIBRS Dataset

221

(1)

7.3 Data Quality Improvement Based on the Total Error Framework

222

(12)

7.3.1 Data Quality Assessment for Using Row-Column-Cell Framework

224

(1)

7.3.1.1 Phase I: Evaluating Each Data Table

224

(1)

7.3.1.2 Row Errors

225

(1)

7.3.1.3 Column Errors

226

(1)

7.3.1.4 Cell Errors

226

(1)

7.3.1.5 Row-Column-Cell Errors Impacting NIBRS

227

(1)

7.3.1.6 Phase II: Evaluating the Integrated Data

227

(1)

7.3.1.7 Errors in Data Integration Process

227

(1)

7.3.1.8 Coverage Errors Due to Nonreporting Agencies

228

(1)

7.3.1.9 Nonresponse Errors in the Incident Data Table Due to Unreported Incident Reports

229

(1)

7.3.1.10 Invalid, Unknown, and Missing Values Within the Incident Reports

230

(1)

7.3.2 Improving Data Quality via Sampling, Weighting, and Imputation

231

(1)

7.3.2.1 Sample-Based Method to Improve Data Representativeness at the Agency Level

231

(1)

7.3.2.2 Statistical Weighting to Adjust for Coverage Errors at the Agency Level

232

(1)

7.3.2.3 Imputation to Compensate for Unreported Incidents and Missing Values in the Incident Reports

233

(1)

7.4 Utilizing External Data Sources in Improving Data Quality of the Administrative Data

234

(4)

7.4.1 Understanding the External Data Sources

234

(1)

7.4.1.1 Data Quality Assessment of External Data Sources

234

(1)

7.4.1.2 Producing Population Counts at the Agency Level Through Auxiliary Data

235

(1)

7.4.2 Administrative vs. Survey Data for Crime Statistics

236

(2)

7.4.3 A Pilot Study on Crime in the Bakken Region

238

(1)

7.5 Summary and Future Work

239

(6)

References

241

(4)

8 Performance and Sensitivities of Home Detection on Mobile Phone Data

245

(28)

Maarten Vanhoof

Clement Lee

Zbigniew Smoreda

8.1 Introduction

245

(4)

8.1.1 Mobile Phone Data and Official Statistics

245

(2)

8.1.2 The Home Detection Problem

247

(2)

8.2 Deploying Home Detection Algorithms to a French CDR Dataset

249

(6)

8.2.1 Mobile Phone Data

249

(2)

8.2.2 The French Mobile Phone Dataset

251

(1)

8.2.3 Defining Nine Home Detection Algorithms

252

(1)

8.2.4 Different Observation Periods

253

(2)

8.2.5 Summary of Data and Setup

255

(1)

8.3 Assessing Home Detection Performance at Nationwide Scale

255

(3)

8.3.1 Ground Truth Data

256

(1)

8.3.2 Assessing Performance and Sensitivities

256

(1)

8.3.2.1 Correlation with Ground Truth Data

256

(2)

8.3.2.2 Ratio and Spatial Patterns

258

(1)

8.3.2.3 Temporality and Sensitivity

258

(1)

8.4 Results

258

(9)

8.4.1 Relations between HDAs' User Counts and Ground Truth

258

(2)

8.4.2 Spatial Patterns of Ratios Between User Counts and Population Counts

260

(1)

8.4.3 Temporality of Correlations

260

(6)

8.4.4 Sensitivity to the Duration of Observation

266

(1)

8.4.5 Sensitivity to Criteria Choice

266

(1)

8.5 Discussion and Conclusion

267

(6)

References

270

(3)

Section 3 Big Data in Official Statistics

273

(114)

9 Big Data Initiatives in Official Statistics

275

(28)

Lilli Japec

Lars Lyberg

9.1 Introduction

275

(1)

9.2 Some Characteristics of the Changing Survey Landscape

276

(4)

9.3 Current Strategies to Handle the Changing Survey Landscape

280

(5)

9.3.1 Training Staff

281

(1)

9.3.2 Forming Partnerships

281

(1)

9.3.3 Cooperation Between European NSIs

282

(1)

9.3.4 Creating Big Data Centers

282

(1)

9.3.5 Experimental Statistics

283

(1)

9.3.6 Organizing Hackathons

283

(1)

9.3.7 IT Infrastructure, Tools, and Methods

284

(1)

9.4 The Potential of Big Data and the Use of New Methods in Official Statistics

285

(5)

9.4.1 Wider and Deeper

285

(1)

9.4.1.1 Green Areas in the Swedish City of Lidingo

285

(1)

9.4.1.2 Innovative Companies

285

(1)

9.4.1.3 Coding Commodity Flow Survey

286

(1)

9.4.2 Better Statistics

287

(1)

9.4.2.1 AIS

287

(1)

9.4.2.2 Expenditure Surveys

288

(1)

9.4.2.3 Examples of Improving Statistics by Adjusting for Bias

288

(1)

9.4.3 Quicker Statistics

289

(1)

9.4.3.1 Early Estimates

289

(1)

9.4.4 Cheaper Statistics

289

(1)

9.4.4.1 Consumer Price Index (CPI)

289

(1)

9.4.4.2 Smart Meter Data

289

(1)

9.4.4.3 ISCO and NACE Coding at Statistics Finland

290

(1)

9.5 Big Data Quality

290

(3)

9.6 Legal Issues

293

(2)

9.6.1 Allowing Access to Data

293

(1)

9.6.2 Providing Access to Data

294

(1)

9.7 Future Developments

295

(8)

References

296

(7)

10 Big Data in Official Statistics: A Perspective from Statistics Netherlands

303

(36)

Barteld Braaksma

Kees Zeelenberg

Sofie De Broe

10.1 Introduction

303

(1)

10.2 Big Data and Official Statistics

304

(1)

10.3 Examples of Big Data in Official Statistics

305

(4)

10.3.1 Scanner Data

305

(1)

10.3.2 Traffic-Loop Data

306

(1)

10.3.3 Social Media Messages

307

(1)

10.3.4 Mobile Phone Data

308

(1)

10.4 Principles for Assessing the Quality of Big Data Statistics

309

(7)

10.4.1 Accuracy

310

(1)

10.4.2 Models in Official Statistics

311

(1)

10.4.3 Objectivity and Reliability

312

(2)

10.4.4 Relevance

314

(1)

10.4.5 Some Examples of Quality Assessments of Big Data Statistics

315

(1)

10.5 Integration of Big Data with Other Statistical Sources

316

(9)

10.5.1 Big Data as Auxiliary Data

316

(1)

10.5.2 Size of the Internet Economy

317

(2)

10.5.3 Improving the Consumer Confidence Index

319

(2)

10.5.4 Big Data and the Quality of Gross National Product Estimates

321

(1)

10.5.5 Google Trends for Nowcasting

322

(1)

10.5.6 Multisource Statistics: Combination of Survey and Sensor Data

323

(1)

10.5.7 Combining Administrative and Open Data Sources to Complete Energy Statistics

324

(1)

10.6 Disclosure Control with Big Data

325

(2)

10.6.1 Volume

326

(1)

10.6.2 Velocity

326

(1)

10.6.3 Variety

326

(1)

10.7 The Way Ahead: A Chance for Paradigm Fusion

327

(3)

10.7.1 Measurement and Selection Bias

328

(1)

10.7.2 Timeliness

329

(1)

10.7.3 Quality

329

(1)

10.7.4 Phenomenon-Oriented Statistics

330

(1)

10.8 Conclusion

330

(9)

References

331

(6)

Further Reading

337

(2)

11 Mining the New Oil for Official Statistics

339

(20)

Siu-Ming Tarn

Jae-Kwang Kim

Lyndon Ang

Han Pham

11.1 Introduction

339

(2)

11.2 Statistical Inference for Binary Variables from Nonprobability Samples

341

(2)

11.3 Integrating Data Source B Subject to Undercoverage Bias

343

(1)

11.4 Integrating Data Sources Subject to Measurement Errors

344

(1)

11.5 Integrating Probability Sample A Subject to Unit Nonresponse

345

(2)

11.6 Empirical Studies

347

(3)

11.7 Examples of Official Statistics Applications

350

(3)

11.8 Limitations

353

(1)

11.9 Conclusion

354

(5)

References

354

(3)

Further Reading

357

(2)

12 Investigating Alternative Data Sources to Reduce Respondent Burden in United States Census Bureau Retail Economic Data Products

359

(28)

Rebecca J. Hutchinson

12.1 Introduction

359

(3)

12.1.1 Overview of the Economic Directorate

360

(1)

12.1.2 Big Data Vision

361

(1)

12.1.3 Overview of the Census Bureau Retail Programs

361

(1)

12.2 Respondent Burden

362

(4)

12.3 Point-of-Sale Data

366

(3)

12.3.1 Background on Point-of-Sale Data

366

(2)

12.3.2 Background on NPD

368

(1)

12.4 Project Description

369

(12)

12.4.1 Selection of Retailers

370

(1)

12.4.2 National-Level Data

371

(4)

12.4.3 Store-Level Data

375

(2)

12.4.4 Product Data

377

(4)

12.5 Summary

381

(6)

Disclaimer

384

(1)

Disclosure

384

(1)

References

384

(3)

Section 4 Combining Big Data with Survey Statistics: Methods and Applications

387

(148)

13 Effects of Incentives in Smartphone Data Collection

389

(26)

Georg-Christoph Haas

Frauke Kreuter

Florian Keusch

Mark Trappmann

Sebastian Bdhr

13.1 Introduction

389

(1)

13.2 The Influence of Incentives on Participation

390

(2)

13.3 Institut fur Arbeitsmarkt- und Berufsforschung (IAB)-SMART Study Design

392

(6)

13.3.1 Sampling Frame and Sample Restrictions

393

(1)

13.3.2 Invitation and Data Request

394

(3)

13.3.3 Experimental Design for Incentive Study

397

(1)

13.3.4 Analysis Plan

397

(1)

13.4 Results

398

(7)

13.4.1 App Installation

398

(2)

13.4.2 Number of Initially Activated Data-Sharing Functions

400

(1)

13.4.3 Deactivating Functions

401

(1)

13.4.4 Retention

402

(1)

13.4.5 Analysis of Costs

403

(2)

13.5 Summary

405

(10)

13.5.1 Limitations and Future Research

407

(5)

References

412

(3)

14 Using Machine Learning Models to Predict Attrition in a Survey Panel

415

(20)

Mingnan Liu

14.1 Introduction

415

(3)

14.1.1 Data

417

(1)

14.2 Methods

418

(5)

14.2.1 Random Forests

418

(1)

14.2.2 Support Vector Machines

419

(1)

14.2.3 LASSO

420

(1)

14.2.4 Evaluation Criteria

420

(2)

14.2.4.1 Tuning Parameters

422

(1)

14.3 Results

423

(2)

14.3.1 Which Are the Important Predictors?

425

(1)

14.4 Discussion

425

(3)

14.A Questions Used in the Analysis

428

(7)

References

431

(4)

15 Assessing Community Wellbeing Using Google Street-View and Satellite Imagery

435

(52)

Pablo Diego-Rosell

Stafford Nichols

Rajesh Srinivasan

Ben Dilday

15.1 Introduction

435

(2)

15.2 Methods

437

(14)

15.2.1 Sampling Units and Frames

437

(1)

15.2.2 Data Sources

438

(1)

15.2.2.1 Study Outcomes from Survey Data

438

(2)

15.2.2.2 Study Predictors from Built Environment Data

440

(7)

15.2.2.3 Study Predictors from - Geospatial Imagery

447

(3)

15.2.2.4 Model Development, Testing, and Evaluation

450

(1)

15.3 Application Results

451

(6)

15.3.1 Baltimore

451

(4)

15.3.2 San Francisco

455

(1)

15.3.3 Generalizability

456

(1)

15.4 Conclusions

457

(2)

15.A Amazon Mechanical Turk Questionnaire

459

(2)

15.B Pictures and Maps

461

(2)

15.C Descriptive Statistics

463

(6)

15.D Stepwise AIC OLS Regression Models

469

(3)

15.E Generalized Linear Models via Penalized Maximum Likelihood with k-Fold Cross-Validation

472

(5)

15.F Heat Maps - Actual vs. Model-Based Outcomes

477

(10)

References

485

(2)

16 Nonparametric Bootstrap and Small Area Estimation to Mitigate Bias in Crowdsourced Data: Simulation Study and Application to Perceived Safety

487

(32)

David Buil-Gil

Reka Solymosi

Angelo Moretti

16.1 Introduction

487

(2)

16.2 The Rise of Crowdsourcing and Implications

489

(1)

16.3 Crowdsourcing Data to Analyze Social Phenomena: Limitations

490

(2)

16.3.1 Self-Selection Bias

490

(1)

16.3.2 Unequal Participation

491

(1)

16.3.3 Underrepresentation of Certain Areas and Times

492

(1)

16.3.4 Unreliable Area-Level Direct Estimates and Difficulty to Interpret Results

492

(1)

16.4 Previous Approaches for Reweighting Crowdsourced Data

492

(1)

16.5 A New Approach: Small Area Estimation Under a Nonparametric Bootstrap Estimator

493

(3)

16.5.1 Step 1: Nonparametric Bootstrap

494

(2)

16.5.2 Step 2: Area-Level Model-Based Small Area Estimation

496

(1)

16.6 Simulation Study

496

(7)

16.6.1 Population Generation

497

(1)

16.6.2 Sample Selection and Simulation Steps

497

(2)

16.6.3 Results

499

(4)

16.7 Case Study: Safety Perceptions in London

503

(8)

16.7.1 The Spatial Study of Safety Perceptions

503

(1)

16.7.2 Data and Methods

504

(1)

16.7.2.1 Place Pulse 2.0 Dataset

504

(2)

16.7.2.2 Area-Level Covariates

506

(1)

16.7.3 Results

506

(1)

16.7.3.1 Model Diagnostics and External Validation

506

(4)

16.7.3.2 Mapping Safety Perceptions at Neighborhood Level

510

(1)

16.8 Discussion and Conclusions

511

(8)

References

513

(6)

17 Using Big Data to Improve Sample Efficiency

519

(16)

Jamie Ridenhour

Joe McMichael

Kami Krotki

Howard Speizer

17.1 Introduction and Background

519

(4)

17.2 Methods to More Efficiently Sample Unregistered Boat-Owning Households

523

(7)

17.2.1 Model 1: Spatial Boat Density Model

525

(1)

17.2.2 Model 2: Address-Level Boat-Ownership Propensity

526

(4)

17.3 Results

530

(3)

17.4 Conclusions

533

(2)

Acknowledgments

534

(1)

References

534

(1)

Section 5 Combining Big Data with Survey Statistics: Tools

535

(90)

18 Feedback Loop: Using Surveys to Build and Assess Registration-Based Sample Religious Flags for Survey Research

537

(24)

David Dutwin

18.1 Introduction

537

(1)

18.2 The Turn to Trees

538

(1)

18.3 Research Agenda

539

(1)

18.4 Data

540

(1)

18.5 Combining the Data

541

(2)

18.6 Building Models

543

(2)

18.7 Variables

545

(1)

18.8 Results

545

(7)

18.9 Considering Systematic Matching Rates

552

(2)

18.10 Discussion and Conclusions

554

(7)

References

557

(4)

19 Artificial Intelligence and Machine Learning Derived Efficiencies for Large-Scale Survey Estimation Efforts

561

(36)

Steven B. Cohen

Jamie Shorey

19.1 Introduction

561

(1)

19.2 Background

562

(1)

19.2.1 Project Goal

563

(1)

19.3 Accelerating the MEPS Imputation Processes: Development of Fast - Track MEPS Analytic Files

563

(9)

19.3.1 MEPS Data Files and Variables

566

(1)

19.3.2 Identification of Predictors of Medical Care Sources of Payment

567

(4)

19.3.2.1 Class Variables Used in the Imputation

571

(1)

19.3.3 Weighted Sequential Hot Deck Imputation

572

(1)

19.A Building the Prototype

572

(3)

19.4.1 Learning from the Data: Results for the 2012 MEPS

573

(2)

19.5 An Artificial Intelligence Approach to Fast-Track MEPS Imputation

575

(13)

19.5.1 Why Artificial Intelligence for Health-Care Cost Prediction

577

(1)

19.5.1.1 Imputation Strategies

578

(2)

19.5.1.2 Testing of Imputation Strategies

580

(1)

19.5.1.3 Approach

580

(1)

19.5.1.4 Raw Data Extraction

581

(1)

19.5.1.5 Attribute Selection

582

(2)

19.5.1.6 Inter-Variable Correlation

584

(1)

19.5.1.7 Multi-Output Random Forest

584

(1)

19.5.2 Evaluation

585

(3)

19.6 Summary

588

(9)

Acknowledgments

592

(1)

References

593

(4)

20 Worldwide Population Estimates for Small Geographic Areas: Can We Do a Better Job?

597

(28)

Safaa Amer

Dana Thomson

Rob Chew

Amy Rose

20.1 Introduction

597

(1)

20.2 Background

598

(2)

20.3 Gridded Population Estimates

600

(8)

20.3.1 Data Sources

600

(1)

20.3.2 Basic Gridded Population Models

601

(1)

20.3.3 LandScan Global

601

(1)

20.3.4 WorldPop

602

(1)

20.3.5 LandScan HD

603

(1)

20.3.6 GRID3

604

(1)

20.3.7 Challenges, Pros, and Cons of Gridded Population Estimates

605

(3)

20.4 Population Estimates in Surveys

608

(5)

20.4.1 Standard Sampling Strategies

608

(1)

20.4.2 Gridded Population Sampling from 1 km × 1 km Grid Cells

609

(1)

20.4.2.1 Geosampling

609

(2)

20.4.3 Gridded Population Sampling from 100 m × 100 m Grid Cells

611

(1)

20.4.3.1 GridSample R Package

611

(1)

20.4.3.2 GridSample2.0 and www.GridSample.org

611

(2)

20.4.4 Implementation of Gridded Population Surveys

613

(1)

20.5 Case Study

613

(3)

20.6 Conclusions and Next Steps

616

(9)

Acknowledgments

617

(1)

References

617

(8)

Section 6 The Fourth Paradigm, Regulations, Ethics, Privacy

625

(108)

21 Reproducibility in the Era of Big Data: Lessons for Developing Robust Data Management and Data Analysis Procedures

627

(30)

D. Betsy McCoach

Jennifer N. Dineen

Sandra M. Chafouleas

Amy Briesch

21.1 Introduction

627

(1)

21.2 Big Data

627

(2)

21.3 Challenges Researchers Face in the Era of Big Data and Reproducibility

629

(1)

21.4 Reproducibility

630

(2)

21.5 Reliability and Validity of Administrative Data

632

(1)

21.6 Data and Methods

632

(14)

21.6.1 The Case

632

(1)

21.6.2 The Survey Data

633

(1)

21.6.3 The Administrative Data

634

(1)

21.6.4 The Six Research Fallacies

635

(1)

21.6.4.1 More Data Are Better!

635

(2)

21.6.4.2 Merging Is About Matching by IDs/Getting the Columns to Align

637

(2)

21.6.4.3 Saving Your Syntax Is Enough to Ensure Reproducibility

639

(2)

21.6.4.4 Transparency in Your Process Ensures Transparency in Your Final Product

641

(2)

21.6.4.5 Administrative Data Are Higher Quality Than Self-Reported Data

643

(1)

21.6.4.6 If Relevant Administrative Data Exist, They Will Help Answer Your Research Question

644

(2)

21.7 Discussion

646

(11)

References

649

(5)

Further Reading

654

(3)

22 Combining Active and Passive Mobile Data Collection: A Survey of Concerns

657

(26)

Florian Keusch

Bella Struminskaya

Frauke Kreuter

Martin Weichbold

22.1 Introduction

657

(2)

22.2 Previous Research

659

(2)

22.2.1 Concern with Smartphone Data Collection

659

(2)

22.2.2 Differential Concern across Subgroups of Users

661

(1)

22.3 Methods and Data

661

(5)

22.3.1 Sample 1

662

(1)

22.3.2 Sample 2

662

(1)

22.3.3 Sample 3

662

(1)

22.3.4 Sample 4

663

(1)

22.3.5 Measures

663

(1)

22.3.6 Analysis Plan

664

(2)

22.4 Results

666

(4)

22.5 Conclusion

670

(3)

22.A Appendix

673

(1)

22.A.1 Frequency of Smartphone Use

673

(1)

22.A.2 Smartphone Skills

673

(1)

22.A3 Smartphone Activities

674

(1)

22.A.4 General Privacy Concern

674

(1)

22.B Appendix

675

(8)

Funding

679

(1)

References

679

(4)

23 Attitudes Toward Data Linkage: Privacy, Ethics, and the Potential for Harm

683

(30)

Ateia C. Fobia

Jennifer H. Childs

Casey Eggleston

23.1 Introduction: Big Data and the Federal Statistical System in the United States

683

(1)

23.2 Data and Methods

684

(6)

23.2.1 Focus Groups 2015 and 2016

685

(4)

23.2.2 Cognitive Interviews

689

(1)

23.3 Results

690

(18)

23.3.1 What Do Respondents Say They Expect and Believe About the Federal Government's Stewardship of Data?

690

(1)

23.3.1.1 Confidentiality

690

(5)

23.3.1.2 Privacy

695

(2)

23.3.1.3 Trust in Statistics

697

(1)

23.3.2 How Do Expectations and Beliefs About the Federal Government's Stewardship of Data Change or Remain When Asked About Data Linkage or Sharing?

698

(3)

23.3.3 Under What Circumstances Do Respondents Support Sharing or Linking Data?

701

(5)

23.3.4 What Fears and Preoccupations Worry Respondents When Asked About Data Sharing in the Federal Government?

706

(1)

23.3.4.1 Individual Harm

706

(1)

23.3.4.2 Community Harm

707

(1)

23.4 Discussion: Toward an Ethical Framework

708

(5)

23.4.1 Data Security

709

(1)

23.4.2 Transparency in Need for Data and Potential Uses of Data

709

(1)

23.4.3 Connecting Data Collections to Benefits

709

(1)

References

710

(3)

24 Moving Social Science into the Fourth Paradigm: The Data Life Cycle

713

(20)

Craig A. Hill

24.1 Consequences and Reality of the Availability of Big Data and Massive Compute Power for Survey Research and Social Science

717

(1)

24.1.1 Variety

717

(1)

24.1.2 Volume

718

(1)

24.1.3 Velocity

718

(1)

24.1.4 Validity

718

(1)

24.2 Technical Challenges for Data-Intensive Social Science Research

718

(5)

24.2.1 The Long Tail

719

(1)

24.2.2 Uncertainty Characterization and Quantification Or True, Useful, and New Information: Where Is It?

720

(2)

24.2.3 Reproducibility

722

(1)

24.3 The Solution: Social Science Researchers Become "Data-Aware"

723

(2)

24.4 Data Awareness

725

(2)

24.4.1 Acquire/Create/Collect

725

(1)

24.4.2 Munge

725

(1)

24.4.3 Use/Reuse

726

(1)

24.4.4 Disseminate

726

(1)

24.4.5 Stewardship

727

(1)

24.5 Bridge the Gap Between Silos

727

(2)

24.6 Conclusion

729

(4)

References

729

(4)

Index

733

Craig A. Hill, PhD, is Senior Vice President at RTI International and focuses on application of new technology to quantitative social science research. He is also the lead editor of Social Media, Sociality, and Survey Research (Wiley, 2013).

Paul P. Biemer, PhD, is Distinguished Fellow, Statistics at RTI International. He is an author, co-author, and co-editor of 6 other books published by Wiley.

Trent D. Buskirk, PhD, is the Novak Family Distinguished Professor of Data Science and the Chair of the Applied Statistics and Operations Research Department in the College of Business at Bowling Green State University.

Lilli Japec, PhD, former Director of Research and Development Department at Statistics Sweden. She co-chaired AAPOR's Task Force on Big Data.

Antje Kirchner, PhD, is a Survey Methodologist at RTI International. She is the Chair of the Scientific Committee of the Big Data Meets Survey Science (BigSurv20) conference.

Stanislav (Stas) Kolenikov, PhD, is Principal Scientist at Abt Associates. His work focuses on survey statistics, including issues in sampling, weighting, variance estimation, multiple imputation, and small area estimation.

Lars E. Lyberg, PhD, is former Head of the Research and Development Department at Statistics Sweden. He is the founder of the Journal of Official Statistics (JOS) and served as its Chief Editor for 25 years.

Lisainfo e-raamatute kohta

Püsilink: https://www.kriso.ee/db/97811189763332e.html

Märksõnad:

E-raamat: Big Data Meets Survey Science: A Collection of Innovative Methods

DRM piirangud

Kopeerimine (copy/paste):

Printimine:

Kasutamine:

Konto & seaded

Otsing

Otsingu andmebaas

Filtreeri tulemusi

Teemad E-raamatute teemad

Vali ostukorv