Muutke küpsiste eelistusi

E-raamat: Big Data Meets Survey Science: A Collection of Innovative Methods

Edited by (Statistics Sweden), Edited by (Abt Associates), Edited by (Statistics Sweden), Edited by (Bowling Green State University), Edited by (Research Triangle Institute), Edited by (RTI International), Edited by (RTI International)
Teised raamatud teemal:
  • Formaat - EPUB+DRM
  • Hind: 123,44 €*
  • * hind on lõplik, st. muud allahindlused enam ei rakendu
  • Lisa ostukorvi
  • Lisa soovinimekirja
  • See e-raamat on mõeldud ainult isiklikuks kasutamiseks. E-raamatuid ei saa tagastada.
  • Raamatukogudele
Teised raamatud teemal:

DRM piirangud

  • Kopeerimine (copy/paste):

    ei ole lubatud

  • Printimine:

    ei ole lubatud

  • Kasutamine:

    Digitaalõiguste kaitse (DRM)
    Kirjastus on väljastanud selle e-raamatu krüpteeritud kujul, mis tähendab, et selle lugemiseks peate installeerima spetsiaalse tarkvara. Samuti peate looma endale  Adobe ID Rohkem infot siin. E-raamatut saab lugeda 1 kasutaja ning alla laadida kuni 6'de seadmesse (kõik autoriseeritud sama Adobe ID-ga).

    Vajalik tarkvara
    Mobiilsetes seadmetes (telefon või tahvelarvuti) lugemiseks peate installeerima selle tasuta rakenduse: PocketBook Reader (iOS / Android)

    PC või Mac seadmes lugemiseks peate installima Adobe Digital Editionsi (Seeon tasuta rakendus spetsiaalselt e-raamatute lugemiseks. Seda ei tohi segamini ajada Adober Reader'iga, mis tõenäoliselt on juba teie arvutisse installeeritud )

    Seda e-raamatut ei saa lugeda Amazon Kindle's. 

"Written and painstakingly edited by leading experts in their respective fields, this volume offers a state-of-the-art overview of Big Data issues, concerns, and responses in survey methodology. Like several other books in the Wiley Series in Survey Methodology, this work has been prepared in conjunction with an international conference on the topic by the Survey Research Methods Section of the American Statistical Association. The conference and book constitute part of an ongoing effort by a group of international researchers to promote quality in Big Data and to raise the level of methodological expertise in various applied fields. The basic content, in light of emerging techniques and technologies, includes in-depth coverage of topics such as combining Big Data with traditional data sources; multiplicity; data sparseness; data streams; using Big Data for reducing, controlling, and evaluating total survey error; handling confidentiality and privacy; and ethical concerns and the concept of harm; among ahost of others. The editors and contributors are eminent, varied, and reflective of the international marketplace. Copious tables, figures, and references, as well as an extensive glossary, supplement the high quality discussion throughout the text"--

Offers a clear view of the utility and place for survey data within the broader Big Data ecosystem 

This book presents a collection of snapshots from two sides of the Big Data perspective.  It assembles an array of tangible tools, methods, and approaches that illustrate how Big Data sources and methods are being used in the survey and social sciences to improve official statistics and estimates for human populations.  It also provides examples of how survey data are being used to evaluate and improve the quality of insights derived from Big Data.   

Big Data Meets Survey Science: A Collection of Innovative Methods shows how survey data and Big Data are used together for the benefit of one or more sources of data, with numerous chapters providing consistent illustrations and examples of survey data enriching the evaluation of Big Data sources.  Examples of how machine learning, data mining, and other data science techniques are inserted into virtually every stage of the survey lifecycle are presented. Topics covered include: Total Error Frameworks for Found Data; Performance and Sensitivities of Home Detection on Mobile Phone Data; Assessing Community Wellbeing Using Google Street View and Satellite Imagery; Using Surveys to Build and Assess RBS Religious Flag; and more. 

  • Presents groundbreaking survey methods being utilized today in the field of Big Data 
  • Explores how machine learning methods can be applied to the design, collection, and analysis of social science data 
  • Filled with examples and illustrations that show how survey data benefits Big Data evaluation 
  • Covers methods and applications used in combining Big Data with survey statistics 
  • Examines regulations as well as ethical and privacy issues  

Big Data Meets Survey Science: A Collection of Innovative Methods is an excellent book for both the survey and social science communities as they learn to capitalize on this new revolution. It will also appeal to the broader data and computer science communities looking for new areas of application for emerging methods and data sources. 

List of Contributors
xxiii
Introduction 1(6)
Craig A. Hill
Paul P. Biemer
Trent D. Buskirk
Lilli Japec
Antje Kirchner
Stas Kolenikov
Lars E. Lyberg
Acknowledgments 7(1)
References 7(2)
Section 1 The New Survey Landscape
9(122)
1 Why Machines Matter for Survey and Social Science Researchers: Exploring Applications of Machine Learning Methods for Design, Data Collection, and Analysis
11(52)
Trent D. Buskirk
Antje Kirchner
1.1 Introduction
11(2)
1.2 Overview of Machine Learning Methods and Their Evaluation
13(3)
1.3 Creating Sample Designs and Constructing Sampling Frames Using Machine Learning Methods
16(7)
1.3.1 Sample Design Creation
16(2)
1.3.2 Sample Frame Construction
18(2)
1.3.3 Considerations and Implications for Applying Machine Learning Methods for Creating Sampling Frames and Designs
20(1)
1.3.3.1 Considerations About Algorithmic Optimization
20(1)
1.3.3.2 Implications About Machine Learning Model Error
21(1)
1.3.3.3 Data Type Considerations and Implications About Data Errors
22(1)
1.4 Questionnaire Design and Evaluation Using Machine Learning Methods
23(5)
1.4.1 Question Wording
24(2)
1.4.2 Evaluation and Testing
26(1)
1.4.3 Instrumentation and Interviewer Training
27(1)
1.4.4 Alternative Data Sources
28(1)
1.5 Survey Recruitment and Data Collection Using Machine Learning Methods
28(5)
1.5.1 Monitoring and Interviewer Falsification
29(1)
1.5.2 Responsive and Adaptive Designs
29(4)
1.6 Survey Data Coding and Processing Using Machine Learning Methods
33(4)
1.6.1 Coding Unstructured Text
33(2)
1.6.2 Data Validation and Editing
35(1)
1.6.3 Imputation
35(1)
1.6.4 Record Linkage and Duplicate Detection
36(1)
1.7 Sample Weighting and Survey Adjustments Using Machine Learning Methods
37(6)
1.7.1 Propensity Score Estimation
37(4)
1.7.2 Sample Matching
41(2)
1.8 Survey Data Analysis and Estimation Using Machine Learning Methods
43(4)
1.8.1 Gaining Insights Among Survey Variables
44(1)
1.8.2 Adapting Machine Learning Methods to the Survey Setting
45(1)
1.8.3 Leveraging Machine Learning Algorithms for Finite Population Inference
46(1)
1.9 Discussion and Conclusions
47(16)
References
48(12)
Further Reading
60(3)
2 The Future Is Now: How Surveys Can Harness Social Media to Address Twenty-first Century Challenges
63(36)
Amelia Burke-Garcia
Brad Edwards
Ting Yan
2.1 Introduction
63(4)
2.2 New Ways of Thinking About Survey Research
67(1)
2.3 The Challenge with Sampling People
67(5)
2.3.1 The Social Media Opportunities
68(1)
2.3.1.1 Venue-Based, Time-Space Sampling
68(2)
2.3.1.2 Respondent-Driven Sampling
70(1)
2.3.2 Outstanding Challenges
71(1)
2.4 The Challenge with Identifying People
72(2)
2.4.1 The Social Media Opportunity
73(1)
2.4.2 Outstanding Challenges
73(1)
2.5 The Challenge with Reaching People
74(3)
2.5.1 The Social Media Opportunities
75(1)
2.5.1.1 Tracing
75(1)
2.5.1.2 Paid Social Media Advertising
76(1)
2.5.2 Outstanding Challenges
77(1)
2.6 The Challenge with Persuading People to Participate
77(4)
2.6.1 The Social Media Opportunities
78(1)
2.6.1.1 Paid Social Media Advertising
78(1)
2.6.1.2 Online Influencers
79(1)
2.6.2 Outstanding Challenges
80(1)
2.7 The Challenge with Interviewing People
81(6)
2.7.1 Social Media Opportunities
82(1)
2.7.1.1 Passive Social Media Data Mining
82(1)
2.7.1.2 Active Data Collection
83(1)
2.7.2 Outstanding Challenges
84(3)
2.8 Conclusion
87(12)
References
89(10)
3 Linking Survey Data with Commercial or Administrative Data for Data Quality Assessment
99(32)
A. Rupa Datta
Gabriel Ugarte
Dean Resnick
3.1 Introduction
99(2)
3.2 Thinking About Quality Features of Analytic Data Sources
101(3)
3.2.1 What Is the Purpose of the Data Linkage?
101(1)
3.2.2 What Kind of Data Linkage for What Analytic Purpose?
102(2)
3.3 Data Used in This
Chapter
104(12)
3.3.1 NSECE Household Survey
104(1)
3.3.2 Proprietary Research Files from Zillow
105(2)
3.3.3 Linking the NSECE Household Survey with Zillow Proprietary Datafiles
107(1)
3.3.3.1 Nonuniqueness of Matches
107(3)
3.3.3.2 Misalignment of Units of Observation
110(1)
3.3.3.3 Ability to Identify Matches
110(2)
3.3.3.4 Identifying Matches
112(2)
3.3.3.5 Implications of the Linking Process for Intended Analyses
114(2)
3.4 Assessment of Data Quality Using the Linked File
116(9)
3.4.1 What Variables in the Zillow Datafile Are Most Appropriate for Use in Substantive Analyses Linked to Survey Data?
116(3)
3.4.2 How Did Different Steps in the Survey Administration Process Contribute to Representativeness of the NSECE Survey Data?
119(4)
3.4.3 How Well Does the Linked Datafile Represent the Overall NSECE Dataset (Including Unlinked Records)?
123(2)
3.5 Conclusion
125(6)
References
127(2)
Further Reading
129(2)
Section 2 Total Error and Data Quality
131(142)
4 Total Error Frameworks for Found Data
133(30)
Paul P. Biemer
Ashley Amaya
4.1 Introduction
133(1)
4.2 Data Integration and Estimation
134(4)
4.2.1 Source Datasets
135(2)
4.2.2 The Integration Process
137(1)
4.2.3 Unified Dataset
137(1)
4.3 Errors in Datasets
138(3)
4.4 Errors in Hybrid Estimates
141(15)
4.4.1 Error-Generating Processes
141(4)
4.4.2 Components of Bias, Variance, and Mean Squared Error
145(3)
4.4.3 Illustrations
148(5)
4.4.4 Error Mitigation
153(1)
4.4.4.1 Sample Recruitment Error
153(3)
4.4.4.2 Data Encoding Error
156(1)
4.5 Other Error Frameworks
156(2)
4.6 Summary and Conclusions
158(5)
References
160(3)
5 Measuring the Strength of Attitudes in Social Media Data
163(30)
Ashley Amaya
Ruben Bach
Frauke Kreuter
Florian Keusch
5.1 Introduction
163(2)
5.2 Methods
165(9)
5.2.1 Data
165(1)
5.2.1.1 European Social Survey Data
166(1)
5.2.1.2 Reddit 2016 Data
167(2)
5.2.1.3 Reddit Survey
169(1)
5.2.1.4 Reddit 2018 Data
169(1)
5.2.2 Analysis
170(1)
5.2.2.1 Missingness
171(2)
5.2.2.2 Measurement
173(1)
5.2.2.3 Coding
173(1)
5.3 Results
174(6)
5.3.1 Overall Comparisons
174(1)
5.3.2 Missingness
175(2)
5.3.3 Measurement
177(1)
5.3.4 Coding
178(2)
5.4 Summary
180(4)
5.A 2016 German ESS Questions Used in Analysis
184(2)
5.B Search Terms Used to Identify Topics in Reddit Posts (2016 and 2018)
186(1)
5.B.1 Political Ideology
186(1)
5.B.2 Interest in Politics
186(1)
5.B.3 Gay Rights
186(1)
5.B.4 EU
187(1)
5.B.5 Immigration
187(1)
5.B.6 Climate
187(1)
5.C Example of Coding Steps Used to Identify Topics and Assign Sentiment in Reddit Submissions (2016 and 2018)
188(5)
References
189(4)
6 Attention to Campaign Events: Do Twitter and Self-Report Metrics Tell the Same Story?
193(24)
Josh Pasek
Lisa O. Singh
Yifang Wei
Stuart N. Soroka
Jonathan M. Ladd
Michael W. Traugott
Ceren Budak
Leticia Bode
Frank Newport
6.1 What Can Social Media Tell Us About Social Phenomena?
193(2)
6.2 The Empirical Evidence to Date
195(1)
6.3 Tweets as Public Attention
196(1)
6.4 Data Sources
197(1)
6.5 Event Detection
198(6)
6.6 Did Events Peak at the Same Time Across Data Streams?
204(1)
6.7 Were Event Words Equally Prominent Across Data Streams?
205(1)
6.8 Were Event Terms Similarly Associated with Particular Candidates?
206(1)
6.9 Were Event Trends Similar Across Data Streams?
207(4)
6.10 Unpacking Differences Between Samples
211(1)
6.11 Conclusion
212(1)
References
213(4)
7 Improving Quality of Administrative Data: A Case Study with FBI's National Incident-Based Reporting System Data
217(28)
Dan Liao
Marcus E. Berzofsky
G. Lance Couzens
Ian Thomas
Alexia Cooper
7.1 Introduction
217(3)
1.2 The NIBRS Database
220(2)
7.2.1 Administrative Crime Statistics and the History of NIBRS Data
220(1)
1.2.2 Construction of the NIBRS Dataset
221(1)
7.3 Data Quality Improvement Based on the Total Error Framework
222(12)
7.3.1 Data Quality Assessment for Using Row-Column-Cell Framework
224(1)
7.3.1.1 Phase I: Evaluating Each Data Table
224(1)
7.3.1.2 Row Errors
225(1)
7.3.1.3 Column Errors
226(1)
7.3.1.4 Cell Errors
226(1)
7.3.1.5 Row-Column-Cell Errors Impacting NIBRS
227(1)
7.3.1.6 Phase II: Evaluating the Integrated Data
227(1)
7.3.1.7 Errors in Data Integration Process
227(1)
7.3.1.8 Coverage Errors Due to Nonreporting Agencies
228(1)
7.3.1.9 Nonresponse Errors in the Incident Data Table Due to Unreported Incident Reports
229(1)
7.3.1.10 Invalid, Unknown, and Missing Values Within the Incident Reports
230(1)
7.3.2 Improving Data Quality via Sampling, Weighting, and Imputation
231(1)
7.3.2.1 Sample-Based Method to Improve Data Representativeness at the Agency Level
231(1)
7.3.2.2 Statistical Weighting to Adjust for Coverage Errors at the Agency Level
232(1)
7.3.2.3 Imputation to Compensate for Unreported Incidents and Missing Values in the Incident Reports
233(1)
7.4 Utilizing External Data Sources in Improving Data Quality of the Administrative Data
234(4)
7.4.1 Understanding the External Data Sources
234(1)
7.4.1.1 Data Quality Assessment of External Data Sources
234(1)
7.4.1.2 Producing Population Counts at the Agency Level Through Auxiliary Data
235(1)
7.4.2 Administrative vs. Survey Data for Crime Statistics
236(2)
7.4.3 A Pilot Study on Crime in the Bakken Region
238(1)
7.5 Summary and Future Work
239(6)
References
241(4)
8 Performance and Sensitivities of Home Detection on Mobile Phone Data
245(28)
Maarten Vanhoof
Clement Lee
Zbigniew Smoreda
8.1 Introduction
245(4)
8.1.1 Mobile Phone Data and Official Statistics
245(2)
8.1.2 The Home Detection Problem
247(2)
8.2 Deploying Home Detection Algorithms to a French CDR Dataset
249(6)
8.2.1 Mobile Phone Data
249(2)
8.2.2 The French Mobile Phone Dataset
251(1)
8.2.3 Defining Nine Home Detection Algorithms
252(1)
8.2.4 Different Observation Periods
253(2)
8.2.5 Summary of Data and Setup
255(1)
8.3 Assessing Home Detection Performance at Nationwide Scale
255(3)
8.3.1 Ground Truth Data
256(1)
8.3.2 Assessing Performance and Sensitivities
256(1)
8.3.2.1 Correlation with Ground Truth Data
256(2)
8.3.2.2 Ratio and Spatial Patterns
258(1)
8.3.2.3 Temporality and Sensitivity
258(1)
8.4 Results
258(9)
8.4.1 Relations between HDAs' User Counts and Ground Truth
258(2)
8.4.2 Spatial Patterns of Ratios Between User Counts and Population Counts
260(1)
8.4.3 Temporality of Correlations
260(6)
8.4.4 Sensitivity to the Duration of Observation
266(1)
8.4.5 Sensitivity to Criteria Choice
266(1)
8.5 Discussion and Conclusion
267(6)
References
270(3)
Section 3 Big Data in Official Statistics
273(114)
9 Big Data Initiatives in Official Statistics
275(28)
Lilli Japec
Lars Lyberg
9.1 Introduction
275(1)
9.2 Some Characteristics of the Changing Survey Landscape
276(4)
9.3 Current Strategies to Handle the Changing Survey Landscape
280(5)
9.3.1 Training Staff
281(1)
9.3.2 Forming Partnerships
281(1)
9.3.3 Cooperation Between European NSIs
282(1)
9.3.4 Creating Big Data Centers
282(1)
9.3.5 Experimental Statistics
283(1)
9.3.6 Organizing Hackathons
283(1)
9.3.7 IT Infrastructure, Tools, and Methods
284(1)
9.4 The Potential of Big Data and the Use of New Methods in Official Statistics
285(5)
9.4.1 Wider and Deeper
285(1)
9.4.1.1 Green Areas in the Swedish City of Lidingo
285(1)
9.4.1.2 Innovative Companies
285(1)
9.4.1.3 Coding Commodity Flow Survey
286(1)
9.4.2 Better Statistics
287(1)
9.4.2.1 AIS
287(1)
9.4.2.2 Expenditure Surveys
288(1)
9.4.2.3 Examples of Improving Statistics by Adjusting for Bias
288(1)
9.4.3 Quicker Statistics
289(1)
9.4.3.1 Early Estimates
289(1)
9.4.4 Cheaper Statistics
289(1)
9.4.4.1 Consumer Price Index (CPI)
289(1)
9.4.4.2 Smart Meter Data
289(1)
9.4.4.3 ISCO and NACE Coding at Statistics Finland
290(1)
9.5 Big Data Quality
290(3)
9.6 Legal Issues
293(2)
9.6.1 Allowing Access to Data
293(1)
9.6.2 Providing Access to Data
294(1)
9.7 Future Developments
295(8)
References
296(7)
10 Big Data in Official Statistics: A Perspective from Statistics Netherlands
303(36)
Barteld Braaksma
Kees Zeelenberg
Sofie De Broe
10.1 Introduction
303(1)
10.2 Big Data and Official Statistics
304(1)
10.3 Examples of Big Data in Official Statistics
305(4)
10.3.1 Scanner Data
305(1)
10.3.2 Traffic-Loop Data
306(1)
10.3.3 Social Media Messages
307(1)
10.3.4 Mobile Phone Data
308(1)
10.4 Principles for Assessing the Quality of Big Data Statistics
309(7)
10.4.1 Accuracy
310(1)
10.4.2 Models in Official Statistics
311(1)
10.4.3 Objectivity and Reliability
312(2)
10.4.4 Relevance
314(1)
10.4.5 Some Examples of Quality Assessments of Big Data Statistics
315(1)
10.5 Integration of Big Data with Other Statistical Sources
316(9)
10.5.1 Big Data as Auxiliary Data
316(1)
10.5.2 Size of the Internet Economy
317(2)
10.5.3 Improving the Consumer Confidence Index
319(2)
10.5.4 Big Data and the Quality of Gross National Product Estimates
321(1)
10.5.5 Google Trends for Nowcasting
322(1)
10.5.6 Multisource Statistics: Combination of Survey and Sensor Data
323(1)
10.5.7 Combining Administrative and Open Data Sources to Complete Energy Statistics
324(1)
10.6 Disclosure Control with Big Data
325(2)
10.6.1 Volume
326(1)
10.6.2 Velocity
326(1)
10.6.3 Variety
326(1)
10.7 The Way Ahead: A Chance for Paradigm Fusion
327(3)
10.7.1 Measurement and Selection Bias
328(1)
10.7.2 Timeliness
329(1)
10.7.3 Quality
329(1)
10.7.4 Phenomenon-Oriented Statistics
330(1)
10.8 Conclusion
330(9)
References
331(6)
Further Reading
337(2)
11 Mining the New Oil for Official Statistics
339(20)
Siu-Ming Tarn
Jae-Kwang Kim
Lyndon Ang
Han Pham
11.1 Introduction
339(2)
11.2 Statistical Inference for Binary Variables from Nonprobability Samples
341(2)
11.3 Integrating Data Source B Subject to Undercoverage Bias
343(1)
11.4 Integrating Data Sources Subject to Measurement Errors
344(1)
11.5 Integrating Probability Sample A Subject to Unit Nonresponse
345(2)
11.6 Empirical Studies
347(3)
11.7 Examples of Official Statistics Applications
350(3)
11.8 Limitations
353(1)
11.9 Conclusion
354(5)
References
354(3)
Further Reading
357(2)
12 Investigating Alternative Data Sources to Reduce Respondent Burden in United States Census Bureau Retail Economic Data Products
359(28)
Rebecca J. Hutchinson
12.1 Introduction
359(3)
12.1.1 Overview of the Economic Directorate
360(1)
12.1.2 Big Data Vision
361(1)
12.1.3 Overview of the Census Bureau Retail Programs
361(1)
12.2 Respondent Burden
362(4)
12.3 Point-of-Sale Data
366(3)
12.3.1 Background on Point-of-Sale Data
366(2)
12.3.2 Background on NPD
368(1)
12.4 Project Description
369(12)
12.4.1 Selection of Retailers
370(1)
12.4.2 National-Level Data
371(4)
12.4.3 Store-Level Data
375(2)
12.4.4 Product Data
377(4)
12.5 Summary
381(6)
Disclaimer
384(1)
Disclosure
384(1)
References
384(3)
Section 4 Combining Big Data with Survey Statistics: Methods and Applications
387(148)
13 Effects of Incentives in Smartphone Data Collection
389(26)
Georg-Christoph Haas
Frauke Kreuter
Florian Keusch
Mark Trappmann
Sebastian Bdhr
13.1 Introduction
389(1)
13.2 The Influence of Incentives on Participation
390(2)
13.3 Institut fur Arbeitsmarkt- und Berufsforschung (IAB)-SMART Study Design
392(6)
13.3.1 Sampling Frame and Sample Restrictions
393(1)
13.3.2 Invitation and Data Request
394(3)
13.3.3 Experimental Design for Incentive Study
397(1)
13.3.4 Analysis Plan
397(1)
13.4 Results
398(7)
13.4.1 App Installation
398(2)
13.4.2 Number of Initially Activated Data-Sharing Functions
400(1)
13.4.3 Deactivating Functions
401(1)
13.4.4 Retention
402(1)
13.4.5 Analysis of Costs
403(2)
13.5 Summary
405(10)
13.5.1 Limitations and Future Research
407(5)
References
412(3)
14 Using Machine Learning Models to Predict Attrition in a Survey Panel
415(20)
Mingnan Liu
14.1 Introduction
415(3)
14.1.1 Data
417(1)
14.2 Methods
418(5)
14.2.1 Random Forests
418(1)
14.2.2 Support Vector Machines
419(1)
14.2.3 LASSO
420(1)
14.2.4 Evaluation Criteria
420(2)
14.2.4.1 Tuning Parameters
422(1)
14.3 Results
423(2)
14.3.1 Which Are the Important Predictors?
425(1)
14.4 Discussion
425(3)
14.A Questions Used in the Analysis
428(7)
References
431(4)
15 Assessing Community Wellbeing Using Google Street-View and Satellite Imagery
435(52)
Pablo Diego-Rosell
Stafford Nichols
Rajesh Srinivasan
Ben Dilday
15.1 Introduction
435(2)
15.2 Methods
437(14)
15.2.1 Sampling Units and Frames
437(1)
15.2.2 Data Sources
438(1)
15.2.2.1 Study Outcomes from Survey Data
438(2)
15.2.2.2 Study Predictors from Built Environment Data
440(7)
15.2.2.3 Study Predictors from - Geospatial Imagery
447(3)
15.2.2.4 Model Development, Testing, and Evaluation
450(1)
15.3 Application Results
451(6)
15.3.1 Baltimore
451(4)
15.3.2 San Francisco
455(1)
15.3.3 Generalizability
456(1)
15.4 Conclusions
457(2)
15.A Amazon Mechanical Turk Questionnaire
459(2)
15.B Pictures and Maps
461(2)
15.C Descriptive Statistics
463(6)
15.D Stepwise AIC OLS Regression Models
469(3)
15.E Generalized Linear Models via Penalized Maximum Likelihood with k-Fold Cross-Validation
472(5)
15.F Heat Maps - Actual vs. Model-Based Outcomes
477(10)
References
485(2)
16 Nonparametric Bootstrap and Small Area Estimation to Mitigate Bias in Crowdsourced Data: Simulation Study and Application to Perceived Safety
487(32)
David Buil-Gil
Reka Solymosi
Angelo Moretti
16.1 Introduction
487(2)
16.2 The Rise of Crowdsourcing and Implications
489(1)
16.3 Crowdsourcing Data to Analyze Social Phenomena: Limitations
490(2)
16.3.1 Self-Selection Bias
490(1)
16.3.2 Unequal Participation
491(1)
16.3.3 Underrepresentation of Certain Areas and Times
492(1)
16.3.4 Unreliable Area-Level Direct Estimates and Difficulty to Interpret Results
492(1)
16.4 Previous Approaches for Reweighting Crowdsourced Data
492(1)
16.5 A New Approach: Small Area Estimation Under a Nonparametric Bootstrap Estimator
493(3)
16.5.1 Step 1: Nonparametric Bootstrap
494(2)
16.5.2 Step 2: Area-Level Model-Based Small Area Estimation
496(1)
16.6 Simulation Study
496(7)
16.6.1 Population Generation
497(1)
16.6.2 Sample Selection and Simulation Steps
497(2)
16.6.3 Results
499(4)
16.7 Case Study: Safety Perceptions in London
503(8)
16.7.1 The Spatial Study of Safety Perceptions
503(1)
16.7.2 Data and Methods
504(1)
16.7.2.1 Place Pulse 2.0 Dataset
504(2)
16.7.2.2 Area-Level Covariates
506(1)
16.7.3 Results
506(1)
16.7.3.1 Model Diagnostics and External Validation
506(4)
16.7.3.2 Mapping Safety Perceptions at Neighborhood Level
510(1)
16.8 Discussion and Conclusions
511(8)
References
513(6)
17 Using Big Data to Improve Sample Efficiency
519(16)
Jamie Ridenhour
Joe McMichael
Kami Krotki
Howard Speizer
17.1 Introduction and Background
519(4)
17.2 Methods to More Efficiently Sample Unregistered Boat-Owning Households
523(7)
17.2.1 Model 1: Spatial Boat Density Model
525(1)
17.2.2 Model 2: Address-Level Boat-Ownership Propensity
526(4)
17.3 Results
530(3)
17.4 Conclusions
533(2)
Acknowledgments
534(1)
References
534(1)
Section 5 Combining Big Data with Survey Statistics: Tools
535(90)
18 Feedback Loop: Using Surveys to Build and Assess Registration-Based Sample Religious Flags for Survey Research
537(24)
David Dutwin
18.1 Introduction
537(1)
18.2 The Turn to Trees
538(1)
18.3 Research Agenda
539(1)
18.4 Data
540(1)
18.5 Combining the Data
541(2)
18.6 Building Models
543(2)
18.7 Variables
545(1)
18.8 Results
545(7)
18.9 Considering Systematic Matching Rates
552(2)
18.10 Discussion and Conclusions
554(7)
References
557(4)
19 Artificial Intelligence and Machine Learning Derived Efficiencies for Large-Scale Survey Estimation Efforts
561(36)
Steven B. Cohen
Jamie Shorey
19.1 Introduction
561(1)
19.2 Background
562(1)
19.2.1 Project Goal
563(1)
19.3 Accelerating the MEPS Imputation Processes: Development of Fast - Track MEPS Analytic Files
563(9)
19.3.1 MEPS Data Files and Variables
566(1)
19.3.2 Identification of Predictors of Medical Care Sources of Payment
567(4)
19.3.2.1 Class Variables Used in the Imputation
571(1)
19.3.3 Weighted Sequential Hot Deck Imputation
572(1)
19.A Building the Prototype
572(3)
19.4.1 Learning from the Data: Results for the 2012 MEPS
573(2)
19.5 An Artificial Intelligence Approach to Fast-Track MEPS Imputation
575(13)
19.5.1 Why Artificial Intelligence for Health-Care Cost Prediction
577(1)
19.5.1.1 Imputation Strategies
578(2)
19.5.1.2 Testing of Imputation Strategies
580(1)
19.5.1.3 Approach
580(1)
19.5.1.4 Raw Data Extraction
581(1)
19.5.1.5 Attribute Selection
582(2)
19.5.1.6 Inter-Variable Correlation
584(1)
19.5.1.7 Multi-Output Random Forest
584(1)
19.5.2 Evaluation
585(3)
19.6 Summary
588(9)
Acknowledgments
592(1)
References
593(4)
20 Worldwide Population Estimates for Small Geographic Areas: Can We Do a Better Job?
597(28)
Safaa Amer
Dana Thomson
Rob Chew
Amy Rose
20.1 Introduction
597(1)
20.2 Background
598(2)
20.3 Gridded Population Estimates
600(8)
20.3.1 Data Sources
600(1)
20.3.2 Basic Gridded Population Models
601(1)
20.3.3 LandScan Global
601(1)
20.3.4 WorldPop
602(1)
20.3.5 LandScan HD
603(1)
20.3.6 GRID3
604(1)
20.3.7 Challenges, Pros, and Cons of Gridded Population Estimates
605(3)
20.4 Population Estimates in Surveys
608(5)
20.4.1 Standard Sampling Strategies
608(1)
20.4.2 Gridded Population Sampling from 1 km × 1 km Grid Cells
609(1)
20.4.2.1 Geosampling
609(2)
20.4.3 Gridded Population Sampling from 100 m × 100 m Grid Cells
611(1)
20.4.3.1 GridSample R Package
611(1)
20.4.3.2 GridSample2.0 and www.GridSample.org
611(2)
20.4.4 Implementation of Gridded Population Surveys
613(1)
20.5 Case Study
613(3)
20.6 Conclusions and Next Steps
616(9)
Acknowledgments
617(1)
References
617(8)
Section 6 The Fourth Paradigm, Regulations, Ethics, Privacy
625(108)
21 Reproducibility in the Era of Big Data: Lessons for Developing Robust Data Management and Data Analysis Procedures
627(30)
D. Betsy McCoach
Jennifer N. Dineen
Sandra M. Chafouleas
Amy Briesch
21.1 Introduction
627(1)
21.2 Big Data
627(2)
21.3 Challenges Researchers Face in the Era of Big Data and Reproducibility
629(1)
21.4 Reproducibility
630(2)
21.5 Reliability and Validity of Administrative Data
632(1)
21.6 Data and Methods
632(14)
21.6.1 The Case
632(1)
21.6.2 The Survey Data
633(1)
21.6.3 The Administrative Data
634(1)
21.6.4 The Six Research Fallacies
635(1)
21.6.4.1 More Data Are Better!
635(2)
21.6.4.2 Merging Is About Matching by IDs/Getting the Columns to Align
637(2)
21.6.4.3 Saving Your Syntax Is Enough to Ensure Reproducibility
639(2)
21.6.4.4 Transparency in Your Process Ensures Transparency in Your Final Product
641(2)
21.6.4.5 Administrative Data Are Higher Quality Than Self-Reported Data
643(1)
21.6.4.6 If Relevant Administrative Data Exist, They Will Help Answer Your Research Question
644(2)
21.7 Discussion
646(11)
References
649(5)
Further Reading
654(3)
22 Combining Active and Passive Mobile Data Collection: A Survey of Concerns
657(26)
Florian Keusch
Bella Struminskaya
Frauke Kreuter
Martin Weichbold
22.1 Introduction
657(2)
22.2 Previous Research
659(2)
22.2.1 Concern with Smartphone Data Collection
659(2)
22.2.2 Differential Concern across Subgroups of Users
661(1)
22.3 Methods and Data
661(5)
22.3.1 Sample 1
662(1)
22.3.2 Sample 2
662(1)
22.3.3 Sample 3
662(1)
22.3.4 Sample 4
663(1)
22.3.5 Measures
663(1)
22.3.6 Analysis Plan
664(2)
22.4 Results
666(4)
22.5 Conclusion
670(3)
22.A Appendix
673(1)
22.A.1 Frequency of Smartphone Use
673(1)
22.A.2 Smartphone Skills
673(1)
22.A3 Smartphone Activities
674(1)
22.A.4 General Privacy Concern
674(1)
22.B Appendix
675(8)
Funding
679(1)
References
679(4)
23 Attitudes Toward Data Linkage: Privacy, Ethics, and the Potential for Harm
683(30)
Ateia C. Fobia
Jennifer H. Childs
Casey Eggleston
23.1 Introduction: Big Data and the Federal Statistical System in the United States
683(1)
23.2 Data and Methods
684(6)
23.2.1 Focus Groups 2015 and 2016
685(4)
23.2.2 Cognitive Interviews
689(1)
23.3 Results
690(18)
23.3.1 What Do Respondents Say They Expect and Believe About the Federal Government's Stewardship of Data?
690(1)
23.3.1.1 Confidentiality
690(5)
23.3.1.2 Privacy
695(2)
23.3.1.3 Trust in Statistics
697(1)
23.3.2 How Do Expectations and Beliefs About the Federal Government's Stewardship of Data Change or Remain When Asked About Data Linkage or Sharing?
698(3)
23.3.3 Under What Circumstances Do Respondents Support Sharing or Linking Data?
701(5)
23.3.4 What Fears and Preoccupations Worry Respondents When Asked About Data Sharing in the Federal Government?
706(1)
23.3.4.1 Individual Harm
706(1)
23.3.4.2 Community Harm
707(1)
23.4 Discussion: Toward an Ethical Framework
708(5)
23.4.1 Data Security
709(1)
23.4.2 Transparency in Need for Data and Potential Uses of Data
709(1)
23.4.3 Connecting Data Collections to Benefits
709(1)
References
710(3)
24 Moving Social Science into the Fourth Paradigm: The Data Life Cycle
713(20)
Craig A. Hill
24.1 Consequences and Reality of the Availability of Big Data and Massive Compute Power for Survey Research and Social Science
717(1)
24.1.1 Variety
717(1)
24.1.2 Volume
718(1)
24.1.3 Velocity
718(1)
24.1.4 Validity
718(1)
24.2 Technical Challenges for Data-Intensive Social Science Research
718(5)
24.2.1 The Long Tail
719(1)
24.2.2 Uncertainty Characterization and Quantification Or True, Useful, and New Information: Where Is It?
720(2)
24.2.3 Reproducibility
722(1)
24.3 The Solution: Social Science Researchers Become "Data-Aware"
723(2)
24.4 Data Awareness
725(2)
24.4.1 Acquire/Create/Collect
725(1)
24.4.2 Munge
725(1)
24.4.3 Use/Reuse
726(1)
24.4.4 Disseminate
726(1)
24.4.5 Stewardship
727(1)
24.5 Bridge the Gap Between Silos
727(2)
24.6 Conclusion
729(4)
References
729(4)
Index 733
Craig A. Hill, PhD, is Senior Vice President at RTI International and focuses on application of new technology to quantitative social science research. He is also the lead editor of Social Media, Sociality, and Survey Research (Wiley, 2013).

Paul P. Biemer, PhD, is Distinguished Fellow, Statistics at RTI International. He is an author, co-author, and co-editor of 6 other books published by Wiley.

Trent D. Buskirk, PhD, is the Novak Family Distinguished Professor of Data Science and the Chair of the Applied Statistics and Operations Research Department in the College of Business at Bowling Green State University.

Lilli Japec, PhD, former Director of Research and Development Department at Statistics Sweden. She co-chaired AAPOR's Task Force on Big Data.

Antje Kirchner, PhD, is a Survey Methodologist at RTI International. She is the Chair of the Scientific Committee of the Big Data Meets Survey Science (BigSurv20) conference.

Stanislav (Stas) Kolenikov, PhD, is Principal Scientist at Abt Associates. His work focuses on survey statistics, including issues in sampling, weighting, variance estimation, multiple imputation, and small area estimation.

Lars E. Lyberg, PhD, is former Head of the Research and Development Department at Statistics Sweden. He is the founder of the Journal of Official Statistics (JOS) and served as its Chief Editor for 25 years.