Muutke küpsiste eelistusi

Analysis of Integrated Data [Kõva köide]

Edited by (Department of Social Statistics, University of Southampton, UK), Edited by (University of Wollongong, Australia)
  • Formaat: Hardback, 272 pages, kõrgus x laius: 234x156 mm, kaal: 526 g, 55 Tables, black and white; 22 Line drawings, black and white; 22 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences
  • Ilmumisaeg: 08-May-2019
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 1498727980
  • ISBN-13: 9781498727983
Teised raamatud teemal:
  • Formaat: Hardback, 272 pages, kõrgus x laius: 234x156 mm, kaal: 526 g, 55 Tables, black and white; 22 Line drawings, black and white; 22 Illustrations, black and white
  • Sari: Chapman & Hall/CRC Statistics in the Social and Behavioral Sciences
  • Ilmumisaeg: 08-May-2019
  • Kirjastus: Chapman & Hall/CRC
  • ISBN-10: 1498727980
  • ISBN-13: 9781498727983
Teised raamatud teemal:
The advent of "Big Data" has brought with it a rapid diversification of data sources, requiring analysis that accounts for the fact that these data have often been generated and recorded for different reasons. Data integration involves combining data residing in different sources to enable statistical inference, or to generate new statistical data for purposes that cannot be served by each source on its own. This can yield significant gains for scientific as well as commercial investigations.

However, valid analysis of such data should allow for the additional uncertainty due to entity ambiguity, whenever it is not possible to state with certainty that the integrated source is the target population of interest. Analysis of Integrated Data aims to provide a solid theoretical basis for this statistical analysis in three generic settings of entity ambiguity: statistical analysis of linked datasets that may contain linkage errors; datasets created by a data fusion process, where joint statistical information is simulated using the information in marginal data from non-overlapping sources; and estimation of target population size when target units are either partially or erroneously covered in each source.











Covers a range of topics under an overarching perspective of data integration.





Focuses on statistical uncertainty and inference issues arising from entity ambiguity.





Features state of the art methods for analysis of integrated data.





Identifies the important themes that will define future research and teaching in the statistical analysis of integrated data.

Analysis of Integrated Data is aimed primarily at researchers and methodologists interested in statistical methods for data from multiple sources, with a focus on data analysts in the social sciences, and in the public and private sectors.
Preface xiii
Contributors xv
1 Introduction
1(12)
Raymond L. Chambers
1.1 Why this book?
1(2)
1.2 The structure of this book
3(8)
1.3 Summary
11(2)
References
11(2)
2 On secondary analysis of datasets that cannot be linked without errors
13(26)
Li-Chun Zhang
2.1 Introduction
13(3)
2.1.1 Related work
14(1)
2.1.2 Outline of investigation
15(1)
2.2 The linkage data structure
16(4)
2.2.1 Definitions
17(1)
2.2.2 Agreement partition of match space
18(2)
2.3 On maximum likelihood estimation
20(2)
2.4 On analysis under the comparison data model
22(8)
2.4.1 Linear regression under the linkage model
22(2)
2.4.2 Linear regression under the comparison data model
24(1)
2.4.3 Comparison data modelling (I)
25(2)
2.4.4 Comparison data modelling (II)
27(3)
2.5 On link subset analysis
30(4)
2.5.1 Non-informative balanced selection
30(3)
2.5.2 Illustration for the C-PR data
33(1)
2.6 Concluding remarks
34(5)
Bibliography
35(4)
3 Capture-recapture methods in the presence of linkage errors
39(34)
Loredana Di Consiglio
Tiziana Tuoto
Li-Chun Zhang
3.1 Introduction
39(1)
3.2 The capture-recapture model: short formalization and notation
40(2)
3.3 The linkage models and the linkage errors
42(5)
3.3.1 The Fellegi and Sunter linkage model
42(2)
3.3.2 Definition and estimation of linkage errors
44(1)
3.3.3 Bayesian approaches to record linkage
45(2)
3.4 The DSE in the presence of linkage errors
47(10)
3.4.1 The Ding and Fienberg estimator
47(1)
3.4.2 The modified Ding and Fienberg estimator
48(1)
3.4.3 Some remarks
49(3)
3.4.4 Examples
52(5)
3.5 Linkage-error adjustments in the case of multiple lists
57(8)
3.5.1 Log-linear model-based estimators
57(3)
3.5.2 An alternative modelling approach
60(1)
3.5.3 A Bayesian proposal
61(1)
3.5.4 Examples
62(3)
3.6 Concluding remarks
65(8)
Bibliography
66(7)
4 An overview on uncertainty and estimation in statistical matching
73(28)
Pier Luigi Conti
Daniela Marella
Mauro Scanu
4.1 Introduction
73(2)
4.2 Statistical matching problem: notations and technicalities
75(2)
4.3 The joint distribution of variables not jointly observed: estimation and uncertainty
77(10)
4.3.1 Matching error
81(2)
4.3.2 Bounding the matching error via measures of uncertainty
83(4)
4.4 Statistical matching for complex sample surveys
87(7)
4.4.1 Technical assumptions on the sample designs
88(2)
4.4.2 A proposal for choosing a matching distribution
90(1)
4.4.3 Reliability of the matching distribution
91(2)
4.4.4 Evaluation of the matching reliability as a hypothesis problem
93(1)
4.5 Conclusions and pending issues: relationship between the statistical matching problem and ecological inference
94(7)
Bibliography
96(5)
5 Auxiliary variable selection in a statistical matching problem
101(20)
Marcello D'Orazio
Marco Di Zio
Mauro Scanu
5.1 Introduction
101(2)
5.2 Choice of the matching variables
103(8)
5.2.1 Traditional methods based on association
104(1)
5.2.2 Choosing the matching variables by uncertainty reduction
105(1)
5.2.3 An illustrative example
106(3)
5.2.4 The penalised uncertainty measure
109(2)
5.3 Simulations with European Social Survey data
111(6)
5.4 Conclusions
117(4)
Bibliography
117(4)
6 Minimal inference from incomplete 2 × 2-tables
121(16)
Li-Chun Zhang
Raymond L. Chambers
6.1 Introduction
121(4)
6.2 Corroboration
125(2)
6.3 Maximum corroboration set
127(3)
6.4 High assurance estimation of $$0
130(1)
6.5 A corroboration test
131(1)
6.6 Application: missing OCBGT data
132(5)
Bibliography
133(4)
7 Dual- and multiple-system estimation with fully and partially observed covariates
137(32)
Peter G. M. van der Heijden
Paul A. Smith
Joe Whittaker
Maarten Cruyff
Bart F. M. Bakker
7.1 Introduction
138(2)
7.2 Theory concerning invariant population-size estimates
140(6)
7.2.1 Terminology and properties
140(2)
7.2.2 Example
142(2)
7.2.3 Graphical representation of log-linear models
144(1)
7.2.4 Three registers
145(1)
7.3 Applications of invariant population-size estimation
146(2)
7.3.1 Modelling strategies with active and passive covariates
146(1)
7.3.2 Working with invariant population-size estimates
147(1)
7.4 Dealing with partially observed covariates
148(6)
7.4.1 Framework for population-size estimation with partially observed covariates
148(2)
7.4.2 Example
150(2)
7.4.3 Interaction graphs for models with incomplete covariates
152(1)
7.4.4 Results of model fitting
152(2)
7.5 Precision and sensitivity
154(3)
7.5.1 Precision
154(2)
7.5.2 Sensitivity
156(1)
7.5.3 Comparison of the EM algorithm with the classical model
157(1)
7.6 An application when the same variable is measured differently in both registers
157(4)
7.6.1 Example: Injuries in road accidents in the Netherlands
158(2)
7.6.2 More detailed breakdown of transport mode in accidents
160(1)
7.7 Discussion
161(8)
7.7.1 Alternative approaches
161(3)
7.7.2 Quality issues
164(1)
Bibliography
165(4)
8 Estimating population size in multiple record systems with uncertainty of state identification
169(28)
Davide Di Cecco
8.1 Introduction
169(3)
8.2 A latent class model for capture-recapture
172(9)
8.2.1 Decomposable models
174(2)
8.2.2 Identifiability
176(1)
8.2.3 EM algorithm
176(2)
8.2.4 Fixing parameters
178(1)
8.2.5 A mixture of different components
178(1)
8.2.6 Model selection
179(2)
8.3 Observed heterogeneity of capture probabilities
181(5)
8.3.1 Use of covariates
181(1)
8.3.2 Incomplete lists
182(4)
8.4 Evaluating the interpretation of the latent classes
186(1)
8.5 A Bayesian approach
187(10)
8.5.1 MCMC algorithm
189(2)
8.5.2 Simulations results
191(1)
Bibliography
192(5)
9 Log-linear models of erroneous list data
197(22)
Li-Chun Zhang
9.1 Introduction
197(2)
9.2 Log-linear models of incomplete contingency tables
199(1)
9.3 Modelling marginally classified list errors
200(6)
9.3.1 The models
200(3)
9.3.2 Maximum likelihood estimation
203(1)
9.3.3 Estimation based on list-survey data
204(2)
9.4 Model selection with zero degree of freedom
206(6)
9.4.1 Latent likelihood ratio criterion
206(3)
9.4.2 Illustration
209(3)
9.5 Homelessness data in the Netherlands
212(7)
9.5.1 Data and previous study
212(1)
9.5.2 Analysis allowing for erroneous enumeration
213(4)
Bibliography
217(2)
10 Sampling design and analysis using geo-referenced data
219(28)
Danila Filipponi
Federica Piersimoni
Roberto Benedetti
Maria Michela Dickson
Giuseppe Espa
Diego Giuliani
10.1 Introduction
219(2)
10.2 Geo-referenced data and potential locational errors
221(1)
10.3 A brief review of spatially balanced sampling methods
222(4)
10.3.1 Local pivotal methods
223(1)
10.3.2 Spatially correlated Poisson sampling
224(1)
10.3.3 Balanced sampling through the cube method
225(1)
10.3.4 Local cube method
225(1)
10.4 Spatial sampling for estimation of under-coverage rate
226(6)
10.5 Business surveys in the presence of locational errors
232(7)
10.6 Conclusions
239(8)
Bibliography
240(7)
Index 247
Li-Chun Zhang is Professor in Social Statistics at the University of Southampton, UK, Senior Researcher at Statistics Norway, Norway, and Professor in Official Statistics at the University of Oslo, Norway.

Raymond Chambers is Professor of Statistical Methodology at the University of Wollongong, Australia.