Muutke küpsiste eelistusi

Methodological Developments in Data Linkage [Kõva köide]

(London School of Hygiene and Tropical Medicine, UK), (University of Edinburgh, UK), (University of Bristol and University College London, UK)
  • Formaat: Hardback, 288 pages, kõrgus x laius x paksus: 252x178x20 mm, kaal: 590 g
  • Sari: Wiley Series in Probability and Statistics
  • Ilmumisaeg: 11-Dec-2015
  • Kirjastus: John Wiley & Sons Inc
  • ISBN-10: 1118745876
  • ISBN-13: 9781118745878
Teised raamatud teemal:
  • Formaat: Hardback, 288 pages, kõrgus x laius x paksus: 252x178x20 mm, kaal: 590 g
  • Sari: Wiley Series in Probability and Statistics
  • Ilmumisaeg: 11-Dec-2015
  • Kirjastus: John Wiley & Sons Inc
  • ISBN-10: 1118745876
  • ISBN-13: 9781118745878
Teised raamatud teemal:

A comprehensive compilation of new developments in data linkage methodology

 

The increasing availability of large administrative databases has led to a dramatic rise in the use of data linkage, yet the standard texts on linkage are still those which describe the seminal work from the 1950-60s, with some updates. Linkage and analysis of data across sources remains problematic due to lack of discriminatory and accurate identifiers, missing data and regulatory issues. Recent developments in data linkage methodology have concentrated on bias and analysis of linked data, novel approaches to organising relationships between databases and privacy-preserving linkage.

 

Methodological Developments in Data Linkage brings together a collection of contributions from members of the international data linkage community, covering cutting edge methodology in this field. It presents opportunities and challenges provided by linkage of large and often complex datasets, including analysis problems, legal and security aspects, models for data access and the development of novel research areas.  New methods for handling uncertainty in analysis of linked data, solutions for anonymised linkage and alternative models for data collection are also discussed.

 

Key Features:

 

  • Presents cutting edge methods for a topic of increasing importance to a wide range of research areas, with applications to data linkage systems internationally

 

  • Covers the essential issues associated with data linkage today

 

  • Includes examples based on real data linkage systems, highlighting the opportunities, successes and challenges that the increasing availability of linkage data provides

 

  • Novel approach incorporates technical aspects of both linkage, management and analysis of linked data

 

 

This book will be of core interest to academics, government employees, data holders, data managers, analysts and statisticians who use administrative data. It will also appeal to researchers in a variety of areas, including epidemiology, biostatistics, social statistics, informatics, policy and public health.

Foreword xi
Contributors xiii
1 Introduction
1(7)
Katie Harron
Harvey Goldstein
Chris Dibben
1.1 Introduction: data linkage as it exists
1(1)
1.2 Background and issues
2(1)
1.3 Data linkage methods
3(2)
1.3.1 Deterministic linkage
3(1)
1.3.2 Probabilistic linkage
3(1)
1.3.3 Data preparation
4(1)
1.4 Linkage error
5(1)
1.5 Impact of linkage error on analysis of linked data
6(1)
1.6 Data linkage: the future
7(1)
2 Probabilistic linkage
8(28)
William E. Winkler
2.1 Introduction
8(2)
2.2 Overview of methods
10(13)
2.2.1 The Fellegi-Sunter model of record linkage
10(3)
2.2.2 Learning parameters
13(7)
2.2.3 Additional methods for matching
20(2)
2.2.4 An empirical example
22(1)
2.3 Data preparation
23(5)
2.3.1 Description of a matching project
24(1)
2.3.2 Initial file preparation
25(1)
2.3.3 Name standardisation and parsing
26(1)
2.3.4 Address standardisation and parsing
27(1)
2.3.5 Summarising comments on preprocessing
27(1)
2.4 Advanced methods
28(7)
2.4.1 Estimating false-match rates without training data
28(4)
2.4.2 Adjusting analyses for linkage error
32(3)
2.5 Concluding comments
35(1)
3 The data linkage environment
36(27)
Chris Dibben
Mark Elliot
Heather Gowans
Darren Lightfoot
3.1 Introduction
36(1)
3.2 The data linkage context
37(5)
3.2.1 Administrative or routine data
37(1)
3.2.2 The law and the use of administrative (personal) data for research
38(4)
3.2.3 The identifiability problem in data linkage
42(1)
3.3 The tools used in the production of functional anonymity through a data linkage environment
42(8)
3.3.1 Governance, rules and the researcher
43(1)
3.3.2 Application process, ethics scrutiny and peer review
43(1)
3.3.3 Shaping `safe' behaviour: training, sanctions, contracts and licences
43(1)
3.3.4 `Safe' data analysis environments
44(3)
3.3.5 Fragmentation: separation of linkage process and temporary linked data
47(3)
3.4 Models for data access and data linkage
50(4)
3.4.1 Single centre
50(1)
3.4.2 Separation of functions: firewalls within single centre
51(2)
3.4.3 Separation of functions: TTP linkage
53(1)
3.4.4 Secure multiparty computation
53(1)
3.5 Four case study data linkage centres
54(8)
3.5.1 Population Data BC
54(4)
3.5.2 The Secure Anonymised Information Linkage Databank, United Kingdom
58(1)
3.5.3 Centre for Data Linkage (Population Health Research Network), Australia
59(2)
3.5.4 The Centre for Health Record Linkage, Australia
61(1)
3.6 Conclusion
62(1)
4 Bias in data linkage studies
63(20)
Megan Bohensky
4.1 Background
63(2)
4.2 Description of types of linkage error
65(3)
4.2.1 Missed matches from missing linkage variables
65(1)
4.2.2 Missed matches from inconsistent case ascertainment
66(1)
4.2.3 False matches: Description of cases incorrectly matched
66(2)
4.3 How linkage error impacts research findings
68(10)
4.3.1 Results
68(7)
4.3.2 Assessment of linkage bias
75(3)
4.4 Discussion
78(5)
4.4.1 Potential biases in the review process
79(1)
4.4.2 Recommendations and implications for practice
79(4)
5 Secondary analysis of linked data
83(26)
Raymond Chambers
Gunky Kim
5.1 Introduction
83(1)
5.2 Measurement error issues arising from linkage
84(2)
5.2.1 Correct links, incorrect links and non-links
84(1)
5.2.2 Characterising linkage errors
85(1)
5.2.3 Characterising errors from non-linkage
86(1)
5.3 Models for different types of linking errors
86(4)
5.3.1 Linkage errors under binary linking
86(2)
5.3.2 Linkage errors under multi-linking
88(1)
5.3.3 Incomplete linking
88(1)
5.3.4 Modelling the linkage error
89(1)
5.4 Regression analysis using complete binary-linked data
90(5)
5.4.1 Linear regression
91(4)
5.4.2 Logistic regression
95(1)
5.5 Regression analysis using incomplete binary-linked data
95(4)
5.5.1 Linear regression using incomplete sample to register linked data
97(2)
5.6 Regression analysis with multi-linked data
99(8)
5.6.1 Uncorrelated multi-linking: Complete linkage
100(1)
5.6.2 Uncorrelated multi-linking: Sample to register linkage
101(4)
5.6.3 Correlated multi-linkage
105(1)
5.6.4 Incorporating auxiliary population information
105(2)
5.7 Conclusion and discussion
107(2)
6 Record linkage: A missing data problem
109(16)
Harvey Goldstein
Katie Harron
6.1 Introduction
109(2)
6.2 Probabilistic Record Linkage (PRL)
111(1)
6.3 Multiple Imputation (MI)
112(1)
6.4 Prior-Informed Imputation (PII)
113(2)
6.4.1 Estimating matching probabilities
115(1)
6.5 Example 1: Linking electronic healthcare data to estimate trends in bloodstream infection
115(3)
6.5.1 Methods
115(2)
6.5.2 Results
117(1)
6.5.3 Conclusions
118(1)
6.6 Example 2: Simulated data including non-random linkage error
118(4)
6.6.1 Methods
118(1)
6.6.2 Results
119(3)
6.7 Discussion
122(3)
6.7.1 Non-random linkage error
122(1)
6.7.2 Strengths and limitations: Handling linkage error
122(1)
6.7.3 Implications for data linkers and data users
123(2)
7 Using graph databases to manage linked data
125(45)
James M. Farrow
7.1 Summary
125(1)
7.2 Introduction
126(5)
7.2.1 Flat approach
127(1)
7.2.2 Oops, your legacy is showing
128(1)
7.2.3 Shortcomings
128(3)
7.3 Graph approach
131(8)
7.3.1 Overview of graph concepts
131(2)
7.3.2 Graph queries versus relational queries
133(3)
7.3.3 Comparison of data in flat database versus graph database
136(1)
7.3.4 Relaxing the notion of `truth'
137(1)
7.3.5 Not a linkage approach per se but a management approach which enables novel linkage approaches
138(1)
7.3.6 Linkage engine independent
139(1)
7.3.7 Separates out linkage from cluster identification phase (and clerical review)
139(1)
7.4 Methodologies
139(17)
7.4.1 Overview of storage and extraction approach
140(1)
7.4.2 Overall management of data as collections
141(1)
7.4.3 Data loading
142(1)
7.4.4 Identification of equivalence sets and deterministic linkage
143(1)
7.4.5 Probabilistic linkage
144(1)
7.4.6 Clerical review
144(1)
7.4.7 Determining cut-off thresholds
145(2)
7.4.8 Final cluster extraction
147(1)
7.4.9 Graph partitioning
147(3)
7.4.10 Data management/curation
150(1)
7.4.11 User interface challenges
150(4)
7.4.12 Final cluster extraction
154(1)
7.4.13 A typical end-to-end workflow
155(1)
7.5 Algorithm implementation
156(2)
7.5.1 Graph traversal
156(1)
7.5.2 Cluster identification
157(1)
7.5.3 Partitioning visitor
158(1)
7.5.4 Encapsulating edge following policies
158(1)
7.5.5 Graph partitioning
158(1)
7.5.6 Insertion of review links
158(1)
7.5.7 How to migrate while preserving current clusters
158(1)
7.6 New approaches facilitated by graph storage approach
158(9)
7.6.1 Multiple threshold extraction
160(5)
7.6.2 Possibility of returning graph to end users
165(1)
7.6.3 Optimised cluster analysis
166(1)
7.6.4 Other link types
167(1)
7.7 Conclusion
167(3)
8 Large-scale linkage for total populations in official statistics
170(31)
Owen Abbott
Peter Jones
Martin Ralphs
8.1 Introduction
170(1)
8.2 Current practice in record linkage for population censuses
171(7)
8.2.1 Introduction
171(1)
8.2.2 Case study: the 2011 England and Wales Census assessment of coverage
172(6)
8.3 Population-level linkage in countries that operate a population register: register-based censuses
178(4)
8.3.1 Introduction
178(1)
8.3.2 Case study 1: Finland
179(1)
8.3.3 Case study 2: The Netherlands Virtual Census
180(1)
8.3.4 Case study 3: Poland
180(1)
8.3.5 Case study 4: Germany
181(1)
8.3.6 Summary
181(1)
8.4 New challenges in record linkage: the Beyond 2011 Programme
182(17)
8.4.1 Introduction
182(1)
8.4.2 Beyond 2011 linking methodology
183(1)
8.4.3 The anonymisation process in Beyond 2011
184(1)
8.4.4 Beyond 2011 linkage strategy using pseudonymised data
185(10)
8.4.5 Linkage quality
195(2)
8.4.6 Next steps
197(1)
8.4.7 Conclusion
198(1)
8.5 Summary
199(2)
9 Privacy-preserving record linkage
201(25)
Rainer Schnell
9.1 Introduction
201(1)
9.2
Chapter outline
202(1)
9.3 Linking with and without personal identification numbers
202(4)
9.3.1 Linking using a trusted third party
203(1)
9.3.2 Linking with encrypted PIDs
204(1)
9.3.3 Linking with encrypted quasi-identifiers
204(1)
9.3.4 PPRL in decentralised organisations
204(2)
9.4 PPRL approaches
206(3)
9.4.1 Phonetic codes
206(1)
9.4.2 High-dimensional embeddings
206(1)
9.4.3 Reference tables
207(1)
9.4.4 Secure multiparty computations for PPRL
207(1)
9.4.5 Bloom filter-based PPRL
207(2)
9.5 PPRL for very large databases: blocking
209(4)
9.5.1 Blocking for PPRL with Bloom filters
210(1)
9.5.2 Blocking Bloom filters with MBT
211(1)
9.5.3 Empirical comparison of blocking techniques for Bloom filters
211(2)
9.5.4 Current recommendations for linking very large datasets with Bloom filters
213(1)
9.6 Privacy considerations
213(4)
9.6.1 Probability of attacks
214(1)
9.6.2 Kind of attacks
215(1)
9.6.3 Attacks on Bloom filters
215(2)
9.7 Hardening Bloom filters
217(7)
9.7.1 Randomly selected hash values
218(1)
9.7.2 Random bits
218(2)
9.7.3 Avoiding padding
220(1)
9.7.4 Standardising the length of identifiers
220(1)
9.7.5 Sampling bits for composite Bloom filters
221(1)
9.7.6 Rehashing
221(2)
9.7.7 Salting keys with record-specific data
223(1)
9.7.8 Fake injections
223(1)
9.7.9 Evaluation of Bloom filter hardening procedures
223(1)
9.8 Future research
224(1)
9.9 PPRL research and implementation with national databases
225(1)
10 Summary
226(7)
Katie Harron
Chris Dibben
Harvey Goldstein
10.1 Introduction
226(1)
10.2 Part 1: Data linkage as it exists today
226(1)
10.3 Part 2: Analysis of linked data
227(2)
10.3.1 Quality of identifiers
227(1)
10.3.2 Quality of linkage methods
228(1)
10.3.3 Quality of evaluation
228(1)
10.4 Part 3: Data linkage in practice: new developments
229(2)
10.5 Concluding remarks
231(2)
References 233(20)
Index 253
Editors:

Katie Harron, London School of Hygiene and Tropical Medicine, UK

Harvey Goldstein, University of Bristol and University College London, UK

Chris Dibben, University of Edinburgh, UK