Foreword |
|
xi | |
Contributors |
|
xiii | |
|
|
1 | (7) |
|
|
|
|
1.1 Introduction: data linkage as it exists |
|
|
1 | (1) |
|
1.2 Background and issues |
|
|
2 | (1) |
|
|
3 | (2) |
|
1.3.1 Deterministic linkage |
|
|
3 | (1) |
|
1.3.2 Probabilistic linkage |
|
|
3 | (1) |
|
|
4 | (1) |
|
|
5 | (1) |
|
1.5 Impact of linkage error on analysis of linked data |
|
|
6 | (1) |
|
1.6 Data linkage: the future |
|
|
7 | (1) |
|
|
8 | (28) |
|
|
|
8 | (2) |
|
|
10 | (13) |
|
2.2.1 The Fellegi-Sunter model of record linkage |
|
|
10 | (3) |
|
2.2.2 Learning parameters |
|
|
13 | (7) |
|
2.2.3 Additional methods for matching |
|
|
20 | (2) |
|
2.2.4 An empirical example |
|
|
22 | (1) |
|
|
23 | (5) |
|
2.3.1 Description of a matching project |
|
|
24 | (1) |
|
2.3.2 Initial file preparation |
|
|
25 | (1) |
|
2.3.3 Name standardisation and parsing |
|
|
26 | (1) |
|
2.3.4 Address standardisation and parsing |
|
|
27 | (1) |
|
2.3.5 Summarising comments on preprocessing |
|
|
27 | (1) |
|
|
28 | (7) |
|
2.4.1 Estimating false-match rates without training data |
|
|
28 | (4) |
|
2.4.2 Adjusting analyses for linkage error |
|
|
32 | (3) |
|
|
35 | (1) |
|
3 The data linkage environment |
|
|
36 | (27) |
|
|
|
|
|
|
36 | (1) |
|
3.2 The data linkage context |
|
|
37 | (5) |
|
3.2.1 Administrative or routine data |
|
|
37 | (1) |
|
3.2.2 The law and the use of administrative (personal) data for research |
|
|
38 | (4) |
|
3.2.3 The identifiability problem in data linkage |
|
|
42 | (1) |
|
3.3 The tools used in the production of functional anonymity through a data linkage environment |
|
|
42 | (8) |
|
3.3.1 Governance, rules and the researcher |
|
|
43 | (1) |
|
3.3.2 Application process, ethics scrutiny and peer review |
|
|
43 | (1) |
|
3.3.3 Shaping `safe' behaviour: training, sanctions, contracts and licences |
|
|
43 | (1) |
|
3.3.4 `Safe' data analysis environments |
|
|
44 | (3) |
|
3.3.5 Fragmentation: separation of linkage process and temporary linked data |
|
|
47 | (3) |
|
3.4 Models for data access and data linkage |
|
|
50 | (4) |
|
|
50 | (1) |
|
3.4.2 Separation of functions: firewalls within single centre |
|
|
51 | (2) |
|
3.4.3 Separation of functions: TTP linkage |
|
|
53 | (1) |
|
3.4.4 Secure multiparty computation |
|
|
53 | (1) |
|
3.5 Four case study data linkage centres |
|
|
54 | (8) |
|
|
54 | (4) |
|
3.5.2 The Secure Anonymised Information Linkage Databank, United Kingdom |
|
|
58 | (1) |
|
3.5.3 Centre for Data Linkage (Population Health Research Network), Australia |
|
|
59 | (2) |
|
3.5.4 The Centre for Health Record Linkage, Australia |
|
|
61 | (1) |
|
|
62 | (1) |
|
4 Bias in data linkage studies |
|
|
63 | (20) |
|
|
|
63 | (2) |
|
4.2 Description of types of linkage error |
|
|
65 | (3) |
|
4.2.1 Missed matches from missing linkage variables |
|
|
65 | (1) |
|
4.2.2 Missed matches from inconsistent case ascertainment |
|
|
66 | (1) |
|
4.2.3 False matches: Description of cases incorrectly matched |
|
|
66 | (2) |
|
4.3 How linkage error impacts research findings |
|
|
68 | (10) |
|
|
68 | (7) |
|
4.3.2 Assessment of linkage bias |
|
|
75 | (3) |
|
|
78 | (5) |
|
4.4.1 Potential biases in the review process |
|
|
79 | (1) |
|
4.4.2 Recommendations and implications for practice |
|
|
79 | (4) |
|
5 Secondary analysis of linked data |
|
|
83 | (26) |
|
|
|
|
83 | (1) |
|
5.2 Measurement error issues arising from linkage |
|
|
84 | (2) |
|
5.2.1 Correct links, incorrect links and non-links |
|
|
84 | (1) |
|
5.2.2 Characterising linkage errors |
|
|
85 | (1) |
|
5.2.3 Characterising errors from non-linkage |
|
|
86 | (1) |
|
5.3 Models for different types of linking errors |
|
|
86 | (4) |
|
5.3.1 Linkage errors under binary linking |
|
|
86 | (2) |
|
5.3.2 Linkage errors under multi-linking |
|
|
88 | (1) |
|
|
88 | (1) |
|
5.3.4 Modelling the linkage error |
|
|
89 | (1) |
|
5.4 Regression analysis using complete binary-linked data |
|
|
90 | (5) |
|
|
91 | (4) |
|
5.4.2 Logistic regression |
|
|
95 | (1) |
|
5.5 Regression analysis using incomplete binary-linked data |
|
|
95 | (4) |
|
5.5.1 Linear regression using incomplete sample to register linked data |
|
|
97 | (2) |
|
5.6 Regression analysis with multi-linked data |
|
|
99 | (8) |
|
5.6.1 Uncorrelated multi-linking: Complete linkage |
|
|
100 | (1) |
|
5.6.2 Uncorrelated multi-linking: Sample to register linkage |
|
|
101 | (4) |
|
5.6.3 Correlated multi-linkage |
|
|
105 | (1) |
|
5.6.4 Incorporating auxiliary population information |
|
|
105 | (2) |
|
5.7 Conclusion and discussion |
|
|
107 | (2) |
|
6 Record linkage: A missing data problem |
|
|
109 | (16) |
|
|
|
|
109 | (2) |
|
6.2 Probabilistic Record Linkage (PRL) |
|
|
111 | (1) |
|
6.3 Multiple Imputation (MI) |
|
|
112 | (1) |
|
6.4 Prior-Informed Imputation (PII) |
|
|
113 | (2) |
|
6.4.1 Estimating matching probabilities |
|
|
115 | (1) |
|
6.5 Example 1: Linking electronic healthcare data to estimate trends in bloodstream infection |
|
|
115 | (3) |
|
|
115 | (2) |
|
|
117 | (1) |
|
|
118 | (1) |
|
6.6 Example 2: Simulated data including non-random linkage error |
|
|
118 | (4) |
|
|
118 | (1) |
|
|
119 | (3) |
|
|
122 | (3) |
|
6.7.1 Non-random linkage error |
|
|
122 | (1) |
|
6.7.2 Strengths and limitations: Handling linkage error |
|
|
122 | (1) |
|
6.7.3 Implications for data linkers and data users |
|
|
123 | (2) |
|
7 Using graph databases to manage linked data |
|
|
125 | (45) |
|
|
|
125 | (1) |
|
|
126 | (5) |
|
|
127 | (1) |
|
7.2.2 Oops, your legacy is showing |
|
|
128 | (1) |
|
|
128 | (3) |
|
|
131 | (8) |
|
7.3.1 Overview of graph concepts |
|
|
131 | (2) |
|
7.3.2 Graph queries versus relational queries |
|
|
133 | (3) |
|
7.3.3 Comparison of data in flat database versus graph database |
|
|
136 | (1) |
|
7.3.4 Relaxing the notion of `truth' |
|
|
137 | (1) |
|
7.3.5 Not a linkage approach per se but a management approach which enables novel linkage approaches |
|
|
138 | (1) |
|
7.3.6 Linkage engine independent |
|
|
139 | (1) |
|
7.3.7 Separates out linkage from cluster identification phase (and clerical review) |
|
|
139 | (1) |
|
|
139 | (17) |
|
7.4.1 Overview of storage and extraction approach |
|
|
140 | (1) |
|
7.4.2 Overall management of data as collections |
|
|
141 | (1) |
|
|
142 | (1) |
|
7.4.4 Identification of equivalence sets and deterministic linkage |
|
|
143 | (1) |
|
7.4.5 Probabilistic linkage |
|
|
144 | (1) |
|
|
144 | (1) |
|
7.4.7 Determining cut-off thresholds |
|
|
145 | (2) |
|
7.4.8 Final cluster extraction |
|
|
147 | (1) |
|
|
147 | (3) |
|
7.4.10 Data management/curation |
|
|
150 | (1) |
|
7.4.11 User interface challenges |
|
|
150 | (4) |
|
7.4.12 Final cluster extraction |
|
|
154 | (1) |
|
7.4.13 A typical end-to-end workflow |
|
|
155 | (1) |
|
7.5 Algorithm implementation |
|
|
156 | (2) |
|
|
156 | (1) |
|
7.5.2 Cluster identification |
|
|
157 | (1) |
|
7.5.3 Partitioning visitor |
|
|
158 | (1) |
|
7.5.4 Encapsulating edge following policies |
|
|
158 | (1) |
|
|
158 | (1) |
|
7.5.6 Insertion of review links |
|
|
158 | (1) |
|
7.5.7 How to migrate while preserving current clusters |
|
|
158 | (1) |
|
7.6 New approaches facilitated by graph storage approach |
|
|
158 | (9) |
|
7.6.1 Multiple threshold extraction |
|
|
160 | (5) |
|
7.6.2 Possibility of returning graph to end users |
|
|
165 | (1) |
|
7.6.3 Optimised cluster analysis |
|
|
166 | (1) |
|
|
167 | (1) |
|
|
167 | (3) |
|
8 Large-scale linkage for total populations in official statistics |
|
|
170 | (31) |
|
|
|
|
|
170 | (1) |
|
8.2 Current practice in record linkage for population censuses |
|
|
171 | (7) |
|
|
171 | (1) |
|
8.2.2 Case study: the 2011 England and Wales Census assessment of coverage |
|
|
172 | (6) |
|
8.3 Population-level linkage in countries that operate a population register: register-based censuses |
|
|
178 | (4) |
|
|
178 | (1) |
|
8.3.2 Case study 1: Finland |
|
|
179 | (1) |
|
8.3.3 Case study 2: The Netherlands Virtual Census |
|
|
180 | (1) |
|
8.3.4 Case study 3: Poland |
|
|
180 | (1) |
|
8.3.5 Case study 4: Germany |
|
|
181 | (1) |
|
|
181 | (1) |
|
8.4 New challenges in record linkage: the Beyond 2011 Programme |
|
|
182 | (17) |
|
|
182 | (1) |
|
8.4.2 Beyond 2011 linking methodology |
|
|
183 | (1) |
|
8.4.3 The anonymisation process in Beyond 2011 |
|
|
184 | (1) |
|
8.4.4 Beyond 2011 linkage strategy using pseudonymised data |
|
|
185 | (10) |
|
|
195 | (2) |
|
|
197 | (1) |
|
|
198 | (1) |
|
|
199 | (2) |
|
9 Privacy-preserving record linkage |
|
|
201 | (25) |
|
|
|
201 | (1) |
|
|
202 | (1) |
|
9.3 Linking with and without personal identification numbers |
|
|
202 | (4) |
|
9.3.1 Linking using a trusted third party |
|
|
203 | (1) |
|
9.3.2 Linking with encrypted PIDs |
|
|
204 | (1) |
|
9.3.3 Linking with encrypted quasi-identifiers |
|
|
204 | (1) |
|
9.3.4 PPRL in decentralised organisations |
|
|
204 | (2) |
|
|
206 | (3) |
|
|
206 | (1) |
|
9.4.2 High-dimensional embeddings |
|
|
206 | (1) |
|
|
207 | (1) |
|
9.4.4 Secure multiparty computations for PPRL |
|
|
207 | (1) |
|
9.4.5 Bloom filter-based PPRL |
|
|
207 | (2) |
|
9.5 PPRL for very large databases: blocking |
|
|
209 | (4) |
|
9.5.1 Blocking for PPRL with Bloom filters |
|
|
210 | (1) |
|
9.5.2 Blocking Bloom filters with MBT |
|
|
211 | (1) |
|
9.5.3 Empirical comparison of blocking techniques for Bloom filters |
|
|
211 | (2) |
|
9.5.4 Current recommendations for linking very large datasets with Bloom filters |
|
|
213 | (1) |
|
9.6 Privacy considerations |
|
|
213 | (4) |
|
9.6.1 Probability of attacks |
|
|
214 | (1) |
|
|
215 | (1) |
|
9.6.3 Attacks on Bloom filters |
|
|
215 | (2) |
|
9.7 Hardening Bloom filters |
|
|
217 | (7) |
|
9.7.1 Randomly selected hash values |
|
|
218 | (1) |
|
|
218 | (2) |
|
|
220 | (1) |
|
9.7.4 Standardising the length of identifiers |
|
|
220 | (1) |
|
9.7.5 Sampling bits for composite Bloom filters |
|
|
221 | (1) |
|
|
221 | (2) |
|
9.7.7 Salting keys with record-specific data |
|
|
223 | (1) |
|
|
223 | (1) |
|
9.7.9 Evaluation of Bloom filter hardening procedures |
|
|
223 | (1) |
|
|
224 | (1) |
|
9.9 PPRL research and implementation with national databases |
|
|
225 | (1) |
|
|
226 | (7) |
|
|
|
|
|
226 | (1) |
|
10.2 Part 1: Data linkage as it exists today |
|
|
226 | (1) |
|
10.3 Part 2: Analysis of linked data |
|
|
227 | (2) |
|
10.3.1 Quality of identifiers |
|
|
227 | (1) |
|
10.3.2 Quality of linkage methods |
|
|
228 | (1) |
|
10.3.3 Quality of evaluation |
|
|
228 | (1) |
|
10.4 Part 3: Data linkage in practice: new developments |
|
|
229 | (2) |
|
|
231 | (2) |
References |
|
233 | (20) |
Index |
|
253 | |