Estimating matching variable error rates and match probabilities for linkage of large administrative data sources

 

Principal Investigator: Ruth Gilbert, University College London

Co-Investigators: Katie Harron (UCL), Harvey Goldstein (UCL and University of Bristol), Mario Cortina-Borja (UCL), Nirupa Dattani (UCL and City University), Berit Muller-Pebody (Health Protection Agency), Roger Parslow (University of Leeds)

Project duration: 1 June 2013 - 30 September 2014

 

Linking existing data relating to the same people but recorded in different places is an efficient way to perform research, as it avoids the time and cost required to collect the same data multiple times. However, when there is no way of uniquely identifying the same person in different datasets then links can be wrong. If we incorrectly link records belonging to different people, or fail to link records that belong to the same person, analyses based on linked data can be flawed. For example we might underestimate the rate of a particular condition or fail to identify a relationship between a risk-factor and an outcome.


Linkage is usually based on methods proposed in the 1950s and 60s. However, statistical methods have recently been suggested that are able to improve ways to efficiently handle the uncertainty related to typographical errors, missing values or changes in identifiers (e.g. married women’s surnames). Results based on linked data using these new methods have been shown to be less biased. These methods rely on there being a way of estimating how likely it is that two records belong to the same person.

Current methods for estimating these probabilities do not work well for several reasons. In particular we need to assume that for any pair of records, agreement on one identifier is not related to agreement on any other identifier. This is not always the case, and can lead to misclassifying records as belonging to the same person or not. This study will address this problem by investigating alternative ways of estimating the probability of a match.

Stage 1 of the study will involve investigation of large administrative data sources to identify how and why identifiers might contain errors. This will take into account that error rates may differ for particular groups of people. For example, people who are sick and visit hospital often may have fewer data errors, as their details are checked more frequently compared with people who rarely go to hospital. The researchers in this project will investigate how often linkage goes wrong in terms of linking records that are nearly the same but belong to different people (e.g. twins) and failing to link records that do belong the same person (e.g. when NHS number is recorded incorrectly). Accurate estimation of these error rates will provide improved estimates of the probability that records belong to the same person.

Stage 2 will use information from stage 1 to investigate methods for estimating the probability of a match. For example, if we estimate the probability that two records are a match given agreement between a set of identifiers at the same time, we do not need to assume that identifiers are unrelated to each other. Statistical methods for estimating these probabilities will be compared in terms of their accuracy.

Stage 3 will evaluate the methods produced in stage 2. This will be done by analysing data using i) methods developed within this project and ii) traditional methods, and comparing results. The researchers in this project will quantify how useful their approach is by calculating how different results are from what should have been seen had there been no errors.

Although primarily focussing on healthcare data, the information gained in stage 1 and the methods produced in stage 2 will be relevant to anyone intending to link or to analyse many types of administrative data. The evaluation in stage 3 should help to change the way people think about linkage methods by challenging existing methods and persuading other researchers to use more appropriate methods for linking and analysing data.

 

For further information please see the project website