Validity of Deterministic Record Linkage Using Multiple Indirect Personal Identifiers
Linking a Large Registry to Claims Data
Background—Linking patient registries with administrative databases can enhance the utility of the databases for epidemiological and comparative effectiveness research. However, registries often lack direct personal identifiers, and the validity of record linkage using multiple indirect personal identifiers is not well understood.
Methods and Results—Using a large contemporary national cardiovascular device registry and 100% Medicare inpatient data, we linked hospitalization-level records. The main outcomes were the validity measures of several deterministic linkage rules using multiple indirect personal identifiers compared with rules using both direct and indirect personal identifiers. Linkage rules using 2 or 3 indirect, patient-level identifiers (ie, date of birth, sex, admission date) and hospital ID produced linkages with sensitivity of 95% and specificity of 98% compared with a gold standard linkage rule using a combination of both direct and indirect identifiers.
Conclusions—Ours is the first large-scale study to validate the performance of deterministic linkage rules without direct personal identifiers. When linking hospitalization-level records in the absence of direct personal identifiers, provider information is necessary for successful linkage.
- databases, factual
- information storage and retrieval
- insurance claim reporting
- medical record linkage
- medical records systems, computerized
Device and disease registries can provide clinically rich information and have been used for various types of research, including health services and outcomes research. However, registries often have limited or no diagnostic and therapeutic information on nontarget conditions and treatments or on follow-up for outcomes such as hospitalization and death.1 Administrative databases have been widely used to assess the safety of medications, health service utilizations, and outcomes because they provide detailed longitudinal information on drug use, long-term follow-up on clinically important outcomes, and a general picture of health status and health services use in large population-based samples. However, these databases generally lack detailed clinical information on disease severity, medical devices, and laboratory and imaging results.2–4 To conduct high-quality comparative effectiveness research either data source alone is insufficient. Therefore, an efficient and effective approach to improving the validity of comparative effectiveness studies may be to create a hybrid database by combining multiple data sources through record linkage.
Record linkage can combine multiple data sets to create a richer data set. The goal of record linkage is to combine person- or event-specific data from one source with additional data for the same people or events from another source.5 Record linkage is expected to be most accurate if a unique record identifier is common to, accurately recorded in, and not missing in both sources. However, registries often do not collect direct identifiers such as Social Security number and Medicare beneficiary identification number, or even names or addresses that are not unique but commonly used and often sufficient for linkage. Even if collected, the information is not usually released to researchers for ethical or technical reasons. Nonetheless, hospitalization records from registries and administrative databases can be linked using multiple indirect identifiers, such as date of birth, sex, admission date, and provider information such as hospital ID. The linked data sets have been used for clinical and comparative effectiveness research and other types of research.6–9
Although previous studies have demonstrated the feasibility of record linkage between registries and claims data using multiple indirect identifiers,7,10 the validity of this method relative to linkage using direct identifiers, generally considered the gold standard, has not been assessed. We compared the validity of several deterministic record linkage methods with multiple indirect identifiers by using data from the Centers for Medicare and Medicaid Services (CMS) implantable cardioverter-defibrillator (ICD) registry and administrative Medicare inpatient claims data.
Data for the study were from the CMS ICD registry and the Medicare Provider Analysis and Review (MedPAR) file for 2005 through 2008. The CMS ICD registry is a subset of the American College of Cardiology–National Cardiovascular Data Registry, which is the sole repository for ICD implantation data for Medicare beneficiaries.9–13 The data are entered by hospital personnel and are only included in the analytic file if hospitals achieve certain completeness on specific data elements.11 In addition, a subset of hospitals is randomly selected for quality control review to evaluate data accuracy. More than 400 000 patients are included in the CMS ICD registry, which contains 37 of 170 data elements that the American College of Cardiology–National Cardiovascular Data Registry collects. These 37 elements include patients’ identifying information, history and clinical characteristics, medications, facility information, provider information, ICD indications, device information, and in-hospital complications.
The MedPAR file contains claims data for services provided to fee-for-service Medicare beneficiaries admitted to Medicare-certified inpatient hospitals. It includes information on beneficiary demographic characteristics, diagnoses, procedures, and health resource use from hospitals (for inpatients only) and skilled nursing facilities, as well as detailed data on accommodations, departmental charges, days of care, entitlement, and Medicare enrollment status.
We used deterministic linkage12 requiring matches on different combinations of direct and indirect identifiers (ie, linkage variables). Linkage variables must be variables that are common to both data sets being linked. Based on our and others’ previous experience,7,10,13,14 we developed multiple linkage rules using the following linkage variables: date of birth, sex, admission date, and hospital ID. Both the CMS registry and the Medicare data included Medicare hospital IDs. Because discharge date was not available in the ICD registry, we did not include it as a linkage variable.
We developed 17 linkage rules (15 test rules and 2 gold standard rules) using various combinations and granularity of information from indirect and direct identifiers (Table 1). We included only 6 test rules using 4 indirect identifiers in the final presentation, which were representative of the entire range of results, to avoid complexity and redundancy. Two rules containing both direct and indirect identifiers were considered the gold standard. All rules contained the admission date for ICD placement to make a record unique at the hospitalization level. A few rules represented situations when less granular information may be available in 1 or both of the databases (eg, month or year of birth but not full date of birth, provider state but not hospital ID). Our deterministic linkage rules required exact matches on values of all linkage variables specified in the rules (Table 1).
The Figure shows the steps taken in linking the CMS ICD registry records with MedPAR inpatient records. First, we identified redundant records that had the same combination of values for all of the linkage variables, including the unique identifiers, within each data set. We assumed that these duplicates were created by administrative or technical errors in the process of creating or managing each database and kept only 1 record among the duplicates. The number of such records was small relative to the size of the data: 45 in the registry and 7429 in the MedPAR file. We then restricted hospitalization records belonging to patients aged ≥66 years in both data sources. Next, we excluded records with missing or obviously invalid values in the linkage variables specified in each linkage rule. These numbers (the seventh box in Figure) were used as the denominator in calculating the linkage rate (as discussed in the next section). To implement deterministic linkage of the 2 databases, we required exact matches on the values specified in each rule. Because we expected 1 unique hospitalization record in the registry to be linked to 1 hospitalization record in the Medicare data, we considered 1-to-1 linkages to be successful linkages. Finally, we compared the performance and validity of the 6 test rules containing indirect identifiers with the 2 gold standards. In calculating these measures, we restricted to records with no missing or invalid values in all 5 linkage variables (ie, beneficiary ID, sex, date of birth, admission date, and provider ID) required for creating the gold standards to avoid changing the denominator for the potentially linkable records (the last box in the Figure).
Linkage Rates and Validity Measures
Perfect linkage would identify all true linkages without false-positive or false-negative linkages. In deterministic linkage, missing values, invalid values, and coding or typing errors in any of the linkage variables prevent perfect linkage, even in the presence of direct identifiers. The error rate can be high for direct identifiers consisting of long strings of numbers or letters, especially if the information was entered manually. Linkage rules with direct identifiers cannot, therefore, be considered the true gold standard; however, in practice, they are often considered the best available with the greatest face validity.
We calculated the linkage rate as the number of 1-to-1 linkages by each rule divided by the total number of linkable records in the ICD registry (the seventh box in Figure). The size of the denominator varied for each rule because the number of linkable records depended on the size of the records with invalid values in linkage variables. The linkage rate gives a general idea about the size of the final linked data set. This measure is affected by both the ability of each linkage rule to identify unique records (1-to-1 linkage) and the amount of missing or invalid values in the linkage variables for each linkage rule. We did not expect that 100% of records for patients aged >65 years in the ICD registry would be linked to Medicare data, even if all linkage variables were complete and accurately recorded. The expected best linkage rate was between 55% and 65% because our Medicare data did not include claims for ICD procedures for patients enrolled in Medicare-managed care plans or in the Veterans Health Administration (VA), patients receiving ICD at VA hospitals,15 patients with supplemental insurance covering inpatient care, or patients with ICD placements not resulting in a hospitalization. We verified this estimate by analyzing data for 5% of Medicare beneficiaries from 2005 to 2008, of whom ≈20% were outpatients and 80% were inpatients. Furthermore, 20% to 30% of all inpatient ICD implantations were in patients enrolled in Medicare-managed care, a VA plan, or supplemental insurance. Therefore, only the remaining 55% to 65% of all implantations were linkable.
Requiring 1-to-1 linkages does not ensure that all linkages are true linkages without false-positive or false-negative linkages. When there is no error in linkage variables, minimally necessary information to make a hospitalization record unique would be a combination of direct patient identifiers such as Social Security number or Medicare beneficiary ID, date of admission, and hospital ID. As primary and secondary gold standards, we developed 2 linkage rules using a direct identifier (beneficiary ID) and a few indirect identifiers (Table 1). The Medicare beneficiary ID is a unique beneficiary identifier field that consists of 15 characters. The CMS ICD registry includes Social Security numbers, and the MedPAR data include scrambled Medicare beneficiary IDs. We obtained a crosswalk file between Social Security numbers and the scrambled beneficiary IDs from CMS. For all analyses, we used the scrambled beneficiary ID as the direct ID.
We calculated sensitivity, specificity, and positive predictive value for the linkage rules using indirect identifiers using the primary gold standard (which include beneficiary ID, admission date, and hospital ID). We also conducted sensitivity analyses using the second gold standard, which included beneficiary ID, admission date, date of birth, and hospital ID.
We first described the frequency of missing values and obvious errors in both registry and Medicare data. We were not able to identify errors in the linkage variables that had plausible but incorrect values. We then described the characteristics of the study population with linkable records in the registry. Finally, we calculated linkage rate and validity measures. All analyses were conducted using SAS version 9.2 (SAS Institute Inc, Cary, NC). The study was approved by the institutional review board of Brigham and Women’s Hospital.
During the study period between 2005 and 2008, among 264 918 hospitalizations for ICD placements in the ICD registry, 3% had a missing value and 1% had an invalid value for Social Security number. Hospital ID was missing in 7% of the records in the registry, compared with none in the Medicare inpatient data. In general, rates of missing or invalid values were small. Nonetheless, the rates of missingness were much higher in the registry than in the Medicare data (Table I in the Data Supplement).
Among 211 229 records for elderly Medicare beneficiaries in the CMS ICD registry, the mean age was 76.1 years, 88% were white, and 75% were men. The mean ejection fraction was 28%, and most patients (83%) had New York Heart Association class II or III heart failure (Table II in the Data Supplement).
The linkage rules requiring exact matches on multiple indirect identifiers and on the direct identifier, beneficiary ID (primary and secondary gold standards) identified 110 504 and 104 770 unique linkages with linkage rates of 58.1% and 55.1%, respectively (Table 2). Among the test rules based solely on indirect identifiers, rule 5 (exact matches on date of birth, admission date, and hospital ID) had the highest linkage rate (56.4%), closely followed by 56.0% for rule 1 (exact matches on date of birth, sex, admission date, and hospital ID) and 55.6% for rule 3 (exact matches on date of birth, sex, admission date, and hospital state). The linkage rates for the primary and secondary gold standards and rules 5, 1, and 3 were all within the expected range of 55% to 65% based on the 5% Medicare data analysis. The rules that represented imperfect information on date of birth (rules 2 and 6) had lower linkage rates (47.3% and 37.7%, respectively). The rule that did not use any provider information (rule 4) produced zero unique linkages.
The gold standard linkage using beneficiary ID with additional linkage variables were likely to produce linkages with very high specificity but also to miss some true linkages because of potential errors in the values of identifiers. We expected that our data are not free from errors in linkage variables. To better understand the validity of the a priori gold standard rule (gold standard 1), which produced a higher number of linkages compared with gold standard 2, we investigated the reasons for nonlinkages among records linked by gold standard 1 but not by gold standard 2. Assessing the patterns of values used only in gold standard 2 (date of birth and sex), we found that >90% of the records had discrepancies in their values only in 1 of the following: sex, day in date of birth, month in date of birth, or year in date of birth (Table III in the Data Supplement). These findings support that gold standard 1 is a reasonable and best available gold standard to evaluate the validity of linkage using indirect identifiers.
Table IV in the Data Supplement shows the characteristics of patients in linked versus nonlinked data by gold standard 1. The 2 groups were comparable in demographic and clinical characteristics, except that those in nonlinked data were more likely to be hospitalized for ICDs, have lower New York Heart Association classes, and to receive ICDs in later years of the study period, consistent with our expectation that the majority of these patients received ICDs as outpatients or enrolled in Medicare-managed care plans.
Compared with the gold standard 1, rule 5 had the highest validity (sensitivity, 95.4%; specificity, 98.3%; positive predictive value, 98.0%), closely followed by rule 1 (sensitivity, 94.8%; specificity, 98.4%; positive predictive value, 98.1%), rule 3 (sensitivity, 88.8%; specificity, 94.1%; positive predictive value, 92.5%), rule 2 (sensitivity, 64.0%; specificity, 85.4%; positive predictive value, 78.3%), and rule 6 (sensitivity, 43.8%; specificity, 82.3%; positive predictive value, 67.4%). Rule 4, which had 0 unique linkages, had 100% specificity and 0% sensitivity and positive predictive value (Table 3). When we calculated these measures using the secondary gold standard with more restrictive linkage rules, rule 1 had the highest validity (sensitivity, 99.8%; specificity, 98.5%; positive predictive value, 98.1%), closely followed by rule 5 (sensitivity, 99.7%; specificity, 97.8%; positive predictive value, 97.2%), whereas the order for the other rules remained the same (Table IV in the Data Supplement).
Using a large contemporary national registry for ICDs and 100% Medicare inpatient data, we assessed the performance and validity of a deterministic linkage using multiple indirect identifiers compared with linkage using both direct and indirect identifiers. We found that linkage rules using 2 or 3 indirect, patient-level identifiers and hospital ID had appropriate linkage rates and high validity compared with the gold standard linkage rule with direct identifiers. Our study is the first large-scale demonstration for a US registry and Medicare data for the validity of deterministic linkage methods without direct identifiers or names or addresses of patients.
We previously used a similar linkage method requiring an exact match on date of birth, sex, admission date, and hospital ID to link a clinical registry for heart failure and myocardial infarction without direct patient identifiers to Medicare patients in pharmacy assistance programs in New Jersey and Pennsylvania.7,13,14,16 In this study, we estimated the success of linkage by comparing the observed linkage rate with the expected rate based on the assumed overlap between the 2 databases. Hammill et al10 described the use of multiple indirect identifiers to link hospitalization records from a national heart failure registry to 100% Medicare inpatient data. Exploring various rules with different combinations of indirect identifiers, they concluded that a high level of record uniqueness can be achieved using different combinations of indirect identifiers blocked by provider. These findings are consistent with our finding that linkage rates and validity are lower in the absence of provider information when linking on multiple indirect identifiers.
Hammill et al10 achieved a linkage rate of 81%, whereas our best linkage rate using a similar linkage rule was 56%. In fact, our linkage rates were lower in general, even for the rules with direct identifiers. The difference in the linkage rates between the study by Hammill et al10 and ours is probably because of differences in the overlap between the linked databases. In neither study was the linkage rate expected to be 100% because the Medicare data include claims only for patients who are fee-for-service beneficiaries but not claims from subjects enrolled in a Medicare-managed care plan or who are receiving VA benefits, received ICDs at VA hospitals, or had their inpatient stay fully covered by private or employer-sponsored insurance plans. Moreover, the penetration of Medicare-managed care programs increased over time. It was 13% in the time frame of the study by Hammill et al10 (2003–2004) compared with 15% to 23% during the time frame of our study (2005–2008).17 In addition, although the heart failure registry in the study by Hammill et al10 consisted of only inpatient records, the CMS ICD registry included both inpatient and outpatient procedures, and the information in the registry could not be used to distinguish between the two. Finally, outpatient ICD implantation records in the registry could not be linked as only inpatient claims are included in MedPAR data.
Most recently, Bohensky et al18 compared the linkage rates for a linkage rule using indirect identifiers with one using both direct and indirect identifiers to link records in the Australia/New Zealand critical care registry to a state financial claims database in a subset.18 They found that linkage rates were similar for rules with and without direct identifiers (95% for the rule using direct identifiers and 92% for the rule using indirect identifiers), which is consistent with what we found in our study. Sensitivity and specificity for the linkage with indirect identifiers were 97.5% and 97.0%, respectively, similar to our large validation study.
Although we used the best available linkage rules using direct identifiers as the gold standard to assess the validity of linkage without direct identifiers, they were not true gold standard because there were likely to be errors or typos in the values for the direct identifier, beneficiary ID. Therefore, false-negative and false-positive linkages are a concern. We used a deterministic linkage method and required exact matches on >3 variables in our rules. The expected error rates in these values are low, and the rate for false-positive linkages is anticipated to be small. However, false-negative linkages are likely in all rules, including the primary and secondary gold standards. We previously described the impact of a gold standard with imperfect sensitivity, which would result in the underestimation of sensitivity and positive predictive value and overestimation of specificity for the test rules.19 The degree of bias from the imperfect gold standard depends on the number of false-negative links in the gold standard and the prevalence of the true linkage.
The change in the relative performance of 2 test rules (ie, rules 1 and 5) can be explained by the CMS data management practice of assigning sex as female when information on sex is missing. Nonetheless, our conclusion is that these 2 rules are equally valid because the differences in the validity measures were so small. To minimize false-negative linkages, probabilistic linkage methods would be useful, and the utility of the probabilistic linkage method needs further evaluation in linking registries to claims data.
In linking hospitalization records from registries to Medicare claims data, we have demonstrated that linkage rules using multiple indirect identifiers including provider IDs produced highly valid linkages compared with the gold standard rule(s) that included both direct and indirect identifiers. Our results are likely generalizable to attempts that link hospitalization-level records. Multiple noncardiovascular registries exist that register patients in outpatient settings, such as the Consortium of Rheumatology Researchers of North America20 and the Childhood Arthritis and Rheumatology Research Alliance for adult and pediatric rheumatology populations, as well as large ambulatory cardiovascular quality improvement registries, such as Practice Innovation and Clinical Excellence (PINNACLE) registry by the American College of Cardiology Foundation or Guideline Advantage by the American Heart Association. Similar to inpatient record linkages, a provider (physician) ID is likely to be a crucial variable in linking outpatient records. However, in outpatient records, completeness and accuracy of physician IDs may be more compromised because of a more ambiguous definition of providers in outpatient settings where physicians work in a group practice. Detailed methodological work and separate validation are needed to understand the best practice in linking outpatient records.
When generalizing our results, expected error rates of linkage variables should be considered. The error rates in the ICD registry are likely to be low (ie, <5%) based on the quality of American College of Cardiology–National Cardiovascular Data Registry and the reported missingness rates. Our results may not be applicable in settings in which databases have high error rates in linkage variables. When error rates are high, a deterministic linkage method using multiple identifiers will produce a large number of false-negative links. A probabilistic linkage method should be considered to overcome this limitation. Finally, the generalizability of our results can be affected by the prevalence of the target condition or procedure of a registry. In general, when the condition or procedure is less common than our example procedure (implantation of ICD), the specificity, sensitivity, and positive predictive values for the methods using indirect identifiers are likely to be higher than what we reported and vice versa.
In conclusion, deterministic linkage using multiple indirect identifiers including provider IDs can produce reliable and valid linkage compared with that using a combination of direct and indirect identifiers to link hospitalization records from a registry to inpatient claims data. In the absence of direct personal identifiers, provider information was the key to identifying unique records and conducting successful linkage. Further studies are needed to understand the validity of similar methods to link outpatient records and the performance of deterministic versus probabilistic linkage methods in real-world record linkages for comparative effectiveness research.
Damon M. Seils, MA, Duke University, assisted with article preparation. He did not receive compensation for his assistance apart from his employment at the institution where the study was conducted.
Sources of Funding
This project was funded under contract #HHSA29020050016I from the Agency for Healthcare Research and Quality, US Department of Health and Human Services, as part of the Developing Evidence to Inform Decisions about Effectiveness (DEcIDE) program; and contract No. HHSM500201000001I from the Centers for Medicare and Medicaid Services, US Department of Health and Human Services. Dr Setoguchi was supported by midcareer development award K02HS017731 from the Agency for Healthcare Research and Quality. The funding agency had no role in the design and conduct of the study and in the collection, analysis, and interpretation of the data. The manuscript was based on a report done under contract to AHRQ; AHRQ had the draft report reviewed by independent peer reviewers before acceptance of the final report.
The authors of this report are responsible for its content. Statements in the report should not be construed as endorsement by the Agency for Healthcare Research and Quality, the Centers for Medicare & Medicaid Services, or the US Department of Health and Human Services.
The Data Supplement is available at http://circoutcomes.ahajournals.org/lookup/suppl/doi:10.1161/CIRCOUTCOMES.113.000294/-/DC1.
- Received April 12, 2013.
- Accepted March 26, 2014.
- © 2014 American Heart Association, Inc.
- Gliklich RE,
- Dreyer NA
- Wilkinson NM,
- Page J,
- Uribe AG,
- Espinosa V,
- Cabral DA
- Clark DE
- Hernandez AF,
- Fonarow GC,
- Hammill BG,
- Al-Khatib SM,
- Yancy CW,
- O’Connor CM,
- Schulman KA,
- Peterson ED,
- Curtis LH
- Douglas PS,
- Brennan JM,
- Anstrom KJ,
- Sedrakyan A,
- Eisenstein EL,
- Haque G,
- Dai D,
- Kong DF,
- Hammill B,
- Curtis L,
- Matchar D,
- Brindis R,
- Peterson ED
- Weintraub WS,
- Grau-Sepulveda MV,
- Weiss JM,
- O’Brien SM,
- Peterson ED,
- Kolm P,
- Zhang Z,
- Klein LW,
- Shaw RE,
- McKay C,
- Ritzenthaler LL,
- Popma JJ,
- Messenger JC,
- Shahian DM,
- Grover FL,
- Mayer JE,
- Shewan CM,
- Garratt KN,
- Moussa ID,
- Dangas GD,
- Edwards FH
- 11.↵National Cardiovascular Data Registry Program Requirements. http://www.ncdr.com/WebNCDR/COMMON/DATACOLLECTION.ASPX. Accessed March 20, 2012.
- Blakely T,
- Salmond C
- 13.↵Addressing Knowledge Gaps in the Treatment of Hypertension Using ACE/ARB Therapies. http://effectivehealthcare.ahrq.gov/healthInfo.cfm?infotype=nr&ProcessID=71. Accessed December 1, 2008.
- 14.↵Improving Methods for Comparative Effectiveness Research in Cardiovascular Care, 2008. http://www.effectivehealthcare.ahrq.gov/index.cfm/comparative-effectiveness-research-grant-and-arra-awards/?grantid=81. Accessed March 30, 2012.
- 15.↵Medicare and Other Health Benefits: Your Guide to Who Pays First. http://www.medicare.gov/publications/pubs/pdf/02179.pdf. Accessed March 20, 2012.
- 17.↵Medicare Managed Care Enrollees and the Medicare Utilization File, 2011. http://www.resdac.org/tools/TBs/TN-009_MedicareManagedCareEnrolleesandUtilFiles_508.pdf. Accessed March 20, 2012.
- Greenberg JD,
- Kremer JM,
- Curtis JR,
- Hochberg MC,
- Reed G,
- Tsao P,
- Farkouh ME,
- Nasir A,
- Setoguchi S,
- Solomon DH