Methods for Evaluating Deduplication

Methods for Evaluating Deduplication
Brandy Altstadter Scientific Technologies Corporation National Immunization Conference March 24, 2005

Agenda Background and Overview Methods of Evaluation Lessons Learned
Process Results Lessons Learned

Background Deterministic deduplication algorithm
Developed over 7 years ago Used in 8 state registries Probabilistic deduplication algorithm Developed over 2 years ago Used in the Master Patient Index product (in 7 states) I am not going to address the differences between deterministic and probabilistic algorithms in this presentation but I wanted to highlight how and why I am interested in deduplication. Currently, STC supports and maintains two deduplication algorithms that are integrated into our licensed products. We also provide support for other registries, which includes deduplication. Over time with supporting deduplication, it has become obvious that deduplication is a process of continuous improvement.

Overview Key characteristics of registries and multiple patient indexes Extremely large datasets Multiple difference sources Wide range of ages and differences in the patient demographic Registries are intended to be population-based databases. This is particularly true in states with cradle-to-grave registries. As immunization in the adult population gets included in the registry more frequently, the registry has significant age cohorts in areas other than early childhood. The registry gets data from multiple sources, including direct data entry, loads from vital statistics, and loads from billing systems and EMRs. Multiple patient indexes are intended to provide one source of patient identification across the public health organization. As a result, this data comes from any different sources, has many difference age groups and, of course, is quite large – nearing the population of the state. The data also may have different identifiers. Some data may have Medicaid ID, while others has WIC ID or birth certificate number and others have SSN. The master patient index is the only source that brings all of those identifiers together into a single place.

Evaluation Methods Matching based on a 3rd record
Comparing two or more algorithms with a large dataset User feedback CDC Deduplication Toolkit In working with deduplication in registries and in master patient indexes, we have utilized four methods for evaluating the quality of the algorithm over the years.

Matching Based on a 3rd Record
Process Compare incoming record against the existing data set Determine cases where the incoming record is an exact match with two or more records Matching based on a 3rd record can be run either by reviewing for 3rd record matches for all incoming records or by re-processing the entire data set to review the quality of the existing deduplication. If the deduplication algorithm indicates that a record is an exact match with two existing records that do not otherwise match with each other, that generally indicates that the incoming record has enough overlapping information with the two records that all three match. The record can then be sent to automatic review or manual review, depending on the confidence level.

3rd Record Matching Example
Two patient records do not have sufficient information to merge automatically Incoming Record Field Current Record Emily Patient First Name Patient Middle Name Smith Patient Last Name 03/15/2004 Patient Birth Date 650 Elm Street Street Address 330 Maple Street Phoenix City AZ State 85032 Zip 85024 Guardian First Name James Mother Maiden Name Monroe In this example, the two records would most likely go to manual review. The person doing the manual review would have to decide if the patient has moved or is a different patient. In this case, the patient has a very common name. So, a manual reviewer without personal knowledge of the patient may determine that is not safe to merge the two patients. If the patient has been loaded by two different providers, it may not even be possible to contact the provider because the new provider may or may not have address history information.

3rd Record Matching Example (cont.)
Third patient record provides enough information to determine that all three are the same patient. Incoming Record Field 1st Current Record 2nd Current Record Emily Patient First Name Patient Middle Name Smith Patient Last Name 03/15/2004 Patient Birth Date 650 Elm Street Street Address 330 Maple Street Phoenix City AZ State 85032 Zip 85024 James Guardian First Name Monroe Mother Maiden Name The third patient record contains the address from the second patient record and the matching family information from the first patient record. Therefore, it is possible to determine that this patient is in fact is the same person and the address is different because the patient moved.

User Feedback Types of Feedback Incorrect Merges Missed Merges
Unnecessary Manual Reviews There are three instances where you might receive user feedback: Records that were incorrectly merged. Duplicate records in the system – should have been merged. Records sent to manual review that could have been automatically merged.

User Feedback Process Develop a procedure for reporting issues with deduplication. Manually fix issues Review issues and fine-tune algorithm

User Feedback Examples
Data composition varies by source Twins Similar first and middle names Incoming Record Field Current Record JOAN E Guardian First Name JOAN Guardian Middle Name ELAINE There are a couple of examples of things that we have discovered through user feedback. The first is that different sources of data often represent the same data slightly differently. For instance, one provider represented guardian first name and middle initial in the same field. It was important to process this data after receipt so that the first name and middle initial appeared in the correct fields so that “JOAN” would match with “JOAN”. Automatic processing would not be able to determine that “JOAN E” and “JOAN ELAINE” are the same (without logic to separate the names/initials for comparison) even though that is obvious to a human. Another example is twins. Through user feedback, we have discovered that there is no fail-proof way to overcome parents that name their kids with similar first and middle names. The variations include: same sex twins with one letter difference in the first name, same sex twins with the same first name and different middle names. In order to allow handling without sending records to manual, we have added birth order and multiple birth counts.

Comparing Algorithms Process Select a large dataset
Load only records that validate and do not error in either algorithm

Comparing Algorithms Process (cont.) Compare aggregate counts
Number of records queued for manual review Number of records automatically merged Retrieve ID pairs for merged records Eliminate records that both systems agree on Manually review differences to look for merging errors or missed merges.

Comparing Algorithms Results SSN was weighted too heavily
Commonly, the guardian’s SSN is entered so siblings or twins may be showing the same SSN Gender was not given enough weight Distinguish opposite sex twins Additional logic was added to normalize addresses i.e. 123 E. Flower Cir., Apt. 12 vs. 123 East Flower Circle, Lot 12 The two algorithms that we compared were developed independently of each other and we discovered several key areas in the newer of the two algorithms that needed to be fine-tuned. Much of the logic that was in the established algorithm had been developed and enhanced over the years through user feedback. Comparing the algorithms allowed us to leverage that user feedback and apply that logic to the new algorithm. First, SSN was weighted too heavily. In registries and other databases where information often comes based on health insurance, the child’s SSN may be populated with the guardian’s SSN rather than their own. So, two siblings would both reflect the parent’s SSN. Second, gender was not given enough weight. While it is difficult to make the decision to merge two patients with similar first names because they may be twins, it is easier to make them separate automatically. One important indication that two patients with similar names are twins is that they are opposite gender – For example, Juan/Male and Juana/Female are probably twins. Finally, address normalization is an important part of deduplication. Comparing two algorithms helps highlight the wide variations in spellings, abbreviations, etc. that are used in addresses and allows the algorithms to be updated so that both contain all of the normalization logic.

CDC Deduplication Toolkit
Available from the CDC Sample set of data containing known duplicates (550 records)

Process Download and install from the Web site Load the data into your registry Run your deduplication algorithm on the data Export the dedup’ed results from your registry to the toolkit and analyze the results

Results Sensitivity How well the system performs at recognizing known duplicate records Specificity How accurate the duplicate record detection is (measured by the rate at which non-duplicate records are misidentified) The toolkit provides two result measurements: sensitivity and specificity. Sensitivity is how well the system performs at recognizing known duplicate records. Specificity is how accurate the duplicate record detection is. Which is measured by the rate at which non-duplicate records are misidentified. Both measurements are important. However, you would expect the specificity measurement to be higher than the sensitivity measurement. The specificity should actually be at 100%, particularly since registries are dealing with patient health issues, it is a bigger mistake to merge two records that are not duplicates than it is to not merge two records that are duplicates.

Lessons Learned Impossible to anticipate every scenario
Creative users find a new way to make a data entry mistake (that you never thought of) Creative parents find a way to name their multiple birth children similarly (that you never thought of) Continue to evolve – there’s always room for improvement The balance with twins is that you want to process as many first name typos automatically as possible but you definitely do not want to merge twins. If you sent every first name difference to manual deduplication, you would still not be assured that you never merged twins because you haven’t taken into account the case where twins have the same first name but different middle names. In addition to those cases, you will also encounter foreign names that you have never thought of. Finally, it is important to continue to evolve. Not only is there always room for improvement, registries have an ever-growing role in the overall public health picture. As registries play and bigger and broader role, the demographic and clinical elements that are tracked change and they are also increased.

Methods for Evaluating Deduplication

Similar presentations

Presentation on theme: "Methods for Evaluating Deduplication"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Methods for Evaluating Deduplication

Similar presentations

Presentation on theme: "Methods for Evaluating Deduplication"— Presentation transcript:

Similar presentations

About project

Feedback