Presentation is loading. Please wait.

Presentation is loading. Please wait.

Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health

Similar presentations


Presentation on theme: "Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health"— Presentation transcript:

1 Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health vpapadou@dohlan.cn.ci.nyc.ny.us dwalker@dohlan.cn.ci.nyc.ny.us § ChoiceMaker Technologies, Inc. Andrew.Borthwick@choicemaker.com Adapted from a presentation at the 34 th National Immunization Conference Washington, DC July 7, 2000 The NY Citywide Immunization Registry’s MEDD De-Duplication Project ChoiceMaker Technologies

2 New York Citywide Immunization Registry: The MEDD De-duplication Project The NYC CIR  New York Citywide Immunization Registry was mandated in January 1997  All health-care providers are required to submit immunizations  Goals of the system:  Doctors look up kids’ immunization statuses to determine which shots to give  Notify parents when their children are due for an appointment  Identify citywide immunization trends  Similar registries are being built at the state and local level around the country ChoiceMaker Technologies

3 New York Citywide Immunization Registry: The MEDD De-duplication Project NYC CIR Background  About 122,000 children are born in NYC every year  Each month the CIR receives: 50-100,000 patient records and 80-200,000 immunization records  From >1,100 institutions and private providers  Given this volume, hand-matching each new record before it enters the CIR is unrealistic ChoiceMaker Technologies

4 New York Citywide Immunization Registry: The MEDD De-duplication Project NYC CIR: Background  Contains 1.8 million records  Very high duplication rate estimated at 3 records: 2 children because of very strict criteria for automatic merging  During April-September 1998 CIR staff reviewed and manually de-duplicated about 260,000 record pairs: spent 1,700 hours ChoiceMaker Technologies

5 New York Citywide Immunization Registry: The MEDD De-duplication Project MEDD: What it is  A system for deciding when two records represent the same child  Fast and accurate  Replicates the human decision- making process ChoiceMaker Technologies

6 New York Citywide Immunization Registry: The MEDD De-duplication Project MEDD’s Decision-Making Process MEDD  For every record pair, MEDD computes a probability between 0 and 100% that the pair should be merged merge  High probabilities  “merge” don’t merge  Low probabilities  “don’t merge” don’t know  Intermediate probabilities (close to 50%) indicate “don’t know” and require human review merge/ don’t know/ don’t merge  Thresholds dividing the merge/ don’t know/ don’t merge cases are set by the user ChoiceMaker Technologies

7 New York Citywide Immunization Registry: The MEDD De-duplication Project Maximum Entropy Modeling  MEDD uses “Maximum Entropy Modeling”  A new statistical decision-making technique  Learn the human judgment process by training from examples  Has been used in sentence parsing, computer vision, financial modeling, and proper-name identification  Has achieved state-of-the-art results on these problems ChoiceMaker Technologies

8 New York Citywide Immunization Registry: The MEDD De-duplication Project Maximum Entropy Modeling: Features  Maximum Entropy uses “Features”  Feature = a function which looks at specific fields in the pair of records to make a “merge” or “don’t merge” decision  MEDD has many different features, each of which is assigned a “weight” during training ChoiceMaker Technologies

9 New York Citywide Immunization Registry: The MEDD De-duplication Project Sample MEDD Features  Mother’s Birthday  Match of Mom’s B’day predicts “Merge”  Mismatch of Mom’s B’day predicts “No-Merge”  Neither feature fires if Mom’s B’day wasn’t filled in on both records  We have no evidence in this case  Many other features  Child’s birthday  Child’s first and last name  Medicaid Number ChoiceMaker Technologies

10 Record pairs hand-marked with merge/no-merge decisions A weight for each feature A set of features Maximum Entropy Parameter Estimator New York Citywide Immunization Registry: The MEDD De-duplication Project Training the System ChoiceMaker Technologies

11 New York Citywide Immunization Registry: The MEDD De-duplication Project Probability Computation merge Merge = product of weights of all features predicting “merge” for the pair no merge NoMerge = product of weights of all features predicting “no merge” for the pair For a pair of records, MEDD computes the probability that the pair should be merged as: ChoiceMaker Technologies

12 Field NameRecordFeatureWeightPrediction 12 Last nameSmith Match1.153Merge First nameEmilyEmelyNo-match Soundex 1.350 4.708 No-merge Merge DOB[04/28/97] Match1.138Merge Multiple birthNN Mom’s Maiden NameCRUZ Mother’s DOB12/04/76 Street4528 3 rd Ave Match4.342Merge CityBronx Match1.103Merge StateNY Zip10462 Match3.013Merge Phone718-123-4567718-123-6789No-match2.130No-merge Med Rec Number11856437503 Match6.587Merge High Probability. Human Decision: Merge Merge Total = 587.2 No-merge total = 2.9 MEDD predicts “Merge” with 99.5% confidence

13 Field NameRecordFeatureWeightPrediction 12 Last nameLopez Match 1.153Merge First nameGirlSusan DOB[1/11/97][1/2/97]No-match28.949No-merge Multiple birthNN Mom’s Maiden NameLopez Mother’s DOB Street987 Cornelia456 ParkNo-match 2.937No-merge CityBrooklyn Match 1.103Merge StateNY Zip11211 Match 3.013Merge Phone718-123-4567718-234-5678No-match 2.130No-merge Med Rec Number1001002567435 Low Probability. Human Decision: No-Merge Merge Total = 3.8 No-merge total = 181.1 MEDD predicts “No-merge” with 97.9% confidence

14 Field NameRecordFeatureWeightPrediction 12 Last nameHernandez Match1.153Merge First nameBoyDavid DOB[2/14/97] Match1.138Merge Multiple birthNN Mom’s Maiden NameHernandez Mother’s DOB11/4/78 Street142 4th Ave Match4.342Merge CityBronx Match1.103Merge StateNY Zip1105111052No-match2.551No-merge Phone718-524-4879718-524-4878No-match2.130No-merge Med Rec Number1001002567435 Intermediate Probability. Human Decision: Merge Merge Total = 6.3 No-merge total = 5.4 Predicts “Merge” with 53.9% confidence (Human review)

15 ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Sophisticated MEDD features: Name Frequency  Name Frequency  “Rodriguez” is 9 times more common than “Walker” in NYC  Less than 3 kids per year are born with the names “Borthwick” and “Papadouka”  Hence we build features categorizing names as “very common”, “somewhat common”, “very rare”, etc.  Given that we have a name match, the fact that the names are very common is a feature predicting “don’t merge”  A match between rare names is a feature predicting “merge”

16 ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Sophisticated MEDD features: Partial Name Match  Soundex: A phonetic representation of names  Connor = Conor = Conner = CNR  When the Soundex representation of two names matches, a feature fires predicting “merge”  Edit Distance: Features firing based on two names having an edit distance of 1  Borthwich  Borthwick  Bortwick

17 ChoiceMaker Technologies New York Citywide Immunization Registry: The MEDD De-duplication Project Special Situation Features  Every database has its quirks  HMO XYZ always sends its data to the CIR with Day of Birth = “1”  Birthday = July 1, 1998 not July 15, 1998  We have a special feature:  If Provider = “HMO XYZ” AND Day of Birth = 1 AND dates differs only on day of birth, THEN predict merge  We plan to allow users to define these types of features themselves

18 New York Citywide Immunization Registry: The MEDD De-duplication Project Test Procedure  MEDD  MEDD tested on c. 3,000 pairs under NYC DOH supervision  Pairs were carefully hand-scored by NYC DOH as Merge/Don’t Merge  ChoiceMaker never saw the test data ChoiceMaker Technologies

19 New York Citywide Immunization Registry: The MEDD De-duplication Project MEDD Evaluation Results Requested Accuracy % of Records Needing Human Review 1% False Positive 1% False Negative 1.4% 0.5% False Positive 0.5% False Negative 2.6% 0.3% False Positive 0.3% False Negative 3.2% Even with double-checking, human error rate is no better than 0.3% Even with double-checking, human error rate is no better than 0.3% ChoiceMaker Technologies

20 New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: What MEDD Offers  Can be trained on just 3,000 record pairs  Judges nearly 1,000 record-pairs per second mergedon’t merge  Achieves very high accuracy by finding the optimal weighting of the different clues (“features”) indicating merge/don’t merge mergedon’t mergeI don’t know  Says “merge”, “don’t merge”, or “I don’t know”  Can be rigorously tested  Registry management can make informed judgments regarding the effort vs. accuracy trade-off ChoiceMaker Technologies

21 New York Citywide Immunization Registry: The MEDD De-duplication Project The 5 Stages of the De-duplication Process 1.“Blocking”: Identify list of possible duplicates (SmartSearch) 2.“Decision-Making”: Identify a definitive list of duplicate records (MEDD) 3.Human Review of a.Records marked as “don’t know” by MEDD b.Records held by special filters (twins, scanty records, etc.) 4.Linkage: Link records that belong to the same child together (if A=B and B=C then A=C) 5.Update the CIR ChoiceMaker Technologies

22 New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche  Project Avalanche  Project Avalanche: A project by which we systematically de-duplicate the whole CIR by comparing every record to every record meeting certain criteria  Uses our querying tool Smart Search and our de-duplication tool MEDD  Project Avalanche I: February-April 2000  Project Avalanche II: May-July 2000 ChoiceMaker Technologies

23 New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche I  Used strict blocking criteria for finding possible duplicates to be passed on to MEDD such as:  Exact match on DOB+Medical Record or  Exact match on Medicaid number or  First name+gender+DOB+last name=maiden name (and vise versa) or  Last name+First name+DOB  Used 98% as the cut-off for automatic merging  Hand-reviewed records produced by the filters ChoiceMaker Technologies

24 New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche I: Results ChoiceMaker Technologies * Estimated

25 New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche II  In April 2000 we loaded 4 months worth of data that were held due to Y2K problems  Used more liberal blocking criteria:  Medical Record Number+  month and year of DOB or  day and year of DOB or  day and month of DOB or  first name  Used 90% as the cut-off for automatic merging  Currently hand-reviewing records produced by the filters ChoiceMaker Technologies

26 New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche II: Results ChoiceMaker Technologies *Estimated

27 New York Citywide Immunization Registry: The MEDD De-duplication Project Project Avalanche: Discussion  Using a very conservative cut-off for automatic merging we reduced the duplicates by about 27.5% each time, more than 30% including human review  As a result of Project Avalanche 81% of records now have immunizations vs. 58% 6 months ago  Since MEDD is not yet implemented on the front end of the CIR, you don’t see the total number of duplicates decreasing over time in these early runs ChoiceMaker Technologies

28 New York Citywide Immunization Registry: The MEDD De-duplication Project Future of MEDD at the CIR  As part of the Lead and CIR integration MEDD will be inserted on the front end, thus reducing the number of duplicates being created  Improving MEDD’s performance will enable us to automatically merge more duplicates with the same error rate  Will continue with Project Avalanche until we bring the duplication rate down to an acceptable level ChoiceMaker Technologies

29 New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: ChoiceMaker Status  Currently have two employees  Andrew Borthwick, Ph.D.  Prof. Arthur Goldberg  Have several major contracts with New York City Dept. Of Health  Good prospects of finding similar work with other state and municipal health departments ChoiceMaker Technologies

30 New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: De-duplication Marketplace  Immunization Registries have very difficult duplicate record problems  Many others have similar problems  Medical researchers (correlating birth certificate and maternal death records)  Banks, phone companies (correlating clients from different lines of business)  Direct marketers (merging mailing lists) ChoiceMaker Technologies

31 New York Citywide Immunization Registry: The MEDD De-duplication Project Summary: ChoiceMaker’s Plans  Do further research to decrease the amount of consulting time needed to deploy MEDD  Seeking first-round investors to fund expansion of R&D and marketing  Have an opening for someone with an M.S. in C.S. or similar qualifications, starting 10/1/2000 and a C.S. Ph.D. starting 11/1/2000 ChoiceMaker Technologies


Download ppt "Andrew Borthwick, PhD§ Vikki Papadouka, PhD, MPH* Deborah Walker, PhD* *New York City Department of Health"

Similar presentations


Ads by Google