Presentation is loading. Please wait.

Presentation is loading. Please wait.

Andrew Borthwick, Ph.D. Martin Buechi, Ph.D. ChoiceMaker Technologies

Similar presentations


Presentation on theme: "Andrew Borthwick, Ph.D. Martin Buechi, Ph.D. ChoiceMaker Technologies"— Presentation transcript:

1 Accurate, Customizable Matching: The Heart of the NYC Master Child Index
Andrew Borthwick, Ph.D. Martin Buechi, Ph.D. ChoiceMaker Technologies Phone: Vikki Papadouka, Ph.D. Paul Schaeffer, M.P.A. Deborah Walker, Ph.D. Alexandra Ternier, M.P.H. NYC Department of Health Phone:

2 Goals of Talk Background
Overview of NYC DOH Master Child Index project History of ChoiceMaker’s use at NYC DOH ChoiceMaker 2.0 ChoiceMaker 2.0 improvements over 1.0 ChoiceMaker’s new “ClueMaker” programming language

3 Master Child Index (MCI)
Integrates the New York Citywide Immunization Registry (CIR) with other child health databases starting with LeadQuest (LQ) to produce a comprehensive Child Health Registry Benefits: Improved surveillance Information sharing with providers Identification of children in need of immunizations and/or lead tests Information sharing among different DOH programs (e.g. addresses, providers, etc.) Improved data quality

4 CIR and LQ CIR contains 2 million records, and >13 million immunization events 1250 pediatric provider sites report ~170,000 immunization events reported monthly also loads Vital records: ~125,000 children/year LeadQuest contains 1.9 million records and 4 million blood tests 60 laboratories report ~39,000 test results reported monthly

5 MCI system Records in LQ and CIR will be linked with each other through the MCI New records will be checked against MCI first ChoiceMaker on front end Duplicates caught before they enter LQ CIR MCI

6 Accurate Matching Is Key to MCI
Matching system correlates records in the 2 databases data from systems can be linked Matching not only prevents duplicates from entering the system but also reduces duplication problem in existing databases creates more complete, less fragmented records and hence increases usefulness of data Matching allows for approximate searches more successful identification of records by doctors and DOH staff who are doing lookups Accurate matching is the heart of the MCI

7 Record Matching Challenges
No unique IDs: SSN not allowed Incorrectly submitted data: Borthwick vs. Borthwich Data change over time: names change, addresses change Variations in data received: Andrew vs. Andy Incomplete data: fields aren’t filled in or are filled in with generic values Data in wrong fields: first and last names reversed Large volume of data: need for automation

8 The MCI’s Record Matching Solution
ChoiceMaker™ 2.0 (formerly known as “MEDD”) Identifies and merges fragmented database records Searches databases for approximate matches, decides if same child, and merges if appropriate Links related records across multiple databases (MCI-CIR-LQ)

9 ChoiceMaker 1.0’s Use by CIR
Used successfully in Batch mode by the CIR since 1999 Merged over 700,000 records Accuracy measured at 99.7% in tests supervised by NYC DOH

10 ChoiceMaker 1.0 vs. 2.0 Both ChoiceMaker 1.0 and 2.0 are:
wholistic - take into account simultaneously many aspects of the record modular: clues can be added and taken out depending on the data based on an Artificial Intelligence technique called “maximum entropy modeling” that lets the system learn from examples (hand scored by people)

11 ChoiceMaker 1.0 vs. 2.0 Improvements in ChoiceMaker 2 vs. ChoiceMaker 1 Handles “stacked” data (more than one value for a field) Is written in Java (vs. C++), which makes it more portable Includes “Blocking” component Includes a new programming language called “ClueMaker” for writing clues Can be called online from MCI, CIR, and LQ applications, not just in batch mode

12 Technology: Production Matching
Search Record Blocking Many Possible Matches Maximum Entropy Matching Match Probabilities of Likely Matches Non-Match Match Probability Match Low High Intermediate Human Review

13 ClueMaker Clues Encode business rules for matching
Take a pair of records and suggest that they match or differ Written in a Java-based language, ClueMaker™ Importance, or weight, of each clue determined by maximum entropy training Clues can also be used as filters (rules) Clue weights are combined to get probability ClueMaker is new to ChoiceMaker with version 2.0

14 Generic clues Do first names match?
clue mFirstNames { match same(r.firstName); } Do first names match approximately using “phonetic matches” such as Soundex? clue mSoundexFirstNames { match same(Soundex.soundex(r.firstName)); Do uncommon first names match? clue mFrequencyFirstName { match foreach(int freq : {0, 1, 2, 3}; same(r.firstName) && Maps.lookupInt("firstFreq", q.firstName) == freq);

15 Healthcare Clue Do we have indication of a twin—matching last names and birthdates, but different first names and consecutive medical record numbers? clue dTwin { differ same(r.last_name) && same(r.birthday) && different(r.first_name) && valid(q.medical_rcrd) && valid(m.medical_rcrd) && Math.abs(q.medical_rcrd – m.medical_rcrd) == 1; }

16 Clues for Peculiarities of Data
Do month and year of birth match and the record comes from facility XYZ? Due to some error in their system, they always report the day of birth as '1'. clue mDobXyz { match same(DateUtils.MonthAndYear(r.dob)) && ("XYZ" == q.facility || "XYZ" == m.facility); }

17 ClueMaker: Built for Stacked Data
Stacked data: multiple values for single field Examples: Stacked first names: Name, nickname, misspelling, middle name that grandma prefers and reports as first name, etc. Stacked addresses: address history; some doctors may continue to report old address Stacking of data improves matching accuracy ChoiceMaker and ClueMaker built for stacking // There exists a valid matching first name clue mFirstNames { match same(r.names.firstName); }

18 Benefits of ChoiceMaker for MCI
Integrates easily into DOH computing environment Can capture peculiarities of data Fast & inexpensive to modify if new peculiarity comes (e.g., new data provider with quirks) Can easily create many clues Designed to handle stacked data well Bottom line: High accuracy


Download ppt "Andrew Borthwick, Ph.D. Martin Buechi, Ph.D. ChoiceMaker Technologies"

Similar presentations


Ads by Google