THRio Database Linkage and THRio Database Issues.

Slides:



Advertisements
Similar presentations
AIDS-Related Tuberculosis in Rio de Janeiro, Brazil Antonio G F Pacheco.
Advertisements

History Data Service1 Good Design for Historical source based Databases History Data Service Hamish James.
Allison Dunning, M.S. Research Biostatistician
The Linked PDD-Death Product More than you want to know David Zingmond, MD, PhD Division of General Internal and Health Services Research UCLA School of.
Kerr Elementary Science Fair GETTING STARTED Pick Your Topic. Choose something that interests you. Ideas might come from hobbies or problems.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Linked Data Products Vital Statistics Death/PDD Presenter: Jan Morgan.
Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Bosna i Hercegovina Agencija za statistiku Bosne i Hercegovine Post-enumeration Survey-A.
Crime Section, Central Statistics Office..  The Crime Section would like to acknowledge the assistance provided by the Probation Service in this project.
Data Quality Class 10. Agenda Review of Last week Cleansing Applications Guest Speaker.
Biostatistics ~ Types of Studies. Research classifications Observational vs. Experimental Observational – researcher collects info on attributes or measurements.
Chapter 8 File organization and Indices.
The Relational Database Model. 2 Objectives How relational database model takes a logical view of data Understand how the relational model’s basic components.
Enhancing HIV/AIDS Surveillance in California California Department of Public Health Office of AIDS Guide for Health Care Providers.
Module 1: Final Case Study #1-CS-1. Case Study: Instructions v Try this case study individually. v We’ll discuss the answers in class. # 1-CS-2.
Unit 4: Monitoring Data Quality For HIV Case Surveillance Systems #6-0-1.
Software Development, Programming, Testing & Implementation.
Thoughts on Biomarker Discovery and Validation Karla Ballman, Ph.D. Division of Biostatistics October 29, 2007.
Antiretroviral Treatment Costs in Mexico WHO/UNAIDS Workshop on Strategic Information for Anti-Retroviral Therapy Programmes 30 June to 2 July, 2003 Assessment.
Systems Life Cycle A summary of what needs to be done.
Surveillance to measure impact of ART Theresa Diaz, MD MPH CDC Global AIDS Program.
The Relational Database Model
Identifying Problem Sources at Data Entry and Collection National Center for Immunization & Respiratory Diseases Influenza Division Nishan Ahmed Regional.
© 2011 Octagon Research Solutions, Inc. All Rights Reserved. The contents of this document are confidential and proprietary to Octagon Research Solutions,
Performing the Study Data Collection
Sayed Ahmed Logical Design of a Data Warehouse.  Free Training and Educational Services  Training and Education in Bangla: Training and Education in.
How to process data from clinical trials and their open label extensions PhUSE, Berlin, October 2010 Thomas Grupe and Stephanie Bartsch, Clinical Data.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
Chapter 8: Systems analysis and design
USING URS for QUALITY MANAGEMENT Case Study 1: “How many of the women currently enrolled in the RWCA case management program are actually receiving routine.
Q2010, Helsinki Development and implementation of quality and performance indicators for frame creation and imputation Kornélia Mag László Kajdi Q2010,
Components of HIV/AIDS Case Surveillance: Case Report Forms and Sources.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
HIV Mortality for Florida and the Six (EMAs) Eligible Metropolitan Areas Florida Department of Health HIV/AIDS & Hepatitis Program Death data as of 07/12/2012.
1 Client/Server Databases and the Oracle Relational Database.
The Relational Database Model
SAE data entry: Clinical versus Pharmacovigilance standards Daniel Becker Solvay Pharmaceuticals Hannover, Germany T:
Systems Life Cycle. Know the elements of the system that are created Understand the need for thorough testing Be able to describe the different tests.
CREATE Biostatistics Core THRio Statistical Considerations Analysis Plan.
CONCEPTS AND TECHNIQUES FOR RECORD LINKAGE, ENTITY RESOLUTION, AND DUPLICATE DETECTION BY PETER CHRISTEN PRESENTED BY JOSEPH PARK Data Matching.
1 CSE 2337 Introduction to Data Management Access Book – Ch 1.
THRio Antonio G F Pacheco. THRioOutline –Database setup Creating a master table with main outcomes –Mortality recovery with linkage Issues and differences.
Concepts of Database Management, Fifth Edition Chapter 6: Database Design 2: Design Methodology.
TB Management: A Medical Aid Perspective presented by Dr Noluthando Nematswerani.
Effect of community-wide isoniazid preventive therapy on tuberculosis among South African gold miners “Thibelo TB” Aurum Health Research LSHTM JHU Gold.
Use of ICT in Data Management AS Applied ICT. Back to Contents Back to Contents.
Decision-Making. Decision Making ▪Decision Making - is choosing among two or more alternatives (choices) ▪Begins with identification of a problem and.
Finding a PersonBOS Finding a Person! Building an algorithm to search for existing people in a system Rahn Lieberman Manager Emdeon Corp (Emdeon.com)
Blindfolded Record Linkage Presented by Gautam Sanka Susan C. Weber, Henry Lowe, Amar Das, Todd Ferris.
THE 6 TH NATIONAL SCIENTIFIC CONFERENCE ON HIV/AIDS Yield and impact of repeated screening for tuberculosis and isoniazid preventive therapy among patients.
CREATE Biostatistics Core THRio Statistical Considerations Analysis of baseline data—esp. truncation Analysis of main study data—esp. correlation.
3 1 Chapter 3 The Relational Database Model Database Systems: Design, Implementation, and Management, Sixth Edition, Rob and Coronel.
IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo a, Jose G. Delgado-Frias Publisher: Journal of Systems.
THRio. Outline Data flow and database Data flow and database Database matching (linkage) issues Database matching (linkage) issues.
1 IP Routing table compaction and sampling schemes to enhance TCAM cache performance Author: Ruirui Guo, Jose G. Delgado-Frias Publisher: Journal of Systems.
INFORMATION TECHNOLOGY DATABASE MANAGEMENT. A database is a collection of information organized to provide efficient retrieval. The collected information.
EPI 5344: Survival Analysis in Epidemiology Week 6 Dr. N. Birkett, School of Epidemiology, Public Health & Preventive Medicine, University of Ottawa 03/2016.
© 2010 Jones and Bartlett Publishers, LLC. Chapter 12 Clinical Epidemiology.
MEDICAL RECORD BROKER -LAVANYA GUNDAMARAJU Introduction Introduction n Database and database systems have become an essential part of everyday life.
DATA TYPES.
Observational Study Working Group
Quality of Electronic Emergency Department Data: How Good Are They?
Addressing the challenges and successes of expediting TB treatment among PLHIV who are seriously ill: experience from Kenya Masini E & Olwande C National.
Record Storage, File Organization, and Indexes
Chapter 12: Query Processing
Introduction to Computer Programming
The Relational Database Model
Candice Preslaski, PharmD, BCPS, BCCCP
S. Findley, M. Irigoyen, P. Sternfels, F. Chimkin, M. Sanchez
Queries Training Module.
Presentation transcript:

THRio Database Linkage and THRio Database Issues

Database matching There are several systems that do not “talk” to each other There are several systems that do not “talk” to each other SINAN – reportable diseases (TB, AIDS) SINAN – reportable diseases (TB, AIDS) SIM – Mortality SIM – Mortality SICOM – Pharmaceutical database (ARVs) SICOM – Pharmaceutical database (ARVs) THRio – Our DB THRio – Our DB Original plan Original plan Match THRio with all other 3 DBs above Match THRio with all other 3 DBs above

Database matching Problems Problems There is no unique identifier common for all systems There is no unique identifier common for all systems We use name, gender and DOB and mother’s name as surrogates We use name, gender and DOB and mother’s name as surrogates The information is not uniform – many missing variables – especially mother’s name The information is not uniform – many missing variables – especially mother’s name THRio THRio Standardization of names abbreviations Standardization of names abbreviations Double data entry Double data entry Not enough – names are misspelled Not enough – names are misspelled The other databases – even worse The other databases – even worse No QC No QC

Database matching Proposed strategy Proposed strategy Compare different approaches Compare different approaches Translated SOUNDEX Translated SOUNDEX Reclink – probabilistic linkage Reclink – probabilistic linkage Other algorithms Other algorithms Apply to different examples and get sensitivity/specificity for each one Apply to different examples and get sensitivity/specificity for each one SICOM SICOM Sequential matching Sequential matching Match TB before doing the sequential Match TB before doing the sequential

Database matching The project was split: The project was split: ARV database revisited ARV database revisited Development of a new algorithm for database linkage Development of a new algorithm for database linkage

Database matching ARV database revisited ARV database revisited Consistency problems (as pointed out before) Consistency problems (as pointed out before) First HAART abstracted for THRio First HAART abstracted for THRio Inconsistency confirmed Inconsistency confirmed Dates did not match (40%) Dates did not match (40%) Drugs did not match Drugs did not match Now all the ART history will be collected (since HAART only) Now all the ART history will be collected (since HAART only) Should we insist and compare the database with the whole history? Should we insist and compare the database with the whole history?

Database matching Development of algorithm for database linkage Development of algorithm for database linkage Using Python to implement the interface Using Python to implement the interface Adapted soundex algorithm Adapted soundex algorithm “Gestalt” algorithm – rather hyperbolic “Gestalt” algorithm – rather hyperbolic Direct field comparisons Direct field comparisons Including an hierarchical structure for searching and comparing records Including an hierarchical structure for searching and comparing records Means taking advantage of differences in amount of information available Means taking advantage of differences in amount of information available Computational problems Computational problems Optimization Optimization

Database matching Blocking Blocking Speeds up computation Speeds up computation I’ll be concerned with records that are a little similar to begin with I’ll be concerned with records that are a little similar to begin with Soundex Soundex First and last names First and last names Mother’s first and last names Mother’s first and last names First name and mother’s last name First name and mother’s last name Needed to expand to account for errors in the first and last names’ first letter Needed to expand to account for errors in the first and last names’ first letter

Database matching Full comparison Full comparison All fields exactly the same All fields exactly the same Small error in DOB Small error in DOB Similar names (gestalt) – generates scores Similar names (gestalt) – generates scores A combination of the above A combination of the above Several “levels” created Several “levels” created Have to choose 2 cutoffs Have to choose 2 cutoffs Not a match Not a match Definitely a match Definitely a match Have to manually decide Have to manually decide

Database matching Computational problems – testing phase Computational problems – testing phase Using PostgreSQL and Python Using PostgreSQL and Python Too slow when matching with the TB database Too slow when matching with the TB database > 100,000 records > 100,000 records Changed the algorithm to Python only Changed the algorithm to Python only Computational times (currently) Computational times (currently) THRio x SIM (12,689 X 2,922) THRio x SIM (12,689 X 2,922) 3-4 minutes 3-4 minutes THRio x TB (12,689 X 102,919) THRio x TB (12,689 X 102,919) minutes minutes

Database matching Results Results First we chose a sample of the mortality database First we chose a sample of the mortality database Year 2005 Year 2005 AIDS only AIDS only 871 records 871 records Matched with THRio database Matched with THRio database 10,344 records at the time 10,344 records at the time

Database matching Compared Manual x Reclink x Algorithm Compared Manual x Reclink x Algorithm We were going to use the manual linkage as the gold standard We were going to use the manual linkage as the gold standard The algorithm found 13 extra right matches The algorithm found 13 extra right matches We used the combination of those as the standard We used the combination of those as the standard

Database matching

The algorithm outperformed both RecLink and manual check The algorithm outperformed both RecLink and manual check But after some adjustments But after some adjustments That was just the “training phase” That was just the “training phase” The only mistake has actually to be checked if it is a twin brother The only mistake has actually to be checked if it is a twin brother Full info and only one different letter in the first name Full info and only one different letter in the first name We still have to test it again with a different sample and with TB We still have to test it again with a different sample and with TB

Database matching THRio (latest) x SIM ( ) THRio (latest) x SIM ( ) 340 matches (total) 340 matches (total) 79 (23%) to be manually checked only 79 (23%) to be manually checked only This means that both DBs have good quality, at lest in terms of completeness This means that both DBs have good quality, at lest in terms of completeness Ended up with 273 matches and one possible mistake Ended up with 273 matches and one possible mistake When we actually implement it… When we actually implement it… Extra check with date of last annotation in the chart Extra check with date of last annotation in the chart

Database matching Challenge Challenge TB database TB database Data quality is much poorer than SIM Data quality is much poorer than SIM Might lead to lower sensitivity Might lead to lower sensitivity Will lead to much more manual checking Will lead to much more manual checking Development of interface to help work Development of interface to help work

Database matching THRio (latest) x TB ( ) THRio (latest) x TB ( ) 6453 matches (total) 6453 matches (total) 3870 (60%) to be manually checked 3870 (60%) to be manually checked 721 (11%) with names only 721 (11%) with names only Quality is much worse than SIM Quality is much worse than SIM Many duplicates Many duplicates Proposed solutions: Proposed solutions: Reduce time frame (for prospective TB cases only) Reduce time frame (for prospective TB cases only) Use date of TB diagnosis to exclude duplicates Use date of TB diagnosis to exclude duplicates GUI to help GUI to help

Database matching Further discussion for mortality: Further discussion for mortality: What database to use? What database to use? All causes X HIV-AIDS as a basic cause All causes X HIV-AIDS as a basic cause Patients may be dying of other causes Patients may be dying of other causes Municipality X State Municipality X State Patients may live in other cities Patients may live in other cities Municipality just records deaths that occurred in the city Municipality just records deaths that occurred in the city

Data analysis issues

Complex structure Complex structure Currently 17 tables with information Currently 17 tables with information Dates are not date fields Dates are not date fields We need dates!!! We need dates!!! We don’t collect information about specific visits We don’t collect information about specific visits It is the information since last annotation up to the current one – could mean multiple visits It is the information since last annotation up to the current one – could mean multiple visits Definitions are hard to make Definitions are hard to make

Data analysis issues All the events have to be based on dates All the events have to be based on dates Partial missing dates Partial missing dates In general I’ll accept missing days – turned to 15 In general I’ll accept missing days – turned to 15 What to use as a surrogate? What to use as a surrogate? For data collected under the study – date of last annotation For data collected under the study – date of last annotation What about baseline data? What about baseline data?

Data analysis issues Definition of Baseline data Definition of Baseline data Study begins on September 1 st 2005 Study begins on September 1 st 2005 Baseline data collection finished on June 2006 Baseline data collection finished on June 2006 “Baseline form” doesn’t mean baseline information “Baseline form” doesn’t mean baseline information Is it baseline for the study or for the patient? Is it baseline for the study or for the patient? What about new patients? Do they have baseline data? What about new patients? Do they have baseline data?

Data analysis issues Definition of a new patient Definition of a new patient We have two “candidate” dates We have two “candidate” dates Date of enrollment in the clinic Date of enrollment in the clinic Could be long before HIV diagnosis Could be long before HIV diagnosis Date of HIV diagnosis Date of HIV diagnosis Could be long before enrollment in that clinic Could be long before enrollment in that clinic A “new” patient is not necessarily new, depending on what we want A “new” patient is not necessarily new, depending on what we want Do we need newly diagnosed or newly enrolled? Do we need newly diagnosed or newly enrolled? Should we use both? Should we use both?

Data analysis issues Several possible outcomes Several possible outcomes Primary outcome of study (TB) Primary outcome of study (TB) Secondary outcome (death) Secondary outcome (death) Operational outcomes Operational outcomes Waiting for PPD Waiting for PPD PPD placed and read PPD placed and read Reactive PPD Reactive PPD INH started INH started How to deal with all of these? How to deal with all of these?

Data analysis issues General output for data analysis General output for data analysis For each patient, look for baseline status For each patient, look for baseline status As of Sept 2005 or at enrollment As of Sept 2005 or at enrollment Look for all changes in time Look for all changes in time Need the dates!!! Need the dates!!! Set up like a database for survival analysis Set up like a database for survival analysis For every change repeat records with For every change repeat records with Initial status Initial status Initial date Initial date Final status Final status Final date Final date Possible to customize for specific outcomes Possible to customize for specific outcomes

Thank you!