Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis.

Similar presentations


Presentation on theme: "Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis."— Presentation transcript:

1 Data Quality Sharp project 5 June 2010

2 Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis Uneven/unequal precision / measurement error Uneven/unequal precision / measurement error Bias Bias …

3 Missing Data: (Rage in Statistical Theory) Common problem with observational/ retrospective data Common problem with observational/ retrospective data Statistical approaches Statistical approaches –Imputation –Multiple imputation (MI) (Statisticians have acronyms too) –Regression with residual error –  draw from Posterior distribution

4 Missing Data– Empirical approach Regression on Y with Missing X-variables Regression on Y with Missing X-variables “X is missing” is also information. “X is missing” is also information. Analyze data set using Analyze data set using –Imputation (mean?) –“missing” indicator –Empirical approach– let data tell you what to do

5 Uncertain diagnosis Universal problem with health data Universal problem with health data No Gold standard No Gold standard Disease/health is a spectrum, not a dichotomy Disease/health is a spectrum, not a dichotomy Probabilistic perspective Probabilistic perspective –Probability (Peripheral Arterial Disease) –From {0,1} to [0-1] as phenotype –More realistic phenotype?

6 Uncertain Diagnosis Result is a probability Result is a probability Probability is a posterior distribution of a 0/1 variable Probability is a posterior distribution of a 0/1 variable –Use p itself (certainty equivalent) Analogous to single imputation Analogous to single imputation –Use multiple imputation “1” with probability p, “0” with probability 1-p “1” with probability p, “0” with probability 1-p

7 Uncertain Diagnosis– PAD example (eMERGE) Mayo Vascular Lab Database– n=18000 Mayo Vascular Lab Database– n=18000 Gold Standard— Ankle/Brachial Index (ABI) Gold Standard— Ankle/Brachial Index (ABI) Use of Diagnostic / procedural codes Use of Diagnostic / procedural codes –ICD-9 / HICDA / CPT Logistic regression of gold standard (PAD by ABI) on diagnostic codes Logistic regression of gold standard (PAD by ABI) on diagnostic codes  posterior probability of PAD  posterior probability of PAD

8 Uncertain Diagnosis Model for Pr(PAD)– 90% predictive value Model for Pr(PAD)– 90% predictive value Export model for Pr{PAD} to patients without gold standard ascertainment? Export model for Pr{PAD} to patients without gold standard ascertainment? (Coding practices?) (Coding practices?)

9 Uncertain Diagnosis Use Pr{PAD} in analysis of Use Pr{PAD} in analysis of –Incidence of PAD –Incidence trends –Surveillance –Analysis of etiology, risk factors

10 Unequal Precision of continuous phenotype eMERGE example: Red Blood Count eMERGE example: Red Blood Count Use retrospective Laboratory Data Use retrospective Laboratory Data N=3000, K=20,000 N=3000, K=20,000 –1 measurement  100 measurements/subject Account for differential precision Account for differential precision Components of variance Components of variance Weighted regression? Weighted regression? Posterior distribution– same model fits Posterior distribution– same model fits

11 Sample from Posterior Distribution Missing Data, uncertain diagnosis, unequal precision can all be represented by sampling from posterior distribution Missing Data, uncertain diagnosis, unequal precision can all be represented by sampling from posterior distribution They are all the “same problem” They are all the “same problem” Statistical / computational tools for this have been developed Statistical / computational tools for this have been developed –Markov Chain Monte Carlo (MCMC) –Multiple Imputation

12 Summary: Data Quality ‘Data’ is not ‘a number’ but ‘a posterior distribution’ ‘Data’ is not ‘a number’ but ‘a posterior distribution’ –Mean and variance –Posterior probability Data quality Data quality –Don’t try to change it –Measure it –Allow for it-- propagation of error

13 What is “Data”? Data is whatever input goes into the next procedure. Data is whatever input goes into the next procedure. (= output from previous procedure) (= output from previous procedure) ‘Propagation of error’ ‘Propagation of error’ Output of NLP is also “Data” Output of NLP is also “Data”

14 How Assess Data Quality? What if there is no Gold Standard? What if there is no Gold Standard? Use any external standard Use any external standard –E.g. outcome data Stronger predictive relationship= better signal/noise ratio? Stronger predictive relationship= better signal/noise ratio? “Errors-in-variables” principle “Errors-in-variables” principle –Larger error in X –> Smaller beta for Y|X

15 Summary: Help! What are the important tasks in Data Quality? What are the important tasks in Data Quality? –Measurement? –Allowance for? Important tasks for this Project? Important tasks for this Project? –Integrate with other projects


Download ppt "Data Quality Sharp project 5 June 2010. Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis."

Similar presentations


Ads by Google