Download presentation
Presentation is loading. Please wait.
Published byLesley Neal Modified over 9 years ago
1
Data Quality Sharp project 5 June 2010
2
Statistical Problems with Data Quality in EHR Missing Data Missing Data Uncertain Diagnosis Uncertain Diagnosis Uneven/unequal precision / measurement error Uneven/unequal precision / measurement error Bias Bias …
3
Missing Data: (Rage in Statistical Theory) Common problem with observational/ retrospective data Common problem with observational/ retrospective data Statistical approaches Statistical approaches –Imputation –Multiple imputation (MI) (Statisticians have acronyms too) –Regression with residual error – draw from Posterior distribution
4
Missing Data– Empirical approach Regression on Y with Missing X-variables Regression on Y with Missing X-variables “X is missing” is also information. “X is missing” is also information. Analyze data set using Analyze data set using –Imputation (mean?) –“missing” indicator –Empirical approach– let data tell you what to do
5
Uncertain diagnosis Universal problem with health data Universal problem with health data No Gold standard No Gold standard Disease/health is a spectrum, not a dichotomy Disease/health is a spectrum, not a dichotomy Probabilistic perspective Probabilistic perspective –Probability (Peripheral Arterial Disease) –From {0,1} to [0-1] as phenotype –More realistic phenotype?
6
Uncertain Diagnosis Result is a probability Result is a probability Probability is a posterior distribution of a 0/1 variable Probability is a posterior distribution of a 0/1 variable –Use p itself (certainty equivalent) Analogous to single imputation Analogous to single imputation –Use multiple imputation “1” with probability p, “0” with probability 1-p “1” with probability p, “0” with probability 1-p
7
Uncertain Diagnosis– PAD example (eMERGE) Mayo Vascular Lab Database– n=18000 Mayo Vascular Lab Database– n=18000 Gold Standard— Ankle/Brachial Index (ABI) Gold Standard— Ankle/Brachial Index (ABI) Use of Diagnostic / procedural codes Use of Diagnostic / procedural codes –ICD-9 / HICDA / CPT Logistic regression of gold standard (PAD by ABI) on diagnostic codes Logistic regression of gold standard (PAD by ABI) on diagnostic codes posterior probability of PAD posterior probability of PAD
8
Uncertain Diagnosis Model for Pr(PAD)– 90% predictive value Model for Pr(PAD)– 90% predictive value Export model for Pr{PAD} to patients without gold standard ascertainment? Export model for Pr{PAD} to patients without gold standard ascertainment? (Coding practices?) (Coding practices?)
9
Uncertain Diagnosis Use Pr{PAD} in analysis of Use Pr{PAD} in analysis of –Incidence of PAD –Incidence trends –Surveillance –Analysis of etiology, risk factors
10
Unequal Precision of continuous phenotype eMERGE example: Red Blood Count eMERGE example: Red Blood Count Use retrospective Laboratory Data Use retrospective Laboratory Data N=3000, K=20,000 N=3000, K=20,000 –1 measurement 100 measurements/subject Account for differential precision Account for differential precision Components of variance Components of variance Weighted regression? Weighted regression? Posterior distribution– same model fits Posterior distribution– same model fits
11
Sample from Posterior Distribution Missing Data, uncertain diagnosis, unequal precision can all be represented by sampling from posterior distribution Missing Data, uncertain diagnosis, unequal precision can all be represented by sampling from posterior distribution They are all the “same problem” They are all the “same problem” Statistical / computational tools for this have been developed Statistical / computational tools for this have been developed –Markov Chain Monte Carlo (MCMC) –Multiple Imputation
12
Summary: Data Quality ‘Data’ is not ‘a number’ but ‘a posterior distribution’ ‘Data’ is not ‘a number’ but ‘a posterior distribution’ –Mean and variance –Posterior probability Data quality Data quality –Don’t try to change it –Measure it –Allow for it-- propagation of error
13
What is “Data”? Data is whatever input goes into the next procedure. Data is whatever input goes into the next procedure. (= output from previous procedure) (= output from previous procedure) ‘Propagation of error’ ‘Propagation of error’ Output of NLP is also “Data” Output of NLP is also “Data”
14
How Assess Data Quality? What if there is no Gold Standard? What if there is no Gold Standard? Use any external standard Use any external standard –E.g. outcome data Stronger predictive relationship= better signal/noise ratio? Stronger predictive relationship= better signal/noise ratio? “Errors-in-variables” principle “Errors-in-variables” principle –Larger error in X –> Smaller beta for Y|X
15
Summary: Help! What are the important tasks in Data Quality? What are the important tasks in Data Quality? –Measurement? –Allowance for? Important tasks for this Project? Important tasks for this Project? –Integrate with other projects
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.