Download presentation
Presentation is loading. Please wait.
Published byEdward Stone Modified over 9 years ago
1
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Basic Data Cleaning Principles
2
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Believe it or not, most good data analysts probably spend the majority of their time cleaning data and only a relatively small percentage doing formal statistical analyses. Regardless of how good your quality control, errors creep into datasets. In addition, missing data and skip patterns need to be dealt with, especially when creating new variables.
3
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Things to look for impossible values improbable values obvious outliers do the data make sense? are there inconsistent or illogical patterns? are there missing data? If yes and due to skip patterns, are there logical codes we can assign? are there text or alpha variables?
4
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Strategies for exploring your data simple frequencies of all categorical variables univariate stats (mean, SD, percentiles, minimum and maximum values) for all continuous variables selected crosstabs, especially for nested questions (i.e., if “yes” to Q1, then ask Q2) listings of selected variables
5
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample frequency table Consider the following frequency table for the # asthma hospitalizations in the past year. Are the “5” and “10” values real? How might you analyze such data?
6
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does anything strike you as peculiar or suspect with this variable? The 4.42 was a data entry error. s/b 7.62
7
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does anything catch your attention? The 25.5 should have been 77.
8
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Sample univariate stats Does this table suggest any problems? What if I said this was from a study of survival in patients with > 6 months on LTOT?
9
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data Consider the following table. How might we resolve the 3 people who answered both questions? What about the 7 folks who skipped Q5c but shouldn’t have?
10
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data How would we define the following variable? Are there any problems with the following? Smoke = 1 if Q5 (current smoker) = yes Smoke = 2 if Q5c (ever smoker) = yes Smoke = 3 otherwise
11
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Dealing with missing data What might be a better definition of smoke that properly deals with missing data? Smoke = 1 if Q5 (current smoker) = yes Smoke = 2 if Q5c (ever smoker) = yes Smoke = 3 if Q5 = no and Q5c = no Smoke = “.” otherwise Even this doesn’t work if we still have to deal with the 3 inconsistent responses!
12
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Imputing values for logical skip patterns Consider the following two questions: Q3a will be skipped, and hence be missing, for everyone who answers “no” to Q3. Is there a logical value to assign in this case? What are merits of assigning “0” (no) vs. NA?
13
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Listing data to check recodes
14
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Getting to Know Your Data Garbage In = Garbage Out ! Spending the time getting to know and understand your data will pay off in the long run. The Bottom Line:
15
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
16
Statistics Inside the Black Box
17
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH
18
Statistics: Inside the Black Box Statistics can be said to be about estimating quantities of interest (e.g., the prevalence of TB in a Rio favela or the rate of decline of lung function with age) and then making inferences about these quantities (e.g., does TB prevalence vary by HIV status). We will focus on the “estimation step”, including model building, and interpreting the coefficients in your models.
19
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 1 Most people have heard of the normal distribution. When we say that some variable is normally distributed with mean, , and variance, we are tacitly assuming that we can write an equation describing the probability (or likelihood) of the observed data as a function of and . The values of and that maximize the probability are termed “maximum likelihood estimates (MLEs).
20
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 2 Whenever you fit a regression model, you are asking the computer to generate maximum likelihood estimates. However rather than simply estimate a single overall mean, , we typically want to describe the mean in terms of other explanatory variables. For example, mean FEV 1 Age Height The coefficients in this model (the s) are also MLEs!
21
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 3 If the source data are normally distributed, then the MLEs will be normally distributed. Even if the source data are not normally distributed, the MLEs derived from such data will be ≈ normally distributed for large enough sample sizes. We use these properties to test specific hypotheses of interest (e.g., H 0 : =0). MLEs have two very desirable properties for statisticians:
22
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 4 the binomial distribution, which forms the basis for logistic regression and is used to analyze binary (yes/no) data the Poisson distribution, useful for modeling rates of occurrence, and the Cox proportional hazards model, used to analyze time to event data. In addition to the normal distribution, other common distributions used in the medical literature are:
23
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 5 Each distribution gives rise to an equation that relates a basic parameter of the model to a collection of predictor variables. e.g., normal: Age Height binomial: ln[P/(1-P)] Age Male Cox: ln[ t ln[ 0 t ] Pkyrs Male
24
© 2009, KAISER PERMANENTE CENTER FOR HEALTH RESEARCH Statistical Estimation Step: Maximum Likelihood Estimation - 6 We will teach you a systematic way to use these equations to help you interpret the coefficients in your model. We will also teach you how to construct your models so as to test specific biological hypotheses of interest.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.