Basic Biostatistics Prof Paul Rheeder Division of Clinical Epidemiology
Overview Bias vs chance Types of data Descriptive statistics Histograms and boxplots Inferential statistics Hypothesis testing: P and CI Comparing groups Correlation and regression
Research Questions? Does CK level predict in hospital mortality post MI? Is there an association between troponin I and renal function? What is the Incidence of amputation in diabetics with renal failure? HOW ARE THEY MEASURED???
Research question Does aspirin reduce CV mortality in diabetics when used for primary prevention? Is there an increased risk between cell phone use and brain cancer? Does level of SES correlate with depression?
Research question So your research question must be phrased in such a manner that you can answer YES or NO or provide some quantification of sorts.
Data analysis Aim: to provide information on the study sample and to answer the research question !
Problems !
Problems Bias and confounding also called systematic error…. Typically dealt with in the planning and execution of the study…can also control for it in the data analysis (eg multivariate analysis) Chance also called random error. Classically P values (and CI) can be used to judge role of chance
First important issues What type of data are you collecting Typically one has some outcome variable and some exposure variable or variables? How and with what are they measured?
Outcome and exposure? Does CK level predict in hospital mortality post MI? Is there an association between troponin I and renal function? What is the Incidence of amputation in diabetics with renal failure? HOW ARE THEY MEASURED???
Research question Does aspirin reduce CV mortality in diabetics when used for primary prevention? Is there an increased risk between cell phone use and brain cancer? Does level of SES correlate with depression?
Research question So your research question must be phrased in such a manner that you can answer YES or NO or provide some quantification of sorts.
Types of data Categorical: HT yes or no, sex, smoking status (usually a %) Ordinal versus nominal Continuous data Spread of continuous data
Data analysis Descriptive stats Mean/median SD or range
Hypothesis testing Differences between groups: Examples: T test/Mann Whitney (2 groups) ANOVA/ Kruskal Wallis (>2 groups) Chi square if it is %
Associations between variables Does coffee cause cancer (OR, RR) Efficacy of Rx (RRR, ARR, NNT) If BMI associated with BP (correlation and regression)
2 X 2 table CancerNo cancer Smokeab Non smokercd RR= (a/a+b)/(c/c+d) OR = (a/b)/(c/d)
TYPES OF DATA
DESCRIPTIVE STATS
Graphics
Using the SD and the Normal Curve
Mean ± 1.96 SD = 95% range of sample Mean ± 1.96 SEM=95% Confidence interval
One of many samples
95% Confidence Intervals
Hypothesis Testing
Type I & II Errors Have an Inverse Relationship If you reduce the probability of one error, the other one increases so that everything else is unchanged.
Factors Affecting Type II Error True value of population parameter – Increases when the difference between hypothesized parameter and its true value decrease Significance level – Increases when decreases Population standard deviation – Increases when increases Sample size – Increases when n decreases n
Examples Difference in glucose between survivors and non survivors = 5 mmol/l (95% CI -5 to 10 mmol/l) RR for cancer =1.4 (95% CI 0.7 to 1.3)
P value The H0 is NO difference BUT I can find a difference by chance Eg WHAT is the probability that you can find a difference between groups of 5 mmol/l when in TRUTH the difference is ZERO? P=0.10
| Key | | | | frequency | | column percentage | | 0=L E=1 Y/NR | 0 1 | Total N | | 48 | | Y | | 49 | | Total | | 97 | | Pearson chi2(1) = Pr = 0.356
Differences between groups
Parametric comparisons
?
T-test ?
What about 3 groups anova age ethngr, cat(ethngr) Number of obs = 37 R-squared = Root MSE = Adj R-squared = Source | Partial SS df MS F Prob > F Model | | ethngr | | Residual | Total |
Differences between the 3. regress Source | SS df MS Number of obs = F( 2, 34) = 1.13 Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = age Coef. Std. Err. t P>|t| [95% Conf. Interval] _cons ethngr (dropped)
Repeated measures One group of schoolkids Muscle strength in January Muscle strength again in March Did things change significantly over time? Paired T –test Two or more groups: RM ANOVA
Non-parametric comparisons Two groups ranksum age, by(menopaus) Two-sample Wilcoxon rank-sum (Mann-Whitney) test menopaus | obs rank sum expected | | combined | unadjusted variance adjustment for ties adjusted variance Ho: age(menopaus==0) = age(menopaus==1) z = Prob > |z| =
Non Parametric Three groups kwallis s_tg, by(ethngr) Test: Equality of populations (Kruskal-Wallis test) | ethngr | Obs | Rank Sum | | | | 1 | 17 | | | 2 | 10 | | | 3 | 10 | | chi-squared = with 2 d.f. probability = chi-squared with ties = with 2 d.f. probability =
summarize Continuous-Non Normal 2 groups: Mann Whitney 3 groups: Kruskal Wallis Continuous-Normal 2 groups: T tests 3 groups: ANOVA
Categorical data
Relationships
Linear Regression
Here the DEPENDENT (logTG) and INDEPENDENT VARIABLES are continuous So how much does logTG increase if waist increases by 1cm = the beta coefficient
What if the INDEP=Categorical regress age menop Source | SS df MS Number of obs = F( 1, 84) = Model | Prob > F = Residual | R-squared = Adj R-squared = Total | Root MSE = age | Coef. Std. Err. t P>|t| [95% Conf. Interval] menopaus | _cons | Menop= 0 or 1……. INTERPRETATION??
Logistic regression Outcome is heart disease (Yes/No… ?) Independent var = age. logistic CVD age Logistic regression Number of obs = 48 LR chi2(1) = 2.51 Prob > chi2 = Log likelihood = Pseudo R2 = died | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval] age | ?