Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPH Exam Review Biostatistics

Similar presentations


Presentation on theme: "CPH Exam Review Biostatistics"— Presentation transcript:

1 CPH Exam Review Biostatistics
Lisa Sullivan, PhD Associate Dean for Education Professor and Chair, Department of Biostatistics Boston University School of Public Health

2 Outline and Goals Overview of Biostatistics (Core Area)
Terminology and Definitions Practice Questions An archived version of this review, along with the PPT file, will be available on the NBPHE website ( under Study Resources

3 Biostatistics Two Areas of Applied Biostatistics:
Descriptive Statistics Summarize a sample selected from a population Inferential Statistics Make inferences about population parameters based on sample statistics.

4 Variable Types Dichotomous variables have 2 possible responses (e.g., Yes/No) Ordinal and categorical variables have more than two responses and responses are ordered and unordered, respectively Continuous (or measurement) variables assume in theory any values between a theoretical minimum and maximum

5 Dichotomous Ordinal Categorical Continuous
We want to study whether individuals over 45 years are at greater risk of diabetes than those younger than 45. What kind of variable is age? Dichotomous Ordinal Categorical Continuous

6 Dichotomous Ordinal Categorical Continuous
We are interested in assessing disparities in infant morbidity by race/ethnicity. What kind of variable is race/ethnicity? Dichotomous Ordinal Categorical Continuous

7 Numerical Summaries of Dichotomous, Categorical and Ordinal Variables
Frequency Distribution Table Heath Status Freq. Rel. Freq. Cumulative Freq Cumulative Rel. Freq. Excellent 19 38% Very Good 12 24% 31 62% Good 9 18% 40 80% Fair 6 12% 46 92% Poor 4 8% 50 100% n=50 Ordinal variables only

8 Frequency Bar Chart

9 Relative Frequency Histogram

10 Continuous Variables Assume, in theory, any value between a theoretical minimum and maximum Quantitative, measurement variables Example – systolic blood pressure Standard Summary: n = 75, X = 123.6, s = 19.4 Second sample n = 75, X = 128.1, s = 6.4

11 Summarizing Location and Variability
When there are no outliers, the sample mean and standard deviation summarize location and variability When there are outliers, the median and interquartile range (IQR) summarize location and variability, where IQR = Q3-Q1 Outliers <Q1–1.5 IQR or >Q3+1.5 IQR

12 Mean Vs. Median

13 Box and Whisker Plot Min Q Median Q Max

14 Comparing Samples with Box and Whisker Plots
2 1 Systolic Blood Pressure

15 What type of display is shown below?
Percent Patients by Disease Stage Frequency bar chart Relative frequency bar chart Frequency histogram Relative frequency histogram

16 The distribution of SBP in men, 20-29 years is shown below
The distribution of SBP in men, years is shown below. What is the best summary of a typical value Mean Median Interquartile range Standard Deviation

17 When data are skewed, the mean is higher than the median.
True False

18 The best summary of variability for the following continuous variable is
Mean Median Interquartile range Standard Deviation

19 Numerical and Graphical Summaries
Dichotomous and categorical Frequencies and relative frequencies Bar charts (freq. or relative freq.) Ordinal Frequencies, relative frequencies, cumulative frequencies and cumulative relative frequencies Histograms (freq. or relative freq. Continuous n, X and s or median and IQR (if outliers) Box whisker plot

20 What is the probability of selecting a male with optimal blood pressure?
Blood Pressure Category Optimal Normal Pre-Htn Htn Total Male Female Total 20/25 20/80 20/150

21 What is the probability of selecting a patient with Pre-Htn or Htn?
Blood Pressure Category Optimal Normal Pre-Htn Htn Total Male Female Total 95/150 45/80 55/150

22 What proportion of men have prevalent CVD?
CVD Free of CVD Men Women 35/80 35/265 35/300

23 What proportion of patients with CVD are men ?
CVD Free of CVD Men Women 35/700 35/80 80/300

24 Are Family History and Current Status Independent?
Example. Consider the following table which cross classifies subjects by their family history of CVD and current (prevalent) CVD status. Current CVD Family History No Yes 215 25 90 15 P(Current CVD| Family Hx) = 15/105 = 0.143 P(Current CVD| No Family Hx) = 25/240 = 0.104

25 Are symptoms independent of disease?
Disease No Disease Total Symptoms No Symptoms No Yes

26 Probability Models – Binomial Distribution
Two possible outcomes: success and failure Replications of process are independent P(success) is constant for each replication Mean=np, variance=np(1-p)

27 Probability Models – Poisson Distribution
Two possible outcomes: success and failure Replications of process are independent Often used to model counts (often used to model rare events) Mean=m, variance=m

28 Probability Models – Normal Distribution
Model for continuous outcome Mean=median=mode

29 Normal Distribution Properties of Normal Distribution
I) The normal distribution is symmetric about the mean (i.e., P(X > m) = P(X < m) = 0.5). ii) The mean and variance (m and s2) completely characterize the normal distribution. iii) The mean = the median = the mode iv) Approximately 68% of obs between mean + 1 sd 95% between mean + 2 sd, and >99% between mean + 3 sd

30 Normal Distribution Body mass index (BMI) for men age 60 is normally distributed with a mean of 29 and standard deviation of 6. What is the probability that a male has BMI < 29? P(X<29)= 0.5

31 Normal Distribution What is the probability that a male has BMI less than 30? P(X<30)=?

32 Standard Normal Distribution Z
Normal distribution with m=0 and s=1

33 Normal Distribution P(X<30)= P(Z<0.17) = 0.5675
From a table of standard normal probabilities or statistical computing package.

34 Comparing Systolic Blood Pressure (SBP)
Suppose for Males Age 50, SBP is approximately normally distributed with a mean of 108 and a standard deviation of 14 Suppose for Females Age 50, SBP is approximately normally distributed with a mean of 100 and a standard deviation of 8 If a Male Age 50 has a SBP = 140 and a Female Age 50 has a SBP = 120, who has the “relatively” higher SBP ?

35 Normal Distribution ZM = (140 - 108) / 14 = 2.29
ZF = ( ) / 8 = 2.50 Which is more extreme?

36 Percentiles of the Normal Distribution
The kth percentile is defined as the score that holds k percent of the scores below it. Eg., 90th percentile is the score that holds 90% of the scores below it. Q1 = 25th percentile, median = 50th percentile, Q3 = 75th percentile

37 Percentiles For the normal distribution, the following is used to compute percentiles: X = m + Z s where m = mean of the random variable X, s = standard deviation, and Z = value from the standard normal distribution for the desired percentile (e.g., 95th, Z=1.645). 95th percentile of BMI for Men: (6) = 38.9

38 Central Limit Theorem (Non-normal) population with m, s
Take samples of size n – as long as n is sufficiently large (usually n > 30 suffices) The distribution of the sample mean is approximately normal, therefore can use Z to compute probabilities Standard error

39 Statistical Inference
There are two broad areas of statistical inference, estimation and hypothesis testing. Estimation. Population parameter is unknown, sample statistics are used to generate estimates. Hypothesis Testing. A statement is made about parameter, sample statistics support or refute statement.

40 What Analysis To Do When
Nature of primary outcome variable Continuous, dichotomous, categorical, time to event Number of comparison groups One, 2 independent, 2 matched or paired, > 2 Associations between variables Regression analysis

41 Estimation Process of determining likely values for unknown population parameter Point estimate is best single-valued estimate for parameter Confidence interval is range of values for parameter: point estimate + margin of error point estimate + t SE (point estimate)

42 Hypothesis Testing Procedures
1. Set up null and research hypotheses, select a 2. Select test statistic 3. Set up decision rule 4. Compute test statistic 5. Draw conclusion & summarize significance (p-value)

43 P-values P-values represent the exact significance of the data
Estimate p-values when rejecting H0 to summarize significance of the data (approximate with statistical tables, exact value with computing package) If p < a then reject H0

44 Errors in Hypothesis Tests
Conclusion of Statistical Test Do Not Reject H0 Reject H0 H0 true Correct Type I error H0 false Type II error Correct

45 Continuous Outcome Confidence Interval for m
Continuous outcome - 1 Sample n > 30 n < 30 Example. 95% CI for mean waiting time at ED Data: n=100, X =37.85 and s=9.5 mins (35.99 to 39.71) Statistical computing packages use t throughout.

46 New Scenario Outcome is dichotomous
Result of surgery (success, failure) Cancer remission (yes/no) One study sample Data On each participant, measure outcome (yes/no) n, x=# positive responses,

47 Dichotomous Outcome Confidence Interval for p
Dichotomous outcome - 1 Sample Example. In the Framingham Offspring Study (n=3532), 1219 patients were on antihypertensive medications. Generate 95% CI. (0.329, 0.361)

48 One Sample Procedures – Comparisons with Historical/External Control
Continuous Dichotomous H0: m=m0 H0: p=p0 H1: m>m0, <m0, ≠m0 H1: p>p0, <p0, ≠p0 n>30 n<30

49 One Sample Procedures – Comparisons with Historical/External Control
Categorical or Ordinal outcome c2 Goodness of fit test H0: p1=p10, p2=p20, , pk=pk0 H1: H0 is false

50 New Scenario Outcome is continuous SBP, Weight, cholesterol
Two independent study samples Data On each participant, identify group and measure outcome

51 Two Independent Samples
Cohort Study - Set of Subjects Who Meet Study Inclusion Criteria Group 1 Group 2 Mean Group 1 Mean Group 2

52 Two Independent Samples
RCT: Set of Subjects Who Meet Study Eligibility Criteria Randomize Treatment 1 Treatment 2 Mean Trt 1 Mean Trt 2

53 Continuous Outcome Confidence Interval for (m1-m2)
Continuous outcome - 2 Independent Samples n1>30 and n2>30 n1<30 or n2<30

54 Hypothesis Testing for (m1-m2)
Continuous outcome 2 Independent Sample H0: m1=m2 (m1-m2 = 0) H1: m1>m2, m1<m2, m1≠m2

55 Hypothesis Testing for (m1-m2)
Test Statistic n1>30 and n2> 30 n1<30 or n2<30

56 An RCT is planned to show the efficacy of a new drug vs
An RCT is planned to show the efficacy of a new drug vs. placebo to lower total cholesterol. What are the hypotheses? H0: mP=mN H1: mP>mN H0: mP=mN H1: mP<mN H0: mP=mN H1: mP≠mN

57 New Scenario Outcome is dichotomous
Result of surgery (success, failure) Cancer remission (yes/no) Two independent study samples Data On each participant, identify group and measure outcome (yes/no)

58 Dichotomous Outcome Confidence Interval for (p1-p2)
Dichotomous outcome - 2 Independent Samples

59 Measures of Effect for Dichotomous Outcomes
Outcome = dichotomous (Y/N or 0/1) Risk=proportion of successes = x/n Odds=ratio of successes to failures=x/(n-x)

60 Measures of Effect for Dichotomous Outcomes
Risk Difference = Relative Risk = Odds Ratio =

61 Confidence Intervals for Relative Risk (RR)
Dichotomous outcome 2 Independent Samples exp(lower limit), exp(upper limit)

62 Confidence Intervals for Odds Ratio (OR)
Dichotomous outcome 2 Independent Samples exp(lower limit), exp(upper limit)

63 Hypothesis Testing for (p1-p2)
Dichotomous outcome 2 Independent Sample H0: p1=p2 H1: p1>p2, p1<p2, p1≠p2 Test Statistic

64 Two (Independent) Group Comparisons
Difference in birth weight is -106 g, 95% CI for difference in mean Birth weight: ( to -36.7)

65 New Scenario Outcome is continuous SBP, Weight, cholesterol
Two matched study samples Data On each participant, measure outcome under each experimental condition Compute differences (D=X1-X2)

66 Two Dependent/Matched Samples
Subject ID Measure 1 Measure 2 . Measures taken serially in time or under different experimental conditions

67 Crossover Trial Treatment Treatment Eligible R Participants
Placebo Placebo Each participant measured on Treatment and placebo

68 Confidence Intervals for md
Continuous outcome 2 Matched/Paired Samples n > 30 n < 30

69 Hypothesis Testing for md
Continuous outcome 2 Matched/Paired Samples H0: md=0 H1: md>0, md<0, md≠0 Test Statistic n>30 n<30

70 Independent Vs Matched Design

71 Statistical Significance versus Effect Size
P-value summarizes significance Confidence intervals give magnitude of effect (If null value is included in CI, then no statistical significance)

72 The null value of a difference in means is…
0.5 1 2

73 The null value of a mean difference is…
0.5 1 2

74 The null value of a relative risk is…
0.5 1 2

75 The null value of a difference in proportions is…
0.5 1 2

76 The null value of an odds ratio is…
0.5 1 2

77 A two sided test for the equality of means produces p=0.20. Reject H0?
Yes No Maybe

78 Hypothesis Testing for More than 2 Means - Analysis of Variance
Continuous outcome k Independent Samples, k > 2 H0: m1=m2=m3 … =mk H1: Means are not all equal Test Statistic F is ratio of between group variation to within group variation (error)

79 ANOVA Table Source of Sums of Mean Variation Squares df Squares F
Between Treatments k-1 SSB/k-1 MSB/MSE Error N-k SSE/N-k Total N-1

80 ANOVA When the sample sizes are equal, the design is said to be balanced Balanced designs give greatest power and are more robust to violations of the normality assumption

81 Extensions Multiple Comparison Procedures – Used to test for specific differences in means after rejecting equality of all means (e.g., Tukey, Scheffe) Higher-Order ANOVA - Tests for differences in means as a function of several factors

82 Extensions Repeated Measures ANOVA - Tests for differences in means when there are multiple measurements in the same participants (e.g., measures taken serially in time)

83 c2 Test of Independence Dichotomous, ordinal or categorical outcome
2 or More Samples H0: The distribution of the outcome is independent of the groups H1: H0 is false Test Statistic

84 c2 Test of Independence Data organization (r by c table)
Is there distribution of the outcome different (associated with) groups Outcome Group 1 2 3 A 20% 40% B 50% 25% C 90% 5%

85 What Tests Were Used?

86 In Framingham Heart Study, we want to assess risk factors for Impaired Glucose
Outcome = Glucose Category Diabetes (glucose > 126), Impaired Fasting Glucose (glucose ), Normal Glucose Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics

87 What test would be used to assess whether sex is associated with Glucose Category?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

88 What test would be used to assess whether age is associated with Glucose Category?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

89 What test would be used to assess whether BMI is associated with Glucose Category?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

90 Consider a Secondary Outcome = Fasting Glucose Level Risk Factors
In Framingham Heart Study, we want to assess risk factors for Glucose Level Consider a Secondary Outcome = Fasting Glucose Level Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics

91 What test would be used to assess whether sex is associated with Glucose Level?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

92 What test would be used to assess whether BMI is associated with Glucose Level?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

93 What test would be used to assess whether age is associated with Glucose Level?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

94 In Framingham Heart Study, we want to assess risk factors for Diabetes
Consider a Tertiary Outcome = Diabetes Vs No Diabetes Risk Factors Sex Age BMI (normal weight, overweight, obese) Genetics

95 What test would be used to assess whether sex is associated with Diabetes?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

96 What test would be used to assess whether BMI is associated with Diabetes?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

97 What test would be used to assess whether age is associated with Diabetes?
ANOVA Chi-Square GOF Chi-Square test of independence Test for equality of means Other

98 Correlation Correlation (r)– measures the nature and strength of linear association between two variables at a time Regression – equation that best describes relationship between variables

99 What is the most likely value of r for the data shown below?

100 What is the most likely value of r for the data shown below?

101 Simple Linear Regression
Y = Dependent, Outcome variable X = Independent, Predictor variable = b0 + b1 x b0 is the Y-intercept, b1 is the slope

102 Simple Linear Regression Assumptions
Linear relationship between X and Y Independence of errors Homoscedasticity (constant variance) of the errors Normality of errors

103 Multiple Linear Regression
Useful when we want to jointly examine the effect of several X variables on the outcome Y variable. Y = continuous outcome variable X1, X2, …, Xp = set of independent or predictor variables .

104 Multiple Regression Analysis
Model is conditional, parameter estimates are conditioned on other variables in model Perform overall test of regression If significant, examine individual predictors Relative importance of predictors by p-values (or standardized coefficients)

105 Multiple Regression Analysis
Predictors can be continuous, indicator variables (0/1) or a set of dummy variables Dummy variables (for categorical predictors) Race: white, black, Hispanic Black (1 if black, 0 otherwise) Hispanic (1 if Hispanic, 0 otherwise)

106 Definitions Confounding – the distortion of the effect of a risk factor on an outcome Effect Modification – a different relationship between the risk factor and an outcome depending on the level of another variable

107 Multiple Regression for SBP: Comparison of Parameter Estimates
Simple Models Multiple Regression b p b p Age < <.0001 Male BMI < <.0001 BP Meds < <.0001 Focus on the association between BP meds and SBP…

108 RCT of New Drug to Raise HDL Example of Effect Modification
Women N Mean Std Dev New drug 40 38.88 3.97 Placebo 41 39.24 4.21 Men 10 45.25 1.89 9 39.06 2.22

109 Simple Logistic Regression
Outcome is dichotomous (binary) We model the probability p of having the disease.

110 Multiple Logistic Regression
Outcome is dichotomous (1=event, 0=non-event) and p=P(event) Outcome is modeled as log odds

111 Multiple Logistic Regression for Birth Defect (Y/N)
Predictor b p OR (95% CI for OR) Intercept Smoke (0.34, 22.51) Age (1.02, 1.78) Interpretation of OR for age: The odds of having a birth defect for the older of two mothers differing in age by one year is estimated to be 1.35 times higher after adjusting for smoking.

112 Survival Analysis Outcome is the time to an event.
An event could be time to heart attack, cancer remission or death. Measure whether person has event or not (Yes/No) and if so, their time to event. Determine factors associated with longer survival.

113 Survival Analysis Incomplete follow-up information Censoring
Measure follow-up time and not time to event We know survival time > follow-up time Log rank test to compare survival in two or more independent groups

114 Survival Curve – Survival Function

115 Comparing Survival Curves
H0: Two survival curves are equal c2 Test with df=1. Reject H0 if c2 > 3.84 c2 = Reject H0.

116 Cox Proportional Hazards Model
ln(h(t)/h0(t)) = b1X1 + b2X2 + … + bpXp Exp(bi) = hazard ratio Model used to jointly assess effects of independent variables on outcome (time to an event).

117 Outcome= all-cause mortality
Age and Sex as predictors bi p HR Age Male Sex

118 Sample Size Determination
Need sample to ensure precision in analysis Sample size determined based on type of planned analysis CI Test of hypothesis

119 Determining Sample Size for Confidence Interval Estimates
Goal is to estimate an unknown parameter using a confidence interval estimate Plan a study to sample individuals, collect appropriate data and generate CI estimate How many individuals should we sample?

120 Determining Sample Size for Confidence Interval Estimates
Confidence intervals: point estimate + margin of error Determine n to ensure small margin of error (precision) – accounting for attrition! Must specify desired margin of error, confidence level and variability of parameter

121 Determining Sample Size for Hypothesis Testing
How many participants are needed to ensure that there is a high probability of rejecting H0 when it is really false? Determine n to ensure high power (usually 80% or 90%) – accounting for attrition! Must specify desired power, a and effect size (difference in parameter under H0 versus H1)

122 Determining Sample Size for Hypothesis Testing
b and Power are related to the sample size, level of significance (a) and the effect size (difference in parameter of interest under H0 versus H1) Power is higher with larger a Power is higher with larger effect size Power is higher with larger sample size

123 Sample Size Determination
Critical Ethical Sometimes difficult


Download ppt "CPH Exam Review Biostatistics"

Similar presentations


Ads by Google