Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions.

Similar presentations


Presentation on theme: "Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions."— Presentation transcript:

1 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions

2 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.2 Counts and Proportions  Coin flipping  What’s the chance that heads comes up 8+ times in 10 flips of a coin?  Binomial distribution  Celtics’ winning record  Are the Celtics really an average (.500) team despite their great start this season?  One-sample test of proportion  Who is the better free throw shooter, Kevin Garnett or Ray Allen?  Two-sample test of two proportions  Biomedical Research…

3 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.3 Where we’ve been… and where we’re going Variables of Interest? One Variable Continuous (Methods from Before Midterm) Binary One-sample test of proportion Two-sample test t of proportion Exact Methods Normal Approximation Two Variables Both Continuous Interested in prediction Simple Linear Regression Interested in association Both variables normal Pearson Correlation Not normal Spearman Correlation One Continuous, one categorical ANOVA Both Binary (in two weeks) More than Two Variables Multiple Linear Regression

4 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.4 Binary Data  Often in medical and public health studies, our endpoint of interest is binary or dichotomous  Examples  disease vs. no disease  response vs. no response  death vs. no death  success vs. failure

5 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.5 Binary Data  Continuous endpoints often dichotomized into a binary endpoint as well  For example, in a study of the effect of a drug on LDL levels, for each subject, the LDL measurement at the end of the study (a continuous measure) may be dichotomized into “response” vs. “no response”  Based on a cut-point defining whether the LDL level has been reduced to acceptable, normal, or safe levels

6 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.6 Binary Data 1.There are only two possible (mutually exclusive) outcomes, often called a “success” and a “failure” 2.Each experiment is identical to all the others, and the probability of a “success” is p  Thus, the probability of a “failure” is 1-p  Each experiment is called a Bernoulli trial  E.g. throwing the die and observing whether or not it comes up six, tossing a coin, or investigating the survival of a cancer patient

7 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.7 Binary Data: Smoking Status Example  Define:  X=1 for a smoker and  X=0 for a non-smoker.  If “success” is the event that a randomly selected individual is a smoker, and from previous research it is known that about 29% of the adults in the United States smoke, then: and  This is an example of a Bernoulli trial  We select just one individual at random, each selection is carried out independently, and each time the probability of that individual being a “success” is constant

8 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.8 Binary Data: Smoking Status Example  If we selected two adults, and looked at their smoking status, then the possible outcomes are:  Neither is a smoker  Only one is a smoker  Both are smokers  If we define X as the number of smokers between these two individuals, then  X=0: Neither is a smoker  X=1: Only one is a smoker  X=2: Both are smokers

9 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.9 Binary Data: Smoking Status Example P(X=0) = (1-p) 2 = (0.71) 2 = 0.5041 P(X=1) = P( 1st individual is a smoker OR 2nd individual is a smoker ) =p(1-p)+(1-p)p = 2p(1-p) = 0.4118 P(X=2) = p 2 = (0.29) 2 = 0.0841 Notice that: P(X=0)+P(X=1)+P(X=2)=0.5041+0.4118+0.0841=1.000

10 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.10 Binary Data: Smoking Status Example  Bar chart  A plot of the probability distribution of X, covering all of the possible numbers that X can attain (in the previous example those were n=0, 1, and 2)

11 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.11 Binomial Distribution  The binomial distribution is a special distribution that closely models the behavior of variables that corresponds to repeated Bernoulli experiments  When describing the binomial distribution, we need to specify two parameters:  The probability of “success” (p)  The number of Bernoulli experiments (n)  One way of looking at p is as the proportion of time that an experiment is successful when repeated a large number of times

12 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.12 Binomial Distribution  The “Exact Binomial Probability” is the probability calculated from the binomial distribution  Done by repeatedly plugging in the values for n, x, and p  Always correct to do so, but tedious  Only use if have computer to calculate for you!

13 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.13 Binomial Distribution  Given these parameters, the mean and standard deviation of a binomial distribution are:

14 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.14 Binomial Distribution  An easier way…  If n is sufficiently large, then the statistic is approximately distributed as normal with mean 0 and standard deviation 1 (a std. normal distribution) In general, we say n is “large” if np and n(1-p) are both ≥ 5

15 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.15 Binomial Distribution  A better approximation to the normal distribution is given by when X < np, and by when X >np  This is called a continuity correction

16 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.16 Binary Data: Smoking Status Example  For example, if we sample n=30 subjects from the population, what is the probability that at most 6 of these individuals smoke?  In other words, we want to find the proportion of samples of size n=30 in which at most 6 people smoke  With p=0.29 and n=30, we have:  np=8.75 and n(1-p)=21.3>5  X=6<np=8.7  should apply the continuity correction

17 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.17 Binary Data: Smoking Status Example  Normal approximation to binomial with continuity correction:  The exact binomial probability is 0.190, which is very close to the approximate value given above

18 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.18 Sampling Distribution of Proportions  Our thinking in terms of estimation (including confidence intervals) and hypothesis testing does not change when dealing with proportions  We may use the Central Limit Theorem for binary data too (as the CLT applies to all distributions)  Again, we must be careful when n is small

19 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.19 Sampling Distribution of Proportions  Since the proportion of successes in the general population will not be known, we must estimate it  In general, such an estimate is derived by calculating the proportion of successes in a sample of n experiments (trials) as follows:  Where x is the number of successes in the sample of size n

20 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.20 Sampling Distribution of Proportions  The sampling distribution of a proportion has mean, μ, and standard deviation, σ: Mean: μ = p Standard deviation: σ = And the statistic: is distributed according to the standard normal distribution (again, the normal approximation is particularly good when np ≥ 5 and n(1-p) ≥ 5)  NOTE: Dividing our previous parameters by n (x  x/n)

21 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.21 Sampling Distribution of Proportions  Note that if we multiply the numerator and denominator of Z by n we have, which is the familiar form encountered in our study of the binomial distribution

22 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.22 Hypothesis Testing  Similar to hypothesis testing with continuous data, one may perform hypothesis tests on binary data:  1-sample test of a proportion  H 0 : p=p 0  H A : p ≠ p 0  2-sample test comparing proportions  H 0 : p 1 =p 2  H A : p 1 ≠ p 2

23 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.23 Lung Cancer Example  Consider the five-year survival among patients who have been diagnosed with lung cancer  The mean proportion of individuals surviving is p=0.10  This implies that the standard deviation of the 5-year survival is:

24 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.24 Lung Cancer Example  Suppose we select n=50 lung cancer patients and are interested in the probability that 10 or more of these patients are still alive after 5 years  In other words, if we select repeated samples of size n=50 patients diagnosed with lung cancer, what fraction of the samples will have 20% or more survivors?

25 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.25 Lung Cancer Example  Since np = 50(0.1) = 5 and n(1-p) = 50(0.9) =45>5 the normal approximation should be adequate  Then,  Only 0.9% probability that 10 (20%) or more cancer patients would survive after five years from a sample of 50

26 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.26 Lung Cancer Example  In the previous example, we did not know the true proportion of 5-year survivors among individuals under 40 years of age that have been diagnosed with lung cancer  If it is known from previous studies that the five-year survival rate of lung cancer patients that are older than 40 years old is 8.2%, we might want to test whether the five-year survival among the younger lung-cancer patients is the same as that of the older ones

27 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.27 Lung Cancer Example  From a sample of n=52 patients under the age of 40 that have been diagnosed with lung cancer, the proportion surviving after five years is (or 6 patients)  Is this within sampling variability of the known 5-year survival of older patients? 0.082

28 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.28 Lung Cancer Example  The test of hypothesis is constructed as follows:

29 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.29  The test statistic is: Lung Cancer Example P(|Z|>0.87)=P(Z>0.87)+P(Z<-0.87) = 0.384 0.082

30 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.30  Since the p value associated with the two-sided test is 0.380 > α = 0.05, we do NOT reject the null hypothesis  That is, there is not sufficient evidence to indicate that the five-year survival of lung cancer patients who are younger than 40 years of age is different than that of the older patients Lung Cancer Example

31 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.31 Lung Cancer Example  STATA output for this normal approximation:  Again, we do not reject the null hypothesis  Note that any differences with the hand calculations are due to round-off error (See Lab #8 for appropriate STATA code)

32 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.32 Lung Cancer Example  Carrying out an exact binomial test in STATA we have:  We see that the p value associated with the two-sided test is 0.318, which is close to that calculated previously

33 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.33 Confidence Intervals  Similar to the testing of hypothesis involving proportions, we can construct confidence intervals for a population proportion (or for a difference between two proportions)  Again these intervals will be based on the statistic: where and are the estimates of the proportion and its associated standard deviation, respectively

34 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.34 Confidence Intervals  Two-sided Confidence Interval:  One-sided Confidence Intervals:

35 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.35  In the previous example, if 6 out of 52 lung cancer patients under 40 years of age were alive after five years, and using the normal approximation (which is justified since np=52(0.115)=5.98 ≥ 5, and n(1-p)= 52(1- 0.115)=46.02 ≥ 5), an approximate 95% confidence interval for the true proportion p is given by Lung Cancer Example

36 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.36  In other words, we are 95% confident that the true five-year survival of lung-cancer patients < 40 years of age is between 2.8% and 20.2%  Note that this interval contains 8.2% (the five-year survival rate among lung cancer patients that are older than 40 years of age)  Thus, it is equivalent to a hypothesis test that did not reject the null hypothesis of equal five-year survival between lung cancer patients that are older than 40 years of age versus younger subjects Lung Cancer Example

37 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.37 Exact Confidence Intervals  “Exact” confidence intervals for a binomial parameter are possible  These do not rely on the normal approximation to the binomial (i.e., use of the CLT)  Computationally very intensive (particularly for large N)  Typically require special programming/software

38 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.38 Exact Confidence Intervals  General Rule:  Use exact confidence intervals whenever software is available and is feasible given the computing resources  If N is large…  It is OK to use normal approximation (as CLT kicks in)  If N is small…  The normal approximation may not be appropriate  Use exact CIs if possible

39 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.39 Lung Cancer Example  In the previous example, using exact binomial confidence intervals we have which is close to our calculations that used the normal approximation

40 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.40 Two Proportions  Comparing two proportions is similar to a two- mean comparison…  First and second group proportions: and  Under assumptions of equality of the two population proportions, we may want to derive a pooled estimate of the sample proportion:

41 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.41 Two Proportions  Using this pooled estimate, we can derive a pooled estimate of the standard deviation of the unknown proportion (assumed equal between the two groups) as:  The hypothesis testing of comparisons between two proportions is based on the statistic:

42 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.42 Two Proportions: Hypothesis Test  Hypothesis test for difference in two proportions:

43 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.43 Car Accident Example  A study investigating morbidity and mortality among pediatric victims of motor vehicles accidents  Two random samples were selected to investigate the effectiveness of seat belts: 1.n 1 =123 from a population of children that were wearing seat belts at the time of the accident 2.n 2 =290 from a group of children that were not wearing seat belts at the time of the accident

44 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.44 Car Accident Example  In the first case, x 1 =3 children died, while in the second x 2 =13 died  Consequently, and  The estimated difference between these proportions:

45 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.45 Car Accident Example  We wish to determine if the death rate is different in the two groups and carry out the test of hypothesis as proposed earlier:  Thus, there is not sufficient evidence to conclude that children not wearing seat belts are safer (die at different rates) than children wearing seat belts

46 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.46 Two Proportions: Confidence Intervals  Confidence intervals of the difference of two proportions are also based on the statistic:  The standard deviation estimate is

47 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.47 Two Proportions: Confidence Intervals  Note discrepancy between hypothesis testing and CI’s:  We no longer need to assume that the two proportions are equal, so the estimate of the standard deviation in the denominator is not a pooled estimate, but simply the sum of the standard deviations in each group  This deviation from hypothesis testing may lead to disagreements between decisions reached through usual hypothesis testing versus hypothesis testing performed using confidence intervals (infrequent issue)

48 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.48 Two Proportions: Confidence Intervals

49 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.49 Car Accident Example  A two-sided 95% confidence interval for the true difference in death rates among children wearing seat belts versus those that did not is given by:

50 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.50 Car Accident Example  STATA output:

51 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.51 Car Accident Example  That is, the true difference between the two groups will be between 5.7% in favor of those children wearing seat belts, to 1.6% in favor of those children not wearing seat belts  Since the zero (hypothesized under the null hypothesis) difference is included in the confidence interval we fail to reject the null hypothesis  There is no evidence to suggest a benefit of seat belts  But enough of a trend, that we may want to do further studies…

52 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.52 Special Case  What if we observe 0 events or responses?  How do we get a CI for the response rate when the variability is 0?  Example: ACTG A5129, “Absence of Sustained Hyperlactatemia Among HIV- Infected Patients with Risk Factors for Mitochondrial Toxicity”

53 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.53 Special Case  From abstract: Wohl DA, Pilcher CD, Evans SR, Revuelta M, McComsey G, Yang Y, Zackin R, Alston B, Welch S, Basar M, Kashuba A, Kondo P, Martinez A, Giardini J,Quinn J, Littles M, Wingfield H, Koletar SL, “Absence of Sustained Hyperlactatemia Among HIV-Infected Patients with Risk Factors for Mitochondrial Toxicity”, Journal of Acquired Immune Deficiency Syndromes (JAIDS), 35:3:274-278, 2004.

54 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.54 Special Case

55 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.55 Special Case  There were no episodes of symptomatic hyperlactatemia or lactic acidosis during the study  A 95% confidence interval for the prevalence of hyperlactatemia given our data (i.e., 0 out of 83) is:  In other words, the true prevalence of hyperlactatemia as defined by this study is likely to be less than 3.54% 95% CI = (0,1-α 1/n ) = (0, 0.0354)

56 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.56 Review: Coin Flipping  What’s the chance that heads comes up 8+ in 10 flips of a coin?  Since x=8>np=10(0.5)=5, we apply the normal approximation with continuity correction:  What’s the area in the upper tail of the standard normal, above 1.581?

57 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.57 Review: Coin Flipping  Thus, there’s a 5.7% chance that we could have 8, 9, or 10 heads on 10 flips of the coin

58 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.58 Review: Basketball Statistics  As of November 27th, the Celtics have 10 wins out of 11 games:  How likely is it that they just had a good start are a really just a.500 team?  Ho: Celtics win 50% of games (p=0.500)  Ha: Celtics win >50% of games (p>0.500)  Set α=0.05

59 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.59 Review: Basketball Statistics  One-sample test of proportion (one-sided)  0.3% chance we would observe this data (the Celtics present winning record) given they are really a.500 team  Since P<0.05, we can say the Celtics are NOT a.500 team!  We reject the null that their winning percentage is 50% 0.500

60 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.60 Review: Basketball Statistics  One-Sided Confidence Interval  We know that 95% of the time, the interval from 0.66 to infinity will cover the true winning percentage of the Celtics  Since 0.5 does not fall in the interval, we can say the Celtics are NOT a.500 team!  We reject the null that their winning percentage is 50%

61 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.61 Review: Basketball Statistics  Comparing Free Throw Percentages:  Kevin Garnett:  Ray Allen:  Combined:  And pooled estimate of S.D.:

62 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.62 Review: Basketball Statistics  Ho: and Ha:  Two-sample test of proportions:  Two-sided test this time…  No statistically significant difference between Garnett and Allen’s free throw percentages… 0

63 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.63 Review: Basketball Statistics  Confidence interval for difference in proportions:  Since zero does fall in this confidence interval, we can’t reject the null hypothesis that Garnett and Allen are equally proficient free throw shooters

64 Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.64 Where we’ve been… and where we’re going Variables of Interest? One Variable Continuous (Methods from Before Midterm) Binary One-sample test of proportion Two-sample test t of proportion Exact Methods Normal Approximation Two Variables Both Continuous Interested in prediction Simple Linear Regression Interested in association Both variables normal Pearson Correlation Not normal Spearman Correlation One Continuous, one categorical ANOVA Both Binary (in two weeks) More than Two Variables Multiple Linear Regression


Download ppt "Introduction to Biostatistics, Harvard Extension School, Fall 2007 © Scott Evans, Ph.D. & Lynne Peeples, M.S.1 Counts and Proportions."

Similar presentations


Ads by Google