Copyright (c) Bani K. Mallick1 STAT 651 Lecture #15
Copyright (c) Bani K. Mallick2 Topics in Lecture #15 Some basic probability The binomial distribution Inference about a single population proportions
Copyright (c) Bani K. Mallick3 Book Sections Covered in Lecture #15 Chapters Chapter 10.2
Copyright (c) Bani K. Mallick4 Lecture 14 Review: Nonparametric Methods Replace each observation by its rank in the pooled data Do the usual ANOVA F-test Kruskal-Wallis
Copyright (c) Bani K. Mallick5 Lecture 14 Review: Nonparametric Methods Once you have decided that the populations are different in their means, there is no version of a LSD You simply have to do each comparison in turn This is a bit of a pain in SPSS, because you physically must do each 2-population comparison, defining the groups as you go
Copyright (c) Bani K. Mallick6 Categorical Data Not all experiments are based on numerical outcomes We will deal with categorical outcomes, i.e., outcomes that for each individual is a category The simplest categorical variable is binary: Success or failure Male of female
Copyright (c) Bani K. Mallick7 Categorical Data For example, consider flipping a fair coin, and let X = 0 means “tails” X = 1 means “heads”
Copyright (c) Bani K. Mallick8 Categorical Data The fraction of the population who are “successes” will be denoted by the Greek symbol Note that because it is a Greek symbol, it represents something to do with a population For coin flipping, if you flipped all the fair coins in the world (the population), the fraction of the times they turn up heads equals
Copyright (c) Bani K. Mallick9 Categorical Data The fraction of the population who are “successes” will be denoted by the Greek symbol The fraction of the sample of size n who are “successes” is going to be denoted by We want to relate to Let X = number of successes in the sample. The fraction = (# successes)/n = X / n
Copyright (c) Bani K. Mallick10 Categorical Data Suppose you flip a coin 10 times, and get 6 heads. The proportion of heads = 0.60 The percentage of heads = 60%
Copyright (c) Bani K. Mallick11 Categorical Data The number of success X in n experiments each with probability of success is called a binomial random variable There is a formula for this: Pr(X = k) = 0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc.
Copyright (c) Bani K. Mallick12 Categorical Data 0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc. The idea is to relate the sample fraction to the population fraction using this formula Key Point: if we knew , then we could entirely characterize the fraction of experiments that have k successes
Copyright (c) Bani K. Mallick13 Categorical Data The probability that the coin lands on heads will be denoted by the Greek symbol Suppose you flip a coin 2 times, and count the number of heads. So here, X = number of heads that arise when you flip a coin 2 times X takes on the values 0, 1 and 2 takes on the values 0/2, ½, 2/2
Copyright (c) Bani K. Mallick14 Categorical Data: What the binomial formula does The experiment results in 4 equally likely outcomes: each occurs ¼ of the time Tails on toss #1 Heads on toss #1 Tails of toss #2 ¼¼ Heads on Toss #2 ¼¼
Copyright (c) Bani K. Mallick15 Categorical Data Heads = “success”: Tails on toss #1 Heads on toss #1 Tails on toss #2 ¼¼ Heads on Toss #2 ¼¼ The binomial formula can be used to give these results without thinking
Copyright (c) Bani K. Mallick16 Categorical Data 0! = 1, 1! = 1, 2! = 2 x 1 = 2, 3! = 3 x 2 x 1 = 6, 4! = 4 x 3 x 2 x 1 = 24, etc. n=2, k=1, k! = 1, n! = 2, (n-k)! = 1 The binomial formula gives the answer ½, which we know to be correct
Copyright (c) Bani K. Mallick17 Categorical Data Roll a fair dice First Dice Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick18 Categorical Data Roll a fair dice /6 First Dice Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick19 Categorical Data Roll a fair dice /6 First Dice Every combination is equally likely, so what are the probabilities? What is the chance of rolling a 1 or a 2?
Copyright (c) Bani K. Mallick20 Categorical Data Roll a fair dice /6 First Dice Every combination is equally likely, so what are the probabilities? What is the chance of rolling a 1 or 2? 2/6 = 1/3
Copyright (c) Bani K. Mallick21 Categorical Data Now roll two fair dice Second Dice First Dice Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick22 Categorical Data Roll two fair dice / Second Dice First Dice Every combination is equally likely, so what are the probabilities?
Copyright (c) Bani K. Mallick23 Categorical Data Roll two fair dice / Second Dice First Dice Define a success as rolling a 1 or a 2. What is the chance of two successes?
Copyright (c) Bani K. Mallick24 Categorical Data Roll two fair dice / Second Dice First Dice Define a success as rolling a 1 or a 2. What is the chance of two successes? 4/36 = 1/9
Copyright (c) Bani K. Mallick25 Categorical Data Roll two fair dice / Second Dice First Dice Define a success as rolling a 1 or a 2. What is the chance of two failures? 16/36 = 4/9
Copyright (c) Bani K. Mallick26 Categorical Data So, a success occurs when you roll a 1 or a 2 Pr(success on a single die) = 2/6 = 1/3 = Pr(2 successes) = 1/3 x 1/3 = 1/9 Use the binomial formula: pr(X=k) when k=2 k!=2, n!=2, (n-k)!=1,
Copyright (c) Bani K. Mallick27 Categorical Data In other words, the binomial formula works in these simple cases, where we can draw nice tables Now think of rolling 4 dice, and ask the chance the 3 of the 4 times you get a 1 or a 2 Too big a table: need a formula
Copyright (c) Bani K. Mallick28 Categorical Data Does it matter what you call as “success” and hat you call a “failure”? No, as long as you keep track For example, in a class experiment many years ago, men were asked whether they preferred to wear boxers or briefs This is binary, because there are only 2 outcomes “success” = ?????
Copyright (c) Bani K. Mallick29 Categorical Data Binary experiments have sampling variability, just like sample means, etc. Experiment: “success” = being under 5’10” in height First 6 men with SSN < 5 First 6 men with SSN > 5 Note how the number of “successes” was not the same! (I might have to do this a few times)
Copyright (c) Bani K. Mallick30 Categorical Data The sample fraction is a random variable This means that if I do the experiment over and over, I will get different values. These different values have a standard deviation.
Copyright (c) Bani K. Mallick31 Categorical Data The sample fraction has a standard error Its standard error is Note how if you have a bigger sample, the standard error decreases The standard error is biggest when = 0.50.
Copyright (c) Bani K. Mallick32 Categorical Data The sample fraction has a standard error Its standard error is The estimated standard error based on the sample is
Copyright (c) Bani K. Mallick33 Categorical Data It is possible to make confidence intervals for the population fraction if the number of successes > 5, and the number of failures > 5 If this is not satisfied, consult a statistician Under these conditions, the Central Limit Theorem says that the sample fraction is approximately normally distributed (in repeated experiments)
Copyright (c) Bani K. Mallick34 Categorical Data (1 100% CI for the population fraction is by looking up 1 in Table 1
Copyright (c) Bani K. Mallick35 Categorical Data Often, you will only know the sample proportion/percentage and the sample size Computing the confidence interval for the population proportion: two ways By hand By SPSS (this is a pain if you do not have the data entered already) Because you may need to do this by hand, I will make you do this.
Copyright (c) Bani K. Mallick36 Categorical Data (1 100% CI for the population fraction 95% CI, = 1.96 n = 25, = 0.30
Copyright (c) Bani K. Mallick37 Categorical Data (1 100% CI for the population fraction Interpretation?
Copyright (c) Bani K. Mallick38 Categorical Data (1 100% CI for the population fraction Interpretation? The proportion of successes in the population is from 0.12 to 0.48 (12% to 48%) with 95% confidence
Copyright (c) Bani K. Mallick39 Categorical Data You can use SPSS as long as the number of successes and the number of failures both exceed 5 To get the confidence intervals, you first have to define a numeric version of your variable that classifies whether an observation is a success or failure. You then compute the 1-sample confidence interval from “descriptives” “Explore”: Demo
Copyright (c) Bani K. Mallick40 Categorical Data If you set up your data in SPSS, the “mean” will be the proportion/fraction/percentage of 1’s Data = n = 10 Mean = 4/10 =.40 =.40
Copyright (c) Bani K. Mallick41 Boxers versus briefs for males In this output, boxers = 1 and briefs = 0
Copyright (c) Bani K. Mallick42 Boxers versus briefs for males: what % prefer boxers? In the sample, 46.81%. In the population??? Descriptives E Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Boxers or Briefs Perference StatisticStd. Error In this output, boxers = 1 and briefs = 0. The proportion of 1’s is the mean
Copyright (c) Bani K. Mallick43 Boxers versus briefs for males: what % prefer boxers? Between 39.61% and 54.01% Descriptives E Mean Lower Bound Upper Bound 95% Confidence Interval for Mean 5% Trimmed Mean Median Variance Std. Deviation Minimum Maximum Range Interquartile Range Skewness Kurtosis Gender MaleNumeric Boxers: 0 = Briefs, 1 = Boxers StatisticStd. Error
Copyright (c) Bani K. Mallick44 Boxers versus briefs In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs. Between 39.61% and 54.01% men prefer boxers to briefs (95% CI) Is there enough evidence to conclude that men generally prefer briefs?
Copyright (c) Bani K. Mallick45 Boxers versus briefs In the sample, 46.81% of the men preferred boxers to briefs: 53.19% preferred briefs. Between 39.61% and 54.01% men prefer boxers to briefs (95% CI) Is there enough evidence to conclude that men generally prefer briefs? No: since 50% is in the CI! This means that it is possible (95%CI) that 50% prefer boxers, 50% prefer briefs, = 0.50.
Copyright (c) Bani K. Mallick46 Sample Size Calculations The standard error of the sample fraction is If you want an (1 100% CI interval to be you should set
Copyright (c) Bani K. Mallick47 Sample Size Calculations This means that
Copyright (c) Bani K. Mallick48 Sample Size Calculations The small problem is that you do not know . You have two choices: Make a guess for Set = 0.50 and calculate (most conservative, since it results in largest sample size) Most polling operations make the latter choice, since it is most conservative
Copyright (c) Bani K. Mallick49 Sample Size Calculations: Examples Set E = 0.04, 95% CI, you guess that = 0.30 You have no good guess: