Review of Basic Statistical Concepts
Statistical Literacy means knowing… …how to read rates and percentages: e.g., percentage of MPPAL students who are full-time employees versus the percentage of full-time employees who are MPPAL students …how to interpret different definitions of a group e.g., which rate is bigger? Child birth rate among women or child birth rate among women ages 20-44? …the difference between (1) deterministic causes and (2) probabilistic causes: e.g., (1) gravity causes the pen to fall and (2) drunk driving causes automobile accidents
Probability The likelihood or chance an event will occur Expressed as a number between 0 (impossibility) and 1 (certainty) or as a percentage between 0 (impossibility) and 100% (certainty) Objectivist Frame: when repeating an experiment, how often does the event occur? Subjectivist Frame: what is the degree of belief in the likelihood of an event occurring (e.g., Bayesian probability)
Normal distribution Population Mean: μ and Standard Deviation: σ
Describing the Normal Distribution If the mean μ = 0 and σ2 = 1 (so σ = 1) and μ is normally distributed then then 95.4% of the values will fall between: μ ± 2*σ = μ ± 2 95% will fall between a slightly smaller interval: μ ± 1.96 90% will fall between μ ± 1.645 99% will fall between μ ± 2.576 99.7% will fall between μ ± 3 What percentage of all conceivable means will lie between -1.96 and +1.96? 95% 95% “Confidence Interval” is the interval one has the confidence will contain the population mean (μ) 95% of the time
95% Confidence Interval and Confidence Level What is the probability that we will observe a mean value for μ that lies outside of our 95% Confidence Interval? 5% or 0.05 Confidence Level is noted as α = 0.05 for a 95% Confidence Interval For α = 0.05, σ = 1 the Confidence Level = 1.96 When μ = 0; 95% of the population means will fall within (-1.96 and +1.96) For μ = 6.5 and Confidence Level (95%) = 0.47, what is the 95% Confidence Interval?
The Challenge We can never know if we are observing the “true” population mean since any observed population mean will deviate plus or minus σ (= a “standard deviation”) Any census of a graduating class will only be a sample of the “true” population of all graduating classes Uncertainty Source #1: If we are seeking to explain the key factors determining the CGPA of graduates, we have to account for the fact that the observed CGPA might deviate from the true mean by σ Uncertainty Source #2: If it is infeasible to conduct a census of all graduating students in even one year, and all we can do is sample the sample, then we have additional uncertainty related to the size of the sample
Uncertainty from Sampling We know there is one inescapable source of uncertainty (Uncertainty Source #1) The sampling error (Uncertainty Source #2) complicates this uncertainty … but in a predictable way. We know as the sample size increases the standard deviation in the observed mean values (the “standard error of the mean” denoted “s”) will approach the true standard deviation (σ) of the theoretical population: s = σ/n1/2
Margin of Error Whenever a sample is less than the census, there is a chance that s ≠ σ and the sample mean (avgX) ≠ μ If the sample observations X follow a “Student t” distribution, the logic of the Confidence Interval follows but some of the vocabulary changes We are interested in the probability that avgX from our sample will equal μ The probability that avgX ≠ μ is the Margin of Error If the Margin of Error = 5%, then the sample interval about the mean would include μ in 95% of similar samples
Confidence Interval, Margin of Error and Sample Size For a Margin of Error = 5%, we obtain a 95% Confidence Interval around the sample mean For an approximate normal distribution, the standard error is = s = σ/n1/2 Then the 95% Confidence Interval = avgX ± 1.96*s or the interval (-1.96*s and + 1.96*s) is 2*1.96*s = 3.92*s units wide (= 3.92*σ/n1/2 ) If σ is known and the Confidence Interval specified, one could precisely calculate the sample size needed from the above information Then the 95% Confidence Interval = avgX ± 1.96*s or the interval (-1.96*s and + 1.96*s) is 2*1.96*s = 3.92*s = 3.92*σ/n1/2 where 1.96 is the critical Z value for 95% If we know the population standard deviation, Generally, n = 1/B2 where B is the margin of error. To limit the Margin of Error to 5%, we need a sample size n = 400 [= 1/(.05)2] For a larger Margin of Error at say 10%, we need n = 100 [= 1/(.01)2] If n=50, we have a Margin of Error of about 50%
Sample Size when σ is unknown When the true standard deviation of the population is unknown, an indirect method of determining the sample size yields: For a 95% confidence interval and Margin of Error of … … 5%, a sample of 400 (n = 400) is needed … 10%, n = 100 … 3%, n = 1000 … 1%, n = 10,000