Statistics : Statistical Inference Krishna.V.Palem Kenneth and Audrey Kennedy Professor of Computing Department of Computer Science, Rice University 1
Sampling distribution of and Population Sample 1Sample 2Sample 3Sample 4Sample k …… Sampling Distribution …… 2
3 Central Limit Theorem (4) The mean of the sampling distribution of is equal to the population mean, i.e. (5) Standard deviation of the sampling distribution of is the population standard deviation divided by the square root of sample size, i.e.
4 Sampling distribution of for a Normal population)
5 Sampling dist. of for a non-Normal population
Computer simulation of the sampling distribution of the sample mean Pick any probability distribution and specify a mean and standard deviation. Tell the computer to randomly generate 1000 observations from that probability distributions E.g., the computer is more likely to spit out values with high probabilities Plot the “observed” values in a histogram. Next, tell the computer to randomly generate 1000 averages-of-2 (randomly pick 2 and take their average) from that probability distribution. Plot “observed” averages in histograms. Repeat for averages-of-10, and averages-of
Uniform Distribution on [0,1]: average of 1 sample (original distribution) 7
Uniform Distribution: 1000 averages of 2 samples 8
Uniform Distribution: 1000 averages of 5 samples 9
Uniform Distribution: 1000 averages of 100 samples 10
Exponential Distribution: 1000 averages of 2 samples 11
Exponential Distribution: average of 1 sample (original distribution) 12
Exponential Distribution: 1000 averages of 5 samples 13
Exponential Distribution: 1000 averages of 100 samples 14
Contents Summary of Statistics Learnt so Far Statistical Inference Central Limit Theorem and its implications Estimation theory Interval Estimation What is Confidence Interval? Tutorial 15
Estimation Theory In statistics, estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. Statisticians use sample statistics to estimate population parameters. For example, sample means are used to estimate population means; sample proportions, to estimate population proportions. 16
Two types of Estimates Point estimate. A point estimate of a population parameter is a single value of a statistic. For example, the sample mean x is a point estimate of the population mean μ. When we estimate the mean ( μ ) by x, the probability that we are exactly correct is close to zero, i.e. P(x= μ ) ~ 0 Assuming, the population is heterogeneous and the sample size n << population size N Hence, we are not very “confident” about our estimates we make using point estimates 17
Two Types of Estimates (contd.) How can we be more confident about our estimates? we want P(x = μ ) to be a bigger value than zero We can increase our confidence levels by using a less than precise estimates instead of point estimates estimate in an interval instead of point Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. For example, a < x < b is an interval estimate of the population mean μ. It indicates that the population mean is greater than a but less than b. 18
Contents Summary of Statistics Learnt so Far Statistical Inference Central Limit Theorem and its implications Estimation theory Interval Estimation What is Confidence Interval? Tutorial 19
History of Interval Estimation Neyman (1937) identified interval estimation ("estimation by interval") as distinct from point estimation ("estimation by unique estimate"). he was the first to recognize and formulate interval estimation work quoting results in the form of an estimate plus-or-minus a standard deviation was the interval estimation his paper on this was titled "On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection" given at the Royal Statistical Society on 19 June You can download the paper from :
What is an Interval Estimate? In statistics, interval estimation is the use of sample data to calculate an interval of possible (or probable) values of an unknown population parameter in contrast to point estimation, which is a single number. Interval estimate. An interval estimate is defined by two numbers, between which a population parameter is said to lie. for example, a < μ < b is an interval estimate of the population mean μ. indicates that the population mean is greater than a but less than b. we use x to estimate this interval Interval estimates provide a "best estimate" of a parameter an indication of the precision with which the parameter is known. 21
Types of Interval Estimation The most prevalent forms of interval estimation are: confidence intervals a frequentist method credible intervals a Bayesian method Other common approaches to interval estimation, which are encompassed by statistical theory, are: Tolerance intervals Prediction intervals used mainly in Regression Analysis Of these, confidence intervals is the most common and widely used and hence, will be covered in more detail in this class 22
Contents Summary of Statistics Learnt so Far Statistical Inference Central Limit Theorem and its implications Estimation theory Interval Estimation What is Confidence Interval? Tutorial 23
What is a Confidence Interval? In statistics, a confidence interval (CI) is an interval estimate of a population parameter. instead of estimating the parameter by a single value, an interval likely to include the parameter is given. confidence intervals are used to indicate the reliability of an estimate. How likely the interval is to contain the parameter is determined by the confidence level increasing the desired confidence level will widen the confidence interval. Confidence intervals and interval estimates more generally have applications across the whole range of quantitative studies. 24
Example of Confidence Interval For example, a confidence interval can be used to describe how reliable some opinion survey results are. In a survey of election voting-intentions, the result might be that 40% of respondents intend to vote for a certain party. A 95% confidence level for the proportion in the whole population having the same intention on the survey date might be in the confidence interval 36% to 44%. From the same survey date one may calculate a smaller 90% confidence level for the proportion in the whole population of for instance in confidence interval 38% to 42%. All other things being equal, a survey result with a small confidence interval with a higher confidence level is more desired 25
Video on Confidence Interval 26
Example In the whole of Houston, what percentage of adults do you think will want to watch a movie sometime in the next 10 days? assume a variance of for the whole population Choose a random sample of 10 adults and ask their opinion Let X be the random variable denoting the percentage of adults attending the movies out of the sample. X i be the value from i th sample Will this be anywhere close to the actual percentage? How can we be sure to be closer to the actual mean? Take very large number of samples 27
Example (contd.) But, taking large number of samples is generally not feasible. We want to arrive at an estimate based on fewer samples. For example, in the previous example, if you take only 1 sample of 10 people and found that 5 of the 10 people would like to go for a movie, then you can say We are pretty sure that 50% of the adult population would want to go for a movie in the next 10 days. Isn’t this ambiguous? How sure is pretty sure? Need to be more definitive 28
Example (contd.) We use confidence interval to remove the ambiguity The only statement we can make which is 100% sure is that the 0%-100% of the adult population would want to watch a movie in the next 10 days. This statement doesn’t hold much importance as you are wrong half the time 90% sure or 95% sure or 98% sure or 99% sure What if we want to be 100% sure? What if we want to be 50% sure? Then, what kind of statements make sense? Confidence Levels 29
Calculating Confidence Level The general norm is to vary the interval by multiples of σ and compute the confidence level σ is varied equally on the either side of the mean The probability that μ is correct by the interval [x- σ,x+ σ ] can be calculated as Assuming Normal distribution, we get Source for calculations: What if we increase the interval from 2σ to 4σ? 30
Confidence Level Table Some of the most commonly used confidence levels in statistics are given in the table below: Less than 90% is generally not considered a strong enough confidence level to make a statement Confidence Level Number of σ s away from mean 90% % % %
Example (Contd.) Let us continue with computing the confidence interval for our movie example Assume that we took a random sample of 10 adults. Among them, 5 adults said that they would like to go for the movie in the next 10 days Hence, we get, mean (x)= 0.5 (denotes 50% ) and standard deviation = (Var(x) = σ 2 /n ) Say, we want to be 95% confident about our estimation. 32
Example (Contd.) From the table we can see that we have to be 1.96 σ away from the mean. Hence, we need to be 1.96* = 0.31 away from the mean Summarizing, we can now say with 95% confidence that the mean of the actual population will be between [ , ] = [0.19,0.81] which is between 19%-81% of total population What if you want to be 98% confident? 33
Graphical Representation of Confidence Intervals A plot of a normal distribution (or bell curve). Example Each colored band has a width of one standard deviation. 34
35 Confidence Interval for when is known A 95% confidence interval for if is known is given by: 95% 95% of the ‘s lie between
36 Rationale for Confidence Interval From the sampling distribution of conclude that and are within 1.96 standard errors ( ) of each other 95% of the time Otherwise stated, 95% of the intervals contain So, the interval can be taken as an interval that typically would include
Example A random sample of 80 tablets had an average potency of 15mg. Assume is known to be 4mg. =15, =4, n=80 A 95% confidence interval for is = (14.12, 15.88)
38