Intro to Inference & The Central Limit Theorem
Learning Objectives By the end of this lecture, you should be able to: – Describe what is meant by the term ‘inference’. – Name two potential pitfalls that can cause us come up with false values for our population when doing inference calculations. – Explain what is meant by the term ‘distribution of sample means’. – Describe what distribution is found when a large series of sample means is obtained. – Define the Central Limit Theorem. – Calculate the mean and SD for a distribution of sample means (as opposed to of an individual sample).
Inference Inference is the process of taking information from a sample and using it to draw conclusions about the population. Example: In a sample of 100 undergraduate DePaul students, 38% of them identify themselves as Republicans. Our goal, of course is to use this 38% to infer the percentage of ALL DePaul undergraduates who identify themselves as Republican. The process of taking this value obtained from the sample and turning it into a prediction about the population is called inference. You will see that the answer is NOT simply to report 38%. We will begin our discussion of how to do inference and the proper way to report the results over the next couple of lectures.
Two important issues to bear in mind about inference: 1. Your sample is only ONE estimate. That is, if you randomly sampled again, you would get a different result. Calculate the average height of 20 people. Then do it again – you will, of course, almost certainly get a somewhat different result. The importance of this fact will become more clear over time. 2. Your estimate of the population is only as good as your sampling design. I.E. Do all you can to eliminate biases. Trying to determine relationship between # beers and BAC by sampling a group of NFL athletes will not properly generalize to the population of all Americans.
Sampling variability Recall that your one sample is only one estimate of the true value. IMPORTANT: By ‘true’ value, we mean the value of the actual population. Recall that the population value is the piece of information that we are interested in. Every time we take a random sample, we are going to get a different set of individuals and, therefore, will obtain a different value. This concept is called sampling variability. Recognizing that no ONE PARTICULAR sample reliably gives you the “true” (i.e. population) value, since all samples will likely be different, is a key concept to keep in mind when doing inference calculations.
Distribution of MANY (i.e. repeated) samples: There is an interesting and very important property to sampling variability: If you’ve already forgotten what sampling variablity is, that’s okay. However, go back and review it! Important Property to Keep in Mind: Suppose we were to take MANY random samples (of the same size) from a given population and record the mean each time. If we plotted all of those means on a histogram and drew a density curve, we would encounter a very familiar distribution! Can you guess which distribution??? Hint: It rhymes with ‘Schnormal’. All of statistical inference is based on this fact: The ‘distribution of sample means’ follows a Normal distribution.
If we take repeated samples and calculated the mean of each, the distribution of all of those means would be approximately Normal. This distribution seen here is an example of a sampling distribution of sample means. Note that the mean of this distribution turns out to be the true (i.e. population) mean.
The what of the who??? Be sure to understand what is meant by the ‘Distribution of Sample Means’… Restated: The sampling distribution of sample means refers to the distribution we would find if were to take many, many samples and calculate the mean of each sample.
Central Limit Theorem As we have just said (repeatedly): If you were to plot the distribution of sample means and draw a density curve, you would quickly find that the distribution of all of those means is Normal. This leads us to one of the most important concepts in an introductory statistics course: The distribution of sample means is always Normal, Even if the original dataset was NOT Normal! This (very) important property is known as the:
Example: 1. A sample of the incomes of 100 people on the street, you would have one result for the mean. Of course, this is only one sample. If we repeat this sample again, we’d almost certainly obtain a different result for the mean. 2. If we repeat this sample again, we will have 2 different means. If we repeat this sample 100 times, we will have 100 results. As we have discussed, we call this a sampling distribution of sample means. 3. If we plot these 100 means on a histogram, we would find that the distribution of all of these values is approximately Normal. 4. IMPORTANT: Note that income is NOT Normally distributed (it is skewed). This is one of the “powerful” aspects of the Central Limit Theorem: The distribution of means is Normal even if the original dataset was not! I am well aware that this isn’t exactly blowing your socks off, however, statistically, it turns out to have some very important ramifications.
Who cares? Are you impressed yet? Okay, I agree that the fact that the sampling distribution of sample means is always Normal may not seem like an earth-shattering revelation. However, there are some aspects to it that end up allowing us to use very powerful statistical tools down the road. The key one to bear in mind for now, is the idea that the central limit theorem applies even when the original dataset is NOT Normal. For example, suppose we looked at the distribution of 100 incomes. This distribution would be right-skewed. Now suppose we took the mean of those 100 incomes. Then we took another sample of 100 incomes and calculated that mean. Then we repeated this process a few hundred times. If we plotted all of those means on a histogram, the distribution would be Normal. In other words, the distribution of sample means is Normal even when the original dataset itself was not Normal.
Example of the Central Limit Theorem Distribution of EVERY CallDistribution of 500 samples (80 calls in each sample) 1.The lengths of phone calls at a call center is right skwewd. 2.The graph on the left is a record of thousands and thousands of phone calls. 3.We take a sample of 80 phone calls and calculate the mean length. Then we repeat with another sample of 80 phone calls and calculate the mean length. We repeat for 500 samples. 4.Note how when we look at the distribution of these 500 phone calls, the distribution is Normal.
Sampling distribution of x bar √n√n For any population with mean and standard deviation : The mean, or center of the sampling distribution is equal to the population mean x . The standard deviation of the sampling distribution is x = / √n. Calculation of Mean and SD of sampling distribution :
Mean of a sampling distribution: There is no tendency for a sample mean to fall systematically above or below even if the distribution of the raw data is skewed. Thus, the mean of the sampling distribution is an unbiased estimate of the population mean — it will be “correct on average” in many samples. Key point: Mean of a Sample Distribution = Mean of the population Standard deviation of a sampling distribution: The standard deviation of the sampling distribution measures how much the sample statistic varies from sample to sample. This sd is smaller than the standard deviation of the population by a factor of √n. Key Point: SD of a Sample Distribution = SD of the population / square root of n But isn’t this backward??? Don’t we typically start with a sample and from there try to infer about the population? In a word: Yes! At the moment, this is backwards. For the time being, if I were to ask you to tell me the mean or SD of a sampling distribuiton on quizzes/exams, I would have to give you the population mean/SD. I realiize that this may seem ridiculous since the whole point of statistical sampling is to DISCOVER the population values since we don’t know them yet!! At the moment, however, we are doing things this way to help us understand the theory of how things work. If/when you progress with stats, you will learn how to get around this seemingly backwards way of doing things.
Restated: Population Sampling distribution If the population is N( ) then the sample means distribution is N( /√n).
Example: In a large population of adults, the mean IQ is 112 with standard deviation 20. Suppose 200 adults are randomly selected for a market research campaign. What is the distribution of the sample means? A) Normal, mean 112, standard deviation 20 B) Normal, mean 112, standard deviation 20 C) Normal, mean 112, standard deviation D) Unable to Determine C) Approximately normal, mean 112, standard deviation Population distribution : N( = 112; = 20) Sampling distribution for n = 200 is N( = 112; /√n = 1.414) KEY POINT: Note that the question asks for the distribution of the sample means.
Example Using our example from earlier: In a sample of 100 DePaul students, 38 of them identify themselves as Republicans. What is the distribution of the sample means? A) Normal, mean 38, standard deviation 3 B) Normal, mean 38, standard deviation 3 C) Normal, mean 38, standard deviation 0.3 D) Unable to Determine D) Unable to determine Sampling distribution for N( = population mean; /√n = pop SD/sqrt(100) Population distribution : N( = ???; = ????) We are missing two key pieces of information (the population mean and SD), so we can’t answer! Again, I recognize that being provided with the population mean and SD seems backward. In subsequent stats study, you will learn how to get around the need for these two pieces of information.