Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.

Similar presentations


Presentation on theme: "Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission."— Presentation transcript:

1 Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission

2 Announcements Problem Set #2 due today!

3 Review: Populations Population: The entire set of persons, objects, or events that have at least one common characteristic of interest to a researcher (Knoke, p. 15) Beyond literal definition, a population is the general group that we wish to study and gain insight into Sample: A subset of a population Random Sample: A sample chosen from a population such that each observation has an equal chance of being selected (Knoke, p. 77) Randomness is one strategy to avoid biased samples.

4 Review: Statistical Inference Statistical inference: making statistical generalizations about a population from evidence contained in a sample (Knoke, 77) When is statistical inference likely to work? 1. When a sample is large If a sample approaches the size of the population, it is likely be a good reflection of that population 2. When a sample is representative of the entire population As opposed to a sample that is atypical in some way, and thus not reflective of the larger group.

5 Populations and Samples Population parameters (μ, σ) are constants There is one true value, but it is usually unknown Sample statistics (Y-bar, s) are variables Up until now we’ve treated them as constants But, there are many possible samples The value of mean, S.D. vary depending on which sample you have Like any variable, the mean and S.D. have a distribution Called the “sampling distribution” Made up of all values for any given population

6 Populations and Samples: Overview PopulationSample Characteristics“parameters”“statistics” Characteristics are: constant (one for population) variables (varies for each sample) Notation Greek ( ,  ) Roman (, s) Estimate“hat”:“point estimate” based on sample

7 Population and Sample Distributions   s

8 Estimating the Mean Suppose we want to know the mean of a population (μ). What do we do? Plan A: Spend $100 million dollars to survey our entire population If it is even possible to survey the whole population Plan B: Spend $1,000 sampling a few hundred people. Estimate the mean Simply use formulas to estimate mu:

9 Estimating the Mean Question: Given our sample, what is our best guess of the population mean? Answer: The sample mean: Y-bar Look at Y-bar, assume that it is a “good guess” Thus, we calculate:

10 Estimating the Mean Issue: There are an infinite number of possible samples that one can take from any population –Each possible sample has a mean, most of which are different Some are close to the population mean, some not Q: How do we know if we got a “good guess”? A: We can’t know for sure. We may draw incorrect conclusions about the mean But: We can use probability theory to determine if our guess is likely to be good!

11 Estimates and Sampling Distributions It is possible to take more than one sample And calculate more than one estimate of the mean If we took many samples (and calculated many means), we’d see a range of estimates We could even plot a histogram of the many estimates Our confidence in our guess depends on how “spread out” the range of guesses tends to be The “standard deviation” of that particular histogram.

12 Sampling Distributions Sampling Distribution: The distribution of estimates created by taking all possible unique samples (of a fixed size) from a population Example: Take every possible 10-person sample of sociology graduate students (all combinations) 1. Calculate the mean of each sample 2. Graph a histogram of all estimates This is called “the sampling distribution of the mean” Note: The sampling distribution is rarely known It is typically thought of as a probability distribution.

13 Sampling Distribution Notation Population mean and S.D. are:  Each sample has a mean and S.D.: Y-bar, s The sampling distribution of the mean (i.e., the distribution of mean-estimates) also has a mean And a S.D., aka the “standard error” Mean, S.D. of sampling distribution: Question: Why are they Greek? A:Because all possible samples represent a population Question: Why is there a sub-Y-bar? Because it is the mean of all possible Y-bars (means)

14 Sampling Distribution of the Mean It turns out that under some circumstances, the shape of the sampling distribution of the mean can be determined –Thus allowing one to get a sense of the range of estimates of the mean one is likely to see If distribution is narrow, our guess is probably good! If S.D. is large, our guess may be quite bad This provides insight into the probable location of the population mean Even if you only have one single sample to look at This “trick” lets us draw conclusions!!!

15 Sampling Distribution Example Let’s create a sampling distribution from a small population,  = 52. (Sample N = 3) Case# of CDs 130 2100 320 470 540 Note how the mean varies depending on the sample Mean of cases 1,2,3 = 50 Mean of 2,4,5 = 70 For this population (N=5) we can calculate all possible means based on sample size 3

16 Sampling Distribution Example First, we must calculate every possible mean Case# of CDs 130 2100 320 470 540 1,2,3 = 50 1,2,4 = 66.67 1,2,5 = 56.67 1,3,4 = 40 1,3,5 = 30 1,4,5 = 46.67 2,3,4 = 63.33 2,3,5 = 53.33 2,4,5 = 70 3,4,5 = 43.33

17 Sampling Distribution Example Here, you can see how the sample mean is really a variable This complete list of all possible means is the sampling distribution As a probability distribution, this tells us the probability of picking a sample with each mean Note: Sampling Dist mean = 52 Same as population mean! SampleY-bar 150 266.67 356.67 440 530 646.67 763.33 853.33 970 1043.33

18 Sampling Distribution Example Histogram of Sampling Distribution (N=3): 17-27 27-37 37-47 47-57 57-67 67-77 77-87 4321043210  = 52 Note: The distribution centers around the population mean And, it is roughly symmetrical

19 Sampling Distribution Example As a probability distribution, the sampling distribution gives a sense of the quality of our estimate of  17-27 27-37 37-47 47-57 57-67 67-77 77-87.4.3.2.1 0  = 52 Probability = Frequency / N The probability of picking a sample with a mean that is within +/- 5 of  is p =.3 (30%) The probability of overestimating  by more than 15 is about p =.1 (10%) Q: What is the probability of a “poor” estimate of  ?

20 Sampling Distribution Example Note: If the sampling distribution is narrow, most of our estimates of the mean will be good That is, they will be close to , the population mean If the sampling distribution is wide, the probability of a “bad” estimate goes up A measure of dispersion can help us assess the sampling distribution Recall: the standard deviation of a sampling distribution is called: the standard error It tells us the width of the sampling distribution!

21 The Central Limit Theorem But, how do we know the width of the sampling distribution? Statisticians have shown that the sampling distribution will have consistent properties, if we have a large sample Several of these properties constitute the “Central Limit Theorem” These properties provide the basis for drawing statistical inferences about the mean.

22 The Central Limit Theorem If you have a large sample (Large N): 1. The sampling distribution of the mean (and thus all possible estimates of the mean) cluster around the true population mean 2. They cluster as a normal curve Even if the population distribution is not normal 3. The estimates are dispersed around the population mean by a knowable standard deviation (sigma over root N)

23 The Central Limit Theorem Formally stated: 1. As N grows large, the sampling distribution of the mean approaches normality

24 Central Limit Theorem: Visually   s

25 Implications of the C.L.T What does this mean for us? Typically, we only have one sample, and thus only one estimate of  The actual value of  is unknown So we don’t know the center of the sampling distribution All we know for certain is that our estimate falls somewhere in the sampling distribution This is always true by definition And, later, we’ll estimate its width.

26 Implications of the C.L.T Visually: Suppose we observe mu-hat = 16 But, mu-hat always falls within the sampling distribution Sampling distribution There are many possible locations of 

27 Implications of the C.L.T We know that the mean from our sample falls somewhere in this sampling distribution Which has mean , standard deviation  over square root N If we can estimate , we can estimate sigma over root N... The “Standard Error” of the mean We don’t know exactly where the sample falls But, laws of probability suggest that we are most likely to draw a sample w/mean from near the center Recall: 67% fall +/- 1 SD, 95 +/- 2SD in a normal curve So, we can determine the range around  in which 95% (or 99%, or 99.9%) of cases will fall.

28 Implications of the C.L.T What is the relation between the Standard Error and the size of our sample (N)? Answer: It is an inverse relationship. The standard deviation of the sampling distribution shrinks as N gets larger Formula: Conclusion: Estimates of the mean based on larger samples tend to cluster closer around the true population mean.

29 Implications of the CLT Visually: The width of the sampling distribution is an inverse function of N (sample size) –The distribution of mean estimates based on N = 10 will be more dispersed. Mean estimates based on N = 50 will cluster closer to . Smaller sample sizeLarger sample size

30 Confidence Intervals Benefits of knowing the width of the sampling distribution: 1. You can figure out the general range of error that a given point estimate might miss by based on the range around the true mean that the estimates will fall 2. And, this defines the range around an estimate that is likely to hold the population mean A “confidence interval” Note: These only work if N is large!

31 Confidence Interval Confidence Interval: “A range of values around a point estimate that makes it possible to state the probability that an interval contains the population parameter between its lower and upper bounds.” (Bohrnstedt & Knoke p. 90) It involves a range and a probability Examples: We are 95% confident that the mean number of CDs owned by grad students is between 20 and 45 We are 50% confident the mean rainfall this year will be between 12 and 22 inches.

32 Confidence Interval Visually: It is probable that  falls near mu-hat Probable values of  Range where  is unlikely to be Q: Can  be this far from mu-hat? Answer: Yes, but it is very improbable

33 Confidence Interval To figure out the range in of “error” in our mean estimate, we need to know the width of the sampling distribution –The Standard Error! (The S.D. of this distribution) The Central Limit Theorem provides a formula: Problem: We do not know the exact value of sigma-sub-Y, the population standard deviation!

34 Confidence Interval Question: How do we calculate the standard error if we don’t know the population S.D.? Answer: We estimate it using the information we have Formula for best estimate: Where N is the sample size and s-sub-Y is the sample standard deviation

35 95% Confidence Interval Example Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200 How do we find the 95% Confidence Interval? If N is large, we know that: 1. The sampling distribution is roughly normal 2. Therefore 95% of samples will yield a mean estimate within 2 standard deviations (of the sampling distribution) of the population mean (  ) Thus, 95% of the time, our estimates of  (Y-bar) are within two “standard errors” of the actual value of .

36 95% Confidence Interval Formula for 95% confidence interval: Where Y-bar is the mean estimate and sigma (Y- bar) is the standard error Result: Two values – an upper and lower bound Adding our estimate of the standard error:

37 95% Confidence Interval Suppose a sample of 100 students with mean SAT score of 1020, standard deviation of 200 Calculate: Thus, we are 95% confident that the population mean falls between 980 and 1060.


Download ppt "Sociology 5811: Lecture 7: Samples, Populations, The Sampling Distribution Copyright © 2005 by Evan Schofer Do not copy or distribute without permission."

Similar presentations


Ads by Google