Unit 3: Sampling Distributions, Parameters and Parameter Estimates

Unit 3: Sampling Distributions, Parameters and Parameter Estimates

What are the risks of excessive drinking at a department party?
A picture is worth a thousand words……

Inferential Statistics
Inferential statistics are used estimate “parameters” in the population from parameter estimates in a sample drawn from that population. In inferential statistics, we use these parameter estimates to test hypotheses (predictions; Null and alternative hypotheses) about the size of the population parameter. These predictions about the size of populations parameters typically map directly onto research questions about (causal) relationships between variables (IVs and DV) Answers from inferential statistical are probabilistic. In other words, all answers have the potential to be wrong and you will provide an index of that probability along with your results.

Populations A population is any clearly defined set of objects or events (people, occurrences, animals, etc.). Populations usually represent all events in a particular class (e.g., all college students, all alcoholics, all depressed people, all people). It is often an abstract concept because in many/most instances you will never have access to the entire population. For example, many of our studies may have the population of all people as its target. Nonetheless, researchers usually want to describe or draw conclusions about populations. (e.g., we don’t care if some new drug is an effective treatment for 100 people in your sample- Will it work, on average, for everyone we might treat?)

(Population) Parameters
A parameter is a value used to describe a certain characteristic of a population. It is usually unknown and therefore has to be estimated. For example, the population mean is a parameter that is often used to indicate the average/typical value of a variable in the population. Within a population, a parameter is a fixed value which does not vary within the population at the time of measurement (e.g., the mean height of people in the US at the present moment). You typically cant calculate these parameters directly because you don’t have access to the entire population. We use Greek letters to represent population parameters (, , 2, 0, j)

Samples & Parameter Estimates
A sample is a finite group of units (e.g., participants) selected from the population of interest. A sample is generally selected for study because the population is too large to study in its entirety. We typically have only one sample in a study. We use the sample to estimate and test parameters in the population. These estimates are called parameter estimates. We use Roman letters to represent sample parameter estimates (X, s, s2, b0, bj).

Sampling Error Since a sample does not include all members of the population, parameter estimates generally differ from parameters on the entire population (e.g., use mean height of a sample of 1000 people to estimate mean height of US population). The difference between the (sample) parameter estimate and the (population) parameter is sampling error. You will not be able to calculate the sampling error of your parameter estimate directly because you don’t know the value of the population parameter. However, you can estimate it by probabilistic modeling of the hypothetical sampling distribution for that parameter.

Hypothetical Sampling Distribution
A sampling distribution is a probability distribution of all possible samples of size N taken from a population A sampling distribution can be formed for any population parameter. Each time you draw a sample of size N from a population, you can calculate an estimate of that population parameter from that sample. Because of sampling error, these parameter estimates will not exactly equal the population parameter. They will not equal each other either. They will form a distribution. A sampling distribution, like a population, is an abstract concept that represent the outcome of repeated (infinite) sampling. You will typically only have one sample.

What if we didn’t need samples?
Research question: How do inhabitants of a remote pacific island feel about the ocean? Population size = 10,000 Dependent measure: Ocean liking scale scores range from -100 (strongly dislike) to 100 (strongly like); 0 represents neutral Hypotheses: H0:  = 0; Ha:  <> 0) How would you answer this question if you had unlimited resources (time, money, and patience!) Administer the Ocean liking scale to all 10,000 inhabitants in the population and calculate the population mean score. Is it 0? If not, the inhabitants are not neutral on average.

Ocean Liking Scale Scores in Full Population
> setwd("C:/Users/LocalUser/Desktop/GLM") > d = lm.readDat('3_SamplingDistributions_Like.dat') > str(d) 'data.frame': obs. of 2 variables: $ Like0: num > lm.describeData(d) var n mean sd median min max skew kurtosis Like

Ocean Liking Scale Scores in Full Population
> windows() #quartz() for MAC users > par('cex' = 1.5, 'lwd' = 2, 'font.axis'=1.5, 'font.lab' = 2) > hist(d$Like0, col=‘yellow’)

Parameter Estimation and Testing
What do you conclude? Inhabitants of island ARE neutral on average on the Ocean Liking Scale;  = 0 How confident are you about this conclusion? Excluding issues of measurement of the scale (i.e., reliability), you are 100% confident that the population mean score on this scale is 0 ( = 0). Of course, this approach to answering a research question is not typical. Why? And how would you normally answer this question? You will very rarely have access to all scores in the population. Instead, you have to use inferential statistics to “infer” (estimate) the size of the population parameter from a sample.

Obtain a Sample You are a poor graduate student. All you can afford is N=10 > dS = data.frame(Sample1 = sample(d$Like0,10)) > lm.describeData(dS$Sample1,1) n mean sd min max Sample What do you conclude and why? A sample mean of 2.40 is not 0. However, you know that the sample mean will not match the population mean exactly. How likely is it to get a sample mean of 2.40 if the population mean is 0 (think about it!)

Obtain a Sample Your friend is a poor graduate student too. All she can afford is N=10 too. > dS$Sample2 = sample(d$Like0,10) > lm.describeData(dS$Sample2,1) n mean sd min max sample What do you conclude and why? A sample mean of 1.04 is not 0. However, you know that the sample mean will not match the population mean exactly. It is more likely to get a sample mean of 1.04 than 2.40 if the population mean is 0 but you still don’t know how likely either outcome is. What if she obtained a sample with mean of 30?

Sampling Distribution of the Mean
You can construct a sampling distribution for any sample statistic (e.g., mean, s, min, max, r, B0, B1) For the mean, you can think of the sampling distribution conceptually as follows: Imagine drawing many samples (lets say 1000 samples but in theory, the sampling distribution is infinite) of N=10 participants (10 participants in each sample) from your population Next, calculate the mean for each of these samples of 10 participants Finally, create a histogram (or density plot) of these sample means

1000 Samples of N=10 OLS Scores
Descriptives for each of 1000 samples of N=10 n mean sd min max sample sample sample sample sample sample sample sample sample sample sample sample sample sample sample ... sample sample

Descriptives for 1000 sample means of N=10 n mean sd median min max skew kurtosis mean NOTE: In your research, you don’t form a sampling distribution. You (typically) only have one sample.

Raw Score Distribution vs. Sampling Distribution
NOTE: The distinction between raw score distribution vs. sample distribution is very important to keep clear in your mind!

What will the mean of the sample means be? In other words, what is the mean of the sampling distribution? The mean of the sample means (i.e., the mean of the sampling distribution) will equal the population mean of raw scores on the dependent measure. This is important b/c it indicates that the sample mean is an unbiased estimator of the population mean.

The mean is an unbiased estimator: The mean of the sample means will equal the mean of the population. Therefore individual sample means will neither systematically under or overestimate the population mean. Raw Ocean Liking scores n mean sd median min max skew kurtosis Like Sample (N=10) means n mean sd median min max skew kurtosis mean The sample variance (s2; with n-1 denominator) is also an unbiased estimator of the population variance (2). In other words, the mean of the sample s2’s will approximate the population variance. Sample s is negatively biased

Will all of the sample means be the same? No, there was a distribution of means that varied from each other. The mean of the sampling distribution was the population mean but the standard deviation was not zero n mean sd median min max skew kurtosis mean

Standard Error (SE) The standard deviation of the sampling distribution (i.e., standard deviation of the infinite sample means) is equal to:   Nsample Where  is the standard deviation of the population of raw scores This variability in the sampling distribution is due to sampling error. Therefore, b/c we use sample statistics (parameter estimates) to estimate population parameters, we would like to minimize sampling error. The standard deviation of the sampling distribution for a population parameter has a technical name. It is called the standard error of the statistic. Here, we are talking about the standard error of the mean

Standard Error What factors affect the size of the sampling error of the mean (i.e., the standard error)?   Nsample The standard deviation of the population raw scores and the sample size

Factors that Affect the Standard Error (SE)
Variation among raw scores for a variable in the population is broadly caused by two factors. What are they? (a) Individual differences (b) Measurement error (the opposite of reliability) What is the relationship between population variability ( ) and SE? As the variability of the variable increases in the population, the SE increases. What would happened to SE if there was no variation in population scores? There would be no SE b/c no matter which participants you sampled, they would all have the same scores.

Factors that Affect the Standard Error (SE)
What is the relationship between sample size and SE? As the sample size increases, the SE for the statistic will decrease. What would the SE be if the sample size = population size? If the sample contained ALL participants from the population, the SE would be equal to 0 because each sample mean would have exactly the same value as the overall population mean (b/c all same scores). What would happen if the samples contained only 1 participant? If each sample contained only 1 participant, the SE would be equal to the variation ( ) observed within the population.

Shape of the Sampling Distribution
Central Limit Theorem: The shape of the sampling distribution approaches normal as N increases. Roughly normal even for moderate sample sizes assuming that the original distribution isn’t really weird (i.e., non-normal).

Normal Pop and Various Sampling Distributions
NOTES: Population size = 100,000; Simulated 10,000 samples

Uniform Pop and Various Sampling Distributions

Skewed Pop and Various Sample Distributions
NOTE: x-axis scale changes across figures on this slide

An Important Normal Distribution: Z-scores
Z scores are normally distributed scores with a mean of 0 and a standard deviation of 1. You can therefore think of a z-score as telling you the position of the score in terms of standard deviations above the mean. The probability distribution is known for z-scores. 16% 16% 2.5% 2.5% 0.5% 0.5%

Probability of Parameter estimate given H0
How could you use the z-score distribution to determine the probability of obtaining a sample mean (parameter estimate) of 2.40 if you draw a sample of N=10 from a population of Ocean Liking scores with a population mean (parameter) of 0? Think about it……

Hypothetical Sampling Distribution for H0
If H0 is true; sampling distribution has a mean of 0 and standard deviation of  / Nsample = / 10 = 7.5

Hypothetical Sampling Distribution for H0
If H0 is true and this is the sampling distribution (in blue), how likely is it to get a sample mean of 2.4 or more extreme? Pretty likely…..  But we can do better than that…….

Our first inferential test: the z-test
z = – 0 = ; p < .749 7.5 pnorm(0.32, mean=0, sd=1, lower.tail=FALSE) * 2 37.4% 37.4%

t vs. z z = – 0 = 7.5 Where did we get the 2.4 from in our z test? Our sample mean from our study. This is our parameter estimate of the population mean of OLS (Like0) scores Where did we get the 0 from in our z test? This is the mean of the sampling distribution of OLS scores if H0 is true. Where did we get the 7.5 from in our z test and what is the problem with this? This was our estimate of the standard deviation of the sampling distribution.  / NSample We do not know .

t vs. z How can we estimate  ?
We can use our sample standard deviation (s) but s is a negatively biased parameter estimate. On average, it will underestimate  So what do we do? We account for this underestimation of  and therefore of the standard deviation (standard error) of the sampling distribution by using the t distribution rather than the z distribution to calculate the probability of our parameter estimate if H0 is true. The t distribution is slightly wider, particularly for small sample sizes to correct for our underestimate of the standard deviation

Our second inferential test: t-test
t(df) = Parameter estimate – Parameter: H0 Standard error of parameter estimate Where SE is estimated use s from sample data df = N – P = = 9

t vs. z The bias in s decreases with increasing N. Therefore, t approaches z with larger sample sizes

Null Hypothesis Significance Testing (NHST)
Divide reality regarding the size of the population parameter into two non-overlapping possibilities. (Null hypothesis & Alternate hypothesis). Assume that the Null hypothesis is true. Collect data. Calculate the probability (p-value) of obtaining your parameter estimate (or a more extreme estimate) given your assumption (i.e., the Null hypothesis is true) Compare probability to some cut-off value (alpha level). (a) If this parameter estimate is less probable than cut-off value, reject null hypothesis in favor of alternate hypothesis. (b) If data is not less probable, fail to reject Null hypothesis.

Unit 3: Sampling Distributions, Parameters and Parameter Estimates

Similar presentations

Presentation on theme: "Unit 3: Sampling Distributions, Parameters and Parameter Estimates"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Unit 3: Sampling Distributions, Parameters and Parameter Estimates

Similar presentations

Presentation on theme: "Unit 3: Sampling Distributions, Parameters and Parameter Estimates"— Presentation transcript:

Similar presentations

About project

Feedback