At the beginning of the term, we talked about populations and samples What are they? Why do we take samples?
Generally, we want to know about the population But, studying/surveying the entire population is problematic! ▪ Too costly ▪ May be impossible!
So, we typically study samples rather than entire populations But, we are not usually interested in the sample itself We hope that the sample will give us insight into the population
Starting here, we will look at the relationship between samples and populations What we can learn How precise/reliable the information is
Suppose we were interested in knowing the average travel time for students coming to Seneca We don’t want to ask every Seneca student So, we take a sample We hope that the sample mean will give us insight into the population mean
Will the sample mean be exactly equal to the population mean?
No, because it depends on exactly who winds up in our sample
Will the sample mean be the same same for every sample?
No, because it depends on exactly who winds up in our sample
Get into groups (samples) of two, and calculate your average travel time
1. The sample mean is RANDOM Depends on exactly who winds up in the sample
Do these samples give us reliable estimates of the population mean?
VERY SMALL -> Subject to a great deal of randomness
Groups of 3
Groups of 5
Groups of 10
1. The sample mean is RANDOM Depends on exactly who winds up in the sample 2. The larger the sample, the more likely that the sample mean will be close to the population mean In larger samples, the randomness tends to ‘average out’, meaning less random fluctuation from sample to sample Larger samples give more reliable results
Because the sample mean is random, we can describe it using a probability distribution I.e., for any given sample mean, there is some probability And, we can talk about, ‘what is the probability that we get a sample mean in the range ______?’ Called the ‘sampling distribution’
Depending on the actual raw data distribution, the distribution of the sample mean can have many different shapes In the next slide, we look at three different data distributions, and what the distribution of the sample means looks like ▪ When sample size, n, =2
Source: Dawson B, Trapp RG: Basic & Clinical Biostatistics, 4 th edition Raw Data Distribution of Sample Mean, n=2
Those distributions look strange! But, as sample size increases, wonderful things happen: First, the sample mean gets more accurate ▪ The distribution gets narrower ▪ I.e., the probability of getting a sample mean far from the real population mean is low Second, the distribution changes shape
Source: Dawson B, Trapp RG: Basic & Clinical Biostatistics, 4 th edition Raw Data Distribution of Sample Mean, When n=2 When n=10 When n=30
As we take larger samples, the distribution of the sample mean approaches the normal distribution! (Almost) regardless of the shape of the actual data! Because of this, we can use what we have learned about the normal distribution to, e.g., judge how reliable/accurate our sample results are!
As discussed, if the sample size is large, the sampling distribution approaches the normal distribution But, its not exactly equal to the normal distribution ▪ Especially if n is small! For this reason, we have another distribution that we use, which is closely related
T distribution takes sample size into account T is wider and flatter than normal The smaller the sample, the wider and flatter! ▪ Reflecting that the information is less reliable ▪ I.e., that we are more likely to get a result far from the real population mean
T use the t-distribution we need to provide degrees of freedom This is just n – 1 ▪ (Sample size – 1)
We can use the t-distribution to determine the probability of getting a mean in a given range, in the same way we used the normal distribution to find the probability of getting a value in a certain range
When using t, no built-in ‘one-step’ like norm.dist 2-step process 1. Convert the x-value(s) into t-scores ▪ Like z-scores! 2. Use the t-score(s) to look up the probability ▪ Using t.dist ▪ And the same structure: ‘Less than’ -> t.dist; ‘Greater than’ -> 1-t.dist; ‘Between’ -> t.dist(big) – t.dist(small)
Recall: z = (value – mean)/SD T-score: t = (value – mean)/(SD/sqrt(n)) Divide standard deviation by square root of sample size The bigger the sample size, the bigger number you divide SD by -> Smaller SD -> less spread out/more accurate!
=t.dist(t-score, degrees of freedom, True)
I will walk you through an example, but first, we note that we cover this primarily so you will understand what comes later Direct business applications (or at least, marketing applications) aren’t as common as for other techniques
Heights for a particular segment are normally distributed, with an average of 176 cm, and a standard deviation of 7.1 cm. If you select an individual at random, what is the probability that he has a height greater than 180 cm?
Heights for a particular segment are normally distributed, with an average of 176 cm, and a standard deviation of 7.1 cm. If you select an individual at random, what is the probability that he has a height greater than 180 cm? =1 – norm.dist(180, 176, 7.1, true) ≈ 0.287
Heights for a particular segment are normally distributed, with an average of 176 cm, and a standard deviation of 7.1 cm. If you select a random sample of size 5, what is the probability that the mean height is greater than 180 cm?
Heights for a particular segment are normally distributed, with an average of 176 cm, and a standard deviation of 7.1 cm. If you select a random sample of size 5, what is the probability that the mean height is greater than 180 cm? t = ( )/(7.1/sqrt(5)) = prob =1 – t.dist( , 7.1, true) ≈ 0.138
Repeat, with: Sample size of 15 Sample size of 30 What happens to the probability? Why?