Download presentation
Published byHugh Briggs Modified over 7 years ago
0
2017 Statistics Review John Glenn College of Public Affairs
Aditi Vaishali Thapar
1
Outline Sampling Measurement Descriptive Statistics:
Sampling Terms: An Example Measurement Descriptive Statistics: Measures of Central Tendency Measures of Dispersion The Normal Distribution Inferential Statistics: Correlation vs. Causation Hypothesis testing P-values Standard Error Confidence Intervals and Z-Scores
2
Sampling Population vs. Sample Population Sample
The entire group of people or things about which we want information Sample Unlikely that we will be able to collect data for the entire population Representative portion of population about which data is collected.
3
Sampling Statistics vs. Parameters Parameters Statistics
Summarise data for an entire population Statistics Summarise data for a sample Unit of Analysis: Entity that is being analyzed in a study Variable: A characteristic of the unit of analysis Image source:
4
Sampling Terms: Example
What is the demographic information for students who attend statistics boot camp? Population: Sample: Unit of Analysis: Variables: Parameter: Statistics:
5
Sampling Terms: Example
What is the demographic information for students who attend statistics boot camp? Population: All students who attend statistics boot camp Sample: 20 randomly selected students at statistics boot camp Unit of Analysis: The individual (i.e. student) Variables: Age, gender, income, race, etc. Parameter: Average age of all students at statistics boot camp, etc. Statistics: Average age of the randomly 20 selected students at boot camp, etc.
6
Measurement Nominal Ordinal
Numerical values just "name" the attribute uniquely No ordering of the cases is implied Example: Numbers on football/basketball jerseys Ordinal Attributes can be rank-ordered, numerically Distances between attributes do not have any meaning. Example: Coding educational Attainment as 0 = less than high school 1 = high school degree 2 = college degree 3 = Masters, PhD, etc.
7
Measurement Interval Ratio
The distance between attributes does have meaning Example: When measuring temperature, the distance between 30F and 40F is the same as that between 70F and 80F. Ratio There is always an absolute zero that is meaningful. i.e. you can construct a meaningful fraction/ratio Source:
8
Measures of Central Tendency
Central tendencies tell us where most of the data lie Mean: also known as the average Add up all the values for your variable, then divide by the total number of values Median: The middle score for a set of data that has been arranged in order of magnitude. Mode: The most frequent value in the dataset
9
Which Measure Should We Use?
It depends on, both, the type of variable and the distribution of the data Mode: Typically used when we have categorical data (i.e. gender, race, educational attainment etc.) Mean: When we want the average value of a variable, UNLESS our data is skewed. Median: When we have skewed data and/or outliers Question: What measure of central tendency would you use to calculate the average salary for a group of 10 people where 9 people earn $1 and 1 person earns $100?
10
Measures of Dispersion
Dispersion studies the spread of the data Range | Maximum – Minimum | Variance How far each of the observations in the sample dataset lie away from the mean Standard Deviation Square root of the variance A low standard deviation tells us that data points tend to be close to the mean
11
Measures of Dispersion
Question: Given the data below on test scores what is the sample size (N), mean, median, mode, range, standard deviation and variance? 6 10 8 7 4 9 3
12
Measures of Dispersion
Answer: Start by ordering the data in order of magnitude: 0, 3, 4, 6, 6, 6, 7, 8, 9, 10 Sample size: 10 Mean: =5.9 Median: 6 Mode: 6 Range: 10 – 0 = 10 Variance: 8.76, calculated using − − − ∗ 6− − − − − −1 Standard deviation: =2.96
13
The Normal Distribution
The normal distribution is a symmetric, bell-shaped distribution that is completely described by the mean and the standard deviation The mean describes the centre of the curve The standard deviation determines the shape
14
Central Limit Theorem As the sample size of a random variable grows larger, the sampling distribution of mean approaches a normal distribution What does this theorem tell us? A sample with more observations gives us a truer picture of the actual population Making assumptions based on samples that are “too small” may make for a biased analysis
15
Correlation Correlation: A single number that describes the degree of relationship between two variables. The value of correlation ranges from -1 to 1 If the correlation coefficient is positive, this means that the two variables move together Example: Education and salary (as level of education increases, as does salary) If the correlation coefficient is negative, this means that the two variables have an inverse relationship Example: Education and unemployment rate (as the level of education increases, the unemployment rate decreases) If the correlation coefficient is zero, the two variables do not have a relationship Example: The weather and salary
16
Causation Causation is a much stronger relationship than just correlation Image source:
17
Hypothesis Testing Hypothesis testing is used to compare our observed statistic to other statistics/parameter. But what does that really mean? You’re testing whether your results are valid by calculating the odds that your results are a product of chance. The null hypothesis (H0) is the hypothesis that we are trying to disprove. Usually, the null hypothesis is a statement of no effect or no difference The alternative hypothesis (H1) describes the relationship as we expect it to be Tests can be either one-tailed or two-tailed
18
Hypothesis Testing Two-tailed test example: A researcher claims that individuals aged 17 have an average body temperature higher than the commonly accepted average of 98.6F. H0: Individuals aged 17 have an average body temperature that is not greater than 98.6 F average temp <= 98.6F H1: Individuals aged 17 have an average body temperature that is greater than 98.6 F average temp > 98.6F
19
Hypothesis Testing One-tailed test example: A researcher claims that consuming a drug she developed increases student performance on exams. The average student test score is 87. H0: The drug will have no effect on average student test scores (i.e. they stay constant) average test score = 87 H1: The drug will increase average student test scores (i.e. they stay constant) average test score > 87
20
P-values P-value is the probability of finding an observed result, assuming that the null hypothesis is true. There are multiple critical values (1%, 5% and 10%) that we use to test the validity of our claims The most frequently used critical value is 5% (0.05) If the p-value obtained is higher than the 0.05 threshold, we say that our finding is not statistically significant Therefore, we cannot reject our null hypothesis. If the p-value obtained is lower than the 0.05 threshold, we say that our finding is statistically significant Therefore, we can reject our null hypothesis, and accept the alternate hypothesis.
21
Standard Error Standard error is how far the sample mean is likely to be from the population mean. How does this differ from the standard deviation? Standard deviation is the degree to which individuals within the sample differ from the sample mean. Calculated using: Example: if we only sample 5 universities to examine the impact of ownership on the test score, what is the likelihood that the true average test score is equivalent to that in our sample?
22
Confidence Intervals and Z-Scores
A Z-score score is a numerical measurement of a value's distance from the mean. If a Z-score is 0, it represents the score is identical to the mean score. Calculated using: 𝑥−𝑚𝑒𝑎𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 At the 95% level, we use 1.96 A confidence interval is a range of values between which we are certain that the true mean lies. Calculated using: mean +/- (standard error * Z-score)
23
Finding the Confidence Interval
Question: You want to investigate the impact of college degree on income. Therefore, you sample 20 persons that have college degree (Group A) and 20 persons that do not have (Group B). You get the following statistics. What is the 95% confidence intervals of each group? How can we interpret the results? Mean Min Max SE Variance Group A 70,000 20,000 130,000 25,000 200 Group B 68,000 200,000 15,000 400 ,000*1.96=119,000 ,000*1.96=21000 68, *1.96=97400 68, *1.96= 38600
24
Let’s move to our worksheets!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.