2.5: Numerical Measures of Variability (Spread) The mean, median and mode give us an idea of the central tendency, or where the “middle” of the data is. Variability gives us an idea of how spread out the data are around that middle. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.4: Numerical Measures of Variability Range Variance Standard Deviation (SD) IQR (interquartile range) McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Range The range is equal to the largest value minus the smallest value in the data set. Easy to compute, but not very informative. Considers only two observations (the smallest and largest). McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Range Ex: Data (Number of Vacation Days in a Company): 22, 17, 15, 16, 14, 20, 25, 11, 26, 14, 23, 21, 13, 15, 15, 28, 30, 20, 14, 33, 27, 28, 15, 22, 16, 12, 25, 31, 19, 23, 26, 21, 11, 18, 17 Range=largest-smallest=33-11=22 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Sample Variance The sample variance, s2, for a sample of n measurements is equal to the sum of the squared distances from the mean, divided by (n – 1). McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Sample Standard Deviation (SD) The sample standard deviation, s, for a sample of n measurements is equal to the square root of the sample variance. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Example 1: Say a small data set consists of the measurements 1, 3, 5, and 3. (when we have few numbers definition and calculation formula take about the same time, but for data with many numbers calculation formula is easier to use!) McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Example 2: Ex: Data (Number of Vacation Days in a Company): 22, 17, 15, 16, 14, 20, 25, 11, 26, 14, 23, 21, 13, 15, 15, 28, 30, 20, 14, 33, 27, 28, 15, 22, 16, 12, 25, 31, 19, 23, 26, 21, 11, 18, 17 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.5: Numerical Measures of Variability As before, Greek letters are used for populations and Roman letters for samples: s2 = sample variance s = sample standard deviation s2 = population variance s = population standard deviation McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.6: Interpreting the Standard Deviation Chebyshev’s Rule The Empirical Rule Both tell us something about where the data will be relative to the mean. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Chebyshev’s Rule Chebyshev’s Rule k k2 1/ k2 2 4 .25 75% 3 9 .11 89% Valid for any data set For any number k >1, at least (1-1/k2) ×100% of the observations will lie within k standard deviations of the mean. k k2 1/ k2 (1- 1/ k2 ) ×100% 2 4 .25 75% 3 9 .11 89% 16 .0625 93.75% McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Chebyshev’s Rule Ex: Data (Number of Vacation Days in a Company): 22, 17, 15, 16, 14, 20, 25, 11, 26, 14, 23, 21, 13, 15, 15, 28, 30, 20, 14, 33, 27, 28, 15, 22, 16, 12, 25, 31, 19, 23, 26, 21, 11, 18, 17 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Chebyshev’s Rule McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule For a perfectly symmetrical and bell-shaped distribution, ~68% will be within the range ~95% will be within the range ~99.7% will be within the range The Empirical Rule Useful for bell-shaped, symmetrical distributions If not perfectly bell-shaped and symmetrical, the values are approximations. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule Ex: Data (Number of Vacation Days in a Company): 22, 17, 15, 16, 14, 20, 25, 11, 26, 14, 23, 21, 13, 15, 15, 28, 30, 20, 14, 33, 27, 28, 15, 22, 16, 12, 25, 31, 19, 23, 26, 21, 11, 18, 17 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule (Example 3) Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and bell-shaped. Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? Between 55 and 65? Less than 45? McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule (Example 3) Since 45 and 65 are exactly one standard deviation below and above the mean, the empirical rule says that about 68% of the hummingbirds will be in this range. Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and bell-shaped. Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? Between 55 and 65? Less than 45? McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule (Example 3) Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and bell-shaped. Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? Between 55 and 65? Less than 45? This range of numbers is from the mean to one standard deviation above it, or one-half of the range in the previous question. So, about one-half of 68%, or 34%, of the hummingbirds will be in this range. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Empirical Rule (Example 3) Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and bell-shaped. Approximately what percentage of hummingbirds beat their wings between 45 and 65 times per second? Between 55 and 65? Less than 45? Half of the entire data set lies above the mean, and ~34% lie between 45 and 55 (between one standard deviation below the mean and the mean), so by symmetry ~34% lie between 45 and 55, which means ~16% are below 45. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.7: Numerical Measures of Relative Standing:Percentiles Percentiles: for any (large) set of n measurements (arranged in ascending order), the 100×pth percentile is a number such that at least 100p% of the measurements fall at or below that number and at least 100(1 – p)% fall at or above it. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Percentiles Finding percentiles is similar to finding the median – the median is the 50th percentile. If you are in the 50th percentile for the GRE, half of the test-takers scored like you or better and half scored like you or worse. If you are in the 75th percentile, three-quarters of the test-takers scored like you or worse. If you are in the 90th percentile, only 10% of all the test-takers scored like you or better. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Finding Percentiles 1. Order the data from smallest to largest 2. Find k=n×p. (a) If k is an integer, 100×pth percentile is the average of the kth and (k+1)th values. (b) If k is not an integer, round it up to the next integer, say q. Then 100×pth percentile is the qth value. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Finding Percentiles Ex: Data (Number of Vacation Days in a Company): Ordered: 11, 11, 12, 13, 14, 14, 14, 15, 15, 15, 15, 16, 16, 17, 17, 18, 19, 20, 20, 21, 21, 22, 22, 23, 23, 25, 25, 26, 26, 27, 28, 28, 30, 31, 33 (i) Find the 80th percentile. k=n×p=35×0.80=28 is an integer, so 80th percentile is the average of 28th and 29th values, i.e., 80th percentile = (26+26)/2=26 Check: 29 out of 35 values (i.e., ~83% of the data) are ≤26 (which satisfies at least 80%!) 8 out of 35 values (i.e., ~23% of the data) are ≥26 (which satisfies at least 20%!) McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Finding Percentiles (ii) Find the 25th percentile (first quartile or lower quartile, QL). k=n×p=35×0.25=8.75 is not an integer, so 25th percentile is 9th value, i.e., 25th percentile = 15 (ii) Find the 75th percentile (third quartile or upper quartile, QU). k=n×p=35×0.75=26.25 is not an integer, so 75th percentile is 27th value, i.e., 75th percentile = 25 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Interquartile Range (IQR) IQR is another measure of spread or variability! It is the difference between third quartile and first quartile, That is, IQR=75th percentile minus 25th percentile= QU -QL Ex: in the number of vacation days of 35 employees IQR=25-15=10 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.7: Numerical Measures of Relative Standing: z-score The z-score tells us how many standard deviations above or below the mean a particular measurement is. Sample z-score Population z-score McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.6: Interpreting the Standard Deviation Hummingbirds beat their wings in flight an average of 55 times per second. Assume the standard deviation is 10, and that the distribution is symmetrical and bell-shaped. An individual hummingbird is measured with 75 beats per second. What is this bird’s z-score? McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.6: Interpreting the Standard Deviation Since ~95% of all the measurements will be within 2 standard deviations of the mean, only ~5% will be more than 2 standard deviations from the mean. About half of this 5% will be far below the mean, leaving only about 2.5% of the measurements at least 2 standard deviations above the mean. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.7: Numerical Measures of Relative Standing Z scores are related to the empirical rule: For a perfectly symmetrical and bell-shaped distribution, ~68 % will have z-scores between -1 and 1 ~95 % will have z-scores between -2 and 2 ~99.7% will have z-scores between -3 and 3 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers An outlier is a measurement that is unusually large or small relative to the other values. Three possible causes: Observation, recording or data entry error Item is from a different population A rare, chance event McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers The box plot is a graph representing information about certain percentiles for a data set and can be used to identify outliers McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers Lower Quartile (QL) Median Upper Quartile (QU) Minimum Value inside the inner fence Maximum Value inside the inner fence McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers Interquartile Range (IQR) = QU - QL McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers Right Outer Fence = QU + 3(IQR) Right Inner Fence = QU + 1.5(IQR) Left Inner Fence = QL - 1.5(IQR) and Left Outer Fence = QL – 3(IQR) McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers Outliers and z-scores The chance that a z-score is between -3 and +3 is over 99%. Any measurement with |z| > 3 is considered an outlier. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.8: Methods for Determining Outliers Outliers and z-scores Here are the descriptive statistics for the games won at the All-Star break, except one team had its total wins for 2006 recorded. That team, with 104 wins recorded, had a z-score of (104-45.68)/12.11 = 4.82. That’s a very unlikely result, which isn’t surprising given what we know about the observation. # of Wins n = 30 Mean 45.68 Sample Variance 146.69 Sample Standard Deviation 12.11 Minimum 25 Maximum 104 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Robustness to Outliers Ex: Consider the data: 2, 7, 5, 6, 4, 2, 5, 1, 5, 6 Mean=4.3, Median=5, Mode=5 Range=6, Variance=4.01, SD=2.0, IQR=3.25 Ex: what if data were 2, 7, 5, 6, 4, 2, 5, 1, 5, 100, then Mean=13.7, Median=5, Mode=5 Range=99, Variance=923.12, SD=30.38, IQR=3.25 McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
Robustness to Outliers Hence, mean, range, variance, and standard deviation are highly affected by the outliers (or extreme values) While, median, mode, and IQR are not affected by the outliers, i.e., they are robust to outliers. McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.9: Graphing Bivariate Relationships Scattergram (or scatterplot) shows the relationship between two quantitative variables McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.9: Graphing Bivariate Relationships If there is no linear relationship between the variables, the scatterplot may look like a cloud, a horizontal line or a more complex curve Source: Quantitative Environmental Learning Project http://www.seattlecentral.org/qelp/index.html McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data
2.10: Distorting the Truth with Deceptive Statistics Distortions Stretching the axis (and the truth) Is average center? Mean, median or mode? Is average relevant? What about the spread? McClave, Statistics , 11th ed. Chapter 2: Methods for Describing Sets of Data