Chapter 5 Describing Distributions Numerically
Describing the Distribution Center Median (.5 quartile, 2nd quartile, 50th percentile) Mean Spread Range Interquartile Range Standard Deviation
Median Literally = middle number (data value) Has the same units as the data n (number of observations) is odd Order the data from smallest to largest Median is the middle number on the list (n+1)/2 number from the smallest value Ex: If n=11, median is the (11+1)/2 = 6th number from the smallest value Ex: If n=37, median is the (37+1)/2 = 19th number from the smallest value
Example – Frank Thomas 15 observations Median = 32 HRs Career Home Runs 4 7 15 18 24 28 29 32 35 38 40 40 41 42 43 Remember to order the values, if they aren’t already in order! 15 observations (15+1)/2 = 8th observation from bottom Median = 32 HRs
Median n is even Order the data from smallest to largest Median is the average of the two middle numbers (n+1)/2 will be halfway between these two numbers Ex: If n=10, (10+1)/2 = 5.5, median is average of 5th and 6th numbers from smallest value
Example – Ryne Sandberg 16 observations (16 + 1)/2 = 8.5, average of 8th and 9th observations from bottom Median = average of 16 and 19 Median = 17.5 HRs Career Home Runs 0 5 7 8 9 12 14 16 19 19 25 26 26 26 30 40 Remember to order the values if they aren’t already in order!
Mean Ordinary average Formula Add up all observations Divide by the number of observations Has the same units as the data Formula n observations y1, y2, y3, …, yn are the values
Mean
Examples Thomas Sandberg FIND THE MEAN
Mean vs. Median Median = middle number Mean = value where histogram balances Mean and Median similar when Data are symmetric Mean and median different when Data are skewed There are outliers
Mean vs. Median Mean influenced by unusually high or unusually low values Example: Income in a small town of 6 people $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 **The mean income is $31,830 **The median income is $32,000
Mean vs. Median Bill Gates moves to town Mean is pulled by the outlier $25,000 $27,000 $29,000 $35,000 $37,000 $38,000 $40,000,000 **The mean income is $5,741,571 **The median income is $35,000 Mean is pulled by the outlier Median is not Mean is not a good center of these data
Mean vs. Median Skewness pulls the mean in the direction of the tail Skewed to the right = mean > median Skewed to the left = mean < median Outliers pull the mean in their direction Large outlier = mean > median Small outlier = mean < median
Spread Range = maximum – minimum Thomas Sandberg Min = 4, Max = 43, Range = 43 - 4 = 39 HRs Sandberg Min = 0, Max = 40, Range = 40 - 0 = 40 HRs
Spread Range is a very basic measure of spread It is highly affected by outliers Makes spread appear larger than reality Ex. The annual numbers of deaths from tornadoes in the U.S. from 1990 to 2000: 53 39 39 33 69 30 25 67 130 94 40 Range with outlier: 130 – 25 = 105 tornadoes Range without outlier: 94 – 25 = 69 tornadoes
Spread Interquartile Range (IQR) IQR = Q3 – Q1 First Quartile (Q1) Larger than about 25% of the data Third Quartile (Q3) Larger than about 75% of the data IQR = Q3 – Q1 Center (Middle) 50% of the values
Finding Quartiles Order the data Split into two halves at the median When n is odd, include the median in both halves When n is even, do not include the median in either half Q1 = median of the lower half Q3 = median of the upper half
Example – Frank Thomas Order the values (15 values) 4 7 15 18 24 28 29 32 35 38 40 40 41 42 43 Lower Half = 4 7 15 18 24 28 29 32 Q1 = Median of lower half = 21 HRs Upper Half = 32 35 38 40 40 41 42 43 Q3 = Median of upper half = 40 HRs IQR = 40 – 21 = 19 HRs
Example – Ryne Sandberg Order the values (16 values) 0 5 7 8 9 12 14 16 19 19 25 26 26 26 30 40 Lower Half = 0 5 7 8 9 12 14 16 Q1 = Median of lower half = 8.5 HRs Upper Half =19 19 25 26 26 26 30 40 Q3 = Median of upper half = 26 HRs IQR = Q3 – Q1 = 26 – 8.5 = 17.5 HRs
Five Number Summary Minimum Q1 Median Q3 Maximum
Examples Thomas Sandberg Min = 4 HRs Q1 = 21 HRs Median = 32 HRs Max = 43 HRs Sandberg Min = 0 HRs Q1 = 8.5 HRs Median = 17.5 HRs Q3 = 26 HRs Max = 40 HRs
Graph of Five Number Summary Boxplot Box between Q1 and Q3 Line in the box marks the median Lines extend out to minimum and maximum Best used for comparisons Use this simpler method
Example – Thomas & Sandberg Boxplot of Thomas Home Runs Box from 21 to 40 Line in box 32 Lines extend out from box from 4 and 43 Boxplot of Sandberg Home Runs Box from 8.5 to 26 Line in box at 17.5 Lines extend out from box to 0 and 40
Side by Side Boxplots of Thomas & Sandberg Home Runs
Spread Standard deviation “Average” spread from mean Most common measure of spread (Although it is influenced by skewness and outliers) Denoted by letter s Make a table when calculating by hand
Standard Deviation
Example – Deaths from Tornadoes 53 53-56.27 =-3.27 10.69 39 39-56.27 = -17.27 298.25 33 33-56.27 = -23.27 541.49 69 69-56.27 = 12.73 162.05 30 30-56.27 = -26.27 690.11 25 25-56.27 = -31.27 977.81 67 67-56.27 = 10.73 115.13 130 130-56.27 = 73.73 5436.11 94 94-56.27 = 37.73 1423.55 40 40-56.27 = -16.27 264.71
Example – Frank Thomas Find the standard deviation of the number of home runs given the following statistic:
Properties of s s = 0 only when all observations are equal; otherwise, s > 0 s has the same units as the data s is not resistant Skewness and outliers affect s, just like mean Tornado Example: s with outlier: 31.97 tornadoes s without outlier: 21.70 tornadoes
Which summaries should you use? What numbers are affected by outliers? Mean Standard deviation Range What numbers are not affected by outliers? Median IQR
Which summaries should you use? Five Number Summary Skewed Data Data with outliers Mean and Standard Deviation Symmetric Data ALWAYS PLOT YOUR DATA!!