Numerically Summarizing Data Chapter 3 Numerically Summarizing Data
Chapter 3 Chapter 3 – Numerically Summarizing Data Measures of Central Tendency Measures of Dispersion Measures of Central Tendency and Dispersion from Grouped Data Measures of Position The Five Number Summary and Boxplots 3.1 3.2 3.4 3.5 3.3
Measures of Central Tendency Chapter 3 Section 1 Measures of Central Tendency
Chapter 3 – Section 1 Analyzing populations versus analyzing samples For populations We know all of the data Descriptive measures of populations are called parameters Parameters are often written using Greek letters ( μ ) For samples We know only part of the entire data Descriptive measures of samples are called statistics Statistics are often written using Roman letters ( ) Analyzing populations versus analyzing samples Analyzing populations versus analyzing samples For populations We know all of the data Descriptive measures of populations are called parameters Parameters are often written using Greek letters ( μ )
Chapter 3 – Section 1 The arithmetic mean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are Compute the arithmetic mean of 6, 1, 5 Add up the three numbers and divide by 3 (6 + 1 + 5) / 3 = 4.0 The arithmetic mean is 4.0 The arithmetic mean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are The arithmetic mean of a variable is often what people mean by the “average” … add up all the values and divide by how many there are Compute the arithmetic mean of 6, 1, 5
Chapter 3 – Section 1 The arithmetic mean is usually called the mean For a population … the population mean Is computed using all the observations in a population Is denoted μ Is a parameter For a sample … the sample mean Is computed using only the observations in a sample Is denoted Is a statistic The arithmetic mean is usually called the mean The arithmetic mean is usually called the mean For a population … the population mean Is computed using all the observations in a population Is denoted μ Is a parameter
Chapter 3 – Section 1 The median of a variable is the “center” When the data is sorted in order, the median is the middle value The calculation of the median of a variable is slightly different depending on If there are an odd number of points, or If there are an even number of points The median of a variable is the “center” When the data is sorted in order, the median is the middle value
Chapter 3 – Section 1 To calculate the median (M) of a data set Arrange the data in order Count the number of observations, n If n is odd There is a value that’s exactly in the middle That value is the median M If n is even There are two values on either side of the exact middle Take their mean to be the median M To calculate the median (M) of a data set Arrange the data in order Count the number of observations, n To calculate the median (M) of a data set Arrange the data in order Count the number of observations, n If n is odd There is a value that’s exactly in the middle That value is the median M
Chapter 3 – Section 1 An example with an odd number of observations (5 observations) Compute the median of 6, 1, 11, 2, 11 Sort them in order 1, 2, 6, 11, 11 The middle number is 6, so the median is 6 An example with an odd number of observations (5 observations) Compute the median of 6, 1, 11, 2, 11 An example with an odd number of observations (5 observations) Compute the median of 6, 1, 11, 2, 11 Sort them in order 1, 2, 6, 11, 11
Chapter 3 – Section 1 An example with an even number of observations (4 observations) Compute the median of 6, 1, 11, 2 Sort them in order 1, 2, 6, 11 Take the mean of the two middle values (2 + 6) / 2 = 4 The median is 4 An example with an even number of observations (4 observations) Compute the median of 6, 1, 11, 2 An example with an even number of observations (4 observations) Compute the median of 6, 1, 11, 2 Sort them in order 1, 2, 6, 11
Chapter 3 – Section 1 One interpretation The median splits the data into halves 62, 68, 71, 74, 77, 82, 84, 88, 90, 94 M = 79.5 62, 68, 71, 74, 77 5 on the left 82, 84, 88, 90, 94 5 on the right
Chapter 3 – Section 1 The mode of a variable is the most frequently occurring value Find the mode of 6, 1, 2, 6, 11, 7, 3 The values are 1, 2, 3, 6, 7, 11 The value 6 occurs twice, all the other values occur only once The mode is 6 The mode of a variable is the most frequently occurring value Find the mode of 6, 1, 2, 6, 11, 7, 3 The values are 1, 2, 3, 6, 7, 11 The mode of a variable is the most frequently occurring value Find the mode of 6, 1, 2, 6, 11, 7, 3 The mode of a variable is the most frequently occurring value
blue, blue, blue, red, green Chapter 3 – Section 1 Qualitative data Values are one of a set of categories Cannot add or order them … the mean and median do not exist The mode is the only one of these three measurements that exists Find the mode of blue, blue, blue, red, green The mode is “blue” because it is the value that occurs the most often Qualitative data Values are one of a set of categories Cannot add or order them … the mean and median do not exist The mode is the only one of these three measurements that exists Qualitative data Values are one of a set of categories Cannot add or order them … the mean and median do not exist The mode is the only one of these three measurements that exists
Chapter 3 – Section 1 Quantitative data Find the mode of The mode can be computed but sometimes it is not meaningful Sometimes each value will only occur once (which can often happen with precise measurements) Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 Each value occurs only once The mode is not a meaningful measurement Quantitative data The mode can be computed but sometimes it is not meaningful Sometimes each value will only occur once (which can often happen with precise measurements) Find the mode of 5.1, 6.6, 6.8, 9.3, 1.9 Quantitative data The mode can be computed but sometimes it is not meaningful Sometimes each value will only occur once (which can often happen with precise measurements) Quantitative data The mode can be computed but sometimes it is not meaningful Sometimes each value will only occur once (which can often happen with precise measurements)
Chapter 3 – Section 1 One interpretation In primary elections, the candidate who receives the most votes is often called “the winner” Votes (data values) are The mode is “Kayla” … Kayla is the winner One interpretation In primary elections, the candidate who receives the most votes is often called “the winner” Votes (data values) are One interpretation In primary elections, the candidate who receives the most votes is often called “the winner” Candidate Number of votes Henry 194 Kayla 215 Jason 172
Chapter 3 – Section 1 The mean and the median are often different This difference gives us clues about the shape of the distribution Is it symmetric? Is it skewed left? Is it skewed right? Are there any extreme values?
Chapter 3 – Section 1 Symmetric – the mean will usually be close to the median Skewed left – the mean will usually be smaller than the median Skewed right – the mean will usually be larger than the median
Chapter 3 – Section 1 If a distribution is symmetric, the data values above and below the mean will balance The mean will be in the “middle” The median will be in the “middle” Thus the mean will be close to the median, in general, for a distribution that is symmetric If a distribution is symmetric, the data values above and below the mean will balance The mean will be in the “middle” The median will be in the “middle”
Chapter 3 – Section 1 If a distribution is skewed left, there will be some data values that are larger than the others The mean will decrease The median will not decrease as much Thus the mean will be smaller than the median, in general, for a distribution that is skewed left If a distribution is skewed left, there will be some data values that are larger than the others The mean will decrease The median will not decrease as much
Chapter 3 – Section 1 If a distribution is skewed right, there will be some data values that are larger than the others The mean will increase The median will not increase as much Thus the mean will be larger than the median, in general, for a distribution that is skewed right If a distribution is skewed right, there will be some data values that are larger than the others The mean will increase The median will not increase as much
Chapter 3 – Section 1 For a mostly symmetric distribution, the mean and the median will be roughly equal Many variables, such as birth weights below, are approximately symmetric
Chapter 3 – Section 1 What if one value is extremely different from the others ( this is so called an outlier)? What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 The mean is now ( 6000 + 1 + 2 ) / 3 = 2001 The median is still 2 The median is “resistant to extreme values” What if one value is extremely different from the others? What if we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2
Summary: Chapter 3 – Section 1 Mean The center of gravity Useful for roughly symmetric quantitative data Median Splits the data into halves Useful for highly skewed quantitative data Mode The most frequent value Useful for qualitative data
Measures of Dispersion Chapter 3 Section 2 Measures of Dispersion
Chapter 3 – Section 2 Learning objectives The range of a variable The variance of a variable The standard deviation of a variable Use the Empirical Rule Use Chebyshev’s inequality 1 2 3 5 4
Chapter 3 – Section 2 Comparing two sets of data The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data The measures of dispersion in this section measure the differences between how far “spread out” the data values are Comparing two sets of data Comparing two sets of data The measures of central tendency (mean, median, mode) measure the differences between the “average” or “typical” values between two sets of data
Chapter 3 – Section 2 The range of a variable is the largest data value minus the smallest data value Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 The largest value is 11 The smallest value is 1 Subtracting the two … 11 – 1 = 10 … the range is 10 The range of a variable is the largest data value minus the smallest data value Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 The largest value is 11 The smallest value is 1 The range of a variable is the largest data value minus the smallest data value Compute the range of 6, 1, 2, 6, 11, 7, 3, 3 The range of a variable is the largest data value minus the smallest data value
Chapter 3 – Section 2 The range only uses two values in the data set – the largest value and the smallest value The range is not resistant If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2 The range is now ( 6000 – 1 ) = 5999 The range only uses two values in the data set – the largest value and the smallest value The range is not resistant The range only uses two values in the data set – the largest value and the smallest value The range is not resistant If we made a mistake and 6, 1, 2 was recorded as 6000, 1, 2
Chapter 3 – Section 2 The variance is based on the deviation from the mean ( xi – μ ) for populations ( xi – ) for samples To treat positive differences and negative differences, we square the deviations ( xi – μ )2 for populations ( xi – )2 for samples The variance is based on the deviation from the mean ( xi – μ ) for populations ( xi – ) for samples
Chapter 3 – Section 2 The population variance of a variable is the sum of these squared deviations divided by the number in the population The population variance is represented by σ2 Note: For accuracy, use as many decimal places as allowed by your calculator The population variance of a variable is the sum of these squared deviations divided by the number in the population The population variance of a variable is the sum of these squared deviations divided by the number in the population
Chapter 3 – Section 2 Compute the population variance of 6, 1, 2, 11 Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 Average the squared deviations (16 + 9 + 1 + 36) / 4 = 15.5 The population variance σ2 is 15.5 Compute the population variance of 6, 1, 2, 11 Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 Compute the population variance of 6, 1, 2, 11 Compute the population mean first μ = (6 + 1 + 2 + 11) / 4 = 5 Compute the population variance of 6, 1, 2, 11
Chapter 3 – Section 2 The sample variance of a variable is the sum of these squared deviations divided by one less than the number in the sample The sample variance is represented by s2 We say that this statistic has n – 1 degrees of freedom The sample variance of a variable is the sum of these squared deviations divided by one less than the number in the sample
Chapter 3 – Section 2 Compute the sample variance of 6, 1, 2, 11 Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 Average the squared deviations (16 + 9 + 1 + 36) / 3 = 20.7 The sample variance s2 is 20.7 Compute the sample variance of 6, 1, 2, 11 Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 Now compute the squared deviations (1–5)2 = 16, (2–5)2 = 9, (6–5)2 = 1, (11–5)2 = 36 Compute the sample variance of 6, 1, 2, 11 Compute the sample mean first = (6 + 1 + 2 + 11) / 4 = 5 Compute the sample variance of 6, 1, 2, 11
Chapter 3 – Section 2 Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) These are two different situations Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) In the second case, { 6, 1, 2, 11 } was just a sample from the population (divide by n – 1) Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers? In the first case, { 6, 1, 2, 11 } was the entire population (divide by N) Why are the population variance (15.5) and the sample variance (20.7) different for the same set of numbers?
Chapter 3 – Section 2 Why do we use different formulas? The reason is that using the sample mean is not quite as accurate as using the population mean If we used “n” in the denominator for the sample variance calculation, we would get a “biased” result Bias here means that we would tend to underestimate the true variance
Chapter 3 – Section 2 The standard deviation is the square root of the variance The population standard deviation Is the square root of the population variance (σ2) Is represented by σ The sample standard deviation Is the square root of the sample variance (s2) Is represented by s The standard deviation is the square root of the variance The standard deviation is the square root of the variance The population standard deviation Is the square root of the population variance (σ2) Is represented by σ
Chapter 3 – Section 2 If the population is { 6, 1, 2, 11 } The population variance σ2 = 15.5 The population standard deviation σ = If the sample is { 6, 1, 2, 11 } The sample variance s2 = 20.7 The sample standard deviation s = The population standard deviation and the sample standard deviation apply in different situations If the population is { 6, 1, 2, 11 } The population variance σ2 = 15.5 The population standard deviation σ = If the population is { 6, 1, 2, 11 } The population variance σ2 = 15.5 The population standard deviation σ = If the sample is { 6, 1, 2, 11 } The sample variance s2 = 20.7 The sample standard deviation s =
Chapter 3 – Section 2 The standard deviation is very useful for estimating probabilities
Chapter 3 – Section 2 The empirical rule If the distribution is roughly bell shaped, then Approximately 68% of the data will lie within 1 standard deviation of the mean Approximately 95% of the data will lie within 2 standard deviations of the mean Approximately 99.7% of the data (i.e. almost all) will lie within 3 standard deviations of the mean The empirical rule If the distribution is roughly bell shaped, then Approximately 68% of the data will lie within 1 standard deviation of the mean Approximately 95% of the data will lie within 2 standard deviations of the mean The empirical rule If the distribution is roughly bell shaped, then Approximately 68% of the data will lie within 1 standard deviation of the mean The empirical rule If the distribution is roughly bell shaped, then
Chapter 3 – Section 2 For a variable with mean 17 and standard deviation 3.4 Approximately 68% of the values will lie between (17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 Approximately 95% of the values will lie between (17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 Approximately 99.7% of the values will lie between (17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 A value of 2.1 and a value of 33.2 would both be very unusual For a variable with mean 17 and standard deviation 3.4 Approximately 68% of the values will lie between (17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 Approximately 95% of the values will lie between (17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8 Approximately 99.7% of the values will lie between (17 – 3 3.4) and (17 + 3 3.4), i.e. 6.8 and 27.2 For a variable with mean 17 and standard deviation 3.4 Approximately 68% of the values will lie between (17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 For a variable with mean 17 and standard deviation 3.4 For a variable with mean 17 and standard deviation 3.4 Approximately 68% of the values will lie between (17 – 3.4) and (17 + 3.4), i.e. 13.6 and 20.4 Approximately 95% of the values will lie between (17 – 2 3.4) and (17 + 2 3.4), i.e. 10.2 and 23.8
Chapter 3 – Section 2 Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) This lower bound is An estimated percentage The actual percentage for any variable cannot be lower than this number Therefore the actual percentage must be this value or higher Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) Chebyshev’s inequality gives a lower bound on the percentage of observations that lie within k standard deviations of the mean (where k > 1) This lower bound is An estimated percentage The actual percentage for any variable cannot be lower than this number
Chapter 3 – Section 2 Chebyshev’s inequality For any data set, at least of the observations will lie within k standard deviations of the mean, where k is any number greater than 1
Chapter 3 – Section 2 How much of the data lies within 1.5 standard deviations of the mean? From Chebyshev’s inequality so that at least 55.6% of the data will lie within 1.5 standard deviations of the mean
Chapter 3 – Section 2 If the mean is equal to 20 and the standard deviation is equal to 4, how much of the data lies between 14 and 26? 14 to 26 are 1.5 standard deviations from 20 so that at least 55.6% of the data will lie between 14 and 26
Summary: Chapter 3 – Section 2 Range The maximum minus the minimum Not a resistant measurement Variance and standard deviation Measures deviations from the mean Empirical rule About 68% of the data is within 1 standard deviation About 95% of the data is within 2 standard deviations
Measures of Central Tendency and Dispersion from Grouped Data Chapter 3 Section 3 Measures of Central Tendency and Dispersion from Grouped Data
Chapter 3 – Section 3 Learning objectives The mean from grouped data The weighted mean The variance and standard deviation for grouped data 1 2 3
Chapter 3 – Section 3 Data may come in groups rather than individually The values may have been summarized in frequency distributions Ranges of ages (20 – 29, 30 – 39, ...) Ranges of incomes ($10,000 – $19,999, $20,000 – $39,999, $40,000 – $79,999, ...) The exact values for the mean, variance, and standard deviation cannot be calculated
Chapter 3 – Section 3 Learning objectives The mean from grouped data The weighted mean The variance and standard deviation for grouped data 1 2 3
Chapter 3 – Section 3 To compute the mean for grouped data Assume that, within each class, the mean of the data is equal to the class midpoint Use the class midpoint in the formula for the mean The number of times the class midpoint value is used is equal to the frequency of the class If 6 values are in the interval [ 8, 10 ] , then we assume that all 6 values are equal to 9 (the midpoint of [ 8, 10 ] To compute the mean for grouped data Assume that, within each class, the mean of the data is equal to the class midpoint Use the class midpoint in the formula for the mean The number of times the class midpoint value is used is equal to the frequency of the class
Chapter 3 – Section 3 As an example, for the following frequency table, we calculate the mean as if The value 1 occurred 3 times The value 3 occurred 7 times The value 5 occurred 6 times The value 7 occurred 1 time As an example, for the following frequency table, Class 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 6
Chapter 3 – Section 3 The calculation for the mean would be or Class 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 6
Chapter 3 – Section 3 Evaluating this formula The mean is about 3.6 In mathematical notation This would be μ for the population mean and for the sample mean
Chapter 3 – Section 3 Sometimes not all data values are equally important To compute a grade point average (GPA), a grade in a 4 credit class is worth more than a grade in a 1 credit class The weights wi quantify the relative importance of the different values Higher weights correspond to more important values Sometimes not all data values are equally important To compute a grade point average (GPA), a grade in a 4 credit class is worth more than a grade in a 1 credit class
Chapter 3 – Section 3 As an example, the following grades would yield a GPA (on a 4 point scale) of Course Credits Grade Statistics 3 A French Literature B Biochemistry 5 Badminton 1 D
Chapter 3 – Section 3 In mathematical notation, if wi is the weight corresponding to the data value xi, then the weighted mean is This formula looks similar to one for the mean for grouped data, and the concepts are similar
Chapter 3 – Section 3 To compute the variance for grouped data Assume again that, within each class, the mean of the data is equal to the class midpoint Use the class midpoint in the formula for the variance The number of times the class midpoint value is used is equal to the frequency of the class If 6 values are in the interval [ 8, 10 ] , then we assume that all 6 values are equal to 9 (the midpoint of [ 8, 10 ] The same approach as for the mean To compute the variance for grouped data Assume again that, within each class, the mean of the data is equal to the class midpoint Use the class midpoint in the formula for the variance The number of times the class midpoint value is used is equal to the frequency of the class
Chapter 3 – Section 3 As an example, for the following frequency table, we calculate the variance as if The value 1 occurred 3 times The value 3 occurred 7 times The value 5 occurred 6 times The value 7 occurred 1 time As an example, for the following frequency table, Class 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 6
Chapter 3 – Section 3 From our previous example, the mean is 3.6 Just as for the mean, the calculation for the variance would then be Class 0 – 1.9 2 – 3.9 4 – 5.9 6 – 7.9 Midpoint 1 3 5 7 Frequency 6
Chapter 3 – Section 3 Evaluating this formula The variance is about 2.7 The standard deviation would be about
Chapter 3 – Section 3 In mathematical notation The population variance would be The sample variance would be The standard deviations would be the corresponding square roots
Summary: Chapter 3 – Section 3 The mean for grouped data Use the class midpoints Obtain an approximation for the mean The variance and standard deviation for grouped data Obtain an approximation for the variance and standard deviation
Chapter 3 Section 4 Measures of Position
Chapter 3 – Section 4 Learning objectives Determine and interpret z-scores Determine and interpret percentiles Determine and interpret quartiles Check a set of data for outliers 1 2 3 4
Chapter 3 – Section 4 Mean / median describe the “center” of the data Variance / standard deviation describe the “spread” of the data This section discusses more precise ways to describe the relative position of a data value within the entire set of data
Chapter 3 – Section 4 The standard deviation is a measure of dispersion that uses the same dimensions as the data (remember the empirical rule) The distance of a data value from the mean, calculated as the number of standard deviations, would be a useful measurement This distance is called the z-score The standard deviation is a measure of dispersion that uses the same dimensions as the data (remember the empirical rule) The standard deviation is a measure of dispersion that uses the same dimensions as the data (remember the empirical rule) The distance of a data value from the mean, calculated as the number of standard deviations, would be a useful measurement
Chapter 3 – Section 4 If the mean was 20 and the standard deviation was 6 The value 26 would have a z-score of 1.0 (1.0 standard deviation higher than the mean) The value 14 would have a z-score of –1.0 (1.0 standard deviation lower than the mean) The value 17 would have a z-score of –0.5 (0.5 standard deviations lower than the mean) The value 20 would have a z-score of 0.0 If the mean was 20 and the standard deviation was 6 The value 26 would have a z-score of 1.0 (1.0 standard deviation higher than the mean) The value 14 would have a z-score of –1.0 (1.0 standard deviation lower than the mean) The value 17 would have a z-score of –0.5 (0.5 standard deviations lower than the mean) If the mean was 20 and the standard deviation was 6 The value 26 would have a z-score of 1.0 (1.0 standard deviation higher than the mean) If the mean was 20 and the standard deviation was 6 If the mean was 20 and the standard deviation was 6 The value 26 would have a z-score of 1.0 (1.0 standard deviation higher than the mean) The value 14 would have a z-score of –1.0 (1.0 standard deviation lower than the mean)
Chapter 3 – Section 4 The population z-score is calculated using the population mean and population standard deviation The sample z-score is calculated using the sample mean and sample standard deviation The population z-score is calculated using the population mean and population standard deviation
Chapter 3 – Section 4 z-scores can be used to compare the relative positions of data values in different samples Pat received a grade of 82 on her statistics exam where the mean grade was 74 and the standard deviation was 12 Pat received a grade of 72 on her biology exam where the mean grade was 65 and the standard deviation was 10 Pat received a grade of 91 on her kayaking exam where the mean grade was 88 and the standard deviation was 6 z-scores can be used to compare the relative positions of data values in different samples Pat received a grade of 82 on her statistics exam where the mean grade was 74 and the standard deviation was 12 Pat received a grade of 72 on her biology exam where the mean grade was 65 and the standard deviation was 10 z-scores can be used to compare the relative positions of data values in different samples Pat received a grade of 82 on her statistics exam where the mean grade was 74 and the standard deviation was 12 z-scores can be used to compare the relative positions of data values in different samples
Chapter 3 – Section 4 Statistics Biology Kayaking Grade of 82 z-score of (82 – 74) / 12 = .67 Biology Grade of 72 z-score of (72 – 65) / 10 = .70 Kayaking Grade of 81 z-score of (91 – 88) / 6 = .50 Biology was the highest relative grade Statistics Grade of 82 z-score of (82 – 74) / 12 = .67 Biology Grade of 72 z-score of (72 – 65) / 10 = .70 Kayaking Grade of 81 z-score of (91 – 88) / 6 = .50 Statistics Grade of 82 z-score of (82 – 74) / 12 = .67 Biology Grade of 72 z-score of (72 – 65) / 10 = .70 Statistics Grade of 82 z-score of (82 – 74) / 12 = .67
Chapter 3 – Section 4 Learning objectives Determine and interpret z-scores Determine and interpret percentiles Determine and interpret quartiles Check a set of data for outliers 1 2 3 4
Chapter 3 – Section 4 The median divides the lower 50% of the data from the upper 50% The median is the 50th percentile If a number divides the lower 34% of the data from the upper 66%, that number is the 34th percentile
Chapter 3 – Section 4 The computation is similar to the one for the median Calculation Arrange the data in ascending order Compute the index i using the formula If i is an integer, take the ith data value If i is not an integer, take the mean of the two values on either side of i
Chapter 3 – Section 4 Compute the 60th percentile of Calculations 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 34 Calculations There are 14 numbers (n = 14) The 60th percentile (k = 60) The index Take the 9th value, or P60 = 23, as the 60th percentile Compute the 60th percentile of 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 34 Compute the 60th percentile of 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 34 Calculations There are 14 numbers (n = 14) The 60th percentile (k = 60) The index
Chapter 3 – Section 4 Compute the 28th percentile of Calculations 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations There are 14 numbers (n = 14) The 28th percentile (k = 28) The index Take the average of the 4th and 5th values, or P28 = (7 + 8) / 2 = 7.5, as the 28th percentile Compute the 28th percentile of 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Compute the 28th percentile of 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations There are 14 numbers (n = 14) The 28th percentile (k = 28) The index
Chapter 3 – Section 4 Learning objectives Determine and interpret z-scores Determine and interpret percentiles Determine and interpret quartiles Check a set of data for outliers 1 2 3 4
Chapter 3 – Section 4 The quartiles are the 25th, 50th, and 75th percentiles Q1 = 25th percentile Q2 = 50th percentile = median Q3 = 75th percentile Quartiles are the most commonly used percentiles The 50th percentile and the second quartile Q2 are both other ways of defining the median
Chapter 3 – Section 4 Quartiles divide the data set into four equal parts The top quarter are the values between Q3 and the maximum The bottom quarter are the values between the minimum and Q1 Quartiles divide the data set into four equal parts The top quarter are the values between Q3 and the maximum Quartiles divide the data set into four equal parts Quartiles divide the data set into four equal parts Quartiles divide the data set into four equal parts
Chapter 3 – Section 4 Quartiles divide the data set into four equal parts The interquartile range (IQR) is the difference between the third and first quartiles IQR = Q3 – Q1 The IQR is a resistant measurement of dispersion
Chapter 3 – Section 4 Learning objectives Determine and interpret z-scores Determine and interpret percentiles Determine and interpret quartiles Check a set of data for outliers 1 2 3 4
Chapter 3 – Section 4 Extreme observations in the data are referred to as outliers Outliers should be investigated Outliers could be Chance occurrences Measurement errors Data entry errors Sampling errors Outliers are not necessarily invalid data Extreme observations in the data are referred to as outliers Outliers should be investigated Extreme observations in the data are referred to as outliers Outliers should be investigated Outliers could be Chance occurrences Measurement errors Data entry errors Sampling errors
Chapter 3 – Section 4 One way to check for outliers uses the quartiles Outliers can be detected as values that are significantly too high or too low, based on the known spread The fences used to identify outliers are Lower fence = LF = Q1 – 1.5 IQR Upper fence = UF = Q3 + 1.5 IQR Values less than the lower fence or more than the upper fence could be considered outliers One way to check for outliers uses the quartiles Outliers can be detected as values that are significantly too high or too low, based on the known spread One way to check for outliers uses the quartiles Outliers can be detected as values that are significantly too high or too low, based on the known spread The fences used to identify outliers are Lower fence = LF = Q1 – 1.5 IQR Upper fence = UF = Q3 + 1.5 IQR
Chapter 3 – Section 4 Is the value 54 an outlier? Calculations 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations Q1 = (4 + 7) / 2 = 5.5 Q3 = (27 + 31) / 2 = 29 IQR = 29 – 5.5 = 23.5 UF = Q3 + 1.5 IQR = 29 + 1.5 23.5 = 64 Using the fence rule, the value 54 is not an outlier Is the value 54 an outlier? 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations Q1 = (4 + 7) / 2 = 5.5 Q3 = (27 + 31) / 2 = 29 IQR = 29 – 5.5 = 23.5 UF = Q3 + 1.5 IQR = 29 + 1.5 23.5 = 64 Is the value 54 an outlier? 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations Q1 = (4 + 7) / 2 = 5.5 Q3 = (27 + 31) / 2 = 29 Is the value 54 an outlier? 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54
Summary: Chapter 3 – Section 4 z-scores Measures the distance from the mean in units of standard deviations Can compare relative positions in different samples Percentiles and quartiles Divides the data so that a certain percent is lower and a certain percent is higher Outliers Extreme values of the variable Can be identified using the upper and lower fences
The Five-Number Summary And Boxplots Chapter 3 Section 5 The Five-Number Summary And Boxplots
Chapter 3 – Section 5 Learning objectives Compute the five-number summary Draw and interpret boxplots 1 2
Chapter 3 – Section 5 Learning objectives Compute the five-number summary Draw and interpret boxplots 1 2
Chapter 3 – Section 5 The five-number summary is the collection of The smallest value The first quartile (Q1 or P25) The median (M or Q2 or P50) The third quartile (Q3 or P75) The largest value These five numbers give a concise description of the distribution of a variable
Chapter 3 – Section 5 The median Information about the center of the data Resistant The first quartile and the third quartile Information about the spread of the data The smallest value and the largest value Information about the tails of the data Not resistant The median Information about the center of the data Resistant The median Information about the center of the data Resistant The first quartile and the third quartile Information about the spread of the data
Chapter 3 – Section 5 Compute the five-number summary for Calculations 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations The minimum = 1 Q1 = P25, the index i = 3.75, Q1 = (4 + 7) / 2 = 5.5 M = Q2 = P50 = (16 + 19) / 2 = 17.5 Q3 = P75, the index i = 11.25, Q3 = (27 + 31) / 2 = 29 The maximum = 54 The five-number summary is 1, 5.5, 17.5, 29, 54 Compute the five-number summary for 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Compute the five-number summary for 1, 3, 4, 7, 8, 15, 16, 19, 23, 24, 27, 31, 33, 54 Calculations The minimum = 1 Q1 = P25, the index i = 3.75, Q1 = (4 + 7) / 2 = 5.5 M = Q2 = P50 = (16 + 19) / 2 = 17.5 Q3 = P75, the index i = 11.25, Q3 = (27 + 31) / 2 = 29 The maximum = 54
Chapter 3 – Section 5 Learning objectives Compute the five-number summary Draw and interpret boxplots 1 2
Chapter 3 – Section 5 The five-number summary can be illustrated using a graph called the boxplot An example of a (basic) boxplot is The middle box shows Q1, Q2, and Q3 The horizontal lines (sometimes called “whiskers”) show the minimum and maximum
Chapter 3 – Section 5 To draw a (basic) boxplot: Calculate the five-number summary Draw a horizontal line that will cover all the data from the minimum to the maximum Draw a box with the left edge at Q1 and the right edge at Q3 Draw a line inside the box at M = Q2 To draw a (basic) boxplot: Calculate the five-number summary Draw a horizontal line that will cover all the data from the minimum to the maximum Draw a box with the left edge at Q1 and the right edge at Q3 Draw a line inside the box at M = Q2 Draw a horizontal line from the Q1 edge of the box to the minimum and one from the Q3 edge of the box to the maximum To draw a (basic) boxplot: Calculate the five-number summary Draw a horizontal line that will cover all the data from the minimum to the maximum Draw a box with the left edge at Q1 and the right edge at Q3 To draw a (basic) boxplot: Calculate the five-number summary Draw a horizontal line that will cover all the data from the minimum to the maximum To draw a (basic) boxplot: To draw a (basic) boxplot: Calculate the five-number summary
Draw the minimum and maximum Chapter 3 – Section 5 To draw a (basic) boxplot Draw the middle box Draw in the median Draw the minimum and maximum Voila!
Chapter 3 – Section 5 An example of a more sophisticated boxplot is The middle box shows Q1, Q2, and Q3 The horizontal lines (sometimes called “whiskers”) show the minimum and maximum The asterisk on the right shows an outlier (determined by using the upper fence)
Chapter 3 – Section 5 To draw this boxplot (in a slightly different way than the text) Draw the center box and mark the median, as before Compute the upper fence and the lower fence Temporarily remove the outliers as identified by the upper fence and the lower fence (but we will add them back later with asterisks) Draw the horizontal lines to the new minimum and new maximum Mark each of the outliers with an asterisk To draw this boxplot (in a slightly different way than the text) Draw the center box and mark the median, as before Compute the upper fence and the lower fence Temporarily remove the outliers as identified by the upper fence and the lower fence (but we will add them back later with asterisks) Draw the horizontal lines to the new minimum and new maximum To draw this boxplot (in a slightly different way than the text) Draw the center box and mark the median, as before Compute the upper fence and the lower fence To draw this boxplot (in a slightly different way than the text) Draw the center box and mark the median, as before To draw this boxplot (in a slightly different way than the text) Draw the center box and mark the median, as before Compute the upper fence and the lower fence Temporarily remove the outliers as identified by the upper fence and the lower fence (but we will add them back later with asterisks)
Chapter 3 – Section 5 To draw this boxplot Draw in the fences, remove the outliers (temporarily) Draw the outliers as asterisks Draw the middle box and the median Draw the minimum and maximum
Chapter 3 – Section 5 The distribution shape and boxplot are related Symmetry (or lack of symmetry) Quartiles Maximum and minimum Relate the distribution shape to the boxplot for Symmetric distributions Skewed left distributions Skewed right distributions The distribution shape and boxplot are related Symmetry (or lack of symmetry) Quartiles Maximum and minimum
Chapter 3 – Section 5 Symmetric distributions Distribution Boxplot Q1 is equally far from the median as Q3 is The median line is in the center of the box Distribution Boxplot Q1 is equally far from the median as Q3 is The median line is in the center of the box The min is equally far from the median as the max is The left whisker is equal to the right whisker Q1 M Q3 Min Max Q1 M Q3 Q1 M Q3 Min Max Q1 M Q3
Chapter 3 – Section 5 Skewed left distributions Distribution Boxplot Q1 is further from the median than Q3 is The median line is to the right of center in the box The min is further from the median than the max is The left whisker is longer than the right whisker Distribution Boxplot Q1 is further from the median than Q3 is The median line is to the right of center in the box Min Max Q1 M Q3 Q1 M Q3
Chapter 3 – Section 5 Skewed right distributions Distribution Boxplot Q1 is closer to the median than Q3 is The median line is to the left of center in the box Distribution Boxplot Q1 is closer to the median than Q3 is The median line is to the left of center in the box The min is closer to the median than the max is The left whisker is shorter than the right whisker Min Max Q1 M Q3 Q1 M Q3 Min Max Q1 M Q3 Q1 M Q3
Chapter 3 – Section 5 We can compare two distributions by examining their boxplots We draw the boxplots on the same horizontal scale We can visually compare the centers We can visually compare the spreads We can visually compare the extremes We can compare two distributions by examining their boxplots We draw the boxplots on the same horizontal scale
Chapter 3 – Section 5 Comparing the “flight” with the “control” samples Center Spread
Summary: Chapter 3 – Section 5 5-number summary Minimum, first quartile, median, third quartile maximum Resistant measures of center (median) and spread (interquartile range) Boxplots Visual representation of the 5-number summary Related to the shape of the distribution Can be used to compare multiple distributions
Chapter 3 Summary Numeric summaries of data Means, medians, modes Ranges, variances, standard deviations, IQR’s Calculations for grouped data Measures of relative position z-scores Percentiles and quartiles Exploratory data analysis Five-number summaries Box plots