Introduction to Summary Statistics Introduction to Engineering Design Unit 3 – Measurement and Statistics Introduction to Summary Statistics
Introduction to Summary Statistics The collection, evaluation, and interpretation of data Statistical analysis of measurements can help verify the quality of a design or process
Introduction to Summary Statistics Central Tendency “Center” of a distribution Mean, median, mode Variation Spread of values around the center Range, standard deviation, interquartile range Distribution Summary of the frequency of values Frequency tables, histograms, normal distribution [click] The average value of a variable (like rainfall depth) is one type of statistic that indicates central tendency – it gives an indication of the center of the data. The median and mode are two other indications of central tendency. But often we need more details on how much a quantity can vary. [click] Statistical dispersion is the variability or spread of data. The range, standard deviation and interquartile range are indications of dispersion. [click] Even more detail about a variable can be shown by a frequency distribution which shows a summary of the data values are distributed throughout the range of values. Frequency distributions can be shown by frequency tables, histograms, box plots, or other summaries.
Introduction to Summary Statistics Mean Central Tendency The mean is the sum of the values of a set of data divided by the number of values in that data set. μ = x i N The mean is the most frequently used measure of central tendency. It is strongly influenced by outliers which are very large or very small data values that do not seem to fit with the majority of data.
Introduction to Summary Statistics Mean Central Tendency μ = x i N μ = mean value xi = individual data value x i = summation of all data values N = # of data values in the data set
Introduction to Summary Statistics Mean Central Tendency Data Set 3 7 12 17 21 21 23 27 32 36 44 Sum of the values = 243 Number of values = 11 μ = x i N 243 Mean = = = 22.09 11
Introduction to Summary Statistics A Note about Rounding in Statistics General Rule: Don’t round until the final answer If you are writing intermediate results you may round values, but keep unrounded number in memory Mean – round to one more decimal place than the original data Standard Deviation: round to one more decimal place than the original data Ref: Elementary Statistics 7th edition by Bluman, McGraw-Hill 2009.
Introduction to Summary Statistics Mean - Rounding Data Set 3 7 12 17 21 21 23 27 32 36 44 Sum of the values = 243 Number of values = 11 Reported: Mean = μ = x i N 243 Mean = = 22. 09 = In this case, the result of the mean calculation is 22.09 with the 09 repeating. Notice that the bar above the 09 indicates a repeating decimal. Keep this number saved in your calculator if you will need it for future calculations (such as the standard deviation which we will present later). But report the mean to one more decimal place than the original data. Since the original data is reported in whole numbers, report the mean to one decimal place or 22.1. 11 22.1
Introduction to Summary Statistics Mode Central Tendency Measure of central tendency The most frequently occurring value in a set of data is the mode Symbol is M Data Set: 27 17 12 7 21 44 23 3 36 32 21
Introduction to Summary Statistics Mode Central Tendency The most frequently occurring value in a set of data is the mode Data Set: 3 7 12 17 21 21 23 27 32 36 44 There are two occurrences of 21. [click] Every other data value has only one occurrence. Therefore, 21 is the mode. [click] Mode = M = 21
Introduction to Summary Statistics Mode Central Tendency The most frequently occurring value in a set of data is the mode. Bimodal Data Set: Two numbers of equal frequency stand out Multimodal Data Set: If more than two numbers of equal frequency stand out
Introduction to Summary Statistics Mode Central Tendency Determine the mode of 48, 63, 62, 49, 58, 2, 63, 5, 60, 59, 55 Mode = 63 Determine the mode of 48, 63, 62, 59, 58, 2, 63, 5, 60, 59, 55 Mode = 63 & 59 Bimodal Determine the mode of 48, 63, 62, 59, 48, 2, 63, 5, 60, 59, 55 Mode = 63, 59, & 48 Multimodal
Median Central Tendency Introduction to Summary Statistics Median Central Tendency Measure of central tendency The median is the value that occurs in the middle of a set of data that has been arranged in numerical order Symbol is x, pronounced “x-tilde” ~ The median divides the data into two sets which contain an equal number of data values.
Median Central Tendency Introduction to Summary Statistics Median Central Tendency The median is the value that occurs in the middle of a set of data that has been arranged in numerical order. Data Set: 27 17 12 7 21 44 23 3 36 32 21
Median Central Tendency Introduction to Summary Statistics Median Central Tendency A data set that contains an odd number of values always has a Median. Data Set: 3 7 12 17 21 21 23 27 32 36 44
Median Central Tendency Introduction to Summary Statistics Median Central Tendency For a data set that contains an even number of values, the two middle values are averaged with the result being the Median. Data Set: 3 7 12 17 21 21 23 27 31 32 36 44
Introduction to Summary Statistics Range Variation Measure of data variation. The range is the difference between the largest and smallest values that occur in a set of data. Symbol is R Data Set: 3 7 12 17 21 21 23 27 32 36 44 Range = R = 44 – 3 = 41
Standard Deviation Variation Introduction to Summary Statistics Standard Deviation Variation Measure of data variation. The standard deviation is a measure of the spread of data values. A larger standard deviation indicates a wider spread in data values
Standard Deviation Variation Presentation Name Course Name Unit # – Lesson #.# – Lesson Name Standard Deviation Variation σ= x i − μ 2 N σ = standard deviation xi = individual data value ( x1, x2, x3, …) μ = mean N = size of population
Standard Deviation Variation Introduction to Summary Statistics Standard Deviation Variation σ= x i − μ 2 N Procedure: Calculate the mean, μ. Subtract the mean from each value and then square each difference. Sum all squared differences. Divide the summation by the size of the population (number of data values), N. Calculate the square root of the result. Note that this is the formula for the population standard deviation, which statisticians distinguish from the sample standard deviation. This formula provides the standard deviation of the data set used in the calculation. We will later differentiate between the population standard deviation and the sample standard deviation.
Introduction to Summary Statistics A Note about Rounding in Statistics, Again General Rule: Don’t round until the final answer If you are writing intermediate results you may round values, but keep unrounded number in memory Standard Deviation: round to one more decimal place than the original data Remember, don’t round intermediate calculations. But, when you find the standard deviation, round the reported value to one more decimal place than the original data.
Introduction to Summary Statistics σ= x i − μ 2 N Standard Deviation Calculate the standard deviation for the data array 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 μ = x i N 1. Calculate the mean. =47. 63 2. Subtract the mean from each data value and square each difference. x i − μ 2 (2 - 47. 63 )2 = 2082.6777 (5 - 47. 63 )2 = 1817.8595 (48 - 47. 63 )2 = 0.1322 (49 - 47. 63 )2 = 1.8595 (55 - 47. 63 )2 = 54.2231 (58 - 47. 63 )2 = 107.4050 (59 - 47. 63 )2 = 129.1322 (60 - 47. 63 )2 = 152.8595 (62 - 47. 63 )2 = 206.3140 (63 - 47. 63 )2 = 236.0413 Since we are given a finite data set, and are not told otherwise, we assume we have a data value for each member of the entire population. We use the POPULATION standard deviation. Find the mean of the data. [3 clicks] NOTE that if we were asked for the mean, we would report the mean to be 47.6. However, we will use the unrounded value stored in the calculator when calculating the standard deviation. Find the difference between the mean and each data value and square each difference. [many clicks] Note that these values are calculated using the unrounded mean, but are rounded to two decimal places since it is unwieldy to report to a large number of decimal places. It is preferable to save each UNROUNDED squared difference in your calculator so that we can add them together.
Standard Deviation Variation Introduction to Summary Statistics Standard Deviation Variation 3. Sum all squared differences. x i − μ 2 = 2082.6777 + 1817.8595 + 0.1322 + 1.8595 + 54.2231 + 107.4050 + 129.1322 + 152.8595 + 206.3140 + 236.0413 + 236.0413 Note that this is the sum of the unrounded squared differences. = 5,024.5455 4. Divide the summation by the number of data values. x i − μ 2 N = 5024.5455 11 = 456.7769 4. Sum the squared differences. [click] Note that the sum shown is calculated using the unrounded values (and so the total is slightly different than total you get when adding the rounded numbers). If you use the rounded squared differences, the sum is nearly the same 5024.53. 5. Divide the sum by the number of data values [click] Remember to use the unrounded result from the previous step (even though it does not matter in this case). 6. Take the square root of the division. [click] Remember to use the unrounded result from the previous step. Report the standard deviation to one more digit than the original data. 5. Calculate the square root of the result. x i − μ 2 N = 456.7769 = 21.4
A Note about Standard Deviation Introduction to Summary Statistics A Note about Standard Deviation Two distinct calculations Population Standard Deviation The measure of the spread of data within a population. Used when you have a data value for every member of the entire population of interest. Sample Standard Deviation An estimate of the spread of data within a larger population. Used when you do not have a data value for every member of the entire population of interest. Uses a subset (sample) of the data to generalize the results to the larger population. [click] We have just calculated the population standard deviation of a data set. For that calculation, we were calculating the standard deviation of the data set only – the data set was the entire population. [click] If you are given data for only a portion of the population of interest and would like to estimate the standard deviation for the larger population, you would use the sample standard deviation formula.
A Note about Standard Deviation Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample σ= x i − μ 2 N s= x i − x 2 n −1 σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population s = sample standard deviation xi = individual data value ( x1, x2, x3, …) x = sample mean n = size of sample [click] So, for instance, if you are asked to measure the height of each student in your class, and then are asked to find the standard deviation of those heights, you would use the POPULATION standard deviation. You would have a data value for every member of the population – the students in your class. We call it POPULATION standard deviation because the value is based on the entire population. [click] However, if you are asked to estimate the height of all of the high school students in your county (and you believed that your class provides a good representation on which to base that estimate) you would use the SAMPLE standard deviation. In this case, you would have only a sample (subset) of the heights of the entire population since your class is a subset of the county high school population. We call this the SAMPLE standard deviation because the value is based on a sample of the entire population. [click] Notice that the main difference in the two formulas is the denominator. The population uses N, the population size. The sample standard deviation uses n – 1 which is one less than the size of the sample used in the calculation.
Sample Standard Deviation Variation Introduction to Summary Statistics Sample Standard Deviation Variation s= x i − x 2 n −1 Procedure: Calculate the sample mean, x . Subtract the mean from each value and then square each difference. Sum all squared differences. Divide the summation by the number of data values minus one, n - 1. Calculate the square root of the result. So, the procedure to find the sample standard deviation is basically the same procedure we used to find the population standard deviation. Notice that here we use the sample mean and the n – 1 denominator.
Sample Mean Central Tendency Introduction to Summary Statistics Sample Mean Central Tendency x = x i n Essentially the same calculation as population mean x = sample mean xi = individual data value x i = summation of all data values n = # of data values in the sample The sample mean is simply the mean of the data values in the sample. This calculation is essentially no different than the mean calculation presented earlier. We are just using a different variable to represent the number of data values. [click] Here we use lower case n to represent the number of data values in the sample, as opposed to upper case N which represents the size of the larger population. The sample is always smaller than the population that the sample represents. But when you have all of the data values for a population (n = N) you are really calculating the population mean.
Sample Standard Deviation Introduction to Summary Statistics s= x i − x 2 n − 1 Sample Standard Deviation Estimate the standard deviation for a population for which the following data is a sample. 2, 5, 48, 49, 55, 58, 59, 60, 62, 63, 63 x = x i n 1. Calculate the sample mean =47. 63 2. Subtract the sample mean from each data value and square the difference. x i − x 2 (2 - 47. 63 )2 = 2082.6777 (5 - 47. 63 )2 = 1817.8595 (48 - 47. 63 )2 = 0.1322 (49 - 47. 63 )2 = 1.8595 (55 - 47. 63 )2 = 54.2231 (58 - 47. 63 )2 = 107.4050 (59 - 47. 63 )2 = 129.1322 (60 - 47. 63 )2 = 152.8595 (62 - 47. 63 )2 = 206.3140 (63 - 47. 63 )2 = 236.0413 Since we are given a finite data set, and are not told otherwise, we assume we have a data value for each member of the entire population. We use the POPULATION standard deviation. Find the mean of the data. [3 clicks] NOTE that if we were asked for the mean, we would report the mean to be 47.6. However, we will use the unrounded value stored in the calculator when calculating the standard deviation. Find the difference between the mean and each data value and square each difference. [many clicks] Note that these values are calculated using the unrounded mean, but are rounded to two decimal places since it is unwieldy to report to a large number of decimal places. It is preferable to save each UNROUNDED squared difference in your calculator so that we can add them together.
Sample Standard Deviation Variation Introduction to Summary Statistics Sample Standard Deviation Variation 3. Sum all squared differences. x i − x 2 = 2082.6777 + 1817.8595 + 0.1322 + 1.8595 + 54.2231 + 107.4050 + 129.1322 + 152.8595 + 206.3140 + 236.0413 + 236.0413 = 5,024.5455 4. Divide the summation by the number of sample data values minus one. x i − x 2 n − 1 = 5024.5455 10 = 502.4545 3. Sum the squared differences. [click] Again, this step yields the same result as the corresponding step in the population standard deviation calculation. Remember that the sum shown reflects the sum of the unrounded squared differences which is slightly different than the sum you would get if you added the rounded values shown. 4. Divide the sum by the number of data values in the sample (the sample size).[click] Here we use n – 1 = 10 in the denominator. 5. Take the square root of the division. [click] So the standard deviation of the larger population predicted by the sample standard deviation formula is 22.4. Slightly different than the population standard deviation of the data set (which was 21.4). 5. Calculate the square root of the result. x i − x 2 n − 1 = 502.4545 = 22.4
A Note about Standard Deviation Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample σ= x i − μ 2 N s= x i − x 2 n − 1 σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population s = sample standard deviation xi = individual data value ( x1, x2, x3, …) x = sample mean n = size of sample The two different formulas for standard deviation can be confusing and are often misapplied. However, it is important to note that if you compare the population standard deviation to the estimated standard deviation of that same population provided by a sample standard deviation, the sample standard deviation generally tends to get closer and closer to the population standard deviation as the sample size increases. [click] In mathematical terms we say, as the sample size approaches the population size (n → N), the sample standard deviation approaches the population standard deviation (s → σ). We use the arrow to represent “approaches”. In other words, the larger the sample, the better the estimate of the population standard deviation. As n → N, s → σ
A Note about Standard Deviation Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample Given the ACT score of every student in your class, use the population standard deviation formula to find the standard deviation of ACT scores in the class. σ= x i − μ 2 N s= x i − x 2 n − 1 σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population s = sample standard deviation xi = individual data value ( x1, x2, x3, …) x = sample mean n = size of sample So, for example, suppose you are interested in the spread of the ACT scores of the students in your class. If you have the ACT score of every student in your class you would use the population standard deviation formula to determine the standard deviation. [click]
A Note about Standard Deviation Introduction to Summary Statistics A Note about Standard Deviation Population Standard Deviation Sample Given the ACT scores of every student in your class, use the sample standard deviation formula to estimate the standard deviation of the ACT scores of all students at your school. σ= x i − μ 2 N s= x i − x 2 n − 1 σ = population standard deviation xi = individual data value ( x1, x2, x3, …) μ = population mean N = size of population s = sample standard deviation xi = individual data value ( x1, x2, x3, …) x = sample mean n = size of sample However, if you wanted to estimate ACT scores of all of the students in your high school using only the scores of student from your class (and you felt that your class is a good representation of the students that attend your school), you would use the sample standard deviation formula to estimate the standard deviation of the larger population – and you might get fairly close to the actual standard deviation for the entire school. [click]
Histogram Distribution Introduction to Summary Statistics Histogram Distribution A histogram is a common data distribution chart that is used to show the frequency with which specific values, or values within ranges, occur in a set of data. An engineer might use a histogram to show the variation of a dimension that exists among a group of parts that are intended to be identical.
Histogram Distribution Introduction to Summary Statistics Histogram Distribution Large sets of data are often divided into limited number of groups. These groups are called class intervals. -16 to -6 -5 to 5 6 to 16 Class Intervals
Histogram Distribution Introduction to Summary Statistics Histogram Distribution The number of data elements in each class interval is shown by the frequency, which is indicated along the Y-axis of the graph 7 5 Frequency 3 1 -16 to -6 -5 to 5 6 to 16
Histogram Distribution Introduction to Summary Statistics Histogram Distribution Example 1, 7, 15, 4, 8, 8, 5, 12, 10 1, 4, 5, 7, 8, 8, 10, 12,15 4 Let’s create a histogram to represent this data set. [click] From this example, we will break the data into ranges of 1 to 5, 6 to 10, and 11 to 15. So, we will place labels on the x-axis to indicate these ranges. [click] Again, these ranges of values are referred to as class intervals. Note that, the class intervals should include all the values along the x-axis. Therefore, the class intervals are, technically 0.5 < x ≤ 5.5 [click] 5.5 < x ≤ 10.5 [click] 10.5 < x ≤ 15.5 [click] But, for simplicity, we indicate only whole numbers since all of the data values fall within the intervals shown. Let’s reorder the data to make it easier to divide into the ranges. [click] [click] 3 Frequency 0.5 < x ≤ 5.5 5.5 < x ≤ 10.5 10.5 < x ≤ 15.5 2 1 1 to 5 6 to 10 11 to 15 0.5 5.5 10.5 15.5
Histogram Distribution Introduction to Summary Statistics Histogram Distribution The height of each bar in the chart indicates the number of data elements, or frequency of occurrence, within each range 1, 4, 5, 7, 8, 8, 10, 12,15 The height of each bar in a histogram indicates the number of data elements, or frequency of occurrence, within each range. Now, looking at the data, you can see there are three data values in the set that are in the range 1 to 5. [click] There are four data values in the range 6 to 10. [click] And there are three data values in the range 11 to 15. [click] Note, that you can always determine the number of data points in a data set from the histogram by adding the frequencies from all of the data ranges. In this case we add three, four and two to get a total of nine data points. 4 3 Frequency 2 1 1 to 5 6 to 10 11 to 15
Histogram Distribution Introduction to Summary Statistics Histogram Distribution 0.7495 < x ≤ 0.7505 0.7455 < x ≤ 0.7465 This histogram represents the length of twenty seven nearly identical parts. The minimum measurement was 0.745 in. [click] and the maximum length measurement was 0.760 inches. Each data value between the min and max fall within one of the class intervals on the horizontal (x-) axis. Note that in this case, each value indicated on the horizontal axis actually represents a class interval - a range of values. For instance, 0.745 represents the values 0.7445 < x ≤ 0.7455. In general, the interval endpoints should have one more decimal place than the data values. Due to the precision of the measurement device, the data are recorded to the thousandth (e.g. 0.745) and will therefore correspond to one of the values shown on the axis. [click] The bars are drawn to show the number of parts that have a length within each interval – the frequency of occurrence. For instance, of the twenty seven parts, two have a length of 0.746 in. (and therefore fall within the 0.7455< x ≤ 0.7465 class interval). [click] And four have a measured length of 0.750 in. (and fall within the 0.7495< x ≤ 0.7505 class interval). [click] MINIMUM = 0.745 in. MAXIMUM = 0.760 in.
Introduction to Summary Statistics Dot Plot Distribution 3 -1 -3 3 2 1 -1 -1 2 1 1 -1 -2 1 2 1 -2 -4 Another way to represent data is a dot plot. A dot plot is similar to a histogram in that it shows frequency of occurrence of data values. To represent the data in the table, we place a dot for each data point directly above the data value. [click] So there is one dot for each data point. [click many times] -6 -5 -4 -3 -2 -1 1 2 3 4 5 6
Introduction to Summary Statistics Dot Plot Distribution 3 -1 -3 3 2 1 -1 -1 2 1 1 -1 -2 1 2 1 -2 -4 Dot plots are easily changed to histograms by replacing the dots with bars of the appropriate height indicating the frequency. [click] 5 Frequency 3 1 -6 -5 -4 -3 -2 -1 1 2 3 4 5 6
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution “Is the data distribution normal?” Translation: Is the histogram/dot plot bell-shaped? Does the greatest frequency of the data values occur at about the mean value? Does the curve decrease on both sides away from the mean? Is the curve symmetric about the mean?
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution Bell shaped curve Frequency In this example the data values are fairly evenly distributed about the mean. Approximately half of the values that are not mean values are less than the mean and approximately half are greater than the mean. And, the frequency of occurrence decreases as the value of the data point moves farther away from the mean. The data appears to form a bell shaped curve. [click] This data set looks to be normally distributed. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution Does the greatest frequency of the data values occur at about the mean value? Mean Value Frequency The highest frequency of values in this example occur at zero, which is the mean value of the data set. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution Does the curve decrease on both sides away from the mean? Mean Value Frequency The highest frequency of values in this example occur at zero, which is the mean value of the data set. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution Is the curve symmetric about the mean? Mean Value Frequency The highest frequency of values in this example occur at zero, which is the mean value of the data set. -6 -5 -4 -3 -2 -1 1 2 3 4 5 6 Data Elements
What if things are not equal? Introduction to Summary Statistics What if things are not equal? Although a normal distribution is the most common probability distribution in statistics and science, some quantities are not normally distributed. A visual analysis can help you decide if a normal distribution is a good representation for your data (although mathematical tests are usually necessary). If the data is skewed you should not assume a normal distribution of data. This histogram shows that the data is skewed to the right, that is, there is a longer “tail” to the right. Histogram Interpretation: Skewed (Non-Normal) Right
Introduction to Summary Statistics Normal Distribution Distribution If the data are normally distributed: 68% of the observations fall within 1 standard deviation of the mean. 95% of the observations fall within 2 standard deviations of the mean. 99.7% of the observations fall within 3 standard deviations of the mean. Many quantities tend to follow a normal distribution – heights of people, test scores, errors in measurement, etc. Given normally distributed data, 68% of the data values should fall within 1 standard deviation of the mean, 95% should fall within 2 standard deviations of the mean and 99.7 % should fall within 3 standard deviations of the mean. This is referred to as the Empirical Rule. Of course, with small samples/populations, these percentages may not hold exactly true because the number of values will not allow divisions to this precision.
Normal Distribution Example Introduction to Summary Statistics Normal Distribution Example Data from a sample of a larger population Mean = x = 0.08 Standard Deviation = s = 1.77 (sample) Let’s assume that this data was gathered from a sample taken from a larger population. Assume that we are interested in finding statistics for the larger population. The sample data values are fairly evenly distributed about the mean. Approximately half of the values that are not mean values are less than the mean and approximately half are greater than the mean. And, the frequency of occurrence decreases as the value of the data point moves farther away from the mean. The data appears to form a bell shaped curve. [click] This data set looks to be normally distributed. The mean is 0.08. The sample standard deviation formula is used to estimate the standard deviation of the larger population and is found to be 1.77.
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution 0.08 + 1.77 = 1.88 0.08 + - 1.77 = -1.69 68 % s -1.77 s +1.77 Since the data appears to be normally distributed, we can estimate that approximately 68% of the population data will fall within one standard deviation of the mean. [4 clicks] That is, about two thirds of the data will be between 1.69 and 1.88. x 0.08 Data Elements
Normal Distribution Distribution Introduction to Summary Statistics Normal Distribution Distribution 0.08 + -3.54 = - 3.46 0.08 + 3.54 = 3.62 95 % And, again, because the data is assumed to be normally distributed, we can estimate that 95% of the population data will fall within 2 standard deviations of the mean. That is, approximately 95% of the data will fall between 3.46 and 3.62 2σ - 3.54 2σ + 3.54 x 0.08 Data Elements