Download presentation
Presentation is loading. Please wait.
Published byKatrina Plass Modified over 10 years ago
1
Introduction Our daily lives often involve a great deal of data, or numbers in context. It is important to understand how data is found, what it means, and how the information is used. The focus of this lesson is on how to calculate and understand statistics—the numbers that summarize, describe, or represent sets of data. 1.1.1: Describing Data Sets
2
Key Concepts Data can be described, summarized, and graphed in a variety of ways. We can represent a data set using a measure of center. Measures of Center A measure of center is a single number used to represent the middle value, expected value, or most typical value of a data set. Two commonly used measures of center are the median and the mean. 1.1.1: Describing Data Sets
3
Key Concepts, continued
The median is the middle-most value of a data set; 50% of the data is less than this value, and 50% is greater than it. To find the median, arrange the data values from least to greatest. The median is the middle value in an ordered data set if the number of data values is odd. If the data set contains an even number of values, the median is the average of the two middle numbers. 1.1.1: Describing Data Sets
4
Key Concepts, continued
The mean is found by adding the values in a data set and then dividing the sum by the number of values in the data set. It is also considered the average of all the values in a data set. The mean can be found using the formula , where (pronounced “x bar”) represents the mean. 1.1.1: Describing Data Sets
5
Key Concepts, continued
is the uppercase Greek letter sigma, and is used to represent a sum. So, represents the sum of the n data values in the data set: 1.1.1: Describing Data Sets
6
Key Concepts, continued
The Five-Number Summary The five-number summary of a data set consists of the following key numbers: the minimum, the first quartile (Q1), the median (Q2), the third quartile (Q3), and the maximum. The minimum is the smallest value in the data set and the maximum is the largest value in the data set. The median, also known as the second quartile, is represented by Q2. 1.1.1: Describing Data Sets
7
Key Concepts, continued
When the data values are ordered from least to greatest, the first quartile, Q1, is the value that identifies the lower 25% of the data. It is also the median of the lower half of the data set; 75% of all data is greater than this value. The third quartile, Q3, is the value that identifies the upper 25% of the data. It is also the median of the upper half of the data set; 75% of all data is less than this value. 1.1.1: Describing Data Sets
8
Key Concepts, continued
Measures of Spread or Variability A measure of spread is a number used to describe how far apart certain key values are from each other, or how far a typical value is from the mean of a data set. Measures of spread are also known as measures of variability. The most common measures of spread are the range, interquartile range, and standard deviation. The range is the difference from the minimum to the maximum in a data set; that is, range = maximum – minimum. The range describes the spread of the entire data set. 1.1.1: Describing Data Sets
9
Key Concepts, continued
The interquartile range, IQR, is the difference from the first quartile to the third quartile: IQR = Q3 – Q1. The interquartile range describes the spread of the middle “half ” of the data set. Note: In some cases, the data values between Q1 and Q3 do not form exactly half the data set. But data sets often have many values, and in those cases the middle “half ” is very close to half, so the distinction is not important. For example, if a data set has 1,001 values, then the middle “half ” has 501 values, which is approximately 50.05% of the data set. 1.1.1: Describing Data Sets
10
Key Concepts, continued
The mean absolute deviation, MAD, is the average absolute value of the difference between each data point in a data set and the mean. It is found by summing the absolute value of each difference (or deviation from the mean), then dividing the sum by the total number of data points. The formula for mean absolute deviation is , where is the mean and n is the number of data values. 1.1.1: Describing Data Sets
11
Key Concepts, continued
Shown in expanded form, the formula looks like this: Consider this data set: 3, 5, 6, 8, 8. The mean is 6: Use the mean to find the mean absolute deviation by substituting each of the values in the data set for xi and 6 for , as shown on the next slide. 1.1.1: Describing Data Sets
12
Key Concepts, continued The mean absolute deviation is 1.6.
1.1.1: Describing Data Sets
13
Key Concepts, continued
The lowercase Greek letter sigma, σ, is used in two measures of spread, or variability: variance and standard deviation. The variance, σ2, is a measure of spread, or variability; it is the average of the squares of the deviations of all the data values in a data set from the mean. The variance is found using the formula , where is the mean and n is the number of data values. 1.1.1: Describing Data Sets
14
Key Concepts, continued
Shown in expanded form, the formula looks like this: Consider the same data set as before: 3, 5, 6, 8, 8, with a mean of 6. Find the variance by substituting each of the values in the data set for xi and 6 for , as shown on the next slide. 1.1.1: Describing Data Sets
15
Key Concepts, continued The variance is 3.6.
1.1.1: Describing Data Sets
16
Key Concepts, continued
The standard deviation, σ, is another measure of spread, or variability; it is the average square difference from the mean, denoted by the lowercase Greek letter sigma, σ. The standard deviation is found using the formula , where xi is a data point, is the mean, and n is the number of data values. 1.1.1: Describing Data Sets
17
Key Concepts, continued
• Shown in expanded form, the formula looks like this: Consider the same data set as earlier: 3, 5, 6, 8, 8. The variance, found previously, is 3.6. Take the square root of the variance to find the standard deviation: σ = 3.6 σ ≈ 1.897 1.1.1: Describing Data Sets
18
Key Concepts, continued
The standard deviation describes how much the data values vary, or deviate, from the mean. That is, it describes the deviation of a typical data value from the mean. When the mean is used as the measure of center, the standard deviation should be used as a measure of spread. 1.1.1: Describing Data Sets
19
Key Concepts, continued
Outliers and Extreme Values An outlier is a data value that is much less or much greater than most of the values in the data set. A data value is an outlier if it is less than Q1 – 1.5(IQR) or if it is greater than Q (IQR). An extreme value is a data value that seems to be much less or much greater than most of the other data values. Note: All outliers are extreme values, but not all extreme values are outliers. 1.1.1: Describing Data Sets
20
Key Concepts, continued
The term “extreme value” is less precise than the term “outlier” because there is no rule for identifying extreme values; they are a matter of opinion. Nevertheless, extreme values can affect the choices of measures of center and spread. Extreme values that are not outliers are those values that fall within the limits discussed previously for outliers. When there are no outliers or other extreme data values, the mean is generally a better measure of center than the median. 1.1.1: Describing Data Sets
21
Key Concepts, continued
When there is an outlier, or in some cases one or more other extreme values, the median is generally a better measure of center than the mean. 1.1.1: Describing Data Sets
22
Key Concepts, continued
Box Plots and Dot Plots A box plot is a graph that shows the five-number summary of a data set. 1.1.1: Describing Data Sets
23
Key Concepts, continued
The vertical line segment inside the box in a box plot represents the median (Q2). The length of the box in a box plot is the interquartile range (IQR). A dot plot is a graph that uses dots to show the number of times each value in a data set appears in that data set. 1.1.1: Describing Data Sets
24
Key Concepts, continued
The mean is the balance point on the dot plot of any data set; that is, if the dots were weights on a scale, the mean would be the point at which the scale would be balanced, or level. A data distribution is an arrangement of data values. When the data values are displayed in a dot plot, the distribution might have a shape that can be named. Two shapes of particular interest are symmetric and skewed. 1.1.1: Describing Data Sets
25
Key Concepts, continued
In a symmetric distribution, a line can be drawn so that the left and right sides are mirror images of each other, as shown. In a skewed distribution, most of the data values are concentrated on one side of the median. 1.1.1: Describing Data Sets
26
Key Concepts, continued
A distribution in which there is a “tail” of isolated, spread-out data points to the right of the median is called skewed to the right. (“Tail” describes the visual appearance of the data points.) Data that is skewed to the right is also called positively skewed. 1.1.1: Describing Data Sets
27
Key Concepts, continued
A distribution is skewed to the right if most of the data values are concentrated on the left. That is, many of the values are clustered on the left side of the distribution, and few values are on the right side (creating the “tail”). There may be one or more outliers or other extreme values on the right. 1.1.1: Describing Data Sets
28
Key Concepts, continued
A distribution in which there is a tail to the left of the median is called skewed to the left. Data that is skewed to the left is also called negatively skewed. 1.1.1: Describing Data Sets
29
Key Concepts, continued
A distribution is skewed to the left if most of the data values are concentrated on the right. That is, many of the values are clustered on the right side of the distribution, and few values are on the left side (creating the “tail”). There may be one or more outliers or other extreme values on the left. 1.1.1: Describing Data Sets
30
Key Concepts, continued
Representing a Given Data Set Accurately It is not always obvious how to choose the most appropriate measures of center and spread as well as the most appropriate graph for a data set. Furthermore, it is not always clear that one particular choice is better than another. Use the table on the next slide to help guide your decisions. 1.1.1: Describing Data Sets
31
Key Concepts, continued
Selecting Appropriate Measures of Center and Spread and Appropriate Graphs *Mean absolute deviation (MAD) and variance (σ2) may be used sometimes as well. If there is an outlier, use: If there is no outlier, use: Measure of center Median (Q2) Mean Rough measure of spread Range Additional measure of spread Interquartile range (IQR) Standard deviation (σ)* Graph Box plot (The median is the vertical segment inside the box.) Dot plot (The mean is the balance point.) 1.1.1: Describing Data Sets
32
Common Errors/Misconceptions
confusing the terms mean and median, and how to calculate each measure confusing the terms mean absolute deviation, variance, and standard deviation, and how to calculate each measure forgetting to order the data values from least to greatest before calculating the median, first and third quartiles, and interquartile range 1.1.1: Describing Data Sets
33
Common Errors/Misconceptions, continued
choosing the data value whose position number is as the median when there are n data values and n is even; for example, choosing the fifth data value as the median when there are ten data values forgetting that when the median is used as the measure of center, the interquartile range should be used as a measure of spread confusing the terms skewed to the left and skewed to the right 1.1.1: Describing Data Sets
34
Guided Practice Example 1
The following data set shows the numbers of minutes it took 10 chemistry students to complete a quiz: Describe the data set, using appropriate measures of center and spread. Identify any outliers or other extreme values and describe their effects. 1.1.1: Describing Data Sets
35
Guided Practice: Example 1, continued Make a plan.
The choice of spread depends on the choice of center. The choice of center depends on whether there are any outliers. To identify outliers, you need the interquartile range. To find the interquartile range, you need to first find the quartiles Q1 and Q3. So, begin by finding the five-number summary of the data set. 1.1.1: Describing Data Sets
36
Guided Practice: Example 1, continued Find the five-number summary.
The five-number summary includes the minimum value, the first quartile (Q1), the second quartile (Q2) or median, the third quartile (Q3), and the maximum value. Begin by ordering the data values from least to greatest. The minimum is 2 and the maximum is 13. 1.1.1: Describing Data Sets
37
Guided Practice: Example 1, continued
The median, Q2, is the average of the two middle values because the number of values, 10, is even. The two middle values are 10 and 11, so add and divide by 2 to find the median. The median is 10.5. There are 5 data values on either side of 10.5; since the number of data values is odd, we can find Q1 and Q3 without averaging values. 1.1.1: Describing Data Sets
38
Guided Practice: Example 1, continued
The first quartile, Q1, is the middle value of the lower half (the data values to the left of the median): 9. The third quartile, Q3, is the middle value of the upper half (the data values to the right of the median): 11. The five-number summary is shown in the following diagram. 1.1.1: Describing Data Sets
39
Guided Practice: Example 1, continued
Find the interquartile range (IQR). The interquartile range is the difference between Q3 (11) and Q1 (9). IQR = Q3 – Q1 IQR = (11) – (9) IQR = 2 The interquartile range is 2. 1.1.1: Describing Data Sets
40
Guided Practice: Example 1, continued Identify any outliers.
A data value is an outlier if it is less than Q1 – 1.5(IQR) or greater than Q (IQR). Calculate Q1 – 1.5(IQR) for Q1 = 9 and IQR = 2. Q1 – 1.5(IQR) = (9) – 1.5(2) Q1 – 1.5(IQR) = 9 – 3 Q1 – 1.5(IQR) = 6 The data values 2 and 2 are outliers because 2 < 6. 1.1.1: Describing Data Sets
41
Guided Practice: Example 1, continued
Calculate Q (IQR) for Q3 = 11 and IQR = 2. Q (IQR) = (11) + 1.5(2) Q (IQR) = Q (IQR) = 14 There are no data values greater than 14. The only outliers are 2 and 2. 1.1.1: Describing Data Sets
42
Guided Practice: Example 1, continued
Choose an appropriate measure of center for the data. The median, 10.5, is an appropriate measure of center because there are two extreme values, 2 and 2, that are also outliers of the data set. 1.1.1: Describing Data Sets
43
Guided Practice: Example 1, continued
Choose an appropriate measure of spread for the data. The range is useful for any data set, but it is only a rough measure because it does not give any information about data values between the minimum and the maximum. Because the median has been chosen as the more appropriate measure of center, the additional measure of spread should be the interquartile range. 1.1.1: Describing Data Sets
44
Guided Practice: Example 1, continued
Draw a box plot and a dot plot to display the data set. Use the five-number summary to create the box plot. 1.1.1: Describing Data Sets
45
Guided Practice: Example 1, continued
Create the dot plot by marking occurrences of each data set value on a number line that has the same increments as your box plot. 1.1.1: Describing Data Sets
46
Guided Practice: Example 1, continued
Use the plots to describe the data set. The distribution is skewed to the left because there are two values that are on the left, relatively far from the rest of the data, which is concentrated at the right. The median, Q2 = 10.5, represents the data set. The median is represented by the vertical line segment inside the box of the box plot. The interquartile range, 2, is the difference between the upper quartile (Q3), which is 11, and the lower quartile (Q1), which is 9. 1.1.1: Describing Data Sets
47
✔ Guided Practice: Example 1, continued
The data values 2 and 2 are extreme values in this data set; their effect is to make the mean too low to be an accurate measure of center. The extreme data values 2 and 2 can be called outliers because they are less than Q1 – 1.5(IQR). On a box plot, outliers are data values that are outside the box by a distance of more than 1.5 times the interquartile range; that is, outside the box by a distance of more than 1.5 times the length of the box. Looking at the box plot, it appears that the distance between 2 and the left side of the box is more than twice the length of the box itself. ✔ 1.1.1: Describing Data Sets
48
Guided Practice: Example 1, continued
1.1.1: Describing Data Sets
49
Guided Practice Example 2
Eight friends are discussing their part-time jobs. They worked the following numbers of hours last week: Describe the data set, using appropriate measures of center and spread. Identify any outliers or other extreme values and describe their effects. 1.1.1: Describing Data Sets
50
Guided Practice: Example 2, continued Make a plan.
The choice of spread depends on the choice of center. The choice of center depends on whether there are any outliers. To identify outliers, you need the interquartile range. To find the interquartile range, you need to first find the quartiles Q1 and Q3. So, begin by finding the five-number summary of the data set. 1.1.1: Describing Data Sets
51
Guided Practice: Example 2, continued Find the five-number summary.
Order the data values from least to greatest. The minimum is 4 and the maximum is 14. The median is the average of the two middle values, because the number of data values is even. 1.1.1: Describing Data Sets
52
Guided Practice: Example 2, continued
The median of 8 doesn’t fall between any values in the data set, so we are splitting the data set into two halves, each with an even number of data values. We will need to average values to find Q1 and Q3. Q1 is the average of the two middle values of the lower half of the data set (the data to the left of the median). 1.1.1: Describing Data Sets
53
Guided Practice: Example 2, continued
Q3 is the average of the two middle values of the upper half of the data set (the data to the right of the median). The five-number summary is shown below. 1.1.1: Describing Data Sets
54
Guided Practice: Example 2, continued
Find the interquartile range (IQR). The interquartile range is the difference between Q3 (12) and Q1 (7). IQR = Q3 – Q1 IQR = (12) – (7) IQR = 5 1.1.1: Describing Data Sets
55
Guided Practice: Example 2, continued Identify any outliers.
A data value is an outlier if it is less than Q1 – 1.5(IQR) or greater than Q (IQR). Calculate Q1 – 1.5(IQR) for Q1 = 7 and IQR = 5. Q1 – 1.5(IQR) = (7) – 1.5(5) Q1 – 1.5(IQR) = 7 – 7.5 Q1 – 1.5(IQR) = –0.5 There are no data values less than –0.5. 1.1.1: Describing Data Sets
56
Guided Practice: Example 2, continued
Calculate Q (IQR) for Q3 = 12 and IQR = 5. Q (IQR) = (12) + 1.5(5) Q (IQR) = Q (IQR) = 19.5 There are no data values greater than There are no outliers. 1.1.1: Describing Data Sets
57
Guided Practice: Example 2, continued
Choose an appropriate measure of center. There are no outliers; therefore, look at the ordered list of data values and decide whether there are any values that seem to be extreme, even if they do not qualify as outliers. Do this by informally comparing the differences between consecutive values. 1.1.1: Describing Data Sets
58
Guided Practice: Example 2, continued
Ordered data values: 4, 6, 8, 8, 8, 10, 14, 14 There are no large differences between consecutive data values, so there do not seem to be any extreme values. The mean is an appropriate measure of center because there are no outliers or other extreme values. 1.1.1: Describing Data Sets
59
Guided Practice: Example 2, continued Find the mean, x.
The mean is the average of all the data values. Formula for calculating mean is the sum of the n data values. Substitute values from the data set for x1, etc. There are data values, so n = 8. — 1.1.1: Describing Data Sets
60
Guided Practice: Example 2, continued Simplify. The mean is 9.
1.1.1: Describing Data Sets
61
Guided Practice: Example 2, continued
Choose appropriate measures of spread. Because the mean has been chosen as the measure of center, appropriate measures of spread are the range, mean absolute deviation (MAD), variance (σ2), and standard deviation (σ). 1.1.1: Describing Data Sets
62
Guided Practice: Example 2, continued Find the range.
The range is the difference between the maximum and minimum. In this data set, the maximum is 14 and the minimum is 4. range = maximum – minimum range = (14) – (4) range = 10 The range is 10. 1.1.1: Describing Data Sets
63
Guided Practice: Example 2, continued
Calculate the mean absolute deviation, the variance, and the standard deviation for individual data values. For each value, find its deviation from the mean, then take the absolute value of the deviation, and then square the deviation. Organize the data values and results in a table, as shown on the next slide. 1.1.1: Describing Data Sets
64
Guided Practice: Example 2, continued
Data value Mean Deviation from mean Absolute deviation Deviation squared xi 4 9 –5 5 25 6 –3 3 8 –1 1 10 14 1.1.1: Describing Data Sets
65
Guided Practice: Example 2, continued
Find the mean absolute deviation (MAD), the variance, and the standard deviation for the data set. Find the sum in each of the last two columns of the table from the previous step. 1.1.1: Describing Data Sets
66
Guided Practice: Example 2, continued
Data value Mean Deviation from mean Absolute deviation Deviation squared xi 4 9 –5 5 25 6 –3 3 8 –1 1 10 14 Sum 22 88 1.1.1: Describing Data Sets
67
Guided Practice: Example 2, continued
The sum of the absolute deviations for the individual data values is 22. The sum of the squares of the deviations is 88. The mean absolute deviation is the average of the sum of the absolute deviations, as shown on the next slide. 1.1.1: Describing Data Sets
68
Guided Practice: Example 2, continued
Formula for mean absolute deviation Substitute 22 for , the sum of the absolute deviations, and 8 for n, the number of data values. Simplify. The mean absolute deviation is 2.75. 1.1.1: Describing Data Sets
69
Guided Practice: Example 2, continued
The variance is the average of the sum of the squares of the deviations: Formula for variance Substitute 88 for , the sum of the squares of the deviations, and 8 for n, the number of data values. Simplify. The variance is 11. 1.1.1: Describing Data Sets
70
Guided Practice: Example 2, continued
The standard deviation is the square root of the variance: Formula for standard deviation Substitute 11 for the variance, σ2. Simplify. The standard deviation is approximately 3.32. 1.1.1: Describing Data Sets
71
Guided Practice: Example 2, continued Draw a box plot.
Use the five-number summary to create the box plot. 1.1.1: Describing Data Sets
72
Guided Practice: Example 2, continued Draw a dot plot.
Create the dot plot by marking occurrences of each data set value on a number line that has the same increments as your box plot. 1.1.1: Describing Data Sets
73
Guided Practice: Example 2, continued
Use the plots to describe the data set. The distribution is neither significantly skewed nor symmetric, though it is nearly symmetric about the value 8. The mean, , and median, Q2 = 8, are both reasonable choices as appropriate measures of center. But the mean is a slightly better choice because it is the balance point of the entire data set, and the data set has no outliers or other extreme values. 1.1.1: Describing Data Sets
74
Guided Practice: Example 2, continued
1.1.1: Describing Data Sets
75
Guided Practice: Example 2, continued
1.1.1: Describing Data Sets
76
✔ Guided Practice: Example 2, continued
The range, 10, describes the spread of the entire data set, from minimum to maximum. The standard deviation, σ ≈ 3.32, describes the difference, or deviation, between a typical data value and the mean. (The mean absolute deviation, MAD = 2.75, and the variance, σ2 = 11, are associated with the standard deviation.) There are no extreme values or outliers. ✔ 1.1.1: Describing Data Sets
77
Guided Practice: Example 2, continued
1.1.1: Describing Data Sets
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.