Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Displaying and Summarizing Quantitative Data.

Similar presentations


Presentation on theme: "Chapter 3 Displaying and Summarizing Quantitative Data."— Presentation transcript:

1 Chapter 3 Displaying and Summarizing Quantitative Data

2 Objectives The student will be able to: Appropriately display quantitative data using a frequency distribution, histogram, relative frequency histogram, and stem-and-leaf display. Describe the general shape of a distribution in terms of shape, center and spread. Describe any anomalies or extraordinary features revealed by the display of a variable. Compute and apply the concepts of mean and median to a set of data. Compute and apply the concept of the standard deviation and IQR to a set of data. Select a suitable measure of center/spread for a variable based on information about its distribution. Create a five-number summary of a variable. Construct a boxplot by hand and using technology, using fences to identify possible outliers. Use the 1.5 IQR rule to identify possible outliers.

3 3.1 Displaying Quantitative Variables

4 Histograms Histogram: A chart that displays quantitative data
A histogram of tsunami generating earthquakes Histogram: A chart that displays quantitative data Great for seeing the distribution of the data Most earthquake generating tsunamis have magnitudes between 6.5 and 8. Japan and Sumatra quakes (9.0 and 9.1) are rare. Quakes under 5 rarely cause tsunamis. Quakes between 7.0 and 7.5 most common for causing tsunamis

5 Choosing the Bin Width Different bin widths tell different stories.
Choose the width that best shows the important features. Presentations can feature two histograms that present the same data in different ways. A gap in the histogram means that there were no occurrences in that range.

6 Relative Frequency Histograms
The vertical axis represents the relative frequency, the frequency divided by the total. The horizontal axis is the same as the horizontal axis for the frequency histogram. The shape of the relative frequency histogram is the same as the frequency histogram. Only the scale of the y-axis is different.

7 Practice Suppose a basketball player scored the following number of points in his last 15 games: 4, 4, 3, 4, 7, 16, 12, 15, 6, 8, 5, 9, 8, 25, 11 Fill in the following frequency (and relative frequency) distribution. Bin Frequency Relative Frequency 1-6 | | 7-12 | | 13-18 | | 19-24 | | 25-30 | | Total | 15 | %

8 Practice (continued) What percentage of games did the player score 12 points or less? What percentage of games did the player score between 7 and 18 points (inclusive i.e. 7<=points<=18)?

9 Using the TI to make a histogram
First Enter data in L1: Stat->Edit->Enter Turn STAT PLOT on [2nd] [Y=] will enter into the stat plot menu With cursor on 1: hit enter with cursor on On for Plot1, hit enter Select type of plot desired If raw data is in L1, Xlist:L1, Freq:1, If frequencies are in L2 then set Xlist:L1, Freq:L2 Zoom -> ZoomStat to display data Example: Lets make a histogram of the following dataset: If we want a histogram that groups the data using the classes: 10-19, 20-29, 30-39, etc. We must adjust the window (to be discussed in class or see separate technology instructions).

10 Histograms and StatCrunch
Enter Data. Graphics → Histogram Click on the data variable and Next. Select Frequency or Relative Frequency. Put in starting value and/or Binwidth if desired. Click Next twice, and type in labels. Click Create Graph.

11 Stem-and-Leaf Displays
Stem-and-Leaf: Shows both the shape of the distribution and all of the individual values Not as visually pleasing as a histogram; more technical looking Can only be used for small collections of data The first column (stems) represents the leftmost digit. The second column (leaves) shows the remaining digit(s).

12 Constructing a Stem-and-Leaf Display
First, cut each data value into leading digits (“stems”) and trailing digits (“leaves”). Use the stems to label the bins. Use only one digit for each leaf—either round or truncate the data values to one decimal place after the stem. Write a key at the bottom: 8|3 indicates 83 or 830 Examples – make stem-and-leaf displays for number of siblings, heights, number of college semesters completed using our class data set

13 Stem and Leaf with StatCrunch
Enter Data Graphics → Stem and Leaf Click on the variable name and Next Select Outlier Trimming Type and Create Graph!

14 Dotplots Dotplot: Displays dots to describe the shape of the distribution There were 30 races with a winning time of 122 seconds. Good for smaller data sets Visually more appealing than stem-and-leaf In StatCrunch: Graphics → Dotplot

15 Think Before you Draw Is the variable quantitative? Is the answer to the survey question or result of the experiment a number whose units are known? Histograms, stem-and-leaf diagrams, and dotplots can only display quantitative data. Bar and pie charts display categorical data.

16 3.2 Shape

17 What is the Shape of the Distribution?
Does the histogram have a single, central hump or several separated humps? Is the histogram symmetric? Do any unusual features stick out?

18 Modes A Mode of a histogram is a hump or high-frequency bin.
One mode → Unimodal Two modes → Bimodal 3 or more → Multimodal Unimodal Bimodal Multimodal

19 Uniform Distributions
Uniform Distribution: All the bins have the same frequency, or at least close to the same frequency. The histogram for a uniform distribution will be flat.

20 Symmetry The histogram for a symmetric distribution will look the same on the left and the right of its center. Not Symmetric Symmetric Symmetric

21 Skew Skewed Right Skewed Left
A histogram is skewed right if the longer tail is on the right side of the mode. A histogram is skewed left if the longer tail is on the left side of the mode. Skewed Right Skewed Left

22 Outliers An Outlier is a data value that is far above or far below the rest of the data values. An outlier is sometimes just an error in the data collection. An outlier can also be the most important data value. Income of a CEO Temperature of a person with a high fever Elevation at Death Valley

23 Summary - Shape of the Distribution
Does the histogram have a single, central hump or several separated humps? unimodal, bimodal, multimodal, uniform Is the histogram symmetric? Symmetric, skewed left, skewed right Do any unusual features stick out? Outliers, gaps

24 Example The histogram shows the amount of money spent by a credit card company’s customers. Describe and interpret the distribution. The distribution is unimodal. Customers most commonly spent a small amount of money. The distribution is skewed right. Many customers spent only a small amount and a few were spread out at the high end. There is an outlier at around $ One customer spent much more than the rest of the customers.

25 3.3 Center

26 The Median Median: The center of the data values
Half of the data values are to the left of the median and half are to the right of the median. For symmetric distributions, the median is directly in the middle.

27 Calculating the Median: Odd Sample Size
First order the numbers. If there are an odd number of numbers, n, the median is at position Find the median of the numbers: 2, 4, 5, 6, 7, 9, 9. The median is the fourth number: 6 Note that there are 3 numbers to the left of 6 and 3 to the right.

28 Calculating the Median: Even Sample Size
First order the numbers. If there are an even number of numbers, n, the median is the average of the two middle numbers: Find the median of the numbers: 2, 2, 4, 6, 7, 8. The median is the average of the third and the fourth numbers:

29 3.4 Spread

30 Spread Locating the center is only part of the story
Are the data all near the center or are they spread out? Is the highest value much higher than the lowest value? To describe data, we must discuss both the center and the spread.

31 Range The range is the difference between the maximum and minimum values. Range = Maximum – Minimum The ages of the guests at your dinner party are: , 18, 23, 23, 27, 35, 74 The range is: 74 – 16 = 58 The range is sensitive to outliers. A single high or low value will affect the range significantly.

32 Percentiles and Quartiles
Percentiles divide the data in one hundred groups. The nth percentile is the data value such that n percent of the data lies below that value. For large data sets, the median is the 50th percentile. The median of the lower half of the data is the 25th percentile and is called the first quartile (Q1). The median of the upper half of the data is the 75th percentile and is called the third quartile (Q3).

33 Using the TI to calculate summary statistics
To enter raw data in L1 STAT -> EDIT [1] With cursor on L1 hit [CLEAR] to delete old values Fill list with individual values To calculate summary statistics: STAT -> CALC[1] [L1] [ENTER] (L1 is found by pressing [2nd][1]) Scroll down to find median, quartiles, min and max To enter a frequency distribution, enter the values in L1 and frequency counts in L2. To calculate summary statistics use: STAT-> CALC[1] [L1] [ , ] [L2] [ENTER]

34 Using StatCrunch to calculate summary statistics
Load Data Stat->Summary Stats -> Columns (select column)

35 StatCrunch, Q1, Median, and Q3
Enter the data. Stat → Summary Stats → Columns Click on the variable and then Calculate.

36 The Interquartile Range
The Interquartile Range (IQR) is the difference between the upper quartile and the lower quartile IQR = Q3 – Q1 The IQR measures the range of the middle half of the data. Example: If Q1 = 23 and Q3 = 44 then IQR = 44 – 23 = 21

37 The Interquartile Range
The Interquartile Range for earthquake causing tsunamis is 0.9. The picture below shows the meaning of the IQR.

38 Benefits and Drawbacks of the IQR
The Interquartile Range is not sensitive to outliers. The IQR provides a reasonable summary of the spread of the distribution. The IQR shows where typical values are, except for the case of a bimodal distribution. The IQR is not great for a general audience since most people do not know what it is.

39 3.5 Boxplots and 5-Number Summaries

40 5-Number Summary The 5-Number Summary provides a numerical description of the data. It consists of Minimum First Quartile (Q1) Median Third Quartile (Q3) Maximum The list to the right shows the Number Summary for the tsunami data.

41 Interpreting the 5-Number Summary
The smallest tsunami-causing earthquake had magnitude 3.7. The largest tsunami-causing earthquake had magnitude 9.1. The middle half of tsunami-causing earthquakes is between 6.7 and 7.6. Half of tsunami-causing earthquakes have magnitudes below 7.2 and half are above 7.2. A tsunami-causing earthquake less than 6.7 is small. A tsunami-causing earthquake more than 7.6 is small.

42 Boxplots A Boxplot is a chart that displays the 5-Point Summary and the outliers. The Box shows the Interquartile Range. The dashed lines are called fences, outside the fences lie the outliers. Above and below the box are the whiskers that display the most extreme data values within the fences. The line inside the box shows the median.

43 Finding the Fences The lower fence is defined by Lower Fence = Q1 – 1.5 × IQR The upper fence is defined by Upper Fence = Q × IQR Tsunami Example: Q1 = 6.7, Q3 = IQR = 7.6 – 6.7 = 0.9 Lower Fence = 6.7 – 1.5 × 0.9 = 5.35 Upper Fence = × 0.9 = 8.95

44 Identifying Outliers Use the 1.5*IQR rule to identify potential outliers Values above Q *IQR Values below Q *IQR If there are any clear outliers and you are reporting the mean and standard deviation, report them with the outliers present and with the outliers removed. The differences may be quite revealing. Note: The median and IQR are not likely to be affected by the outliers.

45 Practice (by hand) Suppose a basketball player scored the following number of points in his last 15 games: 4, 4, 3, 4, 7, 16, 12, 15, 6, 8, 5, 9, 8, 25, 11. Construct a (modified) boxplot for these scores What were your fences?

46 Using the TI to make boxplots
Weekly Salaries of Mooseburgers Employees 123 136 144 150 110 131 140 160 120 130 Similar to plotting histograms Enter data in lists, To turn on stat plots, STATPLOT-> Plot1 -> ENTER Select On -> Enter select the first boxplot pictured (this is a modified boxplot and indicates outliers rather than a standard boxplot whose whiskers extend to the max and minimum), Xlist (L1 or L2) Frequency (will be 1 if all data is entered, may be another list if using a frequency table) Use Plot2 to display another data set Zoom -> 9 (ZoomStat) Use trace to explore the box plot

47 StatCrunch and Boxplots
Enter data and go to Graphics → Boxplot. Click on the variable and Next. Check “Use fences to identify outliers.” Then Next Type in labels and click on Create Graph.

48 3.6 The Center of Symmetric Distributions: The Mean

49 The Mean The Mean is what most people think of as the average.
Add up all the numbers and divide by the number of numbers. Recall that S means “Add them all.” In StatCrunch, the mean is listed in the Summary Statistics.

50 The Mean is the “Balancing Point”
If you put your finger on the mean, the histogram will balance perfectly.

51 Mean Vs. Median For symmetric distributions, the mean and the median are equal. The balancing point is at the center. For skewed distributions, the tail “pulls” the mean towards it more than it does to the median. The mean is more sensitive to outliers than the median.

52 The Mean Is Attracted to the Outlier
The mean is larger than the median since it is “pulled” to the right by the outlier. The median is a better measure of the center for data that is skewed.

53 Why Use the Mean? Although the median is a better measure of the center, the mean weighs in large and small values better. The mean is easier to work with. For symmetric data, statisticians would rather use the mean.

54 3.7 The Spread of Symmetric Distributions: The Standard Deviation

55 The Variance The variance is a measure of how far the data is spread out from the mean. The difference from the mean is: To make it positive, square it. Then find the average of all of these distances, except instead of dividing by n, divide by n – 1. Use s2 to represent the variance. The variance will mostly be used to find the standard deviation s which is the square root of the variance.

56 Standard Deviation The variance’s units are the square of the original units. Taking the square root of the variance gives the standard deviation, which will have the same units as y. The standard deviation is a number that is close to the average distances that the y values are from the mean. If data values are close to the mean (less spread out), then the standard deviation will be small. If data values are far from the mean (more spread out), then the standard deviation will be large.

57 The Standard Deviation and Histograms
Order the histograms below from smallest standard deviation to largest standard deviation. A B C Answer: C, A, B

58 The Standard Deviation (cont.)
A class has been divided into groups of five students each. Each group completed an independent study project and then took an individual pop quiz of 20-points. Their scores are reported by group: Note that all groups had a mean of 10. Notice that the SD for group 1 is 0 What are the other standard deviations? 1 2 3 4 5 6 10 8 12 18 14 20 Center alone cannot describe the differences we see among these groups. 3 of the groups have a range of 20 so that also does not adequately describe the differences.

59 Recall… Using the TI to calculate summary statistics
To enter raw data in L1 STAT -> EDIT [1] With cursor on L1 hit [CLEAR] to delete old values Fill list with individual values To calculate summary statistics: STAT -> CALC[1] [L1] [ENTER] (L1 is found by pressing [2nd][1]) Scroll down to find median, quartiles, min and max To enter a frequency distribution, enter the values in L1 and frequency counts in L2. To calculate summary statistics use: STAT-> CALC[1] [L1] [ , ] [L2] [ENTER]

60 Recall… Using StatCrunch to calculate summary statistics
Load Data Stat->Summary Stats -> Columns (select column)

61 3.8 Summary—What to Tell About a Quantitative Variable

62 What to Tell Histogram, Stem-and-Leaf, Boxplot
Describe modality, symmetry, outliers Center and Spread Median and IQR if not symmetric Mean and Standard Deviation if symmetric. Unimodal symmetric data: IQR > s. Check for errors. Unusual Features For multiple modes, possibly split the data into groups. When there are outliers, report the mean and standard deviation with and without the outliers.

63 Example: Fuel Efficiency
The car owner has checked the fuel efficiency each time he filled the tank. How would you describe the fuel efficiency? Plan: Summarize the distribution of the car’s fuel efficiency. Variable: mpg for 100 fill ups, Quantitative Mechanics: show a histogram Fairly symmetric Low outlier

64 Fuel Efficiency Continued
Which to report? The mean and median are close. Report the mean and standard deviation. Conclusion Distribution is unimodal and symmetric. Mean is 22.4 mpg. Low outlier may be investigated, but limited effect on the mean s = 2.45; from one filling to the next, fuel efficiency differs from the mean by an average of about 2.45 mpg.

65 What Can Go Wrong? Don’t make a histogram for categorical data.
Don’t look for shape, center, and spread for a bar chart. Choose a bin width appropriate for the data.

66 What Can Go Wrong? Continued
Do a reality check Don’t blindly trust your calculator. For example, a mean student age of 193 years old is nonsense. Sort before finding the median and percentiles. 315, 8, 2, 49, 97 does not have median of 2. Don’t worry about small differences in the quartile calculation. Don’t compute numerical summaries for a categorical variable. The mean Social Security number is meaningless.

67 What Can Go Wrong? Continued
Don’t report too many decimal places. Citing the mean fuel efficiency as is going overboard. Don’t round in the middle of a calculation. For multiple modes, think about separating groups. Heights of people → Separate men and women Beware of outliers, the mean and standard deviation are sensitive to outliers. Use a histogram or dotplot to ensure that the mean and standard deviation really do describe the data.

68 Practice Recall: Suppose a basketball player scored the following number of points in his last 15 games: 4, 4, 3, 4, 7, 16, 12, 15, 6, 8, 5, 9, 8, 25, 11 Describe the shape of the distribution (modality, skew, and unusual features) What measures of center or spread would be most appropriate for this data set?

69 Practice #26: A meteorologist preparing a talk about global warming compiled a list of weekly low temperatures (in degrees Fahrenheit) he observed at his south Florida home last year. The coldest temp. for any week was 36F, but he inadvertently recorded the Celsius value of 2 degrees. Assuming he correctly listed all the other temperatures, explain how this error will affect these summary statistics: Measures of center: mean and median Measures of spread: range, IQR, and standard deviation

70 Practice The table displays the heights (in inches) of 130 members of a choir Find the median and IQR Find the mean and standard deviation Display these data with a histogram Write a few sentences describing the distribution Height Count 60 2 69 5 61 6 70 11 62 9 71 8 63 7 72 64 73 4 65 20 74 66 18 75 67 76 1 68 12

71 Practice During his 20 season in the NHL, Wayne Gretzky scored 50% more points than anyone who ever played professional hockey. Here are the number of games he played during each season: 79, 80, 80, 80, 74, 80, 80, 79, 64, 78, 73, 78, 74, 45, 81, 48, 80, 82, 82, 70 Create a stem and leaf display, using split stems Describe the shape of the distribution Describe the center and spread of the distribution What unusual features do you see? What might explain this?

72 Example : weights of pennies (grams)
Create a histogram using bins which are .10 grams wide (use StatCrunch). Be sure to label your axes. What can be said about the distribution? In fact we have TWO different distributions here because in the early 1980s the mint changed from copper to zinc. Lets separate our data into two groups If we want to compare the two distributions would it be more appropriate to use mean and sd as measures of center and spread or median and IQR? Calculate the median, quartiles, and IQR for the data (separated by group). Calculate the mean and sd (using your calculator or StatCrunch). 2.57 2.56 3.14 3.03 3.13 2.47 2.43 3.11 3.06 2.48 2.51 2.50 3.07 3.01 2.45 3.08 3.12 3.10 2.46 2.44 2.54 3.09 2.49

73 Using Statcrunch and/or your TI
Pick one of our class variables from our class survey data set Create a histogram with appropriate sized bins Describe the distribution Calculate the median, quartiles, and interquartile range Calculate the mean and standard deviation Decide which measure of center and spread is most appropriate for the data – why


Download ppt "Chapter 3 Displaying and Summarizing Quantitative Data."

Similar presentations


Ads by Google