Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 3 Displaying and Summarizing Quantitative Data

Similar presentations


Presentation on theme: "Chapter 3 Displaying and Summarizing Quantitative Data"— Presentation transcript:

1 Chapter 3 Displaying and Summarizing Quantitative Data
CHAPTER OBJECTIVES At the conclusion of this chapter you should be able to: 1) Construct graphs that appropriately describe quantitative data 2) Calculate and interpret numerical summaries of quantitative data. 3) Combine numerical methods with graphical methods to analyze a data set. 4) Apply graphical methods of summarizing data to choose appropriate numerical summaries. 5) Apply software and/or calculators to automate graphical and numerical summary procedures.

2 Warmup Based on the histogram, about what percent of the values are between 47.5 and 52.5?

3 Displaying Quantitative Data
Histograms Stem and Leaf Displays

4 Frequency Histogram

5 Relative Frequency Histogram of Exam Grades
.30 .25 .20 Relative frequency .15 .10 .05 Sample histogram; Do not confuse with bar chart 40 50 60 70 80 90 100 Grade 6

6 Histograms A histogram shows three general types of information:
It provides visual indication of where the approximate center of the data is. We can gain an understanding of the degree of spread, or variation, in the data. We can observe the shape of the distribution. Center, spread, shape

7 Histograms Showing Different Centers (football head coach salaries)

8 Histograms - Same Center, Different Spread (football head coach salaries)

9 Histograms: Shape Symmetric distribution
A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. Skewed distribution A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side. Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations.

10 Shape (cont.)Female heart attack patients in New York state
Age: left-skewed Cost: right-skewed

11 Shape (cont.): outliers All 200 m Races, 20.2 secs or less

12 Shape (cont.): Outliers
An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. This is from the book. Imagine you are doing a study of health care in the 50 US states, and need to know how they differ in terms of their elderly population. This is a histogram of the number of states grouped by the percentage of their residents that are 65 or over. You can see there is one very small number and one very large number, with a gap between them and the rest of the distribution. Values that fall outside of the overall pattern are called outliers. They might be interesting, they might be mistakes - I get those in my data from typos in entering RNA sequence data into the computer. They might only indicate that you need more samples. Will be paying a lot of attention to them throughout class both for what we can learn about biology and also because they can cause trouble with your statistics. Guess which states they are (florida and alaska). Alaska Florida

13 Excel Example: 2016 NFL Salaries

14 Statcrunch Example: 2016 NFL Salaries

15 Heights of Students in Recent Stats Class (Bimodal)

16 Grades on a statistics exam
Data:

17 Frequency Distribution of Grades
Class Limits Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 Total 2 6 8 7 5 30 4

18 Relative Frequency Distribution of Grades
Class Limits Relative Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 2/30 = .067 6/30 = .200 8/30 = .267 7/30 = .233 5/30 = .167 5

19 Relative Frequency Histogram of Grades
.30 .25 .20 Relative frequency .15 .10 .05 Frequency and relative frequency histogram of same data will have the same shape. 40 50 60 70 80 90 100 Grade 6

20 Recall: Warmup Based on the histogram, about what percent of the values are between 47.5 and 52.5?
17% 47.5 52.5

21 Stem and leaf displays Have the following general appearance stem leaf
6 4 Probably haven’t seen one before now; here’s what one looks like.

22 Stem and Leaf Displays Partition each no. in data into a “stem” and “leaf” Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps

23 Example: employee ages at a small company
; stem: 10’s digit; leaf: 1’s digit 18: stem=1; leaf=8; 18 = 1 | 8 stem leaf 6 4 Constructing display; Order the leaves in each stem row

24 Suppose a 95 yr. old is hired
stem leaf 6 4 7 8 9 5 Include all stems in the range of data

25 Number of TD passes by NFL teams: 2012-2013 season (stems are 10’s digit)
leaf 4 3 03 247 2 1 8 Smallest number? Largest number?

26 Pulse Rates n = 138 Ignore the circles showing in the graphic;
*Stem rows have leaves o through 4 Note that leaves in each row are ordered

27 Advantages/Disadvantages of Stem-and-Leaf Displays
1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large) Disadvantages display becomes unwieldy for large data sets

28 Population of 185 US cities with between 100,000 and 500,000
Multiply stems by 100,000 Each leaf should be just 1 digit to keep display simple; sometimes data has to be rounded or truncated Note four rows for each stem value so display is not too wide. 3|6 = 360,000

29 Back-to-back stem-and-leaf displays
Back-to-back stem-and-leaf displays. TD passes by NFL teams: , multiply stems by 10 2 4 03 6 3 7 24 6655 1 67889 421 134 8

30 Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77? Stems are 10’s digits 4 6 8 10 12

31 Other Graphical Methods for Data
Time plots plot observations in time order; time on horizontal axis, variable on vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.) Heat maps, word walls

32 Unemployment Rate, by Educational Attainment

33 Water Use During Super Bowl XLV (Packers 31, Steelers 25)

34 Heat Maps

35 Word Wall (customer feedback)

36 Numerical Summaries of Quantitative Data
Numerical and More Graphical Methods to Describe Univariate Data

37 2 characteristics of a data set to measure
center measures where the “middle” of the data is located variability measures how “spread out” the data is

38 Warmup Six people in a room have a median age of 45 years and mean age of 45 years. One person who is 40 years old leaves the room. Questions: What is the median age of the 5 people remaining in the room? What is the mean age of the 5 people remaining in the room?

39 The median: a measure of center
Given a set of n measurements arranged in order of magnitude, Median= middle value n odd mean of 2 middle values, n even Ex. 2, 4, 6, 8, 10; n=5; median=6 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5

40 Student Pulse Rates (n=62)
38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 Median = (75+76)/2 = 75.5

41 Medians are used often Year 2016 baseball salaries
Median $1,450,000 (max=$33,000,000 Clayton Kershaw; min=$500,000) Median fan age: MLB 45; NFL 43; NBA 41; NHL 39 Median existing home sales price: May 2011 $166,500; May 2010 $174,600 Median household income (2008 dollars) 2009 $50,221; 2008 $52,029

42 The median splits the histogram into 2 halves of equal area

43 Examples Example: n = 7 Example n = 7 (ordered): Example: n = 8 Example n =8 (ordered) m = 14.1 m = ( )/2 = 15.8

44 Below are the annual tuition charges at 7 public universities
Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 4971 5245 5546 7586 5245 4965.5 4960 4971

45 Below are the annual tuition charges at 7 public universities
Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 5245 5546 4971 5587 7586 5245 4965.5 5546 4971

46 Recall: Warmup Six people in a room have a median age of 45 years and mean age of 45 years. One person who is 40 years old leaves the room. Questions: What is the median age of the 5 people remaining in the room? What is the mean age of the 5 people remaining in the room? Can’t answer

47 The range and interquartile range
Measures of Spread The range and interquartile range

48 Ways to measure variability
range=largest-smallest OK sometimes; in general, too crude; sensitive to one large or small data value The range measures spread by examining the ends of the data A better way to measure spread is to examine the middle portion of the data

49 Quartiles: Measuring spread by examining the middle
The first quartile, Q1, is the value in the sample that has 25% of the data at or below it (Q1 is the median of the lower half of the sorted data). The third quartile, Q3, is the value in the sample that has 75% of the data at or below it (Q3 is the median of the upper half of the sorted data). Q1= first quartile = 2.3 m = median = 3.4 We are going to start out with a very general way to describe the spread that doesn’t matter whether it is symmetric or not - quartiles. Just as the word suggests - quartiles is like quarters or quartets, it involves dividing up the distribution into 4 parts. Now, to get the median, we divided it up into two parts. To get the quartiles we do the exact same thing to the two halves. Use same rules as for median if you have even or odd number of observations. Now, what an we do with these that helps us understand the biology of these diseases? Q3= third quartile = 4.2

50 Quartiles and median divide data into 4 pieces
1/4 1/4 1/4 1/4 Q M Q3

51 Quartiles are common measures of spread
NCSU freshman profile.pdf University of Southern California Economic Value of College Majors (exec. summ., p.10 or see next slide)

52 The Economic Value of College Majors

53 Rules for Calculating Quartiles
Step 1: find the median of all the data (the median divides the data in half) Step 2a: find the median of the lower half; this median is Q1; Step 2b: find the median of the upper half; this median is Q3. Important: when n is odd include the overall median in both halves; when n is even do not include the overall median in either half.

54 Example 11 n = 10 Median m = (10+12)/2 = 22/2 = 11 Q1 : median of lower half Q1 = 6 Q3 : median of upper half Q3 = 16

55 Quartile example: odd no. of data values
HR’s hit by Babe Ruth in each season as a Yankee Ordered values: Median: value in ordered position 8. median = 46 Lower half (including overall median): Upper half (including overall median):

56 Pulse Rates n = 138 Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70 Q1: median of lower half (lower half = 69 smallest pulses); Q1 = pulse in ordered position 35; Q1 = 63 Q3 median of upper half (upper half = 69 largest pulses); Q3= pulse in position 35 from the high end; Q3=78

57 Below are the weights of 31 linemen on the NCSU football team
Below are the weights of 31 linemen on the NCSU football team. What is the value of the first quartile Q1? # stem leaf 2 22 55 4 23 57 6 24 26 7 25 10 257 12 27 59 (4) 28 1567 15 29 35599 30 333 31 45 5 32 155 33 1 34 287 257.5 263.5 262.5

58 Interquartile range lower quartile Q1 middle quartile: median
upper quartile Q3 interquartile range (IQR) IQR = Q3 – Q1 measures spread of middle 50% of the data

59 Example: beginning pulse rates
Q3 = 78; Q1 = 63 IQR = 78 – 63 = 15

60 Below are the weights of 31 linemen on the NCSU football team
Below are the weights of 31 linemen on the NCSU football team. The first quartile Q1 is What is the value of the IQR? # stem leaf 2 22 55 4 23 57 6 24 26 7 25 10 257 12 27 59 (4) 28 1567 15 29 35599 30 333 31 45 5 32 155 33 1 34 23.5 39.5 46 69.5

61 5-number summary of data
Minimum Q1 median Q3 maximum Pulse data

62 Boxplot: display of 5-number summary
Largest = max = 6.1 BOXPLOT Q3= third quartile = 4.2 m = median = 3.4 Add in a one other thing we know - the spread - the largest and smallest values, and make a box plot. Now, why would you want to make one of these? Q1= first quartile = 2.3 Five-number summary: min Q1 m Q3 max Smallest = min = 0.6

63 Boxplot: display of 5-number summary
Example: age of 66 “crush” victims at rock concerts 5-number summary:

64 Boxplot construction 1) construct box with ends located at Q1 and Q3; in the box mark the location of median (usually with a line or a “+”) 2) fences are determined by moving a distance 1.5(IQR) from each end of the box; 2a) upper fence is 1.5*IQR above the upper quartile 2b) lower fence is 1.5*IQR below the lower quartile Note: the fences only help with constructing the boxplot; they do not appear in the final boxplot display

65 Box plot construction (cont.)
3) whiskers: draw lines from the ends of the box left and right to the most extreme data values found within the fences; 4) outliers: special symbols represent each data value beyond the fences; 4a) sometimes a different symbol is used for “far outliers” that are more than 3 IQRs from the quartiles

66 Boxplot: display of 5-number summary
8 Largest = max = 7.9 BOXPLOT Distance to Q3 7.9 − 4.2 = 3.7 Q3= third quartile = 4.2 Interquartile range Q3 – Q1= 4.2 − 2.3 = 1.9 Add in a one other thing we know - the spread - the largest and smallest values, and make a box plot. Now, why would you want to make one of these? Q1= first quartile = 2.3 1.5 * IQR = 1.5*1.9=2.85. Individual #25 has a value of 7.9 years, which is 3.7 years above the third quartile. This is more than 2.85 = 1.5*IQR above Q3. Thus, individual #25 is a suspected outlier.

67 ATM Withdrawals by Day, Month, Holidays

68

69 Beg. of class pulses (n=138)
Q1 = 63, Q3 = 78 IQR=78  63=15 1.5(IQR)=1.5(15)=22.5 Q (IQR): 63 – 22.5=40.5 Q (IQR): =100.5 40.5 63 70 78 100.5 45

70 Pass Catching Yards by Receivers
Below is a box plot of the yards gained in a recent season by the 136 NFL receivers who gained at least 50 yards. What is the approximate value of Q3 ? 136 273 410 547 684 821 958 1095 1232 1369 Pass Catching Yards by Receivers 450 750 215 545

71 Rock concert deaths: histogram and boxplot

72 Automating Boxplot Construction
Excel “out of the box” does not draw boxplots. Many add-ins are available on the internet that give Excel the capability to draw box plots. Statcrunch ( draws box plots.

73 Statcrunch Boxplot Largest = max = 7.9 Q3= third quartile = 4.2
Add in a one other thing we know - the spread - the largest and smallest values, and make a box plot. Now, why would you want to make one of these? Q1= first quartile = 2.3

74 Tuition 4-yr Colleges

75 Statcrunch: 2016 NFL Salaries by Position

76 College Football Head Coach Salaries by Conference

77 2017 Major League Baseball Salaries by Team

78 End of Chapter 3, Part 1:General Numerical Summaries
End of Chapter 3, Part 1:General Numerical Summaries. Next: Numerical Summaries of Symmetric Data


Download ppt "Chapter 3 Displaying and Summarizing Quantitative Data"

Similar presentations


Ads by Google