Download presentation
Published byFelicia Neal Modified over 9 years ago
1
Chapter 4 Displaying and Summarizing Quantitative Data
Display: Histograms, Stem and Leaf Plots Numerical Summaries: Median, Mean, Quartiles, Standard Deviation
2
Relative Frequency Histogram of Exam Grades
.30 .25 .20 Relative frequency .15 .10 .05 Sample histogram; Do not confuse with bar chart 40 50 60 70 80 90 100 Grade 6
3
Frequency Histograms Sample histogram
4
Frequency Histograms A histogram shows three general types of information: It provides visual indication of where the approximate center of the data is. We can gain an understanding of the degree of spread, or variation, in the data. We can observe the shape of the distribution. Center, spread, shape
5
All 200 m Races 20.2 secs or less
6
Histograms Showing Different Centers
Same shape and spread
7
Histograms - Same Center, Different Spread
8
Frequency and Relative Frequency Histograms
identify smallest and largest values in data set divide interval between largest and smallest values into between 5 and 20 subintervals called classes * each data value in one and only one class * no data value is on a boundary Outline of procedure for constructing histogram Construction of histogram typically automated More important to understand what histogram tells you about the data rahter than construction minutiae.
9
How Many Classes? Approximations; not one correct answer to number of intervals (classes) to use
10
Histogram Construction (cont.)
* compute frequency or relative frequency of observations in each class * x-axis: class boundaries; y-axis: frequency or relative frequency scale * over each class draw a rectangle with height corresponding to the frequency or relative frequency in that class Tedious, that’s why we automate
11
Example. Number of daily employee absences from work
106 obs; approx. no of classes= {2(106)}1/3 = {212}1/3 = 5.69 1+ log(106)/log(2) = = 7.73 There is no single “correct” answer for the number of classes For example, you can choose 6, 7, 8, or 9 classes; don’t choose 15 classes Data on p. 5 of coursepack and in Excel file
12
EXCEL Histogram Histogram produced by Excel (after a few modifications)
13
Absences from Work (cont.)
6 classes class width: ( )/6=37/6= 6 classes, each of width 7; classes span 6(7)=42 units data spans =37 units classes overlap the span of the actual data values by 42-37=5 lower boundary of 1st class: (1/2)(5) units below 121 = = 118.5
14
EXCEL histogram
15
Grades on a statistics exam
Data:
16
Frequency Distribution of Grades
Class Limits Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 Total 2 6 8 7 5 30 4
17
Relative Frequency Distribution of Grades
Class Limits Relative Frequency 40 up to 50 50 up to 60 60 up to 70 70 up to 80 80 up to 90 90 up to 100 2/30 = .067 6/30 = .200 8/30 = .267 7/30 = .233 5/30 = .167 5
18
Relative Frequency Histogram of Grades
.30 .25 .20 Relative frequency .15 .10 .05 Frequency and relative frequency histogram of same data will have the same shape. 40 50 60 70 80 90 100 Grade 6
19
Based on the histo-gram, about what percent of the values are between 47.5 and 52.5?
50% 5% 17% 30% Countdown 10
20
Stem and leaf displays Have the following general appearance stem leaf
6 4 Probably haven’t seen one before now; here’s what one looks like.
21
Stem and Leaf Displays Partition each no. in data into a “stem” and “leaf” Constructing stem and leaf display 1) deter. stem and leaf partition (5-20 stems) 2) write stems in column with smallest stem at top; include all stems in range of data 3) only 1 digit in leaves; drop digits or round off 4) record leaf for each no. in corresponding stem row; ordering the leaves in each row helps
22
Example: employee ages at a small company
; stem: 10’s digit; leaf: 1’s digit 18: stem=1; leaf=8; 18 = 1 | 8 stem leaf 6 4 Constructing display; Order the leaves in each stem row
23
Suppose a 95 yr. old is hired
stem leaf 6 4 7 8 9 5 Include all stems in the range of data
24
Number of TD passes by NFL teams: 2010 season (stems are 10’s digit)
leaf 3 011337 2 1 9 Smallest number? Largest number?
25
Pulse Rates n = 138 Ignore the circles showing in the graphic;
*Stem rows have leaves o through 4 Note that leaves in each row are ordered
26
Advantages/Disadvantages of Stem-and-Leaf Displays
1) each measurement displayed 2) ascending order in each stem row 3) relatively simple (data set not too large) Disadvantages display becomes unwieldy for large data sets
27
Population of 185 US cities with between 100,000 and 500,000
Multiply stems by 100,000 Each leaf should be just 1 digit to keep display simple; sometimes data has to be rounded or truncated Note four rows for each stem value so display is not too wide. 3|6 = 360,000
28
Back-to-back stem-and-leaf displays
Back-to-back stem-and-leaf displays. TD passes by NFL teams: 1999, 2009 multiply stems by 10 1999 2009 2 4 6 3 0444 6655 011113 1 421 0122
29
Below is a stem-and-leaf display for the pulse rates of 24 women at a health clinic. How many pulses are between 67 and 77? Stems are 10’s digits 4 6 8 10 12 Countdown 10
30
Interpreting Graphical Displays: Shape
Symmetric distribution A distribution is symmetric if the right and left sides of the histogram are approximately mirror images of each other. Skewed distribution A distribution is skewed to the right if the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the left if the left side of the histogram extends much farther out than the right side. Complex, multimodal distribution Not all distributions have a simple overall shape, especially when there are few observations.
31
Shape (cont.)Female heart attack patients in New York state
Age: left-skewed Cost: right-skewed
32
Shape (cont.): Outliers
An important kind of deviation is an outlier. Outliers are observations that lie outside the overall pattern of a distribution. Always look for outliers and try to explain them. The overall pattern is fairly symmetrical except for 2 states clearly not belonging to the main trend. Alaska and Florida have unusual representation of the elderly in their population. A large gap in the distribution is typically a sign of an outlier. This is from the book. Imagine you are doing a study of health care in the 50 US states, and need to know how they differ in terms of their elderly population. This is a histogram of the number of states grouped by the percentage of their residents that are 65 or over. You can see there is one very small number and one very large number, with a gap between them and the rest of the distribution. Values that fall outside of the overall pattern are called outliers. They might be interesting, they might be mistakes - I get those in my data from typos in entering RNA sequence data into the computer. They might only indicate that you need more samples. Will be paying a lot of attention to them throughout class both for what we can learn about biology and also because they can cause trouble with your statistics. Guess which states they are (florida and alaska). Alaska Florida
33
Center: typical value of frozen personal pizza? ~$2.65
34
Spread: fuel efficiency 4, 8 cylinders
4 cylinders: more spread 8 cylinders: less spread
35
Other Graphical Methods for Economic Data
Time plots plot observations in time order, with time on the horizontal axis and the vari-able on the vertical axis ** Time series measurements are taken at regular intervals (monthly unemployment, quarterly GDP, weather records, electricity demand, etc.)
36
Unemployment Rate, by Educational Attainment
37
Water Use During Super Bowl
38
Winning Times 100 M Dash
39
Annual Mean Temperature
40
End of Histograms, Stem and Leaf plots
41
Describing Distributions Numerically: Medians and Quartiles
42
2 characteristics of a data set to measure
center measures where the “middle” of the data is located variability measures how “spread out” the data is
43
The median: a measure of center
Given a set of n measurements arranged in order of magnitude, Median= middle value n odd mean of 2 middle values, n even Ex. 2, 4, 6, 8, 10; n=5; median=6 Ex. 2, 4, 6, 8; n=4; median=(4+6)/2=5
44
Student Pulse Rates (n=62)
38, 59, 60, 60, 62, 62, 63, 63, 64, 64, 65, 67, 68, 70, 70, 70, 70, 70, 70, 70, 71, 71, 72, 72, 73, 74, 74, 75, 75, 75, 75, 76, 77, 77, 77, 77, 78, 78, 79, 79, 80, 80, 80, 84, 84, 85, 85, 87, 90, 90, 91, 92, 93, 94, 94, 95, 96, 96, 96, 98, 98, 103 Median = (75+76)/2 = 75.5
45
Medians are used often Year 2011 baseball salaries
Median $1,450,000 (max=$32,000,000 Alex Rodriguez; min=$414,000) Median fan age: MLB 45; NFL 43; NBA 41; NHL 39 Median existing home sales price: May 2011 $166,500; May 2010 $174,600 Median household income (2008 dollars) 2009 $50,221; 2008 $52,029
46
The median splits the histogram into 2 halves of equal area
47
Examples Example: n = 7 Example n = 7 (ordered): Example: n = 8 Example n =8 (ordered) m = 14.1 m = ( )/2 = 15.8
48
Below are the annual tuition charges at 7 public universities
Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 4971 5245 5546 7586 5245 4965.5 4960 4971 5
49
Below are the annual tuition charges at 7 public universities
Below are the annual tuition charges at 7 public universities. What is the median tuition? 4429 4960 5245 5546 4971 5587 7586 5245 4965.5 5546 4971 6
50
The range and interquartile range
Measures of Spread The range and interquartile range
51
Ways to measure variability
range=largest-smallest OK sometimes; in general, too crude; sensitive to one large or small data value The range measures spread by examining the ends of the data A better way to measure spread is to examine the middle portion of the data
52
Quartiles: Measuring spread by examining the middle
The first quartile, Q1, is the value in the sample that has 25% of the data at or below it (Q1 is the median of the lower half of the sorted data). The third quartile, Q3, is the value in the sample that has 75% of the data at or below it (Q3 is the median of the upper half of the sorted data). Q1= first quartile = 2.3 m = median = 3.4 We are going to start out with a very general way to describe the spread that doesn’t matter whether it is symmetric or not - quartiles. Just as the word suggests - quartiles is like quarters or quartets, it involves dividing up the distribution into 4 parts. Now, to get the median, we divided it up into two parts. To get the quartiles we do the exact same thing to the two halves. Use same rules as for median if you have even or odd number of observations. Now, what an we do with these that helps us understand the biology of these diseases? Q3= third quartile = 4.2
53
Quartiles and median divide data into 4 pieces
1/4 1/4 1/4 1/4 Q M Q3
54
Quartiles are common measures of spread
University of Southern California UNC-CH
55
Rules for Calculating Quartiles
Step 1: find the median of all the data (the median divides the data in half) Step 2a: find the median of the lower half; this median is Q1; Step 2b: find the median of the upper half; this median is Q3. Important: when n is odd include the overall median in both halves; when n is even do not include the overall median in either half.
56
Example 11 n = 10 Median m = (10+12)/2 = 22/2 = 11 Q1 : median of lower half Q1 = 6 Q3 : median of upper half Q3 = 16
57
Pulse Rates n = 138 Median: mean of pulses in locations 69 & 70: median= (70+70)/2=70 Q1: median of lower half (lower half = 69 smallest pulses); Q1 = pulse in ordered position 35; Q1 = 63 Q3 median of upper half (upper half = 69 largest pulses); Q3= pulse in position 35 from the high end; Q3=78
58
Below are the weights of 31 linemen on the NCSU football team
Below are the weights of 31 linemen on the NCSU football team. What is the value of the first quartile Q1? # stem leaf 2 22 55 4 23 57 6 24 26 7 25 10 257 12 27 59 (4) 28 1567 15 29 35599 30 333 31 45 5 32 155 33 1 34 287 257.5 263.5 262.5 Countdown 10
59
Interquartile range lower quartile Q1 middle quartile: median
upper quartile Q3 interquartile range (IQR) IQR = Q3 – Q1 measures spread of middle 50% of the data
60
Example: beginning pulse rates
Q3 = 78; Q1 = 63 IQR = 78 – 63 = 15
61
Below are the weights of 31 linemen on the NCSU football team
Below are the weights of 31 linemen on the NCSU football team. The first quartile Q1 is What is the value of the IQR? # stem leaf 2 22 55 4 23 57 6 24 26 7 25 10 257 12 27 59 (4) 28 1567 15 29 35599 30 333 31 45 5 32 155 33 1 34 23.5 39.5 46 69.5 Countdown 10
62
5-number summary of data
Minimum Q1 median Q3 maximum Pulse data
63
End of Medians and Quartiles
64
Numerical Summaries of Symmetric Data.
Measure of Center: Mean Measure of Variability: Standard Deviation
65
Symmetric Data Body temp. of 93 adults
66
Recall: 2 characteristics of a data set to measure
center measures where the “middle” of the data is located variability measures how “spread out” the data is
67
Measure of Center When Data Approx. Symmetric
mean (arithmetic mean) notation
69
Connection Between Mean and Histogram
A histogram balances when supported at the mean. Mean x = 140.6
70
Mean: balance point Median: 50% area each half right histo: mean 55
Mean: balance point Median: 50% area each half right histo: mean yrs, median 57.7yrs
71
Properties of Mean, Median
1. The mean and median are unique; that is, a data set has only 1 mean and 1 median (the mean and median are not necessarily equal). 2. The mean uses the value of every number in the data set; the median does not.
72
Example: class pulse rates
73
2010, 2011 baseball salaries 2010 n = 845 = $3,297,828
median = $1,330,000 max = $33,000,000 2011 n = 848 = $3,305,393 median = $1,450,000 max = $32,000,000
74
Disadvantage of the mean
Can be greatly influenced by just a few observations that are much greater or much smaller than the rest of the data
75
Mean, Median, Maximum BB Salaries
76
Skewness: comparing the mean, and median
Skewed to the right (positively skewed) mean>median
77
Skewed to the left; negatively skewed
Mean < median mean=78; median=87;
78
Symmetric data mean, median approx. equal
79
Describing Variability of symmetric data
80
Describing Symmetric Data (cont.)
Measure of center for symmetric data: Measure of variability for symmetric data?
81
Example 2 data sets: x1=49, x2=51 x=50 y1=0, y2=100 y=50
82
On average, they’re both comfortable
0 100
83
Ways to measure variability
1. range=largest-smallest ok sometimes; in general, too crude; sensitive to one large or small obs.
84
Previous Example
85
The Sample Standard Deviation, a measure of spread around the mean
Square the deviation of each observation from the mean; find the square root of the “average” of these squared deviations
86
Calculations … Mean = 63.4 Sum of squared deviations from mean = 85.2
Women height (inches) Mean = 63.4 Sum of squared deviations from mean = 85.2 (n − 1) = 13; (n − 1) is called degrees freedom (df) s2 = variance = 85.2/13 = 6.55 inches squared s = standard deviation = √6.55 = 2.56 inches
87
2. Then take the square root to get the standard deviation s.
We’ll never calculate these by hand, so make sure to know how to get the standard deviation using your calculator, Excel, or other software. Mean ± 1 s.d. 2. Then take the square root to get the standard deviation s. 1. First calculate the variance s2.
88
Population Standard Deviation
89
Remarks 1. The standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement
90
Remarks (cont.) 2. Note that s and s are always greater than or equal to zero. 3. The larger the value of s (or s ), the greater the spread of the data. When does s=0? When does s =0? When all data values are the same.
91
Remarks (cont.) 4. The standard deviation is the most commonly used measure of risk in finance and business Stocks, Mutual Funds, etc. 5. Variance s2 sample variance 2 population variance Units are squared units of the original data square $, square gallons ??
92
Remarks 6):Why divide by n-1 instead of n?
degrees of freedom each observation has 1 degree of freedom however, when estimate unknown population parameter like m, you lose 1 degree of freedom
93
Remarks 6) (cont.):Why divide by n-1 instead of n? Example
Suppose we have 3 numbers whose average is 9 x1= x2= then x3 must be once we selected x1 and x2, x3 was determined since the average was 9 3 numbers but only 2 “degrees of freedom” Choose ANY values for x1 and x2 Since the average (mean) is 9, x1 + x2 + x3 must equal 9*3 = 27, so x3 = 27 – (x1 + x2)
94
Computational Example
95
class pulse rates
96
Review: Properties of s and s
s and s are always greater than or equal to 0 when does s = 0? s = 0? The larger the value of s (or s), the greater the spread of the data the standard deviation of a set of measurements is an estimate of the likely size of the chance error in a single measurement
97
Summary of Notation
98
End of Chapter 4
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.