Chapter 41
2 In Chapter 3… … we used stemplots to look at shape, central location, and spread of a distribution. In this chapter we use numerical summaries to look at central location and spread.
Chapter 43 Summary Statistics Central location statistics –Mean –Median –Mode Spread statistics –Range –Interquartile range (IQR) –Variance and standard deviation Shape statistics exist but are seldom used in practice (not covered)
Chapter 44 Notation n sample size X variable (e.g., ages of subjects) x i value for individual i sum all values (“capital sigma”) Example: Let X = AGE (n = 10): x 1 = 21, x 2 = 42, …, x 10 = 52 x i = x 1 + x 2 + … + x 10 = … + 52 = 290
Chapter 45 Central Location: Sample Mean For the data on the previous slide: Most common measure of central location
Chapter 46 Example: Sample Mean The mean is the gravitational center of a batch of numbers
Chapter 47 Gravitational Center A skew tips the distribution causing the mean to shift toward the tail
Chapter 48 Uses of the Sample Mean Predicts value of an observation drawn at random from the sample Predicts value of an observation drawn at random from the population Predicts population mean µ
Chapter 49 Population Mean Same operation as sample mean applied to entire population (N ≡ population size) Not readily (never?) available in practice but conceptually important
Chapter 410 Central Location: Median The value with a depth of (n+1) / 2 When n is even → median is obvious When n is even → average the two middle values Example (below): Depth (M) = (10+1) / 2 = 5.5 Median = Average (27 and 28) = median Average the adjacent values: M = 27.5
Chapter 411 More Examples Example A: Median = 4 Example B: Median = 5 (average of 4 and 6) Example C: Median 2 (Values must be ordered first)
Chapter 412 The Median is Robust This data set has x-bar = 1636: The median is 1614 in both instances, The median is resistant to skews and outlier Same data set with a data entry error highlighted This data has x-bar = 2743
Chapter 413 Mode Most frequent value in the dataset This data set has a mode of 7 {4, 7, 7, 7, 8, 8, 9} This data set has no mode {4, 6, 7, 8} (each point appears once) The mode is useful only in large data sets with repeating values
Chapter 414 Comparison of Mean, Median, Mode Mean gets “pulled” by tail: mean = median → symmetrical mean > median → positive skew mean < median → negative skew
Chapter 415 Spread ≡ extent to which data vary around middle point Site 1| |Site |2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10 particulates in air (μg/m 3 ) Sites have similar central locations medians = 36 and 38, respectively Site 1 exhibits much greater spread (visually)
Chapter 416 Spread: Range Range = maximum – minimum Illustrative example: Site 1 range = 68 – 22 = 46 Site 2 range = 40 – 32 = 8 The sample range is not a good measure of spread: tends to underestimate population range Always supplement the range with at least one addition measure of spread Site 1| |Site |2| 8|2| 2|3|234 86|3|6689 2|4|0 |4| |5| |5| |6| 8|6| ×10
Chapter 417 Spread: Interquartile Range Quartile 1 (Q1): marks bottom quarter of data = middle of the lower half of the data set Quartile 3 (Q3): marks top quarter of data = middle of the top half of data set Interquartile Range (IQR) = Q3 – Q1 covers middle 50% of \distribution Q1 median Q3 Q1 = 21, Q3 = 42, and IQR = 42 – 21 = 21
Chapter 418 Five-Point Summary Q0 (the minimum) Q1 (25 th percentile) Q2 (median) Q3 (75 th percentile) Q4 (the maximum) Q1 median Q3 5 point summary = 5, 21, 27.5, 42, 52
Chapter 419 Quartiles: Tukey’s Hinges Data: metabolic rates (cal/day), n = median When n is odd, include the median in both “halves” Bottom half: Top half: Q1 = Q3 = 1729
Chapter 420 §4.6 Boxplots 1.Draw box from Q1 to Q3 2.Draw line for median. 3.Calculate fences: Fence Lower = Q1 – 1.5(IQR) Fence Upper = Q (IQR) 4.Do not draw fences 5.Any values outside the fences (outside values). are plotted separately. 6.Determine most extreme values still inside the fences (inside values) 7.Draw whiskers quartiles to inside values
Chapter 421 Example 1: Boxplot 1.5 pt summary: {5, 21, 27.5, 42, 52}; Box from 21 to 42 with IQR = 42 – 21 = 21. F U = Q (IQR) = 42 + (1.5)(21) = 73.5 F L = Q1 – 1.5(IQR) = 21 – (1.5)(21) = – No values above upper fence None values below lower fence 4.Upper inside value = 52 Lower inside value = 5 Draws whiskers Data: Upper inside = 52 Q3 = 42 Q1 = 21 Lower inside = 5 Q2 = 27.5
Chapter 422 Example 2: Boxplot Data: point summary: 3, 22, 25.5, 29, 51; hinges at 22 and 29 2.IQR = 29 – 22 = 7 F U = Q (IQR) = 29 + (1.5)(7) = 39.5 F L = Q1 – 1.5(IQR) = 22 – (1.5)(7) = One upper outside value (51) One lower outside value (3) 4.Upper inside value is 31 Lower inside value is 21 Draw whiskers
Chapter 423 Example 3: Boxplot Seven metabolic rates (cal / day): point summary: 1362, , 1614, 1729, IQR = 1729 – = F U = Q (IQR) = (279.5) = F L = Q1 – 1.5(IQR) = –1.5(279.5) = None outside 4. Inside values: 1867 and 1362
Chapter 424 Boxplots: Interpretation Central location: position of median and box (IQR) Spread: Hinge-spread (IQR), whisker spread, range Shape: symmetry of median within box and box within whiskers, tail length (kurtosis), outside values
Chapter 425 Spread: Standard Deviation The standard deviation is the most popular measure of spread σ ≡ population standard deviation s = sample standard deviation Based on deviations around the mean
Chapter 426 Deviations Deviation = distance from the mean = This example shows a deviation of −3 for the data point 33 It show a deviation of 4 for data point 40
Chapter 427 “Sum of squares” ObservationDeviationsSq. deviations 36 = = 36 = = 36 = = 36 = = 36 = = 36 = 2 2 2 = 36 = 3 3 2 = 36 = 4 4 2 = 16 SUMS 0 SS = 58
Chapter 428 Sum of Squares (SS), variance (s 2 ), Standard Deviation (s)
Chapter 429 Variance & Standard Deviation Sample variance Standard deviation
Chapter 430 Interpretation of Sample Standard Deviation s Measure spread Estimator of population standard deviation rule (Normal distributions) Chebychev’s rule (all distributions)
Chapter Rule Applies to Normal distributions only! 68% of values within μ ± σ 95% within μ ± 2σ 99.7% within μ ± 3σ Example: Normal distribution with μ = 30 and σ = 10 : 68% of values in 30 ± 10 = 20 to 40 95% in 30 ± (2)(10) = 10 to % in 30 ± (3)(10) = 0 to 60
Chapter 432 Chebychev’s Rule Applies to all distributions At least 75% of the values within μ ± 2σ Example: Distribution with μ = 30 and σ = 10 has at least 75% of the values within 30 ± (2)(10) = 30 ± 20 = 10 to 50
Chapter 433 Rounding There is no single rule for rounding. The number of significant digits should reflect the precision of the measurement U se judgment and “be kind to your reader” Rough guide: carry at least four significant digits during calculations & round as final stepsignificant digits
Chapter 434 Choosing Summary Statistics Always report a measure of central location, a measure of spread, and the sample size Symmetrical distributions report the mean and standard deviation Asymmetrical distributions report the 5- point summaries (or median and IQR)
Chapter 435 Software and Calculators Use ‘em