Download presentation
Presentation is loading. Please wait.
Published byColin Allison Modified over 8 years ago
1
3.1 - 1 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by Mario F. Triola
2
3.1 - 2 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Chapter 3 Statistics for Describing, Exploring, and Comparing Data 3-1 Review and Preview 3-2 Measures of Center 3-3 Measures of Variation 3-4 Measures of Relative Standing and Boxplots
3
3.1 - 3 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Created by Tom Wegleitner, Centreville, Virginia Section 3-1 Review and Preview
4
3.1 - 4 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Chapter 1 Distinguish between population and sample, parameter and statistic Good sampling methods: simple random sample, collect in appropriate ways Chapter 2 Frequency distribution: summarizing data Graphs designed to help understand data Center, variation, distribution, outliers, changing characteristics over time Review
5
3.1 - 5 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 3-2 Measures of Center
6
3.1 - 6 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Basics Concepts of Measures of Center Part 1
7
3.1 - 7 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Measure of Center Measure of Center the value at the center or middle of a data set -- no single way to measure center, there are a number of them.
8
3.1 - 8 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Arithmetic Mean (Mean) the measure of center obtained by adding the values and dividing the total by the number of values What most people call an average.
9
3.1 - 9 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Mathematical notation: Learn right away how to get the mean using your calculators.
10
3.1 - 10 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Advantages Is relatively reliable, means of samples drawn from the same population don’t vary as much as other measures of center Takes every data value into account Mean Disadvantage Is sensitive to every data value, one extreme value can affect it dramatically; is not a resistant measure of center
11
3.1 - 11 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Median Median the middle value when the original data values are arranged in order of increasing (or decreasing) magnitude. -- is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger. often denoted by x (pronounced ‘x-tilde’) ~ is not affected by an extreme value - is a resistant measure of the center
12
3.1 - 12 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Finding the Median 1.If the number of data values is odd, the median is the number located in the exact middle of the list. 2.If the number of data values is even, the median is found by computing the mean of the two middle numbers. First sort the values (arrange them in order), the follow one of these
13
3.1 - 13 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. 5.40 1.10 0.420.73 0.48 1.10 0.66 0.42 0.48 0.660.73 1.10 1.10 5.40 (in order - odd number of values) exact middle MEDIAN is 0.73 5.40 1.10 0.420.73 0.48 1.10 0.42 0.48 0.731.10 1.10 5.40 0.73 + 1.10 2 (in order - even number of values – no exact middle shared by two numbers) MEDIAN is 0.915 unsorted sorted
14
3.1 - 14 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Another example - median 1. Sort observations by size. n = number of observations ______________________________ n = 24 n/2 = 12 Median = (3.3+3.4) /2 = 3.35 2.b. If n is even, the median is the mean of the two middle observations. n = 25 (n+1)/2 = 26/2 = 13 Median = 3.4 2.a. If n is odd, the median is observation (n+1)/2 down the list
15
3.1 - 15 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Say you have five people that are 61, 62, 63, 64, 65, and 90 inches tall. Obviously 90 is the outlier. With it in there, the mean is 67.5, and the median is 63.5 If you take the 90 out, the new mean is 63, and the median is also 63. So basically, it affects mean more because when dealing with median it doesn't matter what the actual value of the outlier is. Whether it was 90 or 130, taking it out would do the same thing to the median. The mean depends on what the outlier is. > x = c(61, 62, 63, 64, 65, 90) > mean(x) [1] 67.5 > median(x) [1] 63.5 > x = c(61, 62, 63, 64, 65) # remove outlier > mean(x) [1] 63 > median(x) [1] 63 Example: Differences between mean and median TI83 – once you entered list with Stat->Edit-> enter data into L1, Stat->Calc->1-Var Stats STATDISK Enter data or open existing data Data->Descriptive Statistics Click on Evaluate
16
3.1 - 16 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. FHEALTH.XLS
17
3.1 - 17 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Mode Mode the value that occurs with the greatest frequency Data set can have one, more than one, or no mode Bimodal two data values occur with the same greatest frequency Multimodalmore than two data values occur with the same greatest frequency No Modeno data value is repeated
18
3.1 - 18 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. a. 5.40 1.10 0.42 0.73 0.48 1.10 b. 27 27 27 55 55 55 88 88 99 c. 1 2 3 6 7 8 9 10 Mode - Examples Mode is 1.10 Bimodal - 27 & 55 No Mode, all occur only once Complex, multimodal distribution Bimodal distribution:
19
3.1 - 19 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Critical Thinking 1) Think about whether the results are reasonable. It does NOT make sense to do numerical calculations (like mean or median) with categorical data like names, labels, etc… Ex: zip codes, party (D,R,I), numbers on athletes t-shorts, letter codes for grades, etc… 2) Think about the method used to collect the sample data. Ex 7, p 89: Consider the following procedure: 1 st find mean per capita income (total amount / # people ) in each State, then average over 50 states. Do we get per capita income in whole US? NO The problem is that some states (like California) have many more people than the others.
20
3.1 - 20 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Beyond the Basics of Measures of Center Part 2
21
3.1 - 21 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Weighted Mean x = w (w x) When data values are assigned different weights, we can compute a weighted mean.
22
3.1 - 22 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Best Measure of Center
23
3.1 - 23 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Symmetric distribution of data is symmetric if the left half of its histogram is roughly a mirror image of its right half Skewed distribution of data is skewed if it is not symmetric and extends more to one side than the other Skewed and Symmetric
24
3.1 - 24 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Skewed to the left (also called negatively skewed) have a longer left tail, mean and median are to the left of the mode Skewed to the right (also called positively skewed) have a longer right tail, mean and median are to the right of the mode Skewed Left or Right
25
3.1 - 25 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. The mean and median cannot always be used to identify the shape of the distribution. For skewed graphs mean and median can be to the left or right of each other depending on the data. Always!!! Look at the plot. Shape of the Distribution Income is the most common example of skewed distribution
26
3.1 - 26 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Section 3-3 Measures of Variation
27
3.1 - 27 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Basics Concepts of Measures of Variation Part 1
28
3.1 - 28 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Definition The standard deviation of a set of sample values, denoted by s, is a measure of variation of values about the mean.
29
3.1 - 29 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Sample Standard Deviation Formula ( x – x ) 2 n – 1 s =s = Let x = (2, 5,11), xbar = (2+5+11)/3 = 6. Begin with individual deviations (xi – xbar) = -4, -1, 5 Just adding them produces sum( xi – xbar ) = 0 – always! -- positive and negative deviations cancel each other out. Best way is to square them, average, and take square root. -- average deviation from the mean. Why (n-1), not n? -- because with given mean = sum(x)/n, there are only (n-1) independent values (degrees of freedom)
30
3.1 - 30 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Sample Standard Deviation (Shortcut Formula) n ( n – 1) s =s = n x 2 ) – ( x ) 2
31
3.1 - 31 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example: list of numbers: x = (1, 3, 4, 6, 9, 19) mean: xbar = sum(x)/n = (1+3+4+6+9+19) / 6 = 42 / 6 = 7 list of deviations: x-xbar = -6, -4, -3, -1, 2, 12 squares of deviations: (x-xbar)^2 = 36, 16, 9, 1, 4, 144 sum of squares of deviations: sum((x-xbar)^2 ) = 36+16+9+1+4+144 = 210 divided by one less than the number of items in the list: 210 / 5 = 42 square root of this number: square root (42) = about 6.48 > x = c(1, 3, 4, 6, 9, 19) > xbar = mean(x) > x-xbar [1] -6 -4 -3 -1 2 12 > (x-xbar)^2 [1] 36 16 9 1 4 144 > sum2 = sum( (x-xbar)^2 ) > sum2 [1] 210 > n = length(x) > n [1] 6 > Var = sum2/(n-1) > Var [1] 42 > var(x) # command from R [1] 42 > Std = sqrt(Var) > Std [1] 6.480741 > sd(x) # command from R [1] 6.480741 TI83 – once you entered list with {1,3,4…} 2 nd STO 2 nd L1, Stat->Calc->1-Var Stats STATDISK Enter data or open existing data Data->Descriptive Statistics Click on Evaluate Excel – see previous slide 16
32
3.1 - 32 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Standard Deviation - Important Properties The standard deviation is a measure of variation of all values from the mean. The value of the standard deviation s is positive. The value of the standard deviation s can increase dramatically with the inclusion of one or more outliers (data values far away from all others). The units of the standard deviation s are the same as the units of the original data values.
33
3.1 - 33 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Population Standard Deviation 2 ( x – µ ) N = This formula is similar to the previous formula, but instead, the population mean and population size are used. > # Now assume x gives the population, not a sample > x = c(1, 3, 4, 6, 9, 19) > mu = mean(x) > x-mu [1] -6 -4 -3 -1 2 12 > (x-mu)^2 [1] 36 16 9 1 4 144 > sum( (x-mu)^2 ) [1] 210 > N = length(x) > N [1] 6 > sigma2 = sum( (x-mu)^2 )/N > sigma2 [1] 35 > sigma = sqrt(sigma2) > sigma [1] 5.91608
34
3.1 - 34 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Population variance: 2 - Square of the population standard deviation Variance The variance of a set of values is a measure of variation equal to the square of the standard deviation. Sample variance: s 2 - Square of the sample standard deviation s
35
3.1 - 35 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Unbiased Estimator The sample variance s 2 is an unbiased estimator of the population variance 2, which means values of s 2 tend to target the value of 2 instead of systematically tending to overestimate or underestimate 2. Dividing by n – 1 yields better results than dividing by n. It causes s 2 to target 2 whereas division by n causes s 2 to underestimate 2.
36
3.1 - 36 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Beyond the Basics of Measures of Variation Part 2
37
3.1 - 37 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Range Rule of Thumb is based on the principle that for many data sets, the vast majority (such as 95%) of sample values lie within two standard deviations of the mean. Informally define usual values in a data set to be those that are typical and not too extreme. Find rough estimates of the minimum and maximum “usual” sample values as follows: Minimum “usual” value (mean) – 2 (standard deviation) Maximum “usual” value (mean) + 2 (standard deviation)
38
3.1 - 38 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Ex: IQ scores. Mean = 100, sd = 15 Min usual value = (mean) – 2 (standard deviation) = 70 Max usual value = (mean) + 2 (standard deviation) = 130
39
3.1 - 39 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Empirical (or 68-95-99.7) Rule For data sets having a distribution that is approximately bell shaped, the following properties apply: About 68% of all values fall within 1 standard deviation of the mean. About 95% of all values fall within 2 standard deviations of the mean. About 99.7% of all values fall within 3 standard deviations of the mean.
40
3.1 - 40 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education The Empirical Rule Ex: IQ scores. Mean ± 1*sd = (85,115) 68% of IQ scores lie in these bounds
41
3.1 - 41 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education The Empirical Rule Ex: IQ scores. Mean ± 2*sd = (70,130) 95% of IQ scores lie in these bounds
42
3.1 - 42 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education The Empirical Rule Ex: IQ scores. Mean ± 3*sd = (55,145) 99.7% of IQ scores lie in these bounds
43
3.1 - 43 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.Copyright © 2010 Pearson Education Chebyshev’s Theorem The proportion (or fraction) of any set of data (not just normal) lying within K standard deviations of the mean is always at least 1–1/K 2, where K is any positive number greater than 1. For K = 2, at least 3/4 (or 75% not 95% like for normal) of all values lie within 2 standard deviations of the mean. For K = 3, at least 8/9 (or 89%, not 99.7% like for normal) of all values lie within 3 standard deviations of the mean. optional
44
3.1 - 44 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Ex: Consider college salaries, which are not bell-shaped but skewed right. Assume mean = 50, sd = 10 Mean ± 2*sd = (30,70) 75% of all salaries lie in these bounds Mean ± 3*sd = (20,80) 89% of all salaries lie in these bounds optional
45
3.1 - 45 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Section 3-4 Measures of Relative Standing and Boxplots
46
3.1 - 46 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Basics of z Scores, Percentiles, Quartiles, and Boxplots Part 1
47
3.1 - 47 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. z Score (or standardized value) the number of standard deviations that a given value x is above or below the mean Z score
48
3.1 - 48 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Sample Population x – µ z = z-score is measure of position, it describes the location of value in terms of standard deviations relative to the mean. Measures of Position z Score z = x – x s
49
3.1 - 49 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved.
50
3.1 - 50 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Example z-scores can be used to compare values even coming from different populations: # How extreme are individual heigths and weights > DataFrame = read.csv("FHEALTH.csv") > x = DataFrame$HT # there are some NA (Not Available) in the set > x = x[is.na(x) == FALSE] > y = DataFrame$WT y = y[is.na(y) == FALSE] > xbar = mean(x); sx = sd(x); > c(xbar,sx) [1] 63.195000 2.741228 > ybar = mean(y); sy = sd(y); > c(ybar,sy) [1] 146.22000 37.62104 > # Test height 70in and weight 210lb > xtest = 70; ytest = 210; > zx = (xtest - xbar)/sx > zy = (ytest - ybar)/sy > c(zx,zy) [1] 2.482464 1.695328 # Heights have much less variation than weights – as expected I.e 70 inch toll women differs more from average height than 210lb one from average weight. On TI83 or STATDISK, 1 st create descriptive statistics for each data set to find means and standard deviations. Here xbar = 63.2 xsd = 2.7 ybar = 146.2 ysd = 37.6 Then use calculator to compute z-scores. On Excel, computer descriptive statistics then set up formula with =
51
3.1 - 51 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Interpreting Z Scores Before, we defined range rule of thumb to conclude that the value is unusual if it is > (2*standard deviation) away from the mean. => Ordinary values: –2 ≤ z score ≤ 2 Unusual Values: z score 2 Whenever a value is less than the mean, its corresponding z score is negative: > # Test height 60in in the Female Health File, xbar was 63.2, sx = 2.74 > xtest = 60 > zx = (xtest - xbar)/sx > zx [1] -1.165536 --NOT unusual. Note that the fact that zx<0 means that height of 60in is below the mean
52
3.1 - 52 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Percentiles -- are measures of location. There are 99 percentiles denoted P 1, P 2,... P 99, which divide a set of data into 100 groups with about 1% of the values in each group. P 50 -- 50% of data below and 50% above => -- same as median
53
3.1 - 53 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Percentile of value x = 100 number of values less than x total number of values > DataFrame = read.csv("Movies.csv") > x = DataFrame$Budget > x [1] 41.0 20.0 116.0 70.0 75.0 52.0 120.0 65.0 6.5 60.0 125.0 20.0 [13] 5.0 150.0 4.5 7.0 100.0 30.0 225.0 70.0 80.0 40.0 70.0 50.0 [25] 74.0 200.0 113.0 68.0 35.0 72.0 160.0 68.0 29.0 132.0 40.0 > n = length(x) > n [1] 35 > xsorted = sort(x) > xsorted [1] 4.5 5.0 6.5 7.0 20.0 20.0 29.0 30.0 35.0 40.0 40.0 41.0 [13] 50.0 52.0 60.0 65.0 68.0 68.0 70.0 70.0 70.0 72.0 74.0 75.0 [25] 80.0 100.0 113.0 116.0 120.0 125.0 132.0 150.0 160.0 200.0 225.0 > xp = 60 > percentile = sum( xsorted < xp )*100/n # 14 values less than 60 14*100/35 = 40 > percentile [1] 40 Let x = [ 2 5 7 9 10 12 15 16] - already sorted n = 8 - total number of elements xp = 7 – point to investigate percentile = 2*100/8 = 25% Finding the Percentile of a Data Value
54
3.1 - 54 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. n total number of values in the data set k percentile being used L locator that gives the position of a value P k k th percentile L = n k 100 Notation Converting from the kth Percentile to the Corresponding Data Value optional
55
3.1 - 55 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Converting from the kth Percentile to the Corresponding Data Value > xsorted [1] 4.5 5.0 6.5 7.0 20.0 20.0 29.0 30.0 35.0 40.0 40.0 41.0 [13] 50.0 52.0 60.0 65.0 68.0 68.0 70.0 70.0 70.0 72.0 74.0 75.0 [25] 80.0 100.0 113.0 116.0 120.0 125.0 132.0 150.0 160.0 200.0 225.0 > # How to convert given percentile to corresponding data set value > is.wholenumber <- function(x, tol =.Machine$double.eps^0.5) { + abs(x - round(x)) < tol + } > > k = 30 # percentile 30% > L = k*n/100 > L [1] 10.5 > if ( is.wholenumber(L) ) { + Pk = ( xsorted[L] + xsorted[L+1] ) / 2 + } else { + L = ceiling(L) + Pk = xsorted[L] + } > Pk [1] 40 # 30% have budget below 40 million and 70% above If set k=80 in above code, get L = 28 - integer and Pk = 118 – average of 116 and 120. 80% have budget below 118 mil, and 20% above Let x = [ 2 5 7 9 10 12 15 16] - already sorted n = 8 - total number of elements percentile = 40 L = 40*8/100 = 3.2 => round up to 4 => x[4] = 9 optional Excel has a function: =PERCENTILE(A2:A5,0.3)
56
3.1 - 56 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Quartiles Q 1 (First Quartile) separates the bottom 25% of sorted values from the top 75%. Q 2 (Second Quartile) same as the median; separates the bottom 50% of sorted values from the top 50%. Q 3 (Third Quartile) separates the bottom 75% of sorted values from the top 25%. -- measures of location, denoted Q 1, Q 2, and Q 3, which divide a set of data into four groups with about 25% of the values in each group.
57
3.1 - 57 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Q 1, Q 2, Q 3 divide ranked scores into four equal parts Quartiles 25% Q3Q3 Q2Q2 Q1Q1 (minimum)(maximum) (median) Q1 = P25; Q2 = P50; Q3 = P75;
58
3.1 - 58 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. For a set of data, the 5-number summary consists of the: minimum value; the first quartile Q 1 ; the median (or second quartile Q 2 ); the third quartile, Q 3 ; and the maximum value. 5-Number Summary
59
3.1 - 59 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. A boxplot (or box-and-whisker-diagram) is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, Q 1 ; the median; and the third quartile, Q 3. Boxplot > DataFrame = read.csv("FHEALTH.csv") > x = DataFrame$HT > fivenum(x) [1] 57.00 61.35 63.35 64.90 68.00 > boxplot(x,xlab="x",ylab="height") STATDISK: Data->Descriptive Statistics Data->Boxplot Excel and TI instructions p 125 For Excel have to install DDXL package from textbook CD.
60
3.1 - 60 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Boxplots Boxplot of Movie Budget Amounts
61
3.1 - 61 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Boxplots - Normal Distribution Normal Distribution: Heights from a Simple Random Sample of Women
62
3.1 - 62 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Boxplots - Skewed Distribution Skewed Distribution: Salaries (in thousands of dollars) of NCAA Football Coaches
63
3.1 - 63 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. > DataFrame = read.csv("WORDS.csv") > x = c(DataFrame$X1M,DataFrame$X2M,DataFrame$X3M, + DataFrame$X4M,DataFrame$X5M,DataFrame$X6M) > y = c(DataFrame$X1F,DataFrame$X2F,DataFrame$X3F, + DataFrame$X4F,DataFrame$X5F,DataFrame$X6F) > fivenum(x) [1] 695 10009 14290 20565 47016 > fivenum(y) [1] 1674 11010 15917 20571 40055 > boxplot(x,y)
64
3.1 - 64 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. # Ex: Comparing heights of women and men rm( list = ls()) print("---------------------------------------") DataFrame = read.csv("FHEALTH.csv") x = DataFrame$HT DataFrame = read.csv("MHEALTH.csv") y = DataFrame$HT > fivenum(x) [1] 57.00 61.35 63.35 64.90 68.00 > fivenum(y) [1] 61.30 66.30 68.30 70.15 76.20 > boxplot(x,y)
65
3.1 - 65 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Outliers and Modified Boxplots Part 2
66
3.1 - 66 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Outliers An outlier is a value that lies very far away from the vast majority of the other values in a data set. An outlier can have a dramatic effect on the mean. An outlier can have a dramatic effect on the standard deviation. An outlier can have a dramatic effect on the scale of the histogram so that the true nature of the distribution is totally obscured.
67
3.1 - 67 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Outliers for Modified Boxplots For purposes of constructing modified boxplots, we can consider outliers to be data values meeting specific criteria. In modified boxplots, a data value is an outlier if it is : above Q 3 by an amount greater than 1.5 IQR, where Inter-Quartile Range (IQR) = Q 3 – Q 1 below Q 1 by an amount greater than 1.5 IQR or
68
3.1 - 68 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Modified Boxplots Boxplots described earlier are called skeletal (or regular) boxplots. Statistical packages provide modified boxplots which represent outliers as special points.
69
3.1 - 69 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Modified Boxplot Construction A special symbol (such as an asterisk) is used to identify outliers. The solid horizontal line extends only as far as the minimum data value that is not an outlier and the maximum data value that is not an outlier. A modified boxplot is constructed with these specifications:
70
3.1 - 70 Copyright © 2010, 2007, 2004 Pearson Education, Inc. All Rights Reserved. Modified Boxplots - Example Pulse rates of females listed in Data Set 1 in Appendix B.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.