Chapter 4 Describing Data (Ⅱ ) Numerical Measures Mean/Average Indicator Deviation Indicator
Dispersion refers to the spread or variability in the data. Measures of dispersion include the following: range, mean deviation, variance, and standard deviation.
Deviation Indicator Measures of variability Measures of central location fail to tell the whole story about the distribution. A question of interest still remains unanswered: How typical is the average value of all the measurements in the data set? or How much spread out are the measurements about the average value?
Observe two hypothetical data sets Low variability data set The average value provides a good representation of the values in the data set. High variability data set This is the previous data set. It is now changing to... The same average value does not provide as good presentation of the values in the data set as before.
But, how do all the measurements spread out? 1. The range The range of a set of measurements is the difference between the largest and smallest measurements. Its major advantage is the ease with which it can be computed. Its major shortcoming is its failure to provide information on the dispersion of the values between the two end points. But, how do all the measurements spread out? ? ? ? Smallest measurement Largest measurement Range
The following represents the current year’s Return on Equity of the 25 companies in an investor’s portfolio. Highest value: 22.1 Lowest value: -8.1 Range = Highest value – lowest value = 22.1-(-8.1) = 30.2
2. Mean deviation Mean Deviation: The arithmetic mean of the absolute values of the deviations from the arithmetic mean.
The main features of the mean deviation are: All values are used in the calculation. It is not unduly influenced by large or small values. Generally, the absolute values are difficult to work with. The weights of a sample of crates containing books for the bookstore (in pounds ) are: 103, 97, 101, 106, 103 Find the mean deviation. X = 102
Self-review The weights of containers being shipped to Hongkong are (in thousand of pounds): 95 103 110 104 105 112 90 What is the range of the weights? Compute the arithmetic mean weight. Compute the mean deviation of the weights. Solution: 22 thousands of pounds, found by 112-90 103 thousands of pounds MD= 46/8=5.75 thousands of pounds
3. The variance The variance is the arithmetic mean of the squared deviations from the mean. This measure of dispersion reflects the values of all the measurements. The variance of a population of N measurements x1, x2,…, xN, having a mean The variance of a sample n measurements x1, x2,…,xN having a mean
The major characteristics of the Population Variance are: Not influenced by extreme values. The units are awkward, the square of the original units. All values are used in the calculation.
A B Consider two small populations: Population A: 8, 9, 10, 11, 12 Population B: 4, 7, 10, 13, 16 9-10= -1 11-10= 1 8-10=-2 12-10= +2 The sum of squared deviations is used in calculating the variance. sum= 0 A The sum of deviations is zero in both cases, therefore, another measure is needed. 8 9 10 11 12 …but measurements in B are much more dispersed then those in A. The mean of both populations is 10... 4-10 = - 6 16-10 = 6 B 7-10 =-3 4 7 10 13 16 4-10 = -6 13-10 = 3 sum= 0
A B The sum of squared deviations is used in calculating the variance. 9-10= -1 The sum of squared deviations is used in calculating the variance. See example next. 11-10= +1 8-10= -2 12-10= +2 Sum = 0 The sum of deviations is zero in both cases, therefore, another measure is needed. A 8 9 10 11 12 4-10 = - 6 16-10 = +6 B 7-10 = -3 4 7 10 13 16 13-10 = +3
Let us calculate the variance of the two populations Why not use the sum of squared deviations as a measure to compare dispersion of data sets instead? After all, the sum of squared deviations increases in magnitude when the dispersion of a data set increases!
e.g. xi 46 44 2 4 54 44 10 100 42 44 -2 4 46 44 2 4 32 44 -12 144
4. The standard deviation It is the square root of the variance of the measurements. Sample Standard Deviation Population Standard Deviation
Example Rates of return over the past 10 years for two mutual funds are shown below. Which one have a higher level of risk? Fund A: 8.3, -6.2, 20.9, -2.7, 33.6, 42.9, 24.4, 5.2, 3.1, 30.05 Fund B: 12.1, -2.8, 6.4, 12.2, 27.8, 25.3, 18.2, 10.7, -1.3, 11.4 Solution Let us use the Excel printout that is run from the “Descriptive statistics” sub-menu (use file Xm04-10)
Fund A should be considered riskier because its standard deviation is larger
where k is any constant greater than 1. Chebyshev’s theorem: For a symmetrical, bell-shaped distribution, the proportion of the values that lie within k standard deviations of the mean is at least: where k is any constant greater than 1. Example: In a symmetrical, bell-shaped score distribution, the arithmetic mean is 71.54 and the standard deviation 7.51. At least what percent of the scores lie within plus 3.5 standard deviation and minus 3.5 standard deviations of the mean Solution: About 92%, found by
Empirical Rule: For any symmetrical, bell-shaped distribution: About 68% of the observations will lie within 1s the mean About 95% of the observations will lie within 2s of the mean Virtually (99.7%) all the observations will be within 3s of the mean
Interpreting Standard Deviation (1)The standard deviation can be used to compare the variability of several distributions make a statement about the general shape of a distribution. (2)The empirical rule
68% 95% 99.7% m-3s m-2s m-1s m m+1s m+2s m+ 3s
First check if the histogram has an approximate mound-shape Example The duration of 30 long-distance telephone calls are shown next. Check the empirical rule for the this set of measurements. Solution First check if the histogram has an approximate mound-shape
Mean = 10.26; Standard deviation = 4.29. Calculate the mean and the standard deviation: Mean = 10.26; Standard deviation = 4.29. Calculate the intervals: Interval Empirical Rule Actual percentage 5.97, 14.55 68% 70% 1.68, 18.84 95% 96.7% -2.61, 23.13 100% 100%
Other conclusions By the empirical rule, approximately 95% of the area under a mound-shaped histogram lies between 95% of the area Since about 95% of all the measurements fall within two standard deviation around the mean For the telephone calls duration problem the range is 19.5-2.3=17.2 minutes.
Self-review The Pitney Pipe Company is one of several domestic manufacturers of PVC pipe. The quality control department sampled 600 10-foot lengths. At a point 1 foot from the end of the pipe they measured the outside diameter. The mean was 14.0 inches and the standard deviation 0.1 inches. If the mound-shape of the distribution is not unknown, at least what percent of the observations will between 13.85 inches and 14.15 inches? If we assume that the distribution of diameter is symmetrical and bell-shaped, about 95% of the observations will between what two values? Solution: (1) (2) 13.9 and 14.2, found by
Self-review The weights of the contents of several small aspirin sample bottles are (in grams): 4, 2, 5, 4, 5, 2, and 6. What is the sample variance? The room rate for a sample of 10 motels are (in $): 101, 97, 103, 110, 78, 87, 101, 80, 106, 88. What is the sample variance? Solution: 2.33, found by 2. 123.66, found by
Can we say that the standard deviation of US $120 for a distribution of annual incomes is greater than the standard distribution of 4.5 days for a distribution of absence from work? Can we say that the annual income distribution of the top executives with the standard deviation US $1200 is more dispersed than that of the unskilled employees with the standard deviation US $ 120. The data are in different units (such as dollars and days absent) 2.The data in the same units, but the means are far apart (such as the the incomes of the top executives and those of the unskilled employees)
Further qualities of Variance: Since
refers to the variance within a class refers to the variance among the classes Example: Class Value Deviation Squared D Mean A 0.6 -0.1 0.01 0.7 0.005 0.8 0.1 To be continued
Continued B 1.6 0.092 1.1 -0.5 0.25 1.5 -0.1 0.01 1.8 0.2 0.04 2.0 0.4 0.16 C 3.0 -1 1 4 1.167 3.5 -0.5 0.25 5.5 1.5 2.25
5. The coefficient of variation ( Relative dispersion ) The coefficient of variation of a set of measurements is the standard deviation divided by the mean value. This coefficient provides a proportionate measure of variation. The coefficient of Range= The coefficient of Mean deviation=
The coefficient of Standard variance The coefficient of population standard variance The coefficient of sample standard variance A standard deviation of 10 may be perceived as large when the mean value is 100, but only Moderately large when the mean value is 500
Z scores Time ( in minutes ) Z score 39 -0.09 29 -1.57 43 0.50 …… Mean Z score is an extreme value or outlier located far away from the mean. It is useful in identifying the extreme value. The larger the Z score, the farther the distance from the value to the mean. It is the deviation divided by the standard deviation. Example: Time ( in minutes ) Z score 39 -0.09 29 -1.57 43 0.50 …… Mean 39.6 S 6.77
Self-review The variation in the annual incomes of executives in Nash Inc. is to be compared with the variation if incomes of unskilled employees. For a sample of executives, and .For a sample of unskilled employees, the annual income and .Compare the relative dispersion in the two groups. Solution: There is no difference in the relative dispersion of the two groups.
Decile/ Proportion N1—— Yes( have some kind of attribute or property) Proportion:The fraction, ratio, or percent indicating the part of the sample or the population having a particular trait of interest. N1—— Yes( have some kind of attribute or property) N2—— No( do not some kind of attribute or property) N= N1+ N2 P= , Q=
Quartiles
Quartiles
n is the number of the observations. Quartiles Lp = (n+1) n is the number of the observations.
consecutive days for a major publicly traded company Stock prices on twelve consecutive days for a major publicly traded company
Quartile 3 Median Quartile 1 Using the twelve stock prices, we can find the median, 25th, and 75th percentiles as follows: Quartile 3 Median Quartile 1
75th percentile Price at 9.75 observation = 88 + .75(91-88) = 90.25 12 11 10 9 8 7 6 5 4 3 2 1 96 92 91 88 86 85 84 83 82 79 78 69 Q4 Q3 50th percentile: Median Price at 6.50 observation = 85 + .5(85-84) = 84.50 Q2 25th percentile Price at 3.25 observation = 79 + .25(82-79) = 79.75 Q1
This distance will include the middle 50 percent of the observations. The Interquartile range is the distance between the third quartile Q3 and the first quartile Q1. This distance will include the middle 50 percent of the observations. Interquartile range = Q3 - Q1
For a set of observations the third quartile is 24 and the first quartile is 10. What is the quartile deviation? The interquartile range is 24 - 10 = 14. Fifty percent of the observations will occur between 10 and 24.
Self-review 1. The quality control department of the Plainsville Peanut Company is responsible for checking the weight of the 8-ounce jar of peanut butter. The weight of a sample of nine jars produced last hours are: 7.69 7.72 7.8 7.86 7.90 7.94 7.97 8.06 8.09 What is the median weight? Determine the weight corresponding to the first and the third quartile. Determine the interquartile range. Solution: 7.9 Q1=7.76 Q3=8.015 Q3 -Q1=0.225