Download presentation
1
Measures of Dispersion
2
What is Dispersion? Refers to the way in which quantitative data values are dispersed or spread out in a dataset. The most powerful dispersion statistics calculate the quantitative spread of the data values around the arithmetic mean and are called measures of deviation. The various measures of deviation calculate the arithmetic differences between each data value and the arithmetic mean of the dataset.
3
Why bother with measuring deviation?
Consider the following datasets: First we calculate their arithmetic means using: 𝑥 = 𝑥 𝑛 𝑥 = =3 𝑥 = =3 Are they the same? According to the mean they are.
4
𝑠= (3−3 ) 2 +(3−3 ) 2 +(3− 3) 2 +(3− 3) 2 +(3−3 ) 2 4 = 0
Then we calculate their standard deviations using: Same means, very different standard deviations. So are the datasets the same – or not? 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠= (3−3 ) 2 +(3−3 ) 2 +(3− 3) 2 +(3− 3) 2 +(3−3 ) = 0 𝑠= (1−3 ) 2 +(1−3 ) 2 +(1− 3) 2 +(2− 3) 2 +(10−3 ) = 3.94
5
Measures of Dispersion and Deviation
The Range (a measure of dispersion): The range is the difference between the lowest value (called MIN) and the highest value (called MAX) in a dataset. The Standard Deviation (a measure of deviation): Measures the average difference between a data value and the arithmetic mean of all data values. The Variance (a measure of deviation): Squares the average difference between a data value and the arithmetic mean of the data set. Thus it is the standard deviation squared.
6
The Range (Range = MAX-MIN)
7
The Range The range describes the span of your dataset, from the minimum value (MIN) to the maximum value (MAX) using: Range = MAX – MIN Used as a measure of data dispersion NOT deviation, because deviation implies a difference between your data values and something, e.g. the arithmetic mean. The Range is used in finding histogram (or bar chart) classes.
8
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 Even the range is telling us more about the data than just the central tendency measures do. Compare dataset #1 with #3.
9
The Standard Deviation (s )
10
The Standard Deviation
The standard deviation measures the average difference between a data value and the arithmetic mean of all data values. It is given by: 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 Where: s is the sample standard deviation x is a value in the dataset is the arithmetic mean of the dataset n is the number of values in the dataset 𝑥 If you’re wondering how the ∑(x-x)2 thing works it is saying subtract the data value from the mean then square it then add up all these squared values. It is not saying subtract all data values from the mean, sum them, then square that value – you’d obviously get zero. The standard deviation and the variance are related insofar as the s is the square root of the variance (or the variance is s2). s is the most widely used measure of deviation, though it should always be used in conjunction with the variance.
11
Interpreting the Standard Deviation Formula
𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑥 Subtract each data value x from the arithmetic mean and sum them: But this returns a set of plus and minus differences that add to zero. So to remove the signs we square each difference and sum the squared differences … … then take their square root to return the magnitudes of the original values. 𝑠= (𝑥− 𝑥 ) 𝑠 = 𝑥− 𝑥 2 𝑠= (𝑥− 𝑥 ) 2 𝑛−1
12
A reminder of the effect of squaring…
# #2 1 2 4 3 9 16 5 25 6 36 7 49 8 64 81 10 100 11 121 12 144 13 169 14 196 15 225 256 17 289 18 324 19 361 20 400 … it emphasizes higher values An exponential progression An arithmetic progression
13
Why Squares and Roots? This is a list of numbers, x.
x-mean x-mean squared sqrt of x-mean squared 1 -9.5 90.25 9.5 2 -8.5 72.25 8.5 3 -7.5 56.25 7.5 4 -6.5 42.25 6.5 5 -5.5 30.25 5.5 6 -4.5 20.25 4.5 7 -3.5 12.25 3.5 8 -2.5 6.25 2.5 9 -1.5 2.25 1.5 10 -0.5 0.25 0.5 11 12 13 14 15 16 17 18 19 20 10.5 0.0 The difference x-x produces negative numbers and a sum of zero, but ‾ …taking the square root of the squared data values simply returns them to the original numbers, and also removes the sign. … the square of a number is always positive, and… square … differences between squares increase more rapidly than differences between original numbers, so… number
14
s values do not indicate skewness. They do indicate kurtosis.
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 s $3,852.18 $14,290.36 $21,559.86 Low s means that the data are clustered around mean (data are leptokurtic or ‘peaked’) High s means that the data are spread out around the mean (data are platykurtic or ‘flat’) REMEMBER s values do not indicate skewness. They do indicate kurtosis.
15
Standard deviation calculations the hard way
16
Review Slide Standard Deviation and the ‘Shape’ of Data
‘Small’ standard deviation Frequency ‘Normal’ standard deviation ‘Large’ standard deviation 𝒙 This ‘peakedness’ of the distribution is called kurtosis. Use the kurtosis statistic to test for normality.
17
The Variance (s2)
18
The Variance Squares the average difference between a data value and the arithmetic mean of the data set. It is given by: 𝑠𝟐= 𝑥− 𝑥 𝑛−1 Where: s2 is the sample variance x is a value in the dataset is the arithmetic mean of the dataset n is the number of values in the dataset 𝑥 Since it uses the arithmetic mean, it is subject to the same effect of extreme values – except much more because of the effect of squaring.
19
Interpreting the Variance Formula
𝑠2= 𝑥− 𝑥 𝑛−1 𝑥 Subtract each data value x from the arithmetic mean and sum them. But this returns a set of plus and minus differences that adds to zero. So to remove the signs we square each difference thus: …and sum the squared differences. 𝑠 2 = (𝑥− 𝑥 ) 𝑠 2 = 𝑥− 𝑥 2
20
Variance and SD Compared
𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠2= 𝑥− 𝑥 𝑛−1 By squaring the differences you remove the negative signs and exaggerate more extreme differences to make them more obvious for analysis. By taking the square root you return the differences to their original magnitude but the signs are removed so the differences no longer sum to zero. In comparing the two, when the s is small, the difference between the variance (s2) and the s is smaller than if the s is large – that’s what happens when you square numbers.
21
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 s $3,852.18 $14,290.36 $21,559.86 s2 $14,839,285.71 $204,214,285.71 $464,827,464.41 Note that the highest s is 5.6 times the lowest whereas the highest s2 is 31 times the lowest – this is the effect of squaring extreme values
22
N and n-1 𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠2= 𝑥− 𝑥 2 𝑛−1
𝑠= (𝑥− 𝑥 ) 2 𝑛−1 𝑠2= 𝑥− 𝑥 𝑛−1 Why do the sample standard deviation and sample variance (in fact, sample anything) formulas have n-1 as the denominator? Because n-1 gives a more conservative estimate of deviation by increasing the standard deviation and variance values. If you have a larger standard deviation or variance, you have a higher standard to pass in making your case. Why? Because if you are testing to see if a data value is 1.96 s away from the mean of its dataset, then a larger s means the data value has to meet a stricter test – i.e. it has to be higher. What it’s saying is that if you want to find out if, for example, a value in a dataset is different than the others, then that value has to be 1.96 sd from the mean of that dataset. Thus if the sd is larger, then the value has to be larger to be significantly different. Never mind right now why the number 1.96 comes up – we’ll deal with it in a couple of weeks.
23
Sample versus population – n-1 versus N
Sample size (n) Value of numerator in standard deviation formula Biased estimate of population standard deviation (i.e. dividing by N) Unbiased estimate of population standard deviation (dividing by n-1) Difference between biased and unbiased estimates 10 500 7.07 7.45 .38 100 2.24 2.25 .01 1000 0.7071 0.7075 .0004 Source: After Salkind, page 40. ∑(𝒙− 𝒙 ) 𝟐 √(500/10)= √(500/(10-1))= 5.0% 0.4% 0.056% √(500/100)= √(500/(100-1))= √(500/1000)= √(500/(1000-1))= Note: 1. With n-1 the standard deviation is higher. 2. The larger the sample, the smaller the effect of n-1 N
24
Sample versus population – n-1 versus N
Sample size (n) Value of numerator in standard deviation formula Biased estimate of population standard deviation (i.e. dividing by N) Unbiased estimate of population standard deviation (dividing by n-1) Difference between biased and unbiased estimates 10 500 7.07 7.45 .38 100 2.24 2.25 .01 1000 0.7071 0.7075 .0004 Source: After Salkind, page 40. (𝒙− 𝒙 ) 𝟐 Note: 1. With n-1 the standard deviation is higher. 2. The larger the sample, the smaller the effect of n-1 N
25
Interpreting Variance & Standard Deviation
s gives the average difference between each data value and the mean of a dataset and s2 squares it and so exaggerates it. The larger the values, the more spread out the values are and the larger the differences between them. If the values are equal to zero then there are no differences between your data values. The standard deviation and the variance each require an arithmetic mean to work, not the median or the mode. Therefore they require the same rigour as the mean and are sensitive to extreme values as well, especially the variance.
26
The Coefficient of Variation (Cv)
27
Calculating the Coefficient Of Variation
The equation for the sample coefficient of variation is: 𝑪𝒗= 𝒔 𝒙 * 100 And, for the population: 𝑪𝒗= 𝜹 𝝁 * 100
28
Interpreting The Coefficient Of Variation
The coefficient of variation expresses the standard deviation as a percentage of the mean. Allows easy comparison of standard deviations with one another.
29
Interpreting The Coefficient Of Variation
By way of example: Compare a s of $2,400 on a per capita average income of $55,000 against an s of $300 on a per capita average income of $2,000 – how to interpret? Here the coefficients of variation are 4.4% and 15% indicating a much wider range of variability in the poorer nation – that is a much wider gap between rich and poor. Case in point: the coefficient of variation for global GNI is 108.9%! This indicates an extraordinary gap between rich and poor nations.
30
Dataset #1 Dataset #2 Dataset #3 $45,000.00 $80,000.00 $43,000.00 $41,000.00 $40,000.00 $37,000.00 $35,000.00 $1.00 Sum $331,000.00 $366,000.00 $331,001.00 n 8 Mean $41,375.00 $45,750.00 $41,375.13 Median $42,000.00 Mode MAX MIN Range $10,000.00 $79,999.00 s $3,852.18 $14,290.36 $21,559.86 s2 $14,839,285.71 $204,214,285.71 $464,827,464.41 Cv 9.31% 31.24% 52.11% Note that the highest Cv is 5.3 times the lowest indicating that dataset#3 is considerably more variable that dataset #1 – the effect of the two extreme values is evident.
31
Summary Stats So Far Arithmetic mean and standard deviation are fundamental to statistics. Form the heart of descriptive statistics. Are the essential building blocks of all other statistical methods – look for them as elements in future formulas. Other measures of dispersion have their roles, are more robust, but not as powerful.
32
All Geography students are deviants.
33
All Geography students are above average deviants.
mg!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.