Presentation is loading. Please wait.

Presentation is loading. Please wait.

Variability Dr. Richard Jackson

Similar presentations


Presentation on theme: "Variability Dr. Richard Jackson "— Presentation transcript:

1 Dr. Richard Jackson jackson_r@mercer.edu
Variability Dr. Richard Jackson As mentioned in the previous module, one way to describe a set of data is through the use of averages or numbers that represent a large group of data. The variability of measures is also very important and in this module we will take a look at various statistics that indicate the variability of subjects within a sample. © Mercer University 2005 All Rights Reserved RJ Variability

2 Frequency Polygons f f f f f f Data Data Data
The importance in determining the variability of subjects within a sample is illustrated by these six frequency polygons. As you may know they are all very very different. Although if calculated the mean for all six of these distributions would be approximately the same therefore it is important to understand that it is necessary to determine the variability of a set of data in order to more accurately to describe the data themselves. Data f f f Data Data RJ Variability

3 Variability Range Q (Quartile Deviation) AD (Average Deviation)
SD (Standard Deviation) Variance SEM There are several measures of variability that we will be discussing. There is the range that we have already mentioned. The Quartile Deviation (Q). The Average Deviation (AD). Standard Deviation (SD or s) and the Variance which is equal to Standard Deviation or (small s2) RJ Variability

4 High Measure - Low Measure = Range
= 5.7 The simplest measure of variability in a set of data is the range and as we have discussed previously the range is determined by taking the high measure and subtracting the low measure in a distribution. In our example involving hemoglobin levels, the high measure was 17.8 and the low measure was 12.1 giving us a range of 5.7. Reference Table Hemoglobin Subjects RJ Variability

5 Quartile Deviation (“Q”)
Q1, Q2, Q3 and Q4 represent Quartiles Q1 = C25 Q2 = C50 Q3 = C75 The Quartile deviation is symbolized by the capital letter Q. Please understand that this is not the same as the quartile that we discussed in a previous module. Quartiles labeled Q1, Q2, Q3, and Q4 represent measures in a given amount of percentages in cases below them respectively 25, 50, 75, and 100 or corresponding to the 25th, 50th, and 75th centile. The quartile deviation is a different statistic symbolized by the capital letter Q. RJ Variability

6 Quartile Deviation (“Q”)
“Q” is statistic of choice for skewed data. (Q3 - Q1) 2 Q = Formula: The quartile deviation is the statistic of choice for variability for when the data are skewed. The formula is Q=Q3-Q1/2 or the third quartile minus the first quartile divided by 2 again the quartile deviation is the statistic of choice when the data are skewed and you want to show variability. RJ Variability

7 Quartile Deviation Example from Table 4.2
Calculating Quartile Deviation Calculate Q3 Class Interval 3 5 11 14 20 13 8 2 f 6 22 36 56 69 77 82 85 88 90 Cumulative Frequency (90)(0.25) = 22.5 Again will we use the data in the sample of hemoglobin levels to calculate and illustrate the quartile deviation. If we go back to the cumulative frequency distribution for our example we can calculate first Q1 to use in our formula for calculating the quartile deviation. Q1 equals that point in the distribution where 25% of the subjects are below it. In other words we want to calculate or determine that hemoglobin level where 25% of the subjects fall below it. 25% of the subjects is 0.25 times 90 or 22.5 so we want to find that point in this distribution where 22.5 of the subjects fall below. If you start at the top of the cumulative frequency distribution and read down in the last column you see that the cumulative frequency in the first three intervals is 11 and through the first 4 intervals is 22. We need to find the individual or the point that corresponds to Therefore that point falls somewhere in the interval 14.0 to Through the bottom of that interval we have 22 subjects. We need 0.5 subjects more and that interval there are 14 subjects so we want to go a fraction of the way through that interval. How far do we go, 0.5 divided by 14 or that fraction of the way through that interval. The interval size is 0.5. If you take the lower limit of that interval or and add to it, what we get when we multiply the fraction of the interval that we want to go through times the interval size (0.5) we get a value of In other words that value corresponds to that point in the distribution where 22.5 persons or 25% of the subjects fall below. = 0.5 (0.5) = i = 0.5 0.5 14 13.97 = Q1 RJ Variability

8 Quartile Deviation Example from Table 4.2
Calculating Quartile Deviation Calculate Q3 Class Interval 3 5 11 14 20 13 8 2 f 6 22 36 56 69 77 82 85 88 90 Cumulative Frequency (90)(0.75) = 67.5 We follow a similar procedure for determining Q3 or the point in the distribution where 75% of the cases fall below. 75% of 90 is Going down through the cumulative frequency distribution we see that that point falls in the interval 15.0 to Through the lower limit of that interval we have 56 cases minus 56 is 11.5 indicating that we need 11.5 of the 13 subjects in the interval 15.0 to We need to go that fraction the way through that interval 11.4 divided by 13. The lower limit of that interval is so going that fraction of the way through the interval size of 0.5 and adding it to the lower limit of that interval or gives us a value for Q3 of it is that point where 75% of the subjects fall below = 11.5 11.5 13 (0.5) = 15.39 = Q3 RJ Variability

9 Quartile Deviation Q = Q = Q = 0.7 Q3 - Q1 2 15.39 - 13.97 2
2 Q = and now we are ready to calculate the quartile deviation using the appropriately formula Q3 minus Q1 divided by 2. We find that the quartile deviation for this set of data is 0.7 Q = 0.7 RJ Variability

10 Quartile Deviation Interpretation of Q:
Median plus and minus one Quartile Deviation includes approximately 50% of subjects. 14.7 ± 1 “Q” • 50% of subjects = 50% of subjects • 50% of subjects The interpretation of the quartile deviation is as follows. The median plus and minus one quartile deviation includes approximately 50% of the subjects so if the median is 14.7 which we have calculated earlier then 50% of the subjects will fall between 14 and Recalling that the quartile deviation is 0.7. So plus and minus one quartile deviation from the median is 14 through 15.4 so 50% of the subjects fall between these two measures and that’s how the quartile deviation is interpreted. So if you know what the quartile deviation is then you know that half of the subjects fall between plus and minus one quartile deviation from the median. So if median is 14.7 then 50% of subjects will fall between (Q=0.7) RJ Variability

11 Quartile Deviation 0.7 x 8 = 5.6 (recap: range was 5.7)
Approx. 8Q’s cover the range of data 50% Subjects mdn 0.7 x 8 = 5.6 (recap: range was 5.7) Another characteristic of the quartile deviation is that 8 quartile deviations, not that that is quartile deviations and not quartiles. 8 quartile deviations covers the range of data. Our quartile deviation for these data was times 8 is 5.6. If you recall when we calculated the range of these data it was 5.7. The frequency polygon on the right illustrates this showing that 50% of the subjects fall plus or minus one quartile deviation from the median and that the range of these data includes 8 quartile deviations. That’s 4 quartile deviations above the mean and 4 quartile deviations below totaling 8 4 Q’s Q’s Range L H RJ Variability

12 Average Deviation (AD)
x 4 2 1 -1 -2 -4 (X - X)  | x | = 14  = 66 X 15 13 12 10 9 7 N = 6  X N X = = 6 66 = 11 A third measure of variability is known as the Average Deviation. The average deviation is used very infrequently in the literature as a measure of variability. But understanding it will greatly enhance our understanding of our next measure of variability the standard deviation. Lets take a look at the average deviation and how it is calculated. The average deviation is abbreviated AD and it represents on average how much each measure a distribution varies form the mean of the distribution and this example we have six measures labeled with a capital letter X. Those measures are 15, 13, 12, 10, 9, and 7. The sum of X is 66 and the mean is determined to be 11. In the right hand column we have calculated what is known as the deviation of each of those measures from the mean. It is labeled with a small letter x and anytime you see a letter x it represents a deviation from something and in this case a deviation from the mean. Mathematically it is x minus X bar so in that last column what we have done is taken each individual measure x and subtracted the mean of 11. So is 4, is 2 and so forth. In the last column then we see the deviations of each of these measures from the mean. The formula for the average deviation is given in the lower left portion it is the sum of the absolute value of the little x value divided by N. So if we take the absolute value of each of those deviations in other words disregard the minus sign we end up with 14. Dividing by the number of subjects in the sample of six, gives us an average deviation of The interpretation of this is for that distribution for six measure, each of them varies from the mean an average of about That is the definition of the average deviation. It is the average amount that each measure in a distribution varies from the mean.  | x | N AD = = 6 14 = 2.33 RJ Variability

13 Standard Deviation (s, SD)
(X - X) X x x2  x2 N s = 15 4 16 13 2 4 6 42 s = 12 1 1 The standard deviation abbreviated with a small letter s or capital letters SD is the most widely used measure of variability. Its formula is little s equals the square root of the sum of the deviations of x squared divided by N. We will go through a calculation using the same data that we had for the average deviation. We have the same six measures, we have the same deviations in the middle column, but in the last column we take those deviations or the small letter x’s and square them. Then by summing the last column we get sigma of the sum of the little x squared or the sum of the deviation squared which we will use to plug into our formula. Substituting that into our formula for the standard deviation gives us a value of This is somewhat different from the value that we got for the average deviation of The interpretation of the standard deviation is that it is almost or sort of like the average amount that each measure in a distribution varies form the mean but not exactly if that were the case it would be the average deviation but for the most part people interpret it not unlike the average deviation as being the average amount that each measure in a distribution varies from the mean. The real utility in its interpretation lies in other characteristics. 10 -1 1 s = 2.64 9 -2 4 7 -4 16  =  = 42 RJ Variability

14 Standard Deviation Interpretation
Cumulative Frequency s = 1.2 for the 90 subjects Class Interval f 3 5 11 14 20 13 8 2 3 6 11 22 36 56 69 77 82 85 88 90 X " 1SD • 68% X " 2SD • 95% The real value of the standard deviation lies in the characteristics associated with the normal distribution. In a normal distribution the mean plus or minus one standard deviation includes about 2/3rds or 68% of the subjects in a sample. The mean plus or minus two standard deviations includes about 95% of the subjects and the mean plus or minus three standard deviations includes approximately 99% of the subjects. The standard deviation for our set of data was 1.2. If we take the mean plus and minus 1.2 for that data then that should include about 68% of our cases. The mean plus or minus two standard deviations or 2 times 1.2 would include about 95% of the subjects if you actually did that for this set of data you would see that it is very close. The standard deviation is used when the distribution is normal that is the measure of variability when the distribution is normal. These relationships concerning the number of subjects in the sample and plus and minus one standard deviation does not apply when the distribution is skewed. It only applies when the data are relatively normal. X " 3SD • 99% SD is used when distribution is normal SD does not apply when distribution is skewed RJ Variability

15 Standard Deviation Standard deviations in a normal distribution X = 50
SD = 10 f 68% These characteristics of the standard deviation and the number of subjects within a sample are further illustrated using this distribution which has a mean of 50 and standard deviation of 10. as we stated the mean plus and minus one standard deviation includes about 68% of the measures in this case about 68% of the subjects fall between the measures 40 and 60. The mean plus and minus two standard deviations includes 95% of the subjects or stated in other words 95% of the subjects fall between the score 30 and 70 and 99% of the subjects fall between 20 and 80. Corresponding to the mean plus and minus three standard deviations. 95% 99% 20 30 40 50 60 70 80 -3SD -2SD -1SD x +1SD +2SD +3SD RJ Variability

16 Variance (s2) Measure of Variability Will be discussed later
The last measure of variability is the variance. The variance is equal to the standard deviation squared and it will be discussed in a later module. Quite often it is not used as a descriptive statistic like the standard deviation. The standard deviation is probably the most important statistic because it is utilized in so many other statistics and their calculations. It is used extensively in the interpretation of data as well. So it is important to know about the standard deviation as a measure of variability, a descriptive statistic, but it is also vitally important to understand the importance of the standard deviation and understanding more clearly many other statistics and procedures we will be discussing later. RJ Variability

17 Standard Error of the Mean (SEM)
How accurately a sample mean (X) estimates a population mean(m) Sample mean(X): unbiased estimator of the population mean(m) Another very important statistic is known as the standard error of the mean. The standard error of the mean indicates how accurately a sample mean estimates a population mean. Earlier we said that the sample and the population were different. If we take a sample and calculate its mean, its symbol is X bar if the population were calculated and quite often we do not know what the mean of the population is, its symbol would be a small letter m. If we take a sample of a population and calculate a mean from that sample, we use that sample mean to estimate the population mean and we know that samples are going to vary one from the other so there will be some variability from sample to sample if we calculated a large number of means taken from a large number of samples and that variability from sample to sample is known as the standard error of the mean. The sample mean or X bar as we said in the statistic vernacular is said to be an unbiased estimator of the population mean. In other words lets consider all the values or measurements in a population. If we graph them on a frequency polygon, these individual values in the population would be approximately normal. Lets say they are the blood pressures of the pharmacists in the state of Georgia. If we determine the blood pressures of all the pharmacists in the state of Georgia and graph them it would result in a graph that is approximately normal in nature. If we took a sample of pharmacists and calculated the blood pressures in the sample, the chances are equal that we would get some on one side of the population mean and an equal number on the other side of the population mean. So the mean calculated on the sample would be said to be an unbiased estimator of the population mean. If we calculated the standard deviation of the sample however we would find that it would be somewhat biased. If we chose one sample from a population, the chances are pretty good that most of the subjects in the sample are going to come from the center of the distribution. It is unlikely that we are going to get exceedingly high or low measures because there are so few of them therefore when we calculate the standard deviation of a sample its going to be smaller then the standard deviation of the population. That's why in the statistical vernacular the sample deviation is said to be a biased estimator of the populations standard deviation and as we said earlier the mean is said to be an unbiased estimator of the population mean Sample standard deviation: biased estimator of the population standard deviation 140 RJ Variability

18 N-1 Formula for s “N - 1” is used sometimes in the denominator to correct for the fact that the sample standard deviation is a bias estimator of the population standard deviation.  x2 N s =  x2 s = N - 1 As you encounter formulas for the standard deviation, sometimes you will see a n-1 in the denominator. If you recall the formula from a previous slide the number in the denominator was N. When inferences are to be made about the population from the sample some statisticians use an n-1 in the denominator to overcome the bias that is produced in the sample standard deviation estimating the population standard deviation so that’s why you may see some formulas with an N and some with an N-1 in the denominator for the standard deviation. RJ Variability

19 Standard Error of the Mean Example
Determining systolic blood pressure of population of R.Ph.s in Georgia Sampling Distribution of the Mean f X’s N = 100 X = 150mm Hg s = 25 To better understand the standard error of the mean lets consider an example. Lets assume we wanted to know the what was the averages systolic blood pressure of the pharmacists in the state of Georgia and to do that we pull a sample of 100 pharmacists of a this very large population. In our sample we calculate the mean and it comes out to be 150 with the standard deviation of 25. The question is, how well does this sample estimate the population mean? The answer to that comes in the calculation of the standard error of the mean and to better understand the standard error of the mean we will have to consider a hypothetical situation. Lets assume that we did this study again and we took another sample of 100 pharmacists and calculated a mean. Chances are this time its not going to be 150 again. It might be as in indicated in the middle of the slide, in our second study it was 155. Lets assume we did the study again and picked another sample of 100 and calculated a mean, it may come out to be 140. A fourth time we did it may be 145. lets consider that we did this study over and over and over and over, an infinitely large number of times and each time we took a sample of 100 and calculated a mean and they would look like the numbers listed in the middle of the slide. Now lets further assume we took all the sample means and constructed a frequency polygon. The frequency polygon would result in a normal distribution as illustrated on the right. That distribution has a name. It is known as the sampling distribution of the mean. It represents the graphic representation of all of the sample means of size 100 that may be chosen from this population. It has another characteristic in that if we did this study over and over and over hundreds and thousands of times then the mean of all those sample means would be equal to the population mean. In other words the mean of all the sample means would be the same as if we went out and measured the systolic blood pressures of all the pharmacist in the state of Georgia. This is a hypothetical situation but understanding this will help us understand the standard error of the mean. 1. 150mm Hg 2. 155mm Hg 3. 140mm Hg 4. 145mm Hg 95% X X = m = 150 RJ Variability

20 Sampling Distribution of the Mean
SEM Example s SEM = = = N Sampling Distribution of the Mean SEM is the SD of the sampling distribution of the mean. The standard error of the mean is calculated by taking the standard deviation of our one sample and dividing it by the square root of the number of subjects in the sample. In our one sample that we chose we had a sample size of 100 and a standard deviation of 25 so the standard error of the mean comes out to be 2.5. The standard error of the mean is a standard deviation. It is the standard deviation of all the sample means about the mean of a population. It is the standard deviation of the sampling distribution of the mean. In other words if we did this study over and over again which of course one wouldn’t do and if we constructed a frequency polygon then the standard deviation of all those sample means would be equal to 2.5. Now we already know about the characteristic of the standard deviation that says the mean plus and minus one standard deviation included about 68% of the subjects well the same is true for the this normal distribution. The mean of the population plus or minus one standard deviations which in this case is known as the standard error of the mean includes 68% of the cases or 68% of all the samples means that we might draw from the size of 100. If we take the mean which is 150 considering that the mean of our sample is an unbiased estimate of our population mean and we go plus or minus one standard deviation or in this case plus or minus one standard error of the mean of 2.5 we see that we get a value of on the lower end and on the upper end. What this means is that if we did this study over and over again an infinitely large number of times. 68% of the time we are going to get a mean that falls between and f X " 1SEM • 68% 68% 147.5 X 152.5 X = m = 150 RJ Variability

21 SEM Example SEM = = = 2.5 s 25 N 100 X " 2SEM • 95%
Sampling Distribution of the Mean X " 2SEM • 95% 95% CI = X " 2 (SEM) 95% CI = " 2 (2.5) 95% CI = " 5 95% CI = . Now lets take this example one step further. Our standard error of the mean is 2.5. The mean plus or minus two standard errors o the mean or the mean plus and minus two standard deviations include 95% of the subjects. This particular situation is known as the 95% confidence interval and involves in taking the mean plus and minus two standard errors of the mean. In this case the 95% confidence interval was equal to the man plus or minus two standard errors of the mean which is not unlike what we said a moment ago when we said the mean plus or minus two standard deviations includes 95% of the cases. Carrying out this mathematical manipulation we get the 95% confidence interval equally plus or minus 2 times plus and minus 5 or 145 to 155. Stated in other words if we did this study over and over again an infinitely large amount of times we would get means that would fall between 145 and % of the time. Therefore we would be 95% confidence that our sample mean would fall between this range 95% of the time. This is known as the 95% confidence interval. The narrower or smaller the confidence interval the better is our ability to conclude that our sample mean is a good estimator of the population mean. The larger or wider the 95% confidence is the less confidence we would have in our sample mean being a good estimator of our population mean f 95% 145 X 155 CI = Confidence Interval X = m = 150 RJ Variability

22 SEM Example The larger the sample size (e.g. N=1000) s SEM = N
The smaller the SEM The smaller (narrow) the 95% CI The more confidence we have in the sample mean estimating the population mean

23 SEM Example The smaller the sample size (e.g. N=9) s 25 SEM = = = 8.3
= 8.3 N 9 CI = ± 2 (8.3) CI = ± 16.6 CI = The larger the SEM . Lets take a look if we were to vary the size of the subjects in our sample. WE will keep the standard deviations the same which would probably be the case if we had a large or small sample and we increase the sample size from 100 to I think you can see that the standard error of the mean would become much smaller. Mathematically we would have a much larger number in the denominator. The larger the sample the size the smaller is the standard the error of the mean. The smaller therefore would be the 95% confidence interval. In other words it would be much more narrow in the more confidence we would have in the sample mean estimating the population mean. Stated in other words if we increase the number of subjects within a sample then the sampling distribution of the mean which is the frequency polygon representing all possible means of sample sizes that we choose would change. As we increase the sample size from 100 to 1000 that sampling distribution of the mean becomes more leptokurtic meaning the variability is much less meaning the standard error of the mean is much less meaning we can have more confidence in our sample mean estimating our population mean. Conversely if we have a much smaller sample lets say an N of 9 instead of an N of 100 we are going to get a much larger standard error of the mean. The wider is going to be the 95% confidence interval and the less confidence we can have in our sample mean estimating our population mean. This module has dealt with statistics The wider the 95% CI The less confidence we have in the X estimating the m RJ Variability


Download ppt "Variability Dr. Richard Jackson "

Similar presentations


Ads by Google