The Sample Variance © Chistine Crisp Edited by Dr Mike Hughes
The Sample Variance Can you find the medians and means for the following 3 data sets? Although the medians and means are the same, the data sets are not really alike. The spread or variability of the numbers is quite different. How can we measure the spread within the data sets? ANS: The range and inter-quartile range both measure spread but neither uses all the data items Set C Set B Set A Mean,Median Interquartile range we will do later with Cumulative Frequency
The Sample Variance If you had to invent a method of measuring spread that used all the data items, what could you do? One thing we could do is find out how far each item is from the mean and add up these differences. e.g. 4 = Data sets B and C give the same result. The negative and positive values have cancelled each other out 11 22 33 4 Set C Set B Set A Mean,Median Set A: x 0
The Sample Variance To avoid the effect of the negative values we can either ignore the negative signs, or square each difference ( since the squares will all be positive ). Squaring is more convenient for developing theory, so, e.g 11 22 33 4 Set A: x Let’s do this calculation for all 3 data sets:
The Sample Variance Set A:Set B:Set C: The larger value for set B shows greater variability. Set C has least variability. Can you see a snag with this measurement? ANS: The calculated value increases if we have more data, so comparing data sets with different numbers of items would not be possible. To allow for this, we need to take n, the number of items, into account Set C: x Set B: x Set A: x Mean, x
The Sample Variance There are 2 formulae that can be used, the mean square deviation. or the sample variance. Our data is nearly always a sample from a large unknown set of data ( the population ) and we take samples to find out about the population. The 1 st formula does not give the best estimate of the variance of the population so is not used.
The Sample Variance So, there are 2 quantities and their square roots that we need to be clear about the mean square deviation, POPULATION VARIANCE Also the sample variance, and the root mean square deviation. POPULATION STANDARD DEVIATION and the sample standard deviation. WE nearly ALWAYS use THESE TWO formula
The Sample Variance e.g. Find the rmsd and msd of the following data: (i) x7914 Mean, (ii) The 2 nd form is exactly the same as the first form but quicker to use !!
The Sample Variance e.g. Find the sample SD and Variance of the following data: (i) x7914 Mean, (ii) The 2 nd form is in general quicker to use.
The Sample Variance This all seems very complicated but help is at hand. Both the quantities, rmsd and s are given by your calculator. The rmsd is smaller than s ( because we are dividing by a larger number ). Correct to 3 s.f. we have e.g. Find the root mean square deviation, rmsd, and the sample standard deviation, s, for the following data: x7914 Use the Statistics function on your calculator and enter the data. Select the list of calculations. You will be able to find the following: and
The Sample Variance x7914 So, for the data we have Squaring these gives ( sample variance )( mean square deviation ) The part of the formula,, is in your formulae sheet, labelled S xx. (said as Sum of squares X X) An expanded form of the expression is also given. All you have to do is divide by the correct quantity.
The Sample Variance The mean square deviation, msd, and sample variance, both measure the spread or variability in the data. SUMMARY To find the msd or sample variance, we square the relevant quantity given by the calculator: If we have raw data we use the statistical functions on the calculator to find the rmsd or sample standard deviation. msd = ( rmsd ) 2 sample variance s 2 Your formulae sheet will gives the formula or equivalent: Then, we divide by n for the msd or ( n – 1 ) for s 2. The sample standard deviation is the larger than the rmsd because we divide by (n-1)
The Sample Variance The formula for the variance can be easily adapted to find the variance of frequency data. Becomes for FREQUENCY DATA Frequency Data We usually only use the formulae if we are given summary data. With raw data we enter the data into the calculator and use the statistical functions to get the answers directly.
The Sample Variance But note that becomes Frequency Data
The Sample Variance SO MSD= SXX/n and VARIANCE = SXX/(n-1) becomes Frequency Data
The Sample Variance e.g.1 Find the mean and sample standard deviation of the following data: x12510 Frequency, f 3584 Solution: sample standard deviation, Using the calculator functions, the mean, = Although we don’t need the formula for this question, let’s check we have the correct value by using the formula:
The Sample Variance e.g.1 Find the mean and sample standard deviation of the following data: x12510 Frequency, f 3584 Solution: So,
The Sample Variance Length (cm) Frequency, f e.g.2 Find the sample standard deviation of the following lengths:
The Sample Variance e.g.2 Find the sample standard deviation of the following lengths: Length (cm) x Frequency, f Solution: Standard deviation, s = We need the class mid-values ·5 x2x x2fx2f xf
The Sample Variance e.g.3 Find the mean and sample variance of 20 values of x given the following: Solution: and sample mean, Since we only have summary data, we must use the formulae sample variance,
The Sample Variance SUMMARY Frequency data Raw data MSD is called POPULATION VARIANCE Take square root for rmsd and sample standard deviation RMSD is called POPULATION STANDARD DEVIATION
The Sample Variance Exercise Find the mean, sample standard deviation and sample variance for each of the following samples, using calculator functions where appropriate f 54321x f Time ( mins ) observations where and
The Sample Variance f 54321x mean, variance, standard deviation, s = Answer: mean, variance, standard deviation, s = x Time ( mins ) f N.B. To find we need to use the full calculator value for s, not the answer to 3 s.f.
The Sample Variance observations where and Solution: Standard deviation, s mean, variance,
The Sample Variance There are 2 formulae that can be used to measure spread: or the mean square deviation. the sample variance, In many books you will find the word variance used for the 1 st of these formulae and you may have used it at GCSE. However, our data is nearly always a sample from a large unknown set of data ( the population ) and we take the sample to find out about the population. The 1 st formula does not give the best estimate of the variance of the population so is not used.
The Sample Variance So, there are 2 quantities and their square roots that we need to be clear about Also the mean square deviation the sample variance, and the root mean square deviation. and the sample standard deviation.
The Sample Variance The rmsd is smaller than s ( because we are dividing by a larger number ). Correct to 3 s.f. we have e.g. Find the root mean square deviation, rmsd, and the sample standard deviation, s, for the following data: 1497x Use the Statistics function on your calculator and enter the data. Select the list of calculations. You will be able to find the following: Ignore the calculator notation.
The Sample Variance Squaring these gives ( variance ) ( mean square deviation ) The part of the formula,, is in your formulae booklet ( see correlation and regression ), labelled S xx. An expanded form of the expression is also given. All you have to do is divide by the correct quantity, n or n 1. Using the formulae: If summary data are given, you will need to use the formulae instead of the calculator functions.
The Sample Variance The mean square deviation, msd, and sample variance, both measure the spread or variability in the data. SUMMARY To find the msd or sample variance, we square the relevant quantity given by the calculator: If we have raw data we use the stats functions on the calculator to find the rmsd or sample standard deviation. msd = ( rmsd ) 2 sample variance s 2 For summary data, we use the formulae book, choosing the appropriate form: Then, we divide by n for the msd or ( n – 1 ) for s 2. The sample standard deviation is the larger of these quantities.
The Sample Variance e.g.1 For the following sample data, find (a) the root mean square deviation, rmsd, (b) the mean square deviation, msd, (c) the sample standard deviation, s, and (d) the sample variance s x Answer: Using the calculator functions, (a)(b) (c)(d)
The Sample Variance e.g.2 Given the following summary of data for a sample of size 5, find Solution: Using the formulae book, (a) the mean square deviation, msd, (b) the root mean square deviation, rmsd, (c) the sample variance s 2 (d) the sample standard deviation, s, and, msd = (a) (b) (c) (d) rmsd =
The Sample Variance The formula for the variance can be easily adapted to find the variance of frequency data. becomes Frequency Data As before, we only use the formulae if we are given summary data.
The Sample Variance e.g.1 Find the mean and sample standard deviation of the following data: 4853 Frequency, f 10521x Solution: So,
The Sample Variance e.g.2 Find the sample standard deviation of the following lengths: Frequency, f Length (cm) Solution: Standard deviation, s = We need the class mid-values ·5 We can now enter the values of x and f on our calculators. x Frequency, f
The Sample Variance To find the root mean square deviation, rmsd, or the sample standard deviation, s, using the calculator functions, SUMMARY the values of x ( and f ) are entered and checked, the table of calculations gives both values, the variance is the square of the standard deviation. the larger value is the sample standard deviation, s, and this is the value that is most often used by statisticians,
The Sample Variance Outliers We’ve already seen that an outlier is a data item that lies well away from the other data. It may be a genuine observation or an error in the data. e.g. 1 Consider the following data: With this data set, we would immediately suspect an error. The value 81 was likely to have been 18. If so, there would be a large effect on the mean and standard deviation although the median would not be affected and there would be little effect on the IQR. The presence of possible outliers is an argument in favour of using median and IQR as measures of data.
The Sample Variance A 2 nd method used to identify outliers is to find points that are further than 2 standard deviations from the mean. The point 33 is more than 2 standard deviations above the mean so, using this measure, it is an outlier. In an earlier section, we met a method of identifying outliers using a measure of 1·5 IQR above or below the median. e.g. 2. Consider the following sample: The sample mean and sample standard deviation are : mean, standard deviation, s = So, and