Data description
Statistics A statistic is a number calculated from the values of variable(s) in a sample. Various statistics are routinely used to describe samples. The following data refer to the total cost of drugs (in Burundi francs) received by 84 adults aged 20-29 visiting five different health centres in the Myinga province of Burundi in 1991-2.
… The data
… There are many statistics that one could calculate from these data - the values of some of the more common ones are listed in the following table.
Medians The median value is the value that halves the distribution, 50% of the values are below and 50% of the values above. So, for example, in the below class of 15 children the median height is 121cm.
… The median by itself is of limited use, so we also find the upper (Qu ) and lower (Ql ) quartiles which together with the median (the middle quartile) split the data into four. An idea of the spread is given by calculating the inter-quartile range, IQR = Qu - Ql . For the child height data, the upper quartile is 134cm, the lower quartile is 111cm and the IQR is 23cm.
Means The arithmetic mean is the most commonly used measure of the central value of a distribution. It is the sum of the observations divided by N (the number of observations).
… In the example of childhood height, what is the mean? (103+104+107+111+111+119+121+124+127+133+134+137+140+150)/15 =114.73 This value is very close to the median, this will generally be the case when the data is distributed roughly symmetrically around the central value.
… When, however we have a few extreme values, then the mean and the median can be very different. Normal practice would be to use the median as it is far more robust to these extreme values. The mean, however, uses all the information that has been collected, possibly at great time and expense, and so is extensively used. It is possible to perform transformations on the data in order to introduce symmetry and thus use the mean.
Mode The mode is the ‘most frequent’ observation For example, in the drug cost example it is 45.4 (occurs 9 times) In the child height example, it is 111 (occurs 3 times)
In Excel Suppose we have the number of clients placed by an employment agency over a period of 11 working days. The mean can be found using the AVERAGE function, =AVERAGE(B2:N2), which is 27 The median can be found using =MEDIAN(B2:N2) = 23 The interquartile range = QUARTILE(B2:N2,3) - QUARTILE(B2:N2,1) = 20 And the mode, =MODE(B2:N2), which is 15
Weighted averages Suppose that 60% and 70% were obtained in two assignments for this course (well done!) The average mark would be =(60+70)/2=65% However, if the second assignment was deemed to be more important, it might have a higher ‘weight’ than the first. Assume that the second assignment is awarded a weight of 0.7, then first must have 0.3 (as the weights must sum to 1)
… To calculate the overall average we multiply each mark by its weight and then add the weighted marks together (0.3*60%)+(0.7*70%) = 18%+49% = 67% This is 2% higher than the simple average, it is better to get greater marks in harder assignments!
In Excel Note that wa and wb are named cells
COUNTIF Now suppose wanted to see how many of the students passed the course. The pass mark is 40% (put into cell D2 and named as passmark) We can then use IF to see whether a student passed, =IF(C4>passmark, “Pass”, “Fail”) And finally can add up the number of passes using COUNTIF Passes, =COUNTIF(D4:D212, “=Pass”) Fails, =COUNTIF(D4:D212, “=Fail”) The pass and fail rates will then be =E2/(E2+F2) and F2/(E2+F2)
… This is the example 8.6 from Whigham (p143, W8_2.xls) which you might like to try for yourselves.