Where are we? Measure of central tendency FETP India
Competency to be gained from this lecture Calculate a measure of central tendency that is adapted to the sample studied
Key issues Measures of central tendency Mode Median Mean Geometric mean Appropriate applications
Summary statistics A single value that summarizes the observed value of a variable Part of the data reduction process Two types: Measures of location/central tendency/average Measures of dispersion/variability/spread Describe the shape of the distribution of a set of observations Necessary for precise and efficient comparisons of different sets of data The location (average) and shape (variability) of different distributions may be different
Different variability, same location
Different location, same variability
Quick definitions of measures of central tendency Mode The most frequently occuring observation Median The mid-point of a set of ordered observations Arithmetic mean The product of the division of the arithmetic sum of observations by the number of observations
The mode Definition The mode of a distribution is the value that is observed most frequently in a given set of data How to obtain it? Arrange the data in sequence from low to high Count the number of times each value occurs The most frequently occurring value is the mode Mode
The mode N Mode
Examples of mode (1/2): Annual salary (in 100,000 rupees) 4, 3, 3, 2, 3, 8, 4, 3, 7, 2 Arranging the values in order: 2, 2, 3, 3, 3, 3, 4, 4, 7, 8 7, 8 The mode is three times “3” Mode
Examples of mode (2/2): Incubation period for hepatitis affected persons (in days) 29, 31, 24, 29, 30, 25 Arranging the values in order: 24, 25, 29, 29, 30, 31 Mode is 29 Mode
The mode is the only location statistics to be used when some characteristic itself cannot be measured Colour preference of people for their cars Colour preferenceNumber of people Green354 Blue852 Gray310 Red474 Mode
Specific features of the mode There may be no mode When each value is unique There may be more than one mode When more than 1 peak occurs Bimodal distribution The mode can be misinterpreted Is a distribution skewed or bimodal ? The mode is not amenable to statistical tests The mode is not based upon all observations Mode
The median The median describes literally the middle value of the data It is defined as the value above or below which half (50%) the observations fall Median
Computing the median Arrange the observations in order from smallest to largest (ascending order) or vice- versa Count the number of observations “n” If “n” is an odd number Median = value of the (n+1) / 2th observation If “n” is an even number Median = the average of the n / 2th and (n /2)+1th observations Median
Computing the Median, Example Example of median calculation What is the median of the following values: 10, 20, 12, 3, 18, 16, 14, 25, 2 Arrange the numbers in increasing order 2, 3, 10, 12, 14, 16, 18, 20, 25 Median = 14 Suppose there is one more observation (8) 2, 3, 8, 10, 12, 14, 16, 18, 20, 25 Median = Mean of 12 & 14 = 13 Median
Advantages and disadvantages of the median Advantages The median is unaffected by extreme values Disadvantages The median does not contain information on the other values of the distribution Only selected by its rank You can change 50% of the values without affecting the median The median is less amenable to statistical tests Median
The median is not sensitive to extreme values Median Same median
Mean (Arithmetic mean / Average) Most commonly used measure of location Definition Calculated by adding all observed values and dividing by the total number of observations Notations Each observation is denoted as x1, x2, … xn The total number of observations: n Summation process = Sigma : The mean: X X = xi /n Mean
Computation of the mean Duration of stay in days in a hospital 8,25,7,5,8,3,10,12,9 9 observations (n=9) Sum of all observations = 87 Mean duration of stay = 87 / 9 = 9.67 Incubation period in days of a disease 8,45,7,5,8,3,10,12,9 9 observations (n=9) Sum of all observations =107 Mean incubation period = 107 / 9 = Mean
Advantages and disadvantages of the mean Advantages Has a lot of good theoretical properties Used as the basis of many statistical tests Good summary statistic for a symmetrical distribution Disadvantages Less useful for an asymmetric distribution Can be distorted by outliers, therefore giving a less “typical” value Mean
Mean of several groups combined Mean of all groups = 2000 / 50 = 40 Crude average = 39.7
The geometric mean Background Some distribution appear symmetric after log transformation (e.g., Neutrophil counts) A log transformation may help describing the central tendency Definition The geometric mean is the antilog of the mean of the log values Geometric mean
Calculating a geometric mean Observe the set of observations 5,10,20,25,40 Take the logarithm of these values 0.70, 1.00, 1.30, 1.40 & Calculate the mean of the log values = 6.00 6.00/ 5 = 1.20 Take the antilog of the mean of the log values Antilog (1.20) = Geometric mean
Geometric mean of several groups combined Overall GM = antilog of ( / 50) = antilog ( ) = 9.3 Geometric mean
N Mean = Median = 10Mode = 13.5 Choosing
What measure of location to use? Consider the duration (days) of absence from work of 21 labourers owing to sickness 1, 1, 2, 2, 3, 3, 4, 4, 4, 4, 5, 6, 6, 6, 7, 8, 9, 10, 10, 59, 80 Mean = 11 days Not typical of the series as 19 of the 21 labourers were absent for less than 11 days Distorted by extreme values Median = 5 days Better measure Choosing
Choice of measure of central tendency for symmetric distributions Any one of the central/location measures can be used The mean has definite advantages if subsequent computations are needed Choosing
Choice of measure of central tendency for asymmetric distributions For skewed distributions, the mean is not suitable Positive skewed: Mean gives a higher value Negatively skewed: Mean gives a lower value If some observations deviate much more than others in the series, then median is the appropriate measure If the log-transformed distribution is symmetric, the geometric mean may be used Choosing
Key messages The mode is the most common value The median is adapted when there are extreme values The mean is adapted for symmetric distribution The geometric mean may be useful when log transformed data are symmetric The type of the distribution determines the measure of central tendency to use