Central tendency and spread Stats Club 7 Marnie Brennan
References Petrie and Sabin - Medical Statistics at a Glance: Chapter 5, 6, 10, 35 Good Petrie and Watson - Statistics for Veterinary and Animal Science: Chapter 2, 4 Good Thrusfield – Veterinary Epidemiology: Chapter 12 Kirkwood and Sterne – Essential Medical Statistics: Chapter 4
Terminology! Along similar lines of previous Stats Clubs, we are talking about ways of describing your numerical data Gives you basic calculations to do to explore your data (get a feel for it) Enables you to compare your data with those collected by other researchers
Central tendency Central tendency = a measure of location or position of data, i.e. the ‘average’ This basically means calculating things like: Mean (arithmetic mean) Median Mode Others E.g. geometric mean (distn. skewed to the right), weighted mean Nice table in Petrie and Sabin (Chapter 5) summarising advantages and disadvantages of all measurements
Central tendency – Mean, Median Mean = Sum of your data/total number of measurements Algebraically defined Affected by skewed data THEREFORE good to use for normally distributed variables Median = The midpoint of your values i.e. what the ‘halfway’ value in your data is If the observations are arranged in increasing order, the median would be the middle value Not algebraically defined Not affected by skewed data THEREFORE good to use for non-normally distributed variables
Distributions Median Mean Mean and median the same
Central tendency - Mode Mode = the value that occurs the most frequently in a data set Generally means more if you have discrete data e.g. The most common litter size of bearded collie dogs is 7 Not often used What is the mode?
Spread Spread = measure of dispersion or variability (variation) of data This basically means calculating things like: Range Percentiles (Quartiles, Interquartile range) Variance Standard deviation Others E.g. coefficient of variation Nice table in Petrie and Sabin (Chapter 6) summarising main points about these measurements
Range and percentiles Range = the range between the minimum and maximum values of your data Gives an indication of spread at a very basic level Distorted by outliers (get a large range) Percentiles = if data is ordered from lowest to highest, these divide the data up into ‘compartments’ E.g. The 5th percentile = the point along the data below which 5% of the data lies; the 20th percentile = the point in the data below which 20% of the data lies Special types of percentiles are called ‘quartiles’ – these divide the data into 4 equal parts (the 25th, 50th and 75th percentiles) From these, you get an ‘interquartile range’ - IQR, which is values between the 25th and 75th percentiles The 50th percentile is the median Not distorted by outliers
Range = 22-28 (6) Q1 (25th percentile) = 24 Q3 (75th percentile) = 26 IQR = 24-26 (2) Range = 0.12-134 (133.9) Q1 (25th percentile) = 6 Q3 (75th percentile) = 36 IQR = 6-36 (30) What conclusions can we draw about what to use when??
Rule of thumb Mean and range = good to use for normally distributed variables Median and interquartile range = good to use for non-normally distributed variables
Variance Variance = the deviations of the data values from the mean e.g. If the data are bunched around the mean, the variance is small; if the data are spread out, the variance is large Calculated by squaring each distance between the observations and the mean - we then take the mean of this (add all values together and divide by the total number of observations minus 1) Reason for squaring – the negative values are ‘cancelled’ out DON’T WORRY ABOUT HOW TO DO THIS! This is what computers are for! Measured in the same units as the observations, but squared e.g. If the units are grams, the variance will be in grams squared
Mean = 26 Variance = 430 Mean = 23 Variance = 11090
Example If we had 6 observations (with mean = 0.17): 15, 18, -14, -17, -3 and 2 What is the variance? = (15 – 0.17)2 + (18-0.17) 2 + (-14 – 0.17) 2 + (-17 – 0.17) 2 + (-3 – 0.17) 2 + (2-0.17) 2/6-1 = 209.37 It is n-1 to reduce bias (we have a sample vs. whole population - again don’t worry too much!)
Standard Deviation (SD) Standard deviation = square root of the variance Similar to variance – also relates to the deviations of the observations from the mean (the ‘average’ deviation) Therefore the units are the same as for the observations – more convenient than variance? If we have a normally distributed dataset, then the mean +/- 2 x standard deviations approximately encompasses the central 95% of observations
Mean = 26 Variance = 430 Standard deviation = 21 Mean = 23 Variance = 11090 Standard deviation = 105
http://www.stark-labs.com/help/blog/files/StackingMethods.php
What about the standard error of the mean (SE or SEM)? Relates to the precision of the sample mean as an estimate of the population mean Standard deviation describes the variation in the data values from the mean Can use SEM to construct confidence intervals This will be covered in greater detail in another session
General rule Standard deviation, variance and SEM are for normally distributed variables For non-normally distributed variables, stick with interquartile range
Examples in papers
= Equal variances? Comparing groups of numerical values It is an assumption of some of the tests used to compare different numerical data groups (e.g. T-tests, ANOVAs) that the variances must be equal (homogeneity of variance) in the groups compared You need to know whether your groups meet these criteria – if they do not: Use other non-parametric tests Transform your data to fit the assumptions Again DON’T WORRY – we’ll cover this in another session
When we start again… The bunfight that is: P-values.................! Type I and Type II errors