Download presentation
Presentation is loading. Please wait.
Published byChristopher Bell Modified over 6 years ago
1
Measures of central tendency and dispersion Tunis, 28th October 2014
Dr Ghada Abou Mrad Ministry of Public Health, Lebanon
2
Learning objectives Define the different types of variables and data within a population or a sample Describe data using the common measures of central tendency (Mode, Median, arithmetic Mean) Describe data in terms of their measures of dispersion (range, standard deviation/variance, standard error)
3
Variable A population is any complete group of units (such as person or business) with at least one characteristic in common. It needs to be clearly identified at the beginning of a study. A sample is a subset group of units in a population, selected to represent all units in a population of interest A variable is any characteristics, number, or quantity that can be measured or counted. It is called a variable because its value may vary in the population and over time; it is represented by “X” in a population and “x” in a sample
4
Data Data are the measurements or observations or values that are collected for a specific variable in a population or a sample; an observation can be represented by “Xi “ in a population and “xi“ in a sample A data unit (or unit record or record) is one entity (such as a person or business) in the population being studied, for which data are collected. A data item (or variable) is a characteristic (or attribute) of a data unit which is measured or counted, such as height. # age sex height 1 20 M 175 2 16 F 163 3 23 170 Data item Data unit
5
Obs Age 27 30 28 31 36 29 37 34 32 Dataset A dataset is a complete collection of all observations for a specific variable in a population or a sample; it is called a raw dataset if the data have not been organized; the total number of observation in a dataset can be represented by “N” for a population and “n” for a sample Example: Ages of students in a class (years)
6
Types of variables Variable Qualitative nominal ordinal Quantitative
discrete continuous
7
Types of variables Qualitative variable: have value that describe a 'quality' ; it is also called a categorical variable Nominal: Observations can take a value that is not able to be organized in a logical sequence like sex or eye color Ordinal: Observations can take a value that can be logically ordered from lowest to highest like clothing size (i.e. small, medium, large) The data collected for a qualitative variable are qualitative data
8
Types of variables Quantitative variable: have values that describe a measurable quantity ; it is also called numeric variable; it can be ordered from lowest to highest Discrete: Observations can take a value based on a count from a set of values. It cannot take the value of a fraction between one value and the next closest value. Ex: number of children in a family Continuous: Observations can take any value between a certain set of real numbers. Ex: height The data collected for a quantitative variable are quantitative data
9
Descriptive statistics
Statistics describe or summarize data Most data can be ordered from lowest to highest The frequency is the number of times an observation occurs for a variable; the frequency distribution can be shown in a table or in a graph such as histogram Quantitative data can be described using the common measures of central tendency (Mode, Median, Mean) and the measures of dispersion (range, standard deviation/variance, standard error)
10
Frequency distribution
Obs Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 20 37 Frequency distribution Age Frequency 27 2 28 3 29 4 30 5 31 32 1 33 34 35 36 37 Total 20 From the raw data we can create a frequency distribution.
11
Obs Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 20 37 Histogram 7 6 5 4 3 2 1 27 28 29 30 31 32 33 34 35 36 37 The frequency distribution can be represented by a histogram
12
Histogram - Outliers Outliers are extreme, or atypical data value(s) that are notably different from the rest of the data. Histogram can show outliers at a glance
13
Epidemic curve Central Location Spread ? ? Number of people Age
Consider a group of people with ages ranging from the teens to the 80s. This histogram displays the frequency distribution by 10-year age group. Well, if you had to present a single number that best characterized this age distribution, what would it be? That is the function of a measure of central location. Later we will also talk about measures that reflect the spread of the distribution. Spread Age
14
Measures of central tendency and spread
Central Location / Position / Tendency A single value that is a good summary of an entire distribution of data Spread / Dispersion / Variability How much the distribution is spread or dispersed from its central location Central Position / Location / Tendency - a single value at the center of the distribution of data that represents a good summary of the entire distribution Dispersion / Spread / Variability - how much the distribution is dispersed or spread from that from its central location
15
Measure of Central Tendency
Also known as measure of central position or location It is a single value that summarizes an entire distribution of data Common measures Mode Median Arithmetic mean Let’s talk about measures of central location first. It is also known as a Measure of central position or location A measure of central location is a single value that summarizes an entire distribution of data Common measures of central location, and we will discuss each one, are: Arithmetic mean Median Mode
16
Mode Mode is the value that occurs most frequently
Method for identification 1. Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs 2. Identify the value that occurs most often The mode is the simplest measure of central location. It requires no calculations. It simply is the the value that occurs most frequently. So, to find the mode, 1. Arrange data into frequency distribution or histogram, showing the values of the variable and the frequency with which each value occurs. 2. Identify the value that occurs most often.
17
Mode Age Frequency 27 2 28 3 29 4 30 5 31 32 1 33 34 35 36 37 Total 20
Obs Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 20 37 Mode Age Frequency 27 2 28 3 29 4 30 5 31 32 1 33 34 35 36 37 Total 20 From the raw data we can create a frequency distribution. The mode is the value that occurs most frequently. From the frequency distribution you can see that age 30 is the most common age (occurring 5 times), so the mode of this distribution is 30.
18
Obs Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 20 37 Mode Mode = 30 7 6 5 4 3 2 1 27 28 29 30 31 32 33 34 35 36 37 From the raw data we can create a frequency distribution. The mode is the value that occurs most frequently. From the frequency distribution you can see that age 30 is the most common age (occurring 5 times), so the mode of this distribution is 30.
19
Unimodal Distribution
2 4 6 8 10 12 14 16 18 20 Population Population 2 4 6 8 10 12 14 16 18 Bimodal Distribution Depending on the variable, there can be one or multiple modes in a dataset. Can the class think of any examples? Can you think of any distributions that may not have a mode? [Answer: when same number of observations at each value]
20
Mode – Properties / Uses
Easiest measure to understand, explain, identify Always equals an original value Does not use all the data Insensitive to extreme values (outliers) May be more than one mode May be no mode So let’s summarize what you should know about modes. The mode is… Easiest measure to understand, explain, identify Always equals an original value Insensitive to extreme values (outliers) The mode is a perfectly fine descriptive measure -- what is the most common or popular value, but it has poor statistical properties -- we don’t do calculations based on the mode. May be more than one mode May be no mode Does not use all the data
21
Median Median is the middle value; it splits the distribution into two equal parts 50% of observations are below the median 50% of observations are above the median Method for identification Arrange observations in order Find middle position as (n + 1) / 2 Identify the value in the middle Medianis the middle value, or the value that splits tieh distribution into two equal parts. One half of its observations are smaller than the median, One half of its observations are larger than the median To find the median, 1. Arrange observations in order 2. Find middle position as (n + 1) / 2 3. Identify the value at the middle
22
Median: uneven number of values
Obs Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 Median: uneven number of values n = 19 Median Observation n+1 2 = 19+1 2 = 20 2 = When there is an odd number of values, such as the 19 values shown here, the middle value of the dataset is the Median. An easy way to find the middle value is to add 1 to the total N of values, divide that by 2. Here, we would take 19+1 = 20, divided by 2 = 10. Therefore the middle value is the 10th observation = 30 years. 10 = Median age = 30 years
23
Median: even number of values
Obs Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 20 37 Median: even number of values n = 20 Median Observation n+1 2 = 20+1 2 = 21 2 = When there is an even number of values, such as this series of 20 values, the median is the average of the 2 middle values. This is shown by calculating the median observation as 10.5 half way between 10 and 11. Therefore the median value is the average of the 10th and 11th observation = (30+30)/2 = 30 years. 10.5 = Median age = Average value between 10th and 11th observation 30+30 2 30 years =
24
Median – Properties / Uses
Does not use all the data available Insensitive to extreme values (outliers) Measure of choice for skewed data So let’s summarize what you should know about medians. The median… Does not use all the data in the distribution, only 1 or 2 values in the middle. So it is insensitive to extreme values (outliers). Like the mode, the median is a good descriptive measure, but has poor statistical properties. Therefore, the median is not commonly used for additional statistical manipulations. Because the median is indifferent to values in the tails of a distribution, it is the measure of choice for skewed data. We will discuss this again later. Finally, the median equals an original value of n is odd, but it is the average of 2 values if n is even.
25
Arithmetic Mean m Arithmetic mean = “average” value =
Method for identification Sum up (S) all of the values (xi) Divide the sum by the number of observations (n) Now let’s move on to the arithmetic mean. The arithmetic mean is what is commonly called the average. To identify the mean, you Method for identification 1. Sum up all of the values 2. Divide the sum by the number of observations (n)
26
Arithmetic Mean å x = m n 605 m = = 30.25 20 n = 20 Sxi = 605 i Obs
Age 1 27 2 3 28 4 5 6 29 7 8 9 10 30 11 12 13 14 15 31 16 17 32 18 34 19 36 20 37 Arithmetic Mean å x i m = n n = 20 Sxi = 605 The Arithmetic Mean (Mean) is the measure of central location calculated by dividing the sum of the values in a dataset by the number of values in the dataset. It is often used interchangeably with the word “average”. The Mean best reflects the typical value of a dataset when there are few outliers and/or the dataset is generally symmetrical. Here there are 20 values: N=20 The sum of the 20 values is equal to 605 The mean is the value found by dividing (605/20) = 30.25 605 m = = 30.25 20
27
Since the mean uses all data, is sensitive to outliers
1 2 3 4 5 6 10 15 20 25 30 35 40 45 50 Nights of stay Mean = 12.0 Mean = 15.3 Look at the top distribution. This is the distribution of hospital stays for the 30 people in the study. The mean equals 12. But, as we have discussed before, what if the person who stayed 49 days had actually stayed 149 days? The mean increases dramatically. So the mean IS sensitive to outliers and even a few extreme values. Tell the class: Now I want you to flip over your notes so you can’t see the next slide.
28
When to use the arithmetic mean?
Centered distribution Approximately symmetrical Few extreme values (outliers)
29
When to use the arithmetic mean? (ii)
1 2 OK! 4 3 Instructor note: This slide has animation. When you are reviewing this presentation, to see the animation either play the slide show, or on the top toolbar choose Slide Show -> Custom Animation. On the bottom of the custom animation box you can click on Play and it will go through the animation quickly, or click on Slide Show and it will only show you this slide (so you don’t have to go through the whole presentation to find this one). When slide first appears: The mean can be misleading as a summary measure in these examples. The best time to use the mean is when the data has: Centered distribution is Approximately symmetrical and has Few Extreme Values, i.e. outliers ASK: Which graphs do not meet these criteria? Why? Wait for answer, then click for next animation: the large red X’s. Click one more time for the “OK!”
30
Arithmetic Mean – Properties / Uses
Use all of the data Affected by extreme values (outliers) Best for normally distributed data Not usually equal to one of the original values So let’s summarize what you should know about the arithmetic mean. The median… Probably best known measure of central location. It uses all of the data, So it is affected by extreme values (outliers). Consequently, the mean is best when you have normally distributed (bell-shaped curve) data Not usually equal to one of the original values (but who cares?) Good statistical properties, so many statistical tests and other techniques are based on the mean.
31
Var A Var B Var C 1 6 3 9 4 6 1 6 2 0 4 0 5 10 5 9 5 1 0 4 7 9 6 4 10 0 9 1 5 8 For each variable, find the: • Sum • Mean • Median • Mode • Minimum value • Maximum value Instructors: Split class into 3 groups. Ask 1 group to use the Variable A data, Group 2 to use Variable B data, and Group 3 to use Variable C data. Find: Sum Mean Median Mode Also, Min Max
32
Sum: 55 55 55 Mean: Median: Mode: Min: Max: Var A Var B Var C
Ask class for answers.
33
Sum: 55 55 55 Mean: 5 5 5 Median: 5 5 5 Mode: 1,9 4,5,6 none
Min: 0 0 0 Max: Var A Var B Var C
34
How does the shape of a distribution influence
the Measures of Central Tendency? Symmetrical: Mode = Median = Mean Skewed right: Mode < Median < Mean Skewed left: Mean < Median < Mode The mean, median, and mode can be the same or different. With a classic bell-shaped curve (“normal distribution”), the mode is at the peak, the median is at the half-way point (which is at the peak), and the mean is in the center (which is at the peak). With a distribution that is skewed to the right (i.e., tail points to the right), the peak tends to be toward the left, so the mode is usually the smallest of the 3 values. Then comes the median and mean. If the tail is long enough, the mean will be pulled to the right farther than the median. With a distribution that is skewed to the left (i.e., tail points to the left), the opposite is true, with the mean pulled to the left by the tail, followed by the median, and finally the mode.
35
Epidemic curve Central Location Spread ? ? Number of people Age
Till now we already reviewed the measures of central location. Now, we will be reviewing the measures that reflect the spread of the distribution. Spread Age
36
different dispersions
Same center but … different dispersions we already talked about measures of central location, such as mean, median, and mode. Here are three different graphs with similar means, medians, and modes. What is different? How would you describe this difference? The dispersion is different. There is less variation in each graph, which could show that the population is more homogenous.
37
Measures of Spread Measures that quantify the variation or dispersion of a set of data from its central location Also known as “Measure of dispersion/ variation” Common measures Range Variance / standard deviation Standard error To quantify the spread or variation in the data we use measures of spread. So measures of spread are measures that quantify the variation or dispersion of a set of data from its central location. They are also known as “measures of dispersion or “measures of variation”. Some common measures of spread are: Range Interquartile range Variance / standard deviation Standard error 95% confidence interval
38
Range Range = Difference between largest and smallest values in a dataset Properties / Uses: Greatly affected by outliers Usually used with median The range is the simplest measure of dispersion. It is, simply, the difference between the largest and smallest values.
39
Finding the Range of Length of Stay Data
0, 2, 3, 4, 5, 5, 6, 7, 8, 9, 9, 9, 10, 10, 10, 10, 10, 11, 12, 12, 12, 13, 14, 16, 18, 18, 19, 22, 27, 49 1 2 3 4 5 6 10 15 20 25 30 35 40 45 50 Nights of stay Here are some data from a study of hospitalized patients. The number represents the number of nights each of 30 patients spent in the hospital. What is the minimum value? [Answer = 0] What is the maximum value? [Answer = 49] So what is the range? [Let students answer. Some might say “0 to 49,” others might say “49”.] Is the range one number or two? The range can be thought of in 2 ways 1) As a quantity, or in a statistics class: the difference between the highest and lowest scores in a distribution. “The range is 49-0 = 49” 2) As an interval, or in common usage: the lowest and highest scores may be reported as a range. “The range is 0 to 49.”
40
Range – Sensitive to Outliers?
1 2 3 4 5 6 10 15 20 25 30 35 40 45 50 Nights of stay Range = = 49 Range = = 149 Obviously, the range is EXTREMELY sensitive to outliers, since it is based on the most extreme values. Finally, when do we use the range in epidemiology? We may use it to describe the age range of cases (“cases at the school ranged in age from 8 to 15 years”), or the observed incubation period during an outbreak (“incubation period of cases ranged from 18 to 31 days”).
41
Interquartile Range Quartiles divide an ordered dataset into four equal parts, and refer to the values of the point between the quarters Interquartile range is the central 50% of a distribution; it is the difference between the upper (Q3) and lower (Q1) quartiles; it is used with the median Quartiles 25% of values Q1 Q2 Q3 [Consider skipping or deleting the Interquartile Range slides - rcd] The interquartile range is the central 50% of a distribution. It is almost always used in conjunction with the median and the quartiles that we discussed earlier. Remember that Q1 represented the 25% point in the data, and Q3 represented the 75% point? Well, the interquartile range is the range from Q1 (25%) to Q3 (75%), which is the middle 50% of the data! It is also commonly used, along with the minimum and maximum, to draw a box-and-whiskers diagram, that I will show you.
42
Variance and Standard Deviation
Measures of variation that quantifies how closely clustered the observed values are to the mean; measures of the spread of the data around the mean Variance = average of squared deviations from mean = Sum (each value – mean)2 / (n-1) Standard deviation = square root of variance The variance and standard deviation are related measures that quantify how closely clustered the observed values are to the mean. The variance is the average of the squared deviations or differences from the mean. To get back to the original units, you have to take the square root of the variance. That is called the standard deviation.
43
Variance and Standard Deviation (ii)
å - x i s² = n-1 x : mean xi : value n : number s²: variance s : standard deviation ( ) å - x i Here are the equations for the variance and the standard deviation. For epidemiologic purposes, the variance is rarely used on its own; it is just an intermediate step to get to the standard deviation. You see the standard deviation is just the square root of the variance. Familiarize the students with the notation s = n-1
44
Steps to Calculate Variance and Standard Deviation
x : mean xi : value n : number s²: variance s : standard deviation ( ) å - x i s² = n-1 x Calculate the arithmetic mean Subtract the mean from each observation. Square the difference. Sum the squared differences Divide the sum of the squared differences by n – 1 Take the square root of the variance - x i ( ) Now let’s go through the formula more closely. Here are the steps. 1. Calculate the arithmetic mean. 2. Subtract the mean from each observation (shown as X-sub-i on this slide). 3. Square the difference. 4. Sum the squared differences. 5. Divide the sum of the squared differences by (n – 1). That gives you the variance. 6. Take the square root of the variance to get the standard deviation. - x i ( ) å - x i s = s2
45
Length of Stay Data (0 – 12)2 = 144 (9 – 12)2 = 9 (12 – 12)2 = 0
(0 – 12)2 = (9 – 12)2 = 9 (12 – 12)2 = (2 – 12)2 = (9 – 12)2 = 9 (13 – 12)2 = (3 – 12)2 = 81 (10 – 12)2 = 4 (14 – 12)2 = (4 – 12)2 = 64 (10 – 12)2 = 4 (16 – 12)2 = (5 – 12)2 = 49 (10 – 12)2 = 4 (18 – 12)2 = (6 – 12)2 = 36 (10 – 12)2 = 4 (19 – 12)2 = (7 – 12)2 = 25 (11 – 12)2 = 1 (22 – 12)2 = 100 (8 – 12)2 = 16 (12 – 12)2 = 0 (27 – 12)2 = 225 (9 – 12)2 = (12 – 12)2 = 0 (49 – 12)2 = 1369 Sum = 2448; Var = 2448 / 29 = 84.4; SD = 84 = 9.2 Here are the data from the length of hospital stay study. Step 1 is to calculate the arithmetic mean. We did that last time. The mean is 12. Step 2. Subtract the mean from each observation, as shown. Step 3. Square the difference. The results are shown. Step 4. Sum the squared differences. The sum is 2448, as shown in the bottom left of the slide. Step 5. Divide the sum of the squared differences by (n – 1) /29 = That gives you the variance. Finally, Step 6. Take the square root of the variance to get the standard deviation. The square root of is 9.2. The mean was 12. So we subtract 12 from each observation, then square the difference.
46
Standard Deviation Standard deviation usually calculated only when data are more or less normally distributed (bell shaped curve) For normally distributed data, 68.3% of the data fall within plus/minus 1 SD 95.5% of the data fall within plus/minus 2 SD 95.0% of the data fall within plus/minus 1.96 SD 99.7% of the data fall within plus/minus 3 SD The standard deviation of a normal distribution enables the calculation of confidence intervals The standard deviation is usually calculated only when the data are more or less normally distributed. (The length of stay data is skewed to the right, so perhaps that was not the best example to use.) But for normally distributed data (i.e., bell-shaped curve), 68.3% of the data fall within plus/minus 1 SD 95.5% of the data fall within plus/minus 2 SD 95.0% of the data fall within plus/minus 1.96 SD 99.7% of the data fall within plus/minus 3 SD
47
Normal Distribution Mean 2.5% 95% 2.5% 68% Standard deviation
As illustrated here, The mean is at the center, and ±1, 2, and 3 standard deviations are marked on the x axis. For normally distributed data, approximately two-thirds (68.3%, to be exact) of the data fall within one standard deviation of either side of the mean; 95.5% of the data fall within two standard deviations of the mean; and 99.7% of the data fall within three standard deviations. Exactly 95.0% of the data fall within 1.96 standard deviations of the mean. This is important to us in epidemiology because we can compare any one individual’s characteristics ( I can compare my height to the class height, I can compare my blood pressure to the mean value of the nation) or any groups characteristic ( rayon’s reported cases this week to the nation’s) with a populations values and determine how dispersed they are from the mean. The standard deviation can be used to set limits as to what is normal , normal birth weight, normal temperature, normal number of reported cases, and then obviously detect when something is not normal, something is rare, (more than 2 or three standard deviations from the mean). We will use the mean and standard deviation to help us determine what is the normal number reported cases and what is abnormal when we cover surveillance analysis. Standard deviation Mean
48
Normal Distribution
49
Properties of Measures of Central Location and Spread
For quantitative / continuous variables Mode – simple, descriptive, not always useful Median – best for skewed data Arithmetic mean – best for normally distributed data Range – use with median Standard deviation – use with mean Standard error – used to construct confidence intervals To summarize, measures of central location and spread are used to summarize quantitative / continuous variables such as height or CD-4 counts. Among the measures of central location, Mode is the most common value. It is simple and descriptive, but not always useful, because it is not always near the center of the distribution, there may be more than 1 mode, or there may not be any. Median is the middle value. It is the best for skewed data, and it is safe when you don’t know whether the data are skewed or not. The arithmetic mean is the average value, best for normally distributed data. In fact, it should only be used with normally distributed data. The geometric mean is use to calculate the average of dilutional lab titers. Among the measures of spread, The range from smallest to largest value is a simple, descriptive measure, often used with median for skewed data or data of unknown skewness. The standard deviation is used with mean, so it should be used only with normally distributed data. And finally, the standard error of the mean is derived from the standard deviation, and is used to construct confidence intervals
50
Name the appropriate measures of central Location and Spread
Distribution Central Location Spread Single peak, symmetrical Skewed or Data with outliers For a distribution with a single peak and symmetrical distribution, which measure of central location would you recommend? Which measure of spread? For skewed data or data with a few extreme values, which measure of central location would you recommend? Which measure of spread?
51
Name the appropriate measures of central Location and Spread
Distribution Central Location Spread Single peak, Mean* Standard symmetrical deviation Skewed or Median Range or Data with outliers Interquartile range The choice of measures of central location and spread will depend in large part on the nature of the distribution of the observations. For continuous variables single-peaked and symmetric distribution, the mean, median, and mode will be similar or identical. The mean is usually preferred, and it is paired with the standard deviation. For data that are skewed or have a few extreme values, the median is the measure of choice because it is insensitive to extreme values. For descriptive purposes, the median is often paired with the range. For comparison purposes, the interquartile range may be used. * Median and mode will be similar
52
Any questions? 1st quartile 3rd quartile Minimum Maximum Range Mode
2 4 6 8 10 12 14 Population 1st quartile 3rd quartile Minimum Maximum Range Mode Median Interquartile interval Here we can see how the positions and distributions of the data are used together to describe the overall dataset. Age
53
Ministry of Public Health, Lebanon
Thank you! Dr Ghada Abou Mrad Ministry of Public Health, Lebanon
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.