Honors Stats Chapter 4 Part 5 Displaying and Summarizing Quantitative Data
Learning Goals Know how to display the distribution of a quantitative variable with a histogram, a stem-and-leaf display, or a dotplot. Know how to display the relative position of quantitative variable with a Cumulative Frequency Curve and analysis the Cumulative Frequency Curve. Be able to describe the distribution of a quantitative variable in terms of its shape. Be able to describe any anomalies or extraordinary features revealed by the display of a variable.
Learning Goals Be able to determine the shape of the distribution of a variable by knowing something about the data. Know the basic properties and how to compute the mean and median of a set of data. Understand the properties of a skewed distribution. Know the basic properties and how to compute the standard deviation and IQR of a set of data.
Learning Goals Understand which measures of center and spread are resistant and which are not. Be able to select a suitable measure of center and a suitable measure of spread for a variable based on information about its distribution. Be able to describe the distribution of a quantitative variable in terms of its shape, center, and spread.
Learning Goal 6 Know the basic properties and how to compute the mean and median of a set of data.
Learning Goal 6: Measures of Central Tendency A measure of central tendency for a collection of data values is a number that is meant to convey the idea of centralness or center of the data set. The most commonly used measures of central tendency for sample data are the: mean, median, and mode.
Learning Goal 6: Measures of Central Tendency Overview Central Tendency Mean Median Mode Midpoint of ranked values Most frequently observed value
Learning Goal 6: The Mean Mean: The mean of a set of numerical (data) values is the (arithmetic) average for the set of values. When computing the value of the mean, the data values can be population values or sample values. Hence we can compute either the population mean or the sample mean
Learning Goal 6: Mean Notation NOTATION: The population mean is denoted by the Greek letter µ (read as “mu”). NOTATION: The sample mean is denoted by 𝑥 (read as “x-bar”). Normally the population mean is unknown.
Learning Goal 6: The Mean The mean is the most common measure of central tendency. The mean is also the preferred measure of center, because it uses all the data in calculating the center. For a sample of size n: Observed values Sample size
Learning Goal 6: The Mean - Example What is the mean of the following 11 sample values? 3 8 6 14 0 -4 0 12 -7 0 -10
Learning Goal 6: The Mean - Example (Continued) Solution:
Learning Goal 6: Mean – Frequency Table When a data set has a large number of values, we summarize it as a frequency table. The frequencies represent the number of times each value occurs. When the mean is calculated from a frequency table it is an approximation, because the raw data is not known.
Learning Goal 6: Mean – Frequency Table Example What is the mean of the following 11 sample values (the same data as before)? Class Frequency -10 to < -4 2 -4 to < 2 4 2 to < 8 8 to < 14 14 to < 20 1
Learning Goal 6: Mean – Frequency Table Example Solution: Class Midpoint Frequency -10 to < -4 -7 2 -4 to < 2 -1 4 2 to < 8 5 8 to < 14 11 14 to < 20 17 1
Learning Goal 6: Calculate Mean on TI-84 Raw Data Enter the raw data into a list, STAT/Edit. Calculate the mean, STAT/CALC/1-Var Stats List: L1 FreqList: (leave blank) Calculate
Learning Goal 6: Calculate Mean on TI-84 Frequency Table Data Enter the Frequency table data into two lists (L1 – Class Midpoint, L2 – Frequency), STAT/Edit. Calculate the mean, STAT/CALC/1-Var Stats List: L1 FreqList: L2 Calculate Same Data Class Mark Freq 0-50 25 1 50-100 75 100-150 125 3 150-200 175 4 200-250 225 7 250-300 275
Learning Goal 6: Calculate Mean on TI-84 – Your Turn Raw Data: 548, 405, 375, 400, 475, 450, 412 375, 364, 492, 482, 384, 490, 492 490, 435, 390, 500, 400, 491, 945 435, 848, 792, 700, 572, 739, 572
Learning Goal 6: Calculate Mean on TI-84 – Your Turn Frequency Table Data (same): Class Limits Frequency 350 to < 450 450 to < 550 550 to < 650 650 to < 750 750 to < 850 850 to < 950 11 10 2 1
Learning Goal 6: Median The median is the midpoint of the observations when they are ordered from the smallest to the largest (or from the largest to smallest) If the number of observations is: Odd, then the median is the middle observation Even, then the median is the average of the two middle observations
Center of a Distribution -- Median The median is the value with exactly half the data values below it and half above it. It is the middle data value (once the data values have been ordered) that divides the histogram into two equal areas. It has the same units as the data.
Learning Goal 6: Finding the Median The location of the median: If the number of values is odd, the median is the middle number. If the number of values is even, the median is the average of the two middle numbers. Note that 𝑛+1 2 is not the value of the median, only the position of the median in the ranked data.
Learning Goal 6: Finding the Median – Example (n odd) What is the median for the following sample values? 3 8 6 14 0 -4 2 12 -7 -1 -10
Learning Goal 6: Finding the Median – Example (n odd) First of all, we need to arrange the data set in order ( STATS/SortA ) The ordered set is: -10 -7 -4 -1 0 2 3 6 8 12 14 Since the number of values is odd, the median will be found in the 6th position in the ordered set (To find; data number divided by 2 and round up, 11/2 = 5.5⇒6). Thus, the value of the median is 2. 6th value
Learning Goal 6: Finding the Median – Example (n even) Find the median age for the following eight college students. 23 19 32 25 26 22 24 20
Learning Goal 6: Finding the Median – Example (n even) First we have to order the values as shown below. 19 20 22 23 24 25 26 32 Since there is an even number of ages, the median will be the average of the two middle values (To find; data number divided by 2, that number and the next are the two middle numbers, 8/2 = 4⇒4th & 5th are the middle numbers). Thus, median = (23 + 24)/2 = 23.5. Middle Two Average
Learning Goal 6: The Median - Summary The median is the midpoint of a distribution—the number such that half of the observations are smaller and half are larger. n = 25 n/2 = 25/2 = 12.5=13 Median = 3.4 If n is odd, the median is observation n/2 (round up) down the list n = 24 n/2 = 12 &13 Median = (3.3+3.4) /2 = 3.35 3. If n is even, the median is the mean of the two center observations 1. Sort observations from smallest to largest.n = number of observations ______________________________
Learning Goal 6: Finding the Median on the TI-84 Enter data into L1 STAT; CALC; 1:1-Var Stats
Learning Goal 6: Find the Mean and Median – Your Turn CO2 Pollution levels in 8 largest nations measured in metric tons per person: 2.3 1.1 19.7 9.8 1.8 1.2 0.7 0.2 Mean = 4.6 Median = 1.5 Mean = 4.6 Median = 5.8 Mean = 1.5 Median = 4.6
Learning Goal 6: Mode A measure of central tendency. Value that occurs most often or frequent. Used for either numerical or categorical data. There may be no mode or several modes. Not used as a measure of center. Mode = 9 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 No Mode
Learning Goal 6: Mode - Example The mode is the measurement which occurs most frequently. The set: 2, 4, 9, 8, 8, 5, 3 The mode is 8, which occurs twice The set: 2, 2, 9, 8, 8, 5, 3 There are two modes - 8 and 2 (bimodal) The set: 2, 4, 9, 8, 5, 3 There is no mode (each value is unique).
Learning Goal 6: Summary Measures of Center
Understand the properties of a skewed distribution. Learning Goal 7 Understand the properties of a skewed distribution.
Learning Goal 7: Where is the Center of the Distribution? If you had to pick a single number to describe all the data what would you pick? It’s easy to find the center when a histogram is unimodal and symmetric—it’s right in the middle. On the other hand, it’s not so easy to find the center of a skewed histogram or a histogram with outliers.
Learning Goal 7: Meaningful measure of Center Your measure of center must be meaningful. The distribution of women’s height appears coherent and symmetrical. The mean is a good measure center. Height of 25 women in a class While we are looking at a number of histograms at once, and talking about means, here is another example of how you might use histograms and descriptive statistics like means to find out something of biological interest. You are interested in studying what pollinators visit a particular species of plant. Let’s say that there has been an increase in agriculture in the area with all the pesticide spraying that comes along with that. If insects are needed to pollinate the plant, and the pesticides kill the insects, the plant species may go extinct. Here is the mean of this distribution., but is it a good description of th center? Why would we care? Maybe plant height is a measure of plant age, and we wonder how well the population is holding up. - here you see there are not very many little plants, which might make you worry that there has been insufficient pollination. One of the things you have noticed about the plants is that the flower color varies. Pollinators are attracted to flower color, so you happen to have the plants divided up into three groups - red pink and white flowers. Typically hummingbirds pollinate red flowers and moths pollinate white flowers. Which makes you start to wonder about your sample. So group them by flower color and get means for each group. Is the mean always a good measure of center?
Learning Goal 7: Impact of Skewed Data Disease X: Mean and median are the same. Mean and median of a symmetric distribution Multiple myeloma: and skewed distribution. The mean is pulled toward the skew. It is maybe easier to see that by comparing the two distributions we just looked at that show time to death after diagnosis. For both disease X and MM you have on average 3 years to live. Does that mean you don’t care which one you get? Well, of the 25 people getting disease X, only 1 died in the first year after diagnosis. Of the ones getting MM, 7 did. So if you get X, according to what we see here only 1/25 or about 4 percent of people don’t make it through year one. But if you get MM, well, if 1 in 7 die in year one, it means you have an almost 30% chance of not making it even a year. Now, you might be one of these very few who live a long time, but it is much more likely that it is time to get your will together and hurry around to say goodbye to your loved ones. Means are the same, medians are different, because of the shape of the distribution. This is one of the major take-home messages from this class - you all thought you knew what an average meant, and you did, But you should also realized that what the average is telling you is different depending on the distribution. When the doctor diagnoses you with some disease, and people with that disease live on average for 3 years, You say Doctor! Show me the distribution! And as you go on in biology and you see charts like this in journal articles or even in the paper, you now know why they are showing them to you. Statistical descriptors, like using the mean to describe the center, are only telling you so much. To really understand what is going on you have to plot the data and look at the distribution for things like overall shape, symmetry, and the presence of outliers, and you have to understand the effect they have on things like the mean. Now, the next obvious question for a biologist of course is why you see these different types of patterns. The top is a normal distribution, represents lots of things in the natural world as we have seen in our women’s height and toucan bill examples. The distribution on the bottom is very different, and when you see something like this it challenges researchers to understand it - why do such a large percentage of people die so quickly - is there one single thing that if we could figure it out would save a huge chunk of the people dying down here? Could they figure out what it is about either these people or their treatment that allowed them to live so long? Lots still not known but a big part of it is that this diagnosis, MM, does not have the word multiple in its name for no reason. When you get down to the level of the cells involved, lots of different ones - so is really a suite of diseases. So this diagnosis is like “cancer” in general - a term that covers a broad range of biological phenomena that you can study and pick apart and understand on the cell biology to epidemiological level using not your intuition, but statistics. Now let’s move on from describing the center to describing the spread and symmetry, which are, again, really different for these two distributions.
Learning Goal 7: The Mean Nonresistant – The mean is sensitive to the influence of extreme values and/or outliers. Skewed distributions pull the mean away from the center towards the longer tail. The mean is located at the balancing point of the histogram. For a skewed distribution, is not a good measure of center.
Learning Goal 7: Mean – Nonresistant Example The most common measure of central tendency. Affected by extreme values (skewed dist. or outliers). 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4
Learning Goal 7: The Median Resistant – The median is said to be resistant, because extreme values and/or outliers have little effect on the median. In an ordered array, the median is the “middle” number (50% above, 50% below).
Learning Goal 7: Median – Resistant Example Not affected by extreme values (skewed distributions or outliers). 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median = 3
Learning Goal 7: Mean vs. Median with Outliers Without the outliers With the outliers Percent of people dying Here is the same data set with some outliers - some lucky people who managed to live longer than the others. The few large values moved the mean up from 3.5 to 4.0 However, the median , the number of years it takes for half the people to die only went from 3.4 to 3.6 This is typical behavior for the mean and median. The mean is sensitive to outliers, because when you add all the values up to get the mean the outliers are weighted disproportionately by their large size. However, when you get the median, they are just another two points to count - the fact that their size is so large does not matter much. The median (resistant), on the other hand, is only slightly pulled to the right by the outliers (from 3.4 to 3.6). The mean (non-resistant) is pulled to the right a lot by the outliers (from 3.4 to 4.2).
Learning Goal 7: Effect of Skewed Distributions The figure below shows the relative positions of the mean and median for right-skewed, symmetric, and left-skewed distributions. Note that the mean is pulled in the direction of skewness, that is, in the direction of the extreme observations. For a right-skewed distribution, the mean is greater than the median; for a symmetric distribution, the mean and the median are equal; and, for a left-skewed distribution, the mean is less than the median. New Slide: Insert Figure 3.1
Learning Goal 7: Comparing the mean and the median The mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean Median Mean and median for skewed distributions Left skew Mean Median Right skew Mean Median
Learning Goal 7: Which measure of location is the “best”? Because the median considers only the order of values, it is resistant to values that are extraordinarily large or small; it simply notes that they are one of the “big ones” or “small ones” and ignores their distance from center. To choose between the mean and median, start by looking at the distribution. Mean is used, for unimodal symmetric distributions, unless extreme values (outliers) exist. Median is used, for skewed distributions or when there are outliers present, since the median is not sensitive to extreme values.
Learning Goal 7: Class Problem Observed mean =2.28, median=3, mode=3.1 What is the shape of the distribution and why?
Learning Goal 7: Example Five houses on a hill by the beach. House Prices: $2,000,000 500,000 300,000 100,000 100,000
Learning Goal 7: Example – Measures of Center House Prices: $2,000,000 500,000 300,000 100,000 100,000 Sum $3,000,000 Mean: ($3,000,000/5) = $600,000 Median: middle value of ranked data = $300,000 Mode: most frequent value = $100,000 Which is the best measure of center? Median
Conclusion – Mean or Median? Mean – use with symmetrical distributions (no outliers), because it is nonresistant. Median – use with skewed distribution or distribution with outliers, because it is resistant.