Download presentation
Presentation is loading. Please wait.
Published byMatthias Lange Modified over 6 years ago
1
The Practice of Statistics in the Life Sciences Fourth Edition
Chapter 2: Numerical descriptors Copyright © 2018 W. H. Freeman and Company
2
Objectives Describing distributions with numbers
Measures of center: mean and median Measures of spread: quartiles and standard deviation The five-number summary and boxplots IQR and outliers Dealing with outliers Choosing among summary statistics Organizing a statistical problem
3
Measure of center: the mean
The mean, or arithmetic average To calculate the average (mean) of a data set, add all values, then divide by the number of individuals. It is the “center of mass.” 𝑥 = 𝑥 1 + 𝑥 2 …+ 𝑥 𝑛 𝑛 𝑥 = 1 𝑛 𝑖=1 𝑛 𝑥 𝑖
4
Measure of center: the median (1 of 2)
The median is the midpoint of a distribution—the number such that half of the observations are smaller, and half are larger. The data are sorted from small to large. The gray column has the ranks, the orange column the data points.
5
Measure of center: the median (2 of 2)
Sort observations from smallest to largest. n= number of observations The location of the median is (n+1)/2 in the sorted list If n is odd, the median is the value of the center observation. n=25 (n+1)/2=13 Median =3.4 If n is even, the median is the mean of the two center observations. n=24 (n+1)/2=12.5 Median = ( )/2 =3.35 The data are sorted from small to large. The gray column has the ranks, the orange column the data points.
6
Comparing the mean and the median (1 of 2)
The median is a measure of center that is resistant to skew and outliers. The mean is not. The mean and the median are (approximately) the same only if the distribution is symmetrical. The mean is not resistant to skew and outliers, because the mean is computed using ALL the numerical values in the data set. The median only requires finding the middle value and thus is not directly affected by values on the edges of the distribution. Mean and median for a symmetric distribution
7
Comparing the mean and the median (2 of 2)
The mean and the median are (approximately) the same only if the distribution is symmetrical. The mean is not resistant to skew and outliers, because the mean is computed using ALL the numerical values in the data set. The median only requires finding the middle value and thus is not directly affected by values on the edges of the distribution. Mean and median for skewed distributions
8
Example—mean and median (1 of 2)
A study of freely forming groups in taverns all over Europe recorded the group size (number of individuals in the group) that were naturally laughing. (There were a total of 501 groups in the study.) The median laughter group size is 2 2.5 3 3.5 4 The median laughter group size is 2: There are 500 groups, so the median is carried by groups 250 and 251 in the ordered list, both of which are in the first column. The mean would be larger than the median: The data show a strong skew, which would influence the mean but not the median. [Note that the mean is about 2.72.]
9
Example—mean and median (2 of 2)
The average laughter group size is A) smaller than the median. B) about the same as the median. C) larger than the median. The median laughter group size is 2: There are 500 groups, so the median is carried by groups 250 and 251 in the ordered list, both of which are in the first column. The mean would be larger than the median: The data show a strong skew, which would influence the mean but not the median. [Note that the mean is about 2.72.]
10
Measure of spread: quartiles
The first quartile, Q1, is the median of the values below the median in the sorted data set. The third quartile, Q3, is the median of the values above the median in the sorted data set. You should know that different technology platforms may use slightly different definitions for the quartiles. So don’t be surprised if you get a different answer when using technology, or even from one software to another.
11
Example—quartiles (1 of 2)
How fast do skin wounds heal? Here are the skin healing rate data from 18 newts measured in micrometers per hour: Sorted data: Median = ??? Quartiles = ??? With n = 18, the location of the median is (n + 1)/2 = 9.5. So the median is the midpoint of values #9 and #10 in the sorted list: 26 and 27, respectively. Therefore, median = 26.5 micrometers per hour. The first quartile is the median of the points below the median, so points #1 through #9; this corresponds to location #5. Therefore, Q1 = 22 micrometers per hour. The third quartile is the median of the points above the median in the sorted list, so points #10 through #18; this corresponds to location #14. Therefore, Q3 = 33 micrometers per hour.
12
Example—quartiles (2 of 2)
With n = 18, the location of the median is (n + 1)/2 = 9.5. So the median is the midpoint of values #9 and #10 in the sorted list: 26 and 27, respectively. Therefore, median = 26.5 micrometers per hour. The first quartile is the median of the points below the median, so points #1 through #9; this corresponds to location #5. Therefore, Q1 = 22 micrometers per hour. The third quartile is the median of the points above the median in the sorted list, so points #10 through #18; this corresponds to location #14. Therefore, Q3 = 33 micrometers per hour.
13
Measure of spread: interquartile range
The interquartile range (IQR) is the distance between the first and third quartiles. 𝐼𝑄𝑅= 𝑄 3 − 𝑄 1 Because the quartiles are medians themselves (of each half of the data set), the IQR is a resistant statistic. It is possible for the IQR to equal zero, if the values for Q3 and Q1 are equal. The quartiles, the median, the minimum, and the maximum are called the five-number summary.
14
Measure of spread: standard deviation (1 of 2)
The standard deviation is used to describe the variation around the mean. To get the standard deviation of a SAMPLE of data: Calculate the variance s2 𝑠 2 = 1 𝑛−1 1 𝑛 𝑥 𝑖 − 𝑥 2 Standard deviation measures spread by looking at how far the observations are from their mean. Although variance is a useful measure of spread, its units are units squared. The standard deviation (square root of the variance) is more intuitive, because it has the same units as the raw data and the mean. The following is for your information only and is not discussed in the book. Why do we divide by n 1 instead of n? We are dividing by the number of independent pieces of information that go into the estimate of a parameter. This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself). But why the term “degrees of freedom”? When we calculate the variance of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n 1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of x such that the sum of all the xs divided by n will equal the obtained mean of the sample. All of the other (n 1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the sample variance is said to have only (n 1) degrees of freedom.
15
Measure of spread: standard deviation (2 of 2)
Take the square root to get the standard deviation, s 𝑠= 1 𝑛−1 1 𝑛 𝑥 𝑖 − 𝑥 2 Standard deviation measures spread by looking at how far the observations are from their mean. Although variance is a useful measure of spread, its units are units squared. The standard deviation (square root of the variance) is more intuitive, because it has the same units as the raw data and the mean. The following is for your information only and is not discussed in the book. Why do we divide by n 1 instead of n? We are dividing by the number of independent pieces of information that go into the estimate of a parameter. This number is called the degrees of freedom (df, and it is equal to the number of independent scores that go into the estimate minus the number of parameters estimated as intermediate steps in the estimation of the parameter itself). But why the term “degrees of freedom”? When we calculate the variance of a random sample, we must first calculate the mean of that sample and then compute the sum of the several squared deviations from that mean. While there will be n such squared deviations only (n 1) of them are, in fact, free to assume any value whatsoever. This is because the final squared deviation from the mean must include the one value of x such that the sum of all the xs divided by n will equal the obtained mean of the sample. All of the other (n 1) squared deviations from the mean can, theoretically, have any values whatsoever. For these reasons, the sample variance is said to have only (n 1) degrees of freedom. Learn how to obtain the standard deviation of a sample using technology.
16
Example—calculating the standard deviation (1 of 2)
A person’s metabolic rate is the rate at which the body consumes energy. Find the mean and standard deviation for the metabolic rates of a sample of 7 men (in kilocalories, Cal, per 24 hours).
17
Example—calculating the standard deviation (2 of 2)
𝑥 = 𝑥 1 𝑛 =1600 𝑥 𝑖 − 𝑥 2 =214,870 𝑑𝑓=𝑛−1=6 𝑠 2 = 1 𝑑𝑓 𝑥 𝑖 − 𝑥 2 = 214,870 6 =35,811.7 𝑠= 35,811.7 ≈189.2
18
Features of the standard deviation
s measures spread about the mean, and should only be used when the average is the measure of center. s is always zero or greater than zero. s = 0 only when all the values in the sample are identical. s has the same units of measurement as the original observations. s2, the variance, has squared units of the original observations, and is harder to interpret. s, like the mean, is not resistant. Outliers have an even larger effect on s than they do on the mean.
19
Graphical displays: boxplots
The boxplot is a graphical view of the five-number summary. Five-number summary: min, Q1, M, Q3, max. Boxplots are sometimes also called “box-and-whiskers” plots.
20
IQR and suspected outliers
Recall the interquartile range (IQR) is the distance between the first and third quartiles (the length of the box in the boxplot). IQR = Q3 – Q1 An outlier is an individual value that falls outside the overall pattern. How far outside the overall pattern does a value have to fall to be considered a suspected outlier? Suspected low outlier: any value < Q1 – 1.5 IQR Suspected high outlier: any value > Q IQR
21
Example 1—using IQR to identify outliers
Some software programs create “modified boxplots,” in which suspected outliers (according to the 1.5 IQR rule) are displayed by a star or asterisk. The “whiskers” then only extend to the next value in the sorted list. Individual #25 has a survival of 7.9 years, which is 3.55 years above the third quartile. This is more than 1.5 IQR = years. Individual #25 is a suspected outlier.
22
Example 2—using IQR to identify outliers
Anonymous class survey: weight (lbs) and height (in) were used to compute BMI. A modified boxplot helps distinguish between points that are part of a skewed pattern and the presence of an outlier. The three values around 33–35 are close to the rest of the pattern and appear to simply be part of the skew. The largest value is clearly an outlier, far from the rest of the data.
23
Dealing with outliers What should you do if you find outliers in your data? It depends in part on what kind of outliers they are: Human error in recording information Human error in experimentation or data collection Unexplainable but apparently legitimate wild observations Are you interested in ALL individuals? Are you interested only in typical individuals? Don’t discard outliers just to make your data look better, and don’t act as if they did not exist. Refer to the Chapter 2 discussion on the treatment of outliers.
24
Choosing among summary statistics
Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers. Plot the mean and use the standard deviation for error bars. Otherwise, use the median and the five-number summary, which can be plotted as a boxplot.
25
Example 1—choosing summary statistics
Deep-sea sediments Phytopigment concentrations in deep-sea sediments collected worldwide show a very strong right-skew. Which of these two values is the mean and which is the median? 0.015 and grams per square meter of bottom surface Which would be a better summary statistic for these data? We know that the mean is not a robust measure of center and that it is influenced by skews and outliers. Because the data are strongly right-skewed, we expect the mean to be larger than the median. Therefore, is the mean and the median. Given the skewed nature of these data, the median would probably be a better summary statistic (depending on intended use).
26
Example 2—choosing summary statistics (1 of 3)
Researchers grafted human cancerous cells onto 20 healthy adult mice. Then 10 of the mice were injected with tumor-specific antibodies (anti-CD47), while the other 10 mice were not (IgG). Here is what a table of the raw data would look like. What summary statistics would you use for each of these two variables? Presence of metastases: This is a categorical variable. Compute the count of mice with metastases (10 versus 1) or the proportion of mice with metastases (1 versus 0.1) for each group. Number of metastases: This is a quantitative variable. Compute the mean and standard deviation of the number of metastases for each group (2.4 and 0.97 versus 0.1 and 0.32). The 5 number summary can be computed as well but with just 10 values in each group, it would not summarize the data much.
27
Example 2—choosing summary statistics (2 of 3)
Mouse Treatment Presence of metastases Number of metastases 1 IgG Yes 2 3 4 yes 5 6 7 8 9 10 11 anti-CD47 no 12 13 Presence of metastases: This is a categorical variable. Compute the count of mice with metastases (10 versus 1) or the proportion of mice with metastases (1 versus 0.1) for each group. Number of metastases: This is a quantitative variable. Compute the mean and standard deviation of the number of metastases for each group (2.4 and 0.97 versus 0.1 and 0.32). The 5 number summary can be computed as well but with just 10 values in each group, it would not summarize the data much.
28
Example 2—choosing summary statistics (3 of 3)
Mouse Treatment Presence of metastases Number of metastases 14 anti-CD47 no 15 16 17 18 19 20 yes 1 Presence of metastases: This is a categorical variable. Compute the count of mice with metastases (10 versus 1) or the proportion of mice with metastases (1 versus 0.1) for each group. Number of metastases: This is a quantitative variable. Compute the mean and standard deviation of the number of metastases for each group (2.4 and 0.97 versus 0.1 and 0.32). The 5 number summary can be computed as well but with just 10 values in each group, it would not summarize the data much.
29
Organizing a statistical problem
State: What is the practical question, in the context of a realworld setting? Plan: What specific statistical operations does this problem call for? Solve: Make the graphs and carry out the calculations needed for this problem. Conclude: Give your practical conclusion in the real-world setting.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.