Download presentation
Presentation is loading. Please wait.
1
1.2 Describing Distributions with Numbers
2
Center and spread are the most basic descriptions of what a data set “looks like.” They are intuitively meant to measure exactly what comes to your mind when you hear those terms, but the best way to define them isn’t so obvious. We first investigate center.
3
Center What is the center? Good question! The most popular measure of center is the mean or average value of a data set. The mean of a data set is computed by summing all data values and then dividing by n. We denote the mean by x. The Greek letter Σ (Sigma) is used as shorthand to mean “sum up.”
4
Notation Suppose we are given the data values 10 4 7 12 1 4 10 8. Compute Σx. We may also combine operations with x and the summation. Compute Σ(x+1). A nice formula for the mean may then be written as
5
Problems with mean Consider the data set 3 2 6 3 5 70 2 4 4 1. Compute the mean of the data. We have that μ=10, which hardly seems like a good “measure of average.” The problem can be attributed to the value 70. A value which is considerably larger (or smaller) than the rest of the data pattern is known as an outlier. Any measure of central tendency that is “sensitive” to extreme values, such as above, is not a resistant measure.
6
A Resistant Measure The median for a collection of data values is the number that is exactly in the middle position of the list when the data are arraigned in increasing order. We use M to denote the median. Let’s find the mean in the example above.
7
Comparison The median is a resistant measure of central tendency, unlike the mean, which is a strong advantage. However, the mean takes into account all numerical values while the median only takes the existence of the values into account. There is a formula for the mean but only a location for the median.
8
Inadequacies with Centers What centers do not take into account is the spread of the data. For example, consider the following data sets: A={19 20 20 21} and B={1 20 20 39}. In both data sets, the mean and median are both 20, but the data of B is much more spread out than the data of A. Thus, using just a center to describe our data is not good enough. We also need spread.
9
Measuring spread The most obvious measure of spread is the range of a set of data; that is, the difference between the highest and lowest data values. The range of a data set is denoted R. Above, R(A)=21-19=2 and R(B)=39-1=38. But consider the data set C={1 50 50 50 50 50 50 50 50 50 99} and D={1 10 20 30 40 50 60 70 80 90 99}. Not only are the mean and median in both data sets the same, but so is the range.
10
Percentile The mth percentile is the number that separates the bottom m% of the data from the top (100-m)% of the data. It is denoted by P M. Note that the median of a data set is the 50 th percentile so that P 50 =median.
11
Quartiles and IQR We define the first, second, and third quartile by Q 1 =P 25, Q 2 =P 50, and Q 3 =P 75. The interquartile range, denoted IQR, is defined by IQR=Q 3 -Q 1. It is easy to convince yourself that IQR is a resistant measure!
12
The 5 Number Summary We study a way to present a data set in which the reader can easily read off the quartiles and the high and low values of the data set. The following is an example of a boxplot. Min, Q 1, M, Q 3, Max
13
E.g. Consider the data set {33, 36, 37, 37 38, 41, 42, 42, 42, 45, 47, 52, 54, 55, 56, 56, 57, 60, 78, 92}. Construct a boxplot. To identify outliers, we use a modified boxplot. The idea is that instead of drawing the whiskers from Q 1 to the lowest value and Q 3 to the highest value, we draw the upper whisker from Q 3 to the largest data value between Q 3 and Q 3 +1.5xIQR. The lower whisker is drawn from Q 1 to the smallest data value between Q 1 – 1.5xIQR and Q 1. Any data that is not plotted thus far is considered an outlier and is plotted individually. Construct a modified boxplot for the above data.
14
Another measure of spread There are other things we can do. Let’s experiment with the data set 2 3 7 12 on the board. We have just “discovered” the standard deviation; The standard deviation is denoted by s, and the variance is denoted by s 2. They are defined by
15
Problems and problems Compute the standard deviation for the data sets A and B. “[The standard deviation]… will be large if the observations are widely spread about their mean, and small if the observations are close to their mean.” Unfortunately, none of these measures of dispersion is resistant; that is, they are sensitive to outliers.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.