Download presentation
Presentation is loading. Please wait.
1
Prepared by Lloyd R. Jaisingh
A PowerPoint Presentation Package to Accompany Applied Statistics in Business & Economics, 4th edition David P. Doane and Lori E. Seward Title page for Chapter 4. Prepared by Lloyd R. Jaisingh
2
Descriptive Statistics
Chapter 4 Chapter Contents 4.1 Numerical Description 4.2 Measures of Center 4.3 Measures of Variability 4.4 Standardized Data 4.5 Percentiles, Quartiles, and Box Plots 4.6 Correlation and Covariance 4.7 Grouped Data 4.8 Skewness and Kurtosis Contents of the Chapter: There are eight sections in this Chapter. The topics in each section are listed above.
3
Descriptive Statistics
Chapter 4 Chapter Learning Objectives LO4-1: Explain the concepts of center, variability, and shape. LO4-2: Use Excel to obtain descriptive statistics and visual displays. LO4-3: Calculate and interpret common measures of center. LO4-4: Calculate and interpret common measures of variability. LO4-5: Transform a data set into standardized values. LO4-6: Apply the Empirical Rule and recognize outliers. Learning Objectives: Listed are the first six learning objectives for this chapter. These objectives will be discussed as the chapter progresses.
4
Descriptive Statistics
Chapter 4 Chapter Learning Objectives LO4-7: Calculate quartiles and other percentiles. LO4-8: Make and interpret box plots. LO4-9: Calculate and interpret a correlation coefficient and covariance. LO4-10: Calculate the mean and standard deviation from grouped data. LO4-11: Assess skewness and kurtosis in a sample. Learning Objectives: Listed are the last five learning objectives for this chapter. These objectives will be discussed as the chapter progresses.
5
4.1 Numerical Description
LO4-1 Chapter 4 LO4-1: Explain the concepts of center, variability, and shape. Three key characteristics of numerical data: Numerical Description: The three key characteristics of numerical data are measures of center, measures of variability, and measures of shape. Some of these measures will be discussed in this chapter.
6
4.1 Numerical Description
LO4-2 Chapter 4 LO4-2: Use Excel to obtain descriptive statistics and visual displays. EXCEL Histogram Display for Tables 4.3 Numerical Description: We can use any appropriate technology to graphically display the data before any measures are computed. One of the purposes for this is get a visual idea of some of the properties of the distribution. Figures 4.1 and 4.3 show examples of some of the graphical displays you can create to get an idea of the behavior of the data set.
7
4.2 Measures of Center Mean Chapter 4 LO4-3
LO4-3: Calculate and interpret common measures of center. Mean A familiar measure of center Population Mean Sample Mean Measures of Center: The most common measures of center are the mean, median, and mode. We can compute the mean for both the population and the sample. These values are computed in the same manner except we use different notations to represent them. The slide displays the notations. The mean is computed by adding all the values and dividing by the number of values. In Excel, use function =AVERAGE(Data) where Data is an array of data values.
8
4.2 Measures of Center Median Chapter 4 LO4-3
The median (M) is the 50th percentile or midpoint of the sorted sample data. M separates the upper and lower halves of the sorted observations. If n is odd, the median is the middle observation in the data array. If n is even, the median is the average of the middle two observations in the data array. Measures of Center: Another measure of center is the median. It is that middle number in the data set when the data are ordered from smallest to largest. When the sample size n is odd, the median will be the middle number in the ordered set. When the sample size n is even, the median will be the average of the two middle numbers in the ordered set. The same procedures will apply if the data are from a population.
9
4.2 Measures of Center Mode Chapter 4 LO4-3
The most frequently occurring data value. May have multiple modes or no mode. The mode is most useful for discrete or categorical data with only a few distinct data values. For continuous data or data with a wide range, the mode is rarely useful. Measures of Center: The mode is another measure of center. It is the most frequently occurring value in the data set. There may be more than one mode for a data set. The mode is most useful for discrete or categorical data. The mode is not very useful for continuous data.
10
4.2 Measures of Center Shape Chapter 4 LO4-1
LO4-1: Explain the concepts of center, variability, and shape. Shape Compare mean and median or look at the histogram to determine degree of skewness. Figure 4.10 shows prototype population shapes showing varying degrees of skewness. Measures of Center: We can use the measures of the mean, median, and mode to characterize the shape of a distribution. Figure 4.10 shows the general shapes of data distributions. When Mean < Median < Mode Negatively skewed distribution. When Mean = Median = Mode Symmetric distribution. When Mean > Median > Mode Positively skewed distribution.
11
4.2 Measures of Center Geometric Mean Growth Rates Chapter 4 LO4-3
The geometric mean (G) is a multiplicative average. Measures of Center: The geometric mean is the multiplicative average for a set of data. A variation of the geometric mean is used to find the average growth rate for a set of time series data. The formulas for the computations are given in the slide. Growth Rates A variation on the geometric mean used to find the average growth rate for a time series.
12
4.2 Measures of Center Growth Rates Chapter 4 LO4-3 Year Revenue (mil)
2006 2,361 2007 2,843 2008 3,392 2009 3,292 2010 3,779 For example, from 2006 to 2010, JetBlue Airlines revenues are: The average growth rate: Measures of Center: This slide presents an example for the average growth rate. or 12.5 % per year.
13
4.2 Measures of Center Midrange Chapter 4 LO4-3
The midrange is the point halfway between the lowest and highest values of X. Easy to use but sensitive to extreme data values. For the J.D. Power quality data: Measures of Center: The midrange is another measure of center. It is the point halfway between the smallest and largest values in a data set. This measure is very sensitive to outliers in the data set. Here, the midrange (126.5) is higher than the mean (114.70) or median (113).
14
4.2 Measures of Center Trimmed Mean Chapter 4 LO4-3
To calculate the trimmed mean, first remove the highest and lowest k percent of the observations. For example, for the n = 33 P/E ratios, we want a 5 percent trimmed mean (i.e., k = .05). To determine how many observations to trim, multiply k by n, which is 0.05 x 33 = 1.65 or 2 observations. Measures of Center: Another measure of center is the trimmed mean. Computing the trimmed mean is a way of dealing with any outliers in the data set. Depending on the percentage, one will remove that percent of the values from the lower end and upper end of the data set when it is ordered. So, we would remove the two smallest and two largest observations before averaging the remaining values.
15
4.2 Measures of Center Chapter 4 LO4-3 Trimmed Mean
Here is a summary of all the measures of central tendency for the J.D. Power data. Mean: 114.70 =AVERAGE(Data) Median: 113 =MEDIAN(Data) Mode: 111 =MODE.SNGL(Data) Geometric Mean: 113.35 =GEOMEAN(Data) Midrange: 126.5 (MIN(Data)+MAX(Data))/2 5% Trim Mean: 113.94 =TRIMMEAN(Data, 0.1) Measures of Center: This slide shows these measures computed for a data set with the Excel software. The trimmed mean mitigates the effects of very high values, but still exceeds the median.
16
=MAX(Data) -MIN(Data)
4.3 Measures of Variability LO4-4 Chapter 4 LO4-4: Calculate and interpret common measures of variability. Variation is the “spread” of data points about the center of the distribution in a sample. Consider the following measures of variability: Measures of Variability Statistic Formula Excel Pro Con Range xmax – xmin =MAX(Data) -MIN(Data) Easy to calculate Sensitive to extreme data values. Measures of Variability: These measures will quantify the spread of a data set about the center of the distribution. The larger the computed values for these measures the more variable the data set will be about the mean of the data set. There are several measures of variability that will be discussed in this section. These measures can be computed for both a sample and a population. The variance for the sample is the most used measure of variation for a sample. Sample Variance (s2) =VAR.S(Data) Plays a key role in mathematical statistics. Nonintuitive meaning.
17
4.3 Measures of Variability
LO4-4 Chapter 4 Measures of Variation Statistic Formula Excel Pro Con Sample standard deviation (s) =STDEV.S(Data) Most common measure. Uses same units as the raw data ($ , £, ¥, grams etc.). Nonintuitive meaning. Sample coef-ficient. of variation (CV) None Measures relative variation in percent so can compare data sets. Requires non-negative data. Measures of Variability: The sample standard deviation is usually given as a statistic for a sample data set because it has the same unit as the variable in the data set. The coefficient of variation is used to compare the variability of two or more different variables measured in different units.
18
4.3 Measures of Variability
LO4-4 Chapter 4 Measures of Variability Statistic Formula Excel Pro Con Mean absolute deviation (MAD) =AVEDEV(Data) Easy to understand. Lacks “nice” theoretical properties. Measures of Variability: Another measure of variability is the mean absolute deviation. This measure computes the average of the absolute deviations from the mean of a data set. Population variance Population standard deviation
19
4.3 Measures of Variability
LO4-4 Chapter 4 Coefficient of Variation Useful for comparing variables measured in different units or with different means. A unit-free measure of dispersion. Expressed as a percent of the mean. Measures of Variability: The coefficient of variation is unit-free and so it can be used to compare the variability for different variables. The coefficient of variation only works for nonnegative data. Only appropriate for nonnegative data. It is undefined if the mean is zero or negative.
20
4.3 Measures of Variability
LO4-4 Chapter 4 Mean Absolute Deviation This statistic reveals the average distance from the center. Absolute values must be used since otherwise the deviations around the mean would sum to zero. It is stated in the unit of measurement. Measures of Variability: Another measure of variability is the mean absolute deviation. This measure computes the average of the absolute deviations from the mean of a data set. The MAD is appealing because of its simple interpretation.
21
4.3 Measures of Variability
LO4-1 Chapter 4 Central Tendency vs. Dispersion: Manufacturing Measures of Variability: Figure 4.19 shows two different distributions with different variability and different means. Take frequent samples to monitor quality.
22
4.4 Standardized Data Chebyshev’s Theorem Chapter 4
For any population with mean m and standard deviation s, the percentage of observations that lie within k standard deviations of the mean must be at least 100[1 – 1/k2]. For k = 2 standard deviations, 100[1 – 1/22] = 75% So, at least 75.0% will lie within m + 2s Although applicable to any data set, these limits tend to be rather wide. Standardized Data: Chebyshev’s Theorem allows us to find the minimum percentage of the data that will lie within a certain number of standard deviations from the mean. Although this theorem is applicable to any data set, the limits computed tend to be rather wide. For k = 3 standard deviations, 100[1 – 1/32] = 88.9% So, at least 88.9% will lie within m + 3s 22
23
4.4 Standardized Data The Empirical Rule Chapter 4
The normal distribution is symmetric and is also known as the bell-shaped curve. The Empirical Rule states that for data from a normal distribution, we expect the interval ± k to contain a known percentage of data. For Standardized Data: The Empirical Rule applies to the normal distribution. It states that within a certain number of standard deviations from the mean for ANY normal distribution, the percentage will be the same. Within one standard deviation of the mean for ANY normal distribution, approximately 68% of the data will lie. Within two standard deviations of the mean for ANY normal distribution, approximately 95.5% of the data will lie. Within three standard deviations of the mean for ANY normal distribution, approximately 99.7% of the data will lie. k = 1, 68.26% will lie within m + 1s k = 2, 95.44% will lie within m + 2s k = 3, 99.73% will lie within m + 3s 23
24
4.4 Standardized Data The Empirical Rule Chapter 4
Note: No upper bound is given. Data values outside m + 3s are rare. Standardized Data: Figure 4.20 graphically depicts the Empirical Rule. 24
25
4.4 Standardized Data Chapter 4 LO4-5
LO4-5: Transform a data set into standardized values. A standardized variable (Z) redefines each observation in terms of the number of standard deviations from the mean. A negative z value means the observation is to the left of the mean. Standardization formula for a population: Standardized Data: One way of standardizing values in a data set is to compute the z-scores. Standardization formula for a sample (for n > 30): Positive z means the observation is to the right of the mean. 25
26
4.4 Standardized Data Chapter 4 LO4-6
LO4-6: Apply the Empirical Rule and recognize outliers. Standardized Data: We can use the Empirical Rule to help to determine whether a data value is an unusual observation or an outlier when the data values are standardized. A data value is classified as an unusual observation if it falls outside two standard deviations from the mean. A data value is classified as an outlier if it falls outside three standard deviations from the mean. 26
27
4.4 Standardized Data Estimating Sigma Chapter 4
For a normal distribution, the range of values is almost 6s (from m – 3s to m + 3s). If you know the range R (high – low), you can estimate the standard deviation as s = R/6. Standardized Data: We can estimate the population standard deviation if we know the range of the data set. This estimate assumes the data set is normally distributed. Useful for approximating the standard deviation when only R is known. This estimate depends on the assumption of normality. 27
28
Percentiles 4.5 Percentiles, Quartiles, and Box-Plots Chapter 4 LO4-7
LO4-7: Calculate quartiles and other percentiles Percentiles Percentiles are data that have been divided into 100 groups. For example, you score in the 83rd percentile on a standardized test. That means that 83% of the test-takers scored below you. Deciles are data that have been divided into 10 groups. Percentiles, Quartiles, and Box Plots: Percentiles divide the data set into 100 equal groups. Special percentiles are deciles, quintiles, and quartiles. Quintiles are data that have been divided into 5 groups. Quartiles are data that have been divided into 4 groups. 28
29
Percentiles 4.5 Percentiles, Quartiles, and Box Plots Chapter 4 LO4-7
Percentiles may be used to establish benchmarks for comparison purposes (e.g. health care, manufacturing, and banking industries use 5th, 25th, 50th, 75th and 90th percentiles). Quartiles (25, 50, and 75 percent) are commonly used to assess financial performance and stock portfolios. Percentiles can be used in employee merit evaluation and salary benchmarking. Percentiles, Quartiles, and Box Plots: Percentiles can be used in many situations. A few examples are given in the slide. 29
30
Quartiles 4.5 Percentiles, Quartiles, and Box Plots Chapter 4 LO4-7
Quartiles are scale points that divide the sorted data into four groups of approximately equal size. Q1 Q2 Q3 Lower 25% | Second 25% Third 25% Upper 25% Percentiles, Quartiles, and Box Plots: Quartiles divide the ordered data set into four equal parts. The three values that separate the four groups are called Q1, Q2, and Q3, respectively. 30
31
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 Quartiles The second quartile Q2 is the median, a measure of central tendency. Q2 Lower 50% | Upper 50% Q1 and Q3 measure dispersion since the interquartile range Q3 – Q1 measures the degree of spread in the middle 50 percent of data values. Percentiles, Quartiles, and Box Plots: The second quartile is the same as the median for the data set. The interquartile range is the middle 50% of the ordered data set. Q1 Q3 Lower 25% | Middle 50% Upper 25% 31
32
Quartiles – The method of medians
4.5 Percentiles, Quartiles, and Box Plots LO4-7 Chapter 4 Quartiles – The method of medians The first quartile Q1 is the median of the data values below Q2, and the third quartile Q3 is the median of the data values above Q2. Q1 Q2 Q3 Lower 25% | Second 25% Third 25% Upper 25% For first half of data, 50% above, 50% below Q1. Percentiles, Quartiles, and Box Plots: One method of finding the quartiles is called the method of medians. For second half of data, 50% above, 50% below Q3.
33
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 Method of Medians For small data sets, find quartiles using method of medians: Step 1: Sort the observations. Step 2: Find the median Q2. Percentiles, Quartiles, and Box Plots: This slide gives the steps involved in computing the quartiles by the method of medians. Step 3: Find the median of the data values that lie below Q2. Step 4: Find the median of the data values that lie above Q2. 33
34
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 Method of Medians Example: Percentiles, Quartiles, and Box Plots: Slide shows an example using the method of medians to compute the quartiles. 34
35
Second 25% of P/E Ratios
4.5 Percentiles, Quartiles, and Box Plots LO4-7 Chapter 4 Example: P/E Ratios and Quartiles So, to summarize: Q1 Q2 Q3 Lower 25% of P/E Ratios 27 Second 25% of P/E Ratios 35.5 Third 25% of P/E Ratios 40.5 Upper 25% of P/E Ratios These quartiles express central tendency and dispersion. What is the interquartile range? Percentiles, Quartiles, and Box Plots: Example continued with the quartiles. 35
36
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 LO4-8: Make and interpret box plots. A useful tool of exploratory data analysis (EDA). Also called a box-and-whisker plot. Based on a five-number summary: Xmin, Q1, Q2, Q3, Xmax Consider the five-number summary for the previous P/E ratios example: Xmin, Q1, Q2, Q3, Xmax Percentiles, Quartiles, and Box Plots: The box plot is based on a five-number summary. These numbers are: minimum, first, second, and third quartiles and the maximum value.
37
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 Box Plots The box plot is displayed visually, like this. Percentiles, Quartiles, and Box Plots: Using the five numbers, a box plot can be displayed. The slide shows an example of a box plot. A box plot shows variability and shape. 37
38
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 Box Plots Percentiles, Quartiles, and Box Plots: Slide shows the shapes of different distributions and the associated box plots. 38
39
Box Plots: Fences and Unusual Data Values
4.5 Percentiles, Quartiles, and Box Plots LO4-8 Chapter 4 Box Plots: Fences and Unusual Data Values Use quartiles to detect unusual data points. These points are called fences and can be found using the following formulas: Inner fences Outer fences: Lower fence Q1 – 1.5 (Q3 – Q1) Q1 – 3.0 (Q3 – Q1) Upper fence Q (Q3 – Q1) Q (Q3 – Q1) Percentiles, Quartiles, and Box Plots: We can use quartiles to determine unusual data values and outliers. The lower and upper fence limits are computed. The formulas are given in the slide. Values outside the inner fences are classified as unusual and values outside the outer fences are classified as outliers. Values outside the inner fences are unusual while those outside the outer fences are outliers. 39
40
Box Plots: Fences and Unusual Data Values
4.5 Percentiles, Quartiles, and Box Plots LO4-8 Chapter 4 Box Plots: Fences and Unusual Data Values For example, consider the P/E ratio data: Inner fences Outer fences: Lower fence: 107 – 1.5 (126 –107) = 78.5 107 – 3.0 (126 –107) = 50 Upper fence: (126 –107) = 154.5 (126 –107) = 183 Percentiles, Quartiles, and Box Plots: Example demonstrating the inner and outer fences. There is one outlier (170) that lies above the inner fence. There are no extreme outliers that exceed the outer fence. 40
41
Box Plots: Fences and Unusual Data Values
4.5 Percentiles, Quartiles, and Box Plots LO4-8 Chapter 4 Box Plots: Fences and Unusual Data Values Truncate the whisker at the fences and display unusual values and outliers as dots. Outlier Percentiles, Quartiles, and Box Plots: Figure 4.29 shows the box plot and the outlier. Based on these fences, there is only one outlier.
42
4.5 Percentiles, Quartiles, and Box Plots
Chapter 4 Box Plots: Midhinge The average of the first and third quartiles. Percentiles, Quartiles, and Box Plots: The midhinge in a box plot is the average of the first and third quartiles. The name midhinge derives from the idea that, if the “box” were folded in half, it would resemble a “hinge”. 42
43
4.6 Correlation and Covariance
LO4-9 Chapter 4 LO4-9: Calculate and interpret a correlation coefficient and covariance. Correlation Coefficient The sample correlation coefficient is a statistic that describes the degree of linearity between paired observations on two quantitative variables X and Y. Correlation and Covariance: The sample correlation coefficient quantifies the strength of the linear association between bivariate variables. It also gives us the direction of the association depending on the sign of the computed value. Note: -1 ≤ r ≤ +1. 43
44
4.6 Correlation and Covariance
LO4-9 Chapter 4 Correlation Coefficient Illustration of Correlation Coefficients Correlation and Covariance: Figure 4.33 shows examples of correlation and the corresponding scatter plots. 44
45
4.6 Correlation and Covariance
LO4-9 Chapter 4 Covariance The covariance of two random variables X and Y (denoted σXY ) measures the degree to which the values of X and Y change together. Correlation and Covariance: Covariance measures the degree to which values in bivariate data change together. 45
46
4.6 Correlation and Covariance
LO4-9 LO Chapter 4 Covariance A correlation coefficient is the covariance divided by the product of the standard deviations of X and Y. Correlation and Covariance: Slide shows the relationship between the correlation coefficient and the covariance. 46
47
4.7 Grouped Data Chapter 4 LO4-10
LO4-10: Calculate the mean and standard deviation from grouped data. Weighted Mean Group Mean and Standard Deviation Grouped Data: We can compute statistics for grouped data. However these will just be estimates because we use the midpoints of the intervals (groups). Slide shows the formula used to compute the group mean. 47
48
4.7 Grouped Data Chapter 4 LO4-10 Group Mean and Standard Deviation
Group Data: Slide shows the formula used to compute the group standard deviation. 48
49
4.8 Skewness and Kurtosis Chapter 4 LO4-11
LO4-11: Assess skewness and kurtosis in a sample. Skewness Skewness and Kurtosis: Slide shows the formula used to compute the skewness of a distribution. 49
50
4.8 Skewness and Kurtosis Chapter 4 LO4-11
LO4-11: Assess skewness and kurtosis in a sample. Kurtosis Skewness and Kurtosis: Slide shows the formula used to compute the kurtosis for a distribution of values. Figure 4.37 displays the three categories of kurtosis. 50
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.