Chapter 3: Descriptive Statistics
Learning Objectives LO1 Apply various measures of central tendency— including the mean, median, and mode—to a set of ungrouped data. LO2 Apply various measures of variability—including the range, interquartile range, mean absolute deviation, variance, and standard deviation (using the empirical rule and Chebyshev’s theorem)—to a set of ungrouped data. LO3 Compute the mean, median, mode, standard deviation, and variance of grouped data. LO4 Describe a data distribution statistically and graphically using skewness, kurtosis, and box-and-whisker plots. LO5 Use computer packages to compute various measures of central tendency, variation, and shape on a set of data, as well as to describe the data distribution graphically.
Measures of Central Tendency Ungrouped Data Ungrouped data is any array of numbers which have not been summarized by statistical techniques Measures of central tendency reveal information about the values at the center, or middle part, of a group of numbers (or ordered array) Common Measures of Central Tendency are the : Mean Median Mode Percentiles Quartiles
The Arithmetic Mean The arithmetic mean is commonly called ‘the mean’ It is the average of a group of numbers It is a concept applicable for interval and ratio data It is not applicable for nominal or ordinal data The mean is computed by summing all values in the data set and dividing the sum by the number of values in the data set Thus, its value is affected by each value in the data set, including extreme values
Application of Arithmetic Mean in Statistics As a summary statistic of central tendency in data produced by business and economic processes When used in these settings it is important to make the distinction between The population mean: µ and the Sample mean The population mean is based on all of the values within the population The sample mean only uses some of the values within a population
Computing Population Mean Suppose a company has five departments with 24, 13, 19, 26, and 11 workers in each department. The population mean number of workers in each department is 18.6 workers. The computations follow:
Computing Sample Mean The calculation of a sample mean uses the same algorithm as for a population mean and will produce the same answer if computed on the same data. However, a separate symbol is necessary for the population mean and for the sample mean. Given the following set of numbers: 57, 86, 42, 38, 90, and 66. The sample mean is 63.167. The computations follow:
Impact of Extreme Values on the Mean The mean is the most commonly used measure of central tendency because of its mathematical properties and because it uses all the data point in the data set However, the mean is affected by extremely large or extremely small numbers Note that for the sample mean example, if the largest number 90 is replaced by the number 1,000 the mean becomes 214.833 as opposed to 63.167 If the smallest number 38 is replaced by the number 5 the mean becomes 57.667 as opposed to 63.167 Extreme values can significantly distort the mean.
The Median The median is the middle value in an ordered array of numbers The median applies for ordinal, interval, and ratio data Advantage of the median – it is unaffected by extremely large and extremely small values in the data set A disadvantage of the median is that not all the information from the numbers is used
Computing the Median First Step Second Step Third Step Arrange the observations in an ordered array Second Step For an array with an odd number of terms, the median is the middle number. Third Step For an array with an even number of terms, the median is the average of the two middle numbers. Locating the Median The median’s location in an ordered array is found by (n+1)/2
Median Example with an Odd Number of Data Let X be an ordered array such that X has the following values: 3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21, 22 There are 17 values in the ordered array Position of median = (n+1)/2 = (17+1)/2 = 9th position Counting from left to right to the 9th position, the median is 15 Advantage - extreme values do not distort the median Note that if 22 (maximum value) is replaced by 100, the median is still 15 If 3 (minimum value) is replaced by -103, the median is still 15
Median Example with an Even Number of Data Let X be an ordered array such that X assumes the following values: 3, 4, 5, 7, 8, 9, 11, 14, 15, 16, 16, 17, 19, 19, 20, 21 There are 16 values in the ordered array Position of median = (n+1)/2 = (16+1)/2 = 8.5th position The median is a value between the 8th and 9th observations in the ordered array. The median is 14 + 0.5(15-14) = 14.5 or simply, (14+15)/2 =14.5 Advantage - extreme values do not distort the median If 21 (maximum value) is replaced by 100, the median is still 14.5 If 3 (minimum value) is replaced by -88, the median is still 14.5
The Mode The mode is the value that occurs most frequently in an array of data The mode applies to all levels of data measurement: nominal, ordinal, interval, and ratio Unimodal: describes data sets with a single mode Bimodal: describes data sets that have two modes Multimodal: describes data sets that contain more than two modes
Example of the Mode Organizing the data into an ordered array helps to locate the mode The arrangement of the numbers represents an ordered array 44 is the value that occurs most frequently (occurs 5 times). The mode is 44
Percentiles Percentiles are measures of central tendency that divide a group of data into 100 parts The nth percentile is the value such that at least n percent of the data are below that value and at most (100 - n) percent are above that value For example: If a plant operator takes a safety examination and 87.6% of the safety exam scores are below that person’s score, he or she still scores at only the 87th percentile, even though more than 87% of the scores are lower. The median is the 50th percentile and has the same value as the 50th percentile
Percentiles Percentiles are stair step values: for example, the 87th and 88th percentile have no values between them Percentile methods are applicable for ordinal, interval, and ratio data and are not applicable for nominal data In general percentiles are not influenced by extreme values in the data set
Steps in Determining the Location of the Percentile Organize the data into ascending order Calculate the percentile location (i) using: Determine the location If i is a whole number, the Pth percentile is the average of the value at the ith location and the value at the (i + 1)th location. If i is not a whole number, the Pth percentile value is located at the whole-number part of i + 1. Where P = percentile i = percentile location n = number in the data set
Calculating Percentiles: An Example Raw Data: 14, 12, 19, 23, 5, 13, 28, 17 Ordered Array: 5, 12, 13, 14, 17, 19, 23, 28 Problem: Find 30th percentile Number of observations n = 8 Location of 30th Percentile: The location index, i, is not a whole number. Therefore put location at whole number portion of ( i + 1) = 2.4 + 1 = 3.4. The whole number portion is 3. The 30th percentile is at the 3rd location of the array: 30th percentile = 13
Quartiles Quartiles are measures of central tendency that divide a group of data into four subgroups or parts Q1: 25% of the data set is below the first quartile Q2: 50% of the data set is below the second quartile Q3: 75% of the data set is below the third quartile Relationship between Quartiles and percentiles Q1 is equal to the 25th percentile Q2 is located at 50th percentile and equals the median Q3 is equal to the 75th percentile Quartile values are not necessarily members of the data set
Calculating Quartiles: An Example Let X be an ordered array: If X={ 106, 109, 114, 116, 121, 122, 125, 129} then Q1: Q2: Q3: Note that when i is a whole number the quartile is the average of the ith and (i+1)th values in the ordered set
Measures of Variability: Ungrouped Data Measures of variability are used to describe the spread or dispersion of data By using variability with measures of central tendency, the result is a more complete description of data Measures of variability for ungrouped data include: range, interquartile range, mean absolute deviation, variance, standard deviation, z scores and coefficient of variation
Measures of Variability: Ungrouped Data Measures of variability describe the dispersion (spread) of a set of data or the convergence (unity) of a set of data Dispersion explains how far data is spread apart or disassociates from the mean Convergence explains how data moves towards union or conformity of the mean Variability is most frequently expressed in terms of deviation from the norm or mean. The images in the next slides express this visually
Variability Mean Mean No Variability in Cash Flow (same amounts) Variability in Cash Flow (different amounts) Mean Mean
Variability No Variability Variability
Range The range is the difference between the largest and smallest values in the data set Usefulness: Advantage - simple to compute Disadvantages: Ignores all data points except the two extremes Influenced by extreme values Has no reference point Has limited use by itself Example of range using data provided:
Interquartile Range Interquartile Range = Q3 – Q1 The interquartile range contains all values in the interval between the first and third quartiles The interquartile range accounts for the middle 50% of values in the ordered data set The interquartile range is especially useful in situations where data users are more interested in values toward the middle and less interested in extremes The interquartile range is less influenced by extremes
Deviation from the Mean An examination of deviations from the mean can reveal information about the variability of data However, the individual deviations are used mostly as a tool to compute other measures of variability Example – The following data set includes: 5, 9, 16, 17, 18 with a mean of µ = 13 (x - ) show distances around the mean or individual deviation from the mean: -8, -4, 3, 4, 5
Mean Absolute Deviation Absolute deviations express the tendency for observations to differ on the average from the mean Easy to calculate but not as statistically useful or unbiased as the use of variance and standard deviation measures Below is an example calculating the mean absolute deviation
Population Variance Population variance is the sum of the square deviations divided by the number of observations Statistics are measured in terms of square units of measurement Square units of measurement are hard to interpret so variance is typically used as a process of obtaining the standard deviation of a data set
Example of Population Variance Given the following x values, the solution would be expressed as 26.0 units squared
Population Standard Deviation Square root of the population variance Easier to interpret in practice than the variance Measures the dispersion of the population data from the mean
Example of Sample Variance Sample variances are also expressed as units squared. For example:
Example of Sample Standard Deviation The sample standard deviation is the square root of the sample variance Easier to interpret in practice than square units Sample standard deviation is used as a good estimator of the population standard deviation
Standard Deviation Standard deviation is the square root of the variance Standard deviation of a population is denoted by: The standard deviation of a sample is denoted by:
Uses of Standard Deviation Indicator of financial risk Quality Control construction of quality control charts process capability studies Comparing two or more populations household incomes in two cities employee absenteeism at two plants used as a percentage of the mean, the coefficient of variation (CV)
Standard Deviation as an Indicator of Financial Risk
Symmetric and Asymmetric Distributions Data are either symmetric or non-symmetric with respect to some measure of central tendency Statisticians have observed that distributions describing many types of business and economic data tend to be symmetric or have a normal shape They found that in practical terms the processes that generate symmetric data have special and exact properties (the empirical rule) with respect to data concentration Non-symmetric distributions, in practice and theory, obey as a minimum specified rules with respect to the concentration of data values in a population (The Chebyschev Theorem)
Empirical Rule When data are normally distributed or approximately normal
- Chebyshev’s Theorem - When Data are Not Normally Distributed or Nonsymmetric. The Chebyshev Theorem applies to all distributions It measures the minimum mass or concentration of data that lies within a specified number of standard deviation around the mean
Number of Standard Deviations Chebyshev’s Theorem A general theory applying to all distributions Calculations for k= 2,3,4 . k = 1 is not defined Number of Standard Deviations k Distance from the Mean Minimum Proportion of Values Falling within Distance from the Mean 2 0.75 3 0.89 4 0.94
Z Scores The z score represents the number of standard deviations a value (x) is above or below the mean Data for a z score is normally distributed Translates into standard deviations Z score formula
Coefficient of Variation Ratio of the standard deviation to the mean, expressed as a percentage Measurement of relative dispersion expressed as: ( ) C V = s m 100
Examples of Coefficient of Variation ( ) 2 84 10 100 11 90 m s = C V . 1 29 4 6 15 86
Measures of Central Tendency and Variability: Grouped Data Mean Median Mode Measures of Variability Variance Standard Deviation
Mean of Grouped Data Weighted average of class midpoints Class frequencies are the weights Mean of group data:
Example Calculation of Grouped Mean
Median of Grouped Data
Calculating the Median of Grouped Data
Estimating the Mode from Grouped Data The modal class is class interval with the greatest frequency -(7- under 9) for the example below. The mode for the grouped data is the class midpoint of the modal class. Mode = 8 for the example below.
Variance and Standard Deviation from Grouped Data
Population Variance and Standard Deviation of Grouped Data
Descriptions and Measures of Shape Skewness Absence of symmetry Presence of extreme values in one or other side of a distribution Kurtosis Peakedness of a distribution Leptokurtic: high and thin peak Mesokurtic: normal or mound shaped top Platykurtic: flat topped and spread out Box and Whisker Plots Graphic display of a distribution using 5-summary statistics Reveals skewness and data location or clustering
Probability Distributions Showing Symmetry and Skewness Symmetrical Right or Positively Skewed Left or Negatively Skewed
Symmetrical Shape Frequency Histogram Showing Relationship of Mean, Median and Mode
Coefficient of Skewness A summary measure for skewness based on the relationship of mean to median and the variation in the data If < 0, the distribution is negatively skewed (skewed to the left). If = 0, the distribution is symmetric (not skewed). If > 0, the distribution is positively skewed (skewed to the right).
Effect of Changes in Mean on the Coefficient of Skewness
Types of Kurtosis
Requirements for A Box and Whisker Plot Five specific numbers are used: Median, Q2 First quartile, Q1 Third quartile, Q3 Minimum value in the data set Maximum value in the data set Inner Fences: First Indicators of extreme values IQR = Q3 - Q1 Lower inner fence = Q1 - 1.5 IQR Upper inner fence = Q3 + 1.5 IQR Outer Fences: Strong Indicators of extreme values Lower outer fence = Q1 - 3.0 IQR Upper outer fence = Q3 + 3.0 IQR
Skewness and the Box Plot Box and whisker plot can determine skewness of a distribution. The location of the median in the box can indicate the skewness of the middle 50% of the data. If the median is located on the right side of the box, then the middle 50% are skewed to the left . If the median is on the left side, then the middle 50% are skewed to the right. Researcher can make judgment about skewness based on length of whiskers If the longest whisker is to the right of the box, then the outer data are skewed to the right, and vice versa. See box and whisker plot in next slide
Box and Whisker Plot
COPYRIGHT Copyright © 2014 John Wiley & Sons Canada, Ltd. All rights reserved. Reproduction or translation of this work beyond that permitted by Access Copyright (The Canadian Copyright Licensing Agency) is unlawful. Requests for further information should be addressed to the Permissions Department, John Wiley & Sons Canada, Ltd. The purchaser may make back-up copies for his or her own use only and not for distribution or resale. The author and the publisher assume no responsibility for errors, omissions, or damages caused by the use of these programs or from the use of the information contained herein.