Data Analysis and Statistical Software I Quarter: Spring 2003 Daniela Stan Raicu School of CTI, DePaul University 1/18/2019 Daniela Stan - CSC323
Outline Describing distributions with numbers (continuation from the previous lecture) The 1.5 X IQR criterion for suspected outliers Measuring spread: the standard deviation Normal Distribution Standard Normal Distribution 1/18/2019 Daniela Stan - CSC323
Describing Distributions (cont.) Measuring spread: the quartiles The pth percentile of a distribution is the value such that p percent of the observations fall at or below it. The 50th percentile = median, M The 25th percentile = first quartile, Q1 The 75th percentile = third quartile, Q3 1/18/2019 Daniela Stan - CSC323
Describing Distributions (cont.) To calculate the quartiles: 1. Arrange the observations in increasing order and locate the median M in the list of observations. 2. The first quartile Q1 is the median of the observations whose position in the ordered list is to the left of the location of the overall median. 3. The third quartile Q3 is the median of the observations whose position in the ordered list is to the right of the location of the overall median. Example: 1.13 13 16 19 21 21 23 23 24 26 26 27 27 27 28 28 30 30 M=?, Q1=?, Q3=? 1/18/2019 Daniela Stan - CSC323
Describing Distributions (cont.) The Five-Number Summary of a set of observations consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation, written in order from the smallest to the largest. In symbols, the five number summary is Minimum Q1 M Q3 Maximum A boxplot is a graph of the five-number summary: A central box spans the quartiles Q1 and Q3 A line in the box marks the median M Lines extend from the box out to the smallest and largest observations 1/18/2019 Daniela Stan - CSC323
Weight Data: Sorted 1/18/2019 Daniela Stan - CSC323
Weight Data: Quartiles 10 0166 11 009 12 0034578 13 00359 14 08 15 00257 16 555 17 000255 18 000055567 19 245 20 3 21 025 22 0 23 24 25 26 0 Weight Data: Quartiles first quartile median or second quartile third quartile Q1= 127.5 Q2= 165 (Median) Q3= 185 1/18/2019 Daniela Stan - CSC323
range = max min = 160 Five-Number Summary minimum = 100 first quartile = 127.5 second quartile = 165 third quartile = 185 maximum = 260 interquartile range = Q3 Q1 = 57.5 range = max min = 160 1/18/2019 Daniela Stan - CSC323
Five-Number Summary: Boxplot Q1 M Q3 min max 100 125 150 175 200 225 250 275 Weight 1/18/2019 Daniela Stan - CSC323
Recommended Problems Chapter 1: Section 1.1 IPS web site: http://www.whfreeman.com/ips4e 1/18/2019 Daniela Stan - CSC323
The 1.5 X IQR criterion The interquartile range IQR: is the distance between the first and third quartiles: IQR=Q3 – Q1 The 1.5 X IQR criterion for outliers: An observation is a suspect outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. Modified boxplot: - the lines extend out from the central box only to the smallest and largest observations that are not suspected outliers. - the suspected outliers are plotted as individual points. 1/18/2019 Daniela Stan - CSC323
The 1.5 X IQR criterion (cont.) Examples 1.9/page 14 & 1.17/page 46 1/18/2019 Daniela Stan - CSC323
The 1.5 X IQR criterion (cont.) Shape? skewed to the right with a single peak at the left Outliers? The one state that stands out is New Mexico with 38.7% Histogram of the percent of Hispanics in the adult population 1/18/2019 Daniela Stan - CSC323
The 1.5 X IQR criterion (cont.) The five number summary is: 0.6 2.0 4.1 38.7 7.0 Minimum M Q1 Maximum Q3 The 1.5 X IQR criterion for outliers: IQR=Q3 – Q1=5 1.5 X IQR=7.5 Suspected outlier: any value below Q1-1.5 X IQR or above Q3+1.5 X IQR Q1-1.5 X IQR=2.0-7.5= -5.5 Q3+1.5 X IQR=7.0+7.5=14.5 There are 7 suspected outliers 1/18/2019 Daniela Stan - CSC323
The 1.5 X IQR criterion (cont.) Modified boxplot: The points represent the suspected outliers. 1/18/2019 Daniela Stan - CSC323
Measuring the spread: Variance and Standard Deviation If all values are the same, what is the variation in the data? Variation exists when some values are above or below the mean. Each data value has an associated deviation from the mean 1/18/2019 Daniela Stan - CSC323
Deviations and Variance A deviation: what is a typical deviation from the mean? small values of this typical deviation indicate small variation in the data; large values of this typical deviation indicate large variation in the data Variance: Find the mean Find the deviation of each value from the mean Square the deviations Sum the squared deviations Divide the sum by n-1 1/18/2019 Daniela Stan - CSC323
Measuring Spread: The standard deviation The variance s2 of a set of observations x1, x2,…, xn is the average of the squares of the observations from their mean: or, in more compact notation 1/18/2019 Daniela Stan - CSC323
Measuring Spread: The standard deviation The standard deviation s is the square root of the variance s2: The number n-1 is called degree of freedom of the variance or standard deviation. When standard deviation s is equal to zero? Is standard deviation s a resistant measure ? 1/18/2019 Daniela Stan - CSC323
The standard deviation (cont.) Example: Problem 1.59 Choosing measures for center and spread: - if the distribution is skewed, choose five number summary - if the distribution is symmetric and free of outliers, choose the mean and the standard deviation 1/18/2019 Daniela Stan - CSC323
Density curves Sometimes the overall pattern of a large number of observations is so regular that we can describe it by smooth curve. The curve is the mathematical model for the distribution. A density curve is a curve that is always on or above horizontal axis and has area exactly 1 underneath it. The histogram of all 947 seventh grade students in Gary, Indiana, on the vocabulary part of the Iowa test. A symmetric density curve 1/18/2019 Daniela Stan - CSC323
The normal distributions Normal curves are density curves that are: Symmetric Unimodal Bell-Shaped 1/18/2019 Daniela Stan - CSC323
The normal distributions (cont.) A normal distribution is specified by: Mean Standard Deviation Notation: N(, ) The equation of the normal distribution ( gives the height of the normal distribution) : 1/18/2019 Daniela Stan - CSC323
? 1/18/2019 Daniela Stan - CSC323
The normal distributions (cont.) Example of two normal curves specified by their mean and standard deviation f(x) Can we locate the standard deviation with the eye? 1/18/2019 Daniela Stan - CSC323
The 68-95-99.7 rule In the normal distribution N(, ): Approximately 68% of the observations are between - and + Approximately 95% of the observations are between - 2 and + 2 Approximately 99.7% of the observations are between - 3 and + 3 1/18/2019 Daniela Stan - CSC323
Empirical Rule for Any Normal Curve +1* -1* 68% +2 * -2* 95% +3 * -3 * 99.7% 1/18/2019 Daniela Stan - CSC323
Health and Nutrition Examination Study of 1976-1980 (HANES) Heights of adults, aged 18-24 women mean: 65.0 inches standard deviation: 2.5 inches men mean: 70.0 inches standard deviation: 2.8 inches 1/18/2019 Daniela Stan - CSC323
Health and Nutrition Examination Study of 1976-1980 (HANES) Empirical Rule women 68% are between 62.5 and 67.5 inches [mean 1 std dev = 65.0 2.5] 95% are between 60.0 and 70.0 inches 99.7% are between 57.5 and 72.5 inches men 68% are between 67.2 and 72.8 inches 95% are between 64.4 and 75.6 inches 99.7% are between 61.6 and 78.4 inches 1/18/2019 Daniela Stan - CSC323
With the Mean and Standard Deviation of the Normal Distribution We Can Determine: What proportion of individuals fall into any range of values Example: What proportion of men are less than 68 inches tall? At what percentile a given individual falls, if you know their values What value corresponds to a given percentile ? 68 70 (height values) 1/18/2019 Daniela Stan - CSC323