Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03

Similar presentations


Presentation on theme: "Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03"— Presentation transcript:

1 Data Analysis and Statistical Software I (323-21-403) Quarter: Autumn 02/03
Daniela Stan, PhD Course homepage: Office hours: (No appointment needed) M, 3:00pm - 3:45pm at LOOP, CST 471 W, 3:00pm - 3:45pm at LOOP, CST 471 1/13/2019 Daniela Stan - CSC323

2 Outline The 1.5 X IQR criterion for suspected outliers
Measuring spread: the standard deviation Normal Distribution Standard Normal Distribution Introduction to SAS 1/13/2019 Daniela Stan - CSC323

3 The 1.5 X IQR criterion The interquartile range IQR: is the distance between the first and third quartiles: IQR=Q3 – Q1 The 1.5 X IQR criterion for outliers: An observation is a suspect outlier if it falls more than 1.5 X IQR above the third quartile or below the first quartile. Modified boxplot: - the lines extend out from the central box only to the smallest and largest observations that are not suspected outliers. - the suspected outliers are plotted as individual points. 1/13/2019 Daniela Stan - CSC323

4 The 1.5 X IQR criterion (cont.)
Examples 1.9/page 14 & 1.17/page 46 1/13/2019 Daniela Stan - CSC323

5 The 1.5 X IQR criterion (cont.)
Shape? skewed to the right with a single peak at the left Outliers? The one state that stands out is New Mexico with 38.7% Histogram of the percent of Hispanics in the adult population 1/13/2019 Daniela Stan - CSC323

6 The 1.5 X IQR criterion (cont.)
The five number summary is: 0.6 2.0 4.1 38.7 7.0 Minimum M Q1 Maximum Q3 The 1.5 X IQR criterion for outliers: IQR=Q3 – Q1= X IQR=7.5 Suspected outlier: any value below Q1-1.5 X IQR or above Q3+1.5 X IQR Q1-1.5 X IQR= = -5.5 Q3+1.5 X IQR= =14.5 There are 7 suspected outliers 1/13/2019 Daniela Stan - CSC323

7 The 1.5 X IQR criterion (cont.)
Modified boxplot: The points represent the suspected outliers. 1/13/2019 Daniela Stan - CSC323

8 Measuring Spread: The standard deviation
The variance s2 of a set of observations x1, x2,…, xn is the average of the squares of the observations from their mean: or, in more compact notation 1/13/2019 Daniela Stan - CSC323

9 Measuring Spread: The standard deviation
The standard deviation s is the square root of the variance s2: The number n-1 is called degree of freedom of the variance or standard deviation. When standard deviation s is equal to zero? Is standard deviation s a resistant measure ? 1/13/2019 Daniela Stan - CSC323

10 The standard deviation (cont.)
Example: Problem 1.59 Choosing measures for center and spread: - if the distribution is skewed, choose five number summary - if the distribution is symmetric and free of outliers, choose the mean and the standard deviation 1/13/2019 Daniela Stan - CSC323

11 The normal distributions
Sometimes the overall pattern of a large number of observations is so regular that we can describe it by smooth curve. The curve is the mathematical model for the distribution. A density curve is a curve that is always on or above horizontal axis and has area exactly 1 underneath it. The histogram of all 947 seventh grade students in Gary, Indiana, on the vocabulary part of the Iowa test. A symmetric density curve Notation: Mean:  Standard deviation:  1/13/2019 Daniela Stan - CSC323

12 The normal distributions (cont.)
Normal curves are density curves that are: Symmetric Unimodal Bell-Shaped A normal distribution is specified by: Mean  Standard Deviation  Notation: N(, ) The equation of the normal distribution is: 1/13/2019 Daniela Stan - CSC323

13 The normal distributions (cont.)
Example of two normal curves specified by their mean and standard deviation f(x) Can we locate the standard deviation with the eye? 1/13/2019 Daniela Stan - CSC323

14 The 68-95-99.7 rule In the normal distribution N(, ):
Approximately 68% of the observations are between -  and +  Approximately 95% of the observations are between - 2 and + 2 Approximately 99.7% of the observations are between - 3 and + 3 1/13/2019 Daniela Stan - CSC323

15 Standardizing and z-Score
If x is an observation from a distribution N(, ), the standardized value of x, called z-value, is: If the z-value is negative, the observation x is less than the mean If the z-value is positive, the observation x is greater than the mean 1/13/2019 Daniela Stan - CSC323

16 The standard normal distribution
The standard normal distribution N(0,1) is the normal distribution with mean 0 and standard deviation 1 If a variable X has any normal distribution N(, ), then the standardized variable Z has the standard normal distribution N(0,1). Why are normal distributions so important? Many statistical inference procedures based on normal distributions work well for other roughly symmetric distributions. They are good descriptions for real data 1/13/2019 Daniela Stan - CSC323

17 Normal distribution calculations
Example: The heights of young women are approximately normal with mean =64.5 inches and =2.5 inches. What is the proportion of women how are less than 68 inches tall? 1. State the problem: X = height, X < 68 2. Standardize: 68 standardized to 1.4 X<68 Z < 1.4 1/13/2019 Daniela Stan - CSC323

18 Normal distribution calculations
3. What proportion of observations/women on the standard normal variable Z take values less than 1.4? Table entry is area to the left of z Table A at the end of the book gives areas (proportions of observations) under standard normal curve. 1/13/2019 Daniela Stan - CSC323

19 Assignment #1 Due Date: 09/25/02 at 1:30pm Chapter 1:
Problem 1.124/page 95 Problem 1.134/page 99 1/13/2019 Daniela Stan - CSC323


Download ppt "Data Analysis and Statistical Software I ( ) Quarter: Autumn 02/03"

Similar presentations


Ads by Google