SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers.

Slides:



Advertisements
Similar presentations
General Linear Models The theory of general linear models posits that many statistical tests can be solved as a regression analysis, including t-tests.
Advertisements

SW388R7 Data Analysis & Computers II Slide 1 Solving Problems in SPSS The data sets Options for variable lists in statistical procedures Options for variable.
Principal component analysis
SW388R6 Data Analysis and Computers I Slide 1 Paired-Samples T-Test of Population Mean Differences Key Points about Statistical Test Sample Homework Problem.
One-sample T-Test of a Population Mean
5/15/2015Slide 1 SOLVING THE PROBLEM The one sample t-test compares two values for the population mean of a single variable. The two-sample test of a population.
Strategy for Complete Regression Analysis
Assumption of normality
Outliers Split-sample Validation
Detecting univariate outliers Detecting multivariate outliers
Chi-square Test of Independence
Outliers Split-sample Validation
Discriminant Analysis – Basic Relationships
Multiple Regression – Assumptions and Outliers
Multiple Regression – Basic Relationships
SW388R7 Data Analysis & Computers II Slide 1 Computing Transformations Transforming variables Transformations for normality Transformations for linearity.
Assumption of Homoscedasticity
LEVEL OF MEASUREMENT Data is generally represented as numbers, but the numbers do not always have the same meaning and cannot be used in the same way.
SW388R6 Data Analysis and Computers I Slide 1 One-sample T-test of a Population Mean Confidence Intervals for a Population Mean.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Logistic Regression – Complete Problems
8/9/2015Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of normality Transformations Assumption of normality script Practice problems.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Basic Relationships Purpose of multiple regression Different types of multiple regression.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Assumption of linearity
SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Problems Using Scripts.
SW388R7 Data Analysis & Computers II Slide 1 Discriminant Analysis – Basic Relationships Discriminant Functions and Scores Describing Relationships Classification.
SW388R6 Data Analysis and Computers I Slide 1 Chi-square Test of Goodness-of-Fit Key Points for the Statistical Test Sample Homework Problem Solving the.
Sampling Distribution of the Mean Problem - 1
SW318 Social Work Statistics Slide 1 Estimation Practice Problem – 1 This question asks about the best estimate of the mean for the population. Recall.
Slide 1 SOLVING THE HOMEWORK PROBLEMS Simple linear regression is an appropriate model of the relationship between two quantitative variables provided.
8/20/2015Slide 1 SOLVING THE PROBLEM The two-sample t-test compare the means for two groups on a single variable. the The paired t-test compares the means.
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems.
8/23/2015Slide 1 The introductory statement in the question indicates: The data set to use: GSS2000R.SAV The task to accomplish: a one-sample test of a.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of Homoscedasticity Homoscedasticity (aka homogeneity or uniformity of variance) Transformations.
SW388R7 Data Analysis & Computers II Slide 1 Analyzing Missing Data Introduction Practice Problems Homework Problems Using Scripts.
Hierarchical Binary Logistic Regression
9/23/2015Slide 1 Published reports of research usually contain a section which describes key characteristics of the sample included in the study. The “key”
SW388R6 Data Analysis and Computers I Slide 1 Central Tendency and Variability Sample Homework Problem Solving the Problem with SPSS Logic for Central.
Chi-Square Test of Independence Practice Problem – 1
Stepwise Multiple Regression
SW388R7 Data Analysis & Computers II Slide 1 Multinomial Logistic Regression: Complete Problems Outliers and Influential Cases Split-sample Validation.
Slide 1 SOLVING THE HOMEWORK PROBLEMS Pearson's r correlation coefficient measures the strength of the linear relationship between the distributions of.
SW388R7 Data Analysis & Computers II Slide 1 Logistic Regression – Hierarchical Entry of Variables Sample Problem Steps in Solving Problems Homework Problems.
SW388R6 Data Analysis and Computers I Slide 1 Independent Samples T-Test of Population Means Key Points about Statistical Test Sample Homework Problem.
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
6/4/2016Slide 1 The one sample t-test compares two values for the population mean of a single variable. The two-sample t-test of population means (aka.
SW388R6 Data Analysis and Computers I Slide 1 Multiple Regression Key Points about Multiple Regression Sample Homework Problem Solving the Problem with.
Chi-square Test of Independence
SW388R7 Data Analysis & Computers II Slide 1 Hierarchical Multiple Regression Differences between hierarchical and standard multiple regression Sample.
SW318 Social Work Statistics Slide 1 One-way Analysis of Variance  1. Satisfy level of measurement requirements  Dependent variable is interval (ordinal)
SW388R6 Data Analysis and Computers I Slide 1 One-way Analysis of Variance and Post Hoc Tests Key Points about Statistical Test Sample Homework Problem.
SW318 Social Work Statistics Slide 1 Percentile Practice Problem (1) This question asks you to use percentile for the variable [marital]. Recall that the.
SW388R6 Data Analysis and Computers I Slide 1 Percentiles and Standard Scores Sample Percentile Homework Problem Solving the Percentile Problem with SPSS.
12/23/2015Slide 1 The chi-square test of independence is one of the most frequently used hypothesis tests in the social sciences because it can be used.
SW388R6 Data Analysis and Computers I Slide 1 Comparing Central Tendency and Variability across Groups Impact of Missing Data on Group Comparisons Sample.
SW388R7 Data Analysis & Computers II Slide 1 Solving Homework Problems in SPSS The data sets Options for variable lists in statistical procedures Options.
Extracting Information from an Excel List The purpose of creating a database, or list in Excel, is to be able to manipulate the data elements in ways that.
Progress and Outcome Measures - Part 3 Progress and Outcome Measures Part 3, Slide 1Copyright © 2004, Jim Schwab, University of Texas at Austin.
2/24/2016Slide 1 The standard deviation statistic is challenging to present to our audiences. Statisticians often resort to the “empirical rule” to describe.
SW388R7 Data Analysis & Computers II Slide 1 Principal component analysis Strategy for solving problems Sample problem Steps in principal component analysis.
(Slides not created solely by me – the internet is a wonderful tool) SW388R7 Data Analysis & Compute rs II Slide 1.
SW388R7 Data Analysis & Computers II Slide 1 Strategy for Complete discriminant Analysis Assumption of normality, linearity, and homogeneity Outliers Multicollinearity.
SW388R7 Data Analysis & Computers II Slide 1 Assumption of linearity Strategy for solving problems Producing outputs for evaluating linearity Assumption.
Assumption of normality
Computing Transformations
Multiple Regression – Split Sample Validation
Multinomial Logistic Regression: Complete Problems
Presentation transcript:

SW388R7 Data Analysis & Computers II Slide 1 Detecting Outliers Detecting univariate outliers Detecting multivariate outliers

SW388R7 Data Analysis & Computers II Slide 2 Outliers  Outliers are cases that have data values that are very different from the data values for the majority of cases in the data set.  Outliers are important because they can change the results of our data analysis. Our results may reflect the characteristics of the outliers at the expense of the majority of cases.  Whether we include or exclude outliers from a data analysis depends on the reason why the case is an outlier and the purpose of the analysis.

SW388R7 Data Analysis & Computers II Slide 3 Univariate and Multivariate Outliers  Univariate outliers are cases that have an unusual value for a single variable. In our analyses, we will be concerned with univariate outliers for the dependent variable in our data analysis.  Multivariate outliers are cases that have an unusual combination of values for a number of variables. The value for any of the individual variables may not be a univariate outlier, but, in combination with other variables, is a case that occurs very rarely. In our analyses, we will be concerned with multivariate outliers for the set of independent variables in our data analysis.

SW388R7 Data Analysis & Computers II Slide 4 Standard Scores Detect Univariate Outliers  One way to identify univariate outliers is to convert all of the scores for a variable to standard scores (z scores).  The criteria for a univariate outlier in the text is a standard score that is less than or greater than +3.29, which is the standard score value that corresponds to a probability of  This criteria applies to interval level variables, and to ordinal level variables that are treated as metric. It does not apply to nominal level variables.

SW388R7 Data Analysis & Computers II Slide 5 Mahalanobis D 2 and Multivariate Outliers  Mahalanobis D 2 is a multidimensional version of a z- score. It measures the distance of a case from the centroid (multidimensional mean) of a distribution, given the covariance (multidimensional variance) of the distribution.  A case is a multivariate outlier if the probability associated with its D 2 is or less. D 2 follows a chi-square distribution with degrees of freedom equal to the number of variables included in the calculation.  Mahalanobis D 2 requires that the variables be metric, i.e. interval level or ordinal level variables that are treated as metric.

SW388R7 Data Analysis & Computers II Slide 6 Problem 1 In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "number of children" [childs] with the independent variables "number of hours worked in the past week" [hrs1], "occupational prestige score" [prestg80], and "highest year of school completed" [educ], no univariate or multivariate outliers were detected. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic

SW388R7 Data Analysis & Computers II Slide 7 Level of measurement In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "number of children" [childs] with the independent variables "number of hours worked in the past week" [hrs1], "occupational prestige score" [prestg80], and "highest year of school completed" [educ], no univariate or multivariate outliers were detected. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic Since we are pre-screening for a multiple regression problem, we should make sure we satisfy the level of measurement before proceeding. "Number of children" [childs] is interval, satisfying the metric level of measurement requirement for the dependent variable. "Number of hours worked in the past week" [hrs1] "occupational prestige score" [prestg80] and "highest year of school completed" [educ] are interval, satisfying the metric or dichotomous level of measurement requirement for independent variables.

SW388R7 Data Analysis & Computers II Slide 8 Descriptive statistics compute standard scores To compute standard scores in SPSS, select the Descriptive Statistics | Descriptives… command from the Analyze menu.

SW388R7 Data Analysis & Computers II Slide 9 Select the variable(s) for the analysis First, click on the variable to be included in the analysis to highlight it, childs. Second, click on right arrow button to move the highlighted variable to the list of variables.

SW388R7 Data Analysis & Computers II Slide 10 Mark the option for computing standard scores First, click on the checkbox to Save standardized values as variables in the dataset. The new variable will have the letter z prepended to its name, e.g. the standard score variable for “childs” will be “zchilds”. Second, click on the OK button to complete the analysis request.

SW388R7 Data Analysis & Computers II Slide 11 The z-score variable in the data editor The variable containing the standard scores will be added to the list of variables in the data editor. To identify outliers below –3.29, we sort the database in ascending order. Right click on the variable header zchilds and select the Sort Ascending command from the popup menu.

SW388R7 Data Analysis & Computers II Slide 12 Outliers with unusually low scores Cases that are outliers because they have unusually low scores for the variable will appear at the top of the sorted list. In this example, we have no cases with z-scores less than –3.29.

SW388R7 Data Analysis & Computers II Slide 13 Detecting outliers with unusually high scores To identify outliers above +3.29, we sort the database in descending order. Right click on the variable header zchilds and select the Sort Descending command from the popup menu.

SW388R7 Data Analysis & Computers II Slide 14 Outliers with unusually high scores Cases that are outliers because they have unusually high scores for the variable will now appear at the top of the sorted list. In this example, there are two cases with z-score values above 3.29.

SW388R7 Data Analysis & Computers II Slide 15 Additional information about the outliers To see additional information about the outliers, we highlight the rows containing the outliers and scroll horizontally to other variables in which we are interested, for example, the case id numbers for the subjects.

SW388R7 Data Analysis & Computers II Slide 16 The raw data scores for the outliers Before deciding whether we retain or omit outliers from the analysis, we should examine the raw scores that made these cases outliers. In this example, the two subjects who were outliers each had seven children.

SW388R7 Data Analysis & Computers II Slide 17 Comparing the raw scores to the mean When we compare the raw data values of 7 to the mean (1.76) and standard deviation (1.532) of the distribution for the variable, we see why these cases are outliers for this distribution. Having 7 children is unusual in a distribution that had a mean that is less than 2 children. Having 7 children makes cases and univariate outliers. The Descriptives output helps us in evaluating the raw data scores for the outliers.

SW388R7 Data Analysis & Computers II Slide 18 Deleting the z-score variable Once we are finished with the outlier analysis, we should delete the variables that were added to the data set. First, click on the zchilds column header to select the entire column. Second, select the Clear command from the Edit menu to delete the column from the dataset.

SW388R7 Data Analysis & Computers II Slide 19 Mahalanobis D 2 is computed by Regression To compute Mahalanobis D 2 in SPSS, select the Regression | Linear… command from the Analyze menu.

SW388R7 Data Analysis & Computers II Slide 20 Adding the independent variables The SPSS Linear Regression procedure computes Mahalanobis D 2 for the set of independent variables entered into the dialog box. Move the variables: hrs1, prestg80, and educ to the list of independent variables.

SW388R7 Data Analysis & Computers II Slide 21 Adding an arbitrary dependent variable First, arbitrarily select a variable to use as the dependent variable. The variable should a numeric variable that does not have any missing cases. For example, click on the first numeric variable in the list of variables: wrkstat. Second, click on the right arrow button to move wrkstat to the text box for the dependent variable. SPSS will not compute the Regression unless we specify a dependent variable, even though the dependent variable is not used in the analysis of multivariate outliers.

SW388R7 Data Analysis & Computers II Slide 22 Adding Mahalanobis D 2 to the dataset To request that SPSS add the value of Mahalanobis D 2 to the data set, click on the Save button to open the save dialog box.

SW388R7 Data Analysis & Computers II Slide 23 Specify saving Mahalanobis D 2 distance Second, complete the request for Mahalanobis distance by clicking on the Continue button. First, mark the checkbox for Mahalanobis in the Distances panel. All other checkboxes can be unchecked.

SW388R7 Data Analysis & Computers II Slide 24 Specify the statistics output needed To understand why a particular case is an outlier, we want to examine the descriptive statistics for each variable. Click on the Statistics… button to request the statistics.

SW388R7 Data Analysis & Computers II Slide 25 Request descriptive statistics Second, complete the request for descriptive statistics by clicking on the Continue button. First, mark the checkbox for Descriptives. All other checkboxes can be unchecked.

SW388R7 Data Analysis & Computers II Slide 26 Complete the request for Mahalanobis D 2 To complete the request for the regression analysis that will compute Mahalanobis D 2, click on the OK button.

SW388R7 Data Analysis & Computers II Slide 27 Mahalanobis D 2 scores in the data editor If we look in the column farthest to the right in the data editor, we see that SPSS has calculated the Mahalanobis D² scores for us in a variable it has named "mah_1." The evaluation for outliers, however, requires the probability for the Mahalanobis D² and not the scores themselves.

SW388R7 Data Analysis & Computers II Slide 28 Computing the probability of D² To compute the probability of D², we will use an SPSS function in a Compute command. First, select the Compute… command from the Transform menu.

SW388R7 Data Analysis & Computers II Slide 29 Specifying the variable name and function First, in the target variable text box, type the name "p_mah_1" as an acronym for the probability of the mah_1, the Mahalanobis D² score. Second, scroll down the list of functions to find CDF.CHISQ, which calculates the probability of a variable which follows as chi-square distribution, like Mahalanobis D². Third, click on the up arrow button to move the highlighted function to the Numeric Expression text box.

SW388R7 Data Analysis & Computers II Slide 30 Completing the specifications for the function Second, click on the OK command to signal completion of the computer variable dialog. First, to complete the specifications for the CDF.CHISQ function, type the name of the variable containing the D² scores, mah_1, followed by a comma, followed by the number of variables used in the calculations, 3. Since the CDF function (cumulative density function) computes the cumulative probability from the left end of the distribution up through a given value, we subtract it from 1 to obtain the probability in the upper tail of the distribution.

SW388R7 Data Analysis & Computers II Slide 31 Probabilities for D² in the data editor To sort the data set, right click on the column header p_mah_1, and select Sort Ascending from the popup menu. SPSS used the compute command to calculate the probabilities for the D² scores and list them in the data editor. To find the smallest probability value, we will sort the data set in ascending order.

SW388R7 Data Analysis & Computers II Slide 32 Identifying outliers Scroll down the data editor past the probabilities with missing values, which are the result of the compute command when one or more variables has missing data. There are two values less than 0.001, displayed as.0000 and Two cases had an unusual combination of values on the three variables resulting in their designation as multivariate outliers.

SW388R7 Data Analysis & Computers II Slide 33 Answer 1 In the dataset GSS2000R, is the following statement true, false, or an incorrect application of a statistic? Use a level of significance of 0.01 for evaluating missing data and assumptions. In pre-screening the data for use in a multiple regression of the dependent variable "number of children" [childs] with the independent variables "number of hours worked in the past week" [hrs1], "occupational prestige score" [prestg80], and "highest year of school completed" [educ], no univariate or multivariate outliers were detected. 1. True 2. True with caution 3. False 4. Inappropriate application of a statistic The original question stated that there were no univariate or multivariate outliers. The answer to this question is false because there are four outliers. There were 4 cases that could be classified as outliers: case was a univariate outlier (z score=3.42); case was a multivariate outlier (Mahalanobis D²=35.43, p<0.0001); case was a multivariate outlier (Mahalanobis D²=17.10, p=0.0007); and case was a univariate outlier (z score=3.42).

SW388R7 Data Analysis & Computers II Slide 34 Evaluating Mulitivariate Outliers  Before we can decide whether we should omit or retain an outlier in our data analysis, we should try to understand why it is an outlier.  To accomplish this, we will move the columns for the variables adjacent to each other in the data editor so that we can compare the values for each case.  We will compare the values for each case to the mean and standard deviation for each variable, computed in the descriptive statistics section of the regression output.

SW388R7 Data Analysis & Computers II Slide 35 Moving columns in the data editor – step 1 We will move the column for the variable prestg80 next to the column for hrs1. First, click on the column header prestg80 for the variable we want to move, so that the column is selected.

SW388R7 Data Analysis & Computers II Slide 36 Moving columns in the data editor – step 2 Next, click and hold the left mouse button down on the column header of the variable we want to move. A box outline will appear at the bottom of the arrow cursor, indicating that SPSS is prepared to move the column.

SW388R7 Data Analysis & Computers II Slide 37 Moving columns in the data editor – step 3 Next, while holding the mouse button down, move the arrow cursor over columns to the left or right. A vertical red line will appear between the columns to indicate where the column will be relocated. When the red line is located where we want to position the column we are moving, release the mouse button. The column will now be relocated.

SW388R7 Data Analysis & Computers II Slide 38 Moving columns in the data editor – step 4 The columns for the variables are now adjacent to one another, making it easier to compare values. Hint: when we move a column, the command “Undo Move Variables” will appear at the top of the Edit menu. I find this command the easiest way to return the columns to their original locations in the data editor. Leaving columns in different locations can make it harder to find a variable we are looking for.

SW388R7 Data Analysis & Computers II Slide 39 Highlighting the outliers for analysis When I finished relocating the three variables, I moved the p_mah_1 column also, so I could easily identify which cases were outliers. Then I highlighted the outlier rows and scrolled them to the top row in the data editor. I can now compare the values for these two cases to the mean and standard deviation of the distribution for the three variables.

SW388R7 Data Analysis & Computers II Slide 40 Evaluating the outlier cases The number of hours worked for both cases is well below the average for the sample. The first case has an above average occupational prestige score combined with below average years of education. The second case has a below average occupational prestige score combined with above average education.

SW388R7 Data Analysis & Computers II Slide 41 Deleting variables added to dataset Once we are finished with the outlier analysis, we should delete the variables that were added to the data set. First, select the mah_1 and p_mah_1 columns. Second, select the Clear command from the Edit menu to delete the column from the dataset.

SW388R7 Data Analysis & Computers II Slide 42 Steps in evaluating outliers The following is a guide to the decision process for answering problems about outliers: Incorrect application of a statistic Yes No Is the dependent variable metric and the independent variables metric or dichotomous? Yes No Is the standard score for the dependent variable for a case > ±3.29, or probability of Mahalanobis D² for independent variables <= 0.001? Case not an outlier Case an outlier