1 Week 6: Assumptions in Regression Analysis. 2 The Assumptions 1.The distribution of residuals is normal (at each value of the dependent variable). 2.The.

Slides:



Advertisements
Similar presentations
ASSUMPTION CHECKING In regression analysis with Stata
Advertisements

Brief introduction on Logistic Regression
Review ? ? ? I am examining differences in the mean between groups
STA305 week 31 Assessing Model Adequacy A number of assumptions were made about the model, and these need to be verified in order to use the model for.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 4 The Relation between Two Variables
AP Statistics Chapters 3 & 4 Measuring Relationships Between 2 Variables.
Correlation and Linear Regression
Statistics for the Social Sciences
Statistics for Managers Using Microsoft® Excel 5th Edition
BA 555 Practical Business Analysis
Lecture 25 Multiple Regression Diagnostics (Sections )
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 6: Multiple Regression
Lecture 24 Multiple Regression (Sections )
Lecture 20 Simple linear regression (18.6, 18.9)
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
RESEARCH STATISTICS Jobayer Hossain Larry Holmes, Jr November 6, 2008 Examining Relationship of Variables.
(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)
1 4. Multiple Regression I ECON 251 Research Methods.
Regression Diagnostics Checking Assumptions and Data.
Business Statistics - QBM117 Statistical inference for regression.
Correlation and Regression Analysis
Finding help. Stata manuals You have all these as pdf! Check the folder /Stata12/docs.
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
Introduction to Regression Analysis, Chapter 13,
Inference for regression - Simple linear regression
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
Inference for Regression
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
BPS - 3rd Ed. Chapter 211 Inference for Regression.
Chapter 15 Inference for Regression
Dealing with data All variables ok? / getting acquainted Base model Final model(s) Assumption checking on final model(s) Conclusion(s) / Inference Better.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
Basics of Data Cleaning
OLS SHORTCOMINGS Preview of coming attractions. QUIZ What are the main OLS assumptions? 1.On average right 2.Linear 3.Predicting variables and error term.
+ Chapter 12: More About Regression Section 12.1 Inference for Linear Regression.
Chapter 16 Data Analysis: Testing for Associations.
Adjusted from slides attributed to Andrew Ainsworth
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Model Building and Model Diagnostics Chapter 15.
Assumptions 5.4 Data Screening. Assumptions Parametric tests based on the normal distribution assume: – Independence – Additivity and linearity – Normality.
Correlation – Recap Correlation provides an estimate of how well change in ‘ x ’ causes change in ‘ y ’. The relationship has a magnitude (the r value)
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
ANOVA, Regression and Multiple Regression March
Residual Analysis Purposes –Examine Functional Form (Linear vs. Non- Linear Model) –Evaluate Violations of Assumptions Graphical Analysis of Residuals.
Ch15: Multiple Regression 3 Nov 2011 BUSI275 Dr. Sean Ho HW7 due Tues Please download: 17-Hawlins.xls 17-Hawlins.xls.
D/RS 1013 Data Screening/Cleaning/ Preparation for Analyses.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Regression Chapter 5 January 24 – Part II.
BPS - 5th Ed. Chapter 231 Inference for Regression.
F73DA2 INTRODUCTORY DATA ANALYSIS ANALYSIS OF VARIANCE.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
Stats Methods at IC Lecture 3: Regression.
Inference for Least Squares Lines
Bivariate & Multivariate Regression Analysis
Statistical Data Analysis - Lecture /04/03
Correlation, Bivariate Regression, and Multiple Regression
Multiple Regression – Part I
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Undergraduated Econometrics
Basic Practice of Statistics - 3rd Edition Inference for Regression
Chapter 13 Additional Topics in Regression Analysis
Review I am examining differences in the mean between groups How many independent variables? OneMore than one How many groups? Two More than two ?? ?
Presentation transcript:

1 Week 6: Assumptions in Regression Analysis

2 The Assumptions 1.The distribution of residuals is normal (at each value of the dependent variable). 2.The variance of the residuals for every set of values for the independent variable is equal. violation is called heteroscedasticity. 3.The error term is additive no interactions. 4.At every value of the dependent variable the expected (mean) value of the residuals is zero No non-linear relationships

3 5.The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation) 6.All independent variables are uncorrelated with the error term. 7.No independent variables are a perfect linear function of other independent variables (no perfect multicollinearity) 8.The mean of the error term is zero.

4 What are we going to do … Deal with some of these assumptions in some detail Deal with others in passing only

5 Assumption 1: The Distribution of Residuals is Normal at Every Value of the Dependent Variable

6 Look at Normal Distributions A normal distribution –symmetrical, bell-shaped (so they say)

7 What can go wrong? Skew –non-symmetricality –one tail longer than the other Kurtosis –too flat or too peaked –kurtosed Outliers –Individual cases which are far from the distribution

8 Effects on the Mean Skew –biases the mean, in direction of skew Kurtosis –mean not biased –standard deviation is –and hence standard errors, and significance tests

9 Examining Univariate Distributions Histograms Boxplots P-P Plots

10 Histograms A and B

11 C and D

12 E & F

13 Histograms can be tricky ….

14 Boxplots

15 P-P Plots A & B

16 C & D

17 E & F

18 Bivariate Normality We didn’t just say “residuals normally distributed” We said “at every value of the dependent variables” Two variables can be normally distributed – univariate, –but not bivariate

19 Couple’s IQs –male and female –Seem reasonably normal

20 But wait!!

21 When we look at bivariate normality –not normal – there is an outlier So plot X against Y OK for bivariate –but – may be a multivariate outlier –Need to draw graph in 3+ dimensions –can’t draw a graph in 3 dimensions But we can look at the residuals instead …

22 IQ histogram of residuals

23 Multivariate Outliers … Will be explored later in the exercises So we move on …

24 What to do about Non- Normality Skew and Kurtosis –Skew – much easier to deal with –Kurtosis – less serious anyway Transform data –removes skew –positive skew – log transform –negative skew - square

25 Transformation May need to transform IV and/or DV –More often DV time, income, symptoms (e.g. depression) all positively skewed –can cause non-linear effects (more later) if only one is transformed –alters interpretation of unstandardised parameter –May alter meaning of variable –May add / remove non-linear and moderator effects

26 Change measures –increase sensitivity at ranges avoiding floor and ceiling effects Outliers –Can be tricky –Why did the outlier occur? Error? Delete them. Weird person? Probably delete them Normal person? Tricky.

27 –You are trying to model a process is the data point ‘outside’ the process e.g. lottery winners, when looking at salary yawn, when looking at reaction time –Which is better? A good model, which explains 99% of your data? A poor model, which explains all of it Pedhazur and Schmelkin (1991) –analyse the data twice

28 We will spend much less time on the other 6 assumptions

29 Assumption 2: The variance of the residuals for every set of values for the independent variable is equal.

30 Heteroscedasticity This assumption is a about heteroscedasticity of the residuals –Hetero=different –Scedastic = scattered We don’t want heteroscedasticity –we want our data to be homoscedastic Draw a scatterplot to investigate

31

32 Only works with one IV –need every combination of IVs Easy to get – use predicted values –use residuals there Plot predicted values against residuals –or standardised residuals –or deleted residuals –or standardised deleted residuals –or studentised residuals A bit like turning the scatterplot on its side

33 Good – no heteroscedasticity

34 Bad – heteroscedasticity

35 Testing Heteroscedasticity White’s test –Not automatic in SPSS (is in SAS) –Luckily, not hard to do –More luckily, we aren’t going to do it (In the very unlikely event you will ever have to do it, look it up. Google: White’s test spss)

36 Plot of Pred and Res

37 Magnitude of Heteroscedasticity Chop data into “slices” –5 slices, based on X (or predicted score) Done in SPSS –Calculate variance of each slice –Check ratio of smallest to largest –Less than 10:1 OK

38 The Visual Bander New in SPSS 12

39 Variances of the 5 groups We have a problem –3 / 0.2 ~= 15

40 Assumption 3: The Error Term is Additive

41 Additivity We sum the scores in the regression equation –Is that legitimate? –can test for it, but hard work Have to know it from your theory A specification error

42 Additivity and Theory Two IVs –Alcohol has sedative effect A bit makes you a bit tired A lot makes you very tired –Some painkillers have sedative effect A bit makes you a bit tired A lot makes you very tired –A bit of alcohol and a bit of painkiller doesn’t make you very tired –Effects multiply together, don’t add together

43 If you don’t test for it –It’s very hard to know that it will happen So many possible non-additive effects –Cannot test for all of them –Can test for obvious In medicine –Choose to test for salient non-additive effects –e.g. sex, race

44 Assumption 4: At every value of the dependent variable the expected (mean) value of the residuals is zero

45 Linearity Relationships between variables should be linear –best represented by a straight line Not a very common problem in social sciences –except economics –measures are not sufficiently accurate to make a difference R 2 too low unlike, say, physics

46 Relationship between speed of travel and fuel used

47 R 2 = –looks pretty good –know speed, make a good prediction of fuel BUT –look at the chart –if we know speed we can make a perfect prediction of fuel used –R 2 should be 1.00

48 Detecting Non-Linearity Residual plot –just like heteroscedasticity Using this example –very, very obvious –usually pretty obvious

49 Residual plot

50 Linearity: A Case of Additivity Linearity = additivity along the range of the IV Jeremy rides his bicycle harder –Increase in speed depends on current speed –Not additive, multiplicative –MacCallum and Mar (1995). Distinguishing between moderator and quadratic effects in multiple regression. Psychological Bulletin.

51 Assumption 5: The expected correlation between residuals, for any two cases, is 0. The independence assumption (lack of autocorrelation)

52 Independence Assumption Also: lack of autocorrelation Tricky one –often ignored –exists for almost all tests All cases should be independent of one another –knowing the value of one case should not tell you anything about the value of other cases

53 How is it Detected? Can be difficult –need some clever statistics (multilevel models) Better off avoiding situations where it arises Residual Plots Durbin-Watson Test

54 Residual Plots Were data collected in time order? –If so plot ID number against the residuals –Look for any pattern Test for linear relationship Non-linear relationship Heteroscedasticity

55

56 How does it arise? Two main ways time-series analyses –When cases are time periods weather on Tuesday and weather on Wednesday correlated inflation 1972, inflation 1973 are correlated clusters of cases –patients treated by three doctors –children from different classes –people assessed in groups

57 Why does it matter? Standard errors can be wrong –therefore significance tests can be wrong Parameter estimates can be wrong –really, really wrong –from positive to negative An example –students do an exam (on statistics) –choose one of three questions IV: time DV: grade

58 Result, with line of best fit

59 Result shows that –people who spent longer in the exam, achieve better grades BUT … –we haven’t considered which question people answered –we might have violated the independence assumption DV will be autocorrelated Look again –with questions marked

60 Now somewhat different

61 Now, people that spent longer got lower grades –questions differed in difficulty –do a hard one, get better grade –if you can do it, you can do it quickly Very difficult to analyse well –need multilevel models

62 Assumption 6: All independent variables are uncorrelated with the error term.

63 Uncorrelated with the Error Term A curious assumption –by definition, the residuals are uncorrelated with the independent variables (try it and see, if you like) It is about the DV –must have no effect (when the IVs have been removed) –on the DV

64 Problem in economics –Demand increases supply –Supply increases wages –Higher wages increase demand OLS estimates will be (badly) biased in this case –need a different estimation procedure –two-stage least squares simultaneous equation modelling

65 Assumption 7: No independent variables are a perfect linear function of other independent variables no perfect multicollinearity

66 No Perfect Multicollinearity IVs must not be linear functions of one another –matrix of correlations of IVs is not positive definite –cannot be inverted –analysis cannot proceed Have seen this with –age, age start, time working –also occurs with subscale and total

67 Large amounts of collinearity –a problem (as we shall see) sometimes –not an assumption

68 Assumption 8: The mean of the error term is zero. You will like this one.

69 Mean of the Error Term = 0 Mean of the residuals = 0 That is what the constant is for –if the mean of the error term deviates from zero, the constant soaks it up - note, Greek letters because we are talking about population values