University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 6 Regression: ‘Loose Ends’

Slides:



Advertisements
Similar presentations
Statistical Techniques I EXST7005 Multiple Regression.
Advertisements

Topic 12 – Further Topics in ANOVA
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
Multiple Regression Involves the use of more than one independent variable. Multivariate analysis involves more than one dependent variable - OMS 633 Adding.
© 2003 Prentice-Hall, Inc.Chap 14-1 Basic Business Statistics (9 th Edition) Chapter 14 Introduction to Multiple Regression.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Statistical Analysis SC504/HS927 Spring Term 2008 Session 7: Week 23: 7 th March 2008 Complex independent variables and regression diagnostics.
Handling Categorical Data. Learning Outcomes At the end of this session and with additional reading you will be able to: – Understand when and how to.
(Correlation and) (Multiple) Regression Friday 5 th March (and Logistic Regression too!)
DTC Quantitative Research Methods Three (or more) Variables: Extensions to Cross- tabular Analyses Thursday 13 th November 2014.
An Introduction to Logistic Regression
Analysis of Variance & Multivariate Analysis of Variance
Multiple Regression Research Methods and Statistics.
Multiple Regression 2 Sociology 5811 Lecture 23 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Summary of Quantitative Analysis Neuman and Robson Ch. 11
Assumption of Homoscedasticity
Slide 1 Testing Multivariate Assumptions The multivariate statistical techniques which we will cover in this class require one or more the following assumptions.
SW388R7 Data Analysis & Computers II Slide 1 Multiple Regression – Split Sample Validation General criteria for split sample validation Sample problems.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Two-Way Analysis of Variance STAT E-150 Statistical Methods.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
Quantitative Business Analysis for Decision Making Multiple Linear RegressionAnalysis.
Leedy and Ormrod Ch. 11 Gray Ch. 14
ANCOVA Lecture 9 Andrew Ainsworth. What is ANCOVA?
Multiple Regression.
A statistical method for testing whether two or more dependent variable means are equal (i.e., the probability that any differences in means across several.
Soc 3306a Multiple Regression Testing a Model and Interpreting Coefficients.
Soc 3306a Lecture 9: Multivariate 2 More on Multiple Regression: Building a Model and Interpreting Coefficients.
Multiple Regression 3 Sociology 5811 Lecture 24 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
ANOVA and Linear Regression ScWk 242 – Week 13 Slides.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 7 Logistic Regression I.
Multiple Regression. Multiple Regression  Usually several variables influence the dependent variable  Example: income is influenced by years of education.
Chapter 16 Data Analysis: Testing for Associations.
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 5 Multiple Regression.
September 18-19, 2006 – Denver, Colorado Sponsored by the U.S. Department of Housing and Urban Development Conducting and interpreting multivariate analyses.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
ANCOVA. What is Analysis of Covariance? When you think of Ancova, you should think of sequential regression, because really that’s all it is Covariate(s)
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 14 th February 2013.
ANOVA, Regression and Multiple Regression March
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
DTC Quantitative Research Methods Regression I: (Correlation and) Linear Regression Thursday 27 th November 2014.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
ANCOVA.
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Logistic Regression II/ (Hierarchical)
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Analysing Means II: Nonparametric techniques.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Multiple Linear Regression An introduction, some assumptions, and then model reduction 1.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 3 Multivariate analysis.
Chapter 11 REGRESSION Multiple Regression  Uses  Explanation  Prediction.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Logistic Regression III/ (Hierarchical)
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)   Week 5 Multiple Regression  
Special Topics in Multiple Regression Analysis
DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014  
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Logistic Regression II/ (Hierarchical)
The Correlation Coefficient (r)
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Analysing Means II: Nonparametric techniques.
Multiple Regression.
Multiple Regression – Part II
Stats Club Marnie Brennan
CH2. Cleaning and Transforming Data
Homoscedasticity/ Heteroscedasticity In Brief
Homoscedasticity/ Heteroscedasticity In Brief
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Analysing Means I: (Extending) Analysis.
Exercise 1 Use Transform  Compute variable to calculate weight lost by each person Calculate the overall mean weight lost Calculate the means and standard.
The Correlation Coefficient (r)
Presentation transcript:

University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 6 Regression: ‘Loose Ends’

Multiple Regression GHQ = (-0.47 x INCOME) + (-1.95 x HOUSING) For B = 0.47, t = (& p = > 0.05) For B = -1.95, t = (& p = < 0.05) The r-squared value for this regression is (23.6%)

Comparing Bs It does not make much sense to compare values of B, since they relate to different units (in this case pounds as compared to a distinction between types of housing). To overcome this lack of comparability, one can look at beta (β) values, which quantify (in standard deviations) the impact of a one standard deviation change in each independent variable. In other words, they adjust the effect of each independent with reference to its standard deviation together with that of the dependent variable. In the example here, the beta values are β=0.220 and β= Hence housing has a more substantial impact than income (per standard deviation), although this was in any case evident from the values of the t-statistics...

Dummy variables Categorical variables can be included in regression analyses via the use of one or more dummy variables (two-category variables with values of 0 and 1). In the case of a comparison of men and women, a dummy variable could compare men (coded 1) with women (coded 0).

Interaction effects… Length of residence Age Women All Men In this situation there is an interaction between the effects of age and of gender, so B (the slope) varies according to gender and is greater for women

Creating a variable to check for an interaction effect We may want to see whether an effect varies according to the level of another variable. Multiplying the values of two independent variables together, and including this third variable alongside the other two allows us to do this.

Interaction effects (continued) Length of residence Age Women All Men Slope of line for women = B AGE Slope of line for men = B AGE +B AGESEXD SEXDUMMY = 1 for men & 0 for women AGESEXD = AGE x SEXDUMMY For men AGESEXD = AGE & For women AGESEXD = 0

Transformations There is quite a useful chapter on transformations in Marsh and Elliott (2009). This makes the point that transformations can be applied to either one or more independent variables or the dependent variable within a regression analysis. Transformations often take the form of raising a variable to a particular power (e.g. squaring it, cubing it, etc., and can take the inverse form of these too (e.g. taking its square root). Logarithmic transformations are also fairly common, e.g. in relation to variables such as income.

Reasons for transformations Transformations can lead to a more accurate representation of the form of the relationship between two variables. But they can also sometimes resolve deviations from regression assumptions more generally!

An example These lengths of residence at current address relate to a sample of cases from the 1995 General Household Survey.

Some immediate problems If we are using length of residence as the dependent variable in a regression analysis, then since the lengths of residence of individuals within households (e.g. members of couples) are likely to be related, and hence the residuals are likely not to be independent, one of the regression assumptions looks problematic. We will also need to take account of (control for) age in some way, since this has obvious implications for length of residence. But as the next slide shows, we might expect the diversity of lengths of residence to increase with increasing age, hence the assumption of homoscedasticity seems problematic too...

Length of residence Age Line showing maximum possible length for a given age Two-headed arrows show increasing scope for diversity

A bivariate regression analysis

Length of Residence (y) Age (x) 0 C 1 B Outlier ε y = Bx + C + ε Error term (Residual) ConstantSlope Regression line

Do the residuals have a normal distribution? The distribution of the residuals is, unsurprisingly, asymmetric. A One-Sample Kolmogorov-Smirnov test with a statistic of shows it differs significantly from a normal distribution (p<0.001).

What happens if we add age 2 ? It looks as if age-squared does a rather better job of representing the relationship than age does when they are included together! But how does that help us? A One-Sample Kolmogorov-Smirnov test with a statistic of shows the distribution of residuals still differs significantly from a normal distribution (p<0.001)...

What if we take the square root of LoR rather than square age?

Are the residuals now closer to a normal distribution? The distribution of the residuals is now much more symmetric. But a One-Sample Kolmogorov-Smirnov test with a statistic of shows it still differs significantly from a normal distribution (p=0.022).

Adding sex to the regression... Adding sex to the regression in the form of the dummy variable described earlier doesn’t seem to have achieved much...

But wait... Is there an interaction? ‘asd’ is the AGESEXD interaction term described earlier. Its effect is (just) significant: p=0.041 < 0.05 Meanwhile, the K-S statistic is now only just significant (1.373; p=0.046)

Adding a set of class dummies SC1 is 1 for Class I, 0 otherwise SC2 is 1 for Class II, 0 otherwise SC3 is 1 for Class III NM, 0 otherwise SC4 is 1 for Class III M, 0 otherwise SC5 is 1 for Class IV, 0 otherwise SC7 is 1 for Armed Forces, 0 otherwise So the sixth class, Class V, becomes the ‘reference category’, i.e. point of reference.

Regression with class...

Hurray! A One-Sample Kolmogorov-Smirnov test with a statistic of now shows that the residuals do not differ significantly from a normal distribution (p=0.111 > 0.05).

But should we include the statistically non-significant effects? The age/sex interaction term is now non- significant... (p=0.066 > 0.05) And some of the dummy variables are non- significant too! But these might be viewed as ‘part’ of the overall, hierarchical class effect? Nevertheless, we might consider asking SPSS to include only significant effects, when the variables are added in a ‘stepwise’ fashion...

Is this the best we can do? (i.e. Model 4)

Not necessarily... SPSS has added variables one at a time, and stopped when nothing more can be added that has a significant effect... But if the sex variable and age/sex interaction term were added together, they might improve the model significantly! And if we combined classes III NM, IIIM and IV (i.e. used a single dummy rather than SC3, SC4 and SC5, the difference between these categories and others might be statistically significant...

What about heteroscedasticity? It is worth noting that taking the square root of LoR reduced (but did not remove) the problem of diversity in LoR increasing with age. Levene’s test is an F-test which is an option within the menu for One-Way ANOVA. Here the diversity of LoR across nine 10-year age groups was examined (comparing teens, twenties, thirties, etc.)