Multiple Regression in SPSS GV917
Multiple Regression Multiple Regression involves more than one predictor variable. For example in the turnout model Y i = a + b 1 X i1 + b 2 X i2 + e i If Ŷ = a + b 1 X i1 + b 2 X i2 Then Y i – Ŷ = e i Where Y i is the observed value of Reported Turnout X i1 is the observed value of Actual Turnout X i2 is the Effective Number of Parties Index a is the intercept and b j are the slope coefficients of the relationship between Reported and Actual Turnout and Reported Turnout and Electoral Distortion Ŷ is the predicted value of Reported Turnout from the linear relationship with Actual Turnout and Electoral Distortion e i is the residual or error term
Add an Effective Number of Parties Index to the Turnout Model This measure was devised by Laakso and Taagepera (Comparative Political Studies 1979). It is designed to summarize the degree of fragmentation of the party system in a country. It is defined as: Σ (P v ) 2 Where Pv is each party’s proportion of the total vote
Two Examples Suppose there is a two party system in a country and the votes are shared 60% to 40%. This is not a fragmented system so that: = = 1.92 Σ (P v ) 2 (0.60) 2 + (0.40) 2 Intuitively this means that the party system contains 1.92 ‘equally sized’ parties. But suppose in the country next door the vote is divided among four parties as follows: 35%, 30%, 20%, 15%. This is much more fragmented: = = 3.64 Σ (P v ) 2 (0.35) 2 + (0.30) 2 + (0.20) 2 + (0.15) 2 In this case there are 3.64 ‘equally sized’ parties.
CountryReported Turnout Actual TurnoutEffective No Parties Austria Belgium Switzerland Czech Republic Germany Denmark Spain Finland France Britain Greece Hungary Ireland Israel Italy Luxembourg Netherlands Norwary Poland Portugal Slovenia
Reported Turnout Regression with Two Predictors
Why this effect? Note that the fragmentation of parties tends to reduce reported turnout. This effect has been attributed to information processing costs. If the average citizen has to make choices among a lot of alternatives before voting, this raises the costs of voting and it has the effect of reducing turnout The parties effect is independent of the actual turnout effect – since in multiple regression we identify the effects of one predictor controlling for all other predictors.
In the Turnout model we are fitting a regression plane to a Three Dimensional Scattergram
How Does Controlling Work? Step One: Regress the Effective Number of Parties on Reported Turnout: Y i = a + b 1 X i2 + v i Note that the v i represents the variation in Reported Turnout NOT accounted for by the Effective Number of Parties. We have removed the number of parties as an influence on reported turnout. Step Two: Regress the Effective Number of Parties on Actual Turnout X i1 = a + b 2 X i2 + u i Thus u i represents the variation in Actual Turnout NOT accounted for by the Effective Number of Parties. We have removed the number of parties as an influence on Actual Turnout
Controlling in Multiple Regression Step Three: In the Multiple Regression Model Y i = a + b 1 X i1 + b 2 X i2 + e i b 1 or the effect of actual turnout on reported turnout can be found by regressing the residuals v i on the residuals u i because both are independent of the Effective Number of Parties. This is in effect what multiple regression does. Actual Turnout Effective Number of Parties Reported Turnout
Controlling in Regression In this model we are regressing the residuals of the Effective Number of Parties (v i ) on the residuals of the Actual Number of Parties (u i ). This produces the same regression coefficient (0.636) as in the earlier multivariate model
Another Look at ANOVA and the F test in Multiple Regression The F test compares the Mean Square with the Residual Mean Square. If it has a high value then the regression explains a lot more variation than is left unexplained. If it has a low value then the regression explains very little variation The theoretical F distribution measures the probability that the F statistics will take on a particular value if the Null Hypothesis (the regression explains nothing) is correct
F Test in Multiple Regression Mean Square = Regression Sum of Squares _________________ = ______ = Degrees of Freedom 2 Residual Mean Square = Residual Sum of Squares = ____________________ _____ = Degrees of Freedom 18 F = Mean Square/ Residual Mean Square = / = 55.86
What are Degrees of Freedom? – They are useable bits of information Total: If we had one observation we could not say anything about the total variation – we need more than one case. This is why the degrees of freedom or usable bits of information is n-1 or 20 (given 21 cases). Residual: If we had two observations we could fit the regression line in a bivariate model since the shortest distance between two points is a straight line, but there would be no residuals since the line would fit perfectly. In a three variable model we would need three observations to fit the regression line since it is a three dimensional space. So to define residuals we need n-3 degrees of freedom or 18 degrees of freedom Since the Total Variation = Explained Variation + Residual Variation Then Explained Variation = Total Variation – Residual Variation Explained Variation = (N-1) – (N-3) = 2 Degrees of freedom
The F test F = Mean Square/ Residual Mean Square is an F distribution. If we start by assuming that the regression explains nothing then the F ratio will not be zero, because by chance we might get a small positive value The F distribution maps the probability that a ratio of a given size will occur if the regression actually explains nothing The larger the value of F, the smaller the likelihood that it will occur by chance if the regression explains nothing. In this case an F of occurring due to chance is much smaller than 0.05, so we can say that the F statistic is significant at the 0.05 level.
The F Distribution – (named after Ronald Fisher)
Another Model – Explaining Happiness in the ESS 2002 Dataset happy How happy are you FrequencyPercentValid Percent Cumulative Percent Valid0 Extremely unhappy Extremely happy Total Missing77 Refusal Don't know No answer 54.1 Total Total
Income Scale in the European Social Survey 2002 hinctnt Household's total net income, all sources FrequencyPercentValid Percent Cumulative Percent Valid1 J R C M F S K P D H U N Total Missing77 Refusal Don't know No answer Total Total
Does Money Buy Happiness? ModelRR SquareAdjusted R SquareStd. Error of the Estimate a a. Predictors: (Constant), income ANOVA b ModelSum of SquaresdfMean SquareFSig. 1Regression a Residual Total a. Predictors: (Constant), income b. Dependent Variable: happy How happy are you Coefficients a Model Unstandardized Coefficients Standardized Coefficients tSig. BStd. ErrorBeta 1(Constant) income a. Dependent Variable: happy How happy are you
Is the Specification Correct? Perhaps we should use a Quadratic Version of the Income Variable *Calculating Quadratic Functions in the ESS Compute income = hinctnt. compute incomsq = hinctnt*hinctnt. Where incomsq is the square of the hinctnt (household income) variable. If we use incomsq in the model in addition to income this captures a non-linear relationship between income and happiness – more income increases happiness but at a declining rate of change
Regression of Income on Happiness in the ESS 2002 – Does Money Buy Happiness?
Quadratic Relationship Between Two Variables
Suppose we want to use Occupational Status as a predictor in the Happiness model – we would have to create this variable This is done with the assistance of the variable ISCOCO. This is a classification of the many occupations which exist in Europe. For example: iscoco Occupation 100 Armed forces 1100 Legislators and senior officials 1110 Legislators, senior government officials 1140 Senior officials of special-interest org 1141 Senior officials of political-party org 1142 Senior officials of economic-interest org To put this in a form which is useable in the regression model we recode it as follows: recode iscoco (2000 thru 2470=6)(1000 thru 1319=5)(3000 thru 3480=4)(4000 thru 4223=3)(5000 thru 8340=2)(9000 thru 9330=1)(else=sysmis) into occup. value labels occup 1 'unskilled or semi-skilled manual workers' 2 'skilled manual workers' 3 'white collar clerical & administrative workers' 4 'white collar technical workers' 5 'middle managers' 6 'professionals and senior managers'.
The Recoded Occupational Status Variable in the ESS 2002 Data
Suppose we want to add a gender variable – to see if women are happier than men If statements can be used to create new variables in SPSS. These are recodes which are carried out if certain conditions are met. For example: compute female=0. (creates a new variable consisting only of zeroes) if (gndr eq 2) female=1.(changes this new variable to a score of 1 if the existing variable gndr has a score of 2)
If Statements in SPSS – gndr and Female
Revised Happiness Model ANOVA b ModelSum of SquaresdfMean SquareFSig. 1Regression a Residual Total a. Predictors: (Constant), incomsq, female, occup, income b. Dependent Variable: happy How happy are you Coefficients a Model Unstandardized Coefficients Standardized Coefficients tSig. BStd. ErrorBeta 1(Constant) female occup income incomsq a. Dependent Variable: happy How happy are you Model Summary ModelRR SquareAdjusted R SquareStd. Error of the Estimate a a. Predictors: (Constant), incomsq, female, occup, income
Conclusions Multiple Regression is a relatively simple extension of Two variable regression Unlike two variable regression in multiple regression we are controlling for the influence of additional variables when examining the relationship between the independent variable and the dependent variable – it is a bit like a statistical experiment The great majority of social science models are multivariate models and so commonly we used multiple regression