Department of Applied Economics National Chung Hsing University

Slides:



Advertisements
Similar presentations
Managerial Economics in a Global Economy
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Copyright © 2009 Pearson Education, Inc. Chapter 29 Multiple Regression.
Copyright © 2010 Pearson Education, Inc. Slide
Inference for Regression
1 SSS II Lecture 1: Correlation and Regression Graduate School 2008/2009 Social Science Statistics II Gwilym Pryce
Multiple Regression Fenster Today we start on the last part of the course: multivariate analysis. Up to now we have been concerned with testing the significance.
Bivariate Regression Analysis
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. *Chapter 29 Multiple Regression.
Statistics II: An Overview of Statistics. Outline for Statistics II Lecture: SPSS Syntax – Some examples. Normal Distribution Curve. Sampling Distribution.
Chapter 12 Simple Regression
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Interactions in Regression.
Intro to Statistics for the Behavioral Sciences PSYC 1900
Ch. 14: The Multiple Regression Model building
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Multiple Regression 2 Sociology 5811 Lecture 23 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Correlation and Regression Analysis
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Multiple Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission.
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Relationships Among Variables
Basic Analysis of Variance and the General Linear Model Psy 420 Andrew Ainsworth.
Multiple Linear Regression A method for analyzing the effects of several predictor variables concurrently. - Simultaneously - Stepwise Minimizing the squared.
Chapter 8: Bivariate Regression and Correlation
Lecture 16 Correlation and Coefficient of Correlation
Lecture 15 Basics of Regression Analysis
Introduction to Linear Regression and Correlation Analysis
Inference for regression - Simple linear regression
Hypothesis Testing:.
Chapter 13: Inference in Regression
LEARNING PROGRAMME Hypothesis testing Intermediate Training in Quantitative Analysis Bangkok November 2007.
STA291 Statistical Methods Lecture 27. Inference for Regression.
Hypothesis Testing in Linear Regression Analysis
Linear Regression Inference
Multiple Regression. In the previous section, we examined simple regression, which has just one independent variable on the right side of the equation.
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
Multiple Regression 1 Sociology 5811 Lecture 22 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
Regression Analysis. Scatter plots Regression analysis requires interval and ratio-level data. To see if your data fits the models of regression, it is.
Statistics for the Social Sciences Psychology 340 Fall 2013 Correlation and Regression.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide
Multiple Regression 3 Sociology 5811 Lecture 24 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Examining Relationships in Quantitative Research
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Multiple Regression. Multiple Regression  Usually several variables influence the dependent variable  Example: income is influenced by years of education.
Regression Chapter 16. Regression >Builds on Correlation >The difference is a question of prediction versus relation Regression predicts, correlation.
Sociology 5811: Lecture 11: T-Tests for Difference in Means Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 16 Data Analysis: Testing for Associations.
University of Warwick, Department of Sociology, 2012/13 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard) Week 5 Multiple Regression.
Multiple Regression Review Sociology 229A Copyright © 2008 by Evan Schofer Do not copy or distribute without permission.
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
Chapter Eight: Using Statistics to Answer Questions.
Correlation & Regression Analysis
ANOVA, Regression and Multiple Regression March
McGraw-Hill/IrwinCopyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Simple Linear Regression Analysis Chapter 13.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Regression Analysis: A statistical procedure used to find relations among a set of variables B. Klinkenberg G
Lecture 7: Bivariate Statistics. 2 Properties of Standard Deviation Variance is just the square of the S.D. If a constant is added to all scores, it has.
Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Chapter 14 Introduction to Multiple Regression
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer
CHAPTER 29: Multiple Regression*
Inferential Statistics
Multiple Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Department of Applied Economics National Chung Hsing University Simple Regression Department of Applied Economics National Chung Hsing University

Linear Functions Formula: Y = a + bX Is a linear formula. If you graphed X and Y for any chosen values of a and b, you’d get a straight line. It is a family of functions: For any value of a and b, you get a particular line a is referred to as the “constant” or “intercept” b is referred to as the “slope” To graph a linear function: Pick values for X, compute corresponding values of Y Then, connect dots to graph line

Linear Functions: Y = a + bX The “constant” or “intercept” (a) Determines where the line intersects the Y-axis If a increases (decreases), the line moves up (down) Y axis X axis -10 -5 0 5 10 20 10 -10 -20 Y=14 - 1.5X Y= 3 -1.5X Y= -9 - 1.5X

Linear Functions: Y = a + bX The slope (b) determines the steepness of the line Y=3+3X Y axis X axis -10 -5 0 5 10 20 10 -10 -20 Y=3-1.5X Y=3+.2X

Linear Functions: Slopes The slope (b) is the ratio of change in Y to change in X -10 -5 0 5 10 20 10 -10 -20 Y=3+3X Slope: b = 15/5 = 3 Change in Y=15 Change in X =5 The slope tells you how many points Y will increase for any single point increase in X

Linear Functions as Summaries A linear function can be used to summarize the relationship between two variables: Slope: b = 2 / 40,000 = .00005 pts/$ Change in X = 40,000 Change in Y = 2 If you change units: b = .05 / $1K b = .5 pts/$10K b = 5 pts/$100K

Linear Functions as Summaries Slope and constant can be “eyeballed” to approximate a formula: Happy = 2 + .00005Income Slope (b): b = 2 / 40,000 = .00005 pts/$ Constant (a) = Value where line hits Y axis a = 2

Linear Functions as Summaries Linear functions can powerfully summarize data: Formula: Happy = 2 + .00005Income Gives a sense of how the two variables are related Namely, people get a .00005 increase in happiness for every extra dollar of income (or 5 pts per $100K) Also lets you “predict” values. What if someone earns $150,000? Happy = 2 + .00005($150,000) = 9.5 But be careful… You shouldn’t assume that a relationship remains linear indefinitely Also, negative income or happiness make no sense…

Linear Functions as Summaries Come up with a linear function that summarizes this real data: years of education vs. job prestige It isn’t always easy! The line you choose depends on how much you “weight” these points.

Computing Regressions Regression coefficients can be calculated in SPSS You will rarely, if ever, do them by hand SPSS will estimate: The value of the constant (a) The value of the slope (b) Plus, a large number of related statistics and results of hypothesis testing procedures

Example: Education & Job Prestige Example: Years of Education versus Job Prestige Previously, we made an “eyeball” estimate of the line Our estimate: Y = 5 + 3X

Example: Education & Job Prestige The actual SPSS regression results for that data: Estimates of a and b: “Constant” = a = 9.427 Slope for “Year of School” = b = 2.487 Equation: Prestige = 9.4 + 2.5 Education A year of education adds 2.5 points job prestige

Example: Education & Job Prestige Comparing our “eyeball” estimate to the actual OLS regression line Our estimate: Y = 5 + 3X Actual OLS regression line computed in SPSS

R-Square The R-Square statistic indicates how well the regression line “explains” variation in Y It is based on partitioning variance into: 1. Explained (“regression”) variance The portion of deviation from Y-bar accounted for by the regression line 2. Unexplained (“error”) variance The portion of deviation from Y-bar that is “error” Formula:

R-Square Visually: Deviation is partitioned into two parts Y=2+.5X “Error Variance” Y=2+.5X -4 -2 0 2 4 4 2 -2 -4 “Explained Variance” Y-bar

Example: Education & Job Prestige R-Square & Hypothesis testing information: The R and R-Square indicate how well the line summarizes the data This information allows us to do hypothesis tests about constant & slope

Hypothesis Tests: Slopes Given: Observed slope relating Education to Job Prestige = 2.47 Question: Can we generalize this to the population of all Americans? How likely is it that this observed slope was actually drawn from a population with slope = 0? Solution: Conduct a hypothesis test Notation: slope = b, population slope = b H0: Population slope b = 0 H1: Population slope b  0 (two-tailed test)

Example: Slope Hypothesis Test The actual SPSS regression results for that data: t-value and “sig” (p-value) are for hypothesis tests about the slope Reject H0 if: T-value > critical t (N-2 df) Or, “sig.” (p-value) less than a (often a = .05)

Hypothesis Tests: Slopes What information lets us to do a hypothesis test? Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic It is the distribution of every value of the slope, based on all possible samples (of size N) If certain assumptions are met, the sampling distribution approximates the t-distribution Thus, we can assess the probability that a given value of b would be observed, if b = 0 If probability is low – below alpha – we reject H0

Hypothesis Tests: Slopes Visually: If the population slope (b) is zero, then the sampling distribution would center at zero Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero If b=0, observed slopes should commonly fall near zero, too b Sampling distribution of the slope If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0.

Regression Assumptions Assumptions of simple (bivariate) regression If assumptions aren’t met, hypothesis tests may be inaccurate 1. Random sample w/ sufficient N (N > ~20) 2. Linear relationship among variables Check scatterplot for non-linear pattern; (a “cloud” is OK) 3. Conditional normality: Y = normal at all values of X Check histograms of Y for normality at several values of X 4. Homoskedasticity – equal error variance at all values of X Check scatterplot for “bulges” or “fanning out” of error across values of X Additional assumptions are required for multivariate regression…

Bivariate Regression Assumptions Normality: Examine sub-samples at different values of X. Make histograms and check for normality. Good Not very good

Bivariate Regression Assumptions Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? Here, things look pretty good.

Bivariate Regression Assumptions Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. This looks pretty bad.

Regression Hypothesis Tests If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution Standard deviation of the sampling distribution is called the standard error of the slope (sb) Population formula of standard error: Where se2 is the variance of the regression error

Regression Hypothesis Tests Finally: A t-value can be calculated: It is the slope divided by the standard error Where sb is the sample point estimate of the S.E. The t-value is based on N-2 degrees of freedom Reject H0 if observed t > critical t (e.g., 1.96).

Example: Education & Job Prestige T-values can be compared to critical t... SPSS estimates the standard error of the slope. This is used to calculate a t-value The t-value can be compared to the “critical value” to test hypotheses. Or, just compare “Sig.” to alpha. If t > crit or Sig < alpha, reject H0

Department of Applied Economics National Chung Hsing University Multiple Regression 1 Department of Applied Economics National Chung Hsing University

Multiple Regression Question: What if a dependent variable is affected by more than one independent variable? Strategy #1: Do separate bivariate regressions One regression for each independent variable This yields separate slope estimates for each independent variable Bivariate slope estimates implicitly assume that neither independent variable mediates the other In reality, there might be no effect of family wealth over and above education

Both variables have positive, significant slopes Multiple Regression Job Prestige: Two separate regression models Both variables have positive, significant slopes

Multiple Regression Idea #2: Use Multiple Regression Multiple regression can examine “partial” relationships Partial = Relationships after the effects of other variables have been “controlled” (taken into account) This lets you determine the effects of variables “over and above” other variables And shows the relative impact of different factors on a dependent variable And, you can use several independent variables to improve your predictions of the dependent var

Education slope is basically unchanged Multiple Regression Job Prestige: 2 variable multiple regression Education slope is basically unchanged Family Income slope decreases compared to bivariate analysis (bivariate: b = 2.07) And, outcome of hypothesis test changes – t < 1.96

Multiple Regression Ex: Job Prestige: 2 variable multiple regression 1. Education has a large slope effect controlling for (i.e. “over and above”) family income 2. Family income does not have much effect controlling for education Despite a strong bivariate relationship Possible interpretations: Family income may lead to education, but education is the critical predictor of job prestige Or, family income is wholly unrelated to job prestige… but is coincidentally correlated with a variable that is (education), which generated a spurious “effect”.

The Multiple Regression Model A two-independent variable regression model: Note: There are now two X variables And a slope (b) is estimated for each one The full multiple regression model is: For k independent variables

Multiple Regression: Slopes Regression slope for the two variable case: b1 = slope for X1 – controlling for the other independent variable X2 b2 is computed symmetrically. Swap X1s, X2s Compare to bivariate slope:

Multiple Regression Slopes Let’s look more closely at the formulas: What happens to b1 if X1 and X2 are totally uncorrelated? Answer: The formula reduces to the bivariate What if X1 and X2 are correlated with each other AND X2 is more correlated with Y than X1? Answer: b1 gets smaller (compared to bivariate)

Regression Slopes So, if two variables (X1, X2) are correlated and both predict Y: The X variable that is more correlated with Y will have a higher slope in multivariate regression The slope of the less-correlated variable will shrink Thus, slopes for each variable are adjusted to how well the other variable predicts Y It is the slope “controlling” for other variables.

Multiple Regression Slopes One last thing to keep in mind… What happens to b1 if X1 and X2 are almost perfectly correlated? Answer: The denominator approaches Zero The slope “blows up”, approaching infinity Highly correlated independent variables can cause trouble for regression models… watch out

Interpreting Results (Over)Simplified rules for interpretation Assumes good sample, measures, models, etc. Multivariate regression with two variables: A, B If slopes of A, B are the same as bivariate, then each has an independent effect If A remains large, B shrinks to zero we typically conclude that effect of B was spurious, or operates through A If both A and B shrink a little, each has an effect, but some overlap or mediation is occurring

Interpreting Multivariate Results Things to watch out for: 1. Remember: Correlation is not causation Ability to “control” for many variables can help detect spurious relationships… but it isn’t perfect. Be aware that other (omitted) variables may be affecting your model. Don’t over-interpret results. 2. Reverse causality Many sociological processes involve bi-directional causality. Regression slopes (and correlations) do not identify which variable “causes” the other. Ex: self-esteem and test scores.

Standardized Regression Coefficients Regression slopes reflect the units of the independent variables Question: How do you compare how “strong” the effects of two variables if they have totally different units? Example: Education, family wealth, job prestige Education measured in years, b = 2.5 Family wealth measured on 1-5 scale, b = .18 Which is a “bigger” effect? Units aren’t comparable! Answer: Create “standardized” coefficients

Standardized Regression Coefficients Standardized Coefficients Also called “Betas” or Beta Weights” Symbol: Greek b with asterisk: b* Equivalent to Z-scoring (standardizing) all independent variables before doing the regression Formula of coeficient for Xj: Result: The unit is standard deviations Betas: Indicates the effect a 1 standard deviation change in Xj on Y

Standardized Regression Coefficients Ex: Education, family income, and job prestige: An increase of 1 standard deviation in Education results in a .52 standard deviation increase in job prestige What is the interpretation of the “family income” beta? Betas give you a sense of which variables “matter most”

R-Square in Multiple Regression Multivariate R-square is much like bivariate: But, SSregression is based on the multivariate regression The addition of new variables results in better prediction of Y, less error (e), higher R-square.

R-Square in Multiple Regression Example: R-square of .272 indicates that education, parents wealth explain 27% of variance in job prestige “Adjusted R-square” is a more conservative, more accurate measure in multiple regression Generally, you should report Adjusted R-square.

Dummy Variables Question: How can we incorporate nominal variables (e.g., race, gender) into regression? Option 1: Analyze each sub-group separately Generates different slope, constant for each group Option 2: Dummy variables “Dummy” = a dichotomous variables coded to indicate the presence or absence of something Absence coded as zero, presence coded as 1.

Dummy Variables Strategy: Create a separate dummy variable for all nominal categories Ex: Gender – make female & male variables DFEMALE: coded as 1 for all women, zero for men DMALE: coded as 1 for all men Next: Include all but one dummy variables into a multiple regression model If two dummies, include 1; If 5 dummies, include 4.

Dummy Variables Question: Why can’t you include DFEMALE and DMALE in the same regression model? Answer: They are perfectly correlated (negatively): r = -1 Result: Regression model “blows up” For any set of nominal categories, a full set of dummies contains redundant information DMALE and DFEMALE contain same information Dropping one removes redundant information.

Dummy Variables: Interpretation Consider the following regression equation: Question: What if the case is a male? Answer: DFEMALE is 0, so the entire term becomes zero. Result: Males are modeled using the familiar regression model: a + b1X + e.

Dummy Variables: Interpretation Consider the following regression equation: Question: What if the case is a female? Answer: DFEMALE is 1, so b2(1) stays in the equation (and is added to the constant) Result: Females are modeled using a different regression line: (a+b2) + b1X + e Thus, the coefficient of b2 reflects difference in the constant for women.

Dummy Variables: Interpretation Remember, a different constant generates a different line, either higher or lower Variable: DFEMALE (women = 1, men = 0) A positive coefficient (b) indicates that women are consistently higher compared to men (on dep. var.) A negative coefficient indicated women are lower Example: If DFEMALE coeff = 1.2: “Women are on average 1.2 points higher than men”.

Dummy Variables: Interpretation Visually: Women = blue, Men = red INCOME 100000 80000 60000 40000 20000 HAPPY 10 9 8 7 6 5 4 3 2 1 Overall slope for all data points Note: Line for men, women have same slope… but one is high other is lower. The constant differs! If women=1, men=0: The constant (a) reflects men only. Dummy coefficient (b) reflects increase for women (relative to men)

Dummy Variables What if you want to compare more than 2 groups? Example: Race Coded 1=white, 2=black, 3=other (like GSS) Make 3 dummy variables: “DWHITE” is 1 for whites, 0 for everyone else “DBLACK” is 1 for Af. Am., 0 for everyone else “DOTHER” is 1 for “others”, 0 for everyone else Then, include two of the three variables in the multiple regression model.

Dummy Variables: Interpretation Ex: Job Prestige Negative coefficient for DBLACK indicates a lower level of job prestige compared to whites T- and P-values indicate if difference is significant.

Dummy Variables: Interpretation Comments: 1. Dummy coefficients shouldn’t be called slopes Referring to the “slope” of gender doesn’t make sense Rather, it is the difference in the constant (or “level”) 2. The contrast is always with the nominal category that was left out of the equation If DFEMALE is included, the contrast is with males If DBLACK, DOTHER are included, coefficients reflect difference in constant compared to whites.