Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer

Slides:



Advertisements
Similar presentations
Managerial Economics in a Global Economy
Advertisements

Inference for Regression
Chapter 15 (Ch. 13 in 2nd Can.) Association Between Variables Measured at the Interval-Ratio Level: Bivariate Correlation and Regression.
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
1 Lecture 2: ANOVA, Prediction, Assumptions and Properties Graduate School Social Science Statistics II Gwilym Pryce
Simple Linear Regression
Department of Applied Economics National Chung Hsing University
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
SIMPLE LINEAR REGRESSION
Linear Regression and Correlation Analysis
1 Simple Linear Regression Chapter Introduction In this chapter we examine the relationship among interval variables via a mathematical equation.
SIMPLE LINEAR REGRESSION
Chapter 14 Introduction to Linear Regression and Correlation Analysis
Correlation and Regression Analysis
Linear Regression 2 Sociology 5811 Lecture 21 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Copyright ©2006 Brooks/Cole, a division of Thomson Learning, Inc. More About Regression Chapter 14.
Simple Linear Regression. Introduction In Chapters 17 to 19, we examine the relationship between interval variables via a mathematical equation. The motivation.
Relationships Among Variables
Lecture 5 Correlation and Regression
Chapter 8: Bivariate Regression and Correlation
SIMPLE LINEAR REGRESSION
Chapter 13: Inference in Regression
Linear Regression Inference
Correlation and Regression. The test you choose depends on level of measurement: IndependentDependentTest DichotomousContinuous Independent Samples t-test.
Linear Functions 2 Sociology 5811 Lecture 18 Copyright © 2004 by Evan Schofer Do not copy or distribute without permission.
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
Ch4 Describing Relationships Between Variables. Section 4.1: Fitting a Line by Least Squares Often we want to fit a straight line to data. For example.
Examining Relationships in Quantitative Research
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Correlation and Regression Basic Concepts. An Example We can hypothesize that the value of a house increases as its size increases. Said differently,
ANOVA, Regression and Multiple Regression March
Correlation and Regression Basic Concepts. An Example We can hypothesize that the value of a house increases as its size increases. Said differently,
Chapter 13 Linear Regression and Correlation. Our Objectives  Draw a scatter diagram.  Understand and interpret the terms dependent and independent.
Linear Regression 1 Sociology 5811 Lecture 19 Copyright © 2005 by Evan Schofer Do not copy or distribute without permission.
Chapter 13 Simple Linear Regression
The simple linear regression model and parameter estimation
Chapter 20 Linear and Multiple Regression
Regression Analysis AGEC 784.
Inference for Least Squares Lines
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Correlation and Simple Linear Regression
Chapter 11: Simple Linear Regression
Chapter 11 Simple Regression
Elementary Statistics
Chapter 13 Simple Linear Regression
Simple Linear Regression
(Residuals and
POSC 202A: Lecture Lecture: Substantive Significance, Relationship between Variables 1.
Slides by JOHN LOUCKS St. Edward’s University.
Correlation and Simple Linear Regression
Analysis of Variance Correlation and Regression Analysis
I271B Quantitative Methods
CHAPTER 29: Multiple Regression*
Chapter 14 – Correlation and Simple Regression
Correlation and Simple Linear Regression
SIMPLE LINEAR REGRESSION
CHAPTER 12 More About Regression
Simple Linear Regression and Correlation
Product moment correlation
SIMPLE LINEAR REGRESSION
Review I am examining differences in the mean between groups How many independent variables? OneMore than one How many groups? Two More than two ?? ?
Warsaw Summer School 2017, OSU Study Abroad Program
Introduction to Regression
MGS 3100 Business Analysis Regression Feb 18, 2016
St. Edward’s University
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Regression 1 Sociology 8811 Copyright © 2007 by Evan Schofer Do not copy or distribute without permission

Announcements None!

Linear Functions Formula: Y = a + bX Is a linear formula. If you graphed X and Y for any chosen values of a and b, you’d get a straight line. It is a family of functions: For any value of a and b, you get a particular line a is referred to as the “constant” or “intercept” b is referred to as the “slope” To graph a linear function: Pick values for X, compute corresponding values of Y Then, connect dots to graph line

Linear Functions: Y = a + bX The “constant” or “intercept” (a) Determines where the line intersects the Y-axis If a increases (decreases), the line moves up (down) Y axis X axis -10 -5 0 5 10 20 10 -10 -20 Y=14 - 1.5X Y= 3 -1.5X Y= -9 - 1.5X

Linear Functions: Y = a + bX The slope (b) determines the steepness of the line Y=3+3X Y axis X axis -10 -5 0 5 10 20 10 -10 -20 Y=3-1.5X Y=2+.2X

Linear Functions: Slopes The slope (b) is the ratio of change in Y to change in X -10 -5 0 5 10 20 10 -10 -20 Y=3+3X Slope: b = 15/5 = 3 Change in Y=15 Change in X =5 The slope tells you how many points Y will increase for any single point increase in X

Linear Functions as Summaries A linear function can be used to summarize the relationship between two variables: Slope: b = 2 / 40,000 = .00005 pts/$ Change in X = 40,000 Change in Y = 2 If you change units: b = .05 / $1K b = .5 pts/$10K b = 5 pts/$100K

Linear Functions as Summaries Slope and constant can be “eyeballed” to approximate a formula: Happy = 2 + .00005Income Slope (b): b = 2 / 40,000 = .00005 pts/$ Constant (a) = Value where line hits Y axis a = 2

Linear Functions as Summaries Linear functions can powerfully summarize data: Formula: Happy = 2 + .00005Income Gives a sense of how the two variables are related Namely, people get a .00005 increase in happiness for every extra dollar of income (or 5 pts per $100K) Also lets you “predict” values. What if someone earns $150,000? Happy = 2 + .00005($150,000) = 9.5 But be careful… You shouldn’t assume that a relationship remains linear indefinitely Also, negative income or happiness make no sense…

Linear Functions as Summaries Come up with a linear function that summarizes this real data: years of education vs. job prestige It isn’t always easy! The line you choose depends on how much you “weight” these points.

Linear Functions as Summaries One estimate of the linear function The line meets the Y-axis at Y=5. Thus a = 5 The line increases to about 65 as X reaches 20. The increase is 60 in Y per 20 in X. Thus: b = 60/20 = 3 Formula: Y = 5 + 3X

Linear Functions as Summaries Questions: How much additional job prestige do you get by going to college (an extra 4 years of education)? Formula: Prestige = 5 + 3*Education Answer: About 12 points of job prestige Change in X is 4… Slope is 3. 3 x 4 = 12 points If X=12, Y=5+3*12 = 41; If X=16, Y=5+3*16 = 53 What is the interpretation of the constant? It is the predicted job prestige of someone with zero years of education… (Prestige = 5)

Linear Functions as Prediction Linear functions can summarize the relationship between two variables: Formula: Happy = 2 + .05Income (in 1,000s) Linear functions can also be used to “predict” (estimate) a case’s value of variable (Yi) based on its value of another variable (Xi) If you know the constant and slope “Y-hat” indicates an estimation function: bYX denotes the slope of Y with respect to X

Prediction with Linear Functions If Xi (Income) = 60K, what is our estimate of Yi (Happiness)? Happy = 2 + .05Income Happiness-hat = 2 + .05(60) = 5 There is an case with Income =60K The prediction in imperfect… The case falls at X = 5.3 (above the line).

The Linear Regression Model To model real data, we must take into account that points will miss the line Similar to ANOVA, we refer to the deviation of points from the estimated value as “error” (ei) In ANOVA the estimated value is: the group mean i.e., the grand mean plus the group effect In regression the estimated value is derived from the formula Y = a + bX Estimation is based on the value of X, slope, and constant (assumes linear relationship between X and Y)

The Linear Regression Model The value of any point (Yi) can be modeled as: The value of Y for case (i) is made up of A constant (a) A sloping function of the case’s value on variable X (bYX) An error term (e), the deviation from the line By adding error (e), an abstract mathematical function can be applied to real data points

The Linear Regression Model Case 7: X=3, Y=5 Visually: Yi = a + bXi + ei e = 1.5 Y=2+.5X -4 -2 0 2 4 4 2 -2 -4 bX = 3(.5) = 1.5 Constant (a) = 2 a = 2

Estimating Linear Equations Question: How do we choose the best line to describe our real data? Previously, we just “eyeballed” it Answer: Look at the error If a given line formula misses points by a lot, the observed error will be large If the line is as close to all points as possible, observed error will be small Of course, even the best line has some error Except when all data points are perfectly on a line

Estimating Linear Equations A poor estimation (big error) Y=1.5-1X -4 -2 0 2 4 4 2 -2 -4

Estimating Linear Equations Better estimation (less error) Y=2+.5X -4 -2 0 2 4 4 2 -2 -4

Estimating Linear Equations Look at the improvement (reduction) in error: High Error vs. Low Error

Estimating Linear Equations Idea: The “best” line is the one that has the least error (deviation from the line) Total deviation from the line can be expressed as: But, to make all deviation positive, we square it, producing the “sum of squares error”

Estimating Linear Equations Goal: Find values of constant (a) and slope (b) that produce the lowest squared error The “least squares” regression line The formula for the slope (b) that yields the “least squares error” is: Where s2x is the variance of X And sYX is the covariance of Y and X.

Covariance Variance: Sum of deviation about Y-bar over N-1 Covariance (sYX): Sum of deviation about Y-bar multiplied by deviation around X-bar:

Covariance Covariance: A measure of how much variance of a case in X is accompanied by variance in Y It measures whether deviation (from mean) in X tends to be accompanied by similar deviation in Y Or if cases with positive deviation in X have negative deviation in Y This is summed up for all cases in the data The covariance is one numerical measure that characterizes the extent of linear association As is the correlation coefficient (r).

Regression Example Example: Study time and student achievement. X variable: Average # hours spent studying per day Y variable: Score on reading test Y axis X axis 0 1 2 3 4 30 20 10 X-bar = 1.8 Case X Y 1 2.6 28 2 1.4 13 3 .65 17 4 4.1 31 5 .25 8 6 1.9 16 Y-bar = 18.8

Sum of X deviation * Y deviation = 51.73 Regression Example Slope = covariance (X and Y) / variance of X X-bar = 1.8, Y-bar = 18.8 Case X Y 1 2.6 28 2 1.4 13 3 .65 17 4 4.1 31 5 .25 8 6 1.9 16 X Dev 0.8 -0.4 1.15 2.3 -1.55 0.1 Y Dev 9.2 -5.8 -1.8 12.2 -10.8 -2.8 XD*YD 7.36 1.92 -2.07 28.06 16.74 -.28 Sum of X deviation * Y deviation = 51.73

Regression Example Calculating the Covariance: Standard deviation of X = 1.4 Variance = square of S.D. = 1.96 Finally:

Regression Example Results: Slope b = 5.3, constant a = 9.3 Equation: TestScore = 9.3 + 5.3*HrsStudied Question: What is the interpretation of b? Answer: For every hour studied, test scores increase by 5.3 points Question: What is the interpretation of the constant? Answer: Individuals who studied zero hours are predicted to score 9.3 on a the test.

Computing Regressions Regression coefficients can be calculated in SPSS You will rarely, if ever, do them by hand SPSS will estimate: The value of the constant (a) The value of the slope (b) Plus, a large number of related statistics and results of hypothesis testing procedures

Example: Education & Job Prestige Example: Years of Education versus Job Prestige Previously, we made an “eyeball” estimate of the line Our estimate: Y = 5 + 3X

Example: Education & Job Prestige The actual SPSS regression results for that data: Estimates of a and b: “Constant” = a = 9.427 Slope for “Year of School” = b = 2.487 Equation: Prestige = 9.4 + 2.5 Education A year of education adds 2.5 points job prestige

Example: Education & Job Prestige Comparing our “eyeball” estimate to the actual OLS regression line Our estimate: Y = 5 + 3X Actual OLS regression line computed in SPSS

R-Square The R-Square statistic indicates how well the regression line “explains” variation in Y It is based on partitioning variance into: 1. Explained (“regression”) variance The portion of deviation from Y-bar accounted for by the regression line 2. Unexplained (“error”) variance The portion of deviation from Y-bar that is “error” Formula:

R-Square Visually: Deviation is partitioned into two parts Y=2+.5X “Error Variance” Y=2+.5X -4 -2 0 2 4 4 2 -2 -4 “Explained Variance” Y-bar

Example: Education & Job Prestige R-Square & Hypothesis testing information: The R and R-Square indicate how well the line summarizes the data This information allows us to do hypothesis tests about constant & slope

Hypothesis Tests: Slopes Given: Observed slope relating Education to Job Prestige = 2.47 Question: Can we generalize this to the population of all Americans? How likely is it that this observed slope was actually drawn from a population with slope = 0? Solution: Conduct a hypothesis test Notation: slope = b, population slope = b H0: Population slope b = 0 H1: Population slope b  0 (two-tailed test)

Example: Slope Hypothesis Test The actual SPSS regression results for that data: t-value and “sig” (p-value) are for hypothesis tests about the slope Reject H0 if: T-value > critical t (N-2 df) Or, “sig.” (p-value) less than a (often a = .05)

Hypothesis Tests: Slopes What information lets us to do a hypothesis test? Answer: Estimates of a slope (b) have a sampling distribution, like any other statistic It is the distribution of every value of the slope, based on all possible samples (of size N) If certain assumptions are met, the sampling distribution approximates the t-distribution Thus, we can assess the probability that a given value of b would be observed, if b = 0 If probability is low – below alpha – we reject H0

Hypothesis Tests: Slopes Visually: If the population slope (b) is zero, then the sampling distribution would center at zero Since the sampling distribution is a probability distribution, we can identify the likely values of b if the population slope is zero If b=0, observed slopes should commonly fall near zero, too b Sampling distribution of the slope If observed slope falls very far from 0, it is improbable that b is really equal to zero. Thus, we can reject H0.

Regression Assumptions Assumptions of simple (bivariate) regression If assumptions aren’t met, hypothesis tests may be inaccurate 1. Random sample w/ sufficient N (N > ~20) 2. Linear relationship among variables Check scatterplot for non-linear pattern; (a “cloud” is OK) 3. Conditional normality: Y = normal at all values of X Check histograms of Y for normality at several values of X 4. Homoskedasticity – equal error variance at all values of X Check scatterplot for “bulges” or “fanning out” of error across values of X Additional assumptions are required for multivariate regression…

Bivariate Regression Assumptions Normality: Examine sub-samples at different values of X. Make histograms and check for normality. Good Not very good

Bivariate Regression Assumptions Homoskedasticity: Equal Error Variance Examine error at different values of X. Is it roughly equal? Here, things look pretty good.

Bivariate Regression Assumptions Heteroskedasticity: Unequal Error Variance At higher values of X, error variance increases a lot. This looks pretty bad.

Regression Hypothesis Tests If assumptions are met, the sampling distribution of the slope (b) approximates a T-distribution Standard deviation of the sampling distribution is called the standard error of the slope (sb) Population formula of standard error: Where se2 is the variance of the regression error

Regression Hypothesis Tests Estimating se2 lets us estimate the standard error: Now we can estimate the S.E. of the slope:

Regression Hypothesis Tests Finally: A t-value can be calculated: It is the slope divided by the standard error Where sb is the sample point estimate of the S.E. The t-value is based on N-2 degrees of freedom Reject H0 if observed t > critical t (e.g., 1.96).

Example: Education & Job Prestige T-values can be compared to critical t... SPSS estimates the standard error of the slope. This is used to calculate a t-value The t-value can be compared to the “critical value” to test hypotheses. Or, just compare “Sig.” to alpha. If t > crit or Sig < alpha, reject H0