Regression Analysis Introduction to Regression Analysis (RA) Regression Analysis is used to estimate a function f ( ) that describes the relationship.

Slides:



Advertisements
Similar presentations
Chap 12-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 12 Simple Regression Statistics for Business and Economics 6.
Advertisements

Forecasting Using the Simple Linear Regression Model and Correlation
Pengujian Parameter Regresi Pertemuan 26 Matakuliah: I0174 – Analisis Regresi Tahun: Ganjil 2007/2008.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 12 Simple Linear Regression
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Chapter 12 Simple Regression
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Fall 2006 – Fundamentals of Business Statistics 1 Chapter 13 Introduction to Linear Regression and Correlation Analysis.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 13-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
1 1 Slide 統計學 Spring 2004 授課教師:統計系余清祥 日期: 2004 年 5 月 4 日 第十二週:複迴歸.
Linear Regression and Correlation Analysis
Chapter 13 Introduction to Linear Regression and Correlation Analysis
Spreadsheet Modeling & Decision Analysis
1 1 Slide © 2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Lecture 17 Interaction Plots Simple Linear Regression (Chapter ) Homework 4 due Friday. JMP instructions for question are actually for.
Chapter 14 Introduction to Linear Regression and Correlation Analysis
1 Simple Linear Regression 1. review of least squares procedure 2. inference for least squares lines.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Spreadsheet Modeling & Decision Analysis A Practical Introduction to Management Science 5 th edition Cliff T. Ragsdale.
Correlation and Linear Regression
Introduction to Linear Regression and Correlation Analysis
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Linear Regression and Correlation
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide Simple Linear Regression Part A n Simple Linear Regression Model n Least Squares Method n Coefficient of Determination n Model Assumptions n.
Econ 3790: Business and Economics Statistics
1 1 Slide © 2016 Cengage Learning. All Rights Reserved. The equation that describes how the dependent variable y is related to the independent variables.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
1 1 Slide Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination n Model Assumptions n Testing.
Statistics for Business and Economics 7 th Edition Chapter 11 Simple Regression Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Ch.
Econ 3790: Business and Economics Statistics Instructor: Yogesh Uppal
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
Introduction to Linear Regression
Chap 12-1 A Course In Business Statistics, 4th © 2006 Prentice-Hall, Inc. A Course In Business Statistics 4 th Edition Chapter 12 Introduction to Linear.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
MGS3100_04.ppt/Sep 29, 2015/Page 1 Georgia State University - Confidential MGS 3100 Business Analysis Regression Sep 29 and 30, 2015.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Chapter 13 Multiple Regression
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice-Hall, Inc. Chap 13-1 Introduction to Regression Analysis Regression analysis is used.
VI. Regression Analysis A. Simple Linear Regression 1. Scatter Plots Regression analysis is best taught via an example. Pencil lead is a ceramic material.
Lecture 10: Correlation and Regression Model.
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2011 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Statistics for Managers Using Microsoft® Excel 5th Edition
Business Research Methods
Regression Analysis Presentation 13. Regression In Chapter 15, we looked at associations between two categorical variables. We will now focus on relationships.
Chapter 13 Simple Linear Regression
Chapter 14 Introduction to Multiple Regression
Inference for Least Squares Lines
John Loucks St. Edward’s University . SLIDES . BY.
Chapter 11 Simple Regression
Chapter 13 Simple Linear Regression
Slides by JOHN LOUCKS St. Edward’s University.
Business Statistics Multiple Regression This lecture flows well with
Prepared by Lee Revere and John Large
MGS 3100 Business Analysis Regression Feb 18, 2016
Presentation transcript:

Regression Analysis

Introduction to Regression Analysis (RA) Regression Analysis is used to estimate a function f ( ) that describes the relationship between a continuous dependent variable and one or more independent variables. Y = f(X 1, X 2, X 3,…, X n ) +  Note: f ( ) describes systematic variation in the relationship.  represents the unsystematic variation (or random error) in the relationship.

In other words, the observations that we have interest can be separated into two parts: Y = f(X 1, X 2, X 3,…, X n ) +  Observations = Model + Error Observations = Signal + Noise Ideally, the noise shall be very small, comparing to the model.

Signal to Noise What we observe can be divided into: what we see signal noise

Model specification y i =  0 +  1 X i +  2 Z i If the true function is: And we fit: y i =  0 +  1 X i +  2 Z i + e i Our model is exactly specified and we obtain an unbiased and efficient estimate.

Model specification y i =  0 +  1 X i +  2 Z i +  3 X i Z i +  4 Z i And finally, if the true function is: And we fit: y i =  0 +  1 X i +  2 Z i + e i Our model is underspecified, we excluded some necessary terms, and we obtain a biased estimate. 2

Model specification y i =  0 +  1 X i +  2 Z i On the other hand, if the true function is: And we fit: y i =  0 +  1 X i +  2 Z i +  3 X i Z i + e i Our model is overspecified, we included some unnecessary terms, and we obtain an inefficient estimate.

Model specification if specify the model exactly, there is no bias if you overspecify the model (add more terms than needed), result is unbiased, but inefficient if you underspecify the model (omit one or more necessary terms (the result is biased) Overall Strategy –best option is to exactly specify the true function –we would prefer to err by overspecifying our model because that only leads to inefficiency –Therefore, start with a likely overspecified model and reduce it

An Example Consider the relationship between advertising ( X 1 ) and sales ( Y ) for a company. There probably is a relationship......as advertising increases, sales should increase. But how would we measure and quantify this relationship?

A Scatter Plot of the Data Advertising (in $1,000s) Sales (in $1,000s)

The Nature of a Statistical Relationship Regression Curve Probability distributions for Y at different levels of X Y X

A Simple Linear Regression Model The scatter plot shows a linear relation between advertising and sales. So the following regression model is suggested by the data, This refers to the true relationship between the entire population of advertising and sales values. The estimated regression function (based on our sample) will be represented as,

Determining the Best Fit Numerical values must be assigned to b 0 and b 1 The method of “least squares” selects the values that minimize: If ESS=0 our estimated function fits the data perfectly. We could solve this problem using Solver...

Estimation – Linear Regressin Formula for a straight line y = b 0 + b 1 x + e x y want to solve for b0 = intercept b1 = slope y x  y  x = outcomeprogram

The Estimated Regression Function The estimated regression function is:

Evaluating the “Fit” R 2 = Advertising (in $000s) Sales (in $000s)

The R 2 Statistic The R 2 statistic indicates how well an estimated regression function fits the data. 0<= R 2 <=1 It measures the proportion of the total variation in Y around its mean that is accounted for by the estimated regression equation. To understand this better, consider the following graph...

Error Decomposition Y X Y Y = b 0 + b 1 X ^ * Y i (actual value) Y i - Y Y i (estimated value) ^ Y i - Y ^ Y i - YiYi ^

Partition of the Total Sum of Squares or, TSS = ESS + RSS

Degree of Linear Correlation R 2 = 1 = perfect linear correlation; R 2 = 0 = no correlation High R 2 = good fit only if linear model is appropriate; always check with a scatterplot Correlation does not prove causation; x and y may both be correlated to a third (possibly unidentified) variable A more popular (but less meaningful) measure is the “correlation coefficient”: R 2 = RSQ([y-range],[x-range] r = CORREL([y-range],[x-range])

R 2 = 0.67

Testing for Significance: F Test n Hypotheses H 0 :  1 = 0 H 0 :  1 = 0 H a :  1 = 0 H a :  1 = 0 n Test Statistic n Rejection Rule Reject H 0 if F > F  where F  is based on an F distribution with 1 d.f. in the numerator and n - 2 d.f. in the denominator.

Some Cautions about the Interpretation of Significance Tests Rejecting H 0 : b 1 = 0 and concluding that the relationship between x and y is significant does not enable us to conclude that a cause- and-effect relationship is present between x and y. Just because we are able to reject H 0 : b 1 = 0 and demonstrate statistical significance does not enable us to conclude that there is a linear relationship between x and y.

An Example of Inappropriate Interpretation A study shows that, in elementary schools, the ability of spelling is stronger for the students with larger feet.  Could we conclude that the size of foot can influence the ability of spelling?  Or there exists another factor that can influence the foot size and the spelling ability?

Making Predictions Estimated Sales= * 65 = So when $65,000 is spent on advertising, we expect the average sales level to be $397,092. Suppose we want to estimate the average levels of sales expected if $65,000 is spent on advertising.

The Standard Error The standard error measures the scatter in the actual data around the estimate regression line. where k = the number of independent variables For our example, S e = This is helpful in making predictions...

An Approximate Prediction Interval An approximate 95% prediction interval for a new value of Y when X 1 =X 1 h is given by where : Example: If $65,000 is spent on advertising: 95% lower prediction interval = * = % upper prediction interval = * = If we spend $65,000 on advertising we are approximately 95% confident actual sales will be between $356,250 and $437,934.

An Exact Prediction Interval A (1-  )% prediction interval for a new value of Y when X 1 =X 1 h is given by where :

Example If $65,000 is spent on advertising: 95% lower prediction interval = * = % upper prediction interval = * = If we spend $65,000 on advertising we are 95% confident actual sales will be between $347,556 and $446,666. This interval is only about $20,000 wider than the approximate one calculated earlier but was much more difficult to create. The greater accuracy is not always worth the trouble.

Comparison of Prediction Interval Techniques Advertising Expenditures Sales Regression Line Prediction intervals created using standard error S e Prediction intervals created using standard prediction error S p

Confidence Intervals for the Mean A (1-  )% confidence interval for the true mean value of Y when X 1 =X 1 h is given by where :

A Note About Extrapolation Predictions made using an estimated regression function may have little or no validity for values of the independent variables that are substantially different from those represented in the sample.

What Does “Regression” Mean?

1.Draw “best-fit” line free hand 2.Find mother’s height = 60”, find average daughter’s height 3.Repeat for mother’s height = 62”, 64”… 70”; draw “best-fit” line for these points 4.Draw line daughter’s height = mother’s height 5.For a given mother’s height, daughter’s height tends to be between mother’s height and mean height: “regression toward the mean”

What Does “Regression” Mean?

Residual for Observation i y i – y i Standardized Residual for Observation i where: Residual Analysis ^^^ ^

 

Residual Analysis Detecting Outliers – An outlier is an observation that is unusual in comparison with the other data. – Minitab classifies an observation as an outlier if its standardized residual value is +2. – This standardized residual rule sometimes fails to identify an unusually large observation as being an outlier. – This rule’s shortcoming can be circumvented by using studentized deleted residuals. – The |i th studentized deleted residual| will be larger than the |i th standardized residual|.

Multiple Regression Analysis Most regression problems involve more than one independent variable. If each independent variables varies in a linear manner with Y, the estimated regression function in this case is: The optimal values for the b i can again be found by minimizing the ESS. The resulting function fits a hyperplane to our sample data.

Example Regression Surface for Two Independent Variables Y X1X1 X2X2 * * * * * * * * * * * * * * * * * * * * * * *

Multiple Regression Example: Real Estate Appraisal A real estate appraiser wants to develop a model to help predict the fair market values of residential properties. Three independent variables will be used to estimate the selling price of a house: –total square footage –number of bedrooms –size of the garage

Selecting the Model We want to identify the simplest model that adequately accounts for the systematic variation in the Y variable. Arbitrarily using all the independent variables may result in overfitting. A sample reflects characteristics: –representative of the population –specific to the sample We want to avoid fitting sample specific characteristics -- or overfitting the data.

Models with One Independent Variable With simplicity in mind, suppose we fit three simple linear regression functions: VariablesAdjustedParameter in the Model R 2 R 2 S e Estimates X b 0 =9.503, b 1 = X b 0 =78.290, b 2 = X b 0 =16.250, b 3 = Key regression results are: The model using X 1 accounts for 87% of the variation in Y, leaving 13% unaccounted for.

Important Software Note When using more than one independent variable, all variables for the X-range must be in one contiguous block of cells (that is, in adjacent columns).

Models with Two Independent Variables Now suppose we fit the following models with two independent variables: VariablesAdjustedParameter in the Model R 2 R 2 S e Estimates X b 0 =9.503, b 1 = X 1 & X b 0 =27.684, b 1 = b 2 = X 1 & X b 0 =8.311, b 1 = b 3 =6.743 Key regression results are: The model using X 1 and X 2 accounts for 93.9% of the variation in Y, leaving 6.1% unaccounted for.

The Adjusted R 2 Statistic As additional independent variables are added to a model: –The R 2 statistic can only increase. –The Adjusted-R 2 statistic can increase or decrease. The R 2 statistic can be artificially inflated by adding any independent variable to the model. We can compare adjusted-R 2 values as a heuristic to tell if adding an additional independent variable really helps.

A Comment On Multicollinearity It should not be surprising that adding X 3 (# of bedrooms) to the model with X 1 (total square footage) did not significantly improve the model. Both variables represent the same (or very similar) things -- the size of the house. These variables are highly correlated (or collinear). Multicollinearity should be avoided.

Testing for Significance: Multicollinearity The term multicollinearity refers to the correlation among the independent variables. When the independent variables are highly correlated (say, |r | >.7), it is not possible to determine the separate effect of any particular independent variable on the dependent variable. If the estimated regression equation is to be used only for predictive purposes, multicollinearity is usually not a serious problem. Every attempt should be made to avoid including independent variables that are highly correlated.

Model with Three Independent Variables Now suppose we fit the following model with three independent variables: VariablesAdjustedParameter in the Model R 2 R 2 S e Estimates X b 0 =9.503, b 1 = X 1 & X b 0 =27.684, b 1 =38.576, b 2 = X 1, X 2 & X b 0 =26.440, b 1 =30.803, b 2 =12.567, b 3 =4.576 Key regression results are: The model using X 1 and X 2 appears to be best: –Highest adjusted-R 2 –Lowest S e (most precise prediction intervals)

Making Predictions Let’s estimate the avg selling price of a house with 2,100 square feet and a 2-car garage: The estimated average selling price is $134,444 A 95% prediction interval for the actual selling price is approximately: 95% lower prediction interval = *7.471 = $119,502 95% lower prediction interval = *7.471 = $149,386

Binary Independent Variables Other types of non-quantitative factors could independent variables could be included in the analysis using binary variables. Example: The presence (or absence) of a swimming pool, Example: Whether the roof is in good, average or poor condition,

Polynomial Regression Sometimes the relationship between a dependent and independent variable is not linear. This graph suggests a quadratic relationship between square footage (X) and selling price (Y).

The Regression Model An appropriate regression function in this case might be, or equivalently, where,

Implementing the Model

Graph of Estimated Quadratic Regression Function

Fitting a Third Order Polynomial Model We could also fit a third order polynomial model, or equivalently, where,

Graph of Estimated Third Order Polynomial Regression Function

Overfitting When fitting polynomial models, care must be taken to avoid overfitting. The adjusted-R 2 statistic can be used for this purpose here also.

Example: Programmer Salary Survey A software firm collected data for a sample of 20 computer programmers. A suggestion was made that regression analysis could be used to determine if salary was related to the years of experience and the score on the firm’s programmer aptitude test. The years of experience, score on the aptitude test, and corresponding annual salary ($1000s) for a sample of 20 programmers is shown on the next slide.

Example: Programmer Salary Survey Exper. Score Salary Exper. Score Salary

Example: Programmer Salary Survey Multiple Regression Model Suppose we believe that salary (y) is related to the years of experience (x 1 ) and the score on the programmer aptitude test (x 2 ) by the following regression model: y =  0 +  1 x 1 +  2 x 2 +   where y = annual salary ($000) x 1 = years of experience x 2 = score on programmer aptitude test

Example: Programmer Salary Survey Multiple Regression Equation Using the assumption E (  ) = 0, we obtain E(y ) =  0 +  1 x 1 +  2 x 2 Estimated Regression Equation b 0, b 1, b 2 are the least squares estimates of  0,  1,  2.  Thus y = b 0 + b 1 x 1 + b 2 x 2. ^

Example: Programmer Salary Survey Solving for the Estimates of  0,  1,  2 ComputerPackage for Solving MultipleRegressionProblemsComputerPackage MultipleRegressionProblems b 0 = b 1 = b 1 = b 2 = b 2 = R 2 = etc. b 0 = b 1 = b 1 = b 2 = b 2 = R 2 = etc. Input Data Least Squares Output x 1 x 2 y x 1 x 2 y

Example: Programmer Salary Survey Data Analysis Output The regression is Salary = Exper Score Predictor Coef Stdev t-ratio p Constant Exper Score s = R-sq = 83.4% R-sq(adj) = 81.5%

Example: Programmer Salary Survey Computer Output (continued) Analysis of Variance SOURCE DF SS MS F P Regression Error Total