LECTURE 04: LINEAR REGRESSION PT. 2 February 3, 2016 SDS 293 Machine Learning.

Slides:



Advertisements
Similar presentations
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Advertisements

Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
11 Simple Linear Regression and Correlation CHAPTER OUTLINE
Ch11 Curve Fitting Dr. Deshi Ye
Class 16: Thursday, Nov. 4 Note: I will you some info on the final project this weekend and will discuss in class on Tuesday.
LECTURE 3 Introduction to Linear Regression and Correlation Analysis
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
Statistics for the Social Sciences
Statistics for Managers Using Microsoft® Excel 5th Edition
BA 555 Practical Business Analysis
Chapter 12 Multiple Regression
Class 5: Thurs., Sep. 23 Example of using regression to make predictions and understand the likely errors in the predictions: salaries of teachers and.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Lecture 24: Thurs. Dec. 4 Extra sum of squares F-tests (10.3) R-squared statistic (10.4.1) Residual plots (11.2) Influential observations (11.3,
Lecture 24: Thurs., April 8th
Lecture 20 Simple linear regression (18.6, 18.9)
Lecture 19 Transformations, Predictions after Transformations Other diagnostic tools: Residual plot for nonconstant variance, histogram to check normality.
Basic Business Statistics, 11e © 2009 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 11 th Edition.
11-1 Empirical Models Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis.
Stat 112: Lecture 9 Notes Homework 3: Due next Thursday
Regression Model Building Setting: Possibly a large set of predictor variables (including interactions). Goal: Fit a parsimonious model that explains variation.
Multiple Regression Dr. Andy Field.
Simple Linear Regression Analysis
Copyright ©2011 Pearson Education 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft Excel 6 th Global Edition.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Correlation & Regression
Objectives of Multiple Regression
STA291 Statistical Methods Lecture 27. Inference for Regression.
Copyright ©2011 Pearson Education, Inc. publishing as Prentice Hall 15-1 Chapter 15 Multiple Regression Model Building Statistics for Managers using Microsoft.
STATISTICS: BASICS Aswath Damodaran 1. 2 The role of statistics Aswath Damodaran 2  When you are given lots of data, and especially when that data is.
© 2004 Prentice-Hall, Inc.Chap 15-1 Basic Business Statistics (9 th Edition) Chapter 15 Multiple Regression Model Building.
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
1 1 Slide © 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Forecasting Revenue: An Example of Regression Model Building Setting: Possibly a large set of predictor variables used to predict future quarterly revenues.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
MBP1010H – Lecture 4: March 26, Multiple regression 2.Survival analysis Reading: Introduction to the Practice of Statistics: Chapters 2, 10 and 11.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 14 Inference for Regression © 2011 Pearson Education, Inc. 1 Business Statistics: A First Course.
Dr. C. Ertuna1 Issues Regarding Regression Models (Lesson - 06/C)
Maths Study Centre CB Open 11am – 5pm Semester Weekdays
1 Regression Analysis The contents in this chapter are from Chapters of the textbook. The cntry15.sav data will be used. The data collected 15 countries’
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
I271B QUANTITATIVE METHODS Regression and Diagnostics.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
ANOVA, Regression and Multiple Regression March
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc. Chap 15-1 Chapter 15 Multiple Regression Model Building Basic Business Statistics 10 th Edition.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 14-1 Chapter 14 Multiple Regression Model Building Statistics for Managers.
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
More on regression Petter Mostad More on indicator variables If an independent variable is an indicator variable, cases where it is 1 will.
Occasionally, we are able to see clear violations of the constant variance assumption by looking at a residual plot - characteristic “funnel” shape… often.
LECTURE 06: CLASSIFICATION PT. 2 February 10, 2016 SDS 293 Machine Learning.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
LECTURE 03: LINEAR REGRESSION PT. 1 February 1, 2016 SDS 293 Machine Learning.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
LECTURE 16: BEYOND LINEARITY PT. 1 March 28, 2016 SDS 293 Machine Learning.
DATA ANALYSIS AND MODEL BUILDING LECTURE 9 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
Yandell – Econ 216 Chap 15-1 Chapter 15 Multiple Regression Model Building.
Linear Regression CSC 600: Data Mining Class 13.
Regression Forecasting and Model Building
Chapter 13 Additional Topics in Regression Analysis
Essentials of Statistics for Business and Economics (8e)
Model Adequacy Checking
Presentation transcript:

LECTURE 04: LINEAR REGRESSION PT. 2 February 3, 2016 SDS 293 Machine Learning

Q&A 1: variance vs. bias Variance: the amount by which the model would change if we estimated it using different training data

Q&A 1: variance vs. bias Bias: the error that is introduced by approximating a real problem using a simple model

Q&A 2: statistical vs. practical significance So far we’ve only measured whether or not a variable is important using it’s statistical significance (p-value) What does the p-value actually tell us? Should we ever include variables that are not statistically significant? If yes, when?

Outline Motivation and Running Example: Advertising Simple Linear Regression - Estimating coefficients - How good is this estimate? - How good is the model? Multiple Linear Regression - Estimating coefficients - Important questions Dealing with Qualitative Predictors Extending the Linear Model - Removing the additive assumption - Non-linear relationships Potential Problems

Extending the linear model Thanks, Roger Hargreaves! The linear regression model provides nice, interpretable results and is a good starting point for many applications

Assumption 1: independent effects Think back to our model of the Advertising dataset

Reality: interaction effects Spend a lot on either TV or Radio = model overestimates Balanced spending Between TV and Radio = model underestimates

Reality: interaction effects Suppose that spending money on radio advertising actually increases the effectiveness of TV advertising This means that the slope term for TV should increase as radio increases In the standard linear model, we didn’t account for that:

Solution: interaction terms One solution: add a new term How does this fix the problem?

Solution: Interaction terms p-value for TVxradio is very low (indicating what?) R 2 without interaction term is 89.7%; this model: 96.8% (96.8 – 89.7) / (100 – 89.7) = 69% of the variability that our previous model missed is explained by the interaction term

Important caveat In this case, p-values for all 3 predictors are significant That may not always be true Hierarchical principle: if we include an interaction term, we should include the main effects too (why?)

Assumption 2: linear relationships LR assumes that there is a straight-line relationship between the predictors and the response If the true relationship is far from linear: - the conclusions we draw from a linear model are probably flawed - the prediction accuracy of the model is likely going to be pretty low

Assumption 2: linear relationships For example, in the Auto dataset:

Solution: polynomial regression One simple approach is to use polynomial transformations on the predictors Note: this is still a linear model! (Want more? We’ve got some shinier non-linear models coming up in Ch. 7)

Okay, so wait…

How to tell if you need more power Residuals plots can help identify problem areas in the model (by highlighting patterns in the errors) Ex. LR of mpg on horsepower in the Auto dataset:

Discussion: breaking LR What other problems might we run into when using LR? How could we fix them?

1. Correlated error terms LR assumes that the error terms are uncorrelated If these terms are correlated, the estimated standard error will tend to underestimate the true standard error What does this mean for the associated confidence intervals and p-values? Question: when might we want to be wary of this? (hint: tick, tock…)

How to tell if error terms are correlated In the time-sampled case, we can plot the residuals from our model as a function of time If the errors are uncorrelated, there should be no discernable pattern “tracking”

2. Non-constant variance of error terms LR assumes that error terms have constant variance This is often not the case (ex. error terms might increase with the value of the response) Non-constant variance in the errors = heteroscedasticity

How to identify / fix heteroscedasticity The residuals plot will show a funnel shape You can sometimes fix this by transforming the response using a concave function (like log or sqrt) Another option is to try weighting the observations proportional to the inverse variance

3. Outliers An outlier is a point whose true value is really far from the one predicted by the model. Outliers sometimes indicate a problem with the model (like a missing predictor), or they might just be a data collection error Outliers generally don’t cause too much or a ruckus with the least-squares fit, but they can wreak havoc on RSE and R 2 which can lead us to misinterpret the model’s fit

How to identify outliers Residual plots can help identify outliers, but sometimes it’s hard to pick a cutoff point (how far is “too far”?) Quick fix: divide each residual by dividing by its estimated standard error (studentized residuals), and flag anything larger than 3 in absolute value

“Studentizing”? Named for English statistician Wm. Gosset Graduated from Oxford with degrees in chemistry and math in 1988 Published under the pseudonym “Student”

4. High leverage points Outliers = unusual values in the response High leverage points = unusual values in the predictor(s) The more predictors you have, the harder they can be to spot (why?) These points can have a major impact on the least squares line (why?), which could invalidate the entire fit

How to identify high leverage points Compute the leverage statistic. For SLR: Straightforward extension to the case of multiple predictors (hard to fit on a slide, though)

5. Collinearity Problems can also arise when two or more predictor variables are closely related to one another Hard to isolate the individual effects of each predictor, which increases uncertainty This makes it harder to detect whether or not an effect is actually present (why?)

Detecting collinearity One simple way to check for collinearity is by looking at the correlation matrix of the predictors For example: looking at the Auto dataset, we see that just about everything is highly correlated The correlation matrix won’t help you find interactions between multiple variables when no single pair is highly correlated (called multicollinearity)

Approaches for dealing with collinearity 1. Drop one of the problematic variables from the regression (linearity implies they’re redundant) 2. Combine the collinear variables together into a single predictor

Lab: Linear Regression To do today’s lab in R, you’ll need the car, MASS and ISLR packages (you don’t need to download the data) To do it in python, you’ll need numpy, pandas, and statsmodels (and seaborn if you want pretty pictures) Instructions and code can be found at: Original version can be found beginning on p. 109 of ISLR

Assignment 1 PDF posted on course website: Problems from ISLR 3.7 (p ) - Conceptual: 3.1, 3.4, and Applied: 3.8, 3.10 Due Wednesday Feb. 10 by 11:59pm