Linear Regression CSC 600: Data Mining Class 13.

Slides:



Advertisements
Similar presentations
Multiple Regression and Model Building
Advertisements

Irwin/McGraw-Hill © Andrew F. Siegel, 1997 and l Chapter 12 l Multiple Regression: Predicting One Factor from Several Others.
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Ch11 Curve Fitting Dr. Deshi Ye
Qualitative Variables and
Some Terms Y =  o +  1 X Regression of Y on X Regress Y on X X called independent variable or predictor variable or covariate or factor Which factors.
Chapter 10 Simple Regression.
Chapter 12 Multiple Regression
Topics: Regression Simple Linear Regression: one dependent variable and one independent variable Multiple Regression: one dependent variable and two or.
Ch. 14: The Multiple Regression Model building
1 1 Slide © 2003 South-Western/Thomson Learning™ Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
Linear Regression/Correlation
Review for Final Exam Some important themes from Chapters 9-11 Final exam covers these chapters, but implicitly tests the entire course, because we use.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 13-1 Chapter 13 Introduction to Multiple Regression Statistics for Managers.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Slides by JOHN LOUCKS & Updated by SPIROS VELIANITIS.
Introduction to Linear Regression and Correlation Analysis
Multiple Regression. In the previous section, we examined simple regression, which has just one independent variable on the right side of the equation.
1 FORECASTING Regression Analysis Aslı Sencer Graduate Program in Business Information Systems.
1 1 Slide © 2005 Thomson/South-Western Slides Prepared by JOHN S. LOUCKS St. Edward’s University Slides Prepared by JOHN S. LOUCKS St. Edward’s University.
1 1 Slide © 2003 Thomson/South-Western Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple Coefficient of Determination.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved OPIM 303-Lecture #9 Jose M. Cruz Assistant Professor.
1 1 Slide © 2007 Thomson South-Western. All Rights Reserved Chapter 13 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 1 Slide © 2008 Thomson South-Western. All Rights Reserved Chapter 15 Multiple Regression n Multiple Regression Model n Least Squares Method n Multiple.
1 Chapter 12 Simple Linear Regression. 2 Chapter Outline  Simple Linear Regression Model  Least Squares Method  Coefficient of Determination  Model.
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
14- 1 Chapter Fourteen McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved.
© Copyright McGraw-Hill Correlation and Regression CHAPTER 10.
Chapter 13 Multiple Regression
Lecture 4 Introduction to Multiple Regression
Multiple Regression. Simple Regression in detail Y i = β o + β 1 x i + ε i Where Y => Dependent variable X => Independent variable β o => Model parameter.
28. Multiple regression The Practice of Statistics in the Life Sciences Second Edition.
Multiple Regression I 1 Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 4 Multiple Regression Analysis (Part 1) Terry Dielman.
Business Statistics: A Decision-Making Approach, 6e © 2005 Prentice- Hall, Inc. Chap 14-1 Business Statistics: A Decision-Making Approach 6 th Edition.
Basic Business Statistics, 10e © 2006 Prentice-Hall, Inc.. Chap 14-1 Chapter 14 Introduction to Multiple Regression Basic Business Statistics 10 th Edition.
Introduction to Multiple Regression Lecture 11. The Multiple Regression Model Idea: Examine the linear relationship between 1 dependent (Y) & 2 or more.
Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Prentice-Hall, Inc.Chap 14-1 Statistics for Managers Using Microsoft® Excel 5th Edition Chapter.
LECTURE 03: LINEAR REGRESSION PT. 1 February 1, 2016 SDS 293 Machine Learning.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression Chapter 14.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Stats Methods at IC Lecture 3: Regression.
Chapter 15 Multiple Regression Model Building
The simple linear regression model and parameter estimation
Chapter 14 Introduction to Multiple Regression
Chapter 15 Multiple Regression and Model Building
Simple Linear Regression
Linear Regression CSC 600: Data Mining Class 12.
STT : Intro. to Statistical Learning
Logistic Regression CSC 600: Data Mining Class 14.
Correlation and Simple Linear Regression
Multiple Regression Analysis and Model Building
Chapter 13 Created by Bethany Stubbe and Stephan Kogitz.
John Loucks St. Edward’s University . SLIDES . BY.
Business Statistics Multiple Regression This lecture flows well with
Correlation and Simple Linear Regression
Stats Club Marnie Brennan
CHAPTER 29: Multiple Regression*
Linear Regression/Correlation
Prepared by Lee Revere and John Large
Correlation and Simple Linear Regression
Simple Linear Regression
Multiple Regression Chapter 14.
Simple Linear Regression and Correlation
Chapter Fourteen McGraw-Hill/Irwin
Introduction to Regression
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression
Presentation transcript:

Linear Regression CSC 600: Data Mining Class 13

Today… Linear Regression Logistic Regression

Advertising Dataset TV Radio Newspaper Sales 230.1 37.8 69.2 22.1 1 import pandas as pd advertising = pd.read_csv('../datasets/Advertising.csv') advertising.head(5) TV Radio Newspaper Sales 230.1 37.8 69.2 22.1 1 44.5 39.3 45.1 10.4 2 17.2 45.9 69.3 9.3 3 151.5 41.3 58.5 18.5 4 180.8 10.8 58.4 12.9

Simple Linear Regression Model for Advertising Dataset

Advertising Dataset Scatter plot visualization for TV and Sales. %matplotlib inline advertising.plot.scatter(x='TV', y='Sales');

Advertising Dataset Simple Linear Model in Python (using pandas and scikit): Predictor: x Response: y reg = linear_model.LinearRegression() reg.fit(advertising['TV'].reshape(-1,1), advertising['Sales'].reshape(-1,1)) print('Coefficients: \n', reg.coef_) print('Intercept: \n', reg.intercept_) Coefficients: [[ 0.04753664]] Intercept: [ 7.03259355] Sales = 7.03259 + 0.04754 * TV

Assessing the Accuracy of the Model Trying to quantify the extent to which the model fits the data Typically assessed with: Residual standard error (RSE) R2 statistic Different than measuring how well model’s predictions were on a test set Root Mean Squared Error (RMSE)

Residual Standard Error (RSE) RSE is the average amount that the response will deviate from the true regression line (never can perfectly predict Y from X because of the error term ε)

Advertising Dataset RSE = 3.26 Is this error amount acceptable? Actual sales in each market deviate from the true regression line by approximately 3.26 units, on average. Is this error amount acceptable? Business answer: depends on problem context Worth noting the percentage error:

Concluding Thoughts on RSE RSE measures the “lack of fit” that a model may have. Measured in the units of Y Not always clear what constitutes a good RSE

R2 Statistic Proportion of variance explained Always a value between 0 and 1 Independent of the scale of Y (unlike RSE)

R2 Statistic TSS: total variance in the response Y Amount of variability inherent in the response, before the regression is performed RSS: amount of variability that is left unexplained after performing the regression TSS-RSS : the amount of variability that is explained

Advertising Dataset R2 = 0.61 Just under two-thirds of the variability in sales is explained by a linear regression on TV.

Q: What is a good R2 value? R2 has an interpretational advantage over RSE A: Depends on the application. Example: problem from physics where it is known that a linear relationship exists, can expect a good R2 value Example: other domains where linear model is rough approximation…

R2 Statistic vs. Correlation Correlation only quantifies the association between a single pair of variables. Correlation is also a measure of the linear relationship between X and Y. For simple linear regression (one predictor): R2 = r2 Next: multiple linear regression (more than one predictor): use RSE

Multiple Linear Regression In practice, often have more than one predictor Yes, we can run three separate simple linear regressions for the Advertising dataset But, Unclear how to make single prediction of sales given all three predictor values Each regression equation ignores the other two media BAD! Media may be correlated with each other

Multiple Linear Regression Model Extend the simple linear regression model for each predictor Response variable Y is numeric (continuous) For p predictor variables: Since error ε has mean zero, variance σ2, with normal distribution, we usually omit it. A one-unit change in any predictor variable xj will change the expected mean response by βj units.

Advertising Dataset

Estimating the Parameters β0β1β2… Parameters (regression coefficients) are typically estimated through the method of least squares Just like with simple linear regression Automatic in R, Python (data mining toolkits) We want to minimize the RSS

Advertising Dataset Sales = 2.938889 + 0.045765 * TV + 0.188530 * radio + -0.001037 * newspaper

Simple and Multiple Linear Regression Coefficients can be Quite Different Slope term (newspaper coefficient) represents the average effect of a $1,000 increase in newspaper advertising, ignoring other predictors (TV and radio). TV Model: [[ 0.04753664]] [ 7.03259355] Radio Model: [[ 0.20249578]] [ 9.3116381] Newspaper Model: [[ 0.0546931]] [ 12.35140707] Coefficient for newspaper represents the average effect of increasing newspaper spending by $1,000 while holding TV and radio fixed. Coefficients: [[ 0.04576465 0.18853002 -0.00103749]] Intercept: [ 2.93888937]

Correlation Matrix TV Radio Newspaper Sales 1.000000 0.054809 0.056648 0.782224 0.354104 0.576223 0.228299

Correlation Matrix Correlation between radio and newspaper is 0.35 TV Radio Newspaper Sales 1.000000 0.054809 0.056648 0.782224 0.354104 0.576223 0.228299 Correlation Matrix Correlation between radio and newspaper is 0.35 Barely any correlation (or “not correlated”) for TV/radio and TV/newspaper Reveals tendency to spend more on Newspaper advertising in markets where more is spent on Radio advertising. Sales higher in markets where more is spent on Radio, but more also tends to be spent on Newspaper. In Simple LM: Newspaper “gets credit” for effect of Radio on Sales.

Qualitative Predictors So far have assumed that all variables in linear regression model are quantitative. How to deal with qualitative variables?

Credit Dataset Response: Quantitative Predictors: Balance (individual’s average credit card debt) Quantitative Predictors: Age (years) Cards (number of credit cards) Education (years of education) Income (in thousands of dollars) Limit (credit limit) Rating (credit rating) Qualitative Predictors: Gender {Male, Female} Student {Yes, No} Married {Yes, No} Ethnicity {Caucasian, African American, Asian}

Qualitative Predictors: Two Levels Levels (sometimes called factors): possible values of discrete variable Solution: create a dummy variable (or indicator) that takes on two possible numerical values Credit dataset, Gender variable: {Male, Female} Create new dummy variable:

Qualitative Predictors: Two Levels … for now assuming that Gender is the only predictor in model … Simple Linear Regression Model Estimate coefficients B0, B1 Term zeros out for males

Qualitative Predictors: Two Levels Interpretation: B0: average credit card balance among males B0 + B1: average credit card balance among females B1: average difference in credit card balance between females and males

Qualitative Predictors: Two Levels Interpretation: B0: average credit card balance among males B0 + B1: average credit card balance among females B1: average difference in credit card balance between females and males Average credit card debt for males is estimated to be $509.80. Females are estimated to carry $19.73 in additional debt, for a total of: $509.80+$19.73=$529.53 Balance = 509.80 + 19.73 * xi

Qualitative Predictors: Two Levels Decision to code females as 1 and males as 0 is arbitrary. It does alter the interpretation of the coefficients What would happen if we coded males as 1 and females as 0?

Qualitative Predictors: Two Levels Interpretation: B0: average credit card balance among females B0 + B1: average credit card balance among males B1: average difference in credit card balance between females and males Average credit card debt for females is estimated to be $529.54. Males are estimated to carry $19.73 in less debt, for a total of: $529.54-$19.73=$509.80 Balance = 529.54 - 19.73 * xi Same exact model!

Qualitative Predictors: Two Levels Interpretation: B0: overall average credit card balance (ignoring gender) B1: amount that females are above the average, and males are below the average Average credit card debt, ignoring gender is $519.67. The average difference between males and females is: $9.865 * 2 = $19.73 Balance = 519.67 + 9.865 * xi Same exact model! It doesn’t matter which coding scheme is used, as long as coefficients are correctly interpreted.

Qualitative Predictors: More than Two Levels Single dummy variable cannot represent all possible values for qualitative predictors with more than two levels Solution: create additional dummy variables For Ethnicity variable: Simple linear model, ignoring all other predictors…. Always one fewer dummy variable than number of levels.

Qualitative Predictors: Two Levels Interpretation: B0: average credit card balance for African Americans B1: difference in average balance between Asians and African Americans B2: difference in average balance between Caucasians and African Americans Balance = 531.00 – 18.69* xi1 – 12.50* xi2 Estimated balance for African Americans is $531.00 Asian category will have $18.69 in less debt than African American category Caucasian category will have $12.50 in less debt than African American category Once again, arbitrary coding scheme.

Qualitative Predictors: Two Levels xi1 Qualitative Predictors: Two Levels African American xi2 xi3 African American Balance = 520.60 + 10.38* xi1 – 8.29* xi2 – 2.11* xi3 Coefficients: [[ 10.39626236 -8.29001215 -2.10625021]] Intercept: [ 520.60373764] Estimated balance for African Americans is $531.00 Asian category will have $18.69 in less debt than African American category Caucasian category will have $12.50 in less debt than African American category

Multiple Quantitative and Qualitative Predictors Not a problem Use as many dummy variables as needed scikit creates dummy variables automatically for the qualitative predictors

In conclusion… Pros of Linear Regression Model: Provides nice interpretable results Works well on many real-world problems Cons of Linear Regression Model: Assumes linear relationship between response and predictors: Change in the response Y due to a one-unit change in Xi is constant Assumes additive relationship Effect of changes in a predictor Xi on response Y is independent of the values of the other predictors

Extensions of the Linear Model Beyond the scope of this course… Can remove the additive assumption by specifying interaction terms Can remove the linear assumption using polynomial regression

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.