Linear Regression CSC 600: Data Mining Class 13.

Linear Regression CSC 600: Data Mining Class 13

Today… Linear Regression Logistic Regression

Advertising Dataset TV Radio Newspaper Sales 230.1 37.8 69.2 22.1 1
import pandas as pd advertising = pd.read_csv('../datasets/Advertising.csv') advertising.head(5) TV Radio Newspaper Sales 230.1 37.8 69.2 22.1 1 44.5 39.3 45.1 10.4 2 17.2 45.9 69.3 9.3 3 151.5 41.3 58.5 18.5 4 180.8 10.8 58.4 12.9

Simple Linear Regression Model for Advertising Dataset

Advertising Dataset Scatter plot visualization for TV and Sales.
%matplotlib inline advertising.plot.scatter(x='TV', y='Sales');

Advertising Dataset Simple Linear Model in Python (using pandas and scikit): Predictor: x Response: y reg = linear_model.LinearRegression() reg.fit(advertising['TV'].reshape(-1,1), advertising['Sales'].reshape(-1,1)) print('Coefficients: \n', reg.coef_) print('Intercept: \n', reg.intercept_) Coefficients: [[ ]] Intercept: [ ] Sales = * TV

Assessing the Accuracy of the Model
Trying to quantify the extent to which the model fits the data Typically assessed with: Residual standard error (RSE) R2 statistic Different than measuring how well model’s predictions were on a test set Root Mean Squared Error (RMSE)

Residual Standard Error (RSE)
RSE is the average amount that the response will deviate from the true regression line (never can perfectly predict Y from X because of the error term ε)

Advertising Dataset RSE = 3.26 Is this error amount acceptable?
Actual sales in each market deviate from the true regression line by approximately 3.26 units, on average. Is this error amount acceptable? Business answer: depends on problem context Worth noting the percentage error:

Concluding Thoughts on RSE
RSE measures the “lack of fit” that a model may have. Measured in the units of Y Not always clear what constitutes a good RSE

R2 Statistic Proportion of variance explained
Always a value between 0 and 1 Independent of the scale of Y (unlike RSE)

R2 Statistic TSS: total variance in the response Y
Amount of variability inherent in the response, before the regression is performed RSS: amount of variability that is left unexplained after performing the regression TSS-RSS : the amount of variability that is explained

Advertising Dataset R2 = 0.61
Just under two-thirds of the variability in sales is explained by a linear regression on TV.

Q: What is a good R2 value? R2 has an interpretational advantage over RSE A: Depends on the application. Example: problem from physics where it is known that a linear relationship exists, can expect a good R2 value Example: other domains where linear model is rough approximation…

R2 Statistic vs. Correlation
Correlation only quantifies the association between a single pair of variables. Correlation is also a measure of the linear relationship between X and Y. For simple linear regression (one predictor): R2 = r2 Next: multiple linear regression (more than one predictor): use RSE

Multiple Linear Regression
In practice, often have more than one predictor Yes, we can run three separate simple linear regressions for the Advertising dataset But, Unclear how to make single prediction of sales given all three predictor values Each regression equation ignores the other two media BAD! Media may be correlated with each other

Multiple Linear Regression Model
Extend the simple linear regression model for each predictor Response variable Y is numeric (continuous) For p predictor variables: Since error ε has mean zero, variance σ2, with normal distribution, we usually omit it. A one-unit change in any predictor variable xj will change the expected mean response by βj units.

Advertising Dataset

Estimating the Parameters β0β1β2…
Parameters (regression coefficients) are typically estimated through the method of least squares Just like with simple linear regression Automatic in R, Python (data mining toolkits) We want to minimize the RSS

Advertising Dataset Sales = * TV * radio * newspaper

Simple and Multiple Linear Regression Coefficients can be Quite Different
Slope term (newspaper coefficient) represents the average effect of a $1,000 increase in newspaper advertising, ignoring other predictors (TV and radio). TV Model: [[ ]] [ ] Radio Model: [[ ]] [ ] Newspaper Model: [[ ]] [ ] Coefficient for newspaper represents the average effect of increasing newspaper spending by $1,000 while holding TV and radio fixed. Coefficients: [[ ]] Intercept: [ ]

Correlation Matrix TV Radio Newspaper Sales 1.000000 0.054809 0.056648

Correlation Matrix Correlation between radio and newspaper is 0.35
TV Radio Newspaper Sales Correlation Matrix Correlation between radio and newspaper is 0.35 Barely any correlation (or “not correlated”) for TV/radio and TV/newspaper Reveals tendency to spend more on Newspaper advertising in markets where more is spent on Radio advertising. Sales higher in markets where more is spent on Radio, but more also tends to be spent on Newspaper. In Simple LM: Newspaper “gets credit” for effect of Radio on Sales.

Qualitative Predictors
So far have assumed that all variables in linear regression model are quantitative. How to deal with qualitative variables?

Credit Dataset Response: Quantitative Predictors:
Balance (individual’s average credit card debt) Quantitative Predictors: Age (years) Cards (number of credit cards) Education (years of education) Income (in thousands of dollars) Limit (credit limit) Rating (credit rating) Qualitative Predictors: Gender {Male, Female} Student {Yes, No} Married {Yes, No} Ethnicity {Caucasian, African American, Asian}

Qualitative Predictors: Two Levels
Levels (sometimes called factors): possible values of discrete variable Solution: create a dummy variable (or indicator) that takes on two possible numerical values Credit dataset, Gender variable: {Male, Female} Create new dummy variable:

… for now assuming that Gender is the only predictor in model … Simple Linear Regression Model Estimate coefficients B0, B1 Term zeros out for males

Interpretation: B0: average credit card balance among males B0 + B1: average credit card balance among females B1: average difference in credit card balance between females and males

Interpretation: B0: average credit card balance among males B0 + B1: average credit card balance among females B1: average difference in credit card balance between females and males Average credit card debt for males is estimated to be $ Females are estimated to carry $19.73 in additional debt, for a total of: $ $19.73=$529.53 Balance = * xi

Decision to code females as 1 and males as 0 is arbitrary. It does alter the interpretation of the coefficients What would happen if we coded males as 1 and females as 0?

Interpretation: B0: average credit card balance among females B0 + B1: average credit card balance among males B1: average difference in credit card balance between females and males Average credit card debt for females is estimated to be $ Males are estimated to carry $19.73 in less debt, for a total of: $ $19.73=$509.80 Balance = * xi Same exact model!

Interpretation: B0: overall average credit card balance (ignoring gender) B1: amount that females are above the average, and males are below the average Average credit card debt, ignoring gender is $ The average difference between males and females is: $9.865 * 2 = $19.73 Balance = * xi Same exact model! It doesn’t matter which coding scheme is used, as long as coefficients are correctly interpreted.

Qualitative Predictors: More than Two Levels
Single dummy variable cannot represent all possible values for qualitative predictors with more than two levels Solution: create additional dummy variables For Ethnicity variable: Simple linear model, ignoring all other predictors…. Always one fewer dummy variable than number of levels.

Interpretation: B0: average credit card balance for African Americans B1: difference in average balance between Asians and African Americans B2: difference in average balance between Caucasians and African Americans Balance = – 18.69* xi1 – 12.50* xi2 Estimated balance for African Americans is $531.00 Asian category will have $18.69 in less debt than African American category Caucasian category will have $12.50 in less debt than African American category Once again, arbitrary coding scheme.

xi1 Qualitative Predictors: Two Levels African American xi2 xi3 African American Balance = * xi1 – 8.29* xi2 – 2.11* xi3 Coefficients: [[ ]] Intercept: [ ] Estimated balance for African Americans is $531.00 Asian category will have $18.69 in less debt than African American category Caucasian category will have $12.50 in less debt than African American category

Multiple Quantitative and Qualitative Predictors
Not a problem Use as many dummy variables as needed scikit creates dummy variables automatically for the qualitative predictors

In conclusion… Pros of Linear Regression Model:
Provides nice interpretable results Works well on many real-world problems Cons of Linear Regression Model: Assumes linear relationship between response and predictors: Change in the response Y due to a one-unit change in Xi is constant Assumes additive relationship Effect of changes in a predictor Xi on response Y is independent of the values of the other predictors

Extensions of the Linear Model
Beyond the scope of this course… Can remove the additive assumption by specifying interaction terms Can remove the linear assumption using polynomial regression

References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.

Linear Regression CSC 600: Data Mining Class 13.

Similar presentations

Presentation on theme: "Linear Regression CSC 600: Data Mining Class 13."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Linear Regression CSC 600: Data Mining Class 13.

Similar presentations

Presentation on theme: "Linear Regression CSC 600: Data Mining Class 13."— Presentation transcript:

Similar presentations

About project

Feedback