LECTURE 03: LINEAR REGRESSION PT. 1 February 1, 2016 SDS 293 Machine Learning.

2 Announcements 1/3 Smith Rstudio Server accounts should now be set up for everyone who requested one When you have time, please try logging in:

3 Announcements 2/3 Some of you have asked how to get Jupyter to play with R Instructions posted to Piazza, email if you get stuck!

4 Announcements 3/3 Deadline for registration: Feb. 4 th SDS has some (limited) funding to support students who wish to attend Interested? Email Jordan!

5 Outline Motivation and Running Example: Advertising Simple Linear Regression - Estimating coefficients - How good is this estimate? - How good is the model? Multiple Linear Regression - Estimating coefficients - Important questions Dealing with Qualitative Predictors Extending the Linear Model - Removing the additive assumption - Non-linear relationships Potential Problems

6 Motivation Why start a ML course with linear regression?

7 Running example

8 Last year’s advertising budget Sales (in 1000s of units sold) Budget (in $1,000s)

9 Your task

10 Questions you might ask 1. Is there a relationship between budget and sales? 2. How strong is the relationship? 3. Which media contribute to sales? 4. How accurately can we estimate the effect? 5. How accurately can we predict future sales? 6. Is the relationship linear? 7. Is there synergy among the advertising media? Linear Regression

11 Simple Linear Regression Straightforward, linear approach for predicting a quantitative response on the basis of a single predictor Assumption: there is a roughly linear relationship between X (the predictor) and Y (the response) The response Is approximately modeled as SlopeIntercept A linear function of the predictor

12 In practice: and are unknown Before we can use the linear model to predict future response values, we need to estimate them from the data: Our goal: find estimated coefficients and so that Simple Linear Regression

13 “pretty close” minimizes RSS (other ways in Ch. 6)

14 Estimating coefficients with RSS Back to our hypothetical model: Define the residual: (difference between observed and predicted responses) We then define the residual sum of squares (RSS) to be:

15 Minimizing RSS: least squares Goal: find values for and that minimize RSS Dusting off our calculus, the minimizers are: where and are the mean values of the sample

16 Advertising example We’d expect to sell about 7,030 units without advertising Every additional $1k = approx. 47.5 additional units sold

17 How good is this estimate? Recall: we assume the true relationship between the predictor and the response takes the form: Consider this:

18 How good is this estimate? Different sets of observations will produce different estimated coefficients - Avg(huge number of samples)  perfect! - Least-squares is an unbiased estimator: i.e. doesn’t systematically over- or under-estimate In reality, we usually don’t have a huge number of sample to average over… but we still want to be able to predict So how close are our estimated parameters and ?

19 Standard error To find out, we can borrow the idea of standard error (SE): is the standard deviation of the population is the number of samples Note: the error gets smaller as the sample size increases

20 Standard error In adapting this to our coefficients, we’ll use the standard deviation of as our value for (why?) Let’s start with the slope: And now the intercept: What happens as x spreads out? What happens when the mean of x is 0?

21 Residual standard error In practice, we don’t know the standard deviation of Can estimate it using the RSS to get residual standard error: Standard error can be used to compute confidence intervals, a range of values that we believe with some level of confidence contain the true value In linear regression, the 95% confidence intervals are:

22 Using SE for hypothesis testing Recall: we want to determine whether there is a relationship between advertising budget and sales If there is NO relationship, what is the true value of ? NO relationship = NO slope = 0 We can then compute the probability that we would have observed our by chance, assuming a true value of 0 If this probability is small, we know a relationship exists

23 Advertising Example

24 How good is this model? Now that we know there is a relationship, we probably want to know how well the model captures it We’ve seen one measure: RSE is (roughly) the amount the response will deviate from the true regression line RSE is an absolute measure, given in the same units as the response variable Question: how do you know what a “good” RSE is?

25 R 2 is an alternative approach: captures the proportion of variance explained by the model We calculate it as follows: Total variance in the response Variance not explained after regression How good is this model?

26 TV advertising and sales Let’s look at these measures for our advertising data: What does the RSE tell us? What does R 2 tell us?

27 Discussion: multiple predictors So far, we’ve only talked about the effect of one variable Question: how do we handle multiple predictors?

28 Option 1: SLR for each predictor What problems do you see with this approach?

29 Option 2: extend the linear model Can accommodate multiple predictors by giving each one its own slope coefficient: Each slope captures the average effect on Y of an increase in one predictor, holding all others constant For example:

30 Estimating coefficients in MLR We can estimate the parameters ion MLR using the same least squares approach we used in SLR Choose to minimize RSS:

31 Advertising example When we fit a MLR model to the advertising data, we get the following least-squares coefficients What does this tell us? Do you notice anything unexpected?

32 What happened to newspaper ads? Does it make sense for the MLR to indicate no effect of newspaper advertising on sales? Let’s look at the correlation between all the dimensions In SLR, newspaper spending was “getting credit” for radio spending’s work!

33 Questions we ask in MLR Is at least one of the predictors useful in predicting the response? Do all the predictors help to explain the response, or is only a subset of the predictors useful? How well does the model fit the data? Given a set of predictor values, what response value should we predict, and how accurate is our prediction?

34 Q1: Is at least one predictor useful? In SLR, we tested to see if the slope was 0 (no effect) In MLR, we need to test whether ALL of the slopes are 0 to prove that there is no effect

35 Is at least one predictor useful? To do this, we compute the F-statistic: where p is the # of predictors and n is the sample size Value close to 1  no effect Question: why look at the F statistic and not just at the p- values for each predictor in turn? (hint: lots of predictors?)

36 Q2: Do we need them all? Now we know that at least one predictor has an effect: which one(s) is it? Determining which predictors are associated with the response is referred to as variable selection I’ll hint at some classic approaches today, and go into more detail when we get to Ch. 6

37 Method 1: Exhaustive Search 1. Construct all possible models, each containing a different subset of predictors 2. Evaluate them against one another, and select the one that performs best Problem: how many possible models are there on a dataset with p predictors?

38 Method 2: Forward selection 1. Begin with null model — an intercept with no predictors. 2. Fit p simple linear regressions, and add the variable that results in the lowest RSS. 3. Add to that model the variable that results in the lowest RSS for the new two-variable model. 4. Iterate. Question: when do we stop?

39 Method 3: Backward selection 1. Start with a MLR model containing all predictors. 2. Remove the predictor with the largest p-value (the least statistically significant). 3. Fit the new (p−1)-variable, and remove the predictor with the largest p-value. 4. Iterate. Question: when do we stop?

40 Method 4: Mixed selection 1. Start with the null model. 2. As with forward selection, iteratively add the variable that provides the best fit. 3. If the p-value for one of the variables in the model rises above a certain threshold, remove that variable. 4. Iterate until all variables in the model have a sufficiently low p-value, and all variables outside the model would have a large p-value if added.

41 Comparing selection methods Backward selection cannot be used if p > n (why?) Does forward selection have the same issue? Forward selection is a greedy approach. Problems? Mixed selection can remedy this, but at what cost? In Ch. 6, we’ll dig deeper into variable selection

42 Q3: How well does the model fit the data? Just like in SLR, we can use RSE and R 2 to measure how well our model fits the data Using the MLR model we created on the advertising data using all predictors: Question: what happens to the R 2 value if we remove newspaper from the model?

43 Q4: How confident are we? Now that we have a model, making a prediction is a piece of cake (just plug and chug!) Need to consider 3 kinds of uncertainty: 1. How far off are the coefficients?  confidence intervals 2. How far from linear is the true relationship?  ignore this for now 3. How much will any specific prediction vary from the true value, even if we had perfect coefficients?  prediction intervals

44 Dealing with Qualitative Predictors So far, we’ve assumed all our predictors were quantitative What happens when we have a qualitative predictor, such as gender or race? No problem! We’ll just trick the model

45 Two-level predictors Consider a qualitative predictor which only has two possible values (e.g. enrolled and auditing) Imagine we want to predict number of assignments turned in using only this predictor We can incorporate this into a regression model using a dummy variable: if the i th person is enrolled if the i th person is auditing

46 Two-level predictors 1 1 11 1 1 10 0 0 ({P1:“enrolled”}, {P2:“enrolled”}, {P3:“auditing”},…) ({P1:1}, {P2:1}, {P3:0},…)

47 Two-level predictors 1 1 11 1 1 10 0 0

48 if the i th person is enrolled if the i th person is auditing is the average number of assignments turned in by students that are auditing is the average number of assignments turned in by students enrolled in the course is the average difference in number of assignments turned in by each group

49 A note on dummy variables The decision to code enrolled students as 1 and auditing students as 0 is arbitrary, It has no effect on model fit, or on the predicted values It does alter interpretation of the coefficients - If we swapped them, what would happen? - If we used (-1,1), what would happen?

50 Multi-level predictors When a qualitative predictor has more than two levels, a single dummy variable won’t cut it We’ll have to create a dummy variable for all but one level For example: if the i th person is from Amherst if the i th person is not from Amherst if the i th person is from Mt. Holyoke if the i th person is not from Mt. Holyoke

51 Multi-level predictors Then both of these variables can be used to obtain the model: if the i th person is from Amherst if the i th person is from Mt. Holyoke if the i th person is from Smith In this model: - is the average number of assignments turned in by students from Smith - is the difference in average number of assignments turned in by students from Smith vs. Mt. Holyoke - is the difference in average number of assignments turned in by students from Smith vs. Amherst “baseline”

52 Lab: Linear Regression To do this week’s lab in R, you’ll need the car, ISLR and MASS packages - All are already installed on the Smith Rstudio server - If you need to install them on your local version of R, run: > install.packages(‘MASS’) > install.packages(‘ISLR’) > install.packages(‘car’) Instructions and code can be found at: Full version can be found beginning on p. 109 of ISLR

53 Assignment 1 PDF posted on course website Problems from ISLR 3.7 (p. 120-123) - Conceptual: 3.1, 3.4, and 3.6 - Applied: 3.8, 3.10 Due Wednesday Feb. 10 by 11:59pm

