Regression, Prediction and Classification

Regression, Prediction and Classification
Microsoft Research 2013 5/1/2019 2:31 AM Regression, Prediction and Classification Jacob LaRiviere Some content taken from Justin Rao © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Terminology 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes
5/1/2019 2:31 AM Terminology 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes Goal is to model outcomes as a function of features. Width of 𝑋 is 𝑝 Length of 𝑋 and 𝑦 is 𝑛 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Terminology cont. 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes Feature 1 Obs 1 Obs 1
5/1/2019 2:31 AM Terminology cont. 𝑋:𝑓𝑒𝑎𝑡𝑢𝑟𝑒𝑠. 𝑦:outcomes Feature 1 Obs 1 Obs 1 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Estimating equations 𝑌=𝑓 𝑋 +𝜖 𝑦 𝑖 =𝑓( 𝑥 𝑖 )+ 𝜖 𝑖
5/1/2019 2:31 AM Estimating equations 𝑌=𝑓 𝑋 +𝜖 𝑦 𝑖 =𝑓( 𝑥 𝑖 )+ 𝜖 𝑖 Where 𝜖 𝑖 is call the error term. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Example: coin flipping
𝑌=𝑓 𝑋 +𝜖 𝑓𝑙𝑖𝑝 𝑖 =𝑝+ 𝜖 𝑖 if outcome=1 then 𝜖 𝑖 =1−𝑝 if outcome=0 then 𝜖 𝑖 =−𝑝 𝑝 is a constant, called the “success rate” © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 𝛼 𝛽 gives the slope
5/1/2019 2:31 AM Linear regression 𝛽 gives the slope 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 Ordinary least squares will find the 𝛼 and 𝛽 to minimized the squared distance between the fitted line and the observations 𝑦 𝑖 = 𝛼 + 𝛽 𝑥 𝑖 Minimizes 𝒚 −𝒚 𝟐 𝛼 gives the intercept © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How does this work intuitively?
5/1/2019 2:31 AM How does this work intuitively? Assume 𝛼=0 mi n β y− y 2 = y− 𝛽 𝑥 2 = y 2 −2 𝛽 𝑥𝑦+ ( 𝛽 𝑥) 2 𝐹𝑂𝐶→−2𝑥𝑦+ 2 𝜷 𝑥 2 =0 → 𝜷 𝑥 2 =𝑥𝑦 𝜷 = 𝑥𝑦 𝑥 2 𝜷 = 𝑋 ′ 𝑋 −1 ( 𝑋 ′ 𝑌) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How does this work for inference?
5/1/2019 2:31 AM How does this work for inference? What about hypothesis testing? (Now assume 𝛼≠0) 𝑠𝑒( 𝛽 )= 𝜎 𝛽 1 2 = 1 𝑛 1 𝑛−2 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 )2 𝜖 𝑖 2 𝑖=1 𝑛 ( 𝑥 𝑖 − 𝑥 )2 →𝑡= 𝜷 1 − 𝛽 1,0 𝑠𝑒( 𝛽 ) Both of these are themselves sample statistics from from data y and X. As a result, they both have their own distributions (e.g., normal and Chi Squared). The results ration has the Student t distribution which is used for hypothesis testing for a given sample. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Height as function of age
Microsoft Research 2013 5/1/2019 2:31 AM Height as function of age Growth is most rapid in ages 12-16 AGE © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Height as function of age
Microsoft Research 2013 5/1/2019 2:31 AM Height as function of age If we estimated ℎ=𝛼+𝛽∗𝑎𝑔𝑒, the constant growth rate 𝛽 would overstate growth in early and later years, and understate during puberty AGE © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

How would we estimate this relationship?
5/1/2019 2:31 AM How would we estimate this relationship? General formula: ℎ 𝑖 =𝑓 𝑎𝑔 𝑒 𝑖 + 𝜖 𝑖 𝜖 𝑖 allows people of he same age to have different heights. Even if we have the correct growth model using age alone, we expect some variation in height conditional on age. To use a linear model, polynomial features allow for a non-linear relationship between the feature, age, and the outcome, height. ℎ 𝑖 =𝛼+ 𝛽 1 𝑎𝑔 𝑒 𝑖 + 𝛽 2 𝑎𝑔 𝑒 𝑖 2 +…+ 𝛽 𝑝 𝑎𝑔 𝑒 𝑖 𝑝 + 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑔 1 𝑥 𝑖,1 +…+ 𝛽 𝑝 𝑔 𝑝 ( 𝑥 𝑖,𝑝 )+ 𝜖 𝑖
5/1/2019 2:31 AM Linear regression Linear regression allows you to use non-linear functions of the features, provided they enter the function as an additive term, with weight 𝛽 𝑦 𝑖 =𝛼+ 𝛽 1 𝑔 1 𝑥 𝑖,1 +…+ 𝛽 𝑝 𝑔 𝑝 ( 𝑥 𝑖,𝑝 )+ 𝜖 𝑖 The simplest example is where all features simply enter without any transformations 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖,1 +…+ 𝛽 𝑝 𝑥 𝑖,𝑝 + 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖
5/1/2019 2:31 AM Linear regression 𝑦 𝑖 =𝛼+ 𝛽 1 𝑥 𝑖 + 𝜖 𝑖 A linear function tends to underestimate for medium-high education And overestimate income for low education © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Linear regression A cubic polynomial fits the data much better
𝑦 𝑖 = 𝛼 + 𝛽 1 𝑥 𝑖 + 𝛽 2 𝑥 𝑖 𝛽 3 𝑥 𝑖 3 𝑦 −𝑦 2 is now much smaller © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Visualizing in two dimensions
5/1/2019 2:31 AM Visualizing in two dimensions Higher order polynomial in educ and seniority could approximate this function 𝑦 𝑖 =𝛼+ 𝛽 1 𝑒𝑑𝑢 𝑐 𝑖 + 𝛽 2 𝑠𝑒𝑛𝑖𝑜𝑟𝑖𝑡𝑦+ 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Steps for linear regression
5/1/2019 2:31 AM Steps for linear regression Define the relationship we are trying to model. What is the outcome? What raw features do we have? E.g. ℎ 𝑖 =𝑓(𝑎𝑔 𝑒 𝑖 )+ 𝜖 𝑖 2. Define a linear model to approximate this relationship, e.g. ℎ 𝑖 =𝛼+ 𝛽 1 𝑎𝑔 𝑒 𝑖 + 𝛽 2 𝑎𝑔 𝑒 𝑖 2 +…+ 𝛽 𝑝 𝑎𝑔 𝑒 𝑖 𝑝 + 𝜖 𝑖 3. Create the necessary features eg. Age2=age^2 (or use poly) 4. Estimate the model: mymodel=lm(y~age+age2…+ageP) summary(mymodel) 5. Evaluate the model with an evaluation metric. 6. Repeat steps 2-5 to improve model fit. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Model prediction (ℎ 𝑖 − ℎ 𝑖 ) 2 :squared loss
5/1/2019 2:31 AM Model prediction After we have fit a model, we can predict the outcome for any input. ℎ 𝑖 =𝛼+ 𝛽 1 𝑎𝑔 𝑒 𝑖 + 𝛽 2 𝑎𝑔 𝑒 𝑖 2 +…+ 𝛽 𝑝 𝑎𝑔 𝑒 𝑖 𝑝 2. For any given observation ℎ 𝑖 − ℎ 𝑖 =𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛 𝑒𝑟𝑟𝑜𝑟 (ℎ 𝑖 − ℎ 𝑖 ) 2 :squared loss |ℎ 𝑖 − ℎ 𝑖 | : absolute loss © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Model Fit Fit refers to how well our model, an approximation of reality, matches what we actually observe in the data Mean squared error is the average of squared loss for each data point (row) we evaluate © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Loss versus Prediction
5/1/2019 2:31 AM Loss versus Prediction Sample In (Inference & Research Design) Out (prediction) Penalty L1 Norm Down weights Outliers LASSO Cross Validation L2 Norm OLS Estimation Ridge Cross Validation © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Training vs. Test We typically want to “train” the model on the data “we have” and test the model on “new data” This is a realistic test of how well the model predicts outcomes. In practice, we will subset our data: X%: Training Y%: Test 80% is commonly used for training. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Why do we need training vs. test?
5/1/2019 2:31 AM Why do we need training vs. test? MSE: test set MSE: training set © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Why do we need training vs. test?
5/1/2019 2:31 AM Why do we need training vs. test? The green curve “overfits” the data. It fits the noise. Using a “out of sample” test set guards against “overfitting” The blue curve gets closest to the black (truth) curve © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Training vs. test in practice
5/1/2019 2:31 AM Training vs. test in practice Randomly select observations (rows) to be in the test set. The remainder are the training set. Why random: want complete coverage. E.g. don’t want to train in February and test on July. Subset on the training set for all model estimation. E.g. any data.frame you pass to lm Use the predict command to make predictions ( 𝑦 ) for the test set Compute the evaluation metric using test set (you have the observe value and the predicted or modeled value) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

R-squared Residual sum of squares Total variation in outcome variable
5/1/2019 2:31 AM R-squared Residual sum of squares Total variation in outcome variable Fraction of the variation in the outcome variable captured by the model © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Adjusted R-squared R-squared is reported for the training data. That is, it is “in sample” The “in sample” model fit weakly improves with more explanatory variables 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 =1− 𝑛−1 𝑛−𝑘−1 𝑆𝑆𝑅 𝑇𝑆𝑆 n is number of observations and k is number of explanatory variables. - penalizes adjusted r2 when add variables without explanatory power. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

A “Fair” R-squared Measure
5/1/2019 2:31 AM A “Fair” R-squared Measure R-squared is reported for the training data. That is, it is “in sample” We learned that we typically want to use out-of-sample evaluation metrics, what do you we do It turns out that 𝑅 2 =𝑐𝑜𝑟𝑟 𝑦 𝑖 , 𝑦 𝑖 2 This means, 𝑐𝑜𝑟𝑟 𝑦 𝑖 , 𝑦 𝑖 2 , for the test set can be interpreted just like R-squared (% of variation explain) and is a “fair test” Related to an important concept called over-fitting. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Feature types Binary features: equal to 1 or 0. For example:
Continuous features: can take an real value Categorical features: can take one of N values. E.g. Saturday, Sunday, Monday… declaring as.factor then R will treat it as N binary variables for each category (ex. =1 if Saturday, =0 otherwise) © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Understand binary features
5/1/2019 2:31 AM Understand binary features 𝛽 1 allows for a different y-intercept for students 𝑦 𝑖 =𝛼+ 𝛽 1 {1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡}+ 𝛽 2 𝑖𝑛𝑐𝑜𝑚𝑒+ 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Understand binary features
5/1/2019 2:31 AM Understand binary features 𝛽 1 allows for a different y-intercept for students 𝛽 3 allows for a different Slope for students 𝑦 𝑖 =𝛼+ 𝛽 1 1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 + 𝛽 2 𝑖𝑛𝑐𝑜𝑚𝑒+ 𝛽 3 1 𝑖𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡 ∗𝑖𝑛𝑐𝑜𝑚𝑒+ 𝜖 𝑖 © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Interaction terms Interaction term: multiplying two features by each other. When done as such: {continuous feature}*{binary feature} it allows for a different slope for the group represented by the binary feature. Note, be sure to include the continuous feature without the interaction as well. Example, suppose temperature has a different impact on # of bikeshare trips taken on weekdays vs. weekends. We can create a binary variable is_weekend. If we add is_weekend to the model, we allow for a different baseline ridership on weekends. If we add is_weekend*temp, we allow temperature to have a different effect for weekends. We can “interact” is_weekend with a polynomial in temp. This effectively gives us a new model for temperature for the weekends vs. weekdays © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Feature explosion If we have two raw features, A and B, how many models can we make? Without interactions (4): none, just A, just B, A + B With interactions (8): A*B, A+B + A*B, A + A*B, B+ A*B If we have p features then we have 2 𝑝+1 possible models! 𝑝=29 −→1,073,741,824 We cannot possible run all these models… so we’ll learn methods to guide us. Human intelligence is often a great guide as well. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Interpreting regression output
5/1/2019 2:31 AM Interpreting regression output Summary(my_model) gives the coefficient estimates (beta’s), t-statistics, standard errors and p-values. t-statistics evaluate the hypothesis that the coefficient is equal to zero. Thus t greater than 1.96 in absolute value allows us to reject this hypothesis at the 95% level. P-values simply convert t-stats into the probability we would get something this far from zero due to sampling chance alone. Note that t-statistics evaluate features “one by one”. Overall model fit is a better guide for explanatory power (e.g. we might be better for leaving insignificant features in sometimes) t-stats can be a good guide to throw out irrelevant features © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Parametric vs. non-parametric regression
𝑌=𝑓 𝑋 +𝜖 Parametric: 𝑓 defined by a model we write down with parameters that we have to estimate (e.g. 𝛼, 𝛽 1 , 𝑒𝑡𝑐.) Non-parametric: we directly fit the data © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Local averaging © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Local averaging vs. bins
5/1/2019 2:31 AM Local averaging vs. bins © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Kernels A method of assigning higher weight to the points closer to the target point I am trying to fit. Choice of kernel usually does not matter much. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Bandwidths Bandwidths tell us how wide to make our kernel (window)
5/1/2019 2:31 AM Bandwidths Bandwidths tell us how wide to make our kernel (window) Larger bandwidths  smoother functions because there is more data used to make the averages Bandwidth is sometimes called the smoothing parameter. Bandwidths matter! © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM K-nearest neighbors An alternative to bandwidths is instead of specifying a fixed window, we can say “use the nearest K nearest neighbors” This ensures each bin has the same amount of data and can be a useful tool Functionally it will often be quite similar to using a bandwidth, except when there are “sparse data” issues (then K-NNN is preferred) Often expressed as a fraction of data. E.g. NN=.1 says “each bin is 10% of data” © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Local polynomial regression
5/1/2019 2:31 AM Local polynomial regression Thing of local averaging as fitting a constant for a given window of data Instead of fitting a constant, we can fit a line (y=a*x + b) or a polynomial (y=a*x + b*x^2 + c), etc. It is thus more general than local averaging, but very similar © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Some examples © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Local polynomial in R locfit install.packages(locfit) library(locfit) Modeler needs to specify: bandwidth (or fraction of nearest neighbors), degree of the local polynomial Contrast: in parametric regression, we had to write down a model Aside: we’ll learn that non-parametric methods struggle in higher dimensions © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM When to use each model? If there are a relatively small number of features (e.g. <4) then non- parametric makes a lot of sense. With many features, the “curse of dimensionality” sets in and non- parametric methods fall apart (the “neighborhood” is generally empty) Parametric models using interaction terms and polynomials are the preferred method with many features “Semi-parametric” methods combine elements of both © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Simple versus complex demand models Motivating with a different form of elasticities which is the basis for “sales”

Intertemporal Substitution Cash for Clunkers: Trade in an old car for well above market value if you purchase a new car. Mian and Sufi (2013) showed that although there was a tremendous increase in new car sales in the months after the offering, this was simply a product of “pulling forward demand” which eroded net gains over time. Ignoring this leads to upwardly biased estimates of program effectiveness.

Intertemporal Substitution Cash for Clunkers: Trade in an old car for well above market value if you purchase a new car. Mian and Sufi (2013) showed that although there was a tremendous increase in new car sales in the months after the offering, this was simply a product of “pulling forward demand” which eroded net gains over time. Ignoring this leads to upwardly biased estimates of program effectiveness. How does this apply to Orange Juice? What if discounts lead to larger and people are stocking up on the product? Q: Does this matter? A: It depends: if price sensitive people would have bought anyway then yes and if not then no. This is where consumer level data would be useful

Demand models approximate how customers make choices
Model definition: distilling the relevant aspects of a real-world situation for systematic and quantitative study Demand models approximate how customers make choices example insight from book airlines leading example

Model 1: The Classic Demand Curve
How many units of product X will be sold at each price point, all else equal? How many consumers think buying X at price p is better than buying any other product available at current prices? (assuming unit demand) Price Quantity Demand example insight from book airlines leading example

A common pitfall is to confuse a demand curve with some other graph that happens to have price and quantity on the axes More expensive models tended to have higher sales Does not mean consumers like price increases! example insight from book airlines leading example

What does a demand curve not model?
example insight from book airlines leading example

What does a demand curve not model? Any form of competition
(it’s held fixed) example insight from book airlines leading example

What does a demand curve not model? Any form of consumer segmentation
(there’s just one curve) example insight from book airlines leading example

Competition All products have to compete to get bought…
consumers don’t have to buy, and won’t if the price is too high consumer could buy something else, so the prices of substitute products are relevant consumers can wait until tomorrow to buy, so future prices of all products are relevant

Segmentation Buyers vary in their price sensitivity and how they value different product features e.g. enterprise vs. consumer markets essentially get different curves for each segment, can maybe set different prices for measurement: can we find buyer groups g1 , g2 etc. and estimate demand separately by group?

Isolating important aspects of demand
Which features of the product drive consumer valuations? Differing across well-defined segments? Empirical methods will help us learn which features matter to consumers For product version, the most important are those that can easily varied based on engineering requirements? (ex. dual sim in phone phones) How sensitive are customers to price? Elasticity of demand: if price drops X%, quantity demand goes up Y%. Are there are various segments with different price sensitivity Ex. customers by country, income level, student, etc. Ex. “mission critical” database needs vs. the average job. How does price sensitivity vary along the demand curve Ex. when generic drugs are released, brand name price goes up  price sensitive customers will always opt for the cheaper alternative, those that remain have higher valuations and less sensitivity to price example insight from book airlines leading example

Model 2: The Logit Demand Model
Models demand in markets rather than demand for a particular product There are J products that compete in a market (e.g. smartphones) Each product has an attractiveness 𝑎𝑗 that depends on its features 𝑥 𝑗 and price 𝑝 𝑗

Model 2: The Logit Demand Model
A product’s market share 𝑠𝑗 depends on how attractive it is relative to its competitors: So if 3 products, product 1 gets:

An Example

An Example Model assumptions Utility from owning a smartphone 4
Additional utility from iPhone 1 Change in attractiveness from $1 increase in price -0.008 Product Price Baseline Utility Attractiveness (net of price) Market Share iPhone 650 5 -0.2 0.35 Rest 450 0.4 0.65

Model 2: Logit Demand Model
Advantage: enables richer scenario planning, for price changes on one or multiple products simultaneously Can simulate impact of price changes, changing price sensitivity and raising product attractiveness

Which model to pick? Two basic models of demand.
Demand curve is simple and can in principle requires only own price and sales data to estimate. For a logit model (and more complex stuff) need market-level data (i.e. sales and prices for all products)

Which model to pick? We’re doing all this to optimize prices, so….
if we have a single product to price and if there are competitor products, their prices are unlikely to change estimating a demand curve is just fine

Which model to pick? We’re doing all this to optimize prices, so….
if we have multiple competing products to price then we need model cannibalization, need logit or if competitors are likely to respond, and we want to model likely future scenarios, we need a logit

Bigger picture Does your demand model match the proposed pricing strategy? Competing product set, segments, revenue model Do you have a way of learning the parameters of this demand model?

More complicated models
Models with multiple latent (not directly observable) consumer types Models of dynamic demand (forward-looking consumers) Models of consumer search

Estimating a demand model

How do you estimate a demand curve?
Price Quantity Demand example insight from book airlines leading example

How do you estimate a demand curve?
What you’d like is some data on prices and sales, while everything else is “held constant” (e.g. experiments…) The problem is that price changes are often correlated with other changes in the environment, so everything else is not held constant. example insight from book airlines leading example

Christmas is bad for econometrics
Prices example insight from book airlines leading example Sales

Spurious correlations
In general price and sales are related in so many ways that are not just through the demand channel, that these sorts of regressions are literally the textbook example of an endogenous regression

Good Data not “Big Data”
In the age of big data, everyone is scrambling to collect all the data they can But it’s really data quality rather than quantity that matters….more data doesn’t magically make the spurious correlations go away, you just measure spurious correlations very precisely!

Getting good data Hypothetical surveys…how can we make them realistic?
Field experiments…when do they work? Detailed historical logs… what can we do with them? How to remove spurious correlation? example insight from book airlines leading example

How can I know something about demand for a new product?
Learn something about features/aspects of new products that are in existing products. Surveys and conjoint analysis methods Pilots and limited releases, initial experimentation: example insight from book airlines leading example

Historical Data Working with historical data has obvious advantages
- the data may already be available - it reflects the choices of actual market participants - no survey / field experiment design But the problem of spurious correlations remains example insight from book airlines leading example

historical data is very useful but the insights from correlations in it can be random and even misleading.

Misleading…. Call: lm(formula = logmove.minute.maid ~ log(price.dominicks) * feat.dominicks + log(price.minute.maid) * feat.minute.maid + log(price.tropicana) * feat.tropicana, data = temp) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** log(price.dominicks) < 2e-16 *** feat.dominicks ** log(price.minute.maid) < 2e-16 *** feat.minute.maid < 2e-16 *** log(price.tropicana) < 2e-16 *** feat.tropicana < 2e-16 *** log(price.dominicks):feat.dominicks log(price.minute.maid):feat.minute.maid < 2e-16 *** log(price.tropicana):feat.tropicana < 2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 9639 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 9639 DF, p-value: < 2.2e-16

Misleading…. log(price. minute. maid):feat. minute. maid -1. 14745 0
Misleading…. log(price.minute.maid):feat.minute.maid < 2e-16 *** Do we really believe this? This says that demand is more elastic when a product is featured. Alternative explanation: when a product is featured it is more likely to be discounted. If more likely to be discounted then this could conflate the elasticity feature effect and the elasticity effect (e.g., only not discounted but not featured during particularly odd dates). Another example: using Christmas advertising to estimate effect of advertising on sales.

Random events can come from lots of sources: experimental variation, nature (“quasi-experimental” variation), arbitrary cutoffs in eligibility for a program, etc…

This is really important for policy: policy changes often occur is isolation of other changes.
- As a result, knowing causal relationships rather than correlations is extra important.

𝑦𝑖=𝛼+𝑥𝑖𝛽+𝜖𝑖 𝑝𝑙𝑖𝑚 𝛽𝑂𝐿𝑆=𝛽+ 𝑐𝑜𝑣 𝜖𝑖,𝑥𝑖 𝑉𝑎𝑟(𝑥𝑖) 𝑐𝑜𝑣 𝜖𝑖,𝑥𝑖 >0 →biased up, positive selection (innately talented goes to more school) =0 → Unbiased (this could be a fluke though!) <0 →biased down, negative selection (innately untalented school stuff)

𝑦𝑖=𝛼+𝑥𝑖𝛽+𝜖𝑖 Alternative view following Angrist and Pischke textbook NOTE 1: treated could be extra year of educating, more advertising impressions, etc… NOTE 2: Can think of 𝑥𝑖 as an indicator variable. Outcome for treated – Outcome for untreated = [Outcome for treated – Outcome for treated if not treated] + [Outcome for treated if not treated – Outcome for untreated] = Impact of Treatment on Treated (TOT) Selection Bias

𝑝𝑙𝑖𝑚 𝛽𝑂𝐿𝑆=𝛽+ 𝑐𝑜𝑣 𝜖𝑖,𝑥𝑖 𝑉𝑎𝑟(𝑥𝑖)
𝑦𝑖=𝛼+𝑥𝑖𝛽+𝜖𝑖 𝑝𝑙𝑖𝑚 𝛽𝑂𝐿𝑆=𝛽+ 𝑐𝑜𝑣 𝜖𝑖,𝑥𝑖 𝑉𝑎𝑟(𝑥𝑖) We’d like to enforce that the random component unobserved by econometrician has no correlation/covariance with the independent variables. This isolates the treatment effect from the selection bias. The cleanest way to do this is with an experimental design because it provides the right counterfactual to compare the treated group to.

What is a counterfactual?
What would have happened had something else not happened. Valuable in economics: 1) What would profits/welfare have been if my price had been 10% higher? 2) What would profits/welfare be if I shut down a division? In law, it is the “but for” principle. Use of twins studies and lab experimental protocols to get good counterfactual.

What’s in a counterfactual?
What was the impact of these events on the firms impacted by them? We saw what happened. What would have happened had these events not occurred and everything else remained constant? What is the counterfactual? BP’s share price, 1994-current Walmart’s share price, March 2015-current

Walmart’s share price, March 2015-current In this simple example, here are three research designs which can be used to construct a counterfactual if treatment is Non-Random. 1) 2) 3) 𝑟 𝑡 𝑊 =ℎ( 𝑟 𝑡−𝑠 𝑊 , 𝑟 𝑡−𝑠 𝑆&𝑃 ,𝑍𝑡;Θ,L 𝑠=1,𝑛 ) <- Benchmarking (differences-in-differences) <- Time series (“simulated” counterfactual): estimate, compare Traditionally used in economics <- Kitchen sink (“Machine Learning”): train, test, treat, compare 𝑟 𝑡 𝑊 =𝛼+𝛽 𝑟 𝑡 𝑆&𝑃 +𝜖𝑡 =f( 𝑟 𝑡−𝑠 𝑆&𝑃 𝑠=1,𝑛 ) 𝑟 𝑡 𝑊 =𝜈+𝜙 𝑟 𝑡−1 𝑊 +𝜖𝑡 =𝑔( 𝑟 𝑡−𝑠 𝑊 𝑠=1,𝑛 )

Each method relies on assumptions for what can be inferred. examples - Benchmarking: in absence of event, the historical relationships are maintained - All designs: nothing else occurred at time of event (Note: not true for Walmart example, EPS same day) What is inference? A precise characterization of what can be learned. How much do dolphins love humans? Perhaps not at all…

Example: What is causal impact of a sale
Consider the following experimental design regarding effectiveness of sale. Assume store neighborhoods are “experimental unit” 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑 =1 → Receive flyer with information about sale 1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑 =0 → Receive nothing 𝑦𝑖=𝛼+1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖 𝛽+𝜖𝑖, where 𝛽 is causal treatment effect but what is it? example insight from book airlines leading example Claim: This is a joint effect of advertising and price and therefore less informative.

Example: What is causal impact of a sale
(receive flyer) (doesn’t receive flyer) example insight from book airlines leading example Within the treatment arm, a household can get a flyer with information about a sale and without information about it. All treated stores have their products priced identically, though. Now there are three arms and each one is randomly assigned (e.g., 33% of stores is C, 33% in T No, and 33% in T Info). In this case 𝛾 is the causal impact of a sale given a particular level of advertisement. 𝑦𝑖=𝛼+1 𝑡𝑟𝑒𝑎𝑡𝑒𝑑𝑖 𝛽+1 𝑆𝑎𝑙𝑒 𝐼𝑛𝑓𝑜𝑖 𝛾+𝜖𝑖

Misleading…. Call: lm(formula = logmove.minute.maid ~ log(price.dominicks) * feat.dominicks + log(price.minute.maid) * feat.minute.maid + log(price.tropicana) * feat.tropicana, data = temp) Residuals: Min Q Median Q Max Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) < 2e-16 *** log(price.dominicks) < 2e-16 *** feat.dominicks ** log(price.minute.maid) < 2e-16 *** feat.minute.maid < 2e-16 *** log(price.tropicana) < 2e-16 *** feat.tropicana < 2e-16 *** log(price.dominicks):feat.dominicks log(price.minute.maid):feat.minute.maid < 2e-16 *** log(price.tropicana):feat.tropicana < 2e-16 *** --- Signif. codes: 0 ‘***’ ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: on 9639 degrees of freedom Multiple R-squared: , Adjusted R-squared: F-statistic: on 9 and 9639 DF, p-value: < 2.2e-16

Quasi-experiments/Natural experiments
Experiments don’t always have to be designed from the top down. Can use: - Changes in internal goals, metrics, budgets - Changes in policy (e.g., who is eligible for what prices) - Changes in production costs All of these affect prices without (hopefully) being systematically related to demand

EX: price “experiments”
Can be experiments over time (otherwise identical time periods, different prices) or over space (otherwise identical markets, different prices) Don’t actually need the time periods or markets to be identical, just identical apart from the things that you have data on and control for (e.g., can’t have lots of things moving around) example insight from book airlines leading example

ML != Econometrics Machine Learning focuses on training models for predictive accuracy. Good for: “How much will I sell next week?” Seems very similar to economics/econometrics. Econometrics focuses on parameters estimation. Strive for “causal” estimates of interpretable effects. Good for: How much more will I sell if I drop price? How certain am I about that impact?

Consequence of different focus
ML models may not have individual feature “estimates” Estimates may be systematically “too low” or “too high” (bias). Don’t quantify uncertainty the same way. No classical standard errors or confidence intervals But neither approach is always better → Learn when to do each

Example of ML concerns Look at linear models: LASSO and Ridge Regression Over-fitting Cross-validation

Linear Models: LASSO and Ridge
Standard OLS Regression min 𝛽 𝑖 𝑦 𝑖 − 𝑥 𝑖 𝛽 2 LASSO min 𝛽 𝑖 𝑦 𝑖 − 𝑥 𝑖 𝛽 2 +𝜆 𝑝 𝛽 𝑝 Ridge Regression min 𝛽 𝑖 𝑦 𝑖 − 𝑥 𝑖 𝛽 2 +𝜆 𝑝 𝛽 𝑝 2 LASSO & Ridge both penalize coefficients towards 0 𝜆 controls penalization. Big 𝜆 means push most coefficients towards 0.

Coefficient Budget and Penalization
𝜆 determines (inversely) “budget for coefficients” “Small 𝜆” = “Big budget” (close to OLS) Penalty form affects shape (trade-off between params) LASSO Ridge

Penalization and Bias Ridge will smoothly make coefficients small
LASSO will push many coefficients to 0 (“model selection”) LASSO Ridge

Overfitting ML models are very flexible. Try to fit all patterns
Great when learning the main relationships Can go too far and fit minor quirks in the data Always some “noise” in a particular sample. World changes so many small things in the past won’t generalize to the future. Overfitting makes model perform worse on new data (out-of-sample). We want just the “right” amount of flexibility. For linear models this is controlled by 𝜆.

Estimating Prediction Error
Quantify overfitting. Estimate prediction error 𝐸𝑟𝑟𝑜𝑟 (𝜆) k-fold Cross-Validation: Split data randomly into k portion (e.g., 5 or 10) For each s in {1,…,k}: Train model on all but section s (most of the data) Use trained model to calculate error on section s. Call that 𝐸𝑟𝑟𝑜𝑟(𝜆,s) Then 𝐸𝑟𝑟𝑜𝑟 𝜆 = 1 𝑘 𝑠 𝐸𝑟𝑟𝑜𝑟(𝜆,𝑠)

Overfitting Solution Pick 𝜆 to minimize out of sample error as given by 𝐸𝑟𝑟𝑜𝑟 𝜆 !

Choosing ML Models Lots of different methods
LASSO, Artificial Neural Nets, Support Vector Machines, Trees Which do a good job (predictively) in my domain? Do I need interpretable features? More interpretable: LASSO, Ridge regression Less interpretable (“black box”): Neural Nets The Pricing Engine has parts where both can be used.

Pricing – Combining ML and Econometrics
5/1/2019 2:31 AM Pricing – Combining ML and Econometrics © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Motivation Everyone’s excited about ML now.
Estimate LASSO of Quantity on Price and Controls Will this work (unbiased estimates of price sensitivity)? No! Biased coefficients

Recall LASSO and Ridge give biased estimates. Bias common in ML models

Plan for Section Statistical Problem: ML models give biased coefficients Walk through incorrect solutions. Correct Solution: Split out prediction part from parameter estimation Double-residual procedure Judging model results Using model results

Bad Method 2 Estimate LASSO of Q on P and Controls but don’t penalize the coefficient on P Will it work? No! Need to control for all variables correlated with P and Q. Otherwise omitted variable will bias our estimation. “OK” to miss ones weakly correlate with both. But if either correlation is strong, need to catch. LASSO only controls for variables highly correlated with Q, not those highly correlated with P. Doesn’t control “Omitted Variable Bias”

Intuition from OLS Alternative OLS steps Single-Step Method:
Estimate 𝑄=𝛽𝑃+𝛾𝑋+ε Multi-Step Method: Remove portion of Price predictable by 𝑋, leaving 𝑃 (“residual”) Remove portion of Quantity predictable by 𝑋, leaving 𝑄 Estimate 𝑄 =𝛽 𝑃 +ε Both give equivalent results for 𝛽 when using OLS. LASSO “didn’t do” step 1 Need both chances to catch a “Christmas” But… Steps 1 and 2 can be done with ML!

Approach: Double-ML Can use lots of ML procedures
Lots of causality problems can be subdivided this way Used by the Pricing Engine

Validation set methodology
5/1/2019 2:31 AM Validation set methodology Train the model with a subset of the data Test the model on the remaining data (the validation set) What data to choose for training vs. test? In a time-series dimension, it is natural to hold out the last year (or time period) of the data, to simulate predicting the future based on all past data. In most settings, however, we’ll randomly select our training/test sets. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

5/1/2019 2:31 AM Cross-validation A class of methods to do many training/test splits and average over all the runs Here is a simple example of 5-fold cross validation. Gives 5 test sets  5 estimates of MSE. The 5-fold CV estimate is obtained by averaging these values. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

K-fold cross-validation
5/1/2019 2:31 AM K-fold cross-validation Split the data up into K “folds”. Iteratively leave fold k out of the training data and use it to test. The more folds, the smaller each testing set is (more training data), but the more times we need to run the estimation procedure. Using rules of thumb like 5—10 folds is often utilized in practice. This can be done with a simple for loop in R For generalized linear models, the cv.glm() function can be used to perform k-fold cross validation. For example, this code loops over 10 possible polynomial orders and computes the 10-fold cross-validated error in each step © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Ridge and Lasso in R We will use the glmnet package
5/1/2019 2:31 AM Ridge and Lasso in R We will use the glmnet package glmnet() works for generalized linear models (OLS, Logit, etc.) By default, the function will do the estimation for a range of 𝜆 ′ 𝑠 (you can specify this range if you like). my_model$lambda We can then use k-fold cross validation on each run of the model in order to select the best one. In the first lab, we’ll do this “by hand”. In the second, we’ll get the package to do it for us. © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Regression, Prediction and Classification

Similar presentations

Presentation on theme: "Regression, Prediction and Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Regression, Prediction and Classification

Similar presentations

Presentation on theme: "Regression, Prediction and Classification"— Presentation transcript:

Similar presentations

About project

Feedback