LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning.

Slides:



Advertisements
Similar presentations
Chapter Outline 3.1 Introduction
Advertisements

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.
CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.
Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
L.M. McMillin NOAA/NESDIS/ORA Regression Retrieval Overview Larry McMillin Climate Research and Applications Division National Environmental Satellite,
Special Topics in Educational Data Mining HUDK5199 Spring term, 2013 March 7, 2013.
Model assessment and cross-validation - overview
Psychology 202b Advanced Psychological Statistics, II February 22, 2011.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Statistics for the Social Sciences
Chapter 10 Simple Regression.
Statistics for Managers Using Microsoft® Excel 5th Edition
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
University of Southern California Center for Systems and Software Engineering 1 © USC-CSSE A Constrained Regression Technique for COCOMO Calibration Presented.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Chapter 11 Multiple Regression.
Lecture 11 Multivariate Regression A Case Study. Other topics: Multicollinearity  Assuming that all the regression assumptions hold how good are our.
Simple Linear Regression Analysis
Classification and Prediction: Regression Analysis
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Regression Model Building
Objectives of Multiple Regression
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Chapter 13: Inference in Regression
Shrinkage Estimation of Vector Autoregressive Models Pawin Siriprapanukul 11 January 2010.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
GG 313 Geological Data Analysis Lecture 13 Solution of Simultaneous Equations October 4, 2005.
Ensemble Learning (1) Boosting Adaboost Boosting is an additive model
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Slide 1 DSCI 5340: Predictive Modeling and Business Forecasting Spring 2013 – Dr. Nick Evangelopoulos Lecture 2: Review of Multiple Regression (Ch. 4-5)
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Survey – extra credits (1.5pt)! Study investigating general patterns of college students’ understanding of astronomical topics There will be 3~4 surveys.
R EGRESSION S HRINKAGE AND S ELECTION VIA THE L ASSO Author: Robert Tibshirani Journal of the Royal Statistical Society 1996 Presentation: Tinglin Liu.
Chapter 22: Building Multiple Regression Models Generalization of univariate linear regression models. One unit of data with a value of dependent variable.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
CpSc 881: Machine Learning
June 30, 2008Stat Lecture 16 - Regression1 Inference for relationships between variables Statistics Lecture 16.
Copyright © 2011 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Simple Linear Regression Analysis Chapter 13.
Machine Learning 5. Parametric Methods.
Logistic Regression & Elastic Net
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
LECTURE 06: CLASSIFICATION PT. 2 February 10, 2016 SDS 293 Machine Learning.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
Intro to Statistics for the Behavioral Sciences PSYC 1900 Lecture 7: Regression.
LECTURE 03: LINEAR REGRESSION PT. 1 February 1, 2016 SDS 293 Machine Learning.
LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.
Chapter 14, continued More simple linear regression Download this presentation.
LECTURE 16: BEYOND LINEARITY PT. 1 March 28, 2016 SDS 293 Machine Learning.
Basis Expansions and Generalized Additive Models Basis expansion Piecewise polynomials Splines Generalized Additive Model MARS.
LECTURE 17: BEYOND LINEARITY PT. 2 March 30, 2016 SDS 293 Machine Learning.
Overfitting, Bias/Variance tradeoff. 2 Content of the presentation Bias and variance definitions Parameters that influence bias and variance Bias and.
LECTURE 11: LINEAR MODEL SELECTION PT. 1 March SDS 293 Machine Learning.
LECTURE 12: LINEAR MODEL SELECTION PT. 2 March 7, 2016 SDS 293 Machine Learning.
LECTURE 14: DIMENSIONALITY REDUCTION: PRINCIPAL COMPONENT REGRESSION March 21, 2016 SDS 293 Machine Learning B.A. Miller.
LECTURE 15: PARTIAL LEAST SQUARES AND DEALING WITH HIGH DIMENSIONS March 23, 2016 SDS 293 Machine Learning.
PREDICT 422: Practical Machine Learning Module 4: Linear Model Selection and Regularization Lecturer: Nathan Bastian, Section: XXX.
Lecture 6 Feb. 2, 2015 ANNOUNCEMENT: Lab session will go from 4:20-5:20 based on the poll. (The majority indicated that it would not be a problem to chance,
The simple linear regression model and parameter estimation
Reasoning in Psychology Using Statistics
Generalized regression techniques
CSE 4705 Artificial Intelligence
Roberto Battiti, Mauro Brunato
What is Regression Analysis?
Linear Model Selection and Regularization
Linear Model Selection and regularization
Basis Expansions and Generalized Additive Models (1)
Presentation transcript:

LECTURE 13: LINEAR MODEL SELECTION PT. 3 March 9, 2016 SDS 293 Machine Learning

Announcements A3 feedback is out; let me know if you didn’t receive it A4 solution set is up By popular demand: - I’m going to start holding additional lab-style office hours - 3 proposed times/locations: Mondays from 4pm to 6pm Wednesdays from 4pm to 6pm Thursdays from 5pm to 7pm - Vote on Piazza!

Outline Model selection: alternatives to least-squares Subset selection - Best subset - Stepwise selection (forward and backward) - Estimating error using cross-validation Shrinkage methods - Ridge regression and the Lasso - Dimension reduction – B. Miller Labs for each part

Flashback: subset selection Big idea: if having too many predictors is the problem maybe we can get rid of some Three methods: - Best subset: try all possible combinations of predictors - Forward: start with no predictors, greedily add one at a time - Backward: start with all predictors, greedily remove one at a time Common theme of subset selection: ultimately, individual predictors are either IN or OUT

Discussion Question: what potential problems do you see? Answer: we’re exploring the space of possible models as if there were only finitely many of them, but there are actually infinitely many (why?)

New approach: “regularization” constrain the coefficients Another way to phrase it: reward models that shrink the coefficient estimates toward zero (and still perform well, of course)

Approach 1: ridge regression Big idea: minimize RSS plus an additional penalty that rewards small (sum of) coefficient values Predicted value Sum over all observations Observed value RSS Tuning parameter Rewards coefficients close to zero Shrinkage penalty Sum over all predictors * In statistical / linear algebraic parlance, this is an ℓ 2 penalty

Approach 1: ridge regression For each value of λ, we only have to fit one model Substantial computational savings over best subset! RSS Shrinkage penalty Tuning parameter

Approach 1: ridge regression Question: what happens when the tuning parameter is small? Answer: turns into simple least-squares; just minimizing RSS! RSS Shrinkage penalty Tuning parameter

Approach 1: ridge regression Question: what happens when the tuning parameter is large? Answer: all coefficients go to zero; turns into null model! RSS Shrinkage penalty Tuning parameter

Ridge regression: caveat RSS is scale-invariant* Question: is this true of the shrinkage penalty? Answer: no! This means having predictors at different scales would influence our estimate… need to first standardize the predictors by dividing by the standard deviation RSS Shrinkage penalty * multiplying any predictor by a constant doesn’t matter

Discussion Question: why would ridge regression improve the fit over least-squares regression? Answer: as usual, comes down to bias-variance tradeoff  As λ increases, flexibility decreases: ↓ variance, ↑ bias  As λ decreases, flexibility increases: ↑ variance, ↓ bias - Takeaway: ridge regression works best in situations where least squares estimates have high variance: trades a small increase in bias for a large reduction in variance

Discussion Question: so what’s the catch? Answer: ridge regression doesn’t actually perform variable selection – our final model will include all predictors - If all we care about is prediction accuracy, this isn’t a problem - It does, however, pose a challenge for model interpretation

* In statistical / linear algebraic parlance, this is an ℓ 1 penalty Approach 2: the lasso (same) Big idea: minimize RSS plus an additional penalty that rewards small (sum of) coefficient values RSS Tuning parameter Rewards coefficients close to zero Shrinkage penalty

Discussion Question: wait a second: why would that enable us to get coefficients exactly equal to zero?

Answer: let’s reformulate a bit For each value of λ, there exists a value for s such that: Ridge regression: Lasso:

Ridge regressionLasso Answer: let’s reformulate a bit Now let’s compare the constraint functions:

Answer: let’s reformulate a bit Ridge regressionLasso Coefficient estimates Common RSS contours

Comparing ridge regression and the lasso Efficient implementations for both (in R and python!) Both significantly reduce variance at the expense of a small increase in bias Question: when would one outperform the other? Answer: - When there are relatively many equally-important predictors, ridge regression will dominate - When there are small number of important predictors and many others that are not useful, the lasso will win

Lingering concern… Question: how do we choose the right value of λ ? Answer: sweep and cross validate!  Because we are only fitting a single model for each λ, we can afford to try lots of possible values to find the best (“sweeping”)  For each λ we test, we’ll want to calculate the cross-validation error to make sure the performance is consistent

Lab: ridge regression & the lasso To do today’s lab in R: glmnet To do today’s lab in python: Instructions and code: Full version can be found beginning on p. 251 of ISLR

Coming up When we return: - Real-world applications of these techniques - Dimension reduction methods Have a wonderful break!