Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Introduction to Support Vector Machines (SVM)

Regularized risk minimization

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

Lecture 9 Support Vector Machines

Machine Learning and Data Mining Linear regression

Linear Regression.

Regularization David Kauchak CS 451 – Fall 2013.

CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.

Classification / Regression Support Vector Machines

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 2: Review Part 2.

Chapter 2: Lasso for linear models

Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.

Lecture 14 – Neural Networks

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Lecture 10: Support Vector Machines

Classification and Prediction: Regression Analysis

Collaborative Filtering Matrix Factorization Approach

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 6: Conditional Random Fields 1.

PATTERN RECOGNITION AND MACHINE LEARNING

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

SVM by Sequential Minimal Optimization (SMO)

Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.

CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct

An Introduction to Support Vector Machines (M. Law)

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression Regression Trees.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

CS 478 – Tools for Machine Learning and Data Mining SVM.

Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.

Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Regularized Least-Squares and Convex Optimization.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Support vector machines

PREDICT 422: Practical Machine Learning

LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.

Statistical Data Analysis - Lecture /04/03

Large Margin classifiers

Multiplicative updates for L1-regularized regression

Boosting and Additive Trees (2)

CSE 4705 Artificial Intelligence

Linear Regression (continued)

Support Vector Machines

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

ECE 5424: Introduction to Machine Learning

Optimization in Machine Learning

Roberto Battiti, Mauro Brunato

Machine Learning Feature Creation and Selection

Statistical Learning Dong Liu Dept. EEIS, USTC.

Hyperparameters, bias-variance tradeoff, validation

Collaborative Filtering Matrix Factorization Approach

10701 / Machine Learning Today: - Cross validation,

Support vector machines

The Bias Variance Tradeoff and Regularization

Support Vector Machine I

Support vector machines

Support vector machines

Machine learning overview

Presentation transcript:

Machine Learning & Data Mining CS/CNS/EE 155 Lecture 3: Regularization, Sparsity & Lasso 1

Homework 1 Check course website! Some coding required Some plotting required – I recommend Matlab Has supplementary datasets Submit via Moodle (due Jan 20 2

Recap: Complete Pipeline Training DataModel Class(es) Loss Function Cross Validation & Model SelectionProfit! 3

Different Model Classes? Cross Validation & Model Selection Option 1: SVMs vs ANNs vs LR vs LS Option 2: Regularization 4

Notation L0 Norm – # of non-zero entries L1 Norm – Sum of absolute values L2 Norm & Squared L2 Norm – Sum of squares – Sqrt(sum of squares) L-infinity Norm – Max absolute value 5

Notation Part 2 Minimizing Squared Loss – Regression – Least-Squares – (Unless Otherwise Stated) E.g., Logistic Regression = Log Loss 6

Ridge Regression aka L2-Regularized Regression Trades off model complexity vs training loss Each choice of λ a “model class” – Will discuss the further later Regularization Training Loss 7

PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 Test Train w b … Larger Lambda 8

Updated Pipeline Training DataModel Class Loss Function Cross Validation & Model SelectionProfit! Choosing λ! 9

PersonAge>10Male ? Height > 55” Alice Bob Carol Dave Erin Frank Gena Harold Irene John Kelly Larry Test Train Model Score w/ Increasing Lambda Best test error 10

25 dimensional space Randomly generated linear response function + noise Choice of Lambda Depends on Training Size 11

Recap: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares Large λ  more stable predictions – Less likely to overfit to training data – Too large λ  underfit Works with other loss – Hinge Loss, Log Loss, etc. 12

Model Class Interpretation This is not a model class! – At least not what we’ve discussed... An optimization procedure – Is there a connection? 13

Norm Constrained Model Class c=1 c=2 c=3 Visualization Seems to correspond to lambda… 14

Lagrange Multipliers Omitting b & 1 training data for simplicity 15

Norm Constrained Model Class Training: Observation about Optimality: Lagrangian:  Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Omitting b & 1 training data for simplicity 16

Norm Constrained Model Class Training: Observation about Optimality: Lagrangian:  Optimality Implication of Lagrangian: Claim: Solving Lagrangian Solves Norm-Constrained Training Problem Two Conditions Must Be Satisfied At Optimality . Satisfies First Condition! Optimality Implication of Lagrangian:  Satisfies 2 nd Condition! Omitting b & 1 training data for simplicity 17

Lagrangian: Norm Constrained Model Class Training: L2 Regularized Training: Omitting b & 1 training data for simplicity Lagrangian = Norm Constrained Training: Lagrangian = L2 Regularized Training: – Hold λ fixed – Equivalent to solving Norm Constrained! – For some c 18

Recap #2: Ridge Regularization Ridge Regression: – L2 Regularized Least-Squares = Norm Constrained Model Large λ  more stable predictions – Less likely to overfit to training data – Too large λ  underfit Works with other loss – Hinge Loss, Log Loss, etc. 19

Hallucinating Data Points Instead hallucinate D data points? Omitting b & for simplicity Identical to Regularization! Unit vector along d-th Dimension 20

Extension: Multi-task Learning 2 prediction tasks: – Spam filter for Alice – Spam filter for Bob Limited training data for both… – … but Alice is similar to Bob 21

Extension: Multi-task Learning Two Training Sets – N relatively small Option 1: Train Separately Omitting b & for simplicity Both models have high error. 22

Extension: Multi-task Learning Two Training Sets – N relatively small Option 2: Train Jointly Omitting b & for simplicity Doesn’t accomplish anything! (w & v don’t depend on each other) 23

Multi-task Regularization Prefer w & v to be “close” – Controlled by γ – Tasks similar Larger γ helps! – Tasks not identical γ not too large Standard Regularization Multi-task Regularization Training Loss Test Loss (Task 2) 24

Lasso L1-Regularized Least-Squares 25

L1 Regularized Least Squares L2: L1: vs = = Omitting b & for simplicity 26

Subgradient (sub-differential) Differentiable: L1: Continuous range for w=0! Omitting b & for simplicity 27

L1 Regularized Least Squares L2: L1: Omitting b & for simplicity 28

Lagrange Multipliers Omitting b & 1 training data for simplicity Solutions tend to be at corners! 29

Sparsity w is sparse if mostly 0’s: – Small L0 Norm Why not L0 Regularization? – Not continuous! L1 induces sparsity – And is continuous! Omitting b & for simplicity 30

Why is Sparsity Important? Computational / Memory Efficiency – Store 1M numbers in array – Store 2 numbers per non-zero (Index, Value) pairs E.g., [ (50,1), (51,1) ] – Dot product more efficient: Sometimes true w is sparse – Want to recover non-zero dimensions 31

Lasso Guarantee Suppose data generated as: Then if: With high probability (increasing with N): See also:  High Precision Parameter Recovery Sometimes High Recall 32

PersonAge>10Male?Height > 55” Alice101 Bob010 Carol000 Dave111 Erin101 Frank011 Gena000 Harold111 Irene100 John011 Kelly101 Larry111 33

Recap: Lasso vs Ridge Model Assumptions – Lasso learns sparse weight vector Predictive Accuracy – Lasso often not as accurate – Re-run Least Squares on dimensions selected by Lasso Easy of Inspection – Sparse w’s easier to inspect Easy of Optimization – Lasso somewhat trickier to optimize 34

Recap: Regularization L2 L1 (Lasso) Multi-task [Insert Yours Here!] 35 Omitting b & for simplicity

Next Lecture: Recent Applications of Lasso Cancer Detection Personalization via twitter Recitation on Wednesday: Probability & Statistics Image Sources: