Machine Learning – Regression David Fenyő

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Machine Learning and Data Mining Linear regression

Neural Networks and SVM Stat 600. Neural Networks History: started in the 50s and peaked in the 90s Idea: learning the way the brain does. Numerous applications.

Regularization David Kauchak CS 451 – Fall 2013.

Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.

R OBERTO B ATTITI, M AURO B RUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Feb 2014.

Definition  Regression Model  Regression Equation Y i =  0 +  1 X i ^ Given a collection of paired data, the regression equation algebraically describes.

P M V Subbarao Professor Mechanical Engineering Department

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

FIN357 Li1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.

FIN357 Li1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u 1. Estimation.

Classification and Prediction: Regression Analysis

Correlation and Regression

Collaborative Filtering Matrix Factorization Approach

Objectives of Multiple Regression

Regression Analysis Regression analysis is a statistical technique that is very useful for exploring the relationships between two or more variables (one.

Managerial Economics Demand Estimation. Scatter Diagram Regression Analysis.

Mathematical formulation XIAO LIYING. Mathematical formulation.

Model representation Linear regression with one variable

Andrew Ng Linear regression with one variable Model representation Machine Learning.

Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

PS 225 Lecture 20 Linear Regression Equation and Prediction.

REGRESSION DIAGNOSTICS Fall 2013 Dec 12/13. WHY REGRESSION DIAGNOSTICS? The validity of a regression model is based on a set of assumptions. Violation.

Engineering Analysis ENG 3420 Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 11:00-12:00.

Regression Analysis Deterministic model No chance of an error in calculating y for a given x Probabilistic model chance of an error First order linear.

BIOSTATISTICS Linear regression. Copyright ©2011, Joanna Szyda INTRODUCTION 1.Linear regression equation 2.Estimation of linear regression coefficients.

WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.

LEARNING FROM EXAMPLES AIMA CHAPTER 18 (4-5) CSE 537 Spring 2014 Instructor: Sael Lee Slides are mostly made from AIMA resources, Andrew W. Moore’s tutorials:

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Machine Learning Supervised Learning Classification and Regression

Theme 6. Linear regression

The simple linear regression model and parameter estimation

Back Propagation and Representation in PDP Networks

Linear Regression.

Machine Learning – Classification David Fenyő

Regression Analysis AGEC 784.

3.1 Examples of Demand Functions

Machine Learning – Regression David Fenyő

CSE 4705 Artificial Intelligence

A Simple Artificial Neuron

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Softmax Classifier + Generalization

Bias and Variance of the Estimator

Probabilistic Models for Linear Regression

Roberto Battiti, Mauro Brunato

Probability and Statistics for Computer Scientists Second Edition, By: Michael Baron Section 11.1: Least squares estimation CIS Computational.

Logistic Regression Classification Machine Learning.

Statistical Methods For Engineers

CHAPTER 29: Multiple Regression*

Multiple Regression A curvilinear relationship between one variable and the values of two or more other independent variables. Y = intercept + (slope1.

CS 2750: Machine Learning Line Fitting + Bias-Variance Trade-off

Hyperparameters, bias-variance tradeoff, validation

Collaborative Filtering Matrix Factorization Approach

Linear regression Fitting a straight line to observations.

What is Regression Analysis?

Presenter: Georgi Nalbantov

Convolutional networks

Contact: Machine Learning – (Linear) Regression Wilson Mckerrow (Fenyo lab postdoc) Contact:

Support Vector Machine I

Basis Expansions and Generalized Additive Models (1)

Multivariate Analysis Regression

Softmax Classifier.

Neural networks (1) Traditional multi-layer perceptrons

Machine learning overview

Artificial Intelligence 10. Neural Networks

Machine learning overview

The Math of Machine Learning

Shih-Yang Su Virginia Tech

Machine Learning.

Presentation transcript:

Machine Learning – Regression David Fenyő Contact: David@FenyoLab.org

Linear Regression – one independent variable 2 Relationship: 𝑦= 𝑤 1 𝑥 1 + 𝑤 0 +𝜖 Data: 𝑦 𝑗 , 𝑥 1𝑗 for j=1..n Loss function: sum of squared errors: 𝐿= 𝑗 𝜖 𝑗 2 = 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2

Minimizing the loss function: Linear Regression – One Independent Variable 3 Minimizing the loss function: 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 =0

Minimizing the loss function, L (sum of squared errors): Linear Regression – One Independent Variable 4 Minimizing the loss function, L (sum of squared errors): 𝜕𝐿 𝜕 𝑤 1 = 𝜕 𝜕 𝑤 1 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 1 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0 𝜕𝐿 𝜕 𝑤 0 = 𝜕 𝜕 𝑤 0 𝑗 𝜖 𝑗 2 = 𝜕 𝜕 𝑤 0 𝑗 𝑦 𝑗 − (𝑤 1 𝑥 1𝑗 + 𝑤 0 ) 2 =0

Linear Regression – One Independent Variable 5 Relationship: 𝑦= 𝑤 1 𝑥+ 𝑤 0 +𝜖 𝒙=(1, 𝑥 1 ) 𝒘=( 𝑤 0 , 𝑤 1 ) 𝑦=𝒙∙𝒘+𝜖

𝒙=(1, 𝑓 1 ( 𝑥 1 ), 𝑓 2 ( 𝑥 1 ), 𝑓 3 ( 𝑥 1 ),…, 𝑓 𝑙 ( 𝑥 1 )) Linear Regression – Sum of Functions 𝑦=𝒙∙𝒘+𝜖 6 𝒙=(1, 𝑓 1 ( 𝑥 1 ), 𝑓 2 ( 𝑥 1 ), 𝑓 3 ( 𝑥 1 ),…, 𝑓 𝑙 ( 𝑥 1 )) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 )

Linear Regression – Polynomial 𝑦=𝒙∙𝒘+𝜖 7 𝒙=(1, 𝑥 1 , 𝑥 1 2 , 𝑥 1 3 ,…, 𝑥 1 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

Linear Regression - Multiple Independent Variables 𝑦=𝒙∙𝒘+𝜖 8 𝒙=(1, 𝑥 1 , 𝑥 2 , 𝑥 3 ,…, 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑘 )

𝑦=𝒙∙𝒘+𝜖 𝒙=(1, 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , Linear Regression - Multiple Independent Variables 𝑦=𝒙∙𝒘+𝜖 9 𝒙=(1, 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , 𝑓 1 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 , …, 𝑓 𝑙 𝑥 1 , 𝑥 2 ,… 𝑥 𝑘 ) 𝒘=( 𝑤 0 , 𝑤 1 , 𝑤 2 , 𝑤 3 ,… , 𝑤 𝑙 )

Gradient Descent 10 min 𝒘 𝑳 𝒘

Gradient Descent 11 𝒘 𝑛+1 = 𝒘 𝑛 −𝜂𝛁𝐿( 𝒘 𝑛 )

⟹ 𝑤 1 = 𝑤 0 −𝜂 𝑑 𝑑𝑤 𝐿( 𝑤 0 ) 𝑤 1 = 𝑤 0 −𝜂 𝐿 𝑤 0 +∆𝑤 −𝐿( 𝑤 0 ) ∆𝑤 Gradient Descent 12 𝑤 1 = 𝑤 0 −𝜂 𝑑 𝑑𝑤 𝐿( 𝑤 0 ) ⟹ 𝑤 1 = 𝑤 0 −𝜂 𝐿 𝑤 0 +∆𝑤 −𝐿( 𝑤 0 ) ∆𝑤

Gradient Descent 13 𝑤 2 = 𝑤 1 −𝜂 𝐿 𝑤 1 +∆𝑤 −𝐿( 𝑤 1 ) ∆𝑤

Gradient Descent 14 𝑤 3 = 𝑤 2 −𝜂 𝐿 𝑤 2 +∆𝑤 −𝐿( 𝑤 2 ) ∆𝑤

𝑤 4 = 𝑤 3 −𝜂 𝐿 𝑤 3 +∆𝑤 −𝐿( 𝑤 3 ) ∆𝑤 Gradient Descent 15 We want to use a large training rate when we are far from the minimum and decrease it as we get closer. 𝑤 4 = 𝑤 3 −𝜂 𝐿 𝑤 3 +∆𝑤 −𝐿( 𝑤 3 ) ∆𝑤

Training: Gradient Descent 16 If the gradient is small in an extended region, gradient descent becomes very slow.

Training: Gradient Descent 17 Gradient descent can get stuck in local minima. To improve the behavior for shallow local minima, we can modify gradient descent to take the average of the gradient for the last few steps (similar to momentum and friction).

Linear Regression – Error Landscape Sum of Square Errors Slope

Linear Regression – Error Landscape Slope Sum of Square Errors Intercept Slope

Linear Regression – Error Landscape Slope Sum of Square Errors Intercept Slope

Linear Regression – Error Landscape Sum of Square Errors

Linear Regression – Error Landscape Sum of Square Errors

Linear Regression – Error Landscape Sum of Absolute Errors

Linear Regression – Error Landscape

Gradient Descent

Gradient Descent

Gradient Descent

Gradient Descent

Linear Regression – Gradient Descent

Linear Regression – Gradient Descent

Linear Regression – Gradient Descent

Linear Regression – Gradient Descent

Gradient Descent

Gradient Descent Batch gradient descent: Uses the whole training set to calculate the gradient for each step. Stochastic gradient descent or Mini-batch gradient descent: Uses a subset of the training set to calculate the gradient for each step.

Gradient Descent – Learning Rate Too Small Too Large

Gradient Descent – Learning Rate Decay Constant Learning Rate Decaying Learning Rate

Partially Remembering Gradient Descent – Unequal Gradients Constant Learning Rate Decaying Learning Rate Partially Remembering Previous Gradients

Gradient Descent Sum of Square Errors Sum of Absolute Errors

Outliers Sum of Square Errors Sum of Absolute Errors

Variable Variance

Model Capacity: Overfitting and Underfitting 41

Model Capacity: Overfitting and Underfitting 42

Model Capacity: Overfitting and Underfitting 43

Model Capacity: Overfitting and Underfitting 44 Training Error Error on Training Set Degree of polynomial

Training and Testing Data Set Test Training

Training and Testing – Linear relationship Error Testing Error Training Error Degree of polynomial

Testing Error and Training Set Size Low Variance High Variance 10 10 10 30 30 100 30 300 Error 100 1000 100 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

Coefficients and Training Set Size Degree of polynomial = 9 10 100 1000 Absolute Value of coefficient Coefficient Coefficient Coefficient

Training and Testing – Non-linear relationship Error Testing Error Training Error Degree of polynomial

Testing Error and Training Set Size Low Variance High Variance 10 30 30 30 100 Log(Error) 100 100 300 300 1000 300 3000 Degree of polynomial Degree of polynomial Degree of polynomial

min 𝒘 𝑳 𝒘 min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2 Regularization No Regularization: min 𝒘 𝑳 𝒘 Regularization: min 𝒘 𝑳 𝒘 +𝜆𝑔( 𝒘 ) Ridge Regression: min 𝒘 𝒚−𝑿𝒘 2 +𝜆 𝒘 2

Regularization: Coefficients and Training Set Size Degree of polynomial = 9 10 100 1000 Coefficient Coefficient Coefficient

Nearest Neighbor Regression – Fixed Distance

Nearest Neighbor Regression – Fixed Number

Nearest Neighbor Regression Linear Data Non-Linear Data 10 10 30 30 Error Log(Error) 100 300 1000 100 3000 Number of Neighbors Number of Neighbors

Nearest Neighbor Regression Linear Data Non-Linear Data 10 30 Error Log(Error) 10 100 30 300 1000 3000 Number of Neighbors Number of Neighbors

Data Set Test Validation Training Validation: Choosing Hyperparameters Examples of hyperparameters: Learning rate schedule Regularization parameter Number of nearest neighbors