ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Regularisation *Courtesy of Associate Professor Andrew.

Slides:



Advertisements
Similar presentations
Machine Learning and Data Mining Linear regression
Advertisements

Linear Regression.
Regularization David Kauchak CS 451 – Fall 2013.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
R Squared. r = r = -.79 y = x y = x if x = 15, y = ? y = (15) y = if x = 6, y = ? y = (6)
Least squares CS1114
Some terminology When the relation between variables are expressed in this manner, we call the relevant equation(s) mathematical models The intercept and.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
P M V Subbarao Professor Mechanical Engineering Department
Artificial Neural Networks
Data mining and statistical learning - lecture 6
Data mining in 1D: curve fitting
The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Section 4.2 Fitting Curves and Surfaces by Least Squares.
x – independent variable (input)
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Nonlinear Regression Probability and Statistics Boris Gervits.
Classification and Prediction: Regression Analysis
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Calibration & Curve Fitting
Correlation and Regression
Collaborative Filtering Matrix Factorization Approach
Neural Networks Lecture 8: Two simple learning algorithms
A quadratic equation is a second degree polynomial, usually written in general form: The a, b, and c terms are called the coefficients of the equation,
Classification Part 3: Artificial Neural Networks
Perceptual and Sensory Augmented Computing Machine Learning, WS 13/14 Machine Learning – Lecture 14 Introduction to Regression Bastian Leibe.
Basic linear regression and multiple regression Psych Fraley.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 2B *Courtesy of Associate Professor Andrew Ng’s Notes, Stanford University.
Model representation Linear regression with one variable
Andrew Ng Linear regression with one variable Model representation Machine Learning.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
Logistic Regression Week 3 – Soft Computing By Yosi Kristian.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
1 Multiple Regression Analysis y =  0 +  1 x 1 +  2 x  k x k + u.
7.8 Applications of Quadratic Equations 8.1 Basic Properties and Reducing to Lowest Terms.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
Physics 114: Lecture 18 Least Squares Fit to Arbitrary Functions Dale E. Gary NJIT Physics Department.
The problem of overfitting
Regularization (Additional)
M Machine Learning F# and Accord.net.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Elementary Row Operations Interchange two equations Multiply one equation by a nonzero constant Add a multple of one equation to another equation.
WEEK 2 SOFT COMPUTING & MACHINE LEARNING YOSI KRISTIAN Gradient Descent for Linear Regression.
Gradient descent David Kauchak CS 158 – Fall 2016.
Deep Feedforward Networks
CSE 4705 Artificial Intelligence
A Simple Artificial Neuron
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
ECE 5424: Introduction to Machine Learning
Machine Learning – Regression David Fenyő
Logistic Regression Classification Machine Learning.
Collaborative Filtering Matrix Factorization Approach
Linear Classifier by Dr
Overfitting and Underfitting
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Softmax Classifier.
Neural networks (1) Traditional multi-layer perceptrons
The Math of Machine Learning
Shih-Yang Su Virginia Tech
Multiple features Linear Regression with multiple variables
Multiple features Linear Regression with multiple variables
Linear regression with one variable
Presentation transcript:

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Regularisation *Courtesy of Associate Professor Andrew Ng’s Notes, Stanford University

3.2 Objectives Overfitting and Underfitting Linear Regression Logistic Regression Regularisation Optimised Cost Function Regularised Linear Regression Regularised Logistic Regression

3.3 Underfitting with Linear Regression Using our house pricing example again Fit a linear function to the data - Price Size Not a great model. This is underfitting - also known as high bias

3.4 Overfitting with Linear Regression Fit a 4th order polynomial Price Size Now curve fit's through all five examples. But, despite fitting the data very well, this is actually not such a good model This is overfitting - also known as high variance

3.5 Overfitting with Linear Regression Fit a quadratic function Price Size Works well. Best fit

3.6 Overfitting with logistic regression Sigmoidal function is an underfit But a high order polynomial gives an overfitting (high variance hypothesis)

3.7 Overfitting To recap, if we have too many features then the learned hypothesis may give a cost function of exactly zero But this tries too hard to fit the training set Fails to provide a general solution  unable to generalise (apply to new examples)

3.8 Addressing Overfitting Plotting hypothesis is one way to decide, but doesn't always work Often have lots of a features - here it's not just a case of selecting a degree polynomial, but also harder to plot the data and visualise to decide what features to keep and which to drop If you have lots of features and little data - overfitting can be a problem How do we deal with this? Reduce number of features  But, in reducing the number of features we lose some information Regularisation  Keep all features, but reduce magnitude of parameters θ  Works well when we have lot of features, each of which contributes a bit to predicting y

3.9 Cost Function Optimisation Penalise and make some of the θ parameters really small For example, here θ 3 and θ 4 The addition in blue is a modification of our cost function to help penalize θ 3 and θ 4 So here we end up with θ 3 and θ 4 being close to zero (because the constants are massive), so we're basically left with a quadratic function Price Size of house

3.10 Regularisation Small values for parameters leads to a simpler hypothesis (effectively, get rid of some of the terms) A simpler hypothesis is less prone to overfitting Not knowing what are the high order terms & how to pick the ones to shrink Regularisation: take cost function, modify it to shrink all parameters  Add a term at the end: Regularisation term shrinks every parameter  By convention you don't penalise θ 0 -minimisation is from θ 1 onwards Housing: ―Features: ―Parameters:

3.11 Regularisation λ is the regularisation parameter Controls a trade off between our two goals  Want to fit the training set well  Want to keep parameters small Using the regularised objective (i.e. the cost function with the regularisation term) you get a much smoother curve which fits the data and gives a much better hypothesis If λ is very large we end up penalising ALL the parameters (θ 1, θ 2 etc.) so all the parameters end up being close to zero  This leads to underfitting  Gradient descent will fail to converge So, λ should be chosen carefully - not too big

3.12 Regularised Linear Regression Linear regression with regularisation is shown below Previously, gradient descent would repeatedly update the parameters θ j, where j = 0,1,2...n simultaneously We've got the θ 0 update here shown explicitly This is because for regularisation we don't penalise θ 0 so treat it slightly differently

3.13 Regularised Linear Regression How do we regularise these two rules? Take the term and add λ/m * θ j  Sum for every θ (i.e. j = 0 to n) This gives regularisation for gradient descent Using calculus, we can show that the equation given below is the partial derivative of the regularised J(θ) θ j gets updated to: θ j - α * [a big term which also depends on θ j ] So if you group the θ j terms together

3.14 Regularised Linear Regression The term Is a number less than 1 usually Usually learning rate is small and m is large  So this typically evaluates to (1 - a small number)  So the term is often around 0.99 to 0.95 This in effect means θ j gets multiplied by 0.99 Means the squared norm of θ j a little smaller The second term is exactly the same as the original gradient descent

3.15 Regularised Logistic Regression We saw earlier that logistic regression can be prone to overfitting with lots of features Logistic regression cost function is as follows To modify it we have to add an extra term This has the effect of penalising the parameters θ 1, θ 2 up to θ n Means, like with linear regression, we can get what appears to be a better fitting lower order hypothesis

3.16 Regularised Logistic Regression How do we implement this? Original logistic regression with gradient descent function was as follows Again, to modify the algorithm we simply need to modify the update rule for θ 1, onwards Looks cosmetically the same as linear regression, except obviously the hypothesis is very different

Any question?