ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 2B *Courtesy of Associate Professor Andrew Ng’s Notes, Stanford University.

Slides:

Advertisements

Similar presentations

Linear Regression.

Advertisements

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Supervised Learning Recap

Machine Learning Week 2 Lecture 1.

Lecture 13 – Perceptrons Machine Learning March 16, 2010.

Logistic Regression Classification Machine Learning.

Artificial Neural Networks

The loss function, the normal equation,

Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.

Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

x – independent variable (input)

Maximum likelihood (ML) and likelihood ratio (LR) test

Linear Models Tony Dodd January 2007An Overview of State-of-the-Art Data Modelling Overview Linear models. Parameter estimation. Linear in the.

Linear Methods for Classification

Machine Learning: Ensemble Methods

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Maximum likelihood (ML)

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Collaborative Filtering Matrix Factorization Approach

Neural Networks Lecture 8: Two simple learning algorithms

October 8, 2013Computer Vision Lecture 11: The Hough Transform 1 Fitting Curve Models to Edges Most contours can be well described by combining several.

More Machine Learning Linear Regression Squared Error L1 and L2 Regularization Gradient Descent.

Modeling the probability of a binary outcome

October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.

Logistic Regression L1, L2 Norm Summary and addition to Andrew Ng’s lectures on machine learning.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

Andrew Ng Linear regression with one variable Model representation Machine Learning.

Logistic Regression Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September 1, 2014.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Regularisation *Courtesy of Associate Professor Andrew.

M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.

CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.

CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.

Logistic Regression Week 3 – Soft Computing By Yosi Kristian.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

Linear Discrimination Reading: Chapter 2 of textbook.

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.

Linear Models for Classification

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Logistic Regression (Classification Algorithm)

M Machine Learning F# and Accord.net.

Regress-itation Feb. 5, Outline Linear regression – Regression: predicting a continuous value Logistic regression – Classification: predicting a.

Logistic Regression William Cohen.

COMP 2208 Dr. Long Tran-Thanh University of Southampton Decision Trees.

Chapter 2-OPTIMIZATION G.Anuradha. Contents Derivative-based Optimization –Descent Methods –The Method of Steepest Descent –Classical Newton’s Method.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

Data Mining Lectures Lecture 7: Regression Padhraic Smyth, UC Irvine ICS 278: Data Mining Lecture 7: Regression Algorithms Padhraic Smyth Department of.

Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.

129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.

Chapter 7. Classification and Prediction

Deep Feedforward Networks

A Simple Artificial Neuron

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Classification with Perceptrons Reading:

Logistic Regression Classification Machine Learning.

Statistical Learning Dong Liu Dept. EEIS, USTC.

Collaborative Filtering Matrix Factorization Approach

Logistic Regression Classification Machine Learning.

Ying shen Sse, tongji university Sep. 2016

Machine Learning Algorithms – An Overview

Logistic Regression Chapter 7.

Logistic Regression Classification Machine Learning.

Logistic Regression Classification Machine Learning.

Linear regression with one variable

Logistic Regression Classification Machine Learning.

Presentation transcript:

ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 2B *Courtesy of Associate Professor Andrew Ng’s Notes, Stanford University

Review Cost Function Gradient Descent Linear Regression With Multiple Variables Polynomial Regression & Feature Selection

Objectives Classification Why Not Linear Regression Logistic Regression Sigmoid Function Boundary Decision Logistic Regression Cost Function Convex vs Non-Convex Multiclass Classification

Classification 0: “Negative Class” (e.g., benign tumor) Classification Problems Email -> spam/not spam? Online transactions -> fraudulent? Tumor -> Malignant/benign Variable in these problems is Y Y is either 0 or 1 0 = negative class (absence of something) 1 = positive class (presence of something) Binary Class Problems 0: “Negative Class” (e.g., benign tumor) 1: “Positive Class” (e.g., malignant tumor)

Binary Class Problems Threshold classifier output at 0.5: We could use linear regression Then threshold the classifier output (i.e. anything over some value is yes, else no) (Yes) 1 Malignant? (No) 0 Tumor Size Tumor Size Threshold classifier output at 0.5: If , predict “y = 1” If , predict “y = 0”

Why Not Linear Regression? In the provided Tumor classification, we can see linear regression does a reasonable job of assigning the data points into one of two classes But what if we had a single Yes with a very large tumour Regression line fitted to data changes and This would lead to classifying all the existing YESes as NOs (Yes) 1 Malignant? (No) 0 Tumor Size Tumor Size

Logistic Regression can be > 1 or < 0 !! Logistic Regression: Another funny issues with linear regression is We know Y is either 0 or 1 but Hypothesis can give values large than 1 or less than 0 Logistic Regression generates a value where is always either 0 or 1 Classification: y = 0 or 1 can be > 1 or < 0 !! Logistic Regression:

Hypothesis Representation We want our classifier to output values between 0 and 1 When using linear regression we did hθ(x) = (θT x) For classification hypothesis representation we do hθ(x) = g((θT x) Where we define g(z) g(z) = 1/(1 + e-z) This is called the sigmoid function, or the logistic function If we combine these equations we can write out the hypothesis as

Sigmoid Function What does the sigmoid function look like? Crosses 0.5 at the origin, then flattens out

Sigmoid Function Sigmoid squashes the real line into the interval (0,1) Appropriate for classification because we would never want to make a prediction of greater than 1 or less than 0. Also differentiable, with a simple derivative Adaptable for gradient descent Given this function, we need to fit θ to our data

Interpretation of Hypothesis Output When our logistic hypothesis (hθ(x)) outputs a number That value is the estimated probability on input x Example: If X is a feature vector with x0 = 1 (as always) and x1 = tumourSize hθ(x) = 0.7 Tells a patient that the chance of a tumor being malignant is 70% This can be written using the following notation hθ(x) = P(y=1|x ; θ) Probability that y=1, given x, parameterized by θ Since this is a binary classification task we know y = 0 or 1, so the following must be true P(y=1|x ; θ) + P(y=0|x ; θ) = 1 P(y=0|x ; θ) = 1 - P(y=1|x ; θ)

Boundary Decision One way of using the sigmoid function is When the probability of y being 1 is greater than 0.5 then we can predict y = 1 Else we predict y = 0 When is it exactly that hθ(x) is greater than 0.5? g(z) is greater than or equal to 0.5 when z is greater than or equal to 0 So if z is positive, g(z) is greater than 0.5 z = (θT x) so when θTx >= 0 then hθ >= 0.5 So the hypothesis predicts y = 1 when θT x >= 0

Boundary Decision Suppose predict “ “ if predict “ “ if Hypothesis predicts y = 1 when θT x >= 0 Then the hypothesis predicts y = 0 when θT x <= 0

Example 1 Assume we have hθ(x) = g(θ0 + θ1x1 + θ2x2) and for example θ0 = -3, θ1 = 1, θ2 = 1 So our parameter vector is a column vector, so θT is a row vector = [-3,1,1] The z here becomes θT x and we predict "y = 1" if -3x0 + 1x1 + 1x2 >= 0 -3 + x1 + x2 >= 0 We can also re-write this as If (x1 + x2 >= 3) then we predict y = 1 If we plot x1 + x2 = 3 we visualise decision boundary x1 x2 1 2 3 x1 x2 1 2 3 Decision Boundary

Example 1 x2 x1 Decision Boundary 3 Decision Boundary It means we have these two regions on the graph Blue = false Magenta = true Line = decision boundary The straight line is the set of points where hθ(x) = 0.5 exactly The decision boundary is a property of the hypothesis We can create the boundary with the hypothesis and parameters without any data Later, we use the data to determine the parameter values

Example 2 x2 x1 x2 x1 Non-Linear Decision Boundaries -1 Non-Linear Decision Boundaries Like polynomial regression, we add higher order terms Assume we have hθ(x) = g(θ0 + θ1x1+ θ3x12 + θ4x22) Say θT is [-1,0,0,1,1] then we predict that "y = 1" if -1 + x12 + x22 >= 0 Or x12 + x22 >= 1 If we plot x12 + x22 = 1, this gives us a circle with a radius of 1 around 0 x1 x2 1 -1 Decision Boundary

More Complex Boundaries We can build more complex decision boundaries by fitting complex parameters to this (relatively) simple hypothesis By using higher order polynomial terms, we can get even more complex decision boundaries x1 x2 x1 x2

Cost Function for Logistic Regression Training set of m training examples Each example has is n+1 length column vector How to choose parameters? i.e. how to fit θ? Training set: m examples

Linear Regression Cost Linear regression uses the following function to determine θ: If we define "cost()" as cost(hθ(xi), y) = 1/2(hθ(xi) - yi)2 Which evaluates the cost for an individual example using the same measure as used in linear regression Now we can redefine J(θ) as Which, appropriately, is the sum of all the individual costs over the training data (for linear regression) This is the cost you want the learning algorithm to pay if the outcome is hθ(x) and the actual outcome is y

Why Needing Special Cost Function? If we use linear regression cost function for logistic regression It is a non-convex function for parameter optimization We have some function - J(θ) - for determining the parameters Our hypothesis has a non-linearity (sigmoid function of hθ(x) ) If you take hθ(x) and plug it into the Cost() function, and then plug the Cost() function into J(θ) and plot J(θ) we find many local optimum: non convex function Lots of local minima mean gradient descent may not find the global optimum - may get stuck in a local minimum We would like a convex function so if you run gradient descent you converge to a global minimum

Convex Vs Non-Convex “convex” “non-convex” To get around this we need a different, convex Cost() function which means we can apply gradient descent

Logistic Regression Cost Function Logistic regression cost function is defined as follows: This is the penalty the algorithm pays If y = 1 1

For Y=1 So when we're right, cost function is 0 Else it slowly increases cost function as we become "more" wrong X axis is what we predict Y axis is the cost associated with that prediction If y = 1 and hθ(x) = 1 If hypothesis predicts exactly 1 and thats exactly correct then that corresponds to 0 If y = 1 and hθ(x) goes to 0 Cost goes to infinity

For Y=0 If y = 0 If y = 0 Then cost is evaluated as -log(1- hθ( x )) Now it goes to plus infinity as hθ(x) goes to 1 If y = 0 1

Simplified Cost Function Logistic regression cost function For binary classification problems y is always 0 or 1, so we can have a simpler way to write the cost function Rather than writing cost function on two lines/two cases, we can compress them into one equation - more efficient Cost(hθ(x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )

Simplified Cost Function We know that there are only two possible cases y = 1 Then our equation simplifies to -log(hθ(x)) - (0)log(1 - hθ(x)) -log(hθ(x)) Which is what we had before when y = 1 y = 0 -log(1- hθ(x)) Which is what we had before when y = 0 Cost(hθ(x),y) = -ylog( hθ(x) ) - (1-y)log( 1- hθ(x) )

Simplified Cost Function So the logistic cost function for θ parameters can be defined as follows Why do we chose this function when other cost functions exist? It is convex This cost function can be derived from statistics using the principle of maximum likelihood estimation To fit parameters θ: Find parameters θ which minimise J(θ)

Gradient Descent for Logistic Cost Now we need to figure out how to minimise J(θ) Use gradient descent as before Repeatedly update each parameter using a learning rate Note that by replacing derivative part with the equation we had for linear regression, we will have: Repeat (simultaneously update all )

GD for Logistic Cost Function This equation is the same as the linear regression rule The only difference is that our definition for the hypothesis has changed When implementing logistic regression with gradient descent, we have to update all the θ values (θ0 to θn) simultaneously Could use a for loop Better would be a vectorised implementation Algorithm looks identical to linear regression!

Advanced Optimisation Alternatively, instead of gradient descent to minimise the cost function we could use other optimisation techniques (much more complicated) such as Conjugate gradient BFGS (Broyden-Fletcher-Goldfarb-Shanno) L-BFGS (Limited memory - BFGS) Advantages No need to manually pick alpha (learning rate) Have a clever inner loop (line search algorithm) which tries a bunch of alpha values and picks a good one Often faster than gradient descent Disadvantages Could make debugging more difficult Different libraries may use different implementations - may hit performance

Multiclass Classification Email foldering/tagging: Work, Friends, Family, Hobby Medical diagrams: Not ill, Cold, Flu Weather: Sunny, Cloudy, Rain, Snow

One-vs-All Given a dataset with three classes, how do we get a learning algorithm to work? Use one vs. all classification make binary classification work for multiclass classification Split the training set into three separate binary classification problems Triangle (1) vs crosses and squares (0) hθ1(x) P(y=1 | x1; θ) Crosses (1) vs triangle and square (0) hθ2(x) P(y=1 | x2; θ) Square (1) vs crosses and square (0) hθ3(x) P(y=1 | x3; θ)

One-vs-All x2 x1 x2 x2 x1 x1 x2 Class 1: Class 2: Class 3: x1

One-vs-All So the overall steps in One-vs-All approach are summarised as follows: Train a logistic regression classifier for each class to predict the probability that . On a new input , to make a prediction, pick the class that maximises the probability that hθ(i)(x) = 1

Any question?