Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic regression
Logistic Regression (lecture 9 on amlbook.com)
Neuron analogy Dot product w T x is a way of combining attributes into a scalar signal s. How signal is used defines the hypothesis set.
In logistic regression, signal become argument of a function with properties like a probability distribution
Objective: find w such that risk score >> 0 for patients that had a heart attack ( (s) ~ 1) and risk score << 0 for those who have not ( (s) ~ 0). Application: risk of heart attack
More specifically (see text p91) Dataset drawn from a distribution function P(y|x), which is related to hypothesis h(x) by P(y n |x n ) = h(x n ) if y n = +1; P(y n |x n ) = 1 - h(x n ) if y n = -1 Logistic function has the property (-s) = 1 – (s) Hence, both relationships are satisfied by P(y n |x n ) = (y n w T x n ) Now use maximum likelihood estimation (MLE) to derive an error function that we minimize to find the optimum w
Recall that MLE is used to Estimate parameters of a probability distribution given a sample X drawn from that distribution In logistic regression, parameters are the weights Likelihood of w given the sample X l (w| X ) = p ( X |w) = ∏ t p(x t |w) Log likelihood L (w| X ) = log( l (w| X )) = ∑ t log p(x t |w) In logistic regression, p( x t |w) = (y n w T x n )
Since Log is a monotone increasing function, maximizing log(likelihood) is equivalent to minimizing -log(likelihood) Text also normalizes by dividing by N; hence error function becomes
Error function of logistic regression (called cross entropy) has the desired properties. If x n are attributes of person who has had a heart attack, w T x n >> 0 and y n > 0 so contribution to E in (w) is small. If x n are attributes of person who has not had a heart attack, w T x n << 0 and y n < 0 so contribution to E in (w) is again small.
Error function of linear regression allows “1-step” optimization. Not true for error function of logistic regression Optimization is iterative; method is “steepest decent”
Method of steepest (gradient) decent: Fixed step size w(1) = w(0) + v hat Unit vector in the direction of the gradient
Method of steepest (gradient) decent: Fixed leaning rate w(1) = w(0) + delta w Weights change fastest where gradient is largest For E in = cross entropy, gradient is analytical
Logistics regression algorithm
How to compute gradient of E in
How to known when to stop
Assignment 6: Due