Logistic Regression
Linear regression fits the a line to a set of points. Given x, you can use the line to predict y.
Logistic regression fits a logistic function (sigmoid) to a set of points and binary labels. Given a new point, the sigmoid gives the predicted probability that the class is positive. https://en.wikipedia.org/wiki/Logistic_regression
Logistic Regression For ease of notation, let x = (x0, x1, …, xn), where x0 =1. Let w = (w0, w1, …,wn), where w0 is the bias weight. Class y {0,1}
Learning: Use training data to determine weights.
Learning: Use training data to determine weights. To classify a new x, assign class y that maximizes P(y | x)
Logistic Regression: Learning Weights Goal is to learn weights w. Let (x j , y j) be the jth training example and its label. We want: This is equivalent to: This is called “log of conditional likelihood”
We can write the log conditional likelihood this way: since yl is either 0 or 1 This is what we want to maximize.
Use gradient ascent to maximize l(w). This is called “Maximum likelihood estimation” (or MLE). Recall: We have: Let’s find the gradient with respect to wi :
Using chain rule and algebra
Stochastic Gradient Ascent for Logistic Regression Start with small random initial weights, both positive and negative: w = (w0, w1, …wn) Repeat until convergence, or for some max number of epochs For each training example : Note again that w includes the bias weight w0, and x includes the bias term x0 = 1.
Homework 4, Part 2