Pattern Recognition and Image Analysis Dr. Manal Helal – Fall 2014 Lecture 7 Linear Classifiers
Linear Models Linear models Perceptron Naïve Bayes Logistic regression 2
Linear Models Linear models for classification separate input vectors into classes using linear (hyperplane) decision boundaries. Example: 2D Input vector x Two discrete classes C1 and C2 3
An Example Positive examples are blank, negative are filled 4 Which line describe the decision boundary? Higher dimensions?! Think of training examples as points in d-dimensional space. Each dimension corresponds to one feature. A linear binary classifier defines a plane in the space which separates positive from negative examples.
Linear Decision Boundary A hyper-plane is a generalization of a straight line to > 2 dimensions A hyper-plane contains all the points in a d dimensional space satisfying the following equation: w 1 x 1 +w 2 x 2,...,+w d x d +w 0 =0 Each coefficient w i can be thought of as a weight on the corresponding feature The vector containing all the weights w = (w 0,..., w d ) is the parameter vector or weight vector 5
Normal Vector Geometrically, the weight vector w is a normal vector of the separating hyper-plane A normal vector of a surface is any vector which is perpendicular to it 6
Hyper-plane as a classifier Let g(x)=w 1 x 1 +w 2 x 2,...,+w d x d +w 0 Then 7
Bias The slope of the hyper-plane is determined by w 1...w d. The location (intercept) is determined by bias w 0 Include bias in the weight vector and add a dummy component to the feature vector Set this component to x 0 = 1 Then g(x) = w. x 8
Separating hyper-planes in 2 dimensions 9
10 y(x) = w t x +w 0 y(x) ≥ 0 → x assigned to C 1 y(x) < 0 → x assigned to C 2 Thus y(x) = 0 defines the decision boundary
Learning The goal of the learning process is to come up with a “good” weight vector w The learning process will use examples to guide the search of a “good” w Different notions of “goodness” exist, which yield different learning algorithms We will describe some of these algorithms in the following slides 11
Perceptron 12
Perceptron Training How do we find a set of weights that separate our classes Perceptron: A simple mistake-driven online algorithm 1. Start with a zero weight vector and process each training example in turn. 2. If the current weight vector classifies the current example incorrectly, move the weight vector in the right direction. 3. If weights stop changing, stop If examples are linearly separable, then this algorithm is guaranteed to converge to the solution vector 13
Fixed increment online perceptron algorithm Binary classification, with classes +1 and − 1 Decision function y ′ = sign(w · x) Perceptron(x 1:N, y 1:N, I): 1:w ← 0 2: for i=1...I do 3: for n = 1...N do 4: if y (n) (w · x (n) ) ≤ 0 then 5: w ← (w · x (n) ) ≤ 0 then 6: return w 14
Or more explicitly 1:w ← 0 2:for i=1...I do 3:for n = 1...N do 4:if y (n) = sign(w · x (n) ) then 5:pass 6:elseif y (n) =+1 ∧ sign(w·x (n) )= − 1 then 7:w ← w + x (n) 8:elseif y (n) = − 1 ∧ sign(w·x (n) )=+1 then 9:w ← w − x (n) 10: return w 15 Tracing an example for a NAND function is found in:
Weight averaging Although the algorithm is guaranteed to converge, the solution is not unique! Sensitive to the order in which examples are processed Separating the training sample does not equal good accuracy on unseen data Empirically, better generalization performance with weight averaging A method of avoiding overfitting As final weight vector, use the mean of all the weight vector values for each step of the algorithm (cf. regularization in a following session) 16
Efficient averaged perceptron algorithm Perceptron(x 1:N, y 1:N, I): 1: w ← 0; wa ← 0 2: b ← 0;ba ← 0 3: c ← 1 4: for i=1...I do 5: for n = 1...N do 6:if y (n) (w · x (n) + b) ≤ 0 then 7: w ← w+y (n) x (n) ;b ← b+y (n) 8: wa ← wa + cy (n) x (n) ; ba ← ba + cy (n) 9: c ← c+1 10: return (w − wa/c, b − ba/c) 17
Naïve Bayes 18
Probabilistic Model Instead of thinking in terms of multidimensional space... Classification can be approached as a probability estimation problem We will try to find a probability distribution which Describes well our training data Allows us to make accurate predictions We’ll look at Naive Bayes as a simplest example of a probabilistic classifier 19
Representation of Examples We are trying to classify documents. Let’s represent a document as a sequence of terms (words) it contains t = (t 1...t n ) For (binary) classification we want to find the most probable class: ŷ = argmax P(Y =y|t) y ∈ { − 1,+1} But documents are close to unique: we cannot reliably condition Y |t Bayes’ rule to the rescue 20
Bayes Rule Bayes rule determines how joint and conditional probabilities are related 21
Prior and likelihood With Bayes’ rule we can invert the direction of conditioning Decomposed the task into estimating the prior P (Y ) (easy) and the likelihood P (t|Y = y) 22
Conditional Independence How to estimate P (t|Y = y)? Naively assume the occurrence of each word in the document is independent of the others, when conditioned on the class 23
Naive Bayes Putting it all together 24
Decision Function For binary classification: 25
Documents in Vector Notation Let’s represent documents as vocabulary-size-dimensional binary vectors Dimension i indicates how many times the i th vocabulary item appears in document x 26
Naive Bayes in Vector Notation Counts appear as exponents: If we take the logarithm of the threshold (ln 1 = 0) and g, we’ll get the same decision function 27
Linear Classifier Remember the linear classifier? Log prior ratio corresponds to the bias term Log likelihood ratios correspond to feature weights 28
What is the difference Training criterion and procedure Perceptron Zero-one loss function Error-driven algorithm 29
Naive Bayes Maximum Likelihood criterion Find parameters which maximize the log likelihood Parameters reduce to relative counts + Ad-hoc smoothing Alternatives (e.g. maximum a posteriori) 30
Comparison 31
Logistic Regression 32
Probabilistic Conditional Model Let’s try to come up with a probabilistic model, which has some of the advantages of perceptron Model P(y|x) directly, and not via P(x|y) and Bayes rule as in Naïve Bayes Avoid issue of dependencies between features of x We’ll take linear regression as a starting point The goal is to adapt regression to model class-conditional probability 33
Linear Regression Training data: observations paired with outcomes (n ∈ R) Observations have features (predictors, typically also real numbers) The model is a regression line y = ax + b which best fits the observations a is the slope b is the intercept This model has two parameters (or weights) One feature = x Example: x = number of vague adjectives in property descriptions y = amount house sold over asking price 34
35
Multiple Linear Regression More generally where y = outcome w 0 = intercept x 1..x d = features vector and w 1..w d weight vector Get rid of bias: Linear regression: uses g(x) directly Linear classifier: uses sign(g(x)) 36
Learning Linear Regression Minimize sum squared error over N training examples Closed-form formula for choosing the best weights w: where the matrix X contains training example features, and y is the vector of outcomes. 37
Logistic Regression In logistic regression we use the linear model to assign probabilities to class labels For binary classification, predict P (Y = 1|x). But predictions of linear regression model are ∈ R, whereas P(Y =1|x) ∈ [0,1] Instead predict logit function of the probability: 38
Solving for P (Y = 1|x) we obtain: 39
Logistic Regression – Classification Example x belongs to class 1 if: Equation w · x = 0 defines a hyper-plane with points above belonging to class 1 40
Multinomial Logistic Regression Logistic regression generalized to more than two classes 41
Learning Parameters Conditional likelihood estimation: choose the weights which make the probability of the observed values y be the highest, given the observations x i For the training set with N examples: 42
Error Function Equivalently, we seek the value of the parameters which minimize the error function: 43
A problem in convex optimization L-BFGS (Limited-memory Broyden-Fletcher-Goldfarb- Shanno method) gradient descent conjugate gradient iterative scaling algorithms 44
Gradient Descent Learning Rule Weight update rule: w j = w j + ( t(i) – [f(i)] ) [f(i)] x j (i) can rewrite as: w j = w j + * error * c * x j (i) The Basic idea A gradient is a slope of a function That is, a set of partial derivatives, one for each dimension (parameter) By following the gradient of a convex function we can descend to the bottom (minimum)
What is the gradient? Partial derivatives of E(w 0, w 1,w 2 ): e.g., d E(w 0, w 1,w 2 ) /d w 1 Gradient is defined as the vector of partial derivatives: Dw = [ d E(w) /d w 0, d E(w) /d w 1, d E(w) /d w 2 ] = gradient of w = vector of derivatives (defined here on 3 dimensions) 1. E(w) and Dw can be evaluated at any particular point w 2. The components of the gradient Dw tell us how fast E(w) is changing in each direction 3. When interpreted as a vector, Dw is the direction of steepest increase => - Dw is the direction of steepest decrease
Gradient Descent Rule in Multiple Dimensions Gradient Descent Rule: w new = w old - (w) where (w) is the gradient and is the learning rate (small, positive) Notes: 1. This moves us downhill in direction (w) (steepest downhill direction) 2. How far we go is determined by the value of 3. The perceptron learning rule is a special case of this general method
Illustration of Gradient Descent w1w1 w0w0 E(w)
Illustration of Gradient Descent w1w1 w0w0 E(w)
Illustration of Gradient Descent Direction of steepest descent = direction of negative gradient w1w1 w0w0 E(w)
Illustration of Gradient Descent Original point in weight space New point in weight space w1w1 w0w0 E(w)
Comments on the Gradient Descent Algorithm Equivalent to hill-climbing heuristic search Works on any objective function E(w) as long as we can evaluate the gradient (w) this can be very useful for minimizing complex functions E Local minima can have multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem) gradient descent goes to the closest local minimum: solution: random restarts from multiple places in weight space (note: no local minima for perceptron learning)
General Gradient Descent Algorithm Define an objective function E(w) (function to be minimized) We want to find the vector of values w that minimize E(w) Algorithm: pick an initial set of weights w, e.g. randomly evaluate (w) at w note: this can be done numerically or in closed form update all the weights w new = w old - (w) check if (w) is approximately 0 if so, we have converged to a “ flat minimum ” if not, we move again in weight space For perceptron learning, (w) is ( t(i) – [f(i)] ) [f(i)] x j (i)
Minimization of Mean Squared Error, E(w) E(w) w1w1 Minimum of function E(w)
Minimization of Mean Squared Error, E(w) E(w) w1w1 d E(w)/ dw 1 w1w1
Moving Downhill: Move in direction of negative derivative E(w) w1w1 d E(w)/ dw 1 w1w1 Decreasing E(w) d E(w)/dw 1 > 0 w 1 <= w 1 - d E(w)/dw 1 i.e., the rule decreases w 1
Moving Downhill: Move in direction of negative derivative E(w) w1w1 d E(w)/ dw 1 w1w1 Decreasing E(w) d E(w)/dw 1 < 0 w 1 <= w 1 - d E(w)/dw 1 i.e., the rule increases w 1
Gradient Descent Example Find argmin θ f( θ ) where f( θ ) = θ 2 Initial value of θ 1 = − 1 Gradient function: ∇ f( θ ) = 2 θ Update: θ (n+1) = θ (n) − η ∇ f ( θ (n) ) The learning rate η (= 0.2) controls the speed of the descent After first iteration: θ (2) = − 1 − 0.2 (2) = −
Five Iterations of Gradient Descent 59
Stochastic Gradient Descent (SGD) We could compute the gradient of error for the full dataset before each update Instead Compute the gradient of the error for a single example update the weight Move on to the next example On average, we’ll move in the right direction Efficient, online algorithm However, stochastic, uses random samples to accommodate for lots of local maxima/minima 60
Error gradient The gradient of the error function is the set of partial derivatives of the error function with respect to the parameters W yi 61
Single training example 62
Update Stochastic gradient update step 63
Update: Explicit For the correct class (y = y (n) ) where 1 − P (Y = y|x (n), W) is the residual For all other classes (y≠y (n) ) 64
65
Logistics Regression SGD vs Perceptron Very similar update! Perceptron is simply an instantiation of SGD for a particular error function The perceptron criterion: for a correctly classified example zero error; for a misclassified example − y (n) w · x (n) 66
Comparison 67
Exercise Do Example 2.2.1:2, and 2.3.1, not using a randomly generated data as shown, but on your project’s dataset. 68