Download presentation
Presentation is loading. Please wait.
1
Classification: Linear Models
Oliver Schulte Deep Learning 726 If you use “insert slide number” under “Footer”, that text box only displays the slide number, not the total number of slides. So I use a new textbox for the slide number in the master.
2
Parent Node/ Child Node
Discrete Continuous Maximum Likelihood Decision Trees logit distribution (logistic regression) Classifiers: linear discriminant (perceptron) Support vector machine conditional Gaussian (not discussed) linear Gaussian (linear regression)
3
Linear Classification Models
General Idea: Learn linear continuous function y of continuous features x. Classify as positive if y crosses a threshold, typically 0. As in linear regression, can use more complicated features defined by basis functions ϕ.
4
Example: Classifying Digits
Classify input vector as “4” vs. “not 4”. Represent input image as vector x with 28x28 =784 numbers. Target t = 1 for “positive”, -1 for “negative”. Given a training set (x1,t1,..,xN,tN), the problem is find a good linear function y(x). y:R784 R. Classify x as positive if y(x) >0, negative o.w. could choose other values, like 1 vs. 0. This will turn out to be convenient. tiff images work on pc
5
Other Examples Will the person vote conservative, given age, income, previous votes? Is the patient at risk of diabetes given body mass, age, blood test measurements? Predict Earthquake vs. nuclear explosion given body wave magnitude and surface wave magnitude. disaster type surface wave magnitude body wave magnitude Age Incom e Votes Convervative
6
Linear Separation white = earthquake black = nuclear explosion
x1 = surface wave magnitude x2 = body wave magnitude Events in Asia and Middle East between 1982 and 1990. Russell and Norvig Figure 18.15
7
Linear Discriminants Simple linear model:
Can drop explicit w0 if we assume fixed dummy bias. Decision surface is line, orthogonal to w. In 2-D, just a line between the classes! weight vector points towards positive class
8
Convexity and Linear Separability
A set of points C is convex if for any two points x,y in C, fraction 0≤α≤1 we have αx+(1-α)y is also in C. If two classes are linearly separable, they are convex. Separating Hyperplan Theorem: If two disjoint sets (classes) are each convex, there exists a linear separator for them.
9
Strengths of Linear Classifiers
Efficient to learn Interpretable in many applications, the “effects” are most important – which features receive the biggest weight. Can quantify predictive uncertainty derive confidence bounds on accuracy of predictions. In machine learning, interpretability and quantifying uncertainty often go together, but trade off against accuracy. Science manages to combine all three.
10
Gradient Descent Learning
11
Learning a Linear Classifier
Most neural net learning follows a decision-theoretic (Bayesian) approach to learning. For a given data set D, define an error function E(w,D) that measures how well the weights w fit the data D. Find w that minimize the error for a given input data set D. In neural net learning, the basic minimization algorithm is gradient descent. can also use a frequentist approach can also use things other than gradient descent.
12
Gradient Descent: Choosing a direction.
Intuition: think about trying to find street number on a block. You stop and see that you are at number 100. Which direction should you go, left or right? You initially check every 50 houses or so where you are. What happens when you get closer to the goal 1000? The fly and the window: the fly sees that the wall is darker, so the light gradient goes down: bad direction. Real life: bad relationship, job is local maximum. Difference with learning: we don’t have the stopping criterion, so we don’t know when we’ve reached a global minimum.
13
Gradient Descent In Multiple Dimensions
should have been covered in Math150 Active Math Applet
14
Gradient Descent Scheme
Initialize weight vectors somehow. (Typically randomly). Update Until some convergence criterion is true.
15
Perceptron Learning
16
Defining an Error Function
General idea: Encode class label using a real number t. e.g., “positive” = 1, “negative” = 0 or “negative” = -1. This is the first example we see of embedding. Measure error or loss by comparing continuous linear output y and class label code t. Obvious loss function is 0-1: 0 if prediction correct, 1 otherwise. In practice use convex upper bounds on 0-1 loss: loss(y,t) ≥ 0 if prediction correct loss(y,t) ≥ 1 if prediction false
17
The Error Function for linear discriminants
Could use squared error as in linear regression. Various problems. Basically due to the fact that 1,-1 are not real target values. Different criterion developed for learning perceptrons. Perceptrons are a precursor to neural nets. Analog implementation by Rosenblatt in the 1950s, see Figure 4.8.
18
The Perceptron Criterion
An example is misclassified if (Take a moment to verify this.) Perceptron Error where M is the set of misclassified inputs, the mistakes. Exercise: find the gradient of the error function wrt a single input xn. Solution: 0 if xn is correctly classified. - xntn o.w. Solution: 0 if x_n is correctly classified, o.w. - x_n t_n (input vector times target scalar vector) Proof: fix single w_j, multiply t_n into the dot product. If output = 0, algorithm fails. Or assume this does not happen.
19
Perceptron Learning Algorithm
Use stochastic gradient descent. gradient descent for one example at a time, cycle through. Update Equation: where we set η= 1 (without loss of generality in this case). Excel Demo. Legend: the arrrow shows the negated gradient, indicating the direction that produces steepest descent along the error surface
20
Perceptron Demo weight vector = black. points in direction of red class. Add weight vector to misclssified feature vector to get new weight vector.
21
Perceptron Learning Analysis
Theorem If the classes are linearly separable, the perceptron learning algorithm converges to a weight vector that separates them. Convergence can be slow. Sensitive to initialization.
22
Nonseparability Linear discriminants can solve problems only if the classes can be separated by a line (hyperplane). Canonical example of non-separable problem is X-OR. Perceptron typically does not converge. using 1 for true, 0 for false.
23
Nonseparability: real world example
white = earthquake black = nuclear explosion x1 = surface wave magnitude x2 = body wave magnitude more actual data points added, no longer linearly separable. Figure Russell and Norvig b
24
Responses to Nonseparability
Classes cannot be separated by a linear discriminant use non-linear activation function finds approximate solution separate classes not completely but “well” add hidden features Fisher discriminant (not covered) logistic regression neural network support vector machine
25
Logistic Regression
26
From Values to Probabilities
Key idea: instead of predicting a class label, predict the probability of a class label. E.g., p+ = P(class is positive|features) p- = P(class is negative|features) Naturally a continuous quantity. How to turn a real number y into a probability p+?
27
The Logistic Sigmoid Function
Definition: Squeezes the real line into [0,1]. Differentiable: (nice exercise)
28
Soft threshold interpretation
If y> 0, σ(y) goes to 1 very quickly. If y<0, σ(y) goes to 0 very quickly. Figure 18 from Russell and Norvig. Figure Russell and Norvig 18.17
29
Probabilistic Interpretation
The sigmoid can be interpreted in terms of the class odds p+/(1-p+). Exercise: Show the following implication for the class odds: Therefore the log class odds. Solution: p- = exp(-y) / 1 + exp(-y). Therefore p+/p- = 1/(exp(-y)) = exp(y). So the netinput is the log-difference in class odds.
30
Logistic Regression In logistic regression, the log-class odds are a linear function of the input features: Recall that we got the same kind of expression for the naive Bayes classifier. Learning logistic regression is conceptually similar to linear regression.
31
Logistic Regression: Maximum Likelihood
Notation: the probability that the n-th input example is positive = which depends on a weight vector w. Positive example has tn = 1, negative tn = 0. Then the likelihood assigned to N independent training data is The cross-entropy error Equivalent to minimizing the KL divergence between the predicted class probabilities and the observed class frequencies. I’m changing notation slightly from book. This turns out to be a good objective function, easy gradient. There is some freedom in choosing the objective function. Cross entropy can also be seen as doing data compression. (Classification story of spam via compression.)
32
Gradient Search Exercise (on assignment): Using the cross-entropy error show that Hint: recall that No closed form minimum since is non-linear function of input features. Can use gradient descent. Better approach: Use Iterative Reweighted Least Squares (IRLS). See assignment. Maybe make this an assignment exercise. Show example from Russell and Norvig (Figure 18).
33
Example logistic regression model learned on non-separable data
vertical axis = probability of earth quake. Figure Russell and Norvig 18.17
34
Logistic Regression With Basis Functions
Legend: left: datapoints in 2D. Red and blue are class labels (ignore). – 2 Gaussian basis functions, we see the centers shown as crosses and the countours shown by green circles. Right: Each point is now mapped to another pair of numbers (roughly, distance to phi1, distance to phi2). On right, classes are linearly separable, see black line. The black line on right corresponds to the black circle on left. Intuitive example: think of Gaussian centers as indicating parts of a picture, or parts of a body. Figure Bishop 4.12
35
Multi-Class Example Logistic regression can be extended to multiple classes. Here’s a picture of what decision boundaries can look like.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.