Download presentation
Presentation is loading. Please wait.
Published byBetty Edwards Modified over 9 years ago
1
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines
2
Recall: A SPATIAL WAY OF LOOKING AT LEARNING Learning a function can also be viewed as learning how to discriminate between different types of objects in a space
3
A SPATIAL VIEW OF LEARNING SPAM NON-SPAM
4
03/19/12 Vector Space Representation Each document is a vector, one component for each term (= word). Normalize to unit length. Properties of vector space – terms are axes – n docs live in this space – even with stemming, may have 10,000+ dimensions, or even 1,000,000+
5
A SPATIAL VIEW OF LEARNING The task of the learner is to learn a function that divides the space of examples into black and red
6
A SPATIAL VIEW OF LEARNING
7
A MORE DIFFICULT EXAMPLE
8
ONE SOLUTION
9
ANOTHER SOLUTION
10
03/19/12 Multi-class problems Government Science Arts
11
Support Vector Machines This lecture: an overview of Linear SVMs (separable problems) Linear SVMs (non-separable problems) Kernels
12
03/19/12 Separation by Hyperplanes Assume linear separability for now: – in 2 dimensions, can separate by a line – in higher dimensions, need hyperplanes Can find separating hyperplane by linear programming (e.g. perceptron): – separator can be expressed as ax + by = c
13
Linear separability Not linearly separable Linearly separable
14
03/19/12 Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b=0 w x + b<0 w x + b>0
15
03/19/12 Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best?
16
03/19/12 Linear Classifiers denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class
17
03/19/12 Linear Classifiers: summary Many common text classifiers are linear classifiers Despite this similarity, large performance differences – For separable problems, there is an infinite number of separating hyperplanes. Which one do you choose? – What to do for non-separable problems?
18
03/19/12 Which Hyperplane? In general, lots of possible solutions for a,b,c. Support Vector Machine (SVM) finds an optimal solution.
19
03/19/12 Maximum Margin denotes +1 denotes -1 f(x,w,b) = sign(w x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1.Maximizing the margin is good according to intuition and PAC theory 2.Implies that only support vectors are important; other training examples are ignorable. 3.Empirically it works very very well.
20
03/19/12 Support Vector Machine (SVM) Support vectors Maximize margin SVMs maximize the margin around the separating hyperplane. The decision function is fully specified by a subset of training samples, the support vectors. Quadratic programming problem Text classification method du jour
21
03/19/12 w: hyperplane normal x_i: data point i y_i: class of data point i (+1 or -1) Constraint optimization formalization: (1) (2) maximize margin: 2/||w|| Maximum Margin: Formalization
22
03/19/12 One can show that hyperplane w with maximum margin is: alpha_i: Lagrange multipliers x_i: data point i y_i: class of data point i (+1 or -1) Where the alpha_i are the solution to maximizing: Quadratic Programming Most alpha_i will be zero.
23
03/19/12 Not Linearly Separable Find a line that penalizes points on “the wrong side”.
24
03/19/12 Soft-Margin SVMs Define distance for each point with respect to separator ax + by = c: (ax + by) - c for red points c - (ax + by) for green points. Negative for bad points.
25
03/19/12 Solve Quadratic Program Solution gives “separator” between two classes: choice of a,b. Given a new point (x,y), can score its proximity to each class: – evaluate ax+by. – Set confidence threshold. 3 5 7
26
03/19/12 Predicting Generalization for SVMs We want the classifier with the best generalization (best accuracy on new data). What are clues for good generalization? – Large training set – Low error on training set – Low capacity/variance (≈ model with few parameters) SVMs give you an explicit bound based on these.
27
03/19/12 Capacity/Variance: VC Dimension Theoretical risk boundary: Remp - empirical risk, l - #observations, h – VC dimension, the above holds with prob. (1-η) VC dimension/Capacity: max number of points that can be shattered A set can be shattered if the classifier can learn every possible labeling.
28
03/19/12 Non-linear SVMs: Feature spaces General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x)
29
03/19/12 Kernels Recall: We’re maximizing: Observation: data only occur in dot products. We can map data into a very high dimensional space (even infinite!) as long as kernel computable. For mapping function Ф, compute kernel K(i,j) = Ф(xi)∙Ф(xj) Example:
30
03/19/12 The Kernel Trick The linear classifier relies on dot product between vectors K(x i,x j )=x i T x j If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is some function that corresponds to an inner product in some expanded feature space We don't have to compute Φ: x → φ(x) explicitly, K(x i,x j ) is enough for SVM learning
31
03/19/12 What Functions are Kernels? For some functions K(x i,x j ) checking that K(x i,x j )= φ(x i ) T φ(x j ) can be cumbersome. Mercer’s theorem: Every semi-positive definite symmetric function is a kernel Semi-positive definite symmetric functions correspond to a semi-positive definite symmetric Gram matrix: K(x1,x1)K(x1,x1)K(x1,x2)K(x1,x2)K(x1,x3)K(x1,x3)…K(x1,xN)K(x1,xN) K(x2,x1)K(x2,x1)K(x2,x2)K(x2,x2)K(x2,x3)K(x2,x3)K(x2,xN)K(x2,xN) …………… K(xN,x1)K(xN,x1)K(xN,x2)K(xN,x2)K(xN,x3)K(xN,x3)…K(xN,xN)K(xN,xN) K=
32
03/19/12 Kernels Why use kernels? – Make non-separable problem separable. – Map data into better representational space Common kernels – Linear – Polynomial – Radial basis function
33
03/19/12 Performance of SVM SVM are seen as best-performing method by many. Statistical significance of most results not clear. There are many methods that perform about as well as SVM. Example: regularized regression (Zhang&Oles) Example of a comparison study: Yang&Liu
34
03/19/12 Yang&Liu: SVM vs Other Methods
35
03/19/12 Yang&Liu: Statistical Significance
36
03/19/12 Yang&Liu: Small Classes
37
03/19/12 SVM: Summary SVM have optimal or close to optimal performance. Kernels are an elegant and efficient way to map data into a better representation. SVM can be expensive to train (quadratic programming). If efficient training is important, and slightly suboptimal performance ok, don’t use SVM? For text, linear kernel is common. So most SVMs are linear classifiers (like many others), but find a (close to) optimal separating hyperplane.
38
03/19/12 Model parameters based on small subset (SVs) Based on structural risk minimization Supports kernels SVM: Summary (cont.)
39
03/19/12 Resources Foundations of Statistical Natural Language Processing. Chapter 16. MIT Press. Manning and Schuetze. Trevor Hastie, Robert Tibshirani and Jerome Friedman, "Elements of Statistical Learning: Data Mining, Inference and Prediction" Springer- Verlag, New York. A Tutorial on Support Vector Machines for Pattern Recognition (1998) Christopher J. C. Burges ML lectures at DISI
40
THANKS I used material from – Mingyue Tan's course at UBC – Chris Manning course at Stanford
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.