Kernel Technique Based on Mercer’s Condition (1909)

Slides:



Advertisements
Similar presentations
Lecture 9 Support Vector Machines
Advertisements

PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Linear Classifiers/SVMs
ICML Linear Programming Boosting for Uneven Datasets Jurij Leskovec, Jožef Stefan Institute, Slovenia John Shawe-Taylor, Royal Holloway University.

Support Vector Machines
Separating Hyperplanes
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Y.-J. Lee, O. L. Mangasarian & W.H. Wolberg
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from.
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Support Vector Classification (Linearly Separable Case, Primal) The hyperplanethat solves the minimization problem: realizes the maximal margin hyperplane.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Reduced Support Vector Machine
The Perceptron Algorithm (Dual Form) Given a linearly separable training setand Repeat: until no mistakes made within the for loop return:
Linear Learning Machines  Simplest case: the decision function is a hyperplane in input space.  The Perceptron Algorithm: Rosenblatt, 1956  An on-line.
Classification Problem 2-Category Linearly Separable Case A- A+ Malignant Benign.
Support Vector Machine (SVM) Classification
Binary Classification Problem Learn a Classifier from the Training Set
The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return:. What is ?
Support Vector Machines
Unconstrained Optimization Problem
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Kernel Methods and SVM’s. Predictive Modeling Goal: learn a mapping: y = f(x;  ) Need: 1. A model structure 2. A score function 3. An optimization strategy.
Support Vector Machines Classification
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Learning in Feature Space (Could Simplify the Classification Task)  Learning in a high dimensional space could degrade generalization performance  This.
SVMs, cont’d Intro to Bayesian learning. Quadratic programming Problems of the form Minimize: Subject to: are called “quadratic programming” problems.
Classification and Regression
Linear Discriminators Chapter 20 From Data to Knowledge.
Mathematical Programming in Support Vector Machines
SVM by Sequential Minimal Optimization (SMO)
Feature Selection in Nonlinear Kernel Classification Olvi Mangasarian & Edward Wild University of Wisconsin Madison Workshop on Optimization-Based Data.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.
Proximal Support Vector Machine Classifiers KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Data Mining Institute University of.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
RSVM: Reduced Support Vector Machines Y.-J. Lee & O. L. Mangasarian First SIAM International Conference on Data Mining Chicago, April 6, 2001 University.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
CS558 Project Local SVM Classification based on triangulation (on the plane) Glenn Fung.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Data Mining via Support Vector Machines Olvi L. Mangasarian University of Wisconsin - Madison IFIP TC7 Conference on System Modeling and Optimization Trier.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
© Eric CMU, Machine Learning Support Vector Machines Eric Xing Lecture 4, August 12, 2010 Reading:
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Proximal Plane Classification KDD 2001 San Francisco August 26-29, 2001 Glenn Fung & Olvi Mangasarian Second Annual Review June 1, 2001 Data Mining Institute.
Neural NetworksNN 21 Architecture We consider the architecture: feed- forward NN with one layer It is sufficient to study single layer perceptrons with.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.
Support Vector Machines Optimization objective Machine Learning.
Minimal Kernel Classifiers Glenn Fung Olvi Mangasarian Alexander Smola Data Mining Institute University of Wisconsin - Madison Informs 2002 San Jose, California,
Page 1 CS 546 Machine Learning in NLP Review 1: Supervised Learning, Binary Classifiers Dan Roth Department of Computer Science University of Illinois.
Dan Roth Department of Computer and Information Science
Geometrical intuition behind the dual problem
Support Vector Machines
Perceptron Algorithm.
University of Wisconsin - Madison
University of Wisconsin - Madison
Minimal Kernel Classifiers
Presentation transcript:

Kernel Technique Based on Mercer’s Condition (1909) The value of kernel function represents the inner product of two training points in feature space Kernel functions merge two steps 1. map input data from input space to feature space (might be infinite dim.) 2. do inner product in the feature space

More Examples of Kernel is an integer: Polynomial Kernel : ) (Linear Kernel Gaussian (Radial Basis) Kernel : The -entry of represents the “similarity” of data points and

Nonlinear 1-Norm Soft Margin SVM In Dual Form Linear SVM: Nonlinear SVM:

1-norm Support Vector Machines Good for Feature Selection Solve the quadratic program for some : min s. t. , , denotes where or membership. Equivalent to solve a Linear Program as follows:

SVM as an Unconstrained Minimization Problem (QP) At the solution of (QP): where Hence (QP) is equivalent to the nonsmooth SVM: min Change (QP) into an unconstrained MP Reduce (n+1+m) variables to (n+1) variables

Smooth the Plus Function: Integrate Step function: Sigmoid function: p-function: Plus function:

SSVM: Smooth Support Vector Machine Replacing the plus function in the nonsmooth SVM by the smooth , gives our SSVM: min nonsmooth SVM as goes to infinity. The solution of SSVM converges to the solution of (Typically, )

Newton-Armijo Method: Quadratic Approximation of SSVM generated by solving a The sequence quadratic approximation of SSVM, converges to the of SSVM at a quadratic rate. unique solution Converges in 6 to 8 iterations At each iteration we solve a linear system of: n+1 equations in n+1 variables Complexity depends on dimension of input space It might be needed to select a stepsize

Newton-Armijo Algorithm Start with any . Having stop if else : (i) Newton Direction : globally and quadratically converge to unique solution in a finite number of steps (ii) Armijo Stepsize : such that Armijo’s rule is satisfied

Nonlinear Smooth SVM Nonlinear Classifier: by a nonlinear kernel : min Replace by a nonlinear kernel : min Use Newton-Armijo algorithm to solve the problem Each iteration solves m+1 linear equations in m+1 variables Nonlinear classifier depends on the data points with nonzero coefficients :

Conclusion An overview of SVMs for classification SSVM: A new formulation of support vector machine as a smooth unconstrained minimization problem Can be solved by a fast Newton-Armijo algorithm No optimization (LP, QP) package is needed There are many important issues did not address this lecture such as: How to solve conventional SVM? How to select parameters: How to deal with massive datasets?

{ Perceptron  . i=0n wi xi g 1 if i=0n wi xi >0 o(xi) = Linear threshold unit (LTU) x0=1 x1 w1 w0 w2 x2  . i=0n wi xi g wn xn 1 if i=0n wi xi >0 o(xi) = -1 otherwise {

Possibilities for function g Sign function Step function Sigmoid (logistic) function sign(x) = +1, if x > 0 -1, if x  0 step(x) = 1, if x > threshold 0, if x  threshold (in picture above, threshold = 0) sigmoid(x) = 1/(1+e-x) Adding an extra input with activation x0 = 1 and weight wi, 0 = -T (called the bias weight) is equivalent to having a threshold at T. This way we can always assume a 0 threshold.

Using a Bias Weight to Standardize the Threshold 1 -T w1 x1 w2 x2 w1x1+ w2x2 < T w1x1+ w2x2 - T < 0

Perceptron Learning Rule (x, t)=([2,1], -1) o =sgn(0.45-0.6+0.3) =1 x2 x2 w = [0.25 –0.1 0.5] x2 = 0.2 x1 – 0.5 o=-1 w = [0.2 –0.2 –0.2] (x, t)=([-1,-1], 1) o = sgn(0.25+0.1-0.5) =-1 x1 x1 (x, t)=([1,1], 1) o = sgn(0.25-0.7+0.1) = -1 -0.5x1+0.3x2+0.45>0  o = 1 w = [0.2 0.2 0.2] w = [-0.2 –0.4 –0.2] x2 x2 x1 x1

The Perceptron Algorithm Rosenblatt, 1956 Given a linearly separable training set and learning rate and the initial weight vector, bias: and let

The Perceptron Algorithm (Primal Form) Repeat: until no mistakes made within the for loop return: . What is ?

The Perceptron Algorithm ( STOP in Finite Steps ) Theorem (Novikoff) Let be a non-trivial training set, and let Suppose that there exists a vector and . Then the number of mistakes made by the on-line perceptron algorithm on is at most

Proof of Finite Termination Proof: Let The algorithm starts with an augmented weight vector and updates it at each mistake. Let be the augmented weight vector prior to the th mistake. The th update is performed when where is the point incorrectly classified by .

Update Rule of Perceotron Similarly,

Update Rule of Perceotron

The Perceptron Algorithm (Dual Form) Given a linearly separable training set and Repeat: until no mistakes made within the for loop return:

What We Got in the Dual Form Perceptron Algorithm? The number of updates equals: implies that the training point has been misclassified in the training process at least once. implies that removing the training point will not affect the final results The training data only appear in the algorithm through the entries of the Gram matrix, which is defined below:

Reuters-21578 21578 docs – 27000 terms, and 135 classes 21578 documents 1-14818 belong to training set 14819-21578 belong to testing set Reuters-21578 includes 135 categories by using ApteMod version of the TOPICS set Result in 90 categories with 7,770 training documents and 3,019 testing documents

Preprocessing Procedures (cont.) After Stopwords Elimination After Porter Algorithm

Binary Text Classification earn(+) vs. acq(-) Select top 500 terms using mutual information Evaluate each classifier using F-measure Compare two classifiers using 10-fold paired-t test

10-fold Testing Results RSVM vs. Naïve Bayes 2 3 4 5 6 7 8 9 10 RSVM 0.965 0.975 0.99 0.984 0.974 0.936 0.98 NB 0.969 0.941 0.964 0.953 0.958 -0.004 -0.009 0.021 0.01 0.033 0.02 -0.038 0.006 0.016 There is no difference between RSVM and NB Reject with 95% confidence level

Multi-Class SVMs Combining into multi-class classifier One-vs-Rest Classes: in this class or not in this class Positive training samples: data in this class Negative training samples: the rest K binary SVMs (k is the number of classes) One-vs-One Classes: in class one or in class two Negative training samples: data in the other class K(K-1)/2 binary SVM

Performance Measures Precision and recall , F-measure where TP is the number of true positive, FP is the number of false positive, and FN is the number of false negative. prediction y=1 prediction y=-1 label y=1 True Positive False Negative label y=-1 False Positive True Negative

Measures for Multi-class Classification (one vs. rest) Macro-averaging: arithmetic average Micro-averaging: averages the contingency (confusion) tables

Summary of Top 10 Categories category (pos , neg , test) acq (1648 , 6075 , 718) corn (180 , 7543 , 56) crude (385 , 7338 , 186) earn (2861 , 4862 , 1080) grain (428 , 7295 , 148) interest (348 , 7375 , 131) money-fx (534 , 7189 , 179) ship (191 , 7532, 87) trade (367 , 7356 , 116) wheat (211 , 7512 , 81)

F-measure of Top10 Categories Category F-measure acq 97.03 corn 81.63 crude 88.58 earn 98.84 grain 90.51 interest 76.52 money-fx 78.26 ship 83.03 trade 75.83 wheat 80.00 macroavg. 85.05 microavg. 92.87