Support Vector Machine 2013.04.15 PNU Artificial Intelligence Lab. Kim, Minho.

Slides:



Advertisements
Similar presentations
Support Vector Machine & Its Applications
Advertisements

Introduction to Support Vector Machines (SVM)
Lecture 9 Support Vector Machines
ECG Signal processing (2)
PrasadL18SVM1 Support Vector Machines Adapted from Lectures by Raymond Mooney (UT Austin)
Support Vector Machine & Its Applications Abhishek Sharma Dept. of EEE BIT Mesra Aug 16, 2010 Course: Neural Network Professor: Dr. B.M. Karan Semester.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
CSE & CSE Multimedia Processing Lecture 09 Pattern Classifier and Evaluation for Multimedia Applications Spring 2009 New Mexico Tech.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
An Introduction of Support Vector Machine
Support Vector Machines
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
LOGO Classification IV Lecturer: Dr. Bo Yuan
Groundwater 3D Geological Modeling: Solving as Classification Problem with Support Vector Machine A. Smirnoff, E. Boisvert, S. J.Paradis Earth Sciences.
1 Machine Learning Support Vector Machines. 2 Perceptron Revisited: Linear Separators Binary classification can be viewed as the task of separating classes.
Support Vector Machine
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Support Vector Machines Kernel Machines
Support Vector Machines
CS 4700: Foundations of Artificial Intelligence
An Introduction to Support Vector Machines Martin Law.
Classification III Tamara Berg CS Artificial Intelligence Many slides throughout the course adapted from Svetlana Lazebnik, Dan Klein, Stuart Russell,
Text Classification using SVM- light DSSI 2008 Jing Jiang.
Support Vector Machine & Image Classification Applications
Copyright © 2001, Andrew W. Moore Support Vector Machines Andrew W. Moore Associate Professor School of Computer Science Carnegie Mellon University.
Support Vector Machines Mei-Chen Yeh 04/20/2010. The Classification Problem Label instances, usually represented by feature vectors, into one of the predefined.
Support Vector Machine (SVM) Based on Nello Cristianini presentation
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Massimo Poesio LECTURE: Support Vector Machines.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
Statistical Classification Methods
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Classification Slides from Heng Ji Classification Define classes/categories Define classes/categories Label text Label text Extract features Extract.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
An Introduction to Support Vector Machines (M. Law)
1 CMSC 671 Fall 2010 Class #24 – Wednesday, November 24.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Support Vector Machine & Its Applications Mingyue Tan The University of British Columbia Nov 26, 2004 A portion (1/3) of the slides are taken from Prof.
CS 1699: Intro to Computer Vision Support Vector Machines Prof. Adriana Kovashka University of Pittsburgh October 29, 2015.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Dec 21, 2006For ICDM Panel on 10 Best Algorithms Support Vector Machines: A Survey Qiang Yang, for ICDM 2006 Panel Partially.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
1 Support Vector Machines Some slides were borrowed from Andrew Moore’s PowetPoint slides on SVMs. Andrew’s PowerPoint repository is here:
An Introduction of Support Vector Machine In part from of Jinwei Gu.
Roughly overview of Support vector machines Reference: 1.Support vector machines and machine learning on documents. Christopher D. Manning, Prabhakar Raghavan.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Kernels Slides from Andrew Moore and Mingyue Tan.
An Introduction of Support Vector Machine Courtesy of Jinwei Gu.
Support Vector Machine & Its Applications. Overview Intro. to Support Vector Machines (SVM) Properties of SVM Applications  Gene Expression Data Classification.
Support Vector Machine Slides from Andrew Moore and Mingyue Tan.
Support Vector Machines
PREDICT 422: Practical Machine Learning
Static Analysis(6) Support Vector Machines
Support Vector Machines
Support Vector Machines
Machine Learning Week 2.
Support Vector Machines
Support Vector Machine & Its Applications
CS 485: Special Topics in Data Mining Jinze Liu
Support Vector Machine
SVMs for Document Ranking
Presentation transcript:

Support Vector Machine PNU Artificial Intelligence Lab. Kim, Minho

Overview  Support Vector Machines (SVM)  Linear SVMs  Non-Linear SVMs  Properties of SVM  Applications of SVMs  Gene Expression Data Classification  Text Categorization

Machine Learning

Type of Learning  Supervised (inductive) learning  Training data includes desired outputs  Unsupervised learning  Training data does not include desired outputs  Semi-supervised learning  Training data includes a few desired outputs  Reinforcement learning  Rewards from sequence of actions

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? w x + b=0 w x + b<0 w x + b>0 Linear Classifiers

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Linear Classifiers

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Linear Classifiers

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Any of these would be fine....but which is best? Linear Classifiers

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) How would you classify this data? Misclassified to +1 class Linear Classifiers

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint. Classifier Margin

f x  y est denotes +1 denotes -1 f(x,w,b) = sign(w x + b) The maximum margin linear classifier is the linear classifier with the, um, maximum margin. This is the simplest kind of SVM (Called an LSVM) Linear SVM Support Vectors are those datapoints that the margin pushes up against 1.Maximizing the margin is good according to intuition and PAC theory 2.Implies that only support vectors are important; other training examples are ignorable. 3.Empirically it works very very well. Maximum Margin

Hard Margin: So far we require all data points be classified correctly - No training error What if the training set is noisy? - Solution 1: use very powerful kernels denotes +1 denotes -1 OVERFITTING! Dataset With Noise

Slack variables ξi can be added to allow misclassification of difficult or noisy examples. wx+b=1 wx+b=0 wx+b=-1 77  11 22 Soft Margin Classification What should our quadratic optimization criterion be? Minimize

The old formulation: The new formulation incorporating slack variables: Parameter C can be viewed as a way to control overfitting. Find w and b such that Φ(w) =½ w T w is minimized and for all { ( x i,y i )} y i (w T x i + b) ≥ 1 Find w and b such that Φ(w) =½ w T w + C Σ ξ i is minimized and for all { ( x i,y i )} y i (w T x i + b) ≥ 1- ξ i and ξ i ≥ 0 for all i Hard Margin versus Soft Margin

The classifier is a separating hyperplane. Most “ important ” training points are support vectors; they define the hyperplane. Quadratic optimization algorithms can identify which training points x i are support vectors with non-zero Lagrangian multipliers α i. Linear SVMs: Overview

Non-Linear SVMs : Example

Non-Linear SVMs : After Modification

Datasets that are linearly separable with some noise work out great: But what are we going to do if the dataset is just too hard? How about … mapping data to a higher-dimensional space: 0 x 0 x 0 x x2x2 Non-Linear SVMs

General idea: the original input space can always be mapped to some higher-dimensional feature space where the training set is separable: Φ: x → φ(x) Non-Linear SVMs: Feature Spaces

The linear classifier relies on dot product between vectors K(x i,x j )=x i T x j If every data point is mapped into high-dimensional space via some transformation Φ: x → φ(x), the dot product becomes: K(x i,x j )= φ(x i ) T φ(x j ) A kernel function is some function that corresponds to an inner product in some expanded feature space. Example: 2-dimensional vectors x=[x 1 x 2 ]; let K(x i,x j )=(1 + x i T x j ) 2, Need to show that K(x i,x j )= φ(x i ) T φ(x j ): K(x i,x j )=(1 + x i T x j ) 2, = 1+ x i1 2 x j x i1 x j1 x i2 x j2 + x i2 2 x j x i1 x j1 + 2x i2 x j2 = [1 x i1 2 √2 x i1 x i2 x i2 2 √2x i1 √2x i2 ] T [1 x j1 2 √2 x j1 x j2 x j2 2 √2x j1 √2x j2 ] = φ(x i ) T φ(x j ), where φ(x) = [1 x 1 2 √2 x 1 x 2 x 2 2 √2x 1 √2x 2 ] The “Kernel Trick”

Linear: K(x i,x j )= x i T x j Polynomial of power p: K(x i,x j )= (1+ x i T x j ) p Gaussian radial-basis function network): Examples of Kernel Functions

Example: Properties of Kernel Function Linear Kernel Polynomial KernelRadial Basis Function

Example: Kernel Selection

SVM locates a separating hyperplane in the feature space and classify points in that space It does not need to represent the space explicitly, simply by defining a kernel function The kernel function plays the role of the dot product in the feature space. Non-linear SVM - Overview

Properties of SVM  Flexibility in choosing a similarity function  Sparseness of solution when dealing with large data sets - only support vectors are used to specify the separating hyperplane  Ability to handle large feature spaces - complexity does not depend on the dimensionality of the feature space  Overfitting can be controlled by soft margin approach  Nice math property: a simple convex optimization problem which i s guaranteed to converge to a single global solution

Weakness of SVM  It is sensitive to noise - A relatively small number of mislabeled examples can dramatically de crease the performance  It only considers two classes - how to do multi-class classification with SVM? - Answer: 1) with output arity m, learn m SVM’s  SVM 1 learns “Output==1” vs “Output != 1”  SVM 2 learns “Output==2” vs “Output != 2”  :  SVM m learns “Output==m” vs “Output != m” 2)To predict the output for a new input, just predict with each SVM an d find out which one puts the prediction the furthest into the positive r egion.

SVM Applications  SVM has been used successfully in many real-world problems - text (and hypertext) categorization - image classification - bioinformatics (Protein classification, Cancer classification) - hand-written character recognition

Text Classification  Goal: to classify documents (news articles, s, Web pages, etc.) into predefined categories  Examples  To classify news articles into “business” and “sports”  To classify Web pages into personal home pages and others  To classify product reviews into positive reviews and negative reviews  Approach: supervised machine learning  For each pre-defined category, we need a set of training documents known to belong to the category.  From the training documents, we train a classifier.

Overview  Step 1—text pre-processing  to pre-process text and represent each document as a feature vector  Step 2—training  to train a classifier using a classification tool (e.g. LIBSVM, SVMlight)  Step 3—classification  to apply the classifier to new documents

Pre-processing: tokenization  Goal: to separate text into individual words  Example: “We’re attending a tutorial now.”  we ’re attending a tutorial now  Tool:  Word Splitter

Pre-processing: stop word removal (optional)  Goal: to remove common words that are usually not useful for text classification  Example: to remove words such as “a”, “the”, “I”, “he”, “she”, “is”, “are”, etc.  Stop word list:  /stop_words /stop_words

Pre-processing: stemming (optional)  Goal: to normalize words derived from the same root  Examples:  attending  attend  teacher  teach  Tool:  Porter stemmer

Pre-processing: feature extraction  Unigram features: to use each word as a feature  To use TF (term frequency) as feature value  To use TF*IDF (inverse document frequency) as feature value  IDF = log (total-number-of-documents / number-of- documents-containing-t)  Bigram features: to use two consecutive words as a feature

SVMlight  SVM-light: a command line C program that implements the SVM learning algorithm  Classification, regression, ranking  Download at  Documentation on the same page  Two programs  svm_learn for training  svm_classify for classification

SVMlight Examples  Input format 1 1:0.5 3:1 5: :0.9 3:0.1 4:2  To train a classifier from train.data  svm_learn train.data train.model  To classify new documents in test.data  svm_classify test.data train.model test.result  Output format  Positive score  positive class  Negative score  negative class  Absolute value of the score indicates confidence  Command line options  -c a tradeoff parameter (use cross validation to tune)

More on SVM-light  Kernel  Use the “-t” option  Polynomial kernel  User-defined kernel  Semi-supervised learning (transductive SVM)  Use “0” as the label for unlabeled examples  Very slow

Some Issues  Choice of kernel - Gaussian or polynomial kernel is default - if ineffective, more elaborate kernels are needed - domain experts can give assistance in formulating appropriate similari ty measures  Choice of kernel parameters - e.g. σ in Gaussian kernel - σ is the distance between closest points with different classifications - In the absence of reliable criteria, applications rely on the use of a valid ation set or cross-validation to set such parameters.  Optimization criterion – Hard margin v.s. Soft margin - a lengthy series of experiments in which various parameters are tested

Reference  Support Vector Machine Classification of Microarray Gene Expression Data, Michael P. S. Brown William N oble Grundy, David Lin, Nello Cristianini, Charles Sugne t, Manuel Ares, Jr., David Haussler   Text categorization with Support Vector Machines: learning with many relevant features T. Joachims, ECML - 98