Support Vector Machines

Support Vector Machines
CSC 576: Data Mining

Today… Support Vector Machines Multiclass Problems Feature Selection

Support Vector Machines (SVM)
What are they? Developed in the 1990s Computer Science community Very popular Performance: Often considered one of the best “out of the box” classifiers Applications: handwritten digit recognition, text categorization

Support Vector Machines (SVM)
Comparing to other statistical learning methods: SVMs work well with high-dimensional data Unique: Represents “decision boundary” using a subset of training examples

Terminology Maximal Margin Classifier Support Vector Classifier
Support Vector Machine Often all three are referred to as “Support Vector Machine”

The Path Ahead Maximal Margin Classifier Support Vector Classifier
Generalization of Maximal Margin Classifier Support Vector Machine Generalization of Support Vector Classifier

Maximal Margin Classifier
First, need to define a hyperplane What is a hyperplane? Hyperplane has p-1 dimensions in a p dimensional space Example: in 2 dimension space, a hyperplane has 1 dimension (and thus, is a line)

Hyperplane Mathematical Definition
For two dimensions, hyperplane defined as: B0, B1, B2 are parameters. X1, X2 are variables. Note that this equation is a line: Hyperplane is in one-dimension

We’re going to “find” values for B0, B1, B2. Then, for any values X1 and X2: if B0 + B1X1 + B2X2 = 0 Point is on the line.

We’re going to “find” values for B0, B1, B2. Then, for any values X1 and X2: if B0 + B1X1 + B2X2 > 0 Point is not on the line. On one side of the line. if B0 + B1X1 + B2X2 < 0 Point is on the other side of the line.

Hyperplane … is dividing 2-dimesional space into two halves by a line.

Separating Hyperplane
Note: a separating hyperplane means zero training errors. Dataset with two classes: Squares circles Can find a separating hyperplane with all squares on one side and all circles on the other. Infinitely many such hyperplanes possible.

Classification Using a Separating Hyperplane
For a new test instance, which side of the line is it on? B0 + B1X1 + B2X2 > 0 B0 + B1X1 + B2X2 < 0

Standard SVM approach: Label class data as either +1 or -1, depending on which class an instance belongs to. Prediction:

For a new test instance, which side of the line is it on? B0 + B1X1 + B2X2 > 0 B0 + B1X1 + B2X2 < 0 Can also look at the magnitude. How far from zero? Greater magnitude means more confident prediction.

Some Concerns with this Approach:
Datasets with more than 2 target classes What if a “seperating hyperplane” can’t be formed Data is more than two dimensions Regression instead of classification SVMs can deal with each of these.

What if Data is more than 2-Dimensions?
Mathematical definition of hyperplane generalizes to n-dimensions:

Maximum Margin Hyperplane
What’s the best separating hyperplane? Intuition: the one that is farthest from the training observations. Called the maximum margin hyperplane.

The Margin B1 and B2 are each separating hyperplanes
B1 is better Margin: the smallest distance from the hyperplane to the training data

Maximal Margin Hyperplane
Represents the mid-line of the widest “slab” that can be inserted between the two classes. Maximal Margin Hyperplane We want the hyperplane that has the greatest margin. That is, B1 instead of B2 or any of the other infinitely many separating hyperplanes

Maximal Margin Hyperplane
Support Vectors: the points in the data, that if moved, the maximal margin hyperplane would move as well. Moving any of the other data points would not affect the model.

Figuring Out the Maximal Margin Classifier
Don’t worry, data mining toolkits do it automatically. Optimization problem. Involves calculus.

Support Vector Classifier
Maximum Margin Classifier is natural way to perform classification if a separating hyperplane exists. Perfect segmentation between two classes In many cases, no separating hyperplane will exist Find a hyperplane that almost perfectly segments the classes This generalization is called: support vector classifier

Maximal Margin Classifier: no training errors allowed Support Margin Classifier: tolerate training errors Approach: Soft margin Will allow construction of linear decision boundary even when classes are not linearly separable.

Additional motivation: New data point added. Dramatic shift in maximal margin hyperplane. Model has high variance when trying to maintain perfect segmentation. Maximum margin classifier. Perfectly segments training data.

So, interested in: Greater robustness to individual data instances Better classification of most of the training data Some misclassifications permitted: “Soft” margin: because margin can be violated by some of the instances

Red Instances: on correct side of margin 2 is on the margin 1 is on the wrong side of the margin Red Instances: on correct side of margin 2 is on the margin 1 is on the wrong side of the margin 11 is on the wrong side of the hyperplane

Using Support Vector Classifier for Classification
Same as before. Which side of the line is the test instance on?

Constructing the Support Vector Classifier
More interesting. How much “softness” (misclassifications) in the soft margin is ideal? Complicated math, but python figures it out. Specification of nonnegative tuning parameter C Generally chosen by analyst following cross-validation Large C: wider margin; more instances violate margin Small C: narrower margin; less tolerance for instances that violate margin

Same data points. Larger C to Smaller C Lower variance. Higher variance.

What if a non-linear decision boundary is needed? Poor performance using this decision boundary.

Idea: transform the data from its original coordinate space in X into a new space Φ(X) so that a linear decision boundary can separate the two classes Φ: nonlinear transformation Huh? Instead of fitting a support vector classifier using n features: X1, X2, …, Xn … use 2n features: X1, X12, X2, X22, …, Xn, Xn2

Enlarged “feature space” compared to original “feature space” Can even extend to higher-order polynomial terms. Downside: can easily end up with huge number of features overfitting

Attribute Transformation

Learning a Nonlinear SVM Model
Once again: Complicated, but python does it for us.

Other extensions to SVMs:
Regression instead of classification Categorical variables instead of continuous Multiclass problems instead of binary Radial “kernals” Circle instead of hyperplane

Today… Support Vector Machines Multiclass Problems Feature Selection

Multiclass Problems Scenario: target class is more than 2 categories
Ideas? Scenario: target class is more than 2 categories Motivation: some machine learning algorithms are designed for binary classification Example: Support Vector Machines (SVM) How to extend binary classifiers to handle multiclass problems?

#1 - Multiclass: One-Against-Rest
Assume multiclass dataset with K target classes Decompose into K binary problems Idea: For each target class yi create a single binary problem, with classifier Ci: Class yi: positive All other classes: negative Training: use all instances; each instance used in training each of the Ci classifiers Testing: run testing instance through each classifier record votes for each yi class (negative prediction is a vote for all other classes) class with most votes is the predicted class

#2 - Multiclass: One-Against-One
Assume multiclass dataset with K target classes Train K(K-1)/2 binary classifiers (many more than One-Against-Rest) Idea: Each classifier distinguishes between pair of classes (yi, yj) Classifier ignores records that don’t belong to yi or yj Training: use all instances; each instance only used in training “relevant classifiers” (K of them) Testing: run testing instance through each classifier record votes for each yi class class with most votes is the predicted class

Today… Support Vector Machines Multiclass Problems Feature Selection
High Dimensionality Feature Subset Selection Adjusted R2 Statistic

High Dimensionality … can be bad
Datasets can have a large number of features Example: big data Example: stock prices (time series) Each stock is individual instance Features/variables are closing price on given day Imagine 30 years worth of closing prices (30 x 365)

Why is it a Problem? BAD: p > n, p = # of features n = # of instances Often data mining algorithms work better if there are not an overwhelming number of attributes The “dimensionality” is lower “The Curse of Dimensionality” As dimensionality increases (more features), the data becomes increasingly sparse in the “feature space” that it occupies. Not enough data objects for the number of features that are present Reduced classification model accuracy

Other Benefits to Dimensionality Reduction
More understandable models Learned model may involve fewer attributes Better visualizations Fewer attributes = less variables to plot Computational time Fewer attributes = quicker model learning? Elimination of irrelevant features

Techniques for Dimensionality Reduction
Advanced: Linear Algebra Techniques Automatic approaches Project data from high-dimensional space into a lower- dimensional space Principal Components Analysis (PCA) Singular Value Decomposition (SVD) Not necessarily interested in “losing information”; rather eliminate some of the sparsity

Feature Construction Example: combining two separate features (# of full baths, # of half baths) into one feature (“total baths”) Example: combining features (mass) and (volume) into one feature (density), where density = mass / volume

Feature Subset Selection Reducing number of features by only using a subset of features How many should be in the subset? Losing information if we only consider a subset of features? Redundant features Example: (1) purchase price and (2) sales tax Irrelevant features Example: student id numbers By eliminating unnecessary features, we hope for a better model.

Eliminating Redundant and Irrelevant Features
Manually via Data Analyst Intuition about problem domain Systematic Approach Try all possible combinations of feature subsets? See which combination results in best model For n features, there are 2n possible combinations of subsets Infeasible to try each of them

Three Systematic Approaches
Embedded Approaches Filter Approaches Wrapper Approaches

Embedded Approaches Algorithm specific
Occurs naturally as part of the data mining algorithm Example: present in decision tree induction Only certain subset of features are used in final decision tree Example: not present in linear regression Fitted model contained coefficient for each predictor variable

Filter Approaches Features are selected before the data mining algorithm is run Filter approach is independent of the data mining task Example: (trying to eliminate redundant features) Look at pairwise correlation between variables Pick subset of variables that each have low pairwise correlation Then use that only that subset in Linear Regression model.

Wrapper Approaches Data mining algorithm is a “black box” for finding best subset of features Tries different combinations of subsets Typically will never enumerate all 2n possible combinations Will search a feature space that is much smaller Final model uses the specific subset that evaluates the best

Top-Down Wrapper Assuming n number of features…
Start with no attributes Train classifier n times, each time with a different feature Each classifier only has a single predictor See which of the n classifiers performs the best Add to the best classifier. Recursively use remaining attributes to find which attribute that improves performance the most Keep including best attribute Stopping criterion: Stop if no improvement to classifier performance, or increase in classifier performance is less than some threshold

Bottom-Up Wrapper Assuming n number of features…
Start with all n attributes in model Create n models, each with a different predictor omitted. Each classifier has n-1 predictors See which of the n classifiers affects performance the least Throw that attribute out Recursively find the attribute that affects performance the least Stopping criterion: Stop if classifier performance begins to degrade

Other Wrappers Bi-Directional Greedy Search with Backtracking …
Combining Top-Down and Bottom-Up Greedy Search with Backtracking (if you’re familiar with AI) …

Adjusted R2 Statistic Always increases as more variables are added to the model. Recall the R2 statistic that we saw in Linear Regression: Measured the proportion of variance explained by the model Always a value between 0 and 1 Higher is better

Adjusted R2 Statistic In contrast to R2, Adjusted R2 penalizes for unnecessary variables in the model. d = number of predictors n = number of instances 𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑 𝑅 2 =1− 𝑅𝑆𝑆 (𝑛−𝑑−1) 𝑇𝑆𝑆 (𝑛−1)

References Data Science from Scratch, 1st Edition, Grus
Introduction to Data Mining, 1st edition, Tan et al. Data Mining and Business Analytics with R, 1st edition, Ledolter

Support Vector Machines

Similar presentations

Presentation on theme: "Support Vector Machines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Support Vector Machines

Similar presentations

Presentation on theme: "Support Vector Machines"— Presentation transcript:

Similar presentations

About project

Feedback