Data Mining and Machine Learning via Support Vector Machines

Name: Data Mining and Machine Learning via Support Vector Machines
Uploaded: 2017-07-04T12:19:19+00:00
Duration: PTM20S47
Channel: Sophie Montgomery
Description: Data Mining and Machine Learning via Support Vector Machines

Data Mining and Machine Learning via Support Vector Machines
Dave Musicant Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Outline The Supervised Learning Classification Problem
The Support Vector Machine for Classification (linear approaches) Nonlinear SVM approaches Active learning techniques for SVMs Iterative algorithms for solving SVMs SVM Regression Wrapup

Basic Definitions Data Mining Machine Learning
“non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” -- Usama Fayyad Utilizes techniques from machine learning, databases, and statistics Machine Learning “concerned with the question of how to construct computer programs that automatically improve with experience." -- Tom Mitchell Fits under Artificial Intelligence umbrella

Supervised Learning Classification
Example: Cancer diagnosis Training Set Use this training set to learn how to classify patients where diagnosis is not known: Test Set Input Data Classification The input data is often easily obtained, whereas the classification is not.

Classification Problem
Goal: Use training set + some learning method to produce a predictive model. Use this predictive model to classify new data. Sample applications:

Application: Breast Cancer Diagnosis
Research by Mangasarian,Street, Wolberg

Breast Cancer Diagnosis Separation
Research by Mangasarian,Street, Wolberg

Application: Document Classification
The Federalist Papers Written in by Alexander Hamilton, John Jay, and James Madison to persuade residents of the State of New York to ratify the U.S. Constitution All written under the pseudonym “Publius” Who wrote which of them? Hamilton wrote 56 papers Madison wrote 50 papers 12 disputed papers, generally understood to be written by Hamilton or Madison, but not known which Research by Bosch, Smith

Federalist Papers Classification
Graphic by Fung Research by Bosch, Smith

Application: Face Detection
Training data is a collection of Faces and NonFaces Rotation and Mirroring added in to provide robustness Image obtained from work by Osuna, Freund, and Girosi at

Face Detection Results
Image obtained from "Support Vector Machines: Training and Applications" by Osuna, Freund, and Girosi.

Face Detection Results
Image obtained from work by Osuna, Freund, and Girosi at

Simple Linear Perceptron
Class -1 Class 1 Goal: Find the best line (or hyperplane) to separate the training data. How to formalize? In two dimensions, equation of the line is given by: Better notation for n dimensions: treat each data point and the coefficients as vectors. Then equation is given by:

Simple Linear Perceptron (cont.)
The Simple Linear Perceptron is a classifier as shown in the picture Points that fall on the right are classified as “1” Points that fall on the left are classified as “-1” Therefore: using the training set, find a hyperplane (line) so that This is a good starting point. But we can do better! Class -1 Class 1

Finding the Best Plane Not all planes are equal. Which of the two following planes shown is better? Both planes accurately classify the training set. The solid green plane is the better choice, since it is more likely to do well on future test data. The solid green plane is further away from the data.

Separating the planes Construct the bounding planes: Class -1 Class 1
Draw two parallel planes to the classification plane. Push them as far apart as possible, until they hit data points. The classification plane with bounding planes furthest apart is the best one. Class -1 Class 1

Recap: Finding the Best Plane
Details All points in class 1 should be to the right of bounding plane 1. All points in class -1 should be to the left of bounding plane -1. Pick yi to be +1 or -1 depending on the classification. Then the above two inequalities can be written as one: The distance between bounding planes should be maximized. The distance between bounding planes is given by: Class -1 Class 1

The Optimization Problem
The previous slide can be rewritten as: This is a mathematical program. Optimization problem subject to constraints More specifically, this is a quadratic program There are high powered software tools for solving this kind of problem (both commercial and academic) These general purpose tools are slow for this particular problem

Data Which is Not Linearly Separable
What if a separating plane does not exist? error Find the plane that maximizes the margin and minimizes the errors on the training points. Take original inequality and add a slack variable to measure error:

The Support Vector Machine
Push the planes apart and minimize the error at the same time: C is a positive number that is chosen to balance these two goals. This problem is called a Support Vector Machine, or SVM.

Terminology Those points that touch the bounding plane, or lie on the wrong side, are called support vectors. If all the data points except the support vectors were removed, the solution would turn out the same. The SVM is mathematically equivalent to force and torque equilibrium (hence the name support vectors).

Example from Carleton College
1850 students 4 year undergraduate liberal arts college Ranked 5th in the nation by US News and World Report 15-20 computer science majors per year All research assistants are full-time undergraduates

Student Research Example
Goal: automatically generate “frequently asked questions” list from discussion groups Subgoal #1: Given a corpus of discussion group postings, identify those messages that contain questions Recruit student volunteers to identify questions Learn classification Work by students Sarah Allen, Janet Campbell, Ester Gubbrud, Rachel Kirby, Lillie Kittredge

Building A Training Set

Building A Training Set
Which sentences are questions in the following text? From: (Wonko the Sane) I was recently talking to a possible employer ( mine! :-) ) and he made a reference to a 48-bit graphics computer/image processing system. I seem to remember it being called IMAGE or something akin to that. Anyway, he claimed it had 48-bit color + a 12-bit alpha channel. That's 60 bits of info--what could that possibly be for? Specifically the 48-bit color? That's 280 trillion colors, many more than the human eye can resolve. Is this an anti-aliasing thing? Or is this just some magic number to make it work better with a certain processor.

Representing the training set
Each document is a point Each potential word is a column (bag of words) Other pre-processing tricks Remove punctuation Remove "stop words" such as "is", "a", etc. Use stemming to remove "ing" and "ed", etc. from similar words

Results If you just guess brain-dead: "every message contains a question", get 55% right If you use a Support Vector Machine, get 66.5% of them right What words do you think were strong indicators of questions? anyone, does, any, what, thanks, how, help, know, there, do, question What words do you think were strong contra-indicators of questions? re, sale, m, references, not, your

Beyond lines Some datasets may not be best separated by a plane.
SVMs can be extended to nonlinear surfaces also. Generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at

Finding nonlinear surfaces
How to modify algorithm to find nonlinear surfaces? First idea (simple and effective): map each data point into a higher dimensional space, and find a linear fit there Example: Find a quadratic surface for Use new coordinates in regular linear SVM A plane in this quadratic space is equivalent to a quadratic surface in our original space

Problems with this method
If dimensionality of space is high, lots of calculations For a high polynomial space, combinations of coordinates explodes Need to do all these calculations for all training points, and for each testing point Infinite dimensional spaces impossible Nonlinear surfaces can be used without these problems through the use of a kernel function.

The Dual Problem The dual SVM is an alternative approach. Class 1
Wrap a “string” around all the data points. Find the two points, one on each “string”, which are closest together. Connect the dots. The perpendicular bisector to this connection is the best classification plane. Class 1 Class -1

The Dual Variable, or “Importance”
Every point on the “string” is a linear combination of the points inside the string. x3 x1 x2 In general: a’s are referred to as dual variables, and represent the “importance” of each data point.

Two Equivalent Approaches
Class 1 Class -1 Class -1 Class 1 Primal Problem: Find best separating plane Variables: w,b Dual Problem: Find closest points on “strings” Variables:  Both problems yield the same classification plane. w,b can be expressed in terms of   can be expressed in terms of w,b

How to generalize nonlinear fits
Traditional SVM: Dual formulation: Can find w and b in terms of . But note: don't need any xi individually, just scalar products between points.

Kernel function Dual formulation again:
Substitute scalar product with kernel function: Using a kernel corresponds to having mapped the data into some high dimensional space, possibly an infinite one.

Traditional kernels Linear Polynomial Gaussian

Another interpretation
Kernels can be thought of as a distance metric. Linear SVM: determine class by sign of Nonlinear SVM: determine class by sign of Those support vectors that x is "closest to" influence its class selection.

Example: Checkerboard

k-Nearest Neighbor Algorithm

SVM on Checkerboard

Active Learning with SVMs
Given a set of unlabeled points that I can label at will, how do I choose which one to label next? Common answer: choose a point that is on or close to the current separating hyperplane (Campbell, Cristianini, Smola; Tong & Koller; Schohn & Cohn) Why?

On the hyperplane: Spin 1
Assume data is linearly separable. A point which is on the hyperplane (or at least in the margin) is guaranteed to change the results. (Schohn & Cohn)

On the hyperplane: Spin 2
Intuition suggests that one should grab the point that is most wrong Problem: don't know the class of the point yet If you grab a point that is far from the hyperplane, and it is classified wrong, this would be wonderful But: points which are far from the hyperplane are the ones which are most likely be correctly classified (Campbell, Cristianini, Smola)

Active Learning in Batches
What if you want to choose a number of points to label at once? (Brinker) Could choose the n closest points to the hyperplane, but this is not optimal

Heuristic approach instead
Assumption: all hyperplanes go through origin authors claim that this can be compensated for with appropriate choice of kernel To have maximal effect on direction of hyperplane, choose points with largest angle

Defining angle Let  = mapping to feature space
Angle between points x and y:

Approach for maximizing angle
Introduce artificial point normal to existing hyperplane. Choose next point to be one that maximizes angle with this one. Choose each successive point to be the one that maximizes the minimum angle to previous point (i.e., minimizes the maximum cosine value)

What happened to distance?
In practice, use both measures: want points closest to plane want points with largest angular separation from others Iterative greedy algorithm: value =  * distance to hyperplane + (1-) * (largest cosine measure to an already existing point) Choose the next point to be the one that minimizes this value Paper has results: fairly robust to varying 

Iterative Algorithms Maintain the “importance,” or dual variable associated with all data points. This is small, since it is a single dimensional array of size m. Algorithm Look at each point sequentially. Update its importance. (How?) Repeat until no further improvements in goal. Class 1 Class -1

Iterative Framework LSVM, ASVM, SOR, etc. are iterative algorithms on the dual variables. Algorithm: (Assume that we have m data points.) for (i=0; i < m; i++) ai = 0; // Initialize dual variables while (distance between strings continues to shorten) for (i=0; i <m; i++) { Update ai according to the update rule (not shown here). } Bottleneck: Repeated scans through the dataset. Many of these data points are unimportant

Iterative Framework (Optimized)
Optimization: Apply algorithm only to active points, i.e. those points that appear to be support vectors, as long as progress is being made. Optimized Algorithm: while (strings continue to shorten) { run the unoptimized algorithm for one iteration while (strings continue to shorten) for (all i corresponding to active points) { Update ai . If ai > 0, keep this data point active. Otherwise, remove it. } This results in more loops, but the inner loops are so much faster that it pays off significantly.

Regression Support vector machines can also be used to solve regression problems.

The Regression Problem
“Close points” may be wrong due to noise only Line should be influenced by “real” data, not noise Ignore errors from those points which are close!

Support Vector Regression
Traditional support vector regression: Minimize the error made outside of the tube Regularize the fitted plane by minimizing the norm of w The parameter C balances two competing goals

My current research Collaborating with:
Deborah Gross, Carleton College (chemistry) Raghu Ramakrishnan, UW-Madison (computer sciences) Jamie Schauer, UW-Madison (atmospheric sciences) Analyzing data from Aerosol Time-of-Flight Mass Spectrometer (ATOFMS) Aerosol: "small particle of gunk in air" Questions we want to answer: How can we classify safe vs. dangerous? Can we determine when a sudden change in the air stream has happened? Can we identify what substances are present in a particular particle?

Questions?

Data Mining and Machine Learning via Support Vector Machines

Similar presentations

Presentation on theme: "Data Mining and Machine Learning via Support Vector Machines"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Data Mining and Machine Learning via Support Vector Machines

Similar presentations

Presentation on theme: "Data Mining and Machine Learning via Support Vector Machines"— Presentation transcript:

Similar presentations

About project

Feedback