Data Mining and Machine Learning via Support Vector Machines

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Lecture 9 Support Vector Machines
ECG Signal processing (2)
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Pattern Recognition and Machine Learning
An Introduction of Support Vector Machine
Linear Separators.
SVM—Support Vector Machines
Machine learning continued Image source:
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
The Disputed Federalist Papers : SVM Feature Selection via Concave Minimization Glenn Fung and Olvi L. Mangasarian CSNA 2002 June 13-16, 2002 Madison,
Pattern Recognition and Machine Learning
David R. Musicant Machine Learning n Definition 1 –“The subfield of AI concerned with programs that learn from experience” –Russell / Norvig, AIMA n Definition.
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.
Optimizing F-Measure with Support Vector Machines David R. Musicant Vipin Kumar Aysel Ozgur FLAIRS 2003 Tuesday, May 13, 2003 Carleton College.
Active Learning with Support Vector Machines
Support Vector Machines Kernel Machines
Recent Results in Support Vector Machines Dave Musicant Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at
Support Vector Machines
Active Set Support Vector Regression
Support Vector Machines
Lecture 10: Support Vector Machines
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Efficient Model Selection for Support Vector Machines
SVM by Sequential Minimal Optimization (SMO)
Support Vector Machine & Image Classification Applications
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
The Disputed Federalist Papers: Resolution via Support Vector Machine Feature Selection Olvi Mangasarian UW Madison & UCSD La Jolla Glenn Fung Amazon Inc.,
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
An Introduction to Support Vector Machine (SVM) Presenter : Ahey Date : 2007/07/20 The slides are based on lecture notes of Prof. 林智仁 and Daniel Yeung.
Kernel Methods A B M Shawkat Ali 1 2 Data Mining ¤ DM or KDD (Knowledge Discovery in Databases) Extracting previously unknown, valid, and actionable.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
An Introduction to Support Vector Machine (SVM)
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines. Notation Assume a binary classification problem. –Instances are represented by vector x   n. –Training examples: x = (x 1,
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Greg GrudicIntro AI1 Support Vector Machine (SVM) Classification Greg Grudic.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Support Vector Machine: An Introduction. (C) by Yu Hen Hu 2 Linear Hyper-plane Classifier For x in the side of o : w T x + b  0; d = +1; For.
Machine Learning and Data Mining: A Math Programming- Based Approach Glenn Fung CS412 April 10, 2003 Madison, Wisconsin.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.
Day 17: Duality and Nonlinear SVM Kristin P. Bennett Mathematical Sciences Department Rensselaer Polytechnic Institute.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Support vector machines
PREDICT 422: Practical Machine Learning
Support Vector Machine
Large Margin classifiers
Support Vector Machines
Large Scale Support Vector Machines
Support Vector Machines
Support vector machines
SVMs for Document Ranking
Presentation transcript:

Data Mining and Machine Learning via Support Vector Machines Dave Musicant Graphic generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at http://svm.research.bell-labs.com/SVT/SVMsvt.html

Outline The Supervised Learning Classification Problem The Support Vector Machine for Classification (linear approaches) Nonlinear SVM approaches Active learning techniques for SVMs Iterative algorithms for solving SVMs SVM Regression Wrapup

Basic Definitions Data Mining Machine Learning “non trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” -- Usama Fayyad Utilizes techniques from machine learning, databases, and statistics Machine Learning “concerned with the question of how to construct computer programs that automatically improve with experience." -- Tom Mitchell Fits under Artificial Intelligence umbrella

Supervised Learning Classification Example: Cancer diagnosis Training Set Use this training set to learn how to classify patients where diagnosis is not known: Test Set Input Data Classification The input data is often easily obtained, whereas the classification is not.

Classification Problem Goal: Use training set + some learning method to produce a predictive model. Use this predictive model to classify new data. Sample applications:

Application: Breast Cancer Diagnosis Research by Mangasarian,Street, Wolberg

Breast Cancer Diagnosis Separation Research by Mangasarian,Street, Wolberg

Application: Document Classification The Federalist Papers Written in 1787-1788 by Alexander Hamilton, John Jay, and James Madison to persuade residents of the State of New York to ratify the U.S. Constitution All written under the pseudonym “Publius” Who wrote which of them? Hamilton wrote 56 papers Madison wrote 50 papers 12 disputed papers, generally understood to be written by Hamilton or Madison, but not known which Research by Bosch, Smith

Federalist Papers Classification Graphic by Fung Research by Bosch, Smith

Application: Face Detection Training data is a collection of Faces and NonFaces Rotation and Mirroring added in to provide robustness Image obtained from work by Osuna, Freund, and Girosi at http://www.ai.mit.edu/projects/cbcl/res-area/object-detection/face-detection.html

Face Detection Results Image obtained from "Support Vector Machines: Training and Applications" by Osuna, Freund, and Girosi.

Face Detection Results Image obtained from work by Osuna, Freund, and Girosi at http://www.ai.mit.edu/projects/cbcl/res-area/object-detection/face-detection.html

Simple Linear Perceptron Class -1 Class 1 Goal: Find the best line (or hyperplane) to separate the training data. How to formalize? In two dimensions, equation of the line is given by: Better notation for n dimensions: treat each data point and the coefficients as vectors. Then equation is given by:

Simple Linear Perceptron (cont.) The Simple Linear Perceptron is a classifier as shown in the picture Points that fall on the right are classified as “1” Points that fall on the left are classified as “-1” Therefore: using the training set, find a hyperplane (line) so that This is a good starting point. But we can do better! Class -1 Class 1

Finding the Best Plane Not all planes are equal. Which of the two following planes shown is better? Both planes accurately classify the training set. The solid green plane is the better choice, since it is more likely to do well on future test data. The solid green plane is further away from the data.

Separating the planes Construct the bounding planes: Class -1 Class 1 Draw two parallel planes to the classification plane. Push them as far apart as possible, until they hit data points. The classification plane with bounding planes furthest apart is the best one. Class -1 Class 1

Recap: Finding the Best Plane Details All points in class 1 should be to the right of bounding plane 1. All points in class -1 should be to the left of bounding plane -1. Pick yi to be +1 or -1 depending on the classification. Then the above two inequalities can be written as one: The distance between bounding planes should be maximized. The distance between bounding planes is given by: Class -1 Class 1

The Optimization Problem The previous slide can be rewritten as: This is a mathematical program. Optimization problem subject to constraints More specifically, this is a quadratic program There are high powered software tools for solving this kind of problem (both commercial and academic) These general purpose tools are slow for this particular problem

Data Which is Not Linearly Separable What if a separating plane does not exist? error Find the plane that maximizes the margin and minimizes the errors on the training points. Take original inequality and add a slack variable to measure error:

The Support Vector Machine Push the planes apart and minimize the error at the same time: C is a positive number that is chosen to balance these two goals. This problem is called a Support Vector Machine, or SVM.

Terminology Those points that touch the bounding plane, or lie on the wrong side, are called support vectors. If all the data points except the support vectors were removed, the solution would turn out the same. The SVM is mathematically equivalent to force and torque equilibrium (hence the name support vectors).

Example from Carleton College 1850 students 4 year undergraduate liberal arts college Ranked 5th in the nation by US News and World Report 15-20 computer science majors per year All research assistants are full-time undergraduates

Student Research Example Goal: automatically generate “frequently asked questions” list from discussion groups Subgoal #1: Given a corpus of discussion group postings, identify those messages that contain questions Recruit student volunteers to identify questions Learn classification Work by students Sarah Allen, Janet Campbell, Ester Gubbrud, Rachel Kirby, Lillie Kittredge

Building A Training Set

Building A Training Set Which sentences are questions in the following text? From: oehler@yar.cs.wisc.edu (Wonko the Sane) I was recently talking to a possible employer ( mine! :-) ) and he made a reference to a 48-bit graphics computer/image processing system. I seem to remember it being called IMAGE or something akin to that. Anyway, he claimed it had 48-bit color + a 12-bit alpha channel. That's 60 bits of info--what could that possibly be for? Specifically the 48-bit color? That's 280 trillion colors, many more than the human eye can resolve. Is this an anti-aliasing thing? Or is this just some magic number to make it work better with a certain processor.

Representing the training set Each document is a point Each potential word is a column (bag of words) Other pre-processing tricks Remove punctuation Remove "stop words" such as "is", "a", etc. Use stemming to remove "ing" and "ed", etc. from similar words

Results If you just guess brain-dead: "every message contains a question", get 55% right If you use a Support Vector Machine, get 66.5% of them right What words do you think were strong indicators of questions? anyone, does, any, what, thanks, how, help, know, there, do, question What words do you think were strong contra-indicators of questions? re, sale, m, references, not, your

Beyond lines Some datasets may not be best separated by a plane. SVMs can be extended to nonlinear surfaces also. Generated with Lucent Technologies Demonstration 2-D Pattern Recognition Applet at http://svm.research.bell-labs.com/SVT/SVMsvt.html

Finding nonlinear surfaces How to modify algorithm to find nonlinear surfaces? First idea (simple and effective): map each data point into a higher dimensional space, and find a linear fit there Example: Find a quadratic surface for Use new coordinates in regular linear SVM A plane in this quadratic space is equivalent to a quadratic surface in our original space

Problems with this method If dimensionality of space is high, lots of calculations For a high polynomial space, combinations of coordinates explodes Need to do all these calculations for all training points, and for each testing point Infinite dimensional spaces impossible Nonlinear surfaces can be used without these problems through the use of a kernel function.

The Dual Problem The dual SVM is an alternative approach. Class 1 Wrap a “string” around all the data points. Find the two points, one on each “string”, which are closest together. Connect the dots. The perpendicular bisector to this connection is the best classification plane. Class 1 Class -1

The Dual Variable, or “Importance” Every point on the “string” is a linear combination of the points inside the string. x3 x1 x2 In general: a’s are referred to as dual variables, and represent the “importance” of each data point.

Two Equivalent Approaches Class 1 Class -1 Class -1 Class 1 Primal Problem: Find best separating plane Variables: w,b Dual Problem: Find closest points on “strings” Variables:  Both problems yield the same classification plane. w,b can be expressed in terms of   can be expressed in terms of w,b

How to generalize nonlinear fits Traditional SVM: Dual formulation: Can find w and b in terms of . But note: don't need any xi individually, just scalar products between points.

Kernel function Dual formulation again: Substitute scalar product with kernel function: Using a kernel corresponds to having mapped the data into some high dimensional space, possibly an infinite one.

Traditional kernels Linear Polynomial Gaussian

Another interpretation Kernels can be thought of as a distance metric. Linear SVM: determine class by sign of Nonlinear SVM: determine class by sign of Those support vectors that x is "closest to" influence its class selection.

Example: Checkerboard

k-Nearest Neighbor Algorithm

SVM on Checkerboard

Active Learning with SVMs Given a set of unlabeled points that I can label at will, how do I choose which one to label next? Common answer: choose a point that is on or close to the current separating hyperplane (Campbell, Cristianini, Smola; Tong & Koller; Schohn & Cohn) Why?

On the hyperplane: Spin 1 Assume data is linearly separable. A point which is on the hyperplane (or at least in the margin) is guaranteed to change the results. (Schohn & Cohn)

On the hyperplane: Spin 2 Intuition suggests that one should grab the point that is most wrong Problem: don't know the class of the point yet If you grab a point that is far from the hyperplane, and it is classified wrong, this would be wonderful But: points which are far from the hyperplane are the ones which are most likely be correctly classified (Campbell, Cristianini, Smola)

Active Learning in Batches What if you want to choose a number of points to label at once? (Brinker) Could choose the n closest points to the hyperplane, but this is not optimal

Heuristic approach instead Assumption: all hyperplanes go through origin authors claim that this can be compensated for with appropriate choice of kernel To have maximal effect on direction of hyperplane, choose points with largest angle

Defining angle Let  = mapping to feature space Angle between points x and y:

Approach for maximizing angle Introduce artificial point normal to existing hyperplane. Choose next point to be one that maximizes angle with this one. Choose each successive point to be the one that maximizes the minimum angle to previous point (i.e., minimizes the maximum cosine value)

What happened to distance? In practice, use both measures: want points closest to plane want points with largest angular separation from others Iterative greedy algorithm: value =  * distance to hyperplane + (1-) * (largest cosine measure to an already existing point) Choose the next point to be the one that minimizes this value Paper has results: fairly robust to varying 

Iterative Algorithms Maintain the “importance,” or dual variable associated with all data points. This is small, since it is a single dimensional array of size m. Algorithm Look at each point sequentially. Update its importance. (How?) Repeat until no further improvements in goal. Class 1 Class -1

Iterative Framework LSVM, ASVM, SOR, etc. are iterative algorithms on the dual variables. Algorithm: (Assume that we have m data points.) for (i=0; i < m; i++) ai = 0; // Initialize dual variables while (distance between strings continues to shorten) for (i=0; i <m; i++) { Update ai according to the update rule (not shown here). } Bottleneck: Repeated scans through the dataset. Many of these data points are unimportant

Iterative Framework (Optimized) Optimization: Apply algorithm only to active points, i.e. those points that appear to be support vectors, as long as progress is being made. Optimized Algorithm: while (strings continue to shorten) { run the unoptimized algorithm for one iteration while (strings continue to shorten) for (all i corresponding to active points) { Update ai . If ai > 0, keep this data point active. Otherwise, remove it. } This results in more loops, but the inner loops are so much faster that it pays off significantly.

Regression Support vector machines can also be used to solve regression problems.

The Regression Problem “Close points” may be wrong due to noise only Line should be influenced by “real” data, not noise Ignore errors from those points which are close!

Support Vector Regression Traditional support vector regression: Minimize the error made outside of the tube Regularize the fitted plane by minimizing the norm of w The parameter C balances two competing goals

My current research Collaborating with: Deborah Gross, Carleton College (chemistry) Raghu Ramakrishnan, UW-Madison (computer sciences) Jamie Schauer, UW-Madison (atmospheric sciences) Analyzing data from Aerosol Time-of-Flight Mass Spectrometer (ATOFMS) Aerosol: "small particle of gunk in air" Questions we want to answer: How can we classify safe vs. dangerous? Can we determine when a sudden change in the air stream has happened? Can we identify what substances are present in a particular particle?

Questions?