Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Introduction to Support Vector Machines (SVM)
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
An Introduction of Support Vector Machine
Classification / Regression Support Vector Machines
Imbalanced data David Kauchak CS 451 – Fall 2013.
A KTEC Center of Excellence 1 Pattern Analysis using Convex Optimization: Part 2 of Chapter 7 Discussion Presenter: Brian Quanz.
Support Vector Machines Instructor Max Welling ICS273A UCIrvine.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Separating Hyperplanes
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Support Vector Machines (and Kernel Methods in general)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Using Analytic QP and Sparseness to Speed Training of Support Vector Machines John C. Platt Presented by: Travis Desell.
Ensemble Learning: An Introduction
Constrained Optimization Rong Jin. Outline  Equality constraints  Inequality constraints  Linear Programming  Quadratic Programming.
Ensemble Learning (2), Tree and Forest
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Support Vector Machines
Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?
Slide Image Retrieval: A Preliminary Study Guo Min Liew and Min-Yen Kan National University of Singapore Web IR / NLP Group (WING)
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Efficient Model Selection for Support Vector Machines
Learning with Positive and Unlabeled Examples using Weighted Logistic Regression Wee Sun Lee National University of Singapore Bing Liu University of Illinois,
SVM by Sequential Minimal Optimization (SMO)
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Improving Web Spam Classification using Rank-time Features September 25, 2008 TaeSeob,Yun KAIST DATABASE & MULTIMEDIA LAB.
CS Statistical Machine learning Lecture 18 Yuan (Alan) Qi Purdue CS Oct
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.
Kernels Usman Roshan CS 675 Machine Learning. Feature space representation Consider two classes shown below Data cannot be separated by a hyperplane.
Machine Learning Weak 4 Lecture 2. Hand in Data It is online Only around 6000 images!!! Deadline is one week. Next Thursday lecture will be only one hour.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.
Ohad Hageby IDC Support Vector Machines & Kernel Machines IP Seminar 2008 IDC Herzliya.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Biointelligence Laboratory, Seoul National University
ProjFocusedCrawler CS5604 Information Storage and Retrieval, Fall 2012 Virginia Tech December 4, 2012 Mohamed M. G. Farag Mohammed Saquib Khan Prasad Krishnamurthi.
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Support Vector Machines Tao Department of computer science University of Illinois.
Linear hyperplanes as classifiers Usman Roshan. Hyperplane separators.
Computational Intelligence: Methods and Applications Lecture 24 SVM in the non-linear case Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
CSSE463: Image Recognition Day 14 Lab due Weds. Lab due Weds. These solutions assume that you don't threshold the shapes.ppt image: Shape1: elongation.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Big Data Processing of School Shooting Archives
CSSE463: Image Recognition Day 14
CS 9633 Machine Learning Support Vector Machines
Support Vector Machines
Evaluating Classifiers
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Computational Intelligence: Methods and Applications
Geometrical intuition behind the dual problem
Classification with Perceptrons Reading:
Perceptrons Support-Vector Machines
An Introduction to Support Vector Machines
LINEAR AND NON-LINEAR CLASSIFICATION USING SVM and KERNELS
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Statistical Learning Dong Liu Dept. EEIS, USTC.
CSSE463: Image Recognition Day 14
The following slides are taken from:
Support Vector Machines
Usman Roshan CS 675 Machine Learning
Presentation transcript:

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University

Goal Implement SMO algorithm to classify the given set of documents as one of two classes "+1 or -1”. X i : N dimension vector Class +1 Class -1 m

SMO implementation SMO overview Data Processing KKT conditions Learned function Kernel Heuristics to find alpha2 Updating w and b( for new lagrange values) Testing Submission We see the main parts in the implementation aspect of SMO.

Data Preprocessing The dataset contains two newsgroups, one is baseball and the other is hockey. For each document(feature selection), you can simply select the top 50 words(50 features)with highest tf-idf values( see Ref 4 ). Of course, you can do advanced preprocessing like stop word removal, word stemming, or define specific features by yourself. An example feature file is given : +1 1:0 2: : :0 22: : : : : : : :0 4:0 30: : : : : : : : where each line "[label] [keyword1]:[value1] [keyword2]:[value2]… " represents a document Label(+1, -1) is the classification of the document; keyword_i is the global sequence number of a word in the whole dataset; value_i is the tf-idf of the word appearing in the document. The goal of data preprocessing: Generate a feature file for the whole dataset, and then split it into five equal set of files say s1 to s5 for 5-fold Cross-validation.

Learning & Test Processing Input: feature files s1 to s5 Steps: 1)Set i=1, take si as test set and others as training set 2) Take the training set and learn the SMO (w and b for the data points) and store the learnt weights. 3) Take the test set and using the weights learnt from Step 2, classify the documents as hockey or baseball. 4) Calculate Precision, Recall and F1 score. 5) i++, Repeat Step 1, 2, and 3 until i > 5

Training

Objective Function The lecture in class showed us, we can: Solve the dual more efficiently (fewer unknowns) Add parameter C to allow some misclassifications Replace x i T x j by more more general kernel term

Intuitive Introduction to SMO SMO is essentially doing same thing with Perceptron learning algorithm – find a linear separator by adjusting weights on misclassified examples Unlike perceptron, SMO has to maintain constraint. Therefore, when SMO adjusts weight of one alpha example, it must also adjust weight of another alpha. Size of alpha is the size of training examples

SMO Algorithm Input: C(say 0.5), kernel(linear), error cache, epsilon(tolerance) Initialize b, w and all  ’s to 0 Repeat until KKT satisfied (to within epsilon): – Find an example ex1 that violates KKT (prefer unbound examples (0<α i <C )here, choose randomly among those) – Choose a second example ex2. # Prefer one to maximize step size (in practice, faster to just maximize |E 1 – E 2 |). # If that fails to result in change, randomly choose unbound example. # If that fails, randomly choose example. If that fails, re-choose ex1. – Update α 1 and α 2 in one step – Compute new threshold b

Karush-Kuhn-Tucker (KKT) Conditions It is necessary and sufficient for a solution to our objective that all  ’s satisfy the following: An  is 0 iff that example is correctly labeled with room to spare An  is C iff that example is incorrectly labeled or in the margin An  is properly between 0 and C (is “unbound”) iff that example is “barely” correctly labeled (is a support vector) We just check KKT to within some small epsilon, typically Unbound

Equality constraint ( cause to lie on diagonal line ) Inequality constraint ( causes lagrange multipliers to lie in the box )

Notations SMO spends most of the time on adjusting the alphas of the non- boundary examples, so error cache is maintained for them. error_cache -> collection to hold the error of each training examples. i1 and i2 are the indexes corresponding to ex1 and ex2. Variables associated with i1 ends with 1 ( say alpha1 or alph1) and associated with i2 ends with 2( say alpha2 or alph2). w – global weights - > size of unique number of words from given data set. x i is i th training example. eta - Objective function a2 is new alpha2 value. L & H -> Lower and upper range used to compute the feasibility range of new alpha a2(because new alpha are found on derivatives or objective function).

wx k -b -> learned_func Example:Error between predicted and actual

Kernel Common features between i1 and i2

Given the first alpha, examineExample(i1) first checks if it violates the KKT condition by more than tolerance and if not, jointly optimize two alphas by calling function takestep(i1,i2) Main Function

Heuristics to choose alpha2 Choose alpha2 such that (E1-E2) is maximized.

Feasibility range of alph2 Even before finding new alpha2 value, we have to find of feasibility range (L and H) value. Refer page number 11 of Reference 3 for the derivation regarding this.

New a2 (new alpha2) Objective function (page 9-11 of Reference 3 for derivation) Definite Indefinite In indefinite case, SMO will move the Lagrange multipliers to the end point that has the lowest value of the objective function. Evaluated at each end(L and H). Read page 8 of Ref 2. Lobj and Hobj – page 22 of Ref 3

Updating threshold b New b is calculated every time so that KKT fullfilled for both alphas. Updating the threshold with either of new alph1 or new alph2 non-boundary. If both of them not-boundary then average b1 and b2 (b1 is valid because it forces x 1 to give y1 output) Refer page 9 of Reference 2.

Updating error cache using new lagrange multipliers 1) Error cache of all other non-bound training examples are updated. 2) Error cache of alph1 and alph2 are updated to zero (since we already optimized together) Change in alph1 and alpha2 Change in b

Updating w t1 and t2 -> Change in alpha1 and alpha2 respectively Global w w.r.t ex1 and ex2 should be updated.

Testing

SVM prediction with w and b Now we have w and b calculated. For new x i, svm_prediction -> wx i -b

Precision & Recall Precision is the probability that a (randomly selected) retrieved document is relevant. Recall is the probability that a (randomly selected) relevant document is retrieved in a search. F1 Score

Points eps too smaller value(more accuracy) may sometimes force SMO in non-exit loop. After each alpha1 and alpha2 optimized, error cache of these two examples should be set to zero, and all other non-bound examples should be updated. At each optimization step, w and b should be updated.

Reference Code Pseudo-code – Reference 2 C++ code – Reference 3

Submission Implementation: Refer complete algorithm (for implementation) : Report Precision, Recall, F1 measure – Report time cost – Submit as.tar file, including 1 ) Source Code with necessary comments, including 1.Data preprocessing 2.SMO training, testing and evaluation 2) ReadMe.txt explaining 1.How to extract features for the dataset, try for different C values and explain the results. 2.Explain main functions, e.g., select alpha1 and alpha2, update alpha1 and alpha2, update w and b. 3.How to run your code, from data preprocessing to training/testing/evaluation, step by step. 4.What’s the final average precision, recall and F1 Score and time cost.

References 1.SVM by Sequential Minimal Optimization (SMO). Algorithm by John Platt. Lecture by David Page. pages.cs.wisc.edu/~dpage/cs760/MLlectureSMO.ppt 2.Platt(1998): 1.Sequential Minimal Optimization for SVM java code for SMO

Thank You!