LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.

Slides:



Advertisements
Similar presentations
Regularization David Kauchak CS 451 – Fall 2013.
Advertisements

Classification / Regression Support Vector Machines
Imbalanced data David Kauchak CS 451 – Fall 2013.
Linear Classifiers/SVMs
Data Mining Classification: Alternative Techniques
SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.
Support Vector Machines
PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.
Separating Hyperplanes
K nearest neighbor and Rocchio algorithm
x – independent variable (input)
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Announcements See Chapter 5 of Duda, Hart, and Stork. Tutorial by Burge linked to on web page. “Learning quickly when irrelevant attributes abound,” by.
CES 514 – Data Mining Lecture 8 classification (contd…)
Support Vector Machines Kernel Machines
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Lecture outline Support vector machines. Support Vector Machines Find a linear hyperplane (decision boundary) that will separate the data.
Support Vector Machines
Lecture 10: Support Vector Machines
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Linear Discriminant Functions Chapter 5 (Duda et al.)
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Machine Learning Week 4 Lecture 1. Hand In Data Is coming online later today. I keep test set with approx test images That will be your real test.
MULTICLASS CONTINUED AND RANKING David Kauchak CS 451 – Fall 2013.
Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:
GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.
Outline Classification Linear classifiers Perceptron Multi-class classification Generative approach Naïve Bayes classifier 2.
ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.
ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.
HADOOP David Kauchak CS451 – Fall Admin Assignment 7.
Discriminant Functions
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
BOOSTING David Kauchak CS451 – Fall Admin Final project.
Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 Chapter 6. Classification and Prediction Overview Classification algorithms and methods Decision tree induction Bayesian classification Lazy learning.
Recognition II Ali Farhadi. We have talked about Nearest Neighbor Naïve Bayes Logistic Regression Boosting.
CISC667, F05, Lec22, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Support Vector Machines I.
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
An Introduction to Support Vector Machine (SVM)
Linear Models for Classification
Today’s Topics 11/10/15CS Fall 2015 (Shavlik©), Lecture 22, Week 101 Support Vector Machines (SVMs) Three Key Ideas –Max Margins –Allowing Misclassified.
MULTICLASS David Kauchak CS 451 – Fall Admin Assignment 4 Assignment 2 back soon If you need assignment feedback… CS Lunch tomorrow (Thursday):
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 7: Linear and Generalized Discriminant Functions.
Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Part 9: Review.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Kernel Methods: Support Vector Machines Maximum Margin Classifiers and Support Vector Machines.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Page 1 CS 546 Machine Learning in NLP Review 2: Loss minimization, SVM and Logistic Regression Dan Roth Department of Computer Science University of Illinois.
Support Vector Machines Reading: Textbook, Chapter 5 Ben-Hur and Weston, A User’s Guide to Support Vector Machines (linked from class web page)
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Large Margin classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
Large Scale Support Vector Machines
Support Vector Machines
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013

Admin Assignment 5

Midterm Download from course web page when you’re ready to take it 2 hours to complete Must hand-in (or in) by 11:59pm Friday Oct. 18 Can use: class notes, your notes, the book, your assignments and Wikipedia. You may not use: your neighbor, anything else on the web, etc.

What can be covered Anything we’ve talked about in class Anything in the reading (not these are not necessarily the same things) Anything we’ve covered in the assignments

Midterm topics Machine learning basics - different types of learning problems - feature-based machine learning - data assumptions/data generating distribution Classification problem setup Proper experimentation - train/dev/test - evaluation/accuracy/training error - optimizing hyperparameters

Midterm topics Learning algorithms - Decision trees - K-NN - Perceptron - Gradient descent Algorithm properties - training/learning - rational/why it works - classifying - hyperparameters - avoiding overfitting - algorithm variants/improvements

Midterm topics Geometric view of data - distances between examples - decision boundaries Features - example features - removing erroneous features/picking good features - challenges with high-dimensional data - feature normalization Other pre-processing - outlier detection

Midterm topics Comparing algorithms - n-fold cross validation - leave one out validation - bootstrap resampling - t-test imbalanced data - evaluation - precision/recall, F1, AUC - subsampling - oversampling - weighted binary classifiers

Midterm topics Multiclass classification - Modifying existing approaches - Using binary classifier - OVA - AVA - Tree-based - micro- vs. macro-averaging Ranking - using binary classifier - using weighted binary classifier - evaluation

Midterm topics Gradient descent - 0/1 loss - Surrogate loss functions - Convexity - minimization algorithm - regularization - different regularizers - p-norms Misc - good coding habits - JavaDoc

Midterm general advice 2 hours goes by fast! - Don’t plan on looking everything up - Lookup equations, algorithms, random details - Make sure you understand the key concepts - Don’t spend too much time on any one question - Skip questions you’re stuck on and come back to them - Watch the time as you go Be careful on the T/F questions For written questions - think before you write - make your argument/analysis clear and concise

Midterm topics General ML concepts - avoiding overfitting - algorithm comparison - algorithm/model bias - model-based machine learning - online vs. offline learning

Which hyperplane? Two main variations in linear classifiers: -which hyperplane they choose when the data is linearly separable -how they handle data that is not linearly separable

Linear approaches so far Perceptron: - separable: - non-separable: Gradient descent: - separable: - non-separable:

Linear approaches so far Perceptron: - separable: - finds some hyperplane that separates the data - non-separable: - will continue to adjust as it iterates through the examples - final hyperplane will depend on which examples is saw recently Gradient descent: - separable and non-separable - finds the hyperplane that minimizes the objective function (loss + regularization) Which hyperplane is this?

Which hyperplane would you choose?

Large margin classifiers Choose the line where the distance to the nearest point(s) is as large as possible margin

Large margin classifiers The margin of a classifier is the distance to the closest points of either class Large margin classifiers attempt to maximize this margin

Large margin classifier setup Select the hyperplane with the largest margin where the points are classified correctly! Setup as a constrained optimization problem: subject to: what does this say?

Large margin classifier setup subject to: Are these equivalent?

Large margin classifier setup subject to: w=(0.5,1) w=1,2) w=(2,4) …

Large margin classifier setup subject to: We’ll assume c =1, however, any c > 0 works

Measuring the margin How do we calculate the margin?

Support vectors For any separating hyperplane, there exist some set of “closest points” These are called the support vectors

Measuring the margin The margin is the distance to the support vectors, i.e. the “closest points”, on either side of the hyperplane

Distance from the hyperplane w=(1,2) f1f1 f2f2 (-1,-2) How far away is this point from the hyperplane?

Distance from the hyperplane f1f1 f2f2 (-1,-2) How far away is this point from the hyperplane? w=(1,2)

Distance from the hyperplane f1f1 f2f2 (1,1) How far away is this point from the hyperplane? w=(1,2)

Distance from the hyperplane f1f1 f2f2 (1,1) How far away is this point from the hyperplane? w=(1,2) length normalized weight vectors

Distance from the hyperplane f1f1 f2f2 (1,1) How far away is this point from the hyperplane? w=(1,2)

Distance from the hyperplane f1f1 f2f2 (1,1) Why length normalized? w=(1,2) length normalized weight vectors

Distance from the hyperplane f1f1 f2f2 (1,1) Why length normalized? w=(2,4) length normalized weight vectors

Distance from the hyperplane f1f1 f2f2 (1,1) Why length normalized? w=(0.5,1) length normalized weight vectors

Measuring the margin margin Thought experiment: Someone gives you the optimal support vectors Where is the max margin hyperplane?

Measuring the margin Margin = (d + -d - )/2 margin Max margin hyperplane is halfway in between the positive support vectors and the negative support vectors Why?

Measuring the margin Margin = (d + -d - )/2 margin Max margin hyperplane is halfway in between the positive support vectors and the negative support vectors -All support vectors are the same distance -To maximize, hyperplane should be directly in between

Measuring the margin Margin = (d + -d - )/2 What is wx+b for support vectors? Hint: subject to:

Measuring the margin subject to: The support vectors have Otherwise, we could make the margin larger!

Measuring the margin Margin = (d + -d - )/2 negative example

Maximizing the margin subject to: Maximizing the margin is equivalent to minimizing ||w||! (subject to the separating constraints)

Maximizing the margin subject to: Maximizing the margin is equivalent to minimizing ||w||! (subject to the separating constraints)

Maximizing the margin subject to: The constraints: 1.make sure the data is separable 2.encourages w to be larger (once the data is separable) The minimization criterion wants w to be as small as possible

Maximizing the margin: the real problem subject to: Why the squared?

Maximizing the margin: the real problem subject to: Minimizing ||w|| is equivalent to minimizing ||w|| 2 The sum of the squared weights is a convex function!

Support vector machine problem subject to: This is a version of a quadratic optimization problem Maximize/minimize a quadratic function Subject to a set of linear constraints Many, many variants of solving this problem (we’ll see one in a bit)