MULTICLASS CONTINUED AND RANKING David Kauchak CS 451 – Fall 2013.

Slides:

Advertisements

Similar presentations

Classification Classification Examples

Advertisements

Sequential Minimal Optimization Advanced Machine Learning Course 2012 Fall Semester Tsinghua University.

Regularization David Kauchak CS 451 – Fall 2013.

Imbalanced data David Kauchak CS 451 – Fall 2013.

Linear Classifiers/SVMs

Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5

Linear Classifiers (perceptrons)

Data Mining Classification: Alternative Techniques

SOFT LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.

Maximum Margin Markov Network Ben Taskar, Carlos Guestrin Daphne Koller 2004.

Machine learning continued Image source:

Text Categorization Karl Rees Ling 580 April 2, 2001.

PERCEPTRON LEARNING David Kauchak CS 451 – Fall 2013.

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.

DECISION TREES David Kauchak CS 451 – Fall Admin Assignment 1 available and is due on Friday (printed out at the beginning of class) Door code for.

K nearest neighbor and Rocchio algorithm

x – independent variable (input)

Linear Discriminant Functions Chapter 5 (Duda et al.)

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.

EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()

Evaluating Classifiers

GEOMETRIC VIEW OF DATA David Kauchak CS 451 – Fall 2013.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

HADOOP David Kauchak CS451 – Fall Admin Assignment 7.

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu Lecture 24 – Classifiers 1.

TOPICS IN BUSINESS INTELLIGENCE K-NN & Naive Bayes – GROUP 1 Isabel van der Lijke Nathan Bok Gökhan Korkmaz.

1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.

Nov 23rd, 2001Copyright © 2001, 2003, Andrew W. Moore Linear Document Classifier.

Classification: Feature Vectors

LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

LARGE MARGIN CLASSIFIERS David Kauchak CS 451 – Fall 2013.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

Tony Jebara, Columbia University Advanced Machine Learning & Perception Instructor: Tony Jebara.

MULTICLASS David Kauchak CS 451 – Fall Admin Assignment 4 Assignment 2 back soon If you need assignment feedback… CS Lunch tomorrow (Thursday):

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

RANKING David Kauchak CS 451 – Fall Admin Assignment 4 Assignment 5.

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.

LECTURE 05: CLASSIFICATION PT. 1 February 8, 2016 SDS 293 Machine Learning.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Machine Learning: Ensemble Methods. 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training data or different.

Gradient descent David Kauchak CS 158 – Fall 2016.

Evaluating Classifiers

Probability David Kauchak CS158 – Fall 2013.

Large Margin classifiers

Artificial Intelligence

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Perceptrons Lirong Xia.

Classification with Perceptrons Reading:

CS 4/527: Artificial Intelligence

CS 188: Artificial Intelligence

Ensemble learning.

David Kauchak CS158 – Spring 2019

Word representations David Kauchak CS158 – Fall 2016.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Perceptrons Lirong Xia.

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

MULTICLASS CONTINUED AND RANKING David Kauchak CS 451 – Fall 2013

Admin Assignment 4 Course feedback Midterm

Java tip for the day private vs. public vs. protected

Debugging tips

Multiclass classification label apple orange apple banana examples banana pineapple Same setup where we have a set of features for each example Rather than just two labels, now have 3 or more

Black box approach to multiclass Abstraction: we have a generic binary classifier, how can we use it to solve our new problem binary classifier +1 optionally: also output a confidence/score Can we solve our multiclass problem with this?

Approach 1: One vs. all (OVA) Training: for each label L, pose as a binary problem  all examples with label L are positive  all other examples are negative apple banana orange apple vs. not +1 orange vs. not +1 banana vs. not +1

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not How do we classify?

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not How do we classify?

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not How do we classify?

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not How do we classify?

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not How do we classify? banana OR pineapple none?

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not How do we classify?

OVA: classify Classify:  If classifier doesn’t provide confidence (this is rare) and there is ambiguity, pick one of the ones in conflict  Otherwise: pick the most confident positive if none vote positive, pick least confident negative

OVA: linear classifiers (e.g. perceptron) pineapple vs. not apple vs. not banana vs. not What does the decision boundary look like?

OVA: linear classifiers (e.g. perceptron) BANANA APPLE PINEAPPLE

OVA: classify, perceptron Classify:  If classifier doesn’t provide confidence (this is rare) and there is ambiguity, pick majority in conflict  Otherwise: pick the most confident positive if none vote positive, pick least confident negative How do we calculate this for the perceptron?

OVA: classify, perceptron Classify:  If classifier doesn’t provide confidence (this is rare) and there is ambiguity, pick majority in conflict  Otherwise: pick the most confident positive if none vote positive, pick least confident negative Distance from the hyperplane

Approach 2: All vs. all (AVA) Training: For each pair of labels, train a classifier to distinguish between them for i = 1 to number of labels: for k = i+1 to number of labels: train a classifier to distinguish between label j and label k : - create a dataset with all examples with label j labeled positive and all examples with label k labeled negative - train classifier on this subset of the data

AVA training visualized apple banana orange +1 apple vs orange +1 apple vs banana +1 orange vs banana

AVA classify +1 apple vs orange +1 apple vs banana +1 orange vs banana What class?

AVA classify +1 apple vs orange +1 apple vs banana +1 orange vs banana orange apple orange In general?

AVA classify To classify example e, classify with each classifier f jk We have a few options to choose the final class: - Take a majority vote - Take a weighted vote based on confidence - y = f jk (e) - score j += y - score k -= y How does this work? Here we’re assuming that y encompasses both the prediction (+1,-1) and the confidence, i.e. y = prediction * confidence.

AVA classify Take a weighted vote based on confidence - y = f jk (e) - score j += y - score k -= y If y is positive, classifier thought it was of type j: - raise the score for j - lower the score for k if y is negative, classifier thought it was of type k: - lower the score for j - raise the score for k

OVA vs. AVA Train/classify runtime? Error? Assume each binary classifier makes an error with probability ε

OVA vs. AVA Train time: AVA learns more classifiers, however, they’re trained on much smaller data this tends to make it faster if the labels are equally balanced Test time: AVA has more classifiers Error (see the book for more justification): - AVA trains on more balanced data sets - AVA tests with more classifiers and therefore has more chances for errors - Theoretically: -- OVA: ε (number of labels -1) -- AVA: 2 ε (number of labels -1)

Approach 3: Divide and conquer vs Pros/cons vs. AVA?

Multiclass summary If using a binary classifier, the most common thing to do is OVA Otherwise, use a classifier that allows for multiple labels:  DT and k-NN work reasonably well  We’ll see a few more in the coming weeks that will often work better

Multiclass evaluation label apple orange apple banana pineapple prediction orange apple pineapple banana pineapple How should we evaluate?

Multiclass evaluation label apple orange apple banana pineapple prediction orange apple pineapple banana pineapple Accuracy: 4/6

Multiclass evaluation imbalanced data label apple banana pineapple prediction orange apple pineapple banana pineapple Any problems? … Data imbalance!

Macroaveraging vs. microaveraging microaveraging: average over examples (this is the “normal” way of calculating) macroaveraging: calculate evaluation score (e.g. accuracy) for each label, then average over labels What effect does this have? Why include it?

Macroaveraging vs. microaveraging microaveraging: average over examples (this is the “normal” way of calculating) macroaveraging: calculate evaluation score (e.g. accuracy) for each label, then average over labels -Puts more weight/emphasis on rarer labels -Allows another dimension of analysis

Macroaveraging vs. microaveraging microaveraging: average over examples macroaveraging: calculate evaluation score (e.g. accuracy) for each label, then average over labels label apple orange apple banana pineapple prediction orange apple pineapple banana pineapple

Macroaveraging vs. microaveraging microaveraging: 4/6 macroaveraging: apple = 1/2 orange = 1/1 banana = 1/2 pineapple = 1/1 total = (1/ /2 + 1)/4 = 3/4 label apple orange apple banana pineapple prediction orange apple pineapple banana pineapple

Confusion matrix ClassicCountryDiscoHiphopJazzRock Classic Country Disco Hiphop Jazz Rock entry (i, j) represents the number of examples with label i that were predicted to have label j another way to understand both the data and the classifier

Confusion matrix BLAST classification of proteins in 850 superfamilies

Multilabel vs. multiclass classification Is it edible? Is it sweet? Is it a fruit? Is it a banana? Is it an apple? Is it an orange? Is it a pineapple? Is it a banana? Is it yellow? Is it sweet? Is it round? Any difference in these labels/categories?

Multilabel vs. multiclass classification Is it edible? Is it sweet? Is it a fruit? Is it a banana? Is it an apple? Is it an orange? Is it a pineapple? Is it a banana? Is it yellow? Is it sweet? Is it round? Different structures Nested/ Hierarchical Exclusive/ Multiclass General/Structured

Multiclass vs. multilabel Multiclass: each example has one label and exactly one label Multilabel: each example has zero or more labels. Also called annotation Multilabel applications?

Multilabel Image annotation Document topics Labeling people in a picture Medical diagnosis

Multiclass vs. multilabel Multiclass: each example has one label and exactly one label Multilabel: each example has zero or more labels. Also called annotation Which of our approaches work for multilabel?