Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay www.cse.iitb.ac.in/~soumen.

Slides:

Advertisements

Similar presentations

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Advertisements

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.

Linear Classifiers/SVMs

CHAPTER 10: Linear Discrimination

SVM—Support Vector Machines

Machine learning continued Image source:

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

CMPUT 466/551 Principal Source: CMU

Chapter 4: Linear Models for Classification

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

x – independent variable (input)

Sparse vs. Ensemble Approaches to Supervised Learning

Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.

Support Vector Machines Kernel Machines

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Ch. Eick: Support Vector Machines: The Main Ideas Reading Material Support Vector Machines: 1.Textbook 2. First 3 columns of Smola/Schönkopf article on.

Final review LING572 Fei Xia Week 10: 03/11/

Support Vector Machines Piyush Kumar. Perceptrons revisited Class 1 : (+1) Class 2 : (-1) Is this unique?

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Efficient Model Selection for Support Vector Machines

Data mining and machine learning A brief introduction.

Topics on Final Perceptrons SVMs Precision/Recall/ROC Decision Trees Naive Bayes Bayesian networks Adaboost Genetic algorithms Q learning Not on the final:

ADVANCED CLASSIFICATION TECHNIQUES David Kauchak CS 159 – Fall 2014.

1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.

DATA MINING LECTURE 10 Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.

Machine Learning in Ad-hoc IR. Machine Learning for ad hoc IR We’ve looked at methods for ranking documents in IR using factors like –Cosine similarity,

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.

Text Classification 2 David Kauchak cs459 Fall 2012 adapted from:

Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:

Powerpoint Templates Page 1 Powerpoint Templates Scalable Text Classification with Sparse Generative Modeling Antti PuurulaWaikato University.

Linear Models for Classification

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Final Exam Review CS479/679 Pattern Recognition Dr. George Bebis 1.

Classification Ensemble Methods 1

Document Classification (Supervised Learning) Soumen Chakrabarti IIT Bombay

MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

Lecture 3: MLE, Bayes Learning, and Maximum Entropy

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.

SVMs in a Nutshell.

Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.

SUPPORT VECTOR MACHINES Presented by: Naman Fatehpuria Sumana Venkatesh.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

A Brief Introduction to Support Vector Machine (SVM) Most slides were from Prof. A. W. Moore, School of Computer Science, Carnegie Mellon University.

Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Semi-Supervised Clustering

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Machine Learning Basics

CS 4/527: Artificial Intelligence

Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Presented by: Chang Jia As for: Pattern Recognition

Feature space tansformation methods

Text Categorization Berlin Chen 2003 Reference:

Chapter 7: Transformations

Multivariate Methods Berlin Chen

Linear Discrimination

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Fast and accurate text classification via multiple linear discriminant projections Soumen Chakrabarti Shourya Roy Mahesh Soundalgekar IIT Bombay

VLDB Introduction  Supervised learning of labels from high- dimensional data has many applications Text topic and genre classification  Many classification algorithms known Support vector machines (SVM)—most accurate Maximum entropy classifiers Naïve Bayes classifiers—fastest and simplest  Problem: SVMs Are difficult to understand and implement Take time almost quadratic in #instances

VLDB Our contributions  Simple Iterated Multiple Projection on Lines (SIMPL) Trivial to understand and code (600 lines) O(#dimensions) or less memory Only sequential scans of training data Almost as fast as naïve Bayes (NB) As accurate as SVMs, sometimes better  Insights into the best choice of linear discriminants for text classification How do the discriminants chosen by NB, SIMPL and SVM differ?

VLDB Naïve Bayes classifiers  For simplicity assume two classes {  1,1}  t=term, d=document, c=class, d =length of document d, n(d,t)=#times t occurs in d  Model parameters Priors Pr(c=  1) and Pr(c=1)  c,t =fractional rate at which t occurs in documents labeled with class c  Probability of a given d generated from c is

VLDB Naïve Bayes is a linear discriminant  When choosing between the two labels Terms involving document length cancel out Taking logs, we compare  The first part is a dot-product, the second part is a fixed offset, so we compare  Simple join-aggregate, very fast

VLDB Many features, each fairly noisy  Sort features in order of decreasing correlation with class labels  Build separate classifiers 1—100, 101—200, etc.  Even features ranked 5000 to provide lift beyond picking a random class  Most features have tiny amounts of useful, noisy and possibly redundant info—how to combine?  Naïve Bayes, LSVM, maximum entropy—all take linear combinations of term frequencies

VLDB Support vector d1d1 d2d2 Linear support vector machine (LSVM)  Want a vector  and a constant b such that for each document d i If c i =1 then  d i +b  1 If c i =  1 then  d i +b   1  I.e., c i (  d i +b)  1  If points d 1 and d 2 touch the slab, the projected distance between them is  Find  to maximize this   d+b=0  d+b=  1  d+b=1 All training instances here have c=1 All training instances here have c=  1

VLDB SVM implementations   SVM is a linear sum of support vectors  Complex, non-linear optimization 6000 lines of C code (SVM-light)  Approx n 1.7—1.9 time with n training vectors  Footprint can be large Usually hold all training vectors in memory Also a cache of dot-products of vector pairs  No I/O-optimized implementation known We measured 40% time in seek+transfer

VLDB Variance of projected Y-pointsVariance of projected X-points Square of distance between projected means Fisher’s linear discriminant (FLD)  Used in pattern recognition for ages  Two point sets X (c=1) and Y (c=  1) x  X, y  Y are points in m dimensions Projection on unit vector  is x · , y ·   Goal is to find a direction  so as to maximize

VLDB Some observations  Hyperplanes can often completely separate training labels for text; more complex separators do not help (Joachims)  NB is biased:  t depends only on term t— SVM/Fisher do not make this assumption  If you find Fisher’s discriminant over only the support vectors, you get the SVM separator (Shashua)  Even random projections preserve inter- point distances whp (Frankl+Maehara 1988, Kleinberg 1997)

VLDB Hill-climbing  Iteratively update  new   old +  J(  ) where  is a “learning rate”   J(  ) = (  J/  1,…,  J/  m ) T where  = (  1,…,  m ) T  Need only 5m + O(1) accumulators for simple, one-pass update  Can also write as sort-merge-accumulate

VLDB Convergence  Initialize  to vector joining positive and negative centroids  Stop if J(  ) cannot be increased in three successive iterations  J(  ) converges in 10—20 iterations Not sensitive to problem size  documents from LSVM takes seconds Hill-climbing converges in 200 seconds

VLDB Multiple discriminants  Separable data points SVM succeeds FLD fails to separate completely  Idea Remove training points (outside the gray zone) Find another FLD for surviving points only  2—3 FLDs suffice for almost complete separation! 7074  230  2 Confusion zone

VLDB SIMPL (only 600 lines of C++)  Repeat for k = 0, 1, … Find  (k), the Fisher discriminant for the current set of training instances Project training instances to  (k) Remove points well-separated by  (k)  while  1 point from each class survive  Orthogonalize the vectors  (0),  (1),  (2),…  Project all training points on the space spanned by the orthogonal  ’s  Induce decision tree on projected points

VLDB Robustness of stopping decision  Compute  (0) to convergence  Vs., run only half the iterations required for convergence  Find  (1),… as usual  Later  s can cover for slop in earlier  s  While saving time in costly early-  updates Later  s take negligible time

VLDB Accuracy  Large improvement beyond naïve Bayes  We tuned parameters in SVM to give “SVM- best”  Often beats SVM with default params  Almost always within 5% of SVM-best  Even beats SVM-best in some cases Especially when problem is not linearly separable

VLDB Performance  SIMPL is linear-time and CPU-bound  LSVM spends 35—60% time in I/O+cache mgmt  LSVM takes 2 orders of magnitude more time for documents SVM

VLDB Summary and future work  SIMPL: a new classifier for high- dimensional data Low memory footprint, sequential scan Orders of magnitude faster than LSVM Often as accurate as LSVM, sometimes better  An efficient “feature space transformer”  How will SIMPL behave for non-textual, high-dim data?  Can we analyze SIMPL? LSVM is theoretically sound, more general When will SIMPL match LSVM/SVM?