Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at

Slides:

Advertisements

Similar presentations

Study on Ensemble Learning By Feng Zhou. Content Introduction A Statistical View of M3 Network Future Works.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

On-line learning and Boosting

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Linear Classifiers (perceptrons)

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Other Classification Techniques 1.Nearest Neighbor Classifiers 2.Support Vector Machines.

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Boosting Approach to ML

2D1431 Machine Learning Boosting.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Reducing Multiclass to Binary LING572 Fei Xia Week 9: 03/04/08.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.

Boosting Main idea: train classifiers (e.g. decision trees) in a sequence. a new classifier should focus on those cases which were incorrectly classified.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.

Ensembles of Classifiers Evgueni Smirnov

Machine Learning CS 165B Spring 2012

A k-Nearest Neighbor Based Algorithm for Multi-Label Classification Min-Ling Zhang

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Efficient Model Selection for Support Vector Machines

INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.

CIKM’09 Date:2010/8/24 Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen 1.

Information Coding in noisy channel error protection:-- improve tolerance of errors error detection: --- indicate occurrence of errors. Source.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Combinatorial Algorithms Reference Text: Kreher and Stinson.

1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.

Ensemble Based Systems in Decision Making Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: IEEE CIRCUITS AND SYSTEMS MAGAZINE 2006, Q3 Robi.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Today Ensemble Methods. Recap of the course. Classifier Fusion

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.

CLASSIFICATION: Ensemble Methods

Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.

Introducing the Separability Matrix for ECOC coding

Estimating Component Availability by Dempster-Shafer Belief Networks Estimating Component Availability by Dempster-Shafer Belief Networks Lan Guo Lane.

DIGITAL COMMUNICATIONS Linear Block Codes

ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.

Learning with AdaBoost

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Bayesian Averaging of Classifiers and the Overfitting Problem Rayid Ghani ML Lunch – 11/13/00.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

© Devi Parikh 2008 Devi Parikh and Tsuhan Chen Carnegie Mellon University April 3, ICASSP 2008 Bringing Diverse Classifiers to Common Grounds: dtransform.

Classification Ensemble Methods 1

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.

Validation methods.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Machine Learning in Practice Lecture 24 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Classification using Co-Training

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.

Artificial Intelligence

Sofus A. Macskassy Fetch Technologies

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

CS 4/527: Artificial Intelligence

Bayesian Averaging of Classifiers and the Overfitting Problem

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at

Outline Introduction to ECOC Intuition & Motivation Some Questions? Experimental Results Semi-Theoretical Model Types of Codes Drawbacks Conclusions

Introduction Decompose a multiclass classification problem into multiple binary problems One-Per-Class Approach (moderately expensive) All-Pairs (very expensive) Distributed Output Code (efficient but what about performance?) Error-Correcting Output Codes (?)

Is it a good idea? Larger margin for error since errors can now be “corrected” One-per-class is a code with minimum hamming distance (HD) = 2 Distributed codes have low HD The individual binary problems can be harder than before Useless unless number of classes > 5

Training ECOC Given m distinct classes Create an m x n binary matrix M. Each class is assigned ONE row of M. Each column of the matrix divides the classes into TWO groups. Train the Base classifiers to learn the n binary problems.

Testing ECOC To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5 X

Questions? How well does it work? How long should the code be? Do we need a lot of training data? What kind of codes can we use? Are there intelligent ways of creating the code?

Previous Work Combine with Boosting – ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999) Local Learners (Ricci & Aha, 1997) Text Classification (Berger, 1999)

Experimental Setup Generate the code BCH Codes Choose a Base Learner Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998) Naive Bayes Classifier

Dataset Industry Sector Dataset Consists of company web pages classified into 105 economic sectors Standard stoplist No Stemming Skip all MIME headers and HTML tags Experimental approach similar to McCallum et al. (1998) for comparison purposes.

Results Classification Accuracies on five random train-test splits of the Industry Sector dataset with a vocabulary size of ECOC is 88% accurate!

Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)

The Longer the Better! Table 2: Average Classification Accuracy on 5 random train-test splits of the Industry Sector dataset with a vocabulary size of words selected using Information Gain. Longer codes mean larger codeword separation The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C If minimum hamming distance is h, then the code can correct  (h-1)/2 errors

Size Matters?

Size does NOT matter!

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy

Types of Codes Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive

What is a Good Code? Row Separation Column Separation (Independence of errors for each binary classifier) Efficiency (for long codes)

Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes

Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH % 19-Bit Hybrid % 15-bit Random 2 (1.5) %

Interesting Observations NBC does not give good probabilitiy estimates- using ECOC results in better estimates.

Drawbacks Can be computationally expensive Random Codes throw away the real- world nature of the data by picking random partitions to create artificial binary problems

Future Work Combine ECOC with Co-Training Automatically construct optimal / adaptive codes

Conclusion Improves Classification Accuracy considerably! Can be used when training data is sparse Algebraic codes perform better than random codes for a given code lenth Hand-constructed codes are not the answer