Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Slides:



Advertisements
Similar presentations
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Advertisements

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Machine learning continued Image source:
Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)
Discriminative and generative methods for bags of features
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
Final review LING572 Fei Xia Week 10: 03/13/08 1.
Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?
Text Classification With Support Vector Machines
Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.
Reducing Multiclass to Binary LING572 Fei Xia Week 9: 03/04/08.
ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1er. Escuela Red ProTIC - Tandil, de Abril, Bayesian Learning 5.1 Introduction –Bayesian learning algorithms calculate explicit probabilities.
Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at
Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?
Distributed Representations of Sentences and Documents
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Semi-Supervised Learning
Ensembles of Classifiers Evgueni Smirnov
Final review LING572 Fei Xia Week 10: 03/11/
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
Recent Trends in Text Mining Girish Keswani
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)
Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.
Ensembles. Ensemble Methods l Construct a set of classifiers from training data l Predict class label of previously unseen records by aggregating predictions.
ISQS 6347, Data & Text Mining1 Ensemble Methods. ISQS 6347, Data & Text Mining 2 Ensemble Methods Construct a set of classifiers from the training data.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Classification Ensemble Methods 1
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003.
Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Classification using Co-Training
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
KNN & Naïve Bayes Hongning Wang
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Recent Trends in Text Mining
Semi-Supervised Clustering
Sofus A. Macskassy Fetch Technologies
Constrained Clustering -Semi Supervised Clustering-
CSC 594 Topics in AI – Natural Language Processing
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University

 Learning from Sequences of fMRI Brain Images (with Tom Mitchell)  Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic)  Effect of Smoothing on Naive Bayes for Text Classification (with Tong IBM Research)  Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang)  Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam)  Error-Correcting Output Codes for Text Classification Some Recent Work

Text Categorization Numerous Applications Search Engines/Portals Customer Service Routing …. Domains: Topics Genres Languages $$$ Making

Problems  Practical applications such as web portal deal with a large number of categories  A lot of labeled examples are needed for training the system

How do people deal with a large number of classes?  Use fast multiclass algorithms (Naïve Bayes) Builds one model per class  Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems  What happens with a 1000 class problem?  Can we do better?

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n binary problems  More efficient than one-per-class  Does it actually perform better?

What is ECOC?  Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 )  Use a learner to learn the binary problems

Training ECOC ABCDABCD f 1 f 2 f 3 f 4 X 1 1 Testing ECOC

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 X 1 1

ECOC works but…  Increased code length = Increased Accuracy  Increased code length = Increased Computational Cost

Classification Performance EfficiencyEfficiency Naïve Bayes ECOC GOAL (as used in Berger 99)

Choosing the codewords  Random? [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high  Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code

Experimental Setup  Generate the code BCH Codes  Choose a Base Learner Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998) Naive Bayes Classifier

Text Classification with Naïve Bayes  “Bag of Words” document representation  Estimate parameters of generative model:  Naïve Bayes classification:

Industry Sector Dataset Consists of company web pages classified into 105 economic sectors [McCallum et al. 1998, Ghani 2000]

Results Industry Sector Data Set Naïve Bayes Shrinkage 1 MaxEnt 2 MaxEnt/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)

ECOC for better Precision

Classification Performance EfficiencyEfficiency NB ECOC GOAL New Goal (as used in Berger 99)

Solutions  Design codewords that minimize cost and maximize “performance”  Investigate the assignment of codewords to classes  Learn the decoding function  Incorporate unlabeled data into ECOC

Size Matters?

What happens with sparse data?

Use unlabeled data with a large number of classes  How? Use EM Mixed Results  Think Again! Use Co-Training Disastrous Results  Think one more time

How to use unlabeled data?  Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories  ECOC works great with a large number of classes but there is no framework for using unlabeled data

ECOC + CoTraining = ECoTrain  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

ECOC+CoTrain - Results Algorithm300L+ 0U Per Class 50L + 250U Per Class 5L + 295U Per Class Naïve BayesUses No Unlabeled Data ECOC 15bit EMUses Unlabeled Data - 105Class Problem Co-Train ECoTrain (ECOC + Co- Training) Uses Unlabeled Data

What Next?  Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration  Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Potential Drawbacks  Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems

Summary  Use ECOC for efficient text classification with a large number of categories  Increase Accuracy & Efficiency  Use Unlabeled data by combining ECOC and Co-Training  Generalize to domain-independent classification tasks involving a large number of categories

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy

Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy

The Longer the Better! Table 2: Average Classification Accuracy on 5 random train-test splits of the Industry Sector dataset with a vocabulary size of words selected using Information Gain.  Longer codes mean larger codeword separation  The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C  If minimum hamming distance is h, then the code can correct  (h-1)/2  errors

Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive Types of Codes

ECOC + CoTraining = ECoTrain  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training  Preliminary Results: Not so great! (very sensitive to initial labeled documents)

What is a Good Code?  Row Separation  Column Separation (Independence of errors for each binary classifier)  Efficiency (for long codes)

Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes

Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH % 19-Bit Hybrid % 15-bit Random 2 (1.5) %

Interesting Observations  NBC does not give good probabilitiy estimates- using ECOC results in better estimates.

Testing ECOC  To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

The Decoding Step  Standard: Map to the nearest codeword according to hamming distance  Can we do better?

The Real Question?  Tradeoff between “learnability” of binary problems and the error-correcting power of the code

Codeword assignment  Standard Procedure: Assign codewords to classes randomly  Can we do better?

Goal of Current Research  Improve classification performance without increasing cost Design short codes that perform well Develop algorithms that increase performance without affecting code length

Previous Results  Performance increases with length of code  Gives the same percentage increase in performance over NB regardless of training set size  BCH Codes > Random Codes > Hand- constructed Codes

Others have shown that ECOC  Works great with arbitrary long codes  Longer codes = More Error-Correcting Power = Better Performance  Longer codes = More Computational Cost

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n problems  More efficient than one-per-class  Does it actually perform better?

Previous Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999) (Ghani 2000)

Design codewords  Maximize Performance (Accuracy, Precision, Recall, F1?)  Minimize length of codes  Search in the space of codewords through gradient descent G=Error + Code_Length

Codeword Assignment  Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart  Pros Focusing on confusable classes more can help  Cons Individual binary problems can be very hard

The Decoding Step  Weight the individual classifiers according to their training accuracies and do weighted majority decoding.  Pose the decoding as a separate learning problem and use regression/Neural Network