Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Slides:

Advertisements

Similar presentations

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Advertisements

Imbalanced data David Kauchak CS 451 – Fall 2013.

Clustering Beyond K-means

Data Mining Classification: Alternative Techniques

Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.

Active Learning to Classify

Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Middle Term Exam 03/01 (Thursday), take home, turn in at noon time of 03/02 (Friday)

Discriminative and generative methods for bags of features

Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.

Final review LING572 Fei Xia Week 10: 03/13/08 1.

Image classification Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing them?

Text Classification With Support Vector Machines

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Reducing Multiclass to Binary LING572 Fei Xia Week 9: 03/04/08.

An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at

Distributed Representations of Sentences and Documents

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Semi-Supervised Learning

Ensembles of Classifiers Evgueni Smirnov

Advanced Multimedia Text Classification Tamara Berg.

Final review LING572 Fei Xia Week 10: 03/11/

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.

CSE 185 Introduction to Computer Vision Pattern Recognition.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Ensemble Classification Methods Rayid Ghani IR Seminar – 9/26/00.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Transfer Learning with Applications to Text Classification Jing Peng Computer Science Department.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Classifiers Given a feature representation for images, how do we learn a model for distinguishing features from different classes? Zebra Non-zebra Decision.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Introducing the Separability Matrix for ECOC coding

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

… Algo 1 Algo 2 Algo 3 Algo N Meta-Learning Algo.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Classification using Co-Training

Support Vector Machines Reading: Ben-Hur and Weston, “A User’s Guide to Support Vector Machines” (linked from class web page)

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Semi-Supervised Clustering

Sofus A. Macskassy Fetch Technologies

Constrained Clustering -Semi Supervised Clustering-

CSC 594 Topics in AI – Natural Language Processing

Classification with Perceptrons Reading:

Instance Based Learning

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Text Categorization Berlin Chen 2003 Reference:

Presentation transcript:

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal

Text Categorization Numerous Applications Search Engines/Portals Customer Service Routing …. Domains: Topics Genres Languages $$$ Making

How do people deal with a large number of classes?  Use fast multiclass algorithms (Naïve Bayes) Builds one model per class  Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems  What happens with a 1000 class problem?  Can we do better?

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n binary problems  More efficient than one-per-class  Does it actually perform better?

What is ECOC?  Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 )  Use a learner to learn the binary problems

Training ECOC ABCDABCD f 1 f 2 f 3 f 4 f 5 X Testing ECOC

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 f 5 X

Classification Performance EfficiencyEfficiency NB ECOC Preliminary Results This Proposal Preliminary Results: ECOC reduces the error of the Naïve Bayes Classifier by 66% with NO increase in computational cost (as used in Berger 99)

Proposed Solutions  Design codewords that minimize cost and maximize “performance”  Investigate the assignment of codewords to classes  Learn the decoding function  Incorporate unlabeled data into ECOC

Use unlabeled data with a large number of classes  How? Use EM Mixed Results  Think Again! Use Co-Training Disastrous Results  Think one more time

Use Unlabeled data  Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories  ECOC works great with a large number of classes but there is no framework for using unlabeled data

Use Unlabeled Data  ECOC decomposes multiclass problems into binary problems  Co-Training works great with binary problems  ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training  Preliminary Results: Not so great! (very sensitive to initial labeled documents)

What Next?  Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration  Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training

Work Plan  Collect Datasets  Codeword Assignment - 2 weeks  Learning Decoding – 1-2 weeks  Using Unlabeled Data - 2 weeks  Design Codes - 2 weeks  Project Write-up – 1 week

Summary  Use ECOC for efficient text classification with a large number of categories  Reduce code length without sacrificing performance  Fix code length and Increase Performance  Generalize to domain-independent classification tasks involving a large number of categories

Testing ECOC  To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)

The Decoding Step  Standard: Map to the nearest codeword according to hamming distance  Can we do better?

The Real Question?  Tradeoff between “learnability” of binary problems and the error-correcting power of the code

Codeword assignment  Standard Procedure: Assign codewords to classes randomly  Can we do better?

Goal of Current Research  Improve classification performance without increasing cost Design short codes that perform well Develop algorithms that increase performance without affecting code length

Previous Results  Performance increases with length of code  Gives the same percentage increase in performance over NB regardless of training set size  BCH Codes > Random Codes > Hand- constructed Codes

Others have shown that ECOC  Works great with arbitrary long codes  Longer codes = More Error-Correcting Power = Better Performance  Longer codes = More Computational Cost

ECOC to the Rescue!  An n-class problem can be solved by solving log 2 n problems  More efficient than one-per-class  Does it actually perform better?

Previous Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999) (Ghani 2000)

Design codewords  Maximize Performance (Accuracy, Precision, Recall, F1?)  Minimize length of codes  Search in the space of codewords through gradient descent G=Error + Code_Length

Codeword Assignment  Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart  Pros Focusing on confusable classes more can help  Cons Individual binary problems can be very hard

The Decoding Step  Weight the individual classifiers according to their training accuracies and do weighted majority decoding.  Pose the decoding as a separate learning problem and use regression/Neural Network