Download presentation
Presentation is loading. Please wait.
1
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery Carnegie Mellon University
2
Learning from Sequences of fMRI Brain Images (with Tom Mitchell) Learning to automatically build language-specific corpora from the web (with Rosie Jones & Dunja Mladenic) Effect of Smoothing on Naive Bayes for Text Classification (with Tong Zhang @ IBM Research) Hypertext Categorization using links and extracted information (with Sean Slattery & Yiming Yang) Hybrids of EM & Co-Training for semi-supervised learning (with Kamal Nigam) Error-Correcting Output Codes for Text Classification Some Recent Work
3
Text Categorization Numerous Applications Search Engines/Portals Customer Service Email Routing …. Domains: Topics Genres Languages $$$ Making
4
Problems Practical applications such as web portal deal with a large number of categories A lot of labeled examples are needed for training the system
5
How do people deal with a large number of classes? Use fast multiclass algorithms (Naïve Bayes) Builds one model per class Use Binary classification algorithms (SVMs) and break an n class problems into n binary problems What happens with a 1000 class problem? Can we do better?
6
ECOC to the Rescue! An n-class problem can be solved by solving log 2 n binary problems More efficient than one-per-class Does it actually perform better?
7
What is ECOC? Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 ) Use a learner to learn the binary problems
8
Training ECOC 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD f 1 f 2 f 3 f 4 X 1 1 Testing ECOC
9
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4
10
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4
11
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4
12
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4 X 1 1
13
ECOC works but… Increased code length = Increased Accuracy Increased code length = Increased Computational Cost
14
Classification Performance EfficiencyEfficiency Naïve Bayes ECOC GOAL (as used in Berger 99)
15
Choosing the codewords Random? [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code
16
Experimental Setup Generate the code BCH Codes Choose a Base Learner Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998) Naive Bayes Classifier
17
Text Classification with Naïve Bayes “Bag of Words” document representation Estimate parameters of generative model: Naïve Bayes classification:
18
Industry Sector Dataset Consists of company web pages classified into 105 economic sectors [McCallum et al. 1998, Ghani 2000]
19
Results Industry Sector Data Set Naïve Bayes Shrinkage 1 MaxEnt 2 MaxEnt/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)
21
ECOC for better Precision
23
Classification Performance EfficiencyEfficiency NB ECOC GOAL New Goal (as used in Berger 99)
24
Solutions Design codewords that minimize cost and maximize “performance” Investigate the assignment of codewords to classes Learn the decoding function Incorporate unlabeled data into ECOC
25
Size Matters?
26
What happens with sparse data?
27
Use unlabeled data with a large number of classes How? Use EM Mixed Results Think Again! Use Co-Training Disastrous Results Think one more time
28
How to use unlabeled data? Current learning algorithms using unlabeled data (EM, Co-Training) don’t work well with a large number of categories ECOC works great with a large number of classes but there is no framework for using unlabeled data
29
ECOC + CoTraining = ECoTrain ECOC decomposes multiclass problems into binary problems Co-Training works great with binary problems ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
30
ECOC+CoTrain - Results Algorithm300L+ 0U Per Class 50L + 250U Per Class 5L + 295U Per Class Naïve BayesUses No Unlabeled Data 766740.3 ECOC 15bit76.568.549.2 EMUses Unlabeled Data - 105Class Problem 68.251.4 Co-Train 67.650.1 ECoTrain (ECOC + Co- Training) Uses Unlabeled Data 72.056.1
31
What Next? Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
32
Potential Drawbacks Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems
33
Summary Use ECOC for efficient text classification with a large number of categories Increase Accuracy & Efficiency Use Unlabeled data by combining ECOC and Co-Training Generalize to domain-independent classification tasks involving a large number of categories
34
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly
35
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99
36
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99
37
The Longer the Better! Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain. Longer codes mean larger codeword separation The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C If minimum hamming distance is h, then the code can correct (h-1)/2 errors
38
Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive Types of Codes
39
ECOC + CoTraining = ECoTrain ECOC decomposes multiclass problems into binary problems Co-Training works great with binary problems ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training Preliminary Results: Not so great! (very sensitive to initial labeled documents)
40
What is a Good Code? Row Separation Column Separation (Independence of errors for each binary classifier) Efficiency (for long codes)
41
Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes
42
Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH 515496420.6% 19-Bit Hybrid 518156922.3% 15-bit Random 2 (1.5)13426024.1%
43
Interesting Observations NBC does not give good probabilitiy estimates- using ECOC results in better estimates.
44
Testing ECOC To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)
45
The Decoding Step Standard: Map to the nearest codeword according to hamming distance Can we do better?
46
The Real Question? Tradeoff between “learnability” of binary problems and the error-correcting power of the code
47
Codeword assignment Standard Procedure: Assign codewords to classes randomly Can we do better?
48
Goal of Current Research Improve classification performance without increasing cost Design short codes that perform well Develop algorithms that increase performance without affecting code length
49
Previous Results Performance increases with length of code Gives the same percentage increase in performance over NB regardless of training set size BCH Codes > Random Codes > Hand- constructed Codes
50
Others have shown that ECOC Works great with arbitrary long codes Longer codes = More Error-Correcting Power = Better Performance Longer codes = More Computational Cost
51
ECOC to the Rescue! An n-class problem can be solved by solving log 2 n problems More efficient than one-per-class Does it actually perform better?
52
Previous Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with no increase in computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999) (Ghani 2000)
53
Design codewords Maximize Performance (Accuracy, Precision, Recall, F1?) Minimize length of codes Search in the space of codewords through gradient descent G=Error + Code_Length
54
Codeword Assignment Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart Pros Focusing on confusable classes more can help Cons Individual binary problems can be very hard
55
The Decoding Step Weight the individual classifiers according to their training accuracies and do weighted majority decoding. Pose the decoding as a separate learning problem and use regression/Neural Network 10011001 11011101
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.