Download presentation
Presentation is loading. Please wait.
Published byGwen Pierce Modified over 9 years ago
2
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell
3
Text Categorization Numerous Applications Search Engines/Portals Customer Service Email Routing …. Domains: Topics Genres Languages
4
Problems Practical applications such as web portals deal with a large number of categories and current systems are not specifically designed to handle that A lot of labeled examples are needed for training the system
5
How do people deal with a large number of classes? Use algorithms that scale up linearly Naïve Bayes – fast, handles multiclass problems, builds one probability distribution per class SVM – can only handle binary problems so an n-class problem is solved by solving n binary problems Can we do better than scale up linearly with the number of classes?
6
ECOC to the Rescue! An n-class problem can be solved by solving fewer than n binary problems More efficient than one-per-class Is it better in classification performance?
7
ECOC to the rescue! Motivated by error-correcting codes – add redundancy to correct errors Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 ) Can be more efficient than one-per-class approach
8
Related Work using ECOC Focus is on accuracy, mostly at the cost of computational efficiency Applied to text classification with NB (Berger 1999) Applied to several non-text problems using Dtrees, NN, kNN (Dietterich & Bakiri 1995, Aha 1997) Combined with Boosting (Schapire 1999, Guruswami & Sahai 1999)
9
Training ECOC 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD f 1 f 2 f 3 f 4 X 1 1 Testing ECOC
10
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4
11
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4
12
ECOC - Picture 0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD A D C B f 1 f 2 f 3 f 4 X 1 1
13
0 0 1 1 1 0 0 1 1 1 0 1 0 0 ABCDABCD f 1 f 2 f 3 f 4 X 1 1 ECOC can recover from the errors of the individual classifiers 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 Longer codes mean larger codeword separation The minimum hamming distance of a code C is the smallest distance between any pair of codewords in C If minimum hamming distance is h, then the code can correct at most (h-1)/2 errors
14
ECOC Longer Codes Better Error-Correcting Properties Better Classification Performance Higher Computational Cost
15
Classification Performance EfficiencyEfficiency NB ECOC Goal (as used in Berger 99)
16
Efficient (short) codes Maximize Accuracy Reduce Labeled Data
17
Roadmap Efficiency: shorter and better codes Accuracy: Maximize classification performance Data efficiency : use unlabeled data Classification Performance Efficiency NB ECOC Goal
18
What is a Good Code? Row Separation Column Separation (correlation of errors for each binary classifier) Efficiency
19
Choosing the codewords Use Coding Theory for Good Error-Correcting Codes? [Dietterich & Bakiri 1995] Guaranteed properties for a fixed-length code Random codes [Berger 1999, James 1999] Asymptotically good (the longer the better) Computational Cost is very high Construct codes that are data-dependent
20
Text Classification with Naïve Bayes “Bag of Words” document representation Estimate parameters of generative model: Naïve Bayes classification:
21
Industry Sector Dataset Consists of company web pages classified into 105 economic sectors [McCallum et al. 1998, Ghani 2000]
22
Results Industry Sector Data Set Naïve Bayes Shrinkage 1 MaxEnt 2 MaxEnt/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% with HALF the computational cost 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)
23
Experimental Setup Generate the code BCH Codes Choose a Base Learner Naive Bayes Classifier
24
Datasets Industry Sector 10,000 Corporate web pages (105 classes) Hoovers-28 and Hoovers-255 4,285 websites Jobs-15 and Jobs-65 100,000 job titles and descriptions (from WhizBang! Labs)
25
Results
26
Comparison with Naïve Bayes
28
A Closer Look…
29
ECOC for better Precision Industry Sector Dataset
30
ECOC for better Precision Hoovers-28 dataset
31
Summary – Short Codes Use algebraic coding theory to obtain short codes Highly efficient and accurate at the same time High precision classification
32
Efficient (short) codes Maximize Accuracy Reduce Labeled Data
33
Roadmap Efficiency: shorter and better codes Accuracy: Maximize classification performance Data efficiency : use unlabeled data
34
Maximize classification performance for fixed code lengths Investigate the assignment of codewords to classes Learn the decoding function
35
Codeword assignment Standard Procedure: Assign codewords to classes randomly Can we do better?
36
Codeword Assignment Generate the confusion matrix and use that to assign the most confusable classes the codewords that are farthest apart Pros Since most errors are between the two most confusable classes, letting the learner make more mistakes between them should help Cons The New individual binary problems can be very hard
37
Codeword Assignment Example Computer.Software Metal Industry Retail.Technology Jewelery Industry 0 0 0 0 1 0 0 1 1 1 Computer.Software Metal Industry Retail.Technology Jewelery Industry Metal Industry Retail.Technology Computer.Software Jewelery Industry Computer.Software Metal Industry Retail.Technology Jewelery Industry 0 1 0 0 0 0 1 0 1 1
38
Results Standard ECOCECOC with intelligent codeword assignment 75.4%56.3% Results on 10 classes chosen from the industry sector dataset Why? Some of the bits are really hard to learn
39
The Decoding Step Standard: Use hamming distance to map to the class with the nearest codeword Weight the individual classifiers according to their training accuracies and do weighted majority decoding. Can we do better?
40
The Decoding Step Pose the decoding as a separate learning problem and use Regression/Neural Network Predicted Codeword Hamming Distance Actual Codeword Neural Network
41
Results Standard ECOCECOC with learning decoding 75.4%76.1% Results on 10 classes chosen from the industry sector dataset
42
What If ? We combine Assigning codewords to classes by confusability Learn the decoding Codeword Assignment Learn Binary Classifiers Learn Decoding function
43
Results 75.4%56.3% 76.1%83.3% Codeword Assignment Random Intelligent Hamming Distance Neural Network Decoding Results on 10 classes chosen from the industry sector dataset
44
Picture of why combo works 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 0 1 0 1 1 1 0 1 0 0 1 1 0 1 1 0 0 1 0 Since each column is a binary classification task, a learner with less than 50% accuracy can be made more than 50% accurate by just inverting the classification Input Output
45
Summary –Maximize Accuracy Assigning codewords to classes intelligently and then learning the decoding function performs well.
46
Efficient (short) codes Maximize Accuracy Reduce Labeled Data
47
Roadmap Efficiency: shorter and better codes Accuracy: Maximize classification performance Data efficiency : use unlabeled data
48
Supervised Learning with Labeled Data Labeled data is required in large quantities and can be very expensive to collect.
49
Why use Unlabeled data? Very Cheap in the case of text Web Pages Newsgroups Email Messages May not be equally useful as labeled data but is available in enormous quantities
50
Goal Make learning more efficient by reducing the amount of labeled data required for text classification with a large number of categories
51
Related research with unlabeled data Using EM in a generative model (Nigam et al. 1999) Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum & Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)
52
ECOC very accurate and efficient for text categorization with a large number of classes Co-Training useful for combining labeled and unlabeled data with a small number of classes Plan
53
The Feature Split Setting Junior Vermiculturist Looking to get involved in Worm farming and composting with worms. Look no more! My Advisor
54
The Co-training algorithm Loop (while unlabeled documents remain): Build classifiers A and B Use Naïve Bayes Classify unlabeled documents with A & B Use Naïve Bayes Add most confident A predictions and most confident B predictions as labeled training examples [Blum & Mitchell 1998]
55
The Co-training Algorithm Naïve Bayes on B Naïve Bayes on A Learn from labeled data Estimate labels Estimate labels Select most confident Select most confident [Blum & Mitchell, 1998] Add to labeled data
56
One Intuition behind co- training A and B are redundant A features independent of B features Co-training like learning with random classification noise Most confident A prediction gives random B Small misclassification error for A
57
ECOC + Co-Training ECOC decomposes multiclass problems into binary problems Co-Training works great with binary problems ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
58
SPORTS SCIENCE ARTS HEALTH POLITICS LAW
59
Datasets Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11% Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%
60
Results DatasetNaïve Bayes (No Unlabeled Data) ECOC (No Unlabeled Data) EMCo- Trainin g ECOC + Co- Trainin g 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled Jobs-6550.168.259.371.258.254.164.5 Hoovers- 255 15.232.024.836.59.110.227.6 DatasetNaïve Bayes (No Unlabeled Data) ECOC (No Unlabeled Data) EMCo- Trainin ECOC + Co- Trainin g 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled Jobs-6550.168.259.371.258.254.164.5 Hoovers- 255 15.232.020.830.59.110.227.6
61
Results Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs-65 50.168.259.371.258.254.164.5 Hoovers -255 15.232.024.836.59.110.227.6 Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs-65 50.168.259.371.258.254.164.5 Hoovers -255 15.232.024.836.59.110.227.6
62
Results Hoovers-255 dataset
63
Summary – Using Unlabeled Data Combination of ECOC and Co-Training performs extremely well for learning with labeled and unlabeled data Results in High-Precision classifications Randomly partitioning the vocabulary works well for Co-Training
64
Conclusions Using short codes with good error-correcting properties results in high accuracy and precision Combination of assigning codewords to classes according to their confusability and learning the decoding is promising Using unlabeled data for a large number of categories by combining ECOC with Co-Training leads to good results
65
Next Steps Use Active Learning techniques to pick initial labeled examples Develop techniques to partition a standard feature set into two redundantly sufficient feature sets Create codes specifically for unlabeled data
66
What happens with sparse data?
67
ECOC+CoTrain - Results Algorithm300L+ 0U Per Class 50L + 250U Per Class 5L + 295U Per Class Naïve BayesUses No Unlabeled Data 766740.3 ECOC 15bit76.568.549.2 EMUses Unlabeled Data - 105Class Problem 68.251.4 Co-Train 67.650.1 ECoTrain (ECOC + Co- Training) Uses Unlabeled Data 72.056.1
68
What Next? Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
69
Potential Drawbacks Random Codes throw away the real-world nature of the data by picking random partitions to create artificial binary problems
70
Summary Use ECOC for efficient text classification with a large number of categories Increase Accuracy & Efficiency Use Unlabeled data by combining ECOC and Co-Training Generalize to domain-independent classification tasks involving a large number of categories
71
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly
72
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99
73
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy 1552.85.59 1552.89.80 1552.91.84 31115.85.67 31115.89.91 31115.91.94 633115.89.99
74
The Longer the Better! Table 2: Average Classification Accuracy on 5 random 50-50 train-test splits of the Industry Sector dataset with a vocabulary size of 10000 words selected using Information Gain. Longer codes mean larger codeword separation The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C If minimum hamming distance is h, then the code can correct (h-1)/2 errors
75
Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive Types of Codes
76
What is a Good Code? Row Separation Column Separation (Independence of errors for each binary classifier) Efficiency (for long codes)
77
Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes
78
Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH 515496420.6% 19-Bit Hybrid 518156922.3% 15-bit Random 2 (1.5)13426024.1%
79
Design codewords Maximize Performance (Accuracy, Precision, Recall, F1?) Minimize length of codes Search in the space of codewords through gradient descent G=Error + Code_Length
80
Classification Performance EfficiencyEfficiency Naïve Bayes ECOC GOAL (as used in Berger 99)
81
A 1 0 0 0 0 0 0 0 0 0 B 0 1 0 0 0 0 0 0 0 0 C 0 0 1 0 0 0 0 0 0 0 D 0 0 0 1 0 0 0 0 0 0 E 0 0 0 0 1 0 0 0 0 0 F 0 0 0 0 0 1 0 0 0 0 G 0 0 0 0 0 0 1 0 0 0 H 0 0 0 0 0 0 0 1 0 0 I 0 0 0 0 0 0 0 0 1 0 J 0 0 0 0 0 0 0 0 0 1
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.