Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University This presentation can be accessed at
Outline Review of ECOC Previous Work Types of Codes Experimental Results Semi-Theoretical Model Drawbacks Conclusions & Work in Progress
Overview of ECOC Decompose a multiclass problem into multiple binary problems The conversion can be independent or dependent of the data (it does depend on the number of classes) Any learner that can learn binary functions can then be used to learn the original multivalued function
Training ECOC Given m distinct classes Create an m x n binary matrix M. Each class is assigned ONE row of M. Each column of the matrix divides the classes into TWO groups. Train the Base classifiers to learn the n binary problems.
Testing ECOC To test a new instance Apply each of the n classifiers to the new instance Combine the predictions to obtain a binary string(codeword) for the new point Classify to the class with the nearest codeword (usually hamming distance is used as the distance measure)
ECOC-Picture AB C
Previous Work Combine with Boosting – ADABOOST.OC (Schapire, 1997), (Guruswami & Sahai, 1999) Local Learners Text Classification (Berger, 1999)
Experimental Setup Generate the code BCH Codes Choose a Base Learner Naive Bayes Classifier as used in text classification tasks (McCallum & Nigam 1998)
Dataset Industry Sector Dataset Consists of company web pages classified into 105 economic sectors Standard stoplist No Stemming Skip all MIME headers and HTML tags Experimental approach similar to McCallum et al. (1998) for comparison purposes.
Results Classification Accuracies on five random train-test splits of the Industry Sector dataset with a vocabulary size of ECOC is 88% accurate!
Results Industry Sector Data Set Naïve Bayes Shrinkage 1 ME 2 ME/ w Prior 3 ECOC 63-bit 66.1%76%79%81.1%88.5% ECOC reduces the error of the Naïve Bayes Classifier by 66% 1.(McCallum et al. 1998) 2,3. (Nigam et al. 1999)
The Longer the Better! Table 2: Average Classification Accuracy on 5 random train-test splits of the Industry Sector dataset with a vocabulary size of words selected using Information Gain. Longer codes mean larger codeword separation The minimum hamming distance of a code C is the smallest distance between any pair of distance codewords in C If minimum hamming distance is h, then the code can correct (h-1)/2 errors
Size Matters?
Size does NOT matter!
Semi-Theoretical Model Model ECOC by a Binomial Distribution B(n,p) n = length of the code p = probability of each bit being classified incorrectly # of BitsH min E max P ave Accuracy
Types of Codes Data-Independent Data-Dependent Algebraic Random Hand-Constructed Adaptive
What is a Good Code? Row Separation Column Separation (Independence of errors for each binary classifier) Efficiency (for long codes)
Choosing Codes RandomAlgebraic Row SepOn Average For long codes Guaranteed Col SepOn Average For long codes Can be Guaranteed EfficiencyNoYes
Experimental Results CodeMin Row HD Max Row HD Min Col HD Max Col HD Error Rate 15-Bit BCH % 19-Bit Hybrid % 15-bit Random 2 (1.5) %
Interesting Observations NBC does not give good probabilitiy estimates- using ECOC results in better estimates.
Drawbacks Can be computationally expensive Random Codes throw away the real- world nature of the data by picking random partitions to create artificial binary problems
Conclusion Improves Classification Accuracy considerably! Can be used when training data is sparse Algebraic codes perform better than random codes for a given code lenth Hand-constructed codes are not the answer
Conclusion Improves Classification Accuracy considerably! Can be used when training data is sparse Algebraic codes perform better than random codes for a given code lenth Hand-constructed codes are not the answer
Future Work Combine ECOC with Co-Training Automatically construct optimal / adaptive codes Sufficient and Necessary conditions for optimal behavior
Future Work Combine ECOC with Co-Training or Shrinkage Methods Sufficient and Necessary conditions for optimal behavior