Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project
Supervised Learning with Labeled Data Labeled data is required in large quantities and can be very expensive to collect.
Why use Unlabeled data? Very Cheap in the case of text Web Pages Newsgroups Messages May not be equally useful as labeled data but is available in enormous quantities
Goal Make learning more efficient and easy by reducing the amount of labeled data required for text classification with a large number of categories
ECOC very accurate and efficient for text categorization with a large number of classes Co-Training useful for combining labeled and unlabeled data with a small number of classes
Related research with unlabeled data Using EM in a generative model (Nigam et al. 1999) Transductive SVMs (Joachims 1999) Co-Training type algorithms (Blum & Mitchell 1998, Collins & Singer 1999, Nigam & Ghani 2000)
What is ECOC? Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 ) Use a learner to learn the binary problems
Training ECOC ABCDABCD f 1 f 2 f 3 f 4 f 5 X Testing ECOC
The Co-training algorithm Loop (while unlabeled documents remain): Build classifiers A and B Use Naïve Bayes Classify unlabeled documents with A & B Use Naïve Bayes Add most confident A predictions and most confident B predictions as labeled training examples [Blum & Mitchell 1998]
The Co-training Algorithm Naïve Bayes on B Naïve Bayes on A Learn from labeled data Estimate labels Estimate labels Select most confident Select most confident Add to labeled data [Blum & Mitchell, 1998]
One Intuition behind co- training A and B are redundant A features independent of B features Co-training like learning with random classification noise Most confident A prediction gives random B Small misclassification error for A
ECOC + CoTraining = ECoTrain ECOC decomposes multiclass problems into binary problems Co-Training works great with binary problems ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training
SPORTS SCIENCE ARTS HEALTH POLITICS LAW
What happens with sparse data?
ECOC+CoTrain - Results Algorithm300L+ 0U Per Class 50L + 250U Per Class 5L + 295U Per Class Naïve BayesUses No Unlabeled Data ECOC 15bit EMUses Unlabeled Data - 105Class Problem Co-Train ECoTrain (ECOC + Co- Training) Uses Unlabeled Data
Datasets Hoovers-255 Collection of 4285 corporate websites Each company is classified into one of 255 categories Baseline 2% Jobs-65 (from WhizBang) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11%
Results DatasetNaïve Bayes (No UnLabeled Data) ECOC (No UnLabeled Data) EMCo- Trainin g ECOC + Co- Trainin g 10% Labeled 100% Labeled 10% Labeled 100% Labeled 10% Labeled Jobs Hoovers
Results
What Next? Use improved version of co-training (gradient descent) Less prone to random fluctuations Uses all unlabeled data at every iteration Use Co-EM (Nigam & Ghani 2000) - hybrid of EM and Co-Training
Summary
Use ECOC for efficient text classification with a large number of categories Reduce code length without sacrificing performance Fix code length and Increase Performance Generalize to domain-independent classification tasks involving a large number of categories
The Feature Split Setting …My research advisor… …Professor Blum… …My grandson… Tom Mitchell Fredkin Professor of AI… Avrim Blum My research interests are… Johnny I like horsies! Classifier AClassifier B ??
The Co-training setting …My advisor… …Professor Blum… …My grandson… Tom Mitchell Fredkin Professor of AI… Avrim Blum My research interests are… Johnny I like horsies! Classifier AClassifier B
Learning from Labeled and Unlabeled Data: Using Feature Splits Co-training [Blum & Mitchell 98] Meta-bootstrapping [Riloff & Jones 99] coBoost [Collins & Singer 99] Unsupervised WSD [Yarowsky 95] Consider this the co-training setting
Learning from Labeled and Unlabeled Data: Extending supervised learning MaxEnt Discrimination [Jaakkola et al. 99] Expectation Maximization [Nigam et al. 98] Transductive SVMs [Joachims 99]
Using Unlabeled Data with EM Estimate labels of the unlabeled documents Use all documents to build a new naïve Bayes classifier Naïve Bayes
Co-training vs. EM Co-training Uses feature split Incremental labeling Hard labels EM Ignores feature split Iterative labeling Probabilistic labels Which differences matter?
Hybrids of Co-training and EM YesNo Incrementalco-trainingself-training Iterativeco-EMEM Uses Feature Split? Labeling Naïve Bayes on A Naïve Bayes on B Label allLearn from all Naïve Bayes on A & B Label all Add only best Label allLearn from all
Text Classification with naïve Bayes “Bag of Words” document representation Naïve Bayes classification: Estimate parameters of generative model:
Meanwhile the black fish swam far away. Experience the great thrill of our roller coaster. The speaker, Dr. Mary Rosen, will discuss effects… The Feature Split Setting Classifier A Classifier B ??
Learning from Unlabeled Data using Feature Splits coBoost [Collins & Singer 99] Meta-bootstrapping [Riloff & Jones 99] Unsupervised WSD [Yarowsky 95] Co-training [Blum & Mitchell 98]
Intuition behind Co-training A and B are redundant A features independent of B features Co-training like learning with random classification noise Most confident A prediction gives random B Small misclassification error for A
Extending Supervised Learning with Unlabeled Data Transductive SVMs [Joachims 99] MaxEnt Discrimination [Jaakkola et al. 99] Expectation-Maximization [Nigam et al. 98]
Using Unlabeled Data with EM Estimate labels of unlabeled documents Use all documents to rebuild naïve Bayes classifier Naïve Bayes [Nigam, McCallum, Thrun & Mitchell, 1998] Initially learn from labeled only
Co-EM Naïve Bayes on A Naïve Bayes on B Estimate labels Build naïve Bayes with all data Estimate labels Build naïve Bayes with all data Use Feature Split? EMco-EMLabel All co-trainingLabel Few NoYes Initialize with labeled data