Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Slides:

Advertisements

Similar presentations

Imbalanced data David Kauchak CS 451 – Fall 2013.

Advertisements

Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.

1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)

Machine learning continued Image source:

Active Learning to Classify

The UNIVERSITY of Kansas EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006.

Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.

Discriminative and generative methods for bags of features

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

ALADDIN Workshop on Graph Partitioning in Vision and Machine Learning Jan 9-11, 2003 Welcome! [Organizers: Avrim Blum, Jon Kleinberg, John Lafferty, Jianbo.

Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.

Chapter 5: Partially-Supervised Learning

Text Classification With Support Vector Machines

Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.

Reducing Multiclass to Binary LING572 Fei Xia Week 9: 03/04/08.

Semi Supervised Learning Qiang Yang –Adapted from… Thanks –Zhi-Hua Zhou – ople/zhouzh/ –LAMDA.

Maria-Florina Balcan Carnegie Mellon University Margin-Based Active Learning Joint with Andrei Broder & Tong Zhang Yahoo! Research.

ECOC for Text Classification Hybrids of EM & Co-Training (with Kamal Nigam) Learning to build a monolingual corpus from the web (with Rosie Jones) Effect.

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

Using Error-Correcting Codes For Text Classification Rayid Ghani This presentation can be accessed at

Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department

1 Automated Feature Abstraction of the fMRI Signal using Neural Network Clustering Techniques Stefan Niculescu and Tom Mitchell Siemens Medical Solutions,

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

Text Classification With Labeled and Unlabeled Data Presenter: Aleksandar Milisic Supervisor: Dr. David Albrecht.

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Center for Automated Learning & Discovery.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Incorporating Unlabeled Data in the Learning Process

August 16, 2015EECS, OSU1 Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation.

Ensembles of Classifiers Evgueni Smirnov

Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.

Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.

Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Text Clustering.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Introducing the Separability Matrix for ECOC coding

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Report on Semi-supervised Training for Statistical Parsing Zhang Hao

Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Detecting New a Priori Probabilities of Data Using Supervised Learning Karpov Nikolay Associate professor NRU Higher School of Economics.

Semi-automatic Product Attribute Extraction from Store Website

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.

1 CS 391L: Machine Learning Clustering Raymond J. Mooney University of Texas at Austin.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003.

Hypertext Categorization using Hyperlink Patterns and Meta Data Rayid Ghani Séan Slattery Yiming Yang Carnegie Mellon University.

Classification using Co-Training

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.

Ensembles of Classifiers Evgueni Smirnov. Outline 1 Methods for Independently Constructing Ensembles 1.1 Bagging 1.2 Randomness Injection 1.3 Feature-Selection.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Semi-Supervised Clustering

Constrained Clustering -Semi Supervised Clustering-

Semi-Supervised Learning

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Presentation transcript:

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs

Supervised Learning with Labeled Data Labeled data is required in large quantities and can be very expensive to collect.

Why use Unlabeled data? Very Cheap in the case of text Web Pages Newsgroups Messages May not be equally useful as labeled data but is available in enormous quantities

Recent work with text and unlabeled data Expectation-Maximization (Nigam et al. 1999) Co-Training (Blum & Mitchell, 1999) Transductive SVMs (Joachims 1999) Co-EM (Nigam & Ghani, 2000)

BUT… Most of the empirical studies have only focused on binary classification tasks ALL Co-Training papers have used two-class datasets The largest dataset with EM was 20 Newsgroups (Nigam et al. 1999)

Do current semi-supervised approaches work for “real” and “multiclass” problems?

Empirical Evaluation Apply EM Co-Training To Multiclass, “real-world” data sets Job Descriptions Web pages

The EM Algorithm Naïve Bayes Learn from labeled data Estimate labels Add Probabilistically to labeled data

The Co-training Algorithm Naïve Bayes on B Naïve Bayes on A Learn from labeled data Estimate labels Estimate labels Select most confident Select most confident [Blum & Mitchell, 1998] Add to labeled data

Data Sets Jobs-65 (from WhizBang!) Job Postings (Two feature sets – Title, Description) 65 categories Baseline 11% Hoovers-255 (used in Ghani et al and Yang et al. 2002) No natural feature split - random partition of vocab. Collection of Web pages from 4285 corporate websites Each company is classified into one of 255 categories Baseline 2%

Mixed Results Dataset Naïve BayesEMCo-Training Jobs Hoovers Unlabeled Data Helps! Unlabeled Data Hurts!

How to extend the unlabeled data framework to multiclass problems? N-Class problem N binary problems N-Class problem binary problems Error-Correcting Output Codes (ECOC) N-Class problem fewer than N binary problems

ECOC (Error Correcting Output Coding) very accurate and efficient for text categorization with a large number of classes (Berger ’99, Ghani ’00) Co-Training useful for combining labeled and unlabeled data with a small number of classes

What is ECOC? Solve multiclass problems by decomposing them into multiple binary problems ( Dietterich & Bakiri 1995 ) Learn the binary problems

Training ECOC ABCDABCD f 1 f 2 f 3 f 4 X 1 1 Testing ECOC

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4

ECOC - Picture ABCDABCD A D C B f 1 f 2 f 3 f 4 X 1 1

ABCDABCD f 1 f 2 f 3 f 4 X 1 1 ECOC can recover from the errors of the individual classifiers  Longer codes mean larger codeword separation  The minimum hamming distance of a code C is the smallest distance between any pair of codewords in C  If minimum hamming distance is h, then the code can correct at most  (h-1)/2  errors

ECOC + Co-Training ECOC decomposes multiclass problems into binary problems Co-Training works great with binary problems ECOC + Co-Train = Learn each binary problem in ECOC with Co-Training

Important Caveat “Normal” binary problems “ECOC-induced” binary problems + _

Results Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs Hoover s Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs Hoover s

Results Dataset Naïve BayesECOCEMCo- Training ECOC + Co-Training 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled Jobs Hoover s Dataset Naïve BayesECOCEM-Co- Training ECOC + CoTraining 10% Labeled 100 % Labeled 10% Labeled 100 % Labeled 10% Labeled + UnLab 10% Labeled + UnLabeled Jobs Hoover s

Results - Summary More labeled data helps ECOC outperforms Naïve Bayes (more pronounced with less labeled data) Co-Training and EM do not always use unlabeled data effectively Combination of ECOC and Co-Training shows great results

Important Side-Effects Extremely Efficient (sublinear in the number of categories) High Precision Results

A Closer Look…

Results

Summary Combination of ECOC and Co-Training for learning with labeled and unlabeled data Effective both in terms of accuracy and efficiency High-Precision classification

Teaser… Does this only work with Co-Training?

Next Steps Use Active Learning techniques to pick initial labeled examples Develop techniques to partition a standard feature set into two redundantly sufficient feature sets Create codes specifically for unlabeled data