Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Slides:



Advertisements
Similar presentations
PEBL: Web Page Classification without Negative Examples Hwanjo Yu, Jiawei Han, Kevin Chen- Chuan Chang IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING,
Advertisements

Co Training Presented by: Shankar B S DMML Lab
Unsupervised Learning
Maximum Entropy Model (I) LING 572 Fei Xia Week 5: 02/05-02/07/08 1.
Chapter 5: Partially-Supervised Learning
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.
Text Learning Tom M. Mitchell Aladdin Workshop Carnegie Mellon University January 2003.
Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.
Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.
Semi Supervised Learning Qiang Yang –Adapted from… Thanks –Zhi-Hua Zhou – ople/zhouzh/ –LAMDA.
Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew K. McCallum Sebastian Thrun Tom Mitchell Machine Learning (2000) Presented.
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.
Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Holistic Web Page Classification William W. Cohen Center for Automated Learning and Discovery (CALD) Carnegie-Mellon University.
Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Bing LiuCS Department, UIC1 Chapter 8: Semi-Supervised Learning Also called “partially supervised learning”
CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.
Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.
The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.
Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
August 16, 2015EECS, OSU1 Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation.
Semi-Supervised Learning
Ensembles of Classifiers Evgueni Smirnov
Final review LING572 Fei Xia Week 10: 03/11/
Using Error-Correcting Codes For Text Classification Rayid Ghani Center for Automated Learning & Discovery, Carnegie Mellon University.
Employing EM and Pool-Based Active Learning for Text Classification Andrew McCallumKamal Nigam Just Research and Carnegie Mellon University.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Text Classification, Active/Interactive learning.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
One-class Training for Masquerade Detection Ke Wang, Sal Stolfo Columbia University Computer Science IDS Lab.
1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.
Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.
Report on Semi-supervised Training for Statistical Parsing Zhang Hao
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
Using Error-Correcting Codes for Efficient Text Categorization with a Large Number of Categories Rayid Ghani Advisor: Tom Mitchell.
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Learning from Labeled and Unlabeled Data Tom Mitchell Statistical Approaches to Learning and Discovery, and March 31, 2003.
Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.
Classification using Co-Training
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
Efficient Text Categorization with a Large Number of Categories Rayid Ghani KDD Project Proposal.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Wrapper Learning: Cohen et al 2002; Kushmeric 2000; Kushmeric & Frietag 2000 William Cohen 1/26/03.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
KNN & Naïve Bayes Hongning Wang
Naive Bayes (Generative Classifier) vs. Logistic Regression (Discriminative Classifier) Minkyoung Kim.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Data Mining Practical Machine Learning Tools and Techniques
Semi-Supervised Clustering
Combining Labeled and Unlabeled Data with Co-Training
Statistical NLP: Lecture 9
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Ensemble learning.
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Co-training LING 572 Fei Xia 02/21/06

Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000) –(Abney, 2002) –(Sarkar, 2002) –…–… Used in document classification, parsing, etc.

Outline Basic concept: (Blum and Mitchell, 1998) Relation with other SSL algorithms: (Nigam and Ghani, 2000)

An example Web-page classification: e.g., find homepages of faculty members. –Page text: words occurring on that page e.g., “research interest”, “teaching” –Hyperlink text: words occurring in hyperlinks that point to that page: e.g., “my advisor”

Two views Features can be split into two sets: –The instance space: –Each example: D: the distribution over X C1: the set of target functions over X1. C2: the set of target function over X2.

Assumption #1: compatibility The instance distribution D is compatible with the target function f=(f 1, f 2 ) if for any x=(x 1, x 2 ) with non-zero prob, f(x)=f 1 (x 1 )=f 2 (x 2 ). The compatibility of f with D:  Each set of features is sufficient for classification

Assumption #2: conditional independence

Co-training algorithm

Co-training algorithm (cont) Why uses U’, in addition to U? –Using U’ yields better results. –Possible explanation: this forces h1 and h2 select examples that are more representative of the underlying distribution D that generates U. Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D. Choosing the iteration number and the size of U’.

Intuition behind the co-training algorithm h 1 adds examples to the labeled set that h 2 will be able to use for learning, and vice verse. If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.

Experiments: setting 1051 web pages from 4 CS depts –263 pages (25%) as test data –The remaining 75% of pages Labeled data: 3 positive and 9 negative examples Unlabeled data: the rest (776 pages) Manually labeled into a number of categories: e.g., “course home page”. Two views: –View #1 (page-based): words in the page –View #2 (hyperlink-based): words in the hyperlinks Learner: Naïve Bayes

Naïve Bayes classifier (Nigam and Ghani, 2000)

Experiment: results Page- based classifier Hyperlink- based classifier Combined classifier Supervised training Co-training p=1, n=3 # of iterations: 30 |U’| = 75

Questions Can co-training algorithms be applied to datasets without natural feature divisions? How sensitive are the co-training algorithms to the correctness of the assumptions? What is the relation between co-training and other SSL methods (e.g., self-training)?

(Nigam and Ghani, 2000)

EM Pool the features together. Use initial labeled data to get initial parameter estimates. In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters. Repeat until converge.

Experimental results: WebKB course database EM performs better than co-training Both are close to supervised method when trained on more labeled data.

Another experiment: The News 2*2 dataset A semi-artificial dataset Conditional independence assumption holds. Co-training outperforms EM and the “oracle” result.

Co-training vs. EM Co-training splits features, EM does not. Co-training incrementally uses the unlabeled data. EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.

Co-EM: EM with feature split Repeat until converge –Train A-feature-set classifier using the labeled data and the unlabeded data with B’s labels –Use classifier A to probabilistically label all the unlabeled data –Train B-feature-set classifier using the labeled data and the unlabeled data with A’s labels. –B re-labels the data for use by A.

Four SSL methods Results on the News 2*2 dataset

Random feature split Co-training: 3.7%  5.5% Co-EM: 3.3%  5.1%  When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.

Assumptions Assumptions made by the underlying classifier (supervised learner): –Naïve Bayes: words occur independently of each other, given the class of the document. –Co-training uses the classifier to rank the unlabeled examples by confidence. –EM uses the classifier to assign probabilities to each unlabeled example. Assumptions made by SSL method: –Co-training: conditional independence assumption. –EM: maximizing likelihood correlates with reducing classification errors.

Summary of (Nigam and Ghani, 2002) Comparison of four SSL methods: self-training, co-training, EM, co-EM. The performance of the SSL methods depends on how well the underlying assumptions are met. Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.

Variations of co-training Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set. Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner. Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.

An alternative? L  L1, L  L2 U  U1, U  U2 Repeat –Train h1 using L1 on Feat Set1 –Train h2 using L2 on Feat Set2 –Classify U2 with h1 and let U2’ be the subset with the most confident scores, L2 + U2’  L2, U2-U2’  U2 –Classify U1 with h2 and let U1’ be the subset with the most confident scores, L1 + U1’  L1, U1-U1’  U1

Yarowsky’s algorithm one-sense-per-discourse  View #1: the ID of the document that a word is in one-sense-per-allocation  View #2: local context of word in the document Yarowsky’s algorithm is a special case of co- training (Blum & Mitchell, 1998) Is this correct? No, according to (Abney, 2002).

Summary of co-training The original paper: (Blum and Mitchell, 1998) –Two “independent” views: split the features into two sets. –Train a classifier on each view. –Each classifier labels data that can be used to train the other classifier. Extension: –Relax the conditional independence assumptions –Instead of using two views, use two or more classifiers trained on the whole feature set.

Summary of SSL Goal: use both labeled and unlabeled data. Many algorithms: EM, co-EM, self-training, co-training, … Each algorithm is based on some assumptions. SSL works well when the assumptions are satisfied.

Additional slides

Rule independence H1 (H2) consists of rules that are functions of X1 (X2, resp) only.

EM: the data is generated according to some simple known parametric model. –Ex: the positive examples are generated according to an n-dimensional Gaussian D+ centered around the point