Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Slides:

Advertisements

Similar presentations

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Advertisements

Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Maria-Florina Balcan Modern Topics in Learning Theory Maria-Florina Balcan 04/19/2006.

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Inductive Semi-supervised Learning Gholamreza Haffari Supervised by: Dr. Anoop Sarkar Simon Fraser University, School of Computing Science.

Spatial Semi- supervised Image Classification Stuart Ness G07 - Csci 8701 Final Project 1.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.

Analysis of Semi-supervised Learning with the Yarowsky Algorithm

Transformation-based error- driven learning (TBL) LING 572 Fei Xia 1/19/06.

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University

CS 4705 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised –Dictionary-based.

Graph-Based Semi-Supervised Learning with a Generative Model Speaker: Jingrui He Advisor: Jaime Carbonell Machine Learning Department

Semi-supervised protein classification using cluster kernels Jason Weston, Christina Leslie, Eugene Ie, Dengyong Zhou, Andre Elisseeff and William Stafford.

Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?

Semi-Supervised Learning D. Zhou, O Bousquet, T. Navin Lan, J. Weston, B. Schokopf J. Weston, B. Schokopf Presents: Tal Babaioff.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

CS Ensembles and Bayes1 Semi-Supervised Learning Can we improve the quality of our learning by combining labeled and unlabeled data Usually a lot.

Semi-Supervised Natural Language Learning Reading Group I set up a site at: ervised/

1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.

Maria-Florina Balcan A Theoretical Model for Learning from Labeled and Unlabeled Data Maria-Florina Balcan & Avrim Blum Carnegie Mellon University, Computer.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Semi-supervised Learning Rong Jin. Semi-supervised learning  Label propagation  Transductive learning  Co-training  Active learing.

For Better Accuracy Eick: Ensemble Learning

Semi-Supervised Learning

Interactive Image Segmentation of Non-Contiguous Classes using Particle Competition and Cooperation Fabricio Breve São Paulo State University (UNESP)

Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.

Unsupervised Word Sense Disambiguation Rivaling Supervised Methods Oh-Woog Kwon KLE Lab. CSE POSTECH.

Boris Babenko Department of Computer Science and Engineering University of California, San Diego Semi-supervised and Unsupervised Feature Scaling.

Text Classification, Active/Interactive learning.

Semisupervised Learning A brief introduction. Semisupervised Learning Introduction Types of semisupervised learning Paper for review References.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

Active Learning An example From Xu et al., “Training SpamAssassin with Active Semi- Supervised Learning”

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

Modern Topics in Multivariate Methods for Data Analysis.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

1 COMP3503 Semi-Supervised Learning COMP3503 Semi-Supervised Learning Daniel L. Silver.

Part 5. Minimally Supervised Methods for Word Sense Disambiguation.

Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.

Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.

Bing LiuCS Department, UIC1 Chapter 8: Semi-supervised learning.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.

Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Course Review #2 and Project Parts 3-6 LING 572 Fei Xia 02/14/06.

Semi-Supervised Learning William Cohen. Outline The general idea and an example (NELL) Some types of SSL – Margin-based: transductive SVM Logistic regression.

Graph-based WSD の続き DMLA /7/10 小町守.

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Semi-Supervised Learning Jing xu. Slides From: Xiaojin (Jerry) Zhu ---An associate professor in the Department of Computer Sciences at University of Wisconsin-Madison.

Semi-Supervised Clustering

Constrained Clustering -Semi Supervised Clustering-

Statistical NLP: Lecture 9

Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models

Ensemble learning.

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06

Outline Overview of Semi-supervised learning (SSL) Self-training (a.k.a. Bootstrapping)

Additional Reference Xiaojin Zhu (2006): Semi-supervised learning literature survey. Olivier Chapelle et al. (2005): Semi- supervised Learning. The MIT Press.

Overview of SSL

What is SSL? Labeled data: –Ex: POS tagging: tagged sentences –Creating labeled data is difficult, expensive, and/or time-consuming. Unlabeled data: –Ex: POS tagging: untagged sentences. –Obtaining unlabeled data is easier. Goal: use both labeled and unlabeled data to improve the performance

Learning –Supervised (labeled data only) –Semi-supervised (both labeled and unlabeled data) –Unsupervised (unlabeled data only) Problems: –Classification –Regression –Clustering –…–…  Focus on semi-supervised classification problem

A brief history of SSL The idea of self-training appeared in the 1960s. SSL took off in the 1970s. The interest for SSL increased in the 1990s, mostly due to applications in NLP.

Does SSL work? Yes, under certain conditions. –Problem itself: the knowledge on p(x) carry information that is useful in the inference of p(y | x). –Algorithm: the modeling assumption fits well with the problem structure. SSL will be most useful when there are far more unlabeled data than labeled data. SSL could degrade the performance when mistakes reinforce themselves.

Illustration (Zhu, 2006)

Illustration (cont)

Assumptions Smoothness (continuity) assumption: if two points x 1 and x 2 in a high-density region are close, then so should be the corresponding outputs y 1 and y 2. Cluster assumption: If points are in the same cluster, they are likely to be of the same class. Low density separation: the decision boundary should lie in a low density region. ….

SSL algorithms Self-training Co-training Generative models: –Ex: EM with generative mixture models Low Density Separations: –Ex: Transductive SVM Graph-based models

Which SSL method should we use? It depends. Semi-supervised methods make strong model assumptions. Choose the ones whose assumptions fit the problem structure.

Semi-supervised and active learning They address the same issue: labeled data are hard to get. Semi-supervised: choose the unlabeled data to be added to the labeled data. Active learning: choose the unlabeled data to be annotated.

Self-training

Basics of self-training Probably the earliest SSL idea. Also called self-teaching or bootstrapping. Appeared in the 1960s and 1970s. First well-known NLP paper: (Yarowsky, 1995)

Self-training algorithm Let L be the set of labeled data, U be the set of unlabeled data. Repeat –Train a classifier h with training data L –Classify data in U with h –Find a subset U’ of U with the most confident scores. –L + U’  L –U – U’  U

Case study: (Yarowsky, 1995)

Setting Task: WSD –Ex: plant: living / factory “Unsupervised”: just need a few seed collocations for each sense. Learner: Decision list

Assumption #1: One sense per collocation Nearby words provide strong and consistent clues to the sense of a target word. The effect varies depending on the type of collocation –It is strongest for immediately adjacent collocations. Assumption #1  Use collocations in the decision rules.

Assumption #2: One sense per discourse The sense of a target word is highly consistent within any given document. The assumption holds most of the time (99.8% in their experiment) Assumption #2  filter and augment the addition of unlabeled data.

Step 1: identify all examples of the given word: “plant” Our sample set S

Step 2: Create initial labeled data using a small number of seed collocations Sense A: “life” Sense B: “manufacturing” Our L (0) U (0) = S – L (0) : residual data set.

Initial “labeled” data

Step 3a: Train a BL classifier

Step 3b: Apply the classifier to the entire set Add to L the members in U which are tagged with prob above a threshold. or

Step 3c: filter and augment this addition with assumption #2

Repeat step 3 until converge

The final DL classifier

The original algorithm Keep initial labeling unchanged.

Options for obtaining the seeds Use words in dictionary definitions Use a single defining collocate for each sense: –Ex: “bird” and “machine” for the word “crane” Label salient corpus collocates: –Collect frequent collocates –Manually check the collocates  Getting the seeds is not hard.

Performance Baseline: 63.9% Supervised learning: 96.1% Unsupervised learning: 90.6% %

Why does it work? (Steven Abney, 2004): “Understanding the Yarowsky Algorithm”

Summary of self-training The algorithm is straightforward and intuitive. It produces outstanding results. Added unlabeled data pollute the original labeled data:

Additional slides

Papers on self-training Yarowsky (1995): WSD Riloff et al. (2003): identify subjective nouns Maeireizo et al. (2004): classify dialogues as “emotional” or “non-emotional”.