Report on Semi-supervised Training for Statistical Parsing Zhang Hao 2002-12-18.

Slides:

Advertisements

Similar presentations

Imbalanced data David Kauchak CS 451 – Fall 2013.

Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Classification: Alternative Techniques

Merge Sort 4/15/2017 6:09 PM The Greedy Method The Greedy Method.

Multi-View Learning in the Presence of View Disagreement C. Mario Christoudias, Raquel Urtasun, Trevor Darrell UC Berkeley EECS & ICSI MIT CSAIL.

Co-Training and Expansion: Towards Bridging Theory and Practice Maria-Florina Balcan, Avrim Blum, Ke Yang Carnegie Mellon University, Computer Science.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

Unsupervised Models for Named Entity Classification Michael Collins and Yoram Singer Yimeng Zhang March 1 st, 2007.

Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.

Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.

Ensemble Learning: An Introduction

Combining Labeled and Unlabeled Data for Multiclass Text Categorization Rayid Ghani Accenture Technology Labs.

Analysis of Semi-supervised Learning with the Yarowsky Algorithm

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Active Learning Strategies for Drug Screening 1. Introduction At the intersection of drug discovery and experimental design, active learning algorithms.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.

Semi-Supervised Learning

Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.

Ensembles of Classifiers Evgueni Smirnov

(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence

Final review LING572 Fei Xia Week 10: 03/11/

Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.

Active Learning for Class Imbalance Problem

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

Research Ranked Recall: Efficient Classification by Learning Indices That Rank Omid Madani with Michael Connor (UIUC)

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

Transfer Learning Task. Problem Identification Dataset : A Year: 2000 Features: 48 Training Model ‘M’ Testing 98.6% Training Model ‘M’ Testing 97% Dataset.

Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.

1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

Bootstrapping Information Extraction with Unlabeled Data Rayid Ghani Accenture Technology Labs Rosie Jones Carnegie Mellon University & Overture (With.

The Greedy Method. The Greedy Method Technique The greedy method is a general algorithm design paradigm, built on the following elements: configurations:

Combining labeled and unlabeled data for text categorization with a large number of categories Rayid Ghani KDD Lab Project.

Advisor : Prof. Sing Ling Lee Student : Chao Chih Wang Date :

1 Decision Tree Learning Original slides by Raymond J. Mooney University of Texas at Austin.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Weakly Supervised Training For Parsing Mandarin Broadcast Transcripts Wen Wang ICASSP 2008 Min-Hsuan Lai Department of Computer Science & Information Engineering.

Supertagging CMSC Natural Language Processing January 31, 2006.

Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Classification using Co-Training

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Spring 2008The Greedy Method1. Spring 2008The Greedy Method2 Outline and Reading The Greedy Method Technique (§5.1) Fractional Knapsack Problem (§5.1.1)

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

1 Machine Learning Lecture 8: Ensemble Methods Moshe Koppel Slides adapted from Raymond J. Mooney and others.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Data Mining Practical Machine Learning Tools and Techniques

CS 4/527: Artificial Intelligence

Speaker: Jim-an tsai advisor: professor jia-lin koh

Merge Sort 11/28/2018 2:18 AM The Greedy Method The Greedy Method.

Merge Sort 1/17/2019 3:11 AM The Greedy Method The Greedy Method.

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 5-8 Ethan Phelps-Goodman.

MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.

Presentation transcript:

Report on Semi-supervised Training for Statistical Parsing Zhang Hao

Brief Introduction  Why semi-supervised training?  Co-training framework and applications  Can parsing fit in this framework?  How?  Conclusion

Why Semi-supervised Training  Compromise between su… and unsu…  Pay-offs: –Minimize the need for labeled data –Maximize the value of unlabeled data –Easy portability

Co-training Scenario  Idea: two different students learn from each other, incrementally, mutually improving  “ 二人行必有我师 ” difference(motive) –mutual learning(optimize)-> agreement(objective).  Task: to optimize the objective function of agreement.  Heuristic selection is important: what to learn?

[Blum & Mitchell, 98] Co- training Assumptions  Classification problem  Feature redundancy –Allows different views of data –Each view is sufficient for classification  View independency of features, given class

[Blum & Mitchell, 98] Co- training example  “Course home page” classification (y/n)  Two views: content text/anchor text (more perfect example: two sides of a coin)  Two naïve Bayes classifiers: should agree

[Blum & Mitchell, 98] Co- Training Algorithm Given: A set L of labeled training examples A set U of unlabeled examples Create a pool U’ of examples by choosing u examples at random from U. Loop for k iterations: Use L to train a classifier h1 that considers only the x1 portion of x Use L to train a classifier h2 that considers only the x2 portion of x Allow h1 to label p positive and n negative examples from U’ Allow h2 to label p positive and n negative examples from U’ Add these self-labeled examples to L Randomly choose 2p+2n examples from U to replenish U’ n:p matches the ratio of negtive to positive examples The selected examples are those “most confidently” labeled ones, i.e. heuristic selection

Family of Algorithms Related to Co-training MethodFeature Split (Yes) Feature Split (No) IncrementalCo-trainingSelf-training IterativeCo-EMEM [Nigam & Ghani 2000]

Parsing As Supertagging and Attaching [Sarkar 2001]  The difference between parsing and other NLP applications:WSD, WBPC, TC, NEI –A tree vs. A label –Composite vs. Monolithic –Large parameter space vs. Small …  LTAG –Each word is tagged with a lexicalized elementary tree (supertagging) –Parsing is a process of substitution and adjoining of elementary trees –Supertagger finishes a very large part of job a traditional parser must do

A glimpse of Suppertags

Two Models to Co-training  H1: selects trees based on previous context (tagging probability model)  H2: computes attachment between trees and returns best parse (parsing probability model)

[Sarkar 2000] Co-training Algorithm 1. Input: labeled and unlabeled 2. Update cache Randomly select sentences from unlabeled and refill cache If cache is empty; Exit 3. Train models H1 and H2 using labeled 4. Apply H1 and H2 to cache 5. Pick most probable n from H1 (run through H2) and add to labeled 6. Pick most probable n from H2 and add to labeled 7. n=n+k; Go to step 2

JHU SW2002 tasks  Co-train Collins CFG parser with Sarkar LTAG parser  Co-train Rerankers  Co-train CCG supertaggers and parsers

Co-training: The Algorithm  Requires: Two learners with different views of the task Cache Manager (CM) to interface with the disparate learners A small set of labeled seed data and a larger pool of unlabelled data  Pseudo-Code: –Init: Train both learners with labeled seed data –Loop: CM picks unlabelled data to add to cache Both learners label cache CM selects newly labeled data to add to the learners' respective training sets Learners re-train

Novel Methods-Parse Selection  Want to select training examples for one parser (student) labeled by the other (teacher) so as to minimize noise and maximize training utility. –Top-n: choose the n examples for which the teacher assigned the highest scores. –Difference: choose the examples for which the teacher assigned a higher score than the student by some threshold. –Intersection: choose the examples that received high scores from the teacher but low scores from the student. –Disagreement: choose the examples for which the two parsers provided different analyses and the teacher assigned a higher score than the student.

Effect of Parse Selection

CFG-LTAG Co-training

Re-rankers Co-training  What is Re-ranking? –A re-ranker reorders the output of an n- best (probabilistic) parser based on features of the parse –While parsers use local features to make decisions, re-rankers use features that can span the entire tree –Instead of co-training parsers, co-train different re-rankers

Re-rankers Co-training  Motivation: Why re-rankers? – Speed parse data once reordered many times – Objective function The lower runtime of re-rankers allows us to explicitly maximize agreement between parses

Re-rankers Co-training  Motivation: Why re-rankers? – Accuracy Re-rankers can improve performance of existing parsers Collins ’00 cites a 13 percent reduction of error rate by re-ranking – Task closer to classification A re-ranker can be seen as a binary classifier: either a parse is the best for a sentence or it isn’t This is the original domain cotraining was intended for

Re-rankers Co-training  Experimental. But much to be explored. Remember: a re-ranker is easier to develop –Reranker 1: Log linear model –Reranker 2: Linear perceptron model – Room for improvement: Current best parser: 89.7 Oracle that picks best parse from top 50: 95 +

JHU SW2002 Conclusion –Largest experimental study to date on the use of unlabelled data for improving parser performance. –Co-training enhances performance for parsers and taggers trained on small (500—10,000 sentences) amounts of labeled data. –Co-training can be used for porting parsers trained on one genre to parse on another without any new human-labeled data at all, improving on state-of-the-art for this task. –Even tiny amounts of human-labelled data for the target genre enhace porting via co-training. –New methods for Parse Selection have been developed, and play a crucial role.

How to Improve Our Parser?  Similar setting: limited labeled data(Penn CTB) large amount of unlabeled and somewhat deferent domain data(PKU People Daily)  To try: –Re-rankers’ developing cycle is much shorter, worthy of trying. Many ML techniques may be utilized. –Re-rankers’ agreement is still an open question