Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University.

Slides:

Advertisements

Similar presentations

Intro to NLP - J. Eisner1 Splitting Words a.k.a. “Word Sense Disambiguation”

Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser Institute for Natural Language Processing University of Stuttgart

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.

28 June 2007EMNLP-CoNLL1 Probabilistic Models of Nonprojective Dependency Trees David A. Smith Center for Language and Speech Processing Computer Science.

1 A Comparative Evaluation of Deep and Shallow Approaches to the Automatic Detection of Common Grammatical Errors Joachim Wagner, Jennifer Foster, and.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.

Analysis of Semi-supervised Learning with the Yarowsky Algorithm

Introduction LING 572 Fei Xia Week 1: 1/3/06. Outline Course overview Problems and methods Mathematical foundation –Probability theory –Information theory.

Predicting the Semantic Orientation of Adjectives

Expectation Maximization Algorithm

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.

Maximum Entropy Model LING 572 Fei Xia 02/08/07. Topics in LING 572 Easy: –kNN, Rocchio, DT, DL –Feature selection, binarization, system combination –Bagging.

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

Introduction to Machine Learning Approach Lecture 5.

Semi-Supervised Learning

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Lecture 6: The Ultimate Authorship Problem: Verification for Short Docs Moshe Koppel and Yaron Winter.

Parser Adaptation and Projection with Quasi-Synchronous Grammar Features David A. Smith (UMass Amherst) Jason Eisner (Johns Hopkins) 1.

ENSEMBLE LEARNING David Kauchak CS451 – Fall 2013.

12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.

1 Semi-Supervised Approaches for Learning to Parse Natural Languages Rebecca Hwa

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

1 1 David A. Smith (JHU  UMass Amherst) Jason Eisner (Johns Hopkins University) Dependency Parsing by Belief Propagation.

BOOSTING David Kauchak CS451 – Fall Admin Final project.

1 Boosting-based parse re-ranking with subtree features Taku Kudo Jun Suzuki Hideki Isozaki NTT Communication Science Labs.

CS 4705 Lecture 19 Word Sense Disambiguation. Overview Selectional restriction based approaches Robust techniques –Machine Learning Supervised Unsupervised.

Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.

Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,

11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.

1 Bootstrapping without the Boot Jason Eisner Damianos Karakos HLT-EMNLP, October 2005.

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Slides for “Data Mining” by I. H. Witten and E. Frank.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Analysis of Bootstrapping Algorithms Seminar of Machine Learning for Text Mining UPC, 18/11/2004 Mihai Surdeanu.

Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

A New Approach for English- Chinese Named Entity Alignment Donghui Feng Yayuan Lv Ming Zhou USC MSR Asia EMNLP-04.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Conditional Markov Models: MaxEnt Tagging and MEMMs

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Natural Language Processing Vasile Rus

Statistical Machine Translation Part II: Word Alignments and EM

Boosted Augmented Naive Bayes. Efficient discriminative learning of

David Mareček and Zdeněk Žabokrtský

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors

Machine Learning in Practice Lecture 17

Sequential Learning with Dependency Nets

Neural Machine Translation by Jointly Learning to Align and Translate

KnowItAll and TextRunner

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Bootstrapping Feature-Rich Dependency Parsers with Entropic Priors David A. Smith Jason Eisner Johns Hopkins University

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner2 Only Connect… Textual Entailment LM IE Lexical Semantics MT Training trees Raw text Parallel & comparable corpora Out-of-domain text Learning Weischedel 2004 Quirk et al Pantel & Lin 2002 Parser (Dependency) Trained

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner3 Outline: Bootstrapping Parsers What kind of parser should we train? How should we train it semi-supervised? Does it work? (initial experiments) How can we incorporate other knowledge?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner4 Re-estimation: EM or Viterbi EM Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner5 Re-estimation: EM or Viterbi EM Trained Parser (iterate process) Oops! Not much supervised training. So most of these parses were bad. Retraining on all of them overwhelms the good supervised data.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner6 So only retrain on “good” parses... Simple Bootstrapping: Self-Training Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner7 So only retrain on “good” parses... at least, those the parser itself thinks are good. (Can we trust it? We’ll see...) Simple Bootstrapping: Self-Training Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner8 Why Might This Work? Sure, now we avoid harming the parser with bad training. But why do we learn anything new from the unsup. data? Trained Parser After training, training parses have  Many features with positive weights  Few features with negative weights But unsupervised parses have  Few positive or negative features unknown  Mostly unknown features Words or situations not seen in training data Still, sometimes enough positive features to be sure it’s the right parse

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner9 Why Might This Work? Sure, we avoid bad guesses that harm the parser. But why do we learn anything new from the unsup. data? Trained Parser Still, sometimes enough positive features to be sure it’s the right parse Now, retraining the weights makes the gray (and red) features greener

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner10 Still, sometimes enough positive features to be sure it’s the right parse... and makes features redder for the “losing” parses of this sentence (not shown) Why Might This Work? Sure, we avoid bad guesses that harm the parser. But why do we learn anything new from the unsup. data? Trained Parser Now, retraining the weights makes the gray (and red) features greener Learning!

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner11 This Story Requires Many Redundant Features! Bootstrapping for WSD (Yarowsky 1995)  Lots of contextual features  success Co-training for parsing (Steedman et. al 2003)  Feature-poor parsers  disappointment Self-training for parsing (McClosky et al. 2006)  Feature-poor parsers  disappointment  Reranker with more features  success More features  more chances to identify correct parse even when we’re undertrained

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner12 This Story Requires Many Redundant Features! So, let’s bootstrap a feature-rich parser! In our experiments so far, we follow McDonald et al. (2005)  Our model has 450 million features (on Czech)  Prune down to 90 million frequent features  About 200 are considered per possible edge Note: Even more features proposed at end of talk More features  more chances to identify correct parse even when we’re undertrained

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner13 Edge-Factored Parsers (McDonald et al. 2005) No global features of a parse Each feature is attached to some edge Simple; allows fast O(n 2 ) or O(n 3 ) parsing Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner14 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? yes, lots of green...

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner15 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner16 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner17 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  NA  N

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner18 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking Is this a good edge? jasný  den (“bright day”) jasný  N (“bright NOUN”) VAAANJNVC A  N preceding conjunction A  NA  N

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner19 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC not as good, lots of red...

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner20 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasný  hodiny (“bright clocks”)... undertrained...

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner21 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC být-jasn-stud-dubn-den-a-hodi-odbí-třin- jasný  hodiny (“bright clocks”)... undertrained... jasn-  hodi- (“bright clock,” stems only)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner22 Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasn-  hodi- (“bright clock,” stems only) A plural  N singular jasný  hodiny (“bright clocks”)... undertrained... být-jasn-stud-dubn-den-a-hodi-odbí-třin-

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner23 jasný  hodiny (“bright clocks”)... undertrained... Edge-Factored Parsers (McDonald et al. 2005) Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking How about this competing edge? VAAANJNVC jasný  hodiny (“bright clock,” stems only) A plural  N singular A  N where N follows a conjunction být-jasn-stud-dubn-den-a-hodi-odbí-třin-

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner24 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC Which edge is better?  “bright day” or “bright clocks”? být-jasn-stud-dubn-den-a-hodi-odbí-třin-

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner25 jasný Edge-Factored Parsers (McDonald et al. 2005) Bylstudenýdubnovýdenahodinyodbíjelytřináctou “ItbrightcolddayAprilandclockswerethirteen”wasainthestriking VAAANJNVC býtjasnýstudenýdubnovýdenahodinyodbittřináct Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner26 Edge-Factored Parsers (McDonald et al. 2005) Which edge is better? Score of an edge e =   features(e) Standard algos  valid parse with max total score our current weight vector can’t have both (one parent per word) can‘t have both (no crossing links) Can’t have all three (no cycles) Thus, an edge may lose (or win) because of a consensus of other edges. Retraining then learns to reduce (or increase) its score.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner27 Only Connect… Textual Entailment LM IE Lexical Semantics MT Training trees Raw text Parallel & comparable corpora Out-of-domain text Learning Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner28 Only retrain on “good” parses... at least, those the parser itself thinks are good. Can we recast this declaratively? Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner29 Can we recast this declaratively? Seed set ClassifierLabel Examples Select Examples W/ High Confidence New Labeled Set

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner30 Bootstrapping as Optimization Maximize a function on supervised and unsupervised data Try to predict the supervised parses Entropy regularization (Brand 1999; Grandvalet & Bengio; Jiao et al.) Try to be confident on the unsupervised parses Yesterday’s talk: How to compute these for non-projective models See Hwa ‘01 for projective tree entropy

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner31  H/  p Claim: Gradient descent on this objective function works like bootstrapping When we’re pretty sure the true parse is A or B, we reduce entropy H by becoming even surer (  retraining  on the example) When we’re not sure, the example doesn’t affect  (  not retraining on the example) sure of parse A (H  0) sure of parse B (H  0) H p not sure (H  1) ?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner32 Claim: Gradient descent on this objective function works like bootstrapping This gives us a tunable parameter  :  Connect to Abney’s view of bootstrapping (  =0)  Obtain Viterbi variant (limit as    )  Obtain Gini variant (  =2)  Still get Shannon entropy (limit as   1) Also easier to compute in some circumstances In the paper, we generalize: replace Shannon entropy H(  ) with Rényi entropy H  (  )

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner33 Experimental Questions Are confident parses (or edges) actually good for retraining? Does bootstrapping help accuracy? What is being learned?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner34 ridiculously small (pilot experiments, sorry) Experimental Design Czech, German, and Spanish (some Bulgarian)  CoNLL-X dependency trees  Non-projective (MST) parsing  Hundreds of millions of features Supervised training sets of 100 & 1000 trees Unparsed but tagged sets of 2k to 70k sentences Stochastic gradient descent  First optimize just likelihood on seed set  Then optimize likelihood + confidence criterion on all data  Stop when accuracy peaks on development data

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner35 Are confident parses accurate? Correlation of entropy with accuracy Shannon entropy “Viterbi” self-training Gini = -log(expected 0/1 gain) log(# of parses): favor short sentences; Abney’s Yarowsky alg

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner36 How Accurate Is Bootstrapping? Significant on paired permutation test  (baseline) +71K+37K+2K 100-tree supervised set

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner37 How Does Bootstrapping Learn? Recall Precision 90%: Maybe enough precision so retraining doesn’t hurt Maybe enough recall so retraining will learn new things

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner38 Bootstrapping vs. EM 100 training trees, 100 dev trees for model selection Two ways to add unsupervised data BulgarianGermanSpanish EM (joint) MLE (joint) MLE (cond.) Boot. (cond.) Compare on a feature-poor model that EM can handle (DMV) Supervised baselines

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner39 There’s No Data Like More Data Textual Entailment LM IE Lexical Semantics MT Training trees Raw text Parallel & comparable corpora Out-of-domain text Learning Trained Parser

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner40 “Token” Projection Project 1-best English dependencies (Hwa et al. ‘04)???  Imperfect or free translation  Imperfect parse  Imperfect alignment Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou ItbrightcolddayAprilandclockswerethirteenwasainthestriking What if some sentences have parallel text? No. Just use them to get further noisy features.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner41 “Token” Projection Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou ItbrightcolddayAprilandclockswerethirteenwasainthestriking What if some sentences have parallel text? Probably aligns to some English link A  N

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner42 “Token” Projection Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou ItbrightcolddayAprilandclockswerethirteenwasainthestriking What if some sentences have parallel text? Probably aligns to some English path N  in  N Cf. “quasi-synchronous grammars” (Smith & Eisner, 2006)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner43 “Type” Projection Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou Probably translate as English words that usually link as N  V when cosentential Can we use world knowledge, e.g., from comparable corpora? …will no longer be royal when the clock strikes midnight. But when the clock strikes 11 a.m. and the race cars rocket… …vehicles and pedestrians after the clock struck eight. …when the clock of a no-passenger Airbus A-320 struck… …born right after the clock struck 12:00 p.m. of December… …as the clock in Madrid’s Plaza del Sol strikes 12 times. Parsed Gigaword corpus clockstrike

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner44 “Type” Projection Can we use world knowledge, e.g., from comparable corpora? Byljasnýstudenýdubnovýdenahodinyodbíjelytřináctou bright broad cheerful pellucid straight … cold fresh hyperborean stone-cold day daytime clock meter metre strikethirteenbe exist subsist Apriland plus Probably translate as English words that usually link as N  V when cosentential

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner45 Conclusions Declarative view of bootstrapping as entropy minimization Improvements in parser accuracy with feature-rich models Easily added features from alternative data sources, e.g. comparable text In future: consider also the WSD decision list learner: is it important for learning robust feature weights?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner46 Thanks Noah Smith Keith Hall The Anonymous Reviewers Ryan McDonald for making his code available

Extra slides …

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner48 Dependency Treebanks

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner49 A Supervised CoNLL-X System What system was this?

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner50 How Does Bootstrapping Learn? Supervised iter. 1 Supervised iter. 10 Boostrapping w/ R2Boostrapping w/ Rinf

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner51 How Does Bootstrapping Learn? UpdatedM feat.Acc. [%]UpdatedM feat.Acc. [%] all none060.9 seed Non- seed Non-lex lexical Non- bilex bilexical

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner52 target word: plant table taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm life (1%) manufacturing (1%) 98%

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner53 figure taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm Should be a good classifier, unless we accidentally learned some bad cues along the way that polluted the original sense distinction.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner54 figure taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm Learn a classifier that distinguishes A from B. It will notice features like “animal”  A, “automate”  B.

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner55 figure taken from Yarowsky (1995) Review: Yarowsky’s bootstrapping algorithm That confidently classifies some of the remaining examples. Now learn a new classifier and repeat … & repeat …

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner56 Bootstrapping: Pivot Features Sat beside the river bank Sat on the bank Run on the bank quickandslyfox slyandcraftyfox quickofslyfoxgaitthe Lots of overlapping features vs. PCFG (McClosky et al.)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner57 Bootstrapping as Optimization Given a “labeling” distribution p̃, log likelihood to max is: On labeled data, p̃ is 1 at the label and 0 elsewhere. Thus, supervised training: Abney (2004)

EMNLP-CoNLL, 29 June 2007David A. Smith & Jason Eisner58 Triangular Trade Features Models Objectives Globally normalized LL Projective/non-projective EM Abney’s K Entropy Regularization Words, Tags, Translations, … Derivational (Rényi) entropy Parent Prediction Inside/Outside Matrix-Tree Data ???