Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.

Slides:

Advertisements

Similar presentations

Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.

Advertisements

1 Minimally Supervised Morphological Analysis by Multimodal Alignment David Yarowsky and Richard Wicentowski.

Noah A. Smith and Jason Eisner Department of Computer Science /

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Learning Accurate, Compact, and Interpretable Tree Annotation Recent Advances in Parsing Technology WS 2011/2012 Saarland University in Saarbrücken Miloš.

Learning and Inference for Hierarchically Split PCFGs Slav Petrov and Dan Klein.

Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.

Modeling the Evolution of Product Entities Priya Radhakrishnan 1, Manish Gupta 1,2, Vasudeva Varma 1 1 Search and Information Extraction Lab, IIIT-Hyderabad,

Natural Language Processing Lecture 8—9/24/2013 Jim Martin.

 They speak German  8.47 million of people live there.

Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.

LING 388 Language and Computers Lecture 22 11/25/03 Sandiway FONG.

The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.

Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.

A Survey of Unsupervised Grammar Induction Baskaran Sankaran Senior Supervisor: Dr Anoop Sarkar School of Computing Science Simon Fraser University.

1 Quasi-Synchronous Grammars  Based on key observations in MT: translated sentences often have some isomorphic syntactic structure, but not usually in.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

Learning Accurate, Compact, and Interpretable Tree Annotation Slav Petrov, Leon Barrett, Romain Thibaux, Dan Klein.

What is the Jeopardy Model? A Quasi-Synchronous Grammar for Question Answering Mengqiu Wang, Noah A. Smith and Teruko Mitamura Language Technology Institute.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

Statistical NLP Winter 2009 Lecture 16: Unsupervised Learning II Roger Levy [thanks to Sharon Goldwater for many slides]

Dependency Parsing with Reference to Slovene, Spanish and Swedish Simon Corston-Oliver Anthony Aue Microsoft Research.

Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.

BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.

Distributed Iterative Training Kevin Gimpel Shay Cohen Severin Hacker Noah A. Smith.

LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.

CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.

Standard EM/ Posterior Regularization (Ganchev et al, 10) E-step: M-step: argmax w E q log P (x, y; w) Hard EM/ Constraint driven-learning (Chang et al,

An Information Theoretic Approach to Bilingual Word Clustering Manaal Faruqui & Chris Dyer Language Technologies Institute SCS, CMU.

Raymond J. Mooney University of Texas at Austin

Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Monday seminar, ÚFAL April 2, 2012,

Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.

Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.

Parser Adaptation and Projection with Quasi-Synchronous Grammar Features David A. Smith (UMass Amherst) Jason Eisner (Johns Hopkins) 1.

1 CS546: Machine Learning and Natural Language Multi-Class and Structured Prediction Problems Slides from Taskar and Klein are used in this lecture TexPoint.

Part D: multilingual dependency parsing. Motivation A difficult syntactic ambiguity in one language may be easy to resolve in another language (bilingual.

인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.

Statistical NLP Winter 2008 Lecture 16: Unsupervised Learning I Roger Levy [thanks to Sharon Goldwater for many slides]

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Inductive Dependency Parsing Joakim Nivre

Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.

1 CS546: Machine Learning and Natural Language Latent-Variable Models for Structured Prediction Problems: Syntactic Parsing Slides / Figures from Slav.

Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.

Publications Vamshi Ambati, Stephan Vogel and Jaime Carbonell. "Collaborative Workflow for Crowdsourcing Translation”, To Appear in the 2012 ACM Conference.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Tokenization & POS-Tagging

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.

Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.

Learning TFC Meeting, SRI March 2005 On the Collective Classification of “Speech Acts” Vitor R. Carvalho & William W. Cohen Carnegie Mellon University.

Part-of-speech tagging

Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.

CSE 517 Natural Language Processing Winter 2015

Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart

Statistical NLP Spring 2010 Lecture 7: POS / NER Tagging Dan Klein – UC Berkeley.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,

Part-of-Speech Tagging CSE 628 Niranjan Balasubramanian Many slides and material from: Ray Mooney (UT Austin) Mausam (IIT Delhi) * * Mausam’s excellent.

Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.

Lecture 9: Part of Speech

David Mareček and Zdeněk Žabokrtský

CSC 594 Topics in AI – Natural Language Processing

Prototype-Driven Learning for Sequence Models

Universal Dependencies

Universal Dependencies

Natural Language Processing

Presentation transcript:

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University

Goal: 2 Learn linguistic structure for a language without any labeled data in that language Part-of-Speech Tagging DET NOUN VERB ADJ ADP. The Skibo Castle is close by. Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011)

This work! Multilingual Unsupervised Learning 3 using parallel data no parallel data (hard) supervision in source language(s) joint learning for multiple languages Snyder et al. (2009) Naseem et al. (2010) supervision in source language(s) Smith and Eisner (2009) Das and Petrov (2011) McDonald et al. (2011) joint learning for multiple languages Cohen and Smith (2009) Berg-Kirkpatrick and Klein (2010) EMNLP 2011Cohen, Das and Smith (2011) Yarowsky and Ngai (2001) Xi and Hwa (2005)

Annotated data In a Nutshell 4 Unlabeled data in Portuguese + = SpanishItalian Coarse, universal parameters Interpolation (unsupervised training) coarse parameters of Portuguese Monolingual unsupervised training in Portuguese Coarse-to-fine expansion and initialization Cohen, Das and Smith (2011) Portuguese parameters EMNLP 2011

5 Assumptions for a given problem: 1. Underlying model is generative HMM TheSkibo is closeby Castle Merialdo (1994) EMNLP 2011Cohen, Das and Smith (2011)

Underlying model is generative DET NOUN VERBADJ ADP ROOT DMV Klein and Manning (2004) Assumptions for a given problem: EMNLP 2011Cohen, Das and Smith (2011)

7 7 Composed of multinomial distributions HMM TheSkibo is closeby Castle Merialdo (1994) Assumptions for a given problem: 1. Underlying model is generative EMNLP 2011Cohen, Das and Smith (2011)

8 8 DET NOUN VERBADJ ADP ROOT DMV Klein and Manning (2004) Composed of multinomial distributions Assumptions for a given problem: 1. Underlying model is generative EMNLP 2011Cohen, Das and Smith (2011)

9 9 In general, unlexicalized parameters look like: k th multinomial in the model i th event in the multinomial Assumptions for a given problem: 1. Underlying model is generative e.g. transition from ADJ ( ) to NOUN ( ) EMNLP 2011Cohen, Das and Smith (2011)

10 The lexicalized parameters take a similar form (No lexicalized parameters for the DMV) Assumptions for a given problem: 1. Underlying model is generative EMNLP 2011Cohen, Das and Smith (2011)

11 unlexicalizedlexicalized number of times event i of multinomial k fires in the derivation Assumptions for a given problem: 1. Underlying model is generative EMNLP 2011Cohen, Das and Smith (2011)

12 2. Coarse, universal part-of-speech tags VERBDET NOUNCONJ PRONNUM ADJPRT ADV. ADPX Assumptions for a given problem: EMNLP 2011Cohen, Das and Smith (2011)

13 Assumptions for a given problem: 2. Coarse, universal part-of-speech tags VERBDET NOUNCONJ PRONNUM ADJPRT ADV. ADPX Treebank tagset For each language, there is a mapping EMNLP 2011Cohen, Das and Smith (2011)

Coarse treebank coarse conversion 3. helper languages 14 Assumptions for a given problem: Treebank unlexicalized parameters MLE EMNLP 2011Cohen, Das and Smith (2011) For each:

15 Multilingual Modeling EMNLP 2011Cohen, Das and Smith (2011)

16 Multilingual Modeling For a target language, unlexicalized parameters: k th multinomial in the model (say, the transitions from the ADJ tag in an HMM) mixture weight for k th multinomial for the th helper language EMNLP 2011Cohen, Das and Smith (2011)

ADJ → NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X 17 Multilingual Modeling e.g., two helper languages: Spanish and Italian NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X EMNLP 2011Cohen, Das and Smith (2011)

?? NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X 18 Multilingual Modeling e.g., two helper languages: Spanish and Italian unknown ADJ →. EMNLP 2011Cohen, Das and Smith (2011)

19 Learning and Inference EMNLP 2011Cohen, Das and Smith (2011)

20 Learning and Inference normal learning EMNLP 2011Cohen, Das and Smith (2011)

21 Learning and Inference multilingual learning are fixed! EMNLP 2011Cohen, Das and Smith (2011)

22 Learning and Inference Multilingual learning learning with EM: Number of times is used in a derivation M-step: EMNLP 2011Cohen, Das and Smith (2011)

23 Learning and Inference Multilingual learning What about feature-rich generative models? Berg-Kirkpatrick et al. (2010) Locally normalized log-linear model EMNLP 2011Cohen, Das and Smith (2011)

?? NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X 24 Multilingual Modeling e.g., two helper languages: Spanish and Italian unknown ADJ →. EMNLP 2011Cohen, Das and Smith (2011)

NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X 25 Multilingual Modeling e.g., two helper languages: Spanish and Italian learned ADJ →. NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X EMNLP 2011Cohen, Das and Smith (2011)

JJS →. JJ →. JJR →. 26 Coarse-to-fine expansion NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X Learning and Inference (for English) NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X identical copies Step 1 ADJ →. EMNLP 2011Cohen, Das and Smith (2011)

27 Coarse-to-fine expansion Learning and Inference (for English) NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X JJ →. EMNLP 2011Cohen, Das and Smith (2011)

28 NOUN VERB ADJ ADV PRON DET ADP NUM CONJ PRT. X JJ →. Coarse-to-fine expansion Learning and Inference (for English) VB VBD VBG VBN VBP VBZ NN NNS NNP NNPS Equal division Monolingual unsupervised training Initializer new, fine JJ →. Step 2 EMNLP 2011Cohen, Das and Smith (2011)

29 Experiments EMNLP 2011Cohen, Das and Smith (2011)

30 Two Problems Unsupervised Part-of-Speech Tagging Model: feature-based HMM (Berg-Kirkpatrick et al., 2010) Learning: L-BFGS Unsupervised Dependency Parsing Model: DMV (Klein and Manning, 2004) Learning: EM EMNLP 2011Cohen, Das and Smith (2011)

31 Languages Target Languages: Bulgarian, Danish, Dutch, Greek, Japanese, Portuguese, Slovene, Spanish, Swedish, and Turkish Helper Languages: English, German, Italian and Czech (CoNLL Treebanks from 2006 and 2007) EMNLP 2011Cohen, Das and Smith (2011)

Direct Gradient (DG) Uniform + DG Mixture + DG Number of Languages with Best Results Average Accuracy 32 Results: POS Tagging (without tag dictionary) EMNLP 2011Cohen, Das and Smith (2011) Monolingual baseline (Berg-Kirkpatrick et al., 2010) Uniform mixture parameters (no learning) Full model

Direct Gradient (DG) Uniform + DG Mixture + DG Number of Languages with Best Results 2 (Portuguese, Danish) 2 (Turkish, Bulgarian) 6 Average Accuracy Results: POS Tagging (without tag dictionary) EMNLP 2011Cohen, Das and Smith (2011)

EMPRPGI Number of Languages with Best Results Average Accuracy 34 Results: Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011) Monolingual EM (Klein and Manning, 2004) Posterior Regularization (Gillenwater et al, 2010) Phylogenetic Grammar Induction (Berg-Kirkpatrick and Klein, 2010)

EMPRPGI Number of Languages with Best Results Average Accuracy UniformMixtureUniform + EM Mixture + EM 1. Uniform mixture parameters 2. No coarse-to-fine expansion (no learning) 35 Results: Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011) 1.Learned mixture parameters 2.No coarse-to-fine expansion 1.Uniform mixture parameters 2.Coarse-to-fine expansion → monolingual learning 1.Learned mixture parameters 2.Coarse-to-fine expansion → monolingual learning

UniformMixtureUniform + EM Mixture + EM PRPGI Number of Languages with Best Results 02 (Turkish, Slovene) 0 Average Accuracy * 53.6* 36 Results: Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011)

UniformMixtureUniform + EM Mixture + EM 3 (Bulgarian, Swedish, Dutch) 1 (Danish) 1 (Greek) 3 (Portuguese, Japanese, Spanish) EMPRPGI Number of Languages with Best Results 02 (Turkish, Slovene) 0 Average Accuracy * 53.6* 37 Results: Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011)

EMNLP Analyzing with Principal Component Analysis Two principal components

39 From Words to Dependencies EMNLP 2011Cohen, Das and Smith (2011)

40 From Words to Dependencies Use induced tags to induce dependencies 1.In a pipeline 2.Using the posteriors over tags in a sausage lattice (Cohen and Smith, 2007) EMNLP 2011Cohen, Das and Smith (2011)

EMNLP From Words to Dependencies Joint Decoding: The Skibo Castle DET : 0.95 ADJ: 0.03 NOUN: 0.02 DET : 0.0 ADJ: 0.3 NOUN: 0.7 DET : 0.01 ADJ: 0.1 NOUN: 0.89 DMV Parsing a lattice

42 Results: Words to Dependencies PipelineJoint DGMixture + DG Mixture + DG Number of Languages with Best Results Average EMNLP 2011Cohen, Das and Smith (2011)

43 Results: Words to Dependencies PipelineJoint DGMixture + DG Mixture + DG Number of Languages with Best Results 1 (Greek) 0 5 (Portuguese, Turkish, Swedish, Slovebe, Danish) 4 (Bulgarian, Japanese, Spanish, Dutch) Average EMNLP 2011Cohen, Das and Smith (2011)

44 Results: Words to Dependencies PipelineJoint DGMixture + DG Mixture + DG Number of Languages with Best Results 1 (Greek) 0 5 (Portuguese, Turkish, Swedish, Slovebe, Danish) 4 (Bulgarian, Japanese, Spanish, Dutch) Average EMNLP 2011Cohen, Das and Smith (2011) Best average result with gold tags: 62.2 Interesting result: Auto tags perform better for Turkish and Slovene

45 Conclusions EMNLP 2011Cohen, Das and Smith (2011)

46 Conclusions Improvements for two major tasks using non- parallel multilingual guidance In general grammar induction results better than POS tagging Joint POS and dependency parsing performs surprisingly well For a few languages, results are better than using gold tags Joint decoding performs better than a pipeline EMNLP 2011Cohen, Das and Smith (2011)

47 Questions? EMNLP 2011Cohen, Das and Smith (2011)

48 Results: POS Tagging Direct Gradient (DG) Uniform + DG Mixture + DG Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish Average (without tag dictionary) EMNLP 2011Cohen, Das and Smith (2011)

49 Results: POS Tagging Direct Gradient (DG) Uniform + DG Mixture + DG Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish Average (without tag dictionary) EMNLP 2011Cohen, Das and Smith (2011)

50 Results: POS Tagging Direct Gradient (DG) Uniform + DG Mixture + DG Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish50.4 Average (with tag dictionary) EMNLP 2011Cohen, Das and Smith (2011)

51 Results: POS Tagging Direct Gradient (DG) Uniform + DG Mixture + DG Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish Average (with tag dictionary) EMNLP 2011Cohen, Das and Smith (2011)

UniformMixtureUniform + EMMixture + EM EMPRPGI Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish Average Results: Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011)

UniformMixtureUniform + EMMixture + EM EMPRPGI Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish Average Results: Dependency Parsing EMNLP 2011Cohen, Das and Smith (2011)

54 Results: Words to Dependencies JointPipelineGold Tags DGMixture + DGDGMixture + DG Bulgarian Danish Dutch Greek Japanese Portuguese Slovene Spanish Swedish Turkish Average EMNLP 2011Cohen, Das and Smith (2011)