Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Dependency tree projection across parallel texts David Mareček Charles University in Prague Institute of Formal and Applied Linguistics.

Noah A. Smith and Jason Eisner Department of Computer Science /

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.

Unsupervised Structure Prediction with Non-Parallel Multilingual Guidance July 27 EMNLP 2011 Shay B. Cohen Dipanjan Das Noah A. Smith Carnegie Mellon University.

Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.

The University of Wisconsin-Madison Universal Morphological Analysis using Structured Nearest Neighbor Prediction Young-Bum Kim, João V. Graça, and Benjamin.

1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.

Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.

Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.

Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.

1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

1 Introduction LING 575 Week 1: 1/08/08. Plan for today General information Course plan HMM and n-gram tagger (recap) EM and forward-backward algorithm.

1 Statistical NLP: Lecture 13 Statistical Alignment and Machine Translation.

Language Identification Ben King1/23June 12, 2013 Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods Ben King.

Albert Gatt Corpora and Statistical Methods Lecture 9.

ELN – Natural Language Processing Giuseppe Attardi

LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)

Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.

Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**

Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

Adaptor Grammars Ehsan Khoddammohammadi Recent Advances in Parsing Technology WS 2012/13 Saarland University 1.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad

A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.

1 Co-Training for Cross-Lingual Sentiment Classification Xiaojun Wan ( 萬小軍 ) Associate Professor, Peking University ACL 2009.

A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:

Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.

Learning Multilingual Subjective Language via Cross-Lingual Projections Mihalcea, Banea, and Wiebe ACL 2007 NLG Lab Seminar 4/11/2008.

CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Tokenization & POS-Tagging

Maximum Entropy Models and Feature Engineering CSCI-GA.2590 – Lecture 6B Ralph Grishman NYU.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

2/5/01 Morphology technology Different applications -- different needs –stemmers collapse all forms of a word by pairing with “stem” –for (CL)IR –for (aspects.

Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.

Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text Ross Israel Indiana University Joel Tetreault Educational Testing Service.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Statistical NLP Spring 2010 Lecture 7: POS / NER Tagging Dan Klein – UC Berkeley.

Part of Speech Tagging in Context month day, year Alex Cheng Ling 575 Winter 08 Michele Banko, Robert Moore.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

POS Tagging1 POS Tagging 1 POS Tagging Rule-based taggers Statistical taggers Hybrid approaches.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Concept-Based Analysis of Scientific Literature Chen-Tse Tsai, Gourab Kundu, Dan Roth UIUC.

A Simple English-to-Punjabi Translation System By : Shailendra Singh.

Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.

A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Language Identification and Part-of-Speech Tagging

Prototype-Driven Learning for Sequence Models

Topics in Linguistics ENG 331

Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006

Presentation transcript:

Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013

Semi-Supervised Training HMM with Expectation-Maximization (EM) Need: Large raw corpus Tag dictionary [Kupiec, 1992] [Merialdo, 1994]

Previous Works:  Supervised Learning  Provide high accuracy for POS tagging (Manning, 2011).  Perform poorly when little supervision is available.  Semi-Supervised  Done by training sequence models such as HMM using the EM algorithm.  Work in this area has still relied on relatively large amounts of data. (Kupiec, 1992; Merialdo,1994).

Previous Works:  Goldberg et al.(2008)  Manually constructed lexicon for Hebrew to train HMM tagger.  Lexicon was developed over a long period of time by expert lexicographers.  Tackstrom et al. (2013)  Evaluated use of mixed type and token constraints generated by projecting information from high resource language to low resource languages.  Large parallel corpora required.

Low-Resource Languages 6,900 languages in the world ~30 have non-negligible quantities of data No million-word corpus for any endangered language [Maxwell and Hughes, 2006] [Abney and Bird, 2010]

Low-Resource Languages Kinyarwanda (KIN) Niger-Congo. Morphologically-rich. Malagasy (MLG) Austronesian. Spoken in Madagascar. Also, English

Collecting Annotations Supervised training is not an option. Semi-supervised training: Annotate some data by hand in 4 hours, (in 30-minute intervals) for two tasks. Type supervision. Token supervision.

Tag Dict Generalization These annotations are too sparse! Generalize to the entire vocabulary

Tag Dict Generalization Haghighi and Klein (2006) do this with a vector space. We don’t have enough raw data Das and Petrov (2011) do this with a parallel corpus. We don’t have a parallel corpus

Tag Dict Generalization Strategy: Label Propagation Connect annotations to raw corpus tokens Push tag labels to entire corpus [Talukdar and Crammer. 2009]

Morphological Transducers Finite-state transducers are used for morphological analysis. FST accepts a word type and produces a set of morphological features. Power of FSTs: Analyze out-of-vocabulary items by looking for known affixes and guessing the stem of the word.

Tag Dict Generalization PREV_ NEXT_thug TOK_the_4TOK_the_1 TYPE_the PREV_the TOK_the_9TOK_thug_5 TYPE_thug NEXT_walks TOK_dog_2 TYPE_dog PRE1_tPRE2_th SUF1_e SUF1_gPRE1_d PRE2_do

Tag Dict Generalization Type Annotations _ the __ DT ____ _ dog _ NN ____ TYPE_the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TYPE_dog NEXT_walks TOK_the_4TOK_the_1 TOK_thug_5 TOK_dog_2

Tag Dict Generalization Type Annotations _ the ________ _ dog ________ TY DT the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TY NN og NEXT_walks TOK_the_4TOK_the_1 TOK_thug_5 TOK_dog_2

Tag Dict Generalization Type Annotations _ the ________ dog TYPE_the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TYPE_dog NEXT_walks TOK_the_4TOK_the_1 TOK_thug_5 TOK_dog_2 Token Annotations the dog walks DT NN VBZ

Tag Dict Generalization Type Annotations _ the ________ dog TYPE_the PREV_ PRE2_thPRE1_t TYPE_thug PREV_the SUF1_g TYPE_dog NEXT_walks TO DT e_4 TOK_the_1 TOK_thug_5 TOK NN _2 Token Annotations the dog walks ____________

Model Minimization [Ravi et al., 2010; Garrette and Baldridge, 2012] LP graph has a node for each corpus token. Each node is labelled with distribution over POS tags. Graph provides a corpus of sentences labelled with noisy tag distributions. Greedily seek the minimal set of tag bigrams that describe the raw corpus. Now use, HMM trained by EM.

Overall Accuracy All of these values were achieved using both FST and affix LP features.

Results

Types versus Tokens

Mixing Type and Token Annotations

Morphological Analysis

Annotator Experience

Conclusion Type Annotations are the most useful input from a linguist. We can train effective POS-taggers on low resource languages given only a small amount of unlabeled text and a few hours of annotation by a non-native linguist.