LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Introducing COMPARA The Portuguese-English Parallel Corpus Ana Frankenberg-Garcia ISLA, Lisbon & Diana Santos SINTEF, Oslo.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Big Data for Small Languages Laura Welcher The Long Now Foundation / Rosetta Project.
Tools and resources Summary of working group discussion.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Introduction to treebanks Session 1: 7/08/
Machine Translation via Dependency Transfer Philip Resnik University of Maryland DoD MURI award in collaboration with JHU: Bootstrapping Out of the Multilingual.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Distributors Linguistic Data Consortium NEMLAR (Network for Euro-Mediterranean LAnguage Resources)NEMLAR (Network for Euro-Mediterranean LAnguage.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
Creation of a Russian-English Translation Program Karen Shiells.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,
Linguistically Targeted Test Suites November 2, 2012 Lori Levin Jason Baldridge Chris Dyer Vijay John Kyle Jerro.
The Linguistic-Core Approach to Structured Translation and Analysis of Low- Resource Languages 2011 Program Review for ARL MURI Project 4 November 2011.
Introduction to Natural Language Processing Heshaam Faili University of Tehran.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Survey of Semantic Annotation Platforms
Evaluation of the Statistical Machine Translation Service for Croatian-English Marija Brkić Department of Informatics, University of Rijeka
An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Real-World Semi-Supervised Learning of POS-Taggers for Low-Resource Languages Dan Garrette, Jason Mielens, and Jason Baldridge Proceedings of ACL 2013.
Research Topics CSC Parallel Computing & Compilers CSC 3990.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
8. ONLINE REFERENCE TOOLS Dictionaries and Thesauruses Concordancers and corpuses for language analysis Translators for language analysis Encyclopedias.
Daisy Arias Math 382/Lab November 16, 2010 Fall 2010.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Designing a Machine Translation Project Lori Levin and Alon Lavie Language Technologies Institute Carnegie Mellon University CATANAL Planning Meeting Barrow,
POS Tagger and Chunker for Tamil
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
DARPA TIDES MT Group Meeting Marina del Rey Jan 25, 2002 Alon Lavie, Stephan Vogel, Alex Waibel (CMU) Ulrich Germann, Kevin Knight, Daniel Marcu (ISI)
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
MALAGASY The indigenous language of Madagascar John Cadigan & Martin Horn.
Natural Language Processing Vasile Rus
Approaches to Machine Translation
--Mengxue Zhang, Qingyang Li
Machine Learning in Natural Language Processing
Approaches to Machine Translation
Introduction to Machine Translation
Computational Linguistics: New Vistas
Presentation transcript:

LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

Purpose Collect and build data Monolingual text Bilingual text Linguistic annotations to support work on machine translation for Kinyarwanda-English Malagasy-English

KGMC (270k)KGMC (225k) Pbook (0.9k)Pbook (0.7k) GWord (8b) BILINGUAL (285k) ENGLISH monolingual (huge) KINYARWANDA monolingual (7m) ENG treebank ENG text KIN text KIN treebank PTB (1m) Kinyarwanda Data Resources News (7m) KGMC (5.8k)KGMC (4.8k) BBC (0.3k) IGT (0.1k)IGT (0.06k) Dict (9k)Dict (8k) KGMC (2.9k) KGMC (3.8k) BBC (0.3k) IGT (0.06k) IGT (0.1k) word counts 1.0 Release 02/ Release 10/11

KGMC (270k)KGMC (225k) Pbook (0.9k)Pbook (0.7k) GWord (8b) BILINGUAL (285k) ENGLISH monolingual (huge) KINYARWANDA monolingual (7m) ENG treebank ENG text KIN text KIN treebank PTB (1m) Kinyarwanda Data Resources News (7m) KGMC (5.8k)KGMC (4.8k) BBC (0.3k) IGT (0.1k)IGT (0.06k) Dict (9k)Dict (8k) KGMC (2.9k) Part-of- speech (2k) GFL (4.7k) KGMC (3.8k) BBC (0.3k) IGT (0.06k) IGT (0.1k) word counts Reviewed & improved 1.0 Release 02/ Release 10/ Release 11/12

Bible (730k)Bible (725k) News (2.1k)News (2.3k) Gword (8b) BILINGUAL (732k) ENGLISH monolingual (huge) MALAGASY Monolingual ENG treebank ENG text MLG text MLG treebank PTB (1m) Malagasy Data Resources News (2.1k) News (2.3k) 1.0 Release 02/ Release 10/11

Bible (730k)Bible (725k) News (2.1k)News (2.3k) Gword (8b) BILINGUAL (732k) ENGLISH monolingual (huge) MALAGASY Monolingual ENG treebank ENG text MLG text MLG treebank PTB (1m) Malagasy Data Resources News (2.1k) Reviewed & improved. News (2.3k) Reviewed & improved. Part-of-speech (2k) Global voices (1.8m) Global voices (1.9m) Leipzig (600k) Global voices GFL (3.7k) 1.0 Release 02/ Release 10/ Release 11/12 Dictionary (77.5k)

Malagasy Data Resources Year 1: 19 th century Malagasy bible Year 2: – Univ. of Leipzig Web Corpus Monolingual Malagasy, very clean – CMU Global Voices Archive

Malagasy Resources TokensTypesHapax Bible (Year 1)579,57819,4608,401 Leipzig corpus (Year 2)618,28241,46223,659 CMU Global Voices (Year 2)2,148,97684,74446,627 Total3,346,836115,17262,517 Malagasy - English Resources eng-Tokenseng-Typesmlg-Tokensmlg-Types Bible (Year 1)584,87213,084579,57819,460 CMU Global Voices (Year 2)1,785,47263,3572,148,97684,744 Total2,370,34467,7903,346,836115,172

CMU Global Voices Corpus Domains include Twitter, blogs, news about popular democracy movements Actively published by volunteer translators – We are gathering ~ 500k words / language / year of high quality parallel data eng-Tokenseng-Typesmlg-Tokensmlg-Types Global Voices <Jun 20111,318,78056,4141,569,34372,906 Global Voices <Jun 20121,732,67459,7502,066,41979,269

Morphological analysis We decided against creating morphological gold-standard annotations from the output of finite state transducers. Initially tried to use XFST analyzer created by Dalrymple, Liakata and Mackie – Quality of the output of Dalrymple transducer was poor (ambiguous, many incorrect). No existing Kinyarwanda transducer – Any annotations would be subject to changing analyses during transducer development.

Morphological analysis Developed new transducers for both Kinyarwanda and Malagasy. – Less ambiguity – Cautious guessing for unknown stems => better precision Improvements driven by measuring ambiguity/coverage on data and effect on performance in other tasks. We may produce annotations after transducer development deemed sufficient.

Syntactic annotations During past year, we reviewed and revised phrase structures annotated for kin and mlg texts. – Analyses and labels made more consistent across languages – Head annotations added to enable dependency parsing training/evaluation. – All tokenization standardized. GFL annotations: 4k each tokens, kin and mlg

Data accomplishments Fieldwork on Kinyarwanda that informs theoretical linguistic work and transducers. New morphological transducers for kin and mlg. V 3.0 of monolingual, bilingual, and tree-banked data for Kinyarwanda and Malagasy to be released this coming week. – Order of magnitude parallel data (mlg) – Better & more syntactic data (kin/mlg)

Data accomplishments Evaluation – Pilot annotations for linguistically target test suites Formal linguistic advances – GFL specification and tools for annotation and visualization – Abstract Meaning Representation (AMR): leverage ideas, data and tools from ISI as part of other synergistic projects.