LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.

Slides:



Advertisements
Similar presentations
Don’t Type it! OCR it! How to use an online OCR..
Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
Bilingual Dictionaries
How do we work in a virtual multilingual classroom? A virtual multilingual classroom with Moodle and Apertium Cultural and Linguistic Practices in the.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
Towards an NLP `module’ The role of an utterance-level interface.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Evaluating an MT French / English System Widad Mustafa El Hadi Ismaïl Timimi Université de Lille III Marianne Dabbadie LexiQuest - Paris.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
LEARNING WORD TRANSLATIONS Does syntactic context fare better than positional context? NCLT/CNGL Internal Workshop Ankit Kumar Srivastava 24 July 2008.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
The Linguistic-Core Approach to Structured Translation and Analysis of Low- Resource Languages 2011 Program Review for ARL MURI Project 4 November 2011.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Streamlining the Review Cycle Michael Oettli, nlg GmbH Santa Clara, October 10 th.
FLAVIUS Technical presentation (Overblog, Qype, TVTrip) - WP2 Platform architecture.
Querying Across Languages: A Dictionary-Based Approach to Multilingual Information Retrieval Doctorate Course Web Information Retrieval Speaker Gaia Trecarichi.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
Learner corpus analysis and error annotation Xiaofei Lu CALPER 2010 Summer Workshop July 13, 2010.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Can Controlled Language Rules increase the value of MT? Fred Hollowood & Johann Rotourier Symantec Dublin.
Leveraging Reusability: Cost-effective Lexical Acquisition for Large-scale Ontology Translation G. Craig Murray et al. COLING 2006 Reporter Yong-Xiang.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Advisors: Gabor Sarkozy, WPI Andras Kornai, MTA-Sztaki April 23 rd, 2013 Zhongxiu Liu CS 14’ Yidi Zhang CS 13’
Sofia Garcia/Roberto Silva Tutorial Workshop, GrenobleDate: 31/Jan/2007 The work of a professional translator and the translation agency V1.0.
Enhanced Infrastructure for Creation & Collection of Translation Resources Zhiyi Song, Stephanie Strassel (speaker), Gary Krug, Kazuaki Maeda.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Translation Memory System (TMS)1 Translation Memory Systems Presentation by1 Melina Takanen & Julianna Ekert CAT Prof. Thorsten Trippel University.
Copenhagen, 6 June 2006 EC CHM Multilinguality Anton Cupcea Finsiel Romania.
For Wednesday No reading Homework –Chapter 23, exercise 15 –Process: 1.Create 5 sentences 2.Select a language 3.Translate each sentence into that language.
For Friday Finish chapter 24 No written homework.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
An Unsupervised Approach for the Detection of Outliers in Corpora David Guthrie Louise Guthire, Yorick Wilks The University of Sheffield.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
POS Tagger and Chunker for Tamil
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Towards a Translation Assessment Assistant Tom Cheesman.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
1 Dr. Cord Pagenstecher Testimonies on Nazi Forced Labor and the Holocaust Building Digital Environments for Research and Education Dr. Cord Pagenstecher.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
NLP Midterm Solution #1 bilingual corpora –parallel corpus (document-aligned, sentence-aligned, word-aligned) (4) –comparable corpus (4) Source.
Language Identification and Part-of-Speech Tagging
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Computational Linguistics: New Vistas
Using GOLD to Tracking L2 Development
Presentation transcript:

LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI

Purpose Collect and build data Monolingual text Bilingual text Linguistic annotations to support work on machine translations for Kinyarwanda-English Malagasy-English

Overview Source, type and size of data Language consultants Kinyarwanda data Malagasy data Annotation An idea Accomplishments, challenges, future releases

Text sources Bible (highly multilingual parallel corpus) Dictionaries, phrasebooks Interview transcripts Newspapers

Pbook (0.9k)Pbook (0.7k) GWord (8b) BILINGUAL (16k) ENGLISH monolingual (huge) KINYARWANDA monolingual (7m) ENG treebank ENG text KIN text KIN treebank PTB (1m) word align Kinyarwanda Data Resources 1.0 Release 2.0 Release News (7m) KGMC (5.8k)KGMC (4.8k) Dict (9k)Dict (8k) NONE word counts

KGMC (270k)KGMC (225k) Pbook (0.9k)Pbook (0.7k) GWord (8b) BILINGUAL (285k) ENGLISH monolingual (huge) KINYARWANDA monolingual (7m) ENG treebank ENG text KIN text KIN treebank PTB (1m) word align Kinyarwanda Data Resources 1.0 Release 2.0 Release News (7m) KGMC (5.8k)KGMC (4.8k) BBC (0.3k) IGT (0.1k)IGT (0.06k) Dict (9k)Dict (8k) NONE KGMC (2.9k)KGMC (3.8k) BBC (0.3k) IGT (0.06k) IGT (0.1k) NOTE: no gold morph-split text word counts

Bible (730k)Bible (725k) Gword (8b) BILINGUAL (730k) ENGLISH monolingual (huge) MALAGASY monolingual (zero) ENG treebank ENG text MLG text MLG treebank PTB (1m) word align Malagasy Data Resources 1.0 Release 2.0 Release NONE none

Bible (730k)Bible (725k) News (2.1k)News (2.3k) Gword (8b) BILINGUAL (732k) ENGLISH monolingual (huge) MALAGASY monolingual (zero) ENG treebank ENG text MLG text MLG treebank PTB (1m) word align Malagasy Data Resources 1.0 Release 2.0 Release NONE none News (2.1k)News (2.3k) NOTE: no gold morph-split text

Quality of Original Texts Perfectly clean: English Bible Reasonably edited: Newspapers (kin/mlg) Uneven editing: Genocide protocols – Spelling errors – missing/sloppy punctuation – untranslated text (missing or still in source language) Kinyarwanda word ikaragiro (which means dairy) repeatedly translated as diary. “... over there, the houses that belong to the diary.”

Native speaker consultants UT reached out to speakers of both languages Kinyarwanda – Several speakers near Austin – Most would like some payment – One has helped with translation and consultation Malagasy speakers – Many speakers from around US and Canada – Most would like some payment – Two have helped with translations

Native speaker consultants At this point, UT does need to have access to paid informants. – Need texts from other genres translated – Need to ask questions about meanings of some sentences for linguistic analysis The CMU-Rwanda initiative may provide us with a further avenue for obtaining consultants for Kinyarwanda. – Also a potential source of data

Overview Source, type and size of data Language consultants Kinyarwanda data Malagasy data Annotation An idea Accomplishments, challenges, future releases

KGMC Transcripts Collaboration between Kigali Genocide Memorial Center and the Human Rights Documentation Initiative at UT Austin Library – – Transcriptions of survivor testimonies filmed for the Genocide Archive Rwanda

KGMC Data 48 translated transcripts – all translated into English – 33 into French 41 untranslated transcripts (only Kinyarwanda)

KGMC Data Original format: Microsoft Word, in tables

KGMC Data normalization Converted to XML using a semi-automatic process Each language represented side-by-side Script to process the MS Word format – Iteratively modified based on output and error detection – Needed to handle missing data and misalignments between time spans across translations Final manual verification and correction of each file.

Example XML

Overview Source, type and size of data Language consultants Kinyarwanda data Malagasy data Annotation An idea Accomplishments, challenges, future releases

Malagasy Bible Online version of 1865 Malagasy Bible – Preparation: – Convert HTML to text – Align with the NET Bible (New English Translation) using verses – Currently have 686 chapters aligned Obvious problem: 150 year-old Malagasy text

Malagasy Dictionary Online dictionary of Malagasy – k words – English definitions for 8000 words – French definitions for 10,000 words Includes parts-of-speech, mostly coarse- grained (noun, verb, adjective, etc.)

Malagasy Dictionary Scraped and processed to produce clean XML

Malagasy texts Texts from six webpages – 3 from Lakroa: – 3 from Lagazette: Translated by native speakers to English to create small parallel corpus for initial analysis and annotation.

Overview Source, type and size of data Language consultants Kinyarwanda data Malagasy data Annotation An idea Accomplishments, challenges, future releases

Morphological analysis UT Austin obtained and adapted XFST analyzer created by Dalrymple, Liakata and Mackie Applied it to the Malagasy website texts from Lakroa and Lagazette, hand-selecting the correct analysis for each word. These need to be integrated with the standard tokenization and data organization. Kinyarwanda morph analyzer in development.

Syntactic annotations Did initial pilot annotations with example sentences from the linguistics literature. Annotated KGMC (kin) and Lagazette and Lakroa (mlg) texts with phrase structures. – Used a fairly standard set of labels and structures – Trees created for both the source language sentences and their English translations

Example KGMC tree

Syntactic annotations Phrase structures were created before standardizing the tokenization; had to be grafted back onto correct tokens. Current trees are still pilot annotations! Need to do many things, including: – reconsider the choice of node labels – add head markers (enable easy conversion to dependency analyses) – review and incorporate feedback from others – graft some existing trees to standard tokenization

Overview Source, type and size of data Language consultants Kinyarwanda data Malagasy data Annotation An idea: data-driven dictionary development Accomplishments, challenges, future releases

Data-driven Dictionary Development Current dictionary size is moderate – 6,632 entries with 3,890 distinct Kin. words/phrases – many relatively common words not covered Idea: increase dictionary size using translators – based on data analysis of monolingual corpora – using NLP techniques to leverage process Goals – Additional bitext for direct use in MT training – Improved resource for morphological analyzers

Data-driven Dict. Dev. (Example) Monolingual Kinyarwanda corpus contains – ikinini (43 occ.), ibinini (96 occ.); not in dictionary Automatically predict lexical form(s), POS – ikinini (noun, plural: ibinini) Elicit English translation: pill, tablet – providing examples from corpus in context Generate dictionary entry as well as MT bitext – ikinini=pill, ikinini=tablet, ibinini=pills, ibinini=tablets

Overview Source, type and size of data Language consultants Kinyarwanda data Malagasy data Annotation An idea: data-driven dictionary development Accomplishments, challenges, future releases

Accomplishments Released monolingual, bilingual, and tree-banked data for Kinyarwanda and Malagasy – Data release v1.0 in February 2011 – Data release v2.0 in October 2011 Tools that can be shared – Tokenizer for Kinyarwanda and Malagasy – Diagnostic tools to check encoding, character sets, tokenization, tree well-formedness etc.

Challenges Need for more and better annotation tools to annotate faster and assure consistency – sentence segmentation, treebanking,... Need guidelines, workflow for data acquisition and annotation process Need reliable language experts for Kinyarwanda and Malagasy Need more data Wikipedia, LDS, mlg/fre

Data release v2.1 (target: Dec. 2011) Full sentence-level segmentation on Kinyarwanda-English text Release tokenizers, morph analyzers, diagnostic tools

Data release v3.0 (target: May 2012) Highest priority In-domain bilingual test sets – 500 sentences (300 newswire, 200 conversation) – Naturally occurring, source texts on both sides – Multiple translation if possible Large, modern Malagasy monolingual corpora Head markings (syntactic) Word alignment

Data release v3.0 (target: May 2012) Next priority Increase size of Kinyarwanda-English dictionary More Malagasy-English news bitext Typo correction Bible in Kinyarwanda (?) Malagasy-English dictionary Morphological gold standard