Presentation is loading. Please wait.

Presentation is loading. Please wait.

LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU.

Similar presentations


Presentation on theme: "LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU."— Presentation transcript:

1 LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU

2 Purpose Collect and build data Monolingual text Bilingual text Linguistic annotations to support work on machine translation for Kinyarwanda-English Malagasy-English

3 KGMC (270k)KGMC (225k) Pbook (0.9k)Pbook (0.7k) GWord (8b) BILINGUAL (285k) ENGLISH monolingual (huge) KINYARWANDA monolingual (7m) ENG treebank ENG text KIN text KIN treebank PTB (1m) Kinyarwanda Data Resources News (7m) KGMC (5.8k)KGMC (4.8k) BBC (0.3k) IGT (0.1k)IGT (0.06k) Dict (9k)Dict (8k) KGMC (2.9k) KGMC (3.8k) BBC (0.3k) IGT (0.06k) IGT (0.1k) word counts 1.0 Release 02/11 2.0 Release 10/11

4 KGMC (270k)KGMC (225k) Pbook (0.9k)Pbook (0.7k) GWord (8b) BILINGUAL (285k) ENGLISH monolingual (huge) KINYARWANDA monolingual (7m) ENG treebank ENG text KIN text KIN treebank PTB (1m) Kinyarwanda Data Resources News (7m) KGMC (5.8k)KGMC (4.8k) BBC (0.3k) IGT (0.1k)IGT (0.06k) Dict (9k)Dict (8k) KGMC (2.9k) Part-of- speech (2k) GFL (4.7k) KGMC (3.8k) BBC (0.3k) IGT (0.06k) IGT (0.1k) word counts Reviewed & improved 1.0 Release 02/11 2.0 Release 10/11 3.0 Release 11/12

5 Bible (730k)Bible (725k) News (2.1k)News (2.3k) Gword (8b) BILINGUAL (732k) ENGLISH monolingual (huge) MALAGASY Monolingual ENG treebank ENG text MLG text MLG treebank PTB (1m) Malagasy Data Resources News (2.1k) News (2.3k) 1.0 Release 02/11 2.0 Release 10/11

6 Bible (730k)Bible (725k) News (2.1k)News (2.3k) Gword (8b) BILINGUAL (732k) ENGLISH monolingual (huge) MALAGASY Monolingual ENG treebank ENG text MLG text MLG treebank PTB (1m) Malagasy Data Resources News (2.1k) Reviewed & improved. News (2.3k) Reviewed & improved. Part-of-speech (2k) Global voices (1.8m) Global voices (1.9m) Leipzig (600k) Global voices GFL (3.7k) 1.0 Release 02/11 2.0 Release 10/11 3.0 Release 11/12 Dictionary (77.5k)

7 Malagasy Data Resources Year 1: 19 th century Malagasy bible Year 2: – Univ. of Leipzig Web Corpus Monolingual Malagasy, very clean – CMU Global Voices Archive

8 Malagasy Resources TokensTypesHapax Bible (Year 1)579,57819,4608,401 Leipzig corpus (Year 2)618,28241,46223,659 CMU Global Voices (Year 2)2,148,97684,74446,627 Total3,346,836115,17262,517 Malagasy - English Resources eng-Tokenseng-Typesmlg-Tokensmlg-Types Bible (Year 1)584,87213,084579,57819,460 CMU Global Voices (Year 2)1,785,47263,3572,148,97684,744 Total2,370,34467,7903,346,836115,172

9 CMU Global Voices Corpus Domains include Twitter, blogs, news about popular democracy movements Actively published by volunteer translators – We are gathering ~ 500k words / language / year of high quality parallel data eng-Tokenseng-Typesmlg-Tokensmlg-Types Global Voices <Jun 20111,318,78056,4141,569,34372,906 Global Voices <Jun 20121,732,67459,7502,066,41979,269 http://www.ark.cs.cmu.edu/global-voices/

10 Morphological analysis We decided against creating morphological gold-standard annotations from the output of finite state transducers. Initially tried to use XFST analyzer created by Dalrymple, Liakata and Mackie 2006. – Quality of the output of Dalrymple transducer was poor (ambiguous, many incorrect). No existing Kinyarwanda transducer – Any annotations would be subject to changing analyses during transducer development.

11 Morphological analysis Developed new transducers for both Kinyarwanda and Malagasy. – Less ambiguity – Cautious guessing for unknown stems => better precision Improvements driven by measuring ambiguity/coverage on data and effect on performance in other tasks. We may produce annotations after transducer development deemed sufficient.

12 Syntactic annotations During past year, we reviewed and revised phrase structures annotated for kin and mlg texts. – Analyses and labels made more consistent across languages – Head annotations added to enable dependency parsing training/evaluation. – All tokenization standardized. GFL annotations: 4k each tokens, kin and mlg

13 Data accomplishments Fieldwork on Kinyarwanda that informs theoretical linguistic work and transducers. New morphological transducers for kin and mlg. V 3.0 of monolingual, bilingual, and tree-banked data for Kinyarwanda and Malagasy to be released this coming week. – Order of magnitude parallel data (mlg) – Better & more syntactic data (kin/mlg)

14 Data accomplishments Evaluation – Pilot annotations for linguistically target test suites Formal linguistic advances – GFL specification and tools for annotation and visualization – Abstract Meaning Representation (AMR): leverage ideas, data and tools from ISI as part of other synergistic projects.


Download ppt "LDMT MURI Data Collection and Linguistic Annotation November 2, 2012 Jason Baldridge, UT Austin Lori Levin, CMU."

Similar presentations


Ads by Google