Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing Matt Post, Chris Callison-Burch, and Miles Osborne June 8, 2012.

Slides:



Advertisements
Similar presentations
Yansong Feng and Mirella Lapata
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Improving Machine Translation Quality via Hybrid Systems and Refined Evaluation Methods Andreas Eisele DFKI GmbH and Saarland University Helsinki, November.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Vamshi Ambati | Stephan Vogel | Jaime Carbonell Language Technologies Institute Carnegie Mellon University A ctive Learning and C rowd-Sourcing for Machine.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Languages of Asia Part 2: South Asia
ÓC-DAC Noida’2004 Efforts in Language & Speech Technology Natural Language Processing Lab Centre for Development of Advanced Computing (Ministry of Communications.
Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali.
1 Problems and Prospects in Collecting Spoken Language Data Kishore Prahallad Suryakanth V Gangashetty B. Yegnanarayana Raj Reddy IIIT Hyderabad, India.
AU-KBC FIRE2008 Submission - Cross Lingual Information Retrieval Track: Tamil- English Pattabhi R.K Rao and Sobha. L AU-KBC Research Centre, MIT Campus,
McEnery, T., Xiao, R. and Y.Tono Corpus-based language studies. Routledge. Unit A 2. Representativeness, balance and sampling (pp13-21)
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
English-Persian SMT Reza Saeedi 1 WTLAB Wednesday, May 25, 2011.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Achieving Domain Specificity in SMT without Over Siloing William Lewis, Chris Wendt, David Bullock Microsoft Research Machine Translation.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Automatic Detection of Tags for Political Blogs Khairun-nisa Hassanali and Vasileios Hatzivassiloglou Human Language Technology Research Institute The.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
2012: Monolingual and Crosslingual SMS-based FAQ Retrieval Johannes Leveling CNGL, School of Computing, Dublin City University, Ireland.
Active Learning for Statistical Phrase-based Machine Translation Gholamreza Haffari Joint work with: Maxim Roy, Anoop Sarkar Simon Fraser University NAACL.
2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.
Morpho Challenge competition Evaluations and results Authors Mikko Kurimo Sami Virpioja Ville Turunen Krista Lagus.
Use of Hierarchical Keywords for Easy Data Management on HUBzero HUBbub Conference 2013 September 6 th, 2013 Gaurav Nanda, Jonathan Tan, Peter Auyeung,
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
NUDT Machine Translation System for IWSLT2007 Presenter: Boxing Chen Authors: Wen-Han Chao & Zhou-Jun Li National University of Defense Technology, China.
IIIT Hyderabad’s CLIR experiments for FIRE-2008 Sethuramalingam S & Vasudeva Varma IIIT Hyderabad, India 1.
14/12/2009ICON Dipankar Das and Sivaji Bandyopadhyay Department of Computer Science & Engineering Jadavpur University, Kolkata , India ICON.
Why Not Grab a Free Lunch? Mining Large Corpora for Parallel Sentences to Improve Translation Modeling Ferhan Ture and Jimmy Lin University of Maryland,
Ibrahim Badr, Rabih Zbib, James Glass. Introduction Experiment on English-to-Arabic SMT. Two domains: text news,spoken travel conv. Explore the effect.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Chinese Word Segmentation Adaptation for Statistical Machine Translation Hailong Cao, Masao Utiyama and Eiichiro Sumita Language Translation Group NICT&ATR.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
© 2015 albert-learning.com Indian languages Indian Languages.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
POS Tagger and Chunker for Tamil
Distant Course Master English English Language Course For Masters of Mathematical and Mechanical Faculty Saint Petersburg State University The Faculty.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Crowdsourcing Ling 240. What is crowdfunding? Crowdsourcing—definition “the practice of obtaining information or services by soliciting input from a.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
LING 200 Introduction to Linguistics Prof. Sharon Hargus Winter 2009 Jan. 5, 2009.
Review: Review: Translating without in-domain corpus: Machine translation post-editing with online learning techniques Antonio L. Lagarda, Daniel Ortiz-Martínez,
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
A CASE STUDY OF GERMAN INTO ENGLISH BY MACHINE TRANSLATION: MOSES EVALUATED USING MOSES FOR MERE MORTALS. Roger Haycock 
Neural Machine Translation
Statistical Machine Translation Part II: Word Alignments and EM
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
Joint Training for Pivot-based Neural Machine Translation
REPORT WRITING Many types but two main kinds:
--Mengxue Zhang, Qingyang Li
Deep Learning based Machine Translation
Topics in Linguistics ENG 331
Introduction to Machine Translation
Statistical n-gram David ling.
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing Matt Post, Chris Callison-Burch, and Miles Osborne June 8, 2012

2 Introduction Machine translation favors simple models with lots of data

3 Languages Indo-Aryan languages Dravidian languages Bengali বাংলা Hindi मानकहिनद्ी Malayalam മലയാളം Tamil தமிழ் Telugu తెలుగు Urdu اردو

4 Languages Indo-Aryan languages Dravidian languages BengaliHindiMalayalamTamilTeluguUrdu languagescriptfamily L1 (millions) Bengali বাংলা Indo-Aryan110 Hindi मानकहिनद् ी Indo-Aryan180 Malayalam മലയാളം Dravidian35 Tamil தமிழ் Dravidian65 Telugu తెలుగు Dravidian69 UrduاردوIndo-Aryan60

5 Characteristics Head-final Subject-object-verb (SOV) word order Agglutinative morphology inflectional: tense, person, number, gender, mood, voice e.g., ஈரிநீன் / eeRineen (“climbed”) “The senator prepared her remarks” செனட்டர் அவளை கருத்துக்கள் தயார். senator her remarksprepared. ஈறு + இன் + ஈன eeRu in een climb past 1p-sing-neuter

6 Introduction Data is often hard to come by, leading to understudied languages source: wals.info source: informal survey of ACL 2009 proceedings

7 Data collection We took the 100 most- popular Wikipedia articles in each language and translated them using Amazon’s Mechanical Turk in a 3-step process 1. Dictionary construction 2. Page translation 3. Vote gathering

8 1. Dictionary construction We built a source-language vocabulary, and solicited translations for each word from Turkers Each word was presented along with four sentences it occurred in As controls, we used titles, which link pages in Wikipedia across languages

9 2. Translation We took the 100 most-popular Wikipedia articles in each language and translated them using Amazon’s Mechanical Turk அண்ணாதுரை மிகச் சிறந்த தமிழ் சொற்பொழிவாளரும் மேடைப் பேச்சாளரும் ஆவார். Annadurai best Tamil lecturer stage spokesman is.Annadurai four non-expert translations annadurai was an excellent orator and a public speaker.annadurai is a very good speakerannadurai is one of the best tamil speeches and also stage speeches.annathurai was the great reader and also the stage speaker.

10 3. Votes In a final task, we collected five votes on which of the four translations was the best அண்ணாதுரை மிகச் சிறந்த தமிழ் சொற்பொழிவாளரும் மேடைப் பேச்சாளரும் ஆவார். Annadurai best Tamil lecturer stage spokesman is.Annadurai four non-expert translations annadurai was an excellent orator and a public speaker.annadurai is a very good speakerannadurai is one of the best tamil speeches and also stage speeches.annathurai was the great reader and also the stage speaker. 5 training sentences

11 Obtained about 500K English words for training, another 35K for tuning and testing Data collection Indic languages 0.5m English words Europarl ES-EN 50m English words

12 Data splits We produce four datasets: train, dev, devtest, test Steps: We manually assigned documents to one of seven categories We assigned categories to datasets in round-robin fashion

13 PLACESPEOPLETHINGSSEXRELIGION AgraGautama BuddhaAir pollutionAnal sexBhagavad Gita Bihar Harivansh Rai Bachchan EarthKama SutraDiwali ChinaIndira GandhiEssayMasturbationHanuman DelhiJaishankar Prasad GangesPenisHinduism HimalayasJawaharlal Nehru General knowledge Sex positionsHoli IndiaKabir Global warming Sexual intercourse Islam MumbaiKalpana ChawlaPollutionVaginaMahabharata NepalMahadevi VarmaSolar energyLANGUAGE &Puranas PakistanMeeraTerrorismCULTUREQuran RajasthanMohammed RafiTECHAyurvedaRamayana Red Fort Mahatma Gandhi BlogConstitution of IndiaShiva Taj MahalMother TeresaGoogleCricket Taj Majal: Shiva Temple? United StatesNavbharat TimesHindi Web Resources English languageVedas Uttar PradeshPremchandInternetHindi Cable NewsVishnu PEOPLERabindranath Tagore Mobile phoneHindi literaturePEOPLE A. P. J. Abdul Kalam Rani Lakshmibai News aggregator Hindi-Urdu grammar Subhas Chandra Bose Aishwarya RaiSachin TendulkarRSSHoroscopeSurdas AkbarSarojini NaiduWikipediaIndian cuisineSwami Vivekananda Amitabh Bachchan EVENTSYouTubeSanskritTulsidas Barack ObamaHistory of IndiaStandard Hindi Bhagat SinghWorld War II Dainik Jagran

14 Split at the document level into training, dev, devtest, and test Data collection languagewordssentences Bengali40,909439,153 Hindi53,666897,337 Malayalam192,672612,618 Tamil124,630579,474 Telugu97,700600,733 Urdu158,299886,007

15 Data splits languagetraindevdevtesttest Bengali5, Hindi12, Malayalam6, Tamil7, Telugu9, Urdu11, in thousands of sentences

16 Data splits

17 Translation quality Translations aasai was the first successfull movie for ajith kumar. first film by ajith kumar was ' asai ' ajith kumar first victory is aasai ajithkumar first success movie is aasai Data quality issues அஜித் குமாரின் முதல் வெற்றிப் படம் ஆசை. ajith kumar first successful movie assai. training sentence 17

18 Inconsistent orthography Translations in srilanka solar government chola rule in sri lanka. in srilanka chozhas ruled chola reign in sri lanka Data quality issues இலங்கையில் சோழர் ஆட்சி In Sri Lanka Chola ruled

19 false positive true positive false negative Legend Data quality issues Poor alignments

20 Research Questions 1. How well does SMT work on these languages? 2. Do linguistic annotations help? 3. How important is translation quality?

21 Q1: How well can we do? Hiero Linguistically un-informed grammars that define lexicalized (re)orderings, extracted from aligned text X → X (1) உறுதி செய்கிறது X (2), X (1) confirmed X (2)

22 Q1: How well can we do? LanguageBLEU-4 scoreGoogle Bengali Hindi Malayalam Tamil Telugu Urdu scores are the mean of three MERT runs

23 Q1: How well can we do? scores are the mean of three MERT runs

24 Q2: Do linguistic annotations help? Syntax-augmented machine translation (SAMT) Linguistically informed grammars extracted with the aid of a target-side parse tree S+. → PRP+VBZ (1) உறுதி செய்கிறது. (2), PRP+VBZ (1) confirmed. (2) SAMT grammars are particularly well-motivated Syntax should help describe high-level SOV → SVO reordering Previously well-attested for Urdu (Baker et al., SCALE 2009)

25 Q2: Do linguistic annotations help? LanguageHieroSAMTDifference Bengali Hindi Malayalam Tamil Telugu Urdu scores are the mean of three MERT runs — BLEU-4 scores —

26 Q2: Do linguistic annotations help? scores are the mean of three MERT runs

27 Q3: Does translation quality matter? Recall that we have four redundant translations for each Indian language sentence, along with independently- obtained votes about which is best We trained models on a quarter of the data 1. selected randomly 2. selected by plurality (breaking ties randomly) And tested on the same test sets

28 Q3: Does text quality matter? HieroSAMT Languagerandombestrandombest Bengali Hindi Malayalam---- Tamil Telugu Urdu scores are the mean of three MERT runs

29 Q3: Does text quality matter? scores are the mean of three MERT runs

30 Future directions Morphology We took the word segmentations as given, yet we know these languages to be highly agglutinative Better segmentation should help at all stages, from alignment to decoding

31 Future directions Text normalization: standardizing orthography would help immensely இலங்கையில் சோழர் ஆட்சி in srilanka solar governmentchola rule in sri lanka.in srilanka chozhas ruledchola reign in sri lanka misspellin g count japenese91 japans40 japenes9 japenies3 japaenese s 3 japeneese1 japense1 Tamil-English dataset Urdu-English dataset

32 Summary A suite of six low-resource, head-final, morphologically rich languages from the Indian subcontinent Provided data splits for comparisons Ideas can be tested in an afternoon on a variety of languages We suggest future work in the areas of morphology, normalization, and domain adaptation The website will track uses of the data as well as the best test-set scores joshua-decoder.org/indian-parallel-corpora

33 joshua-decoder.org/indian-parallel-corpora

34 Thanks Support: Google, Microsoft, EuroMatrixPlus, DARPA Lexi Birch Ghouse Ismail