Download presentation
Presentation is loading. Please wait.
Published byReece Crozier Modified over 9 years ago
1
Morphological Analysis for Phrase- Based Statistical Machine Translation Luong Minh Thang WING group meeting – 15 Aug, 2008 HYP update - part1 4/30/20151
2
Agenda Introduction - what does my project title mean? Language pair English-Finnish challenges Related works Project direction 4/30/20152
3
Introduction I: phrase-based SMT Statistical: derive statistical information from large data Phrase-base: capture local constraints 4/30/20153 Marianodabaunabotefadaalabrujaverde 123456789 01234567 NULLMarydidnotslapthegreenwitch Source Target
4
Introduction II - Morphology Morpheme: minimal meaning-bearing unit – machines = machine + s – translation = translate + ion – goalkeeper = goal + keeper English is a low-inflected language - simple morphological structure High-inflected languages are much complicated! 4/30/20154
5
Introduction III – high-inflected languages Concatenate chain of morphemes to form a word Finnish: oppositio + kansa + n + edusta + ja (opposition + people + of + represent + -ative) = opposition of parliarment member Turkish: uygarlas,tiramadiklarimizdanmis,sinizcasina (uygar+las, tir+ama+dik+lar+imiz+dan+mis, siniz+casina) = (behaving) as if you are among those whom we could not cause to become civilized 4/30/20155 This is a word!!!
6
Introduction IV – Why morphological-aware SMT? Tackle the data sparseness problem (Statistics from 1.021.180 sentence pairs) Capture the relations among words 4/30/20156 English machine machines Spanish máquina máquinas Type countToken count English105.144121.442.173 Finnish516.102130.128.883
7
Language pair I – our choice? We chose English - Finnish as our main translation task 4/30/20157 Low-inflectedhighly-inflected (Dyer, 2007) Vietnamese
8
Language pair II – why Finnish? Honestly, I don’t know Finnish … But because: – Available corpora – Finnish is an agglutinative morphologically-complex language, suitable for our project scope – Investigate in translation from low to high inflected languages -> an area to explore, yet hard !!! 4/30/20158
9
English-Finnish challenges I – many-to-one word relationship Finnish uses suffixes to express grammatical relations and also to derive new words 4/30/20159 CaseSuffixEnglish prep. Sample word form Translation of the sample nominatiivi -talohouse genetiivi-noftalonof (a) house essiivi-naastalonaas a house inessiivi-ssaintalossain (a) house elatiivi-stafrom (inside)talostafrom (a) house komitatiivi-ne-together (with)taloineniwith my house(s) Many-to-one English-Finnish word relationship need word-morpheme correspondence (about 14-15 cases for nouns) Not merely concatenating
10
English-Finnish challenges II – word order Word order is “free” in Finnish – Pete rakastaa Annaa = Pete loves Annaa (normal) – Annaa Pete rakastaa: emphasizes Annaa – Rakastaa Pete Annaa: emphasizes rakastaa = Pete does love Anna – Pete Annaa rakastaa: stress on Pete – Rakastaa Annaa Pete. not sound like a normal sentence, quite understandable. 4/30/201510
11
English-Finnish challenges III – surface form generation After translating from English words Finnish morphemes, need a surface generation step oppositio + kansa + n + edusta + ja oppositiokansanedustaja What if missing morphemes or changes in morpheme order? Need a more error-tolerate surface recovery algorithm 4/30/201511
12
Related works I – low-to-high inflected languages Many works from high to low inflected languages, but very few works on the opposite direction, considered hard in (Koehn, 2005) – (Yang & Kirchhoff, 2006): Finnish-English, backoff – (Oflazer & Durgar El-Kahlout, 2006, 2007): English- Turkish, word-morpheme translation, then simply concatenating morphemes All use language-dependent tools & syntactic knowledge: TreeTager, Snowball stemmer … 4/30/201512
13
Related works II – surface form recovery (Toutanova et. al., 2007, 2008): English-Russian, English-Arabic; translate stem-to-stem; predict inflection from stems using many different features (lexical, morphological, and syntactic) (Avramidis & Koehn, 2008): English-Greek Use syntax to get the “missing” morphology, depending on the syntactic position Noun cases agreement and verb person conjugation Rely mostly on manual annotation data 4/30/201513
14
Project direction Use language-independent tool (Morfessor), and based on the unannotated data only (i.e. no feature data or syntactical information) Work on a general surface-form recovery We would like to have a unified view of the transalation process: separating low-low, low- high, high-low, high-high 4/30/201514 We are at here
15
Reference I Chirs Dyer, 2007 http://www.ling.umd.edu/~redpony/edinburgh.pdf http://www.ling.umd.edu/~redpony/edinburgh.pdf Jurafsky, D., & Martin, J. H. (2007). Speech and language processing book The Finnish language http://www.cs.tut.fi/~jkorpela/Finnish.html http://www.cs.tut.fi/~jkorpela/Finnish.html Yang & Kirchhoff, 2006: Phrase-based backoff models for machine translation of highly inflected languages Oflazer & Durgar El-Kahlout, 2006: Initial Explorations in English to Turkish Statistical Machine Translation 4/30/201515
16
Reference II Oflazer & Durgar El-Kahlout, 2007: Exploring different representational units in English-to-Turkish statistical machine translation Toutanova et. al., 2007: Generating complex morphology for machine translation Toutanova et. al., 2008: Applying morphology generation models to machine translation Avramidis & Koehn, 2008: Enriching morphologically poor languages for statistical machine translation 4/30/201516
17
Q & A? 4/30/201517
18
To be continued … Thank you !!! 4/30/201518
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.