Download presentation
Presentation is loading. Please wait.
1
Machine Translation: Challenges and Approaches Nizar Habash Post-doctoral Fellow Center for Computational Learning Systems Columbia University Invited Lecture CS 4705: Introduction to Natural Language Processing Fall 2004
2
Sounds like Faulkner? http://www.ee.ucla.edu/~simkin/sounds_like_faulkner.html It lay on the table a candle burning at each corner upon the envelope tied in a soiled pink garter two artificial flowers. Not hit a man in glasses. It was once a shade, which was in all beautiful weather under a tree and varied like the branches in the wind. William Faulkner, "The sound and the fury“ Es war einmal ein Schatten, der lag bei jedem schönen Wetter unter einem Baum und schwankte wie die Zweige im Wind. Helmut Wördemann, "Der unzufriedene Schatten“ (Translated by Systran) Faulkner Machine Translation Faulkner Machine Translation
3
Progress in MT Statistical MT example Form a talk by Charles Wayne, DARPA 2002 2003 Human Translation insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment. Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. Egypt Air May Resume its Flights to Libya Tomorrow Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.
4
Road Map Why Machine Translation (MT)? Multilingual Challenges for MT MT Approaches MT Evaluation
5
Why (Machine) Translation? Languages in the world 6,800 living languages 600 with written tradition 95% of world population speaks 100 languages Translation Market $8 Billion Global Market Doubling every five years (Donald Barabé, invited talk, MT Summit 2003)
6
Why Machine Translation? Full Translation –Domain specific Weather reports Machine-aided Translation –Translation dictionaries –Translation memories –Requires post-editing Cross-lingual NLP applications –Cross-language IR –Cross-language Summarization
7
Road Map Why Machine Translation (MT)? Multilingual Challenges for MT –Orthographic variations –Lexical ambiguity –Morphological variations –Translation divergences MT Paradigms MT Evaluation
8
Multilingual Challenges Orthographic Variations –Ambiguous spelling كتب الاولاد اشعارا كَتَبَ الأوْلادُ اشعَاراً – Ambiguous word boundaries Lexical Ambiguity –Bank بنك (financial) vs. ضفة(river) –Eat essen (human) vs. fressen (animal)
9
Multilingual Challenges Morphological Variations Affixation vs. Root+Pattern write writtenكتب مكتوب kill killedقتل مقتول do doneفعل مفعول conj noun pluralarticle Tokenization And the cars and the cars والسيارات w Al SyArAt Et les voitures et le voitures
10
Multilingual Challenges Translation Divergences How languages map semantics to syntax 35% of sentences in TREC El Norte Corpus (Dorr et al 2002) Divergence Types –Categorial (X tener hambre X be hungry) [98%] –Conflational (X dar puñaladas a Z X stab Z) [83%] –Structural (X entrar en Y X enter Y)[35%] –Head Swapping (X cruzar Y nadando X swim across Y)[8%] –Thematic (X gustar a Y Y like X)[6%]
11
لست هنا I-am-not here be Ihere I am not here not ليس ا ناهنا Translation Divergences conflation Je ne suis pas ici I not be not here etre Jeicinepas
12
* ا نابردان * קרל انا بردان I cold be Icold I am coldקר לי cold for-me אני Translation Divergences categorial, thematic and structural tener Yofrio tengo frio I-have cold
13
swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial اسرع اناسباحةعبور نهر اسرعت عبور النهر سباحة I-sped crossing the-river swimming
14
swim I quickly across river I swam across the river quickly Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות חציתי את הנהר בשחיה במהירות I-crossed obj river in-swim speedily
15
Translation Divergences head swap and categorial חצה אניבאת נהר ב שחיהמהירות اسرع اناسباحةعبور نهر swim I quickly across river no un pre p ver b no un adve rb ver b no un ver b no un
16
Translation Divergences Orthography+Morphology+Syntax 妈妈的车 mama de che car mom possessed-by mom’s car سيارة ماما sayyArat mama la voiture de maman
17
Road Map Why Machine Translation (MT)? Multilingual Challenges for MT MT Approaches –Gisting / Transfer / Interlingua –Statistical / Symbolic / Hybrid –Practical Considerations MT Evaluation
18
MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Gisting
19
MT Approaches Gisting Example Sobre la base de dichas experiencias se estableció en 1988 una metodología. Envelope her basis out speak experiences them settle at 1988 one methodology. On the basis of these experiences, a methodology was arrived at in 1988.
20
MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransfer
21
MT Approaches Transfer Example Transfer Lexicon –Map SL structure to TL structure poner X mantequilla en Y :obj :mod:subj :obj butter X Y :subj:obj X puso mantequilla en YX buttered Y
22
MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration GistingTransferInterlingua
23
MT Approaches Interlingua Example: Lexical Conceptual Structure (Dorr, 1993)
24
MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingua Gisting Transfer
25
MT Approaches MT Pyramid Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration Interlingual Lexicons Dictionaries/Parallel Corpora Transfer Lexicons
26
MT Approaches Statistical vs. Symbolic Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration
27
MT Approaches Noisy Channel Model Portions from http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf
28
MT Approaches IBM Model (Word-based Model) http://www.clsp.jhu.edu/ws03/preworkshop/lecture_yamada.pdf
29
Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Symbolic vs. Hybrid
30
Source word Source syntax Source meaningTarget meaning Target syntax Target word AnalysisGeneration MT Approaches Statistical vs. Symbolic vs. Hybrid
31
MT Approaches Hybrid Example: GHMT Generation-Heavy Hybrid Machine Transaltion Lexical transfer but NO structural transfer poner Maria mantequilla en pan :obj :mod:subj :obj lay locate place put render set stand Maria butter bilberry on in into at bread loaf :obj :mod:subj :obj Maria puso la mantequilla en el pan.
32
MT Approaches Hybrid Example: GHMT LCS-driven Expansion Conflation Example Goal BUTTER V MARIA BREAD Agent Goal PUT V BUTTER N Theme Agent MARIA BREAD [CAUSE GO] CategorialVariation
33
MT Approaches Hybrid Example: GHMT Structural Overgeneration put Maria butteron bread lay Maria butterat loaf render Maria butter into loaf butter Maria bread Maria butter …
34
Structural N-gram Model –Long-distance –Lexemes Surface N-gram Model –Local –Surface-forms John buy MT Approaches Hybrid Example: GHMT Target Statistical Resources car ared Johnboughtcarared
35
MT Approaches Hybrid Example: GHMT Linearization &Ranking Maria buttered the bread -47.0841 Maria butters the bread -47.2994 Maria breaded the butter -48.7334 Maria breads the butter -48.835 Maria buttered the loaf -51.3784 Maria butters the loaf -51.5937 Maria put the butter on bread -54.128
36
MT Approaches Practical Considerations Resources Availability –Parsers and Generators Input/Output compatability –Translation Lexicons Word-based vs. Transfer/Interlingua –Parallel Corpora Domain of interest Bigger is better Time Availability –Statistical training, resource building
37
MT Approaches Resource Poverty No Parser? No Translation Dictionary? Parallel Corpus Align with rich language Extract dictionary Parse rich side Infer parses Build a statistical parser
38
Road Map Why Machine Translation (MT)? Multilingual Challenges for MT MT Approaches MT Evaluation
39
More art than science Wide range of Metrics/Techniques –interface, …, scalability, …, faithfulness,... space/time complexity, … etc. Automatic vs. Human-based –Dumb Machines vs. Slow Humans
40
MT Evaluation Metrics System-based Metrics Count internal resources: size of lexicon, number of grammar rules, etc. –easy to measure –not comparable across systems –not necessarily related to utility (Church and Hovy 1993)
41
MT Evaluation Metrics Text-based Metrics –Sentence-based Metrics Quality: Accuracy, Fluency, Coherence, etc. 3-point scale to 100-point scale –Comprehensibility Metrics Comprehension, Informativeness, x-point scales, questionnaires most related to utility hard to measure
42
MT Evaluation Metrics Text-based Metrics (cont’d) –Amount of Post-Editing number of keystrokes per page not necessarily related to utility Cost-based Metrics –Cost per page –Time per page
43
Human-based Evaluation Example Accuracy Criteria
44
Human-based Evaluation Example Fluency Criteria
45
Fluency vs. Accuracy Accuracy Fluency conMT FAHQ MT Prof. MT Info. MT
46
Automatic Evaluation Example Bleu Metric Bleu –BiLingual Evaluation Understudy (Papineni et al 2001) –Modified n-gram precision with length penalty –Quick, inexpensive and language independent –Correlates highly with human evaluation –Bias against synonyms and inflectional variations
47
Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Automatic Evaluation Example Bleu Metric
48
Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4/5 Automatic Evaluation Example Bleu Metric
49
Test Sentence colorless green ideas sleep furiously Gold Standard References all dull jade ideas sleep irately drab emerald concepts sleep furiously colorless immature thoughts nap angrily Unigram precision = 4 / 5 = 0.8 Bigram precision = 2 / 4 = 0.5 Bleu Score = (a 1 a 2 …a n ) 1/n = (0.8 ╳ 0.5) ½ = 0.6325 63.25 Automatic Evaluation Example Bleu Metric
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.