Course Summary LING 575 Fei Xia 03/06/07
Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics
Introduction to MT
Major challenges Translation is hard. Getting the right words: –Choosing the correct root form –Getting the correct inflected form –Inserting “spontaneous” words Putting the words in the correct order: –Word order: SVO vs. SOV, … –Unique constructions: –Divergence
Lexical choice Homonymy/Polysemy: bank, run Concept gap: no corresponding concepts in another language: go Greek, go Dutch, fen sui, lame duck, … Coding (Concept lexeme mapping) differences: –More distinction in one language: e.g., kinship vocabulary. –Different division of conceptual space:
Major approaches Transfer-based Interlingua Example-based (EBMT) Statistical MT (SMT) Hybrid approach
The MT triangle word Word Meaning Transfer-based Phrase-based SMT, EBMT Word-based SMT, EBMT (interlingua) Analysis Synthesis
Comparison of resource requirement Transfer- based InterlinguaEBMTSMT dictionary+++ Transfer rules + parser+++ (?) semantic analyzer + parallel data++ othersUniversal representation Generator thesaurus
Evaluation Unlike many NLP tasks (e.g., tagging, chunking, parsing, IE, pronoun resolution), there is no single gold standard for MT. Human evaluation: accuracy, fluency, … –Problem: expensive, slow, subjective, non-reusable. Automatic measures: –Edit distance –Word error rate (WER), Position-independent WER (PER) –Simple string accuracy (SSA), Generation string accuracy (GSA) –BLEU
Major approaches
Word-based SMT IBM Models 1-5 Main concepts: –Source channel model –Hidden word alignment –EM training
Source channel model for MT Eng sent Noisy channel Fr sent P(E)P(F | E) Two types of parameters: Language model: P(E) Translation model: P(F | E)
Modeling p(F | E) with alignment
Modeling Parameters: Length prob: P(m | l) Translation prob: t(f j | e i ) Distortion prob (for Model 2): d(i | j, m, l) Model 1: Model 2:
Training Model 1:
Finding the best alignment Given E and F, we are looking for Model 1:
Clump-based SMT The unit of translation is a clump. Training stage: –Word alignment –Extracting clump pairs Decoding stage: –Try all segmentations of the src sent and all the allowed permutations –For each src clump, try TopN tgt clumps –Prune the hypotheses
Transfer-based MT Analysis, transfer, generation: –Example: (Quirk et al., 2005) 1.Parse the source sentence 2.Transform the parse tree with transfer rules 3.Translate source words 4.Get the target sentence from the tree Translation as parsing: –Example: (Wu, 1995)
Hybrid approaches Preprocessing with transfer rules: (Xia and McCord, 2004), (Collins et al, 2005) Postprocessing with taggers, parsers, etc: JHU 2003 workshop Hierarchical phrase-based model: (Chiang, 2005) …
Other topics
Other issues Resources –MT for Low density languages –Using comparable corpora and wikipedia Special translation modules –Identifying and translating name entities and abbreviations –…–…
To build an MT system (1) Gather resources –Parallel corpora, comparable corpora –Grammars, dictionaries, … Process data –Document alignment, sentence alignment –Tokenization, parsing, …
To build an MT system (2) Modeling Training –Word alignment and extracting clump pairs –Learning transfer rules Decoding –Identifying entities and translating them with special modules (optional) –Translation as parsing, or parse + transfer + translation –Segmenting src sentence, replace src clump with target clump, …
To build an MT system (3) Post-processing –System combination –Reranking Using the system for other applications: –Cross-lingual IR –Computer-assisted translation –….
Misc Grades –Assignments ( hw1-hw3): 30% –Class participation: 20% –Project: Presentation: 25% Final paper: 25%