AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi
February 23, 2005DoD KDL Visit2 Goals and Approach Analysts often are looking for limited concrete information within the text full MT may not be necessary Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information But – how do we extract just the relevant parts in the source language? AMTEXT approach: –learn extraction patterns and their translations from small amounts of human translated and aligned data –Combine with broad coverage Named-Entity translation lexicons –System output: translation of extracted information + a structured representation
February 23, 2005DoD KDL Visit3 AMTEXT Extraction-based MT Learning Module Transfer Rules S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE] ((X1::Y1) (X4::Y4) (X5::Y5)) Word Translation Lexicon Run Time Extract Transfer System Word-aligned elicited data Partial Parser & Transfer Engine NE Translation Lexicon Source Text Extracted Target Text Post-processor Extractor Filled Template
February 23, 2005DoD KDL Visit4 Elicitation Example
February 23, 2005DoD KDL Visit5 Learning Extraction Translation Patterns Elicited example: Sharon nifgash hayom im bush Sharon met with Bush today After Generalization: im with Resulting Learned Pattern Rule: S::S : [PERSON MEET-V TE im PERSON] -> [PERSON MEET-V with PERSON TE] ( (X1::Y1) (X2::Y2) (X3::Y5) (X5::Y4))
February 23, 2005DoD KDL Visit6 Transfer Rule Formalism Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )
February 23, 2005DoD KDL Visit7 The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他 看 书。 (he read book) S NP VP N V NP 他 看 书 Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book”
February 23, 2005DoD KDL Visit8 Partial Parsing Input: Full text in the foreign language Output: Translation of extracted/matched text Goal: Extract by effectively matching transfer rules with the full text –Identify/parse NEs and words in restricted vocabulary –Identify transfer-rule (source-side) patterns –Transfer Engine produces a complete lattice of transfer translations Sharon, meluve b-sar ha-xuc shalom, yipagesh im bush hayom Sharon will meet with Bush today NE-P TE
February 23, 2005DoD KDL Visit9 Post Processing Translation Selection Module: –select most complete and coherent translation from lattice based on scoring heuristics Structure Extraction: –Extract translated entities from the pattern and display in a structured table format Output Display: –Perl scripts construct HTML page for displaying complete translation results
February 23, 2005DoD KDL Visit10 Translation Selection Module: Features Goal: Scoring function that can identify the most likely best match Lattice arc features from the transfer engine: –matched range of source –matched parts of target –transfer score –partial parse
February 23, 2005DoD KDL Visit11 Lattice Example Arafat to meet Peres in Brussels on Monday ErfAt yltqy byryz msAA AlAvnyn fy brwksl (1 1 "Arafat" 3 "ErfAt" "(PNAME,0 "Arafat")") (2 2 "will meet with" 3 "yltqy" "(MEET-V,5 "will meet with")") (3 3 "Peres" 3 "byryz" "(PNAME,1 "Peres")") (1 3 "Arafat will meet with Peres" 3 "ErfAt yltqy byryz" "((S,11 (PERSON,1 (PNAM E,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) ) )") (4 4 "msAA" 3 "msAA" "(UNK,0 "msAA")") (5 5 "Monday" 3 "AlAvnyn" "(DAY,0 "Monday")") (4 5 "on Monday" 2.9 "msAA AlAvnyn" "((TE,4 (LITERAL "on")(DAY,0 "Monday") ) )") (1 5 "Arafat will meet with Peres on Monday" 3.2 "ErfAt yltqy byryz msAA AlAvnyn " "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (P NAME,1 "Peres") ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 5 "Arafat will meet with Peres Monday" 3.1 "ErfAt yltqy byryz msAA AlAvnyn" " ((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAM E,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )") (6 6 "fy" 3 "fy" "(UNK,2 "fy")") (7 7 "Brussels" 3 "brwksl" "(PLACE,0 "Brussels")") (6 7 "in Brussels" 2.9 "fy brwksl" "((LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) )") (1 7 "Arafat will meet with Peres in Brussels on Monday" 3.4 "ErfAt yltqy byryz msAA AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will me et with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels" ) ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels Monday" 3.3 "ErfAt yltqy byryz msA A AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) (TE,5 (DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels" 3.2 "ErfAt yltqy byryz msAA AlAvn yn fy brwksl" "((S,8 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) ) )")
February 23, 2005DoD KDL Visit12 Example: Extracting Features 1 5 Length (tokens) of source segment (ar) (1) "Arafat will meet with Peres Monday" length of trans segment (2) 3.1 transfer engine score (3) "ErfAt yltqy byryz msAA AlAvnyn" length of source segment (4) "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )" Transfer structure - full frame (S) or not? (5) Secondary feature (6): relative lengths of (2) over (4) : the smaller, the more concise the source language match (less extraneous material, i.e. less chance of mistranslation).
February 23, 2005DoD KDL Visit13 Selecting Best Translation For each parse P j in the lattice, calculate a score S j based on features f i with weight coefficients w i, as follows Weights w i trained by hill climbing (training set / manual reference parse)
February 23, 2005DoD KDL Visit14 “Proof-of-Concept” System Arabic-to-English Newswire text (available from TIDES) Very limited set of actions: (X meet Y) Limited collection of translation patterns: – * * Limited vocabulary and NE lexicon
February 23, 2005DoD KDL Visit15 System Development Training corpus of 535 short sentences translated and aligned by bilingual informant –258 simple meeting sentences –120 Temporal Expressions –105 Location Expressions –52 Title Expressions Translation Lexicon of Names Entities (person names, organizations and locations) converted from Fei Huang’s NE translation/transliteration work Pattern Generalizations semi-automatically “learned” from the training data Patterns manually enhanced with “skipping markers” Initial System integrated Development with informant on 74 sentence dev data
February 23, 2005DoD KDL Visit16 Resulting System Transfer Grammar contains: –21 transfer pattern rules –12 Meet Verb rules –4/17/11/17 Person/TE/LOC/PTitle “high-level” rules Transfer Lexicon contains 3070 entries (mostly names and locations) Estimated development effort/time: –~20 hours with informant –~50 hours of lexical and rule development
February 23, 2005DoD KDL Visit17 Evaluation Development set of 74 sentences Test set of 76 unseen sentences with meeting information Identified subset of each set on which meeting patterns could potentially apply (“Good”) –53 development sentences –44 test sentences
February 23, 2005DoD KDL Visit18 Evaluation Translation-based: –Unigram token-based retrieval metrics: precision / recall / F1 Entity-based: –Recall for each role in the meeting frame (V, P1, P2, LOC and TE) –Partial recall credit for partial matches –Partial credit (50%) for P1/P2 role interchange
February 23, 2005DoD KDL Visit19 Evaluation Results
February 23, 2005DoD KDL Visit20 Demonstration
February 23, 2005DoD KDL Visit21 Conclusions Attractive methodology for joint extraction + translation of Essential Elements of Information from full foreign language texts Rapid Development - circumvents need for developing high-quality full MT or high-quality IE technology for the foreign source language Effective use of bilingual informants Main Open Question – Scalability –Can this methodology be effective with much broader and more complex types of extracted EEIs? –Is automatic learning of generalized patterns feasible and effective in such more complex scenarios? –Can the selection heuristics effectively cope with the vast amounts of ambiguity expected in a large scale system?