AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi
Sep 21, 2004CACI Visit2 Goals and Approach Analysts often are looking for limited concrete information within the text full MT may not be necessary Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information AMTEXT approach: –learn extraction patterns and their translations from small amounts of human translated and aligned data –Combine with broad coverage Named-Entity translation lexicons –System output: translation of extracted information + a structured representation
Sep 21, 2004CACI Visit3 AMTEXT Extraction-based MT Learning Module Transfer Rules S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE] ((X1::Y1) (X4::Y4) (X5::Y5)) Word Translation Lexicon Run Time Extract Transfer System Word-aligned elicited data Partial Parser & Transfer Engine NE Translation Lexicon Source Text Extracted Target Text Post-processor Extractor Filled Template
Sep 21, 2004CACI Visit4 Elicitation Example
Sep 21, 2004CACI Visit5 Partial Parsing Input: Full text in the foreign language Output: Translation of extracted/matched text Goal: Extract by effectively matching transfer rules with the full text –Identify/parse NEs and words in restricted vocabulary –Identify transfer-rule (source-side) patterns –Handle expected high-levels of ambiguity Sharon, meluve b-sar ha-xuc shalom, yipagesh im bush hayom Sharon will meet with Bush today NE-P TE
Sep 21, 2004CACI Visit6 “Proof-of-Concept” System [funded by small year-0 ITIC/REFLEX] Arabic-to-English Newswire text (available from TIDES) Limited set of actions: (X meet Y) Limited translation patterns: – * * Limited vocabulary and NE lexicon
Sep 21, 2004CACI Visit7 Demonstration
Sep 21, 2004CACI Visit8 Integration Technical Issues Components: –Converter of Arabic to “Darwish” representation and pre-processor (scripts) –Transfer Engine (C/C++) –Post-processor extractor (perl scripts) Input: Arabic text in UTF8 Output: formatted html page