AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Languages & The Media, 4 Nov 2004, Berlin 1 Multimodal multilingual information processing for automatic subtitle generation: Resources, Methods and System.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Enabling MT for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University.
J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Automatic Rule Learning for Resource-Limited Machine Translation Alon Lavie, Katharina Probst, Erik Peterson, Jaime Carbonell, Lori Levin, Ralf Brown Language.
1 Improving a Statistical MT System with Automatically Learned Rewrite Patterns Fei Xia and Michael McCord (Coling 2004) UW Machine Translation Reading.
1/17 Probabilistic Parsing … and some other approaches.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Artificial Intelligence Research Centre Program Systems Institute Russian Academy of Science Pereslavl-Zalessky Russia.
Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
GLOSSARY COMPILATION Alex Kotov (akotov2) Hanna Zhong (hzhong) Hoa Nguyen (hnguyen4) Zhenyu Yang (zyang2)
The Impact of Grammar Enhancement on Semantic Resources Induction Luca Dini Giampaolo Mazzini
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :
Statistical XFER: Hybrid Statistical Rule-based Machine Translation Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.
Reordering Model Using Syntactic Information of a Source Tree for Statistical Machine Translation Kei Hashimoto, Hirohumi Yamamoto, Hideo Okuma, Eiichiro.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System Alon Lavie Language Technologies Institute Carnegie Mellon University.
Rule Learning - Overview Goal: Syntactic Transfer Rules 1) Flat Seed Generation: produce rules from word- aligned sentence pairs, abstracted only to POS.
Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.
Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.
Alignment of Bilingual Named Entities in Parallel Corpora Using Statistical Model Chun-Jen Lee Jason S. Chang Thomas C. Chuang AMTA 2004.
PARSING 2 David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
LREC 2004, 26 May 2004, Lisbon 1 Multimodal Multilingual Resources in the Subtitling Process S.Piperidis, I.Demiros, P.Prokopidis, P.Vanroose, A. Hoethker,
Supertagging CMSC Natural Language Processing January 31, 2006.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
A Trainable Transfer-based MT Approach for Languages with Limited Resources Alon Lavie Language Technologies Institute Carnegie Mellon University Joint.
The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.
Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.
Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.
Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.
October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.
CMU Statistical-XFER System Hybrid “rule-based”/statistical system Scaled up version of our XFER approach developed for low-resource languages Large-coverage.
Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.
CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.
AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Minority Languages Katharina Probst Language Technologies Institute Carnegie Mellon.
The AVENUE Project: Automatic Rule Learning for Resource-Limited Machine Translation Faculty: Alon Lavie, Jaime Carbonell, Lori Levin, Ralf Brown Students:
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Approaches to Machine Translation
Semantic Parsing for Question Answering
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Eiji Aramaki* Sadao Kurohashi* * University of Tokyo
Approaches to Machine Translation
CS246: Information Retrieval
AMTEXT: Extraction-based MT for Arabic
David Kauchak CS159 – Spring 2019
Presentation transcript:

AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

February 23, 2005DoD KDL Visit2 Goals and Approach Analysts often are looking for limited concrete information within the text  full MT may not be necessary Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information But – how do we extract just the relevant parts in the source language? AMTEXT approach: –learn extraction patterns and their translations from small amounts of human translated and aligned data –Combine with broad coverage Named-Entity translation lexicons –System output: translation of extracted information + a structured representation

February 23, 2005DoD KDL Visit3 AMTEXT Extraction-based MT Learning Module Transfer Rules S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE] ((X1::Y1) (X4::Y4) (X5::Y5)) Word Translation Lexicon Run Time Extract Transfer System Word-aligned elicited data Partial Parser & Transfer Engine NE Translation Lexicon Source Text Extracted Target Text Post-processor Extractor Filled Template

February 23, 2005DoD KDL Visit4 Elicitation Example

February 23, 2005DoD KDL Visit5 Learning Extraction Translation Patterns Elicited example: Sharon nifgash hayom im bush Sharon met with Bush today After Generalization: im with Resulting Learned Pattern Rule: S::S : [PERSON MEET-V TE im PERSON] -> [PERSON MEET-V with PERSON TE] ( (X1::Y1) (X2::Y2) (X3::Y5) (X5::Y4))

February 23, 2005DoD KDL Visit6 Transfer Rule Formalism Type information Part-of-speech/constituent information Alignments x-side constraints y-side constraints xy-constraints, e.g. ((Y1 AGR) = (X1 AGR)) ; SL: the old man, TL: ha-ish ha-zaqen NP::NP [DET ADJ N] -> [DET N DET ADJ] ( (X1::Y1) (X1::Y3) (X2::Y4) (X3::Y2) ((X1 AGR) = *3-SING) ((X1 DEF = *DEF) ((X3 AGR) = *3-SING) ((X3 COUNT) = +) ((Y1 DEF) = *DEF) ((Y3 DEF) = *DEF) ((Y2 AGR) = *3-SING) ((Y2 GENDER) = (Y4 GENDER)) )

February 23, 2005DoD KDL Visit7 The Transfer Engine Analysis Source text is parsed into its grammatical structure. Determines transfer application ordering. Example: 他 看 书。 (he read book) S NP VP N V NP 他 看 书 Transfer A target language tree is created by reordering, insertion, and deletion. S NP VP N V NP he read DET N a book Article “a” is inserted into object NP. Source words translated with transfer lexicon. Generation Target language constraints are checked and final translation produced. E.g. “reads” is chosen over “read” to agree with “he”. Final translation: “He reads a book”

February 23, 2005DoD KDL Visit8 Partial Parsing Input: Full text in the foreign language Output: Translation of extracted/matched text Goal: Extract by effectively matching transfer rules with the full text –Identify/parse NEs and words in restricted vocabulary –Identify transfer-rule (source-side) patterns –Transfer Engine produces a complete lattice of transfer translations Sharon, meluve b-sar ha-xuc shalom, yipagesh im bush hayom Sharon will meet with Bush today NE-P TE

February 23, 2005DoD KDL Visit9 Post Processing Translation Selection Module: –select most complete and coherent translation from lattice based on scoring heuristics Structure Extraction: –Extract translated entities from the pattern and display in a structured table format Output Display: –Perl scripts construct HTML page for displaying complete translation results

February 23, 2005DoD KDL Visit10 Translation Selection Module: Features Goal: Scoring function that can identify the most likely best match Lattice arc features from the transfer engine: –matched range of source –matched parts of target –transfer score –partial parse

February 23, 2005DoD KDL Visit11 Lattice Example Arafat to meet Peres in Brussels on Monday ErfAt yltqy byryz msAA AlAvnyn fy brwksl (1 1 "Arafat" 3 "ErfAt" "(PNAME,0 "Arafat")") (2 2 "will meet with" 3 "yltqy" "(MEET-V,5 "will meet with")") (3 3 "Peres" 3 "byryz" "(PNAME,1 "Peres")") (1 3 "Arafat will meet with Peres" 3 "ErfAt yltqy byryz" "((S,11 (PERSON,1 (PNAM E,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) ) )") (4 4 "msAA" 3 "msAA" "(UNK,0 "msAA")") (5 5 "Monday" 3 "AlAvnyn" "(DAY,0 "Monday")") (4 5 "on Monday" 2.9 "msAA AlAvnyn" "((TE,4 (LITERAL "on")(DAY,0 "Monday") ) )") (1 5 "Arafat will meet with Peres on Monday" 3.2 "ErfAt yltqy byryz msAA AlAvnyn " "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (P NAME,1 "Peres") ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 5 "Arafat will meet with Peres Monday" 3.1 "ErfAt yltqy byryz msAA AlAvnyn" " ((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAM E,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )") (6 6 "fy" 3 "fy" "(UNK,2 "fy")") (7 7 "Brussels" 3 "brwksl" "(PLACE,0 "Brussels")") (6 7 "in Brussels" 2.9 "fy brwksl" "((LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) )") (1 7 "Arafat will meet with Peres in Brussels on Monday" 3.4 "ErfAt yltqy byryz msAA AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will me et with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels" ) ) (TE,4 (LITERAL "on")(DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels Monday" 3.3 "ErfAt yltqy byryz msA A AlAvnyn fy brwksl" "((S,7 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) (TE,5 (DAY,0 "Monday") ) ) )") (1 7 "Arafat will meet with Peres in Brussels" 3.2 "ErfAt yltqy byryz msAA AlAvn yn fy brwksl" "((S,8 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (LOC,1 (LITERAL "in")(PLACE,0 "Brussels") ) ) )")

February 23, 2005DoD KDL Visit12 Example: Extracting Features 1 5  Length (tokens) of source segment (ar) (1) "Arafat will meet with Peres Monday"  length of trans segment (2) 3.1  transfer engine score (3) "ErfAt yltqy byryz msAA AlAvnyn"  length of source segment (4) "((S,9 (PERSON,1 (PNAME,0 "Arafat") ) (MEET-V,5 "will meet with") (PERSON,1 (PNAME,1 "Peres") ) (TE,5 (DAY,0 "Monday") ) ) )"  Transfer structure - full frame (S) or not? (5) Secondary feature (6): relative lengths of (2) over (4) : the smaller, the more concise the source language match (less extraneous material, i.e. less chance of mistranslation).

February 23, 2005DoD KDL Visit13 Selecting Best Translation For each parse P j in the lattice, calculate a score S j based on features f i with weight coefficients w i, as follows Weights w i trained by hill climbing (training set / manual reference parse)

February 23, 2005DoD KDL Visit14 “Proof-of-Concept” System Arabic-to-English Newswire text (available from TIDES) Very limited set of actions: (X meet Y) Limited collection of translation patterns: – * * Limited vocabulary and NE lexicon

February 23, 2005DoD KDL Visit15 System Development Training corpus of 535 short sentences translated and aligned by bilingual informant –258 simple meeting sentences –120 Temporal Expressions –105 Location Expressions –52 Title Expressions Translation Lexicon of Names Entities (person names, organizations and locations) converted from Fei Huang’s NE translation/transliteration work Pattern Generalizations semi-automatically “learned” from the training data Patterns manually enhanced with “skipping markers” Initial System integrated Development with informant on 74 sentence dev data

February 23, 2005DoD KDL Visit16 Resulting System Transfer Grammar contains: –21 transfer pattern rules –12 Meet Verb rules –4/17/11/17 Person/TE/LOC/PTitle “high-level” rules Transfer Lexicon contains 3070 entries (mostly names and locations) Estimated development effort/time: –~20 hours with informant –~50 hours of lexical and rule development

February 23, 2005DoD KDL Visit17 Evaluation Development set of 74 sentences Test set of 76 unseen sentences with meeting information Identified subset of each set on which meeting patterns could potentially apply (“Good”) –53 development sentences –44 test sentences

February 23, 2005DoD KDL Visit18 Evaluation Translation-based: –Unigram token-based retrieval metrics: precision / recall / F1 Entity-based: –Recall for each role in the meeting frame (V, P1, P2, LOC and TE) –Partial recall credit for partial matches –Partial credit (50%) for P1/P2 role interchange

February 23, 2005DoD KDL Visit19 Evaluation Results

February 23, 2005DoD KDL Visit20 Demonstration

February 23, 2005DoD KDL Visit21 Conclusions Attractive methodology for joint extraction + translation of Essential Elements of Information from full foreign language texts Rapid Development - circumvents need for developing high-quality full MT or high-quality IE technology for the foreign source language Effective use of bilingual informants Main Open Question – Scalability –Can this methodology be effective with much broader and more complex types of extracted EEIs? –Is automatic learning of generalized patterns feasible and effective in such more complex scenarios? –Can the selection heuristics effectively cope with the vast amounts of ambiguity expected in a large scale system?