AMTEXT: Extraction-based MT for Arabic

Slides:

Advertisements

Similar presentations

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China

Advertisements

Masaki Itagaki (Language Excellence) Takako Aikawa (Machine Translation Incubation at MSR) Microsoft.

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

J. Turmo, 2006 Adaptive Information Extraction Summary Information Extraction Systems Multilinguality Introduction Language guessers Machine Translators.

Jumping Off Points Ideas of possible tasks Examples of possible tasks Categories of possible tasks.

Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

Donald R. Rainey, Sr., CPPB/VCO Director, Office of General Services Virginia Department of Social Services.

An Integrated Approach for Arabic-English Named Entity Translation Hany Hassan IBM Cairo Technology Development Center Jeffrey Sorensen IBM T.J. Watson.

Using Text Mining and Natural Language Processing for Health Care Claims Processing Cihan ÜNAL

Language Knowledge Engineering Lab. Kyoto University NTCIR-10 PatentMT, Japan, Jun , 2013 Description of KYOTO EBMT System in PatentMT at NTCIR-10.

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

10. Parsing with Context-free Grammars -Speech and Language Processing- 발표자 : 정영임 발표일 :

CALIFORNIA DEPARTMENT OF EDUCATION Jack O’Connell, State Superintendent of Public Instruction California Department of Education Update English Language.

Recent Major MT Developments at CMU Briefing for Joe Olive February 5, 2008 Alon Lavie and Stephan Vogel Language Technologies Institute Carnegie Mellon.

Advanced MT Seminar Spring 2008 Instructors: Alon Lavie and Stephan Vogel.

Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.

AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.

Deeper Sentiment Analysis Using Machine Translation Technology Kanauama Hiroshi, Nasukawa Tetsuya Tokyo Research Laboratory, IBM Japan Coling 2004.

Transfer-based MT with Strong Decoding for a Miserly Data Scenario Alon Lavie Language Technologies Institute Carnegie Mellon University Joint work with:

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

JAVELIN Project Briefing AQUAINT Program 1 AQUAINT 6-month Meeting 10/08/04 JAVELIN II: Scenarios and Variable Precision Reasoning for Advanced QA from.

Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.

Hebrew-to-English XFER MT Project - Update Alon Lavie June 2, 2004.

Xml:tm XML Text Memory Using XML technology to reduce the cost of translating XML documents.

Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.

Coping with Surprise: Multiple CMU MT Approaches Alon Lavie Lori Levin, Jaime Carbonell, Alex Waibel, Stephan Vogel, Ralf Brown, Robert Frederking Language.

The CMU Mill-RADD Project: Recent Activities and Results Alon Lavie Language Technologies Institute Carnegie Mellon University.

Error Analysis of Two Types of Grammar for the purpose of Automatic Rule Refinement Ariadna Font Llitjós, Katharina Probst, Jaime Carbonell Language Technologies.

Approaching a New Language in Machine Translation Anna Sågvall Hein, Per Weijnitz.

Large Vocabulary Data Driven MT: New Developments in the CMU SMT System Stephan Vogel, Alex Waibel Work done in collaboration with: Ying Zhang, Alicia.

Avenue Architecture Learning Module Learned Transfer Rules Lexical Resources Run Time Transfer System Decoder Translation Correction Tool Word- Aligned.

October 10, 2003BLTS Kickoff Meeting1 Transfer with Strong Decoding Learning Module Transfer Rules {PP,4894} ;;Score: PP::PP [NP POSTP] -> [PREP.

Campus Solutions Academic Advisement December 2010.

Eliciting a corpus of word- aligned phrases for MT Lori Levin, Alon Lavie, Erik Peterson Language Technologies Institute Carnegie Mellon University.

Overview of Statistical NLP IR Group Meeting March 7, 2006.

Seed Generation and Seeded Version Space Learning Version 0.02 Katharina Probst Feb 28,2002.

CMU MilliRADD Small-MT Report TIDES PI Meeting 2002 The CMU MilliRADD Team: Jaime Carbonell, Lori Levin, Ralf Brown, Stephan Vogel, Alon Lavie, Kathrin.

AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi.

MEMT: Multi-Engine Machine Translation Faculty: Alon Lavie, Robert Frederking, Ralf Brown, Jaime Carbonell Students: Shyamsundar Jayaraman, Satanjeev Banerjee.

Named Entities in Domain Unlimited Speech Translation Alex Waibel, Stephan Vogel, Tanja Schultz Carnegie Mellon University Interactive Systems Labs.

LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.

MANAGEMENT INFORMATION SYSTEM

G. Anushiya Rachel Project Officer

Eliciting a corpus of word-aligned phrases for MT

RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,

Approaches to Machine Translation

Language Technologies Institute Carnegie Mellon University

IT Strategy Roadmap Template

Committee of Experts World Intellectual Property Organization

Urdu-to-English Stat-XFER system for NIST MT Eval 2008

Social Knowledge Mining

Eiji Aramaki* Sadao Kurohashi* * University of Tokyo

Automatic Detection of Causal Relations for Question Answering

Approaches to Machine Translation

Statistical Machine Translation Papers from COLING 2004

LINGUA INGLESE 2A – a.a. 2018/2019 Computer-Aided Translation Technology LESSON 3 prof. ssa Laura Liucci –

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

Text for section 1 1 Text for section 2 2 Text for section 3 3

SANSKRIT ANALYZING SYSTEM

Presentation transcript:

AMTEXT: Extraction-based MT for Arabic Faculty: Alon Lavie, Jaime Carbonell Students and Staff: Laura Kieras, Peter Jansen Informant: Loubna El Abadi

Background and Objectives Full MT of text is problematic: Requires large amounts of resources, long development time Quality of output varies Analysts often are looking for limited concrete information within the text  full MT may not be necessary Alternative: rather than full MT followed by extraction, first extract and then translate only extracted information Text Extraction technology has made much progress in past decade [TIPSTER, TREC, EELD] Research Question: Can Extraction-based MT result in improved accuracy and utility of information for analysts? Nov 14, 2003 ITIC Site Visit

Extraction-based MT “Traditional” Approach: Develop information extraction capability for the source language Runtime Extractor produces a template of extracted feature-value information If desired, English Generator can render the information in the form of text Drawback: Adapting extraction technology to a new foreign language is difficult Requires significant expertise in the foreign language Significant amounts of human development time Not clear that it is an attractive solution Nov 14, 2003 ITIC Site Visit

AMTEXT Approach Attempt to leverage from our work on automatic learning of MT transfer rules Develop an elicitation corpus specifically designed for targeted extraction patterns Learn generalized transfer rules for targeted extraction patterns from elicitation corpus Acquire high accuracy Named-Entity translation lexicon + limited translation lexicon for targeted vocabulary Runtime: use partial parser + transfer rules to translate only the matched portions of SL text Nov 14, 2003 ITIC Site Visit

AMTEXT Extraction-based MT Word-aligned elicited data Source Text Learning Module Run Time Transfer System Transfer Rules Partial Parser S::S [NE-P pagash et NE-P TE] -> [NE-P met with NE-P TE] ((X1::Y1) (X4::Y4) (X5::Y5)) Extracted Target Text Transfer Engine NE Translation Lexicon Word Translation Lexicon Nov 14, 2003 ITIC Site Visit

Elicitation Example Nov 14, 2003 ITIC Site Visit

Elicitation Example Nov 14, 2003 ITIC Site Visit

Elicitation Example Nov 14, 2003 ITIC Site Visit

Elicitation Example Nov 14, 2003 ITIC Site Visit

Learning Transfer Rules Different notion of rule generalization than in our full XFER approach Generalize from examples to NEs that play specific roles in target extraction pattern Verbs and function words may not be generalized Example: Sharon will meet with Bush today sharon yipagesh &im bush hayom Goal Rule: S::S [NE-P yipagesh &im NE-P TE] -> [NE-P will meet with NE-P TE] ((X1::Y1) (X4::Y5) (X5::Y6)) Nov 14, 2003 ITIC Site Visit

Acquisition of Named Entity Translation Lexicon Utilize Fei Huang’s work on building Named Entity Translation Lexicons based on transliteration models NE Lexicon will be split into meaningful sub-categories: PNs, Organizations, Locations, etc. NE translation lexicon augmented with NEs from elicited data Goal: High coverage and high accuracy identification of NEs that play a part in the transfer rules Nov 14, 2003 ITIC Site Visit

Named Entity Translation Lexicon English-Arabic lexicon from Fei: Trained on TIDES Newswire Data 7522 entries sorted by transliteration score Example: 4.51948528108464 # XXX # # Israel # AsrAAyl 4.05498190544419 # XXX # # Kabul # kAbwl 3.66368346525326 # XXX # # Paris # bArys 3.65527347080481 # XXX # # Afghanistan # AfgAnstAn 3.47030997281853 # XXX # # Pakistan # bAkstAn 3.23199522148251 # XXX # # Moscow # mwskw 3.20392400497002 # XXX # # Arafat # ErfAt 3.13060360328543 # XXX # # Beirut # byrwt 3.06872591580516 # XXX # # Russia # rwsyA Nov 14, 2003 ITIC Site Visit

Named Entity Identification NE Identifinder for English Available from BBN Will be used for identifying English NEs within elicited data  Arabic NEs from word alignments NE Identifinder for Arabic: Requested from BBN, so far no response Will use if available, can manage without it (naïve identification based on NE translation lexicon) Nov 14, 2003 ITIC Site Visit

Acquisition of Limited Word Translation Lexicon Vocabulary of interest is limited based on specific actions and objects that are of interest  scopeable on the English side Elicitation corpus serves as a high-quality initial source for extracting this translation lexicon Statistical word-to-word translation dictionary from SMT or EBMT can be used as a source for expanding coverage on the foreign language side Experiment if time/resources permit with incorporating expanded vocabulary into transfer rules Nov 14, 2003 ITIC Site Visit

Partial Parsing Input: Full text in the foreign language Output: Translation of extracted/matched text Goal: Extract by effectively matching transfer rules with the full text Identify/parse NEs and words in restricted vocabulary Identify transfer-rule (source-side) patterns Handle expected high-levels of ambiguity Sharon, meluve b-sar ha-xuc shalom, yipagesh im bush hayom NE-P NE-P NE-P TE Sharon will meet with Bush today Nov 14, 2003 ITIC Site Visit

Scope of Pilot System Arabic-to-English Newswire text (available from TIDES) Limited set of actions: (X meet Y) (X attend Y) (X hold Y) (X kill Y) (X announce Y)… Limited translation patterns: <subj-NE> <verb> <obj> <LOC>* <TE>* Limited vocabulary Nov 14, 2003 ITIC Site Visit

Evaluation Plan Compare AMTEXT approach to full-text Arabic-to-English SMT, on a limited task of translation of relations within the scope of coverage Establish a test set for evaluation Define an appropriate metric: Precision/Recall/F1 of relations and entities Compare performance Nov 14, 2003 ITIC Site Visit

Current Status Initial small elicitation corpus translated and aligned Extraction of elicitation phrases from Penn-TB in advanced stages Identifying scope of coverage: relations, actions, translation patterns Preliminary NE translation lexicon available Nov 14, 2003 ITIC Site Visit

Work Plan Creation of full elicitation corpus: Nov-03 Translation/align. of elicitation corpus: Nov/Dec-03 Install and integrate BBN English Identifinder: Dec-03 Acquire initial NE translation lexicon: Dec-03 Acquire initial word translation lexicon: Dec-03 Develop and integrate partial parser: Dec-03/Feb-04 Modify Transfer Engine for AMTEXT configuration: Dec-03/Jan-04 Integration of preliminary complete system: Feb-04 Design of evaluation: Feb-04 System testing and modifications: Feb/Apr-04 Test-set evaluation: Apr-04 Nov 14, 2003 ITIC Site Visit