1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko.

Slides:



Advertisements
Similar presentations
Machine Translation: Interlingual Methods Thanks to Les Sikos Bonnie J. Dorr, Eduard H. Hovy, Lori S. Levin.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,
Omega Ontology: Supporting Annotation Eduard Hovy with Andrew Philpot, Jerry Hobbs, Michael Fleischman, and Patrick Pantel USC/ISI.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
A Syntactic Translation Memory Vincent Vandeghinste Centre for Computational Linguistics K.U.Leuven
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
David Farwell, Stephen Helmreich Computing Research Laboratory/New Mexico State University Lori Levin, Teruko Mitamura Language Technologies Institute/Carnegie.
Empirical Methods in Information Extraction - Claire Cardie 자연어처리연구실 한 경 수
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Creation of a Russian-English Translation Program Karen Shiells.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
Automatic translation quality control using Eurovoc descriptors Marko Tadić, Božo Bekavac
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Processing of large document collections Part 10 (Information extraction: multilingual IE, IE from web, IE from semi-structured data) Helena Ahonen-Myka.
Carmen Banea, Rada Mihalcea University of North Texas A Bootstrapping Method for Building Subjectivity Lexicons for Languages.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Feb 23, Interlingua Annotation of Multilingual Corpora (IAMTC) Project Lori Levin and Teruko Mitamura Language Technologies Institute Carnegie Mellon.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Carnegie Mellon School of Computer Science Copyright © 2001, Carnegie Mellon. All Rights Reserved. JAVELIN Project Briefing 1 AQUAINT Phase I Kickoff December.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Parallel Syntactic Annotation of Multiple Languages Owen Rambow, Bonnie Dorr, David Farwell, Rebecca Green, Nizar Habash, Stephen Helmreich, Eduard Hovy,
This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.
ACL Birds of a Feather Corpus Annotation with Interlingual Content Interlingual Annotation of Multilingual Text Corpora Bonnie Dorr, David Farwell, Rebecca.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
© Ch. Boitet & Wang-Ju Tsai (GETA, CLIPS) ICUKL-2002, Goa, 25-29/11/02 1 Proposals for solving some problems in UNL encoding International Conference on.
Deeper Sentiment Analysis Using Machine Translation Technology Kanauama Hiroshi, Nasukawa Tetsuya Tokyo Research Laboratory, IBM Japan Coling 2004.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
MT with an Interlingua Lori Levin April 13, 2009.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Topic #1: Introduction EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Interlingua Annotation Owen Rambow Advaith Siddharthan Kathleen McKeown
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Semantic Annotation for Interlingual Representation of Mulilingual Texts Teruko Mitamura (CMU), Keith Miller (MITRE), Bonnie Dorr (Maryland), David Farwell.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
MedKAT Medical Knowledge Analysis Tool December 2009.
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
11/23/00UNU/IAS/UNL Centre1 The Universal Networking Language United Nations University Institute of Advanced Studies United Networking Language ® UNU/IAS.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
UNL Document Summarization Virach Sornlertlamvanich, Tanapong Potipiti and Thatsanee Charoenporn Information Research and Development Division National.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Removing the Language Barrier Machine Translation And Digital Libraries.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Learning to Generate Complex Morphology for Machine Translation Einat Minkov †, Kristina Toutanova* and Hisami Suzuki* *Microsoft Research † Carnegie Mellon.
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Approaches to Machine Translation
[A Contrastive Study of Syntacto-Semantic Dependencies]
INAGO Project Automatic Knowledge Base Generation from Text for Interactive Question Answering.
Donna M. Gates Carnegie Mellon University
Approaches to Machine Translation
Owen Rambow 6 Minutes.
Presentation transcript:

1 Interlingual Annotation of Multilingual Text Corpora (IAMTC) Project Overview for ITIC November 13, 2003 Carnegie Mellon University Lori Levin, Teruko Mitamura, Simon Fung

2 Principal investigators and senior personnel Bonnie Dorr, University of Maryland Nizar Habash, University of Maryland and Columbia Stephen Helmreich, NMSU Eduard Hovy, USC David Farwell, NMSU Lori Levin, CMU Keith Miller, MITRE Teruko Mitamura, CMU Owen Rambow, Columbia University Florence Reeder, MITRE

3 Cooperative Website: Wiki Corpora Documents and manuals Discussion

4 Goals of IAMTC A practical interlingua for unrestricted text –Based on mismatch resolution between languages and between multiple English translations –Goal: Feasible human coding Speed Inter-coder agreement

5 Benefits of IAMTC Usable by many research communities, and by researchers using different approaches, working at different levels: MT, information extraction, summarization, question- answering, etc. Corpus-based, rule-based, machine learning-based, statistical approaches, etc. (note: heterogeneous list, not mutually exclusive) Multiple levels of representation: –Syntactic dependency structure –Language-specific predicate argument structure –Interlingua (with resolution of some mismatches)

6 Products of IAMTC A coding manual for the interlingua A multilingual tagged corpus –25 original texts in: French, Spanish, Japanese, Korean, Arabic, Hindi –Three English translations of each text An evaluation metric for the interlingua

7 Representations IL0: Language-specific dependency syntax IL1: Language-specific semantic structure with: –Labeling of nodes using ontology –Labeling of arcs with semantic role names IL2: Interlingua

8 Neutralize: support verbs; some multi-word expressions and non-literal language; some lexical converses (buy-sell); some sentence planning differences –“john who is blond likes apples” –“john is blond and likes apples” conflational mismatches “tape” Verb Japanese “teepu de tomeru” (tape with attach) head-switching mismatches, etc. “I tend to go to school.” vs. “I usually go to school.”

9 Examples (from Nizar Habash) –The minister, who has his own website, also said: "I want Dubai to be the best place in the world for state -of-the-art technology companies.“ –The minister who has a personal website on the internet, further said that he wanted Dubai to become the best place in the world for the advanced (hitech) technological companies.

10 Example Original English: –In its first five years of operation, PRODEM financed loans to over 13,300 micorentrepreneurs, 77 per cent of whom were women, disbursing over $27 million in loans averaging $273. Original French: –Au bout de cinq ans, le programme avait consenti plus de 27 millions de dollars de prêts d'un montant moyen de 273 dollars, à plus de entrepreneurs, dont 77% de femmes.... English Translation from French: –At the end of five years, the program had granted more than 27 million dollars in loans with an average amount of 273 dollars, to more than entrepreneurs, of which 77% were women,....

11 Example 1 Original English: –financed loans to over 13,300 micorentrepreneurs, –disbursing over $27 million –in loans Original French: –consenti plus de 27 millions de dollars –de prêts à plus de entrepreneurs, English Translation from French: –granted more than 27 million dollars –in loans to more than entrepreneurs

12 Example 2 Original English: –Its network of eighteen independent organizations in Latin America has lent ….. Original French: –le réseau regroupe dix-huit organisations indépendantes qui ont déboursé ….. English Translation from French: –the network comprises eighteen independent organizations which have disbursed …..

13 Example 2 Original English: –has lent Its network –of eighteen independent organizations ….. Original French: –regroupe le réseau –dix-huit organisations indépendantes »ont déboursé …… English Translation from French: –comprises the network eighteen independent organizations –have disbursed ……

14 Interlingua Merging Language-faithful interlinguas Original English: –financed loans to over 13,300 micorentrepreneurs –disbursing over $27 million –in loans Original French: –consenti plus de 27 millions de dollars –de prêts à plus de entrepreneurs English Translation from French: –granted more than 27 million dollars –in loans to more than entrepreneurs Merged Interlingua –TRANSFER-MONEY over $27 million to over 13,300 micorentrepreneurs –SOME-RELATION over $27 million loans

15 Interlingua Merging Original English: –has lent Its network –of eighteen independent organizations Original French: –regroupe le réseau –dix-huit organisations indépendantes »ont déboursé English Translation from French: –comprises the network eighteen independent organizations –have disbursed Merged Interlingua –HAS-AS-PART the network eighteen independent organizations –TRANSFER-MONEY the network …..

16 Example 3 Original English: –Three of the most advanced institutions in the ACCION network started their programmes as non-profit organizations and have, in the last five years, converted into Original French: –Trois des institutions les plus performantes rattachees a ACCION International qui etaient au depart des organisations a but nonlucratif sont devenues ces cinq dernieres annees English Translation from French: –Three of the most successful institutions connected to ACCION International, which were non-profit organizations in the beginning, have become, in these last five years,

17 Example 3 Original English: –Started their programmes Institutions –as non-profit organizations –Converted Institutions ….. Original French: –sont devenues Institutions –relative-clause: etaient au depart »institutions …… English Translation from French: –Have become Institutions –Relative-clause; Were …in the beginning »institutions ……

18 Meetings and Workshops Meetings: –September, 2003: New Orleans during MT Summit –November 8 and 9, 2003: CMU –January 18 and 19,2004: ISI Workshops: –September 2003: MT Summit –May 2004: Plan for a panel in the workshop organized by Adam Meyer at NAACL/HLT 2004 –July 2004: Plan to propose ACL workshop

19 Timeline November 10 to December 1: –Assembly of ENGLISH tools and knowledge sources Tools committee: Hovy, Rambow, Miller Omega ontology, ISI LCS verb lexicon (connect to Omega via Propbank) LDA (Lightweight Dependency Analyzer, Srinivas Bangalore) Graph tool from Prague New annotation tool (Dependency tree, Omega, Lexicon) –Draft of coding manual for IL1: Annotation Committee: Rambow, Mitamura, Levin, Dorr, Habash, Helmreich Ontology symbols– Hovy IL0 – dependency structure – Rambow IL1 markup format – Rambow and Habash Semantic roles – Dorr, Habash, Mitamura, Levin Nouns and compounds – Mitamura Adverbs and adjectives– Helmreich Prepositions – Miller Named entities – Reeder Modification vs Predication – Habash –Annotator training Phase 1: All annotators will tag the same English text –Assembly of corpora: Data committee: Mitamura, Hovy, Miller, Farwell Five foreign language original texts in each language Three English translations of each text

20 Annotation Procedure (English) Run LDA parser Use tree editing tool to convert syntactic dependency parse into IL1 –Correct parsing errors –Choose symbols from the ontology as node labels –For verbs: look the verb up in the lexicon to get a list of semantic role names Match phrases to roles

21 Timeline December 1 to January 19: Annotation development cycle: –Procedure committee: Hovy, Farwell, Mitamura –For each week, for each language: Pick a text and two English translations of the text and one English translation from another site. –Each week: Conference call on Friday at 1:00 pm Eastern Time Revise annotation manuals as necessary Development of inter-coder agreement metric –Evaluation committee: Reeder and Habash, leaders Proposal for IL2 based on comparison of IL1’s for different translations of the same text

22 Timeline January 19-February 23 –Development of foreign language analysis tools –Large inter-coder agreement evaluation (IL1) –Continue working on the IL2 design March 1: Mid year report March to September 2004 –Annotation of full corpus: 25 original texts in each of the six languages (French, Spanish, Hindi, Korean, Arabic, Japanese) 3 translations of each text into English

23 Plans for year 2 Argument taking predicates other than verbs Additional tools for automatic construction of IL1 and IL2 More comprehensive set of divergences resolved in IL2 Additional annotation topics: –Coreference –Scope –Tense and aspect –Etc. Larger annotated corpus –Suitable for corpus-based methods and machine learning