EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,

Slides:



Advertisements
Similar presentations
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Semantic Role Labeling Abdul-Lateef Yussiff
10/9/01PropBank1 Proposition Bank: a resource of predicate-argument relations Martha Palmer University of Pennsylvania October 9, 2001 Columbia University.
A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.
Towards Parsing Unrestricted Text into PropBank Predicate- Argument Structures ACL4 Project NCLT Seminar Presentation, 7th June 2006 Conor Cafferkey.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Steven Schoonover.  What is VerbNet?  Levin Classification  In-depth look at VerbNet  Evolution of VerbNet  What is FrameNet?  Applications.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Simple Features for Chinese Word Sense Disambiguation Hoa Trang Dang, Ching-yi Chia, Martha Palmer, Fu- Dong Chiou Computer and Information Science University.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Machine Translation via Dependency Transfer Philip Resnik University of Maryland DoD MURI award in collaboration with JHU: Bootstrapping Out of the Multilingual.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Course Summary LING 575 Fei Xia 03/06/07. Outline Introduction to MT: 1 Major approaches –SMT: 3 –Transfer-based MT: 2 –Hybrid systems: 2 Other topics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Research methods in corpus linguistics Xiaofei Lu.
ELN – Natural Language Processing Giuseppe Attardi
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
PropBank, VerbNet & SemLink Edward Loper. PropBank 1M words of WSJ annotated with predicate- argument structures for verbs. –The location & type of each.
SENSEVAL2 Scott Cotton and Martha Palmer ISLE Meeting Dec 11, 2000 University of Pennsylvania.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Word Sense Disambiguation UIUC - 06/10/2004 Word Sense Disambiguation Another NLP working problem for learning with constraints… Lluís Màrquez TALP, LSI,
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
MASC The Manually Annotated Sub- Corpus of American English Nancy Ide, Collin Baker, Christiane Fellbaum, Charles Fillmore, Rebecca Passonneau.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Ideas for 100K Word Data Set for Human and Machine Learning Lori Levin Alon Lavie Jaime Carbonell Language Technologies Institute Carnegie Mellon University.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Combining Lexical Resources: Mapping Between PropBank and VerbNet Edward Loper,Szu-ting Yi, Martha Palmer September 2006.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
[1].Handling Structural Divergences and Recovering Dropped Arguments in a Korean/English Machine Translation System [2].Learning to express motion events.
Supertagging CMSC Natural Language Processing January 31, 2006.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
CS 4705 Lecture 17 Semantic Analysis: Robust Semantics.
Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
Human-Assisted Machine Annotation Sergei Nirenburg, Marjorie McShane, Stephen Beale Institute for Language and Information Technologies University of Maryland.
Open Health Natural Language Processing Consortium
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
CIS Treebanks, Trees, Querying, QC, etc. Seth Kulick Linguistic Data Consortium University of Pennsylvania
English Proposition Bank: Status Report
Approaches to Machine Translation
Parsing in Multiple Languages
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Approaches to Machine Translation
CS224N Section 3: Corpora, etc.
Presentation transcript:

EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Tony Kroch, Lyle Ungar University of Pennsylvania March 23, 2000 TIDES KICKOFF

Penn approach Relies on lexically based linguistic analysis –Humans annotate naturally occurring text (hand correct output of automatic parsers, e.g. Fiddich, XTAG) –Train statistical POStaggers, parsers, etc. –Common thread is predicate-argument structure Hypothesis: More linguistically sophisticated analyzers More accurate output

EMPOWER 2 Approach Annotations enriched with semantics and pragmatics Provide companion lexicons for annotated corpora Extend our coverage to other languages Goal – Parallel annotated corpora/lexicons will enable rapid ramp-up of MT

TOOLS/RESOURCES Morphological Analyzers Stochastic parsers Lexicalized grammars Lexical classifications for cross-lingual mappings LANGUAGES English Chinese Korean Hindi/Tamil Faster development of Tools/annotation:

Current Status English Q&A using coreference English annotation –adding semantics to Penn TreeBank –creating companion lexicon Korean/English annotation –syntactic annotation and some semantics –companion transfer lexicon Chinese annotation –syntactic annotation (Chinese TreeBank)

English Q&A – Tom Morton TREC-8 Approach Extract sentences based on: –Words in the sentence –Category of the answer –Words in co-reference relationships pronouns common nouns dates Results –Placed 4th out of 20 participants.

Examples (pronouns) Who killed Lee Harvey Oswald? - demo –..., and the hat. There was the suit he wore on the day he(JACK RUBY) killed Oswald, a diamond-studded watch, a silver and diamond ring, two pairs of swim trunks, a shower cap, an athletic supporter and a letter written to a woman. Other than th … Future Plans –use WordNet and syntactic constructions to determine semantic categories of noun phrases –Cross-document co-reference

Semantic Annotation – Hoa Dang, Joseph Rosenzweig, John Duda Current syntactic annotation –POS, phrase structure bracketing –Logical Subject, locative, temporal adjuncts New semantic augmentations –Sense tag verbs and noun arguments/adjuncts –Predicate-argument relations for verbs, label arguments (arg0, arg1, arg2)

First Experiment (Siglex99) WSJ 5K word corpus –running text –WordNet words sense tagged twice (10 days) –89% inter-annotator agreement –700 verb tokens – 81% agreement (disagreement in 90/350 verb tokens) Automatic predicate-argument labeling –81% precision on 162 structures –Hand corrected 2100 words in one day

Example I was shaking the whole time. The walls shook; the building rocked. ;

Second Experiment: Methodology ( 150K target – Penn TreeBank II, with Christiane Fellbaum ) Sense tagging –Two human annotators (replace one with automatic WSD if possible) –WordNet senses, but allow for revision of entries Predicate argument labels –Rosenzweig’s converter –Uses TreeBank “cues” –Consults lexical semantic KB Verb subcategorization frames and alternations Ontology of noun-phrase referents Multi-word lexical items XML annotation in external file referencing IDs

Predicate-Argument Labeling: one raid tree – Rosenzweig’s converter

New language/English MT Components New language –Morphological Analyzer (POStags) –Parser/Generator –TreeBank –Companion pred-arg lexicon English –POStagger –Parser/Generator –TreeBank –Companion pred-arg lexicon Transfer Lexicon

Korean/English MT Chunghye Han, Juntae Yoon, Meesook Kim, Eonsuk Ko (CoGenTex/Penn/Systran: ARL) Parallel TreeBanks for Korean/English enable –Training of domain-specific Korean parsers Collins parser and SuperTagger (also English) –Alignment of Korean/English structures Attempt automatic and semi-automatic testing and generation of transfer lexicon (with CoGenTex) Apply statistical MT techniques Lexical semantics (Systran, mapped to EuroWordNet-IL) should improve –Accuracy of parsers – Recovery of dropped arguments

Additional Korean/English parallel data? Current parallel corpus not public domain Can use tools trained on this corpus to quickly annotate additional corpora –Translate sections of Penn TreeBank into Korean? –Use existing Korean newswire text – translate into English? –Both?

Example translation

Transfer lexicon entries: Mapping predicate argument structures across languages

Chinese TreeBank – DOD Fei Xia, Ninwen Xue, Fu-dong Chiou Workshop of interested members of Chinese community, June ‘98 Guidelines and sample files posted on web –Segmentation, March, ‘99 –POStagging, March, ‘99 –Bracketing, First pass, October, ’99 –Bracketing, Second Pass, May, ’00 95%+ inter-annotator consistency Release of 100K annotated data, July, ’00 Follow-up workshop, Hong Kong, ACL’00

Goal for Chinese Parallel, annotated corpora – Hong Kong news? Parse English with WSJ trained parsers, correct Extend English TreeBank lexicon as needed Parse Chinese with CTB trained parsers, correct Start with lexicon extracted from CTB, extend Experiment with using semi-automated techniques wherever possible to speed up process

Past results XTAG project Penn TreeBank Enabled the development of tools: POStaggers, parsers, co-reference, etc