Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark.

Slides:



Advertisements
Similar presentations
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Advertisements

Layering Semantics (Putting meaning into trees) Treebank Workshop Martha Palmer April 26, 2007.
Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Semantic Role Labeling Abdul-Lateef Yussiff
PropBanks, 10/30/03 1 Penn Putting Meaning Into Your Trees Martha Palmer Paul Kingsbury, Olga Babko-Malaya, Scott Cotton, Nianwen Xue, Shijong Ryu, Ben.
Towards Parsing Unrestricted Text into PropBank Predicate- Argument Structures ACL4 Project NCLT Seminar Presentation, 7th June 2006 Conor Cafferkey.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
LING NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Simple Features for Chinese Word Sense Disambiguation Hoa Trang Dang, Ching-yi Chia, Martha Palmer, Fu- Dong Chiou Computer and Information Science University.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
The current status of Chinese-English EBMT research -where are we now Joy, Ralf Brown, Robert Frederking, Erik Peterson Aug 2001.
Automatic Verb Sense Grouping --- Term Project Proposal for CIS630 Jinying Chen 10/28/2002.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
The Third Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition Gina-Anne Levow Fifth SIGHAN Workshop July 22, 2006.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Portability, Parallelism and Efficiency in Parsing Dan Bikel University of Pennsylvania March 11th, 2002.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
ELN – Natural Language Processing Giuseppe Attardi
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Phrase Reordering for Statistical Machine Translation Based on Predicate-Argument Structure Mamoru Komachi, Yuji Matsumoto Nara Institute of Science and.
The ICT Statistical Machine Translation Systems for IWSLT 2007 Zhongjun He, Haitao Mi, Yang Liu, Devi Xiong, Weihua Luo, Yun Huang, Zhixiang Ren, Yajuan.
Multilingual Relevant Sentence Detection Using Reference Corpus Ming-Hung Hsu, Ming-Feng Tsai, Hsin-Hsi Chen Department of CSIE National Taiwan University.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
11 Chapter 19 Lexical Semantics. 2 Lexical Ambiguity Most words in natural languages have multiple possible meanings. –“pen” (noun) The dog is in the.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Annotation for Hindi PropBank. Outline Introduction to the project Basic linguistic concepts – Verb & Argument – Making information explicit – Null arguments.
Discourse Connectives and Their Argument Structure: Annotating a discourse treebank ARAVIND K. JOSHI Department of Computer and Information Science October.
1 Discourse Connectives and Their Argument Structure: Annotating a discourse treebank ARAVIND K. JOSHI Department of Computer and Information Science August.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Combining Lexical Resources: Mapping Between PropBank and VerbNet Edward Loper,Szu-ting Yi, Martha Palmer September 2006.
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
CPSC 503 Computational Linguistics
Improving Named Entity Translation Combining Phonetic and Semantic Similarities Fei Huang, Stephan Vogel, Alex Waibel Language Technologies Institute School.
[1].Handling Structural Divergences and Recovering Dropped Arguments in a Korean/English Machine Translation System [2].Learning to express motion events.
LING 001 Introduction to Linguistics Spring 2010 Syntactic parsing Part-Of-Speech tagging Apr. 5 Computational linguistics.
Supertagging CMSC Natural Language Processing January 31, 2006.
ARDA Visit 1 Penn Lexical Semantics at Penn: Proposition Bank and VerbNet Martha Palmer, Dan Gildea, Paul Kingsbury, Olga Babko-Malaya, Bert Xue, Karin.
FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.
1 Fine-grained and Coarse-grained Word Sense Disambiguation Jinying Chen, Hoa Trang Dang, Martha Palmer August 22, 2003.
6/27/031 Integrating Syntactic and Semantic Annotation of Biomedical Text Seth Kulick, Mark Liberman, Martha Palmer and Andrew Schein The University of.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
8 December 1997Industry Day Applications of SuperTagging Raman Chandrasekar.
Open Health Natural Language Processing Consortium
Computational Linguistics Courses Experiment Test.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Chinese Proposition Bank Nianwen Xue, Chingyi Chia Scott Cotton, Seth Kulick, Fu-Dong Chiou, Martha Palmer, Mitch Marcus.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
English Proposition Bank: Status Report
ACL 2002, Univ. of Pennsylvania, Philadelphia, PA (July 2002) Session: Anaphora and Coreference Session Chair: Lillian Lee Improving Machine Learning.
Parsing in Multiple Languages
Text Analytics Giuseppe Attardi Università di Pisa
LING/C SC 581: Advanced Computational Linguistics
CS224N Section 3: Corpora, etc.
CS224N Section 3: Project,Corpora
Presentation transcript:

Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 25, 2003 TIDES SITE VISIT

Translation Issues: Chinese to English - Word order - Dropped arguments - Lexical ambiguities - Structure vs morphology CH:tazai wen-jian shangqian-zi EN:he signed the document

Abstracting away from surface structure sign qian-zi NP1 [case:nom] NP2 [case:acc] NP1 NP2 [prep:zai] CH:tazai wen-jian shangqian-zi EN:he signed the document

Common Thread Predicate-argument structure –Basic constituents of the sentence and how they are related to each other Constituents – he, the document Relations –Sign

Penn approach Annotation + machine learning = IP tools George Washington signed the Constitution. PERSONCOMMUNICATION PROPER NOUN VERB DET PROPER NOUN [ [NP1 ] [ [ ] NP2 ]] Arg0-Agent RELArg1-Theme

Predicate-argument structure George Washington signed the Constitution. sign Agent: George W. Theme: Constitution NP1[case:nom] NP2[case:acc]

Outline for Today Introduction Overview –Objectives, Chinese TreeBank, Agenda PennTools: Training of Individual Components –noun phrase chunkers, parsers, word sense taggers, semantic argument taggers –Training with labeled and unlabeled data Active learning (annotation tools) Unsupervised learning Combining labeled and unlabeled data Information Extraction Machine Translation

Objectives – Resources: TreeBanks Fu-dong Chiou, Tsan Kauang Lee, Chingyi Chia, Meiyu Chang Prior releases –Chinese TreeBanks 1.0 and 2.0 (100K and revisions) –Korean/English Parallel TreeBanks Recent releases –Chinese TreeBank 3.0 (250K) –Chinese TreeBank 2.0 and English translation as parallel corpora Future releases –Chinese TreeBank 4.0 (400K, Dec, ‘03), 5.0 (500K, ‘04) –CTB English Translation Treebank 1.0

Sighan’03, Sapporo, Japan  Second SIGHAN Workshop on Chinese Language Processing ACL’03, Sapporo, Japan AND THE  First International Chinese Word Segmentation Bakeoff, Four sources for training and test corpora: The Academia Sinica (Taiwan) Treebank Taiwan Big Five encoding The Beijing University Institute of Computational Linguistics Corpus GB encoding The Penn Chinese Treebank GB encoding Hong Kong City University corpus HK Big Five encoding

Summary of Chinese TreeBanks ResourceGenreData, CostCompletion Date Chinese Treebank 1.0 Xinhua Newswire 100K June, ‘00 Chinese Treebank 2.0 Xinhua Newswire100K, $270KDec, ‘00 Proposed Chinese TreeBank Release Xinhua Newswire 250K, $100K Feb, 03 Chinese TreeBank 3.0 (+CTB 2.0) Xinhua Newswire150K, $70March, ’03* Chinese TreeBank 4.0Sinorama (Taiwanese Magazine) 100K, $80K**July, ‘03 * Delay caused by poor quality of English Translation. ** Increased cost due to difficulty w/ automatic parsing of new genre.

Parallel TreeBanks Lessons learned –good quality translation is slow, expensive and hard to come by –switching genres (Xinhua to Sinorama) can really slow down treebanking –Start with good quality parallel corpora, similar genre if possible – AFP

Parallel TreeBanks To Do –Finish double pass of Sinorama (100K + additional 50K, Oct, ‘03) –AFP – 100K words, Summer, ‘04 –English treebanking, first 100K, and then?

Richer CTB Annotations Coreference Tagging (Susan Converse) –Guidelines presented at Sighan’02, Coling-02,Taiwan –100K words tagged, double annotated, adjudication is ongoing, additional tagging –Two preliminary tools for recovering dropped arguments under development Hobbs algorithm modified for Chinese MaxEnt system

Summary of Resources ResourceGenreData, CostCompletion Date Chinese Treebank 4.0 Sinorama (Taiwanese Magazine) 150K Oct, 03 Chinese Treebank 5.0 AFP100K2004 CTB English Translation TreeBank Translation of Xinhua Newswire 100K, $70K Aug, 03 Chinese/English Parallel TreeBank Chinese/English Sinorama Chinese/English AFP 150K 100K ?? English PropBankFinancial subcorpus, WSJ Penn TreeBank II, WSJ 300K 1M, $625K June ‘02 Dec ‘03 Chinese PropBankXinhua Newswire250K, $500K Summer, ‘04

Resource Development Chinese PropBank – Nianwen Xue English PropBank – Olga Babko-Malaya

Objectives (cont) PennTools ($200K) – faster training of multlingual components with less annotation –Noun phrase chunking with SuperTags (Libin Shen) –Parsing in Multiple Languages (Dan Bikel) –(Unsupervised) Coarse-grained Word Sense Disambiguation (Jinying Chen) –Automatic Predicate Argument Tagging, (using labeled and unlabeled data) (Szuting Yi)

Objectives, (cont.) Applications: Putting it all together Semantic Relations for Passage Retrieval (Tom Morton) –Information Extraction – ACE Participated in ’02 English Entity and Relation evaluation Future directions of ACE (Seth Kulick and Edward Loper) Recent Improvements in English Named Entity Tagging (Ryan McDonald) Preliminary work on Chinese (Yuan Ding, John Blitzer) –Machine Translation Flexible Tree-to-string Alignment (Dan Gildea) Johns Hopkins Summer Workshop plans