Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman, Fernando Pereira University of Pennsylvania March 25, 2003 TIDES SITE VISIT
Translation Issues: Chinese to English - Word order - Dropped arguments - Lexical ambiguities - Structure vs morphology CH:tazai wen-jian shangqian-zi EN:he signed the document
Abstracting away from surface structure sign qian-zi NP1 [case:nom] NP2 [case:acc] NP1 NP2 [prep:zai] CH:tazai wen-jian shangqian-zi EN:he signed the document
Common Thread Predicate-argument structure –Basic constituents of the sentence and how they are related to each other Constituents – he, the document Relations –Sign
Penn approach Annotation + machine learning = IP tools George Washington signed the Constitution. PERSONCOMMUNICATION PROPER NOUN VERB DET PROPER NOUN [ [NP1 ] [ [ ] NP2 ]] Arg0-Agent RELArg1-Theme
Predicate-argument structure George Washington signed the Constitution. sign Agent: George W. Theme: Constitution NP1[case:nom] NP2[case:acc]
Outline for Today Introduction Overview –Objectives, Chinese TreeBank, Agenda PennTools: Training of Individual Components –noun phrase chunkers, parsers, word sense taggers, semantic argument taggers –Training with labeled and unlabeled data Active learning (annotation tools) Unsupervised learning Combining labeled and unlabeled data Information Extraction Machine Translation
Objectives – Resources: TreeBanks Fu-dong Chiou, Tsan Kauang Lee, Chingyi Chia, Meiyu Chang Prior releases –Chinese TreeBanks 1.0 and 2.0 (100K and revisions) –Korean/English Parallel TreeBanks Recent releases –Chinese TreeBank 3.0 (250K) –Chinese TreeBank 2.0 and English translation as parallel corpora Future releases –Chinese TreeBank 4.0 (400K, Dec, ‘03), 5.0 (500K, ‘04) –CTB English Translation Treebank 1.0
Sighan’03, Sapporo, Japan Second SIGHAN Workshop on Chinese Language Processing ACL’03, Sapporo, Japan AND THE First International Chinese Word Segmentation Bakeoff, Four sources for training and test corpora: The Academia Sinica (Taiwan) Treebank Taiwan Big Five encoding The Beijing University Institute of Computational Linguistics Corpus GB encoding The Penn Chinese Treebank GB encoding Hong Kong City University corpus HK Big Five encoding
Summary of Chinese TreeBanks ResourceGenreData, CostCompletion Date Chinese Treebank 1.0 Xinhua Newswire 100K June, ‘00 Chinese Treebank 2.0 Xinhua Newswire100K, $270KDec, ‘00 Proposed Chinese TreeBank Release Xinhua Newswire 250K, $100K Feb, 03 Chinese TreeBank 3.0 (+CTB 2.0) Xinhua Newswire150K, $70March, ’03* Chinese TreeBank 4.0Sinorama (Taiwanese Magazine) 100K, $80K**July, ‘03 * Delay caused by poor quality of English Translation. ** Increased cost due to difficulty w/ automatic parsing of new genre.
Parallel TreeBanks Lessons learned –good quality translation is slow, expensive and hard to come by –switching genres (Xinhua to Sinorama) can really slow down treebanking –Start with good quality parallel corpora, similar genre if possible – AFP
Parallel TreeBanks To Do –Finish double pass of Sinorama (100K + additional 50K, Oct, ‘03) –AFP – 100K words, Summer, ‘04 –English treebanking, first 100K, and then?
Richer CTB Annotations Coreference Tagging (Susan Converse) –Guidelines presented at Sighan’02, Coling-02,Taiwan –100K words tagged, double annotated, adjudication is ongoing, additional tagging –Two preliminary tools for recovering dropped arguments under development Hobbs algorithm modified for Chinese MaxEnt system
Summary of Resources ResourceGenreData, CostCompletion Date Chinese Treebank 4.0 Sinorama (Taiwanese Magazine) 150K Oct, 03 Chinese Treebank 5.0 AFP100K2004 CTB English Translation TreeBank Translation of Xinhua Newswire 100K, $70K Aug, 03 Chinese/English Parallel TreeBank Chinese/English Sinorama Chinese/English AFP 150K 100K ?? English PropBankFinancial subcorpus, WSJ Penn TreeBank II, WSJ 300K 1M, $625K June ‘02 Dec ‘03 Chinese PropBankXinhua Newswire250K, $500K Summer, ‘04
Resource Development Chinese PropBank – Nianwen Xue English PropBank – Olga Babko-Malaya
Objectives (cont) PennTools ($200K) – faster training of multlingual components with less annotation –Noun phrase chunking with SuperTags (Libin Shen) –Parsing in Multiple Languages (Dan Bikel) –(Unsupervised) Coarse-grained Word Sense Disambiguation (Jinying Chen) –Automatic Predicate Argument Tagging, (using labeled and unlabeled data) (Szuting Yi)
Objectives, (cont.) Applications: Putting it all together Semantic Relations for Passage Retrieval (Tom Morton) –Information Extraction – ACE Participated in ’02 English Entity and Relation evaluation Future directions of ACE (Seth Kulick and Edward Loper) Recent Improvements in English Named Entity Tagging (Ryan McDonald) Preliminary work on Chinese (Yuan Ding, John Blitzer) –Machine Translation Flexible Tree-to-string Alignment (Dan Gildea) Johns Hopkins Summer Workshop plans