Parsing in Multiple Languages

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.
Semantic Role Labeling Abdul-Lateef Yussiff
A Joint Model For Semantic Role Labeling Aria Haghighi, Kristina Toutanova, Christopher D. Manning Computer Science Department Stanford University.
Towards Parsing Unrestricted Text into PropBank Predicate- Argument Structures ACL4 Project NCLT Seminar Presentation, 7th June 2006 Conor Cafferkey.
In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.
Distributed components
Introduction to treebanks Session 1: 7/08/
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
The Data Mining Visual Environment Motivation Major problems with existing DM systems They are based on non-extensible frameworks. They provide a non-uniform.
Features and Unification
Cognitive Architecture for Reasoning about Adversaries T-REX: A Domain-Independent System for Automated Cultural Information Extraction Massimiliano Albanese.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Portability, Parallelism and Efficiency in Parsing Dan Bikel University of Pennsylvania March 11th, 2002.
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
Breaking the Resource Bottleneck for Multilingual Parsing Rebecca Hwa, Philip Resnik and Amy Weinberg University of Maryland.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
1.3 Executing Programs. How is Computer Code Transformed into an Executable? Interpreters Compilers Hybrid systems.
A centralized system.  Active Directory is Microsoft's trademarked directory service, an integral part of the Windows architecture. Like other directory.
Named Entity Recognition and the Stanford NER Software Jenny Rose Finkel Stanford University March 9, 2007.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
6/1/2001 Supplementing Aleph Reports Using The Crystal Reports Web Component Server Presented by Bob Gerrity Head.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.
1 Sketch tools and Related Research Rachel Patel.
ALIP: Automatic Linguistic Indexing of Pictures Jia Li The Pennsylvania State University.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
Semi-supervised Training of Statistical Parsers CMSC Natural Language Processing January 26, 2006.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Compressed Abstract Syntax Trees as Mobile Code Christian H. Stork Vivek Haldar University of California, Irvine.
D OSHISHA U NIVERSITY 13 November XML-based Genetic Programming Framework: Design Philosophy, Implementation and Applications.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
2003 (c) University of Pennsylvania1 Better MT Using Parallel Dependency Trees Yuan Ding University of Pennsylvania.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.
Natural Language Processing Vasile Rus
CSC 594 Topics in AI – Natural Language Processing
Approaches to Machine Translation
Chapter 1 Introduction.
Language Technologies Institute Carnegie Mellon University
Tools for Natural Language Processing Applications
SOFTWARE DESIGN AND ARCHITECTURE
Chapter 1 Introduction.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Improving a Pipeline Architecture for Shallow Discourse Parsing
Overview of the Course Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.
Specifying collaborative decision-making systems
LING/C SC 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27
Starting Design: Logical Architecture and UML Package Diagrams
Distributed computing deals with hardware
Approaches to Machine Translation
Overview of Workflows: Why Use Them?
Independent Project Natural Language to SQL
CS224N Section 3: Corpora, etc.
Owen Rambow 6 Minutes.
Presentation transcript:

Parsing in Multiple Languages Dan Bikel University of Pennsylvania August 22nd, 2003

Problems with many parser designs Many parsing models are largely language-independent, but their implementations do not have layer of abstraction for language-specific information Most generative parsers do not have easy method for experimenting with different parameterizations and back-off levels Most parsers need large database of smoothed probability estimates, but this database is often tightly coupled with decoder Parallel and/or distributed computing cannot easily be exploited

Parsing Architecture for Speed and Flexibility Language Language package DecoderServer 1 ModelCollection CKY Client 1 CKY Client 2 M M DecoderServer N ModelCollection CKY Client N Switchboard Object server

Architecture for Parsing II Highly parallel, multi-threaded Can take advantage of, e.g., clustered computing environment Fully fault-tolerant Significant flexibility: layers of abstraction Optimized for speed Highly portable for new domains, including new languages

Layer of Abstraction: Language Package Four, easy-to-build components Treebank: encapsulates linguistic information, Treebank-specific information isConjunction(tag), isConjunction(word) isNP(nonterminal) isSentence(nonterminal) isVerb(preterminal subtree) Training Preprocessing, tree augmentations Head finder Reads in small, head rules file (35 lines for English) Word features

Tree augmentation Parsers operate over augmented tree space, T+ Examples of tree augmentations Head lexicalization: NP ® NP(thing,NN) Argument identification: NP ® NP-A Chiang & Bikel (2002) provided treep New, portable syntax for augmenting tree nodes Method for reestimating parser models in the augmented space such that P(S,T) is maximized

Rapid Portability to New Languages and New Data Sets Bikel & Chiang (2000) described porting two parsing models developed for English to Chinese BBN: LR 69.0, LP 74.8 (≤ 40 words) Chiang: LR 76.8, LP 77.8 (≤ 40 words) Original design goal for parsing engine: develop new language packages in 1–2 weeks Developed Chinese language package for new engine in one and a half days Results with new parsing engine using CTB (100k words) LR 77.0, LP 81.6 (≤ 40 words, hand segmented) Latest results with new engine using CTB 3.0 (250k words) LR 78.8, LP 82.4 (≤ 40 words, hand segmented)

Tools in Use English parser Chinese parser Gets state-of-the-art results Training data: Penn Treebank, §§02-21, ~1 million words Results: LR 89.90, LP 90.15 on §00 (≤40 words) Fast: trains in minutes, can use cluster to parse an entire WSJ section in about 5 minutes Chinese parser Dan Jurafsky and one of his grad students, Honglin Sun, at the University of Colorado (predicate-argument labeling for IE) Kevin Knight and Philipp Koehn at ISI Research (MT) In-house users Dan Gildea (MT) Yuan Ding (MT) Chinese Treebank annotators (bootstrapping treebank annotation)

Tools in Use II Arabic Parser Once representation/format issues ironed out, porting to Arabic also proceeded rapidly Encouraging preliminary results Training data: 150k words LR 75.6, LP 77.4 (≤40 words, mapped gold-standard tags) Currently being used by Rebecca Hwa at the University of Maryland (MT) We believe Much room for improvement after analysis of initial results, but… performance is already more than good enough for bootstrapping Treebank annotation

Tools in Use III Classical Portugese Parser Rapid development of language package (with help of Tony Kroch) Successfully used for bootstrapping of treebank for the Statistical Physics, Pattern Recognition and Language Change project at State University of Campinas, São Paolo, Brazil

Obtaining Tools and Data For my parsing engine (English, Arabic, Chinese and soon, Korean), please contact me (dbikel@cis.upenn.edu) For Penn’s Chinese segmenter, contact Bert Xue (xueniwen@linc.cis.upenn.edu) Data Treebanks and PropBanks (mpalmer@cis.upenn.edu)

Future Work Develop Java version of David Chiang’s and my tree augmentation program, treep, and incorporate into parsing engine Develop user-level documentation for developing a new language package (in progress) Provide layer of abstraction for language-specific lexical resources, such as semantic hierarchy (in progress) Explore language-independent parsing model changes, to find those that yield positive effect across all languages Employ the layers of abstraction for exploring parameter space Perform “regression tests” across treebanks/languages

fin