Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.

Slides:



Advertisements
Similar presentations
Layering Semantics (Putting meaning into trees) Treebank Workshop Martha Palmer April 26, 2007.
Advertisements

Multilinugual PennTools that capture parses and predicate-argument structures, and their use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus,
Hindi Syntax Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure Martha Palmer (University of Colorado, USA) Rajesh Bhatt.
Semantic Role Labeling Abdul-Lateef Yussiff
Towards Parsing Unrestricted Text into PropBank Predicate- Argument Structures ACL4 Project NCLT Seminar Presentation, 7th June 2006 Conor Cafferkey.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Kakia Chatsiou Modern Greek Grammar fragment Implementation using XLE FLATLANDS GreekGram Reporting on the progress of the implementation.
Steven Schoonover.  What is VerbNet?  Levin Classification  In-depth look at VerbNet  Evolution of VerbNet  What is FrameNet?  Applications.
< Translator Team > 25+ Languages, …and growing!.
Treebanks are Not Naturally Occurring Data Choices in Treebank Design and What They Mean for Natural Language Processing Owen Rambow Columbia University.
LING 581: Advanced Computational Linguistics Lecture Notes March 9th.
The Hindi-Urdu Treebank Lecture 7: 7/29/ Multi-representational, Multi-layered treebank Traditional approach: – Syntactic treebank: PS or DS, but.
Introduction to treebanks Session 1: 7/08/
General course information Session 1: 7/08/
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
DS-to-PS conversion Fei Xia University of Washington July 29,
Tasks Talk: ULA08 Workshop March 18, 2007 A Talk about Tasks Unified Linguistic Annotation Workshop Adam Meyers New York University March 18, 2008.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
1/13 Parsing III Probabilistic Parsing and Conclusions.
Introduction to Syntax, with Part-of-Speech Tagging Owen Rambow September 17 & 19.
1/17 Probabilistic Parsing … and some other approaches.
Treebanks as Training Data for Parsers Joakim Nivre Växjö University and Uppsala University
Workshop on Treebanks, Rochester NY, April 26, 2007 The Penn Treebank: Lessons Learned and Current Methodology Ann Bies Linguistic Data Consortium, University.
EMPOWER 2 Empirical Methods for Multilingual Processing, ‘Onoring Words, Enabling Rapid Ramp-up Martha Palmer, Aravind Joshi, Mitch Marcus, Mark Liberman,
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
THE BIG PICTURE Basic Assumptions Linguistics is the empirical science that studies language (or linguistic behavior) Linguistics proposes theories (models)
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Annotation for Hindi PropBank. Outline Introduction to the project Basic linguistic concepts – Verb & Argument – Making information explicit – Null arguments.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 4.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Profiling Web Archive Coverage for Top-Level Domain & Content Language Ahmed AlSum, Michele C. Weigle, Michael L. Nelson, and Herbert Van de Sompel International.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
Interlingua Annotation Owen Rambow Advaith Siddharthan Kathleen McKeown
Why Study Languages Produced by the Subject Centre for Languages, Linguistics and Area Studies …When Everyone Speaks English?
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Exploiting Reducibility in Unsupervised Dependency Parsing David Mareček and Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank Prudhvi Kosaraju, Bharat Ram Ambati, Samar Husain Dipti Misra Sharma,
LING 6520: Comparative Topics in Linguistics (from a computational perspective) Martha Palmer Jan 15,
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
NLP. Introduction to NLP Last week, Min broke the window with a hammer. The window was broken with a hammer by Min last week With a hammer, Min broke.
Multilinugual PennTools that capture parses and predicate-argument structures, for use in Applications Martha Palmer, Aravind Joshi, Mitch Marcus, Mark.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
Syntax By WJQ. Syntax : Syntax is the study of the rules governing the way words are combined to form sentences in a language, or simply, the study of.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
COSC 6336: Natural Language Processing
Mángo Languages UM libraries.
Leonardo Zilio Supervisors: Prof. Dr. Maria José Bocorny Finatto
English Proposition Bank: Status Report
RECENT TRENDS IN SMT By M.Balamurugan, Phd Research Scholar,
PRESENTED BY: PEAR A BHUIYAN
Profiling Web Archive Coverage for Top-Level Domain & Content Language
Parsing in Multiple Languages
[A Contrastive Study of Syntacto-Semantic Dependencies]
Oracle Supplier Management Solution Product Availability
Towards comprehensive syntactic and semantic annotations of the clinical narrative Daniel Albright, Arrick Lanfranchi, Anwen Fredriksen, William F Styler.

Computational Linguistics: New Vistas
Part of Speech Tagging with Neural Architecture Search
COUNTRIES NATIONALITIES LANGUAGES.
CS224N Section 3: Corpora, etc.
CS224N Section 3: Project,Corpora

Owen Rambow 6 Minutes.
Presentation transcript:

Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011

(Syntactic) Treebank Sentences annotated with syntactic structure (dependency structure or phrase structure) 1960s: Brown Corpus Early 1990s: The English Penn Treebank Late 1990s: Prague Dependency Treebank 1990s – now: Arabic, Chinese, Dutch, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Latin, Norwegian, Polish, Spanish, Turkish, etc. 2

PS and DS John loves Mary. S NP VP./. John/NNP loves/VBPNP Mary/NNP loves/VBP John/NNPMary/NNP./. 3 Phrase structure (PS): Dependency structure (DS):

Proposition Bank (PropBank) Sentences annotated with predicate argument structure Ex: John loves Mary – “loves” is the predicate – “John” is Arg0 (“Agent”) – “Mary” is Arg1 (“Theme”) 2000s: The English PropBank, followed by the PropBanks for Chinese, Arabic, Hindi/Urdu, etc. 4

Why do we need treebanks? Computational Linguistics: – To build and evaluate NLP tools (e.g., word segmenters, part-of-speech taggers, parsers, semantic role labelers) – This leads to significant progress of the CL field Theoretical linguistics: – Annotation guidelines are like a grammar book, with more detail and coverage – As a discovery tool – One can test linguistic theories and collect statistics by searching treebanks. 5

The Hindi-Urdu Treebank (HUTB) Traditional approach: – Syntactic treebank: PS or DS, but not both – Layers are added one-by-one Our approach: – Syntactic treebank: both DS and PS – DS, PS, and PB are developed at the same time – Automatic conversion from DS+PB to PS

Motivation 1: Two Representations Both phrase-structure treebanks and dependency treebanks are used in NLP – Collins/Charniak/Bikel parsers for PS – CoNLL task on dependency parsing Problem: currently few treebanks (no?) with PS and DS which are independently motivated  Our project: build treebank for Hindi/Urdu for which PS and DS are linguistically motivated from the outset – Dependency: Paninian grammar (Panini 400 BC) – Phrase structure: variant of Minimalism (Chomsky 1995)

Motivation 2: Two Content Levels Everyone (?) wants syntax Recent popularity of PropBank (Palmer et al 2002): lexical predicate-argument structure; “semantics as surfacy as it gets” Recent experience: PropBank may inform some treebanking decisions  Build treebank with all levels from the outset  Annotating them together allows us to study relation between DS/PB/PS and reduce annotation time

Goals Hindi/Urdu Treebank: – DS, PB, and PS for 400K-word Hindi 150K-word Urdu – Unified annotation guidelines – Frame files for PropBank Better understanding of the relation between DS, PB, and PS.

Where we are now Guidelines are almost complete. Annotation: – DS annotation: 354K-word Hindi, 60K-word Urdu – PB annotation: 40K-word Hindi Automatic conversion from DS + PropBank in progress. Preliminary release in 2009 and 2010

The HUTB team IIIT, India (DS team): Dipti Sharma, Samar Husain, Rahul Aggarwal, etc. Univ of Colorado at Boulder (PB team): Martha Palmer, Bhuvana Narasimhan, Ashwini Vaidya, Archna Bhatia, etc. UMass (PS team): Rajesh Bhatt, Annahita farudi Columbia Univ (PS team): Owen Rambow, Univ. of Washington (Conversion): Fei Xia, Michael Tepper

Some Sample Structures Guideline Sentences -transitive (25), causatives (4), AP predicate (10), 21 (clausal extraposition + unaccusative), participial adjunct (35), complex predicate (1) Corpus Sentences