Download presentation
Presentation is loading. Please wait.
Published byOpal Marshall Modified over 9 years ago
1
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011
2
(Syntactic) Treebank Sentences annotated with syntactic structure (dependency structure or phrase structure) 1960s: Brown Corpus Early 1990s: The English Penn Treebank Late 1990s: Prague Dependency Treebank 1990s – now: Arabic, Chinese, Dutch, Finnish, French, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Italian, Japanese, Korean, Latin, Norwegian, Polish, Spanish, Turkish, etc. 2
3
PS and DS John loves Mary. S NP VP./. John/NNP loves/VBPNP Mary/NNP loves/VBP John/NNPMary/NNP./. 3 Phrase structure (PS): Dependency structure (DS):
4
Proposition Bank (PropBank) Sentences annotated with predicate argument structure Ex: John loves Mary – “loves” is the predicate – “John” is Arg0 (“Agent”) – “Mary” is Arg1 (“Theme”) 2000s: The English PropBank, followed by the PropBanks for Chinese, Arabic, Hindi/Urdu, etc. 4
5
Why do we need treebanks? Computational Linguistics: – To build and evaluate NLP tools (e.g., word segmenters, part-of-speech taggers, parsers, semantic role labelers) – This leads to significant progress of the CL field Theoretical linguistics: – Annotation guidelines are like a grammar book, with more detail and coverage – As a discovery tool – One can test linguistic theories and collect statistics by searching treebanks. 5
6
The Hindi-Urdu Treebank (HUTB) Traditional approach: – Syntactic treebank: PS or DS, but not both – Layers are added one-by-one Our approach: – Syntactic treebank: both DS and PS – DS, PS, and PB are developed at the same time – Automatic conversion from DS+PB to PS
7
Motivation 1: Two Representations Both phrase-structure treebanks and dependency treebanks are used in NLP – Collins/Charniak/Bikel parsers for PS – CoNLL task on dependency parsing Problem: currently few treebanks (no?) with PS and DS which are independently motivated Our project: build treebank for Hindi/Urdu for which PS and DS are linguistically motivated from the outset – Dependency: Paninian grammar (Panini 400 BC) – Phrase structure: variant of Minimalism (Chomsky 1995)
8
Motivation 2: Two Content Levels Everyone (?) wants syntax Recent popularity of PropBank (Palmer et al 2002): lexical predicate-argument structure; “semantics as surfacy as it gets” Recent experience: PropBank may inform some treebanking decisions Build treebank with all levels from the outset Annotating them together allows us to study relation between DS/PB/PS and reduce annotation time
9
Goals Hindi/Urdu Treebank: – DS, PB, and PS for 400K-word Hindi 150K-word Urdu – Unified annotation guidelines – Frame files for PropBank Better understanding of the relation between DS, PB, and PS.
10
Where we are now Guidelines are almost complete. Annotation: – DS annotation: 354K-word Hindi, 60K-word Urdu – PB annotation: 40K-word Hindi Automatic conversion from DS + PropBank in progress. Preliminary release in 2009 and 2010
11
The HUTB team IIIT, India (DS team): Dipti Sharma, Samar Husain, Rahul Aggarwal, etc. Univ of Colorado at Boulder (PB team): Martha Palmer, Bhuvana Narasimhan, Ashwini Vaidya, Archna Bhatia, etc. UMass (PS team): Rajesh Bhatt, Annahita farudi Columbia Univ (PS team): Owen Rambow, Univ. of Washington (Conversion): Fei Xia, Michael Tepper
12
Some Sample Structures Guideline Sentences -transitive (25), causatives (4), AP predicate (10), 21 (clausal extraposition + unaccusative), participial adjunct (35), complex predicate (1) Corpus Sentences
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.