Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank Prudhvi Kosaraju, Bharat Ram Ambati, Samar Husain Dipti Misra Sharma,

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Sequence Classification: Chunking Shallow Processing Techniques for NLP Ling570 November 28, 2011.
Hindi Syntax Annotating Dependency, Lexical Predicate-Argument Structure, and Phrase Structure Martha Palmer (University of Colorado, USA) Rajesh Bhatt.
Overview of the Hindi-Urdu Treebank Fei Xia University of Washington 7/23/2011.
Part-Of-Speech Tagging and Chunking using CRF & TBL
Two-Stage Constraint Based Sanskrit Parser Akshar Bharati, IIIT,Hyderabad.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
GRAMMAR & PARSING (Syntactic Analysis) NLP- WEEK 4.
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
Shallow Parsing CS 4705 Julia Hirschberg 1. Shallow or Partial Parsing Sometimes we don’t need a complete parse tree –Information extraction –Question.
Two-Stage Constraint Based Hindi Parser LTRC, IIIT Hyderabad.
The Hindi-Urdu Treebank Lecture 7: 7/29/ Multi-representational, Multi-layered treebank Traditional approach: – Syntactic treebank: PS or DS, but.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Stemming, tagging and chunking Text analysis short of parsing.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
DS-to-PS conversion Fei Xia University of Washington July 29,
Hindi Treebank Dipti Misra Sharma LTRC International Institute of Information Technology Hyderabad India.
Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.
1/13 Parsing III Probabilistic Parsing and Conclusions.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
1/17 Probabilistic Parsing … and some other approaches.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,
Introduction to Machine Learning Approach Lecture 5.
SI485i : NLP Set 9 Advanced PCFGs Some slides from Chris Manning.
11 CS 388: Natural Language Processing: Syntactic Parsing Raymond J. Mooney University of Texas at Austin.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Probabilistic Parsing Reading: Chap 14, Jurafsky & Martin This slide set was adapted from J. Martin, U. Colorado Instructor: Paul Tarau, based on Rada.
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Evaluating Statistically Generated Phrases University of Melbourne Department of Computer Science and Software Engineering Raymond Wan and Alistair Moffat.
1 Statistical Parsing Chapter 14 October 2012 Lecture #9.
Ling 570 Day 17: Named Entity Recognition Chunking.
Hindi Parsing Samar Husain LTRC, IIIT-Hyderabad, India.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Training dependency parsers by jointly optimizing multiple objectives Keith HallRyan McDonaldJason Katz- BrownMichael Ringgaard.
AQUAINT Workshop – June 2003 Improved Semantic Role Parsing Kadri Hacioglu, Sameer Pradhan, Valerie Krugler, Steven Bethard, Ashley Thornton, Wayne Ward,
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Annotation for Hindi PropBank. Outline Introduction to the project Basic linguistic concepts – Verb & Argument – Making information explicit – Null arguments.
A Systematic Exploration of the Feature Space for Relation Extraction Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois,
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
Statistical Decision-Tree Models for Parsing NLP lab, POSTECH 김 지 협.
Role of NLP in Linguistics Dipti Misra Sharma Language Technologies Research Centre International Institute of Information Technology Hyderabad.
Supertagging CMSC Natural Language Processing January 31, 2006.
Automatic Grammar Induction and Parsing Free Text - Eric Brill Thur. POSTECH Dept. of Computer Science 심 준 혁.
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Towards Semi-Automated Annotation for Prepositional Phrase Attachment Sara Rosenthal William J. Lipovsky Kathleen McKeown Kapil Thadani Jacob Andreas Columbia.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
A knowledge rich morph analyzer for Marathi derived forms Ashwini Vaidya IIIT Hyderabad.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Integrating linguistic knowledge in passage retrieval for question answering J¨org Tiedemann Alfa Informatica, University of Groningen HLT/EMNLP 2005.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
Natural Language Processing Vasile Rus
Language Identification and Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
David Mareček and Zdeněk Žabokrtský
Representation of Actions as an Interlingua
Presentation transcript:

Intra-Chunk Dependency Annotation : Expanding Hindi Inter-Chunk Annotated Treebank Prudhvi Kosaraju, Bharat Ram Ambati, Samar Husain Dipti Misra Sharma, Rajeev Sangal Language Technologies Research Centre IIIT Hyderabad

Treebank Linguistic resources in which each sentence has – Parse tree – morphological, syntactic and lexical information marked explicitly Some treebanks – Penn Treebank (Marcus et al., 1993) for English – Prague Dependency Treebank (Hajicova, 1998) for Czech. For Indian Languages – Lack of such treebank been a major bottleneck for advance research and development of NLP tools and applications

Treebank creation Annotated manually or semi-automatically Manual creation – Annotators has to follow prescribed guidelines – Costly process in terms of both money & time Semi-automatic creation – Running of tools or parsers – Manual correction of Errors Note: An accurate annotating parser/tool saves cost and time for both the annotation as well as the validation task

Hindi Treebank Multi-layered and multi-representational treebank having – Dependency relations – Verb arguments (PropBank, Palmer et al., 2005) – Phase structure Dependency treebank has information at – morpho-syntactic (morphological,part-of-speech (POS) and chunk) level – syntactico-semnatic (dependency) level

Hindi Dependency Treebank Manual annotation has been done at – Part_of_speech level – Chunk level – Morph level – Inter-chunk dependency level

Inter-chunk annotated sentence Sentence1: नीली किताब गिर गई niilii kitaab gir gaii ‘blue’ ‘book’ ‘fall’ ’go-perf’ The blue book fell down Figure 1: SSF Representation Figure 2: Inter-chunk dependency tree of sentence 1

Intra-chunk dependencies Intra-chunk dependencies left unannotated since – Identification of intra-chunk dependencies are quite deterministic – Can be automatically annotated with high degree of accuracy Marking intra-chunk dependencies on inter- chunk dependency annotated trees results expansion of the later Automatic conversion to phase structure depends upon the expanded version of the treebank Hence, a High quality intra-chunk dependency annotator/parser is required

Fig 3: SSF representation of complete dependency tree Fig 4: complete dependency tree of Sentence 1

Intra-chunk dependency annotation Guidelines Tags can be classified into – Normal dependencies nmod__adj, jjmod__intf etc – Local word group dependencies (lwg) lwg__psp, lwg__vaux, lwg__neg etc – Linking local word group dependencies lwg__cont etc Total of 12 tags were used for experiments

nmod__adj Various types of adjectival modifications are shown using this label. An adjective modifying a head noun is one such instance. The label also incorporates various other modifications such as a demonstrative or a quantifier modifying a noun Chunk: नीली किताब NP ((niilii_JJ kitaab_NN)) ‘blue ‘ ‘book’ niilii nmod__adj kitaab

lwg__psp Used to attach post-positions/ auxiliaries associated with the noun or a verb. ‘lwg’ in the label name stands for local word grouping and associates all the postpositions with the head noun C hunk: अभिषेक ने NP((abhishek_NNP ne_PSP)) ’abhishek’ ’ERG ’ abhishek lwg__psp ne

lwg__cont To show that a group of lexical items inside a chunk together perform certain function In such cases, we do not commit on the dependencies between these elements We see this with complex post-positions associated with a noun/verb or with the auxiliaries of a verb ‘cont’ stands for continue Chunk: जा सकता है VGF((jaa_VM sakataa_VAUX hai_VAUX)) ‘go’ ‘can’ ‘be-pres’ jaa lwg__vaux sakataa lwg__cont hai

Intra-chunk dependnecy annotator/parser Built a robust intra-chunk dependency parser for Hindi – Rule based Approach – Statistical Approach – Hybrid Approach (using heuristic based post- processing component on top of statistical approach) The rule based tool can easily adaptable to other languages as well

Rule based intra-chunk dependency annotator Identifies modifier-modified (parent-child) relationship inside a chunk Rules provided in a fixed rule template Heads in each chunk determined by head computation module All information present in the SSF can be captured through the rule template

Rule template We capture the rules in form of constraints applicable at Chunk Label Parent Constraints Child Constraints Contextual Constraints Table 1 : Rule template Chunk NameParent Constraints Child Constraints Contextual ConstraintsDep. Relation NPPOS == NNPOS == JJposn(parent) > posn(child); nmod__adj

Statistical approach : Sub-tree parsing using Malt parser Malt parser(Nivre et al., 2007), transition based dependency parser is best suited for identifying short range dependencies (Nivre, 2003) Each chunk is separated and called sub-tree Data is divided into training (192 sentences), development(64) and testing(64) We followed the strategies used in kosaraju et.al,2010 – Feature pool – Pruning features using forward selector

Results (on gold data) Table 2 : Data Statistics Table 3: Rule based accuracies Table 4: Statistical approach showing baseline, POS and best templates No:of Sentences Training192 Development64 Testing64 LAS97.89 UAS98.50 LS98.38 BaselinePOS -templateBest template LAS UAS LS

Data Statistics Table 2 : Data Statistics No:of Sentences Training192 Development64 Testing64

Results (on gold data) Table 3: Rule based accuracies Table 4: Statistical approach showing baseline, POS and best templates LAS97.89 UAS98.50 LS98.38 BaselinePOS -templateBest template LAS UAS LS

Hybrid approach Post processed the statistical approach output using the rules as heuristics Only those tag associated rules are considered for which recall in rule-based is greater than statistical approach – Pof__cn, nmod__adj,rsym Table 5: All methods parsing accuracies ApproachLASUASLS Rule-based Statistical Hybrid

Special Cases ‘Chunks are self contained units. Intra-chunk dependencies are chunk internal and do not span outside a chunk.’ The above is the basis for neat division of inter- chunk and intra-chunk parsing However, there are two cases this constraint does not hold. – In these two cases a chunk internal element that is not the head of the chunk has a relation with a lexical item outside its chunk Hence, these relations are to be handled seperately

Special cases rsym__EOS (End of Sentence marker): – Occurs in the last chunk, Attached to head of the sentence lwg__psp : – According to guidelines, psp attaches to head of the chunk with lwg__psp – However, if the right most child of a CCP (conjunction chunk) is a nominal (NP or VGNN), one needs to attach the PSP of this nominal child to the head of the CCP during expansion – If there are multiple PSP, then first PSP gets a lwg__psp and second lwg__cont

Special case (lwg__psp) NP(raama_NNP) CCP(aur_CC) NP(siitaa_NNP ne_PSP) ‘ram’ ‘and’ ‘sita’ ‘ERG’ aur ccof ccof lwg__psp raama ne sita Fig 5: Expanded sub-tree with PSP connected with CC

Conclusion Described annotation guidelines for marking intra-chunk dependency relations Approaches: 1.Rule based 2. Statistical 3. Hybrid (using 1&2) By error analysis the outputs, only certain tags are not being marked correctly. This is good news because then one can make very targeted manual corrections after the automatic tool is run

THANK YOU