Download presentation
Presentation is loading. Please wait.
Published byGregory Glenn Modified over 9 years ago
1
Shallow Parsing for South Asian Languages -Himanshu Agrawal
2
Shallow Parsing Parts Of Speech Tagging Assigning grammatical classes to words in a natural language sentence. Text Chunking Dividing the text in syntactically co-related parts of words. Example: [NP He ] [VP reckons ] [NP the current account deficit ] [VP will narrow ] [PP to ] [NP only # 1.8 billion ] [PP in [NP September ]].
3
Applications Direct Applications Automatic Spell Checking Software Grammar Suggestions ( MS word pop-ups) Full Parsing Indirect Applications Machine Translation Systems Web Search ( )
4
Nature of the problem of Shallow Parsing A classic problem of classifying input tokens into given classes. The sequence aspect The sequence of best classes. The best sequence of classes. Typically, the classifying information is the language context of the word under consideration.
5
Shallow Parsing for English The problem has been well worked upon for English. Very Efficient Systems Exist Example : Brill’s Tagger: ’95, Transformation Based Learning. Adwait Ratnaparkhi: ’99, Parsing with Maximum Entropy Significant effect on the development of MT systems for European Languages
6
Shallow Parsing for South Asian Languages Portability of Shallow Parsing Systems across languages ?? NOT GOOD !! Inflectional Richness of the Languages. * Training on 22,000 words and Testing on 5000 words. POS tagging only EnglishHindi Brill’s Transformation Based Learning 87%79% Ratnaparkhi’s Maximum Entropy Based Learning 89%81%
7
Challenges with Indian Languages. Poor Disambiguation between certain POS class categories example NNP and NNC !! (Error Type 1) JJ and NN !! (Error Type 2) Inflectional Richness of the language Absence of markers like the capitalization of proper nouns and etc. Is that Raj ?
8
On Improving the performance for Hindi and other South Asian Languages. There can be two ways Improving the classifying information by the use of better features or using language specific information or both. Improving the learning by better training and better inference-ing.
9
A. POS Tagging For better training and inference-ing. oApproach 1: Training on a hierarchical structure of tags oApproach 2: Building a knowledge database from raw / un-annotated text to use as a `look up`.
10
Approach 1: Training on Hierarchical Tagset Training in steps, on a hierarchical structure of classes. Training Level 1 2
11
Approach 1: Training on Hierarchical Tagset The approach was devised to minimize the number of errors that are made within a family class. Results 73.33 % Reason: No mechanism to correct errors in the part 1 of training Jittered language constructs while training in part 2.
12
Approach 2: Building a knowledge database for `look up.` The Knowledge database consists of words and the POS tags it is known to have occurred with. How is it important ?? Inflectional richness Vs per class ambiguity
13
Building the knowledge database Adding words and their POS tags from the training data. Training on 22,000 words on Gold Standard POS tags, and creating a training model `A`. Using model ‘A’ to annotate the raw text consisting of 2 Lakh words. Extracting the words/POS tags of words tagged with very high confidence measure. And adding them to the database.
14
Using the knowledge database For the final tagging We use model ‘A’ to get the probability of each tag to be associated with a word. ie P(tag i / word) for (every tag) for (every word in the test data) If a word is found in the database, we choose the tag in its entry, which has the highest probability. If not found, we let the tag predicted in the first run remain unchanged.
15
Approach 2 Results : 84.90 %
16
Training for Model `A` We use Linear Chain Implementation of the Conditional Random Fields. Taku Kudo et. Al. 2005 We use simple language independent features Word Window [-2, 2]. Suffix Information as in last 2, 3, 4 chars. Presence of Special Characters. Word Length.
17
B. Chunking We have followed the approach used by Anirudh, Himanshu ’06 NWAI. 2 step Training: Training on Boundary-Label scheme for extracting Chunk Labels. Training on Boundaries with added information of chunk labels.
18
Chunking cont. Training for identifying Chunk tags is also done using a linear chain implementation of CRF. Features: Word window of [-2, 2] POS tag window of [-2, 2] Chunk Labels, for chunk Boundary Identification [-2, 0]
19
Chunking Results 92.69 %
20
Consolidated Results ** The results below are on calculated on the development data. HindiTeluguBengali POS Tagging 84.90 %71.22 %81.09 % Chunking 92.69 %91.77 %94.90 %
21
Conclusions: Training on a tag-set optimal for capturing the language patterns. If training is done in more than one step, esp. such that tags in the subsequent step are directly dependent on the tags in the present step, then it is of importance that there exist a way to re-tag the mis-tagged tokens.
22
References: Charles Sutton, An Introduction to Conditional Random Fields for Relational Learning Adwait Ratnaparkhi,1998, Maximum Entropy Models For Natural Language Ambiguity Resolution, Dissertation in Computer and Information Science,University Of Pennslyvania,1998. Akshay Singh, Sushma Bendre, Rajeev Sangal, 2005,HMM Based Chunker for Hindi, IIIT Hyderabad. Thorsten Brants. 2000. TnT - A Statistical Part-of- Speech Tagger Proceedings of the sixth conference on Applied Natural Language Processing (2000) 224–231. Himanshu Agrawal, Anirudh Mani 2006, Part Of Speech Tagging and Chunking Using Conditional Random Fields: Proceedings of the NLPAI MLcontest workshop, National Workshop on Artificial Intelligence.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.