POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Part-Of-Speech Tagging and Chunking using CRF & TBL
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Statistical NLP: Lecture 11
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Hindi POS tagging and chunking : An MEMM approach Aniket Dalal Kumar Nagaraj Uma Sawant Sandeep Shelke Under the guidance of Prof. P. Bhattacharyya.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Introduction to CL Session 1: 7/08/2011. What is computational linguistics? Processing natural language text by computers  for practical applications.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
Part of speech (POS) tagging
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
(Some issues in) Text Ranking. Recall General Framework Crawl – Use XML structure – Follow links to get new pages Retrieve relevant documents – Today.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Sequence Labeling Raymond J. Mooney University of Texas at Austin.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
A New Approach for HMM Based Chunking for Hindi Ashish Tiwari Arnab Sinha Under the guidance of Dr. Sudeshna Sarkar Department of Computer Science and.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
Graphical models for part of speech tagging
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
7-Speech Recognition Speech Recognition Concepts
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
POS Tagger and Chunker for Tamil
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Chunk Parsing II Chunking as Tagging. Chunk Parsing “Shallow parsing has become an interesting alternative to full parsing. The main goal of a shallow.
Stochastic and Rule Based Tagger for Nepali Language Krishna Sapkota Shailesh Pandey Prajol Shrestha nec & MPP.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
CSCI 5832 Natural Language Processing
Machine Learning in Natural Language Processing
Natural Language Processing
Presentation transcript:

POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

Language A unique ability of humans  Animals have signs – Sign for danger But cannot combine the signs  Higher animals – Apes Can combine symbols (noun & verb)  But can talk only about here and now

Language : Means of Communication CONCEPT Language codingdecoding * The concept gets transferred through language

Language : Means of thinking What should I wear today? * Can we think without language ?

What is NLP ? The process of computer analysis of input provided in a human language is known as Natural Language Processing. Concept Language Intermediate representation Used for processing by computer

Applications Machine translation Document Clustering Information Extraction / Retrieval Text classification

MT system : Shakti Machine translation system being developed at IIIT – Hyderabad. A hybrid translation system which uses the combined strengths of Linguistic, Statistical and Machine learning techniques. Integrates the best available NLP technologies.

Shakti architecture English sentence English sentence analysis Transfer from English to Hindi Hindi sentence generation Hindi sentence Morphology POS tagging Chunking Parsing Word reordering Hindi word subs. Agreement Word-generation

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

Levels of Language Analysis Morphological analysis Lexical Analysis ( POS tagging ) Syntactic Analysis ( Chunking, Parsing ) Semantic Analysis ( Word sense disambiguation ) Discourse processing ( Anaphora resolution ) Let’s take an example sentence “Children are watching some programmes on television in the house”

Chunking What are chunks ?  [[ Children ]] (( are watching )) [[ some programmes ]] [[ on television ]] [[ in the house ]] Chunks  Noun chunks (NP, PP) in square brackets  Verb chunks (VG) in parentheses Chunks represent objects  Noun chunks represent objects/concepts  Verb chunks represent actions

Chunking Representation in SSF

Part-of-Speech tagging

Morphological analysis Deals with the word form and it’s analysis. Analysis consists of characteristic properties like  Root/Stem  Lexical category  Gender, number, person …  Etc … Ex: watching  Root = watch  Lexical category = verb  Etc …

Morphological analysis

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

POS Tags in Hindi  Broadly categories are noun, verb, adjective & adverb. Word are classified depending on their role, both individually as well as in the sentence. Example:  vaha aama khaa rahaa hei  Pron noun verb verb verb

POS Tagging Simplest method of POS tagging  Looking in the dictionary khaanaa Dictionary lookup verb

Problems with POS Tagging Size of the dictionary limits the scope of POS- tagger. Ambiguity  The same word can be used both as a noun as well as a verb. khaanaa nounverb

Problems with POS Tagging Ambiguity  Sentences in which the word “khaanaa” occurs tum bahuta achhaa khaanaa banatii ho. mein jilebii khaanaa chaahataa hun.  Hence, complete sentence has to be looked at before determining it’s role and thus the POS tag.

Problems with POS Tagging Many applications need more specific POS tags.  For example,  Hence, the need for defining a tagset. … seba khaa rahaa … Verb Finite Main … khaate huE … Verb Non-Finite Adjective … khaakara … Verb Non-Finite Adverb sharaaba piinaa sehata … Verb Non-Finite Nominal

Defining the tagset for Hindi (IIIT Tagset) Issues ! 1. Fineness V/s Coarseness in linguistic analysis 2. Syntactic Function V/s lexical category 3. New tags V/s tags from a standard tagger

Fineness V/s Coarseness Decision has to be taken whether tags will account for finer distinctions of various features of the parts of speech. Need to strike a balance  Not too fine to hamper machine learning  Not too coarse to loose information

Fineness V/s Coarseness Nouns  Plurality information not taken into account (noun singular and noun plural are marked with same tags).  Case information not marked (noun direct and noun oblique are marked with same tags). Adjectives and Adverbs  No distinction between comparitive and superlative forms Verbs  Finer distinctions are made (eg., VJJ, VRB, VNN)  Helps us understand the arguments that a verb form can take.

Fineness in Verb tags Useful for tasks like dependency parsing as we have better information about arguments of verb form. Non-finite form of verbs which are used as nouns or adjectives or adverbs still retain their verbal property.  (VNN -> Noun formed for a verb) Example: aasamaana/NN mein/PREP udhane/VNN vaalaa/PREP ghodhaa/NN “sky” “in” “flying” “horse” niiche/NLOC utara/VFM aayaa/VAUX “down” “climb” “came”

Syntactic V/S Lexical Whether to tag the word based on lexical or syntactic category.  Should “uttar” in “uttar bhaarata” be tagged as noun or adjective ? Lexical category is given more importance than syntactic category while marking text manually. Leads to consistency in tagging.

New tags v/s tags from standard tagset Entirely new tagset for Indian languages not desirable as people are familiar with standard tagsets like Penn tags. Penn tagset has been used as benchmark while deciding tags for Hindi. Wherever Penn tagset has been found inadequate, new tags introduced.  NVB  New tag for kriyamuls or Light verbs  QW  Modified tag for question words

IIIT Tagset Tags are grouped into three types. 1. Group1 : Adopted from the Penn tagset with minor changes. 2. Group2 : Modification over Penn tagset. 3. Group3 : Tags not present in Penn tagset.  Examples of tags in Group3 1. INTF ( Intensifier ) : Words like ‘baHuta’, ‘kama’ etc. 2. NVB, JVB, RBVB : Light verbs.  Detailed guidelines would be put online.

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

Corpus – based approach POS tagged corpus Learn POS tagger Untagged new corpus Tagged new corpus

POS tagging : A simple method Pick the most likely tag for each word Probabilities can be estimated from a tagged corpus. Assumes independence between tags. Accuracy < 90%

POS tagging : A simple method Example Brown corpus, tagged words (training section), 26 tags Example : mujhe xo kitabein xijiye Word xo occurs 267 times, 227 times tagged as QFN 29 times as VAUX P(QFN|W=xo) = 227/267 = P(NN | W=xo) = 29/267 =

Corpus-based approaches Learning RulesStatistical Transformation-based error driven learning.  Brill Hidden Markov models.  TnT, Brants 00 Inductive Logic programming.  Cussens Maximum entropy.  Ratnaparakhi’ 96

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

POS tagging using HMMs Let W be a sequence of words W = w 1, w 2 … w n Let T be the corresponding tag sequence T = t 1, t 2 … t n Task : Find T which maximizes P ( T | W ) T’ = argmax T P ( T | W )

POS tagging using HMM By Bayes Rule, P ( T | W ) = P ( W | T ) * P ( T ) / P ( W ) T’ = argmax T P ( W | T ) * P ( T ) P ( T ) = P ( t 1 ) * P ( t 2 | t 1 ) * P ( t 3 | t 1 t 2 ) …… * P ( t n | t 1 … t n -1 ) Applying Bi-gram approximation, P ( T ) = P ( t 1 ) * P ( t 2 | t 1 ) * P ( t 3 | t 2 ) …… * P ( t n | t n-1 )

POS tagging using HMM P ( W | T ) = P ( w 1 | T ) * P ( w 2 | w 1 T ) * P ( w 3 | w 1.w 2 T ) * ……… P ( w n | w 1 … w n-1, T ) = Π i = 1 to n P ( w i | w 1 …w i-1 T ) Assume, P ( w i | w 1 …w i-1 T ) = P ( w i | t i ) Now, T’ is the one which maximizes, P ( t 1 ) * P ( t 2 | t 1 ) * …… * P ( t n | t n-1 ) * P ( w 1 | t 1 ) * P ( w 2 | t 2 ) * …… * P ( w n | w n-1 )

POS tagging using HMM If we use Tri-gram model instead for the tag sequence, P ( T ) = P ( t 1 ) * P ( t 2 | t 1 ) * P ( t 3 | t 1 t 2 ) …… * P ( t n | t n-2 t n-1 ) Which model to choose ? Depends on the amount of data available ! Richer models ( Tri-grams, 4-grams ) require lots of data.

Chain rule with approximations P( W = “vaha ladakaa gayaa”, T = “det noun verb” ) == P(det) * P(vaha|det) * P(noun|det) * P(ladakaa|noun) * P(verb|noun) * P(gayaa|verb) detnounverb vaha ladakaagayaa

Chain rule with approximations: Example P (vaha | det ) = ( Number of times ‘vaha’ appeared as ‘det’ in the corpus ) ( Total number of occurrences of ‘det’ in the corpus ) P ( verb | noun ) = ( Number of times ‘verb’ followed ‘noun’ in the corpus ) ( Total number of occurrences of ‘noun’ in the corpus ) If we obtained the following estimates from the corpus detnounverb vaha ladakaagayaa P ( W, T ) = 0.5 * 0.4 * 0.99 * 0.5 * 0.4 * 0.02 =

POS tagging using HMM We need to estimate three types of parameters from the corpus P start (t i ) = (no. of sentences which begin with t i ) / ( no. of sentences ) P ( t i | t i-1 ) = count ( t i-1 t i ) / count ( t i-1 ) P ( w i | t i ) = count ( w i with t i ) / count ( t i ) These parameters can be directly represented using the Hidden Markov Models (HMMs) and the best tag sequence can be computed by applying Viterbi algorithm on the HMMs.

Markov models Markov Chain An event is dependent on the previous events. Consider the word sequence usanekahaaki Here, each word is dependent on the previous one word. Hence, it is said to form markov chain of order 1.

Hidden Markov models Hidden states follow markov property. Hence, this model is know as Hidden Markov Model. Observation sequence O o1o2o3o4 x1x2 x3 x4 Hidden states sequence X Index of sequence t 12 34

Hidden Markov models Representation of parameters in HMMs Define O(t) = t th Observation Define X(t) = Hidden State Value at t th position A = a ab = P ( X ( t+1 ) = X b | X ( t ) = X a )  Transition matrix B = b ak = P ( O ( t ) = O k | X ( t ) = X a )  Emission matrix PI = pi a = Probability of the starting with hidden state X a  PI matrix The model is μ = { A, PI, B }

HMM for POS tagging Observation sequence === Word sequence Hidden state sequence === Tag sequence Model A = P ( current tag | previous tag ) B = P ( current word | current tag ) PI = P start ( tag ) Tag sequences are mapped to Hidden state sequences because they are not observable in the natural language text.

Example A = detnounverb det noun verb vahaladakaagayaa det noun verb B = PI = det 0.5 noun 0.4 verb.01

POS tagging using HMM The problem can be formulated as, Given the observation sequence O and the model μ = (A, B, PI), how to choose the best state sequence X which explains the observations ? Consider all the possible tag sequences and choose the tag sequence having the maximum joint probability with the observation sequence. X_max = argmax ( P(O, X) ) The complexity of the above is high. Order N T Viterbi algorithm is used for computational efficiency.

POS tagging using HMM det noun verb det noun verb det noun verb vaha ladakaahansaa 27 tag sequences possible ! = 27 paths t1 23 O X’s

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa Let α noun (ladakaa) represent the probability of reaching the state ‘noun’ taking the best possible path and generating observation ‘ladakaa’ t1 23 O X’s

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaahansaa Best probability of reaching a state associated with first word α pron (vaha) = PI (det) * B [det, vaha ] t1 23 O X’s

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa Probability of reaching a state elsewhere in the best possible way α noun (ladakaa) = t1 23 O X’s

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa t1 23 O X’s Probability of reaching a state in the best possible way α noun (ladakaa) = MAX { α pron (vaha) * A [det, noun ] * B [ noun, ladakaa ],

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa t1 23 O X’s Probability of reaching a state in the best possible way, α noun (ladakaa) = MAX { α pron (vaha) * A [ det, noun ] * B [ noun, ladakaa ], α noun (vaha) * A [ noun, noun ] * B [ noun, ladakaa ],

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa t1 23 O X’s Probability of reaching a state in the best possible way α noun (ladakaa) = MAX { α pron (vaha) * A [det, noun ] * B [ noun, ladakaa ], α noun (vaha) * A [ noun, noun ] * B [ noun, ladakaa ], α verb (vaha) * A [ verb, noun ] * B [ noun, ladakaa ] }

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa t1 23 O X’s What is the best way to come to a particular state ? phi noun (ladakaa) = ARGMAX { α pron (vaha) * A [ pron, noun ] * B [ noun, ladakaa ], α noun (vaha) * A [ noun, noun ] * B [ noun, ladakaa ], α verb (vaha) * A [ verb, noun ] * B [ noun, ladakaa ] }

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaa hansaa The last tag of the most likely sequence phi (T+1) = ARGMAX { α pron (hansaa), α noun (hansaa), α verb (hansaa) } t1 23 O X’s

Viterbi algorithm det noun verb det noun verb det noun verb vaha ladakaahansaa Most likely sequence is obtained by backtracking. t1 23 O X’s

Preliminary Results POS tagging for Indian languages  Training set = tokens, Testing set = tokens  Tags = 26.  Most frequent tag labelling = %  Hidden Markov Models = % Needs improvement!  By experimenting with a variety of tags and tokens ( Some experiments on the chunking task are shown in following slides ).

Preliminary Results Most Common error seen.  NNP, NNC  NN  Opportunity to carry out experiments to eliminate such errors as part of NLPAI shared task, 2006 (will be introduced at the end).

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

Introduction to TnT Efficient implementation of Viterbi’s algorithm for 2 nd order Markov Chains ( Trigram approximation ). Language independent – Can be trained on any corpus. Easy to use.

Introduction to TnT 4 main programs –  tnt-para – trains the model (parameter generation) tnt-para [options]  tnt – tagging tnt [options]  tnt-diff - Comparing two files to get precision/ recall figures. tnt-diff [options]  tnt-wc – count tokens (words) and types (pos-tag/chunk-tag) in different files. tnt-wc [options]

Introduction to TnT Training file format  Tokens and tag separated by white space. Example,  %  nirAlA NNP  kI PREP  sAhiwya NN  % blank line – new sentence  yahAz PRP  yaha PRP  aXikAMRa JJ

Introduction to TnT Testing file – consists of only the first column. Other files – Used to store the model .lex file .123 file .map file Demo1.

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

An Example ( Chunk boundary identification )

Chunking with TnT Chunk Tags  STRT: A chunk starts at this token  CNT: This token lies in the middle of a chunk  STP: This token lies at the end of a chunk  STRT_STP: This token lies in a chunk of its own Chunk Tag Schemes  2-tag Scheme: {STRT, CNT}  3-tag Scheme: {STRT, CNT, STP}  4-tag Scheme: {STRT, CNT, STP, STRT_STP}

Input Tokens What kinds of input tokens can we use?  Word only – simplest  POS tag only – use only the part of speech tag of the word  Combinations of the above – Word_POStag: word followed by POS tag POStag_Word: POS tag followed by word.

Chunking with TnT: Experiments Training corpus = tokens Testing corpus = tokens Trick to improve learning is by training on larger tagset and reduce it to smaller tagset  NO LOSS of INFO. as all the tagsets convey same info. Best results (Precision = 85.6%) obtained for  Input Tokens of the form ‘Word_POS’  Learning trick : 4 tags reduced to 2

Chunking with TnT: Improvement 85.6 not good enough. Improvement of model (Precision = 88.63%) by adding contextual information (POS tags). Example,

Chunking with TnT: Improvements For experiments which lead to furthur improvements in chunk boundary identification, see  Akshay Singh; Sushama Bendre; Rajeev Sangal, HMM based Chunker for Hindi, In Second International Joint Conference on Natural Language Processing: Companion Volume including Posters/Demos and tutorial abstracts.

Chunking labelling & Results Chunk labelling:  Chunks which have been identified have to be labelled as Noun chunks, Verb chunks etc. Rule based chunk labelling performed best. RESULTS:  Final Chunk Boundary Identification accuracy = 92.6%  Chunk boundary identification + Chunk labelling = 91.5%

Contents NLP : Introduction Language Analysis - Representation Part-of-speech tags in Indian Languages (Ex. Hindi) Corpus based methods: An introduction POS tagging using HMMs Introduction to TnT Chunking for Indian languages – Few experiments Shared task - Introduction

Shared task. For information on the shared task, refer to the flyer on NLPAI shared task 2006.

Thank you