Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
Part-Of-Speech Tagging and Chunking using CRF & TBL
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Natural Language Processing Lecture 8—9/24/2013 Jim Martin.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Stemming, tagging and chunking Text analysis short of parsing.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
CS224N Interactive Session Competitive Grammar Writing Chris Manning Sida, Rush, Ankur, Frank, Kai Sheng.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Advanced Smoothing, Evaluation of Language Models.
Robert Hass CIS 630 April 14, 2010 NP NP↓ Super NP tagging JJ ↓
Natural Language Processing Assignment Group Members: Soumyajit De Naveen Bansal Sanobar Nishat.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Some Advances in Transformation-Based Part of Speech Tagging
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging for Bengali with Hidden Markov Model Sandipan Dandapat,
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Part-Of-Speech Tagging using Neural Networks Ankur Parikh LTRC IIIT Hyderabad
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Leif Grönqvist 1 Tagging a Corpus of Spoken Swedish Leif Grönqvist Växjö University School of Mathematics and Systems Engineering
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
POS tagging and Chunking for Indian Languages Rajeev Sangal and V. Sriram, International Institute of Information Technology, Hyderabad.
Natural Language Processing
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-14: Probabilistic parsing; sequence labeling, PCFG.
Part-of-speech tagging
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Language Identification and Part-of-Speech Tagging
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15
CSCI 5832 Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
Natural Language Processing
Meni Adler and Michael Elhadad Ben Gurion University COLING-ACL 2006
Presentation transcript:

Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI

4 July 2006 NWAI2 What's in? Why POS tagging & chunking? Approach Challenges Unseen tag sequences Unknown words Results Future work Conclusion

4 July 2006 NWAI3 Intro & Motivation

4 July 2006 NWAI4 POS Parts-of-Speech Dionysius Thrax (ca 100 BC) 8 types – noun, verb, pronoun, preposition, adverb, conjunction, participle and article I get my thing in action. (Verb, that's what's happenin') To work, (Verb!) To play, (Verb!) To live, (Verb!) To love... (Verb!...) - Schoolhouse Rock

4 July 2006 NWAI5 Tagging Assigning the appropriate POS or lexical class marker to words in a given text Symbols, punctuation markers etc. are also assigned specific tag(s)

4 July 2006 NWAI6 Why POS tagging? Gives significant information about a word and its neighbours Adjective near noun Adverb near verb Gives clue on how a word is pronounced OBject as noun obJECT as verb Speech synthesis, full parsing of sentences, IR, word sense disambiguation etc.

4 July 2006 NWAI7 Chunking Identifying simple phrases Noun phrase, verb phrase, adjectival phrase… Useful as a first step to Parsing Named entity recognition

4 July 2006 NWAI8 POS tagging & Chunking

4 July 2006 NWAI9 Stochastic approaches Availability of tagged corpora in large quantity Most are based on HMM Weischedel ’93 DeRose ’88 Skut and Brants ’98 – extending HMM to chunking Zhou and Su ‘00 and lots more…

4 July 2006 NWAI10 HMM Tag-sequence probabilityWord-emit probability Annotated corpus Assumptions Probability of a word is dependent only on its tag Approximate the tag history to the most recent two tags

4 July 2006 NWAI11 Structural tags A triple – POS tag, structural relation & chunk tag Originally proposed by Skut & Brants ’98 Seven relations Enables embedded and overlapping chunks

4 July 2006 NWAI12 Structural relations परीक्षा में NP 00 Beg परीक्षा NP 90 SSF । End VG 09 SSF श्रेणी प्राप्त NP 99 SSF VG परीक्षा में भी प्रथम श्रेणी प्राप्त की और विद्यालय में कुलपति द्वारा विशेष पुरस्कार भी उन्हीं को प्राप्त हुआ ।

4 July 2006 NWAI13 Decoding Viterbi mostly used (also A* or stack) Aims at finding the best path (tag sequence) given observation sequence Possible tags are identified for each transition, with associated probabilities The best path is the one that maximizes the product of these transition probabilities

4 July 2006 NWAI14 अब जीवन का एक अन्य रूप उनके सामने आया । JJ NLOC NN PREP PRP QFN RB VFM SYM

4 July 2006 NWAI15 अब जीवन का एक अन्य रूप उनके सामने आया । JJ NLOC NN PREP PRP QFN RB VFM SYM

4 July 2006 NWAI16 अब जीवन का एक अन्य रूप उनके सामने आया । JJ NLOC NN PREP PRP QFN RB VFM SYM

4 July 2006 NWAI17 Issues

4 July 2006 NWAI18 1. Unseen tag sequences Smoothing (Add-One, Good-Turing) and/ or Backoff (Deleted interpolation) Idea is to distribute some fractional probability (of seen occurrences) to unseen Good-Turing Re-estimates the probability mass of lower count N- grams by that of higher counts - Number of N-grams occurring c times

4 July 2006 NWAI19 2. Unseen words Insufficient corpus (even after 10 mn words) Not all of them are proper names Treat them as rare words that occur once in the corpus - Baayen and Sproat ’96, Dermatas and Kokkinakis ’95 Known Hindi corpus of 25 K words and unseen corpus of 6 K words All words vs. Hapax vs. Unknown

4 July 2006 NWAI20 Tag distribution analysis

4 July 2006 NWAI21 3. Features Can we use other features? Capitalization Word endings and Hyphenations Weishedel ’93 reports about 66% reduction in error rate with word endings and hyphenations Capitalizations, though useful for proper nouns are not very effective

4 July 2006 NWAI22 Contd… String length Prefix & suffix – fixed characters width Character encoding range Complete analysis remains to be done Expected to be very effective for morphologically rich languages To be experimented with Tamil

4 July 2006 NWAI23 4. Multi-part words Examples In/ terms/ of/ United/ States/ of/ America/ More problematic in Hindi United/NNPC States/NNPC of/NNPC America/NNP Central/NNC government/NN NNPC – Compound proper noun, NN - noun NNP – Proper noun, NNC – Compound noun How does the system identify the last word in multi-part word? 10% of errors is due to this in Hindi (6 K words tested)

4 July 2006 NWAI24 Results

4 July 2006 NWAI25 Evaluation metrics Tag precision Unseen word accuracy % of unseen words that are correctly tagged Estimates the goodness of unseen words % reduction in error Reduction in error after the application of a particular feature

4 July 2006 NWAI26 Results - Tagger No structural tags  better smoothing Unseen data – significantly more unknowns DevS-1S-2S-3S-4Test # words Correctly tagged Precision # Unseen Correctly tagged Unseen Precision

4 July 2006 NWAI27 Results – Chunk tagger Training  22 K, development data  8 K 4-cross validation Test data  5 K POS tagging Precision Chunk IdentificationLabelling PreRecPreRec Dev data Average Test data

4 July 2006 NWAI28 Results – Tagging error analysis Significant issues with nouns/multi-part words NNP  NN NNC  NN Also, VAUX  VFM; VFM  VAUX and NVB  NN; NN  NVB

4 July 2006 NWAI29 HMM performance (English) > 96% reported accuracies About 85% for unknown words Advantage Simple and most suitable with the availability of annotated data

4 July 2006 NWAI30 Conclusion

4 July 2006 NWAI31 Future work Handling unseen words Smoothing Can we exploit other features? Especially morphological ones Multi-part words

4 July 2006 NWAI32 Summary Statistical approaches now include linguistic features for higher accuracies Improvement required Tagging Precision – 79.22% Unknown words – 41.6% Chunking Precision – 60% Recall – 62%