7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran.

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Part of speech (POS) tagging
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Word classes and part of speech tagging Chapter 5.
Introduction to Machine Learning Approach Lecture 5.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Part-of-Speech Tagging
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 Persian Part Of Speech Tagging Mostafa Keikha Database Research Group (DBRG) ECE Department, University of Tehran.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
Some Advances in Transformation-Based Part of Speech Tagging
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Natural Language Processing
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Part-of-speech tagging
Hybrid Method for Tagging Arabic Text Written By: Yamina Tlili-Guiassa University Badji Mokhtar Annaba, Algeria Presented By: Ahmed Bukhamsin.
Shallow Parsing for South Asian Languages -Himanshu Agrawal.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Word classes and part of speech tagging Chapter 5.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Lecture 9: Part of Speech
Machine Learning in Natural Language Processing
Statistical NLP: Lecture 9
Natural Language Processing
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

7 November 2006 University of Tehran Persian POS Tagging Hadi Amiri Database Research Group (DBRG) ECE Department, University of Tehran

7 November 2006 DBRG- University of Tehran Outline What is POS tagging How is data tagged for POS? Tagged Corpora POS Tagging Approaches Corpus Training How to Evaluate a tagger? Bijankhan Corpus Memory Based POS MLE Based POS Neural Network POS Tagger

7 November 2006 DBRG- University of Tehran What is POS tagging Annotating each word for its part of speech (grammatical type) in a given sentence. e.g. I/PRP would/MD prefer/VB to/TO study/VB at/IN a/DT traditional/JJ school/NN Properties: It helps parsing It resolves pronunciation ambiguities As the water grew colder, their hands grew number. (number=ADJ, not N) It resolves semantic ambiguities Patients can bear pain.

7 November 2006 DBRG- University of Tehran POS Application Part-of-speech (POS) tagging is important for many applications Word sense disambiguation Parsing Language modeling Q&A and Information extraction Text-to-speech Tagging techniques can be used for a variety of tasks Semantic tagging Dialogue tagging Information Retrieval….

7 November 2006 DBRG- University of Tehran POS Tags N nounbaby, toy V verb see, kiss ADJ adjective tall, grateful, alleged ADV adverb quickly, frankly,... P preposition in, on, near DET determiner the, a, that WhPronwh-pronounwho, what, which, … COORD coordinatorand, or Open Class

7 November 2006 DBRG- University of Tehran POS Tags There is no standard set of POS tags  Some use coarse classes: e.g., N, V, A, Aux, ….  Others prefer finer distinctions (e.g., Penn Treebank): PRP: personal pronouns (you, me, she, he, them, him, …) PRP$: possessive pronouns (my, our, her, his, …) NN: singular common nouns (sky, door, theorem, …) NNS: plural common nouns (doors, theorems, women, …) NNP: singular proper names (Fifi, IBM, Canada, …) NNPS: plural proper names (Americas, Carolinas, …)

7 November 2006 DBRG- University of Tehran How is data tagged for POS? We are trying to model human performance. So we have humans tag a corpus and try to match their performance. To creating a model  A corpora are hand-tagged for POS by more than 1 annotator  Then checked for reliability

7 November 2006 DBRG- University of Tehran Penn Treebank Corpus (WSJ, 4.5M) History Brown Corpus Created (EN-US) 1 Million Words Brown Corpus Tagged HMM Tagging (CLAWS) 93%-95% Greene and Rubin Rule Based - 70% LOB Corpus Created (EN-UK) 1 Million Words DeRose/Church Efficient HMM Sparse Data 95%+ British National Corpus (tagged by CLAWS) POS Tagging separated from other NLP Transformation Based Tagging (Eric Brill) Rule Based – 95%+ Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network 96%+ Trigram Tagger (Kempe) 96%+ Combined Methods 98%+ LOB Corpus Tagged

7 November 2006 DBRG- University of Tehran Tagged Corpora Corpus# Tags#Tokens Brown871 million British Natl61100 million Penn Treebank454.8 million Original Bijankhan550? Bijankhan402.6 million

7 November 2006 DBRG- University of Tehran POS Tagging Approaches POS Tagging SupervisedUnsupervised Rule-BasedStochasticNeuralRule-BasedStochasticNeural

7 November 2006 DBRG- University of Tehran Rule-Based POS Tagger Lexicon with tags identified for each word that ADV PRON DEM SG DET CENTRAL DEM SG CS Constraints to eliminate tags: If  next word is adj, adv, quant  And following is S bdry  And previous word is not consider-type V Then  Eliminate non-ADV tags He was that drunk.

7 November 2006 DBRG- University of Tehran Probabilistic POS Tagging Provides the possibility of automatic training rather than painstaking rule revision. Automatic training means that a tagger can be easily adapted to new text domains. E.g. A moving/VBG house A moving/JJ ceremony

7 November 2006 DBRG- University of Tehran Probabilistic POS Tagging Needs large tagged corpus for training Unigram statistics (most common part-of- speech for each word) get us to about 90% accuracy For greater accuracy, we need some information on adjacent words

7 November 2006 DBRG- University of Tehran Corpus Training The probabilities in a statistical model come from the corpus it is trained on. If the corpus is too domain-specific, the model may not be portable to other domains. If the corpus is too general, it will not capitalize on the advantages of domain- specific probabilities

7 November 2006 DBRG- University of Tehran Tagger Evaluation Once a tagging model has been built, how is it tested?  Typically, a corpus is split into a training set (usually ~90% of the data) and a test set (10%).  The test set is held out from the training.  The tagger learns the tag sequences that maximize the probabilities for that model.  The tagger is tested on the test set. Tagger is not trained on test data. But test data is highly similar to training data.

7 November 2006 DBRG- University of Tehran Current Performance How many tags are correct?  About 98% currently  But baseline is already 90%  Baseline algorithm: Tag every word with its most frequent tag Tag unknown words as nouns How well do people do?

7 November 2006 University of Tehran Memory Based Part Of Speech Tagging Experiments With Persian Text

7 November 2006 DBRG- University of Tehran Corpus Study At first the corpus had 550 tags. The content is gathered form daily news and common texts. Each document is assigned a subject such as political, cultural and so on.  Totally, there are 4300 different subjects.  This subject categorization provides an ideal experimental environment for clustering, filtering, categorization research. In this research, we simply ignored the subject categories of the documents and concentrated on POS tags.

7 November 2006 DBRG- University of Tehran Selecting Suitable Tags At first frequencies of each tags was gathered. Then many of the tags were grouped together and a smaller tag set was produced Each tag in the tag set is placed in a hierarchical structure.  As an example, consider the tag “N_PL_LOC”. Nstands for a noun PLdescribes the plurality of the tag LOCdefines the tag as about locations

7 November 2006 DBRG- University of Tehran The Tags Distribution

7 November 2006 DBRG- University of Tehran Max, Min, AVG, Total # of Tags in The Training Set

7 November 2006 DBRG- University of Tehran Number of Different Tags For instance, the word “آسمان” which means “the sky” in English is always tagged with "N_SING" in the whole corpus; but a word like “بالا” which means “high or above” has been tagged by several tags ("ADJ_SIM", "ADV", "ADV_NI", "N_SING", "P", and "PRO").

7 November 2006 DBRG- University of Tehran Classifying the Rare Words The Tags whose number of occurrences is below 5000 times in the corpus are gathered to “ETC” group.

7 November 2006 DBRG- University of Tehran Bijankhan Corpus

7 November 2006 DBRG- University of Tehran Implemented Mehtods MLE Based POS Tagger Neural Network POS Tagger Memory Based POS Tagger

7 November 2006 DBRG- University of Tehran Implemented Mehtods MLE Based POS Tagger Neural Network POS Tagger Memory Based POS Tagger

7 November 2006 DBRG- University of Tehran Memory-Based POS Tagging Memory-based POS tagging is also called Lazy Leaning, Example Based learning or Case Based Learning MBT uses some specifications of each word such as its possible tags, and a fixed width context as features. We used MBT, a tool for memory based tagger generation and tagging. (available at:

7 November 2006 DBRG- University of Tehran The MBT tool generates a tagger by working through the annotated corpus and creating three data structures:  a lexicon, associating words to tags as evident in the training corpus  a case base for known words (words occurring in the lexicon)  a case base for unknown words. Memory-Based POS Tagging Selecting appropriate feature sets for known and unknown words has important impact on the accuracy of the results

7 November 2006 DBRG- University of Tehran After different experiments, we chose “ddfa” as the feature set for known words. So “ddfa” is choosing the appropriate tag for each known word, based on the tag of two words before and possible tags of the word after it. Memory-Based POS Tagging af d d  d stand for disambiguated tags  f means focus (current) word  a is ambiguous word after the current word.

7 November 2006 DBRG- University of Tehran The feature set chosen for unknown word is “dFass” Memory-Based POS Tagging ssa F d current word  d is the disambiguated tag of the word before current word  a stands for ambiguous tags of the word after current word  ss are two suffix letters of the current word. The F in unknown words features indicates position of the focus word and it is not included in actual feature set for tagging.

7 November 2006 DBRG- University of Tehran MBT Results- Known Words “ddfa”

7 November 2006 DBRG- University of Tehran MBT Results- Unknown Words “dFass”

7 November 2006 DBRG- University of Tehran MBT Results- Overall

7 November 2006 DBRG- University of Tehran Implemented Mehtods Neural Network POS Tagger MLE Based POS Tagger Memory Based POS Tagger

7 November 2006 DBRG- University of Tehran Maximum Likelihood Estimation As a bench mark of POS tagging accuracy, we chose Maximum Likelihood Estimation (MLE) approach.  Calculating the maximum likelihood probabilities for each tag assigned to any word in the training set.  Choosing the tag with greater maximum likelihood probability (designated tag) for each word and make it the only tag assignable to that word. In order to evaluate this method we analyze the words in the test set and assign the designated tags to the words in the test set.

7 November 2006 DBRG- University of Tehran Maximum Likelihood Estimation OccurrenceWordTagMLE 1پدرانهADV_NI پدرانهADJ_SIM پديدارADJ_SIM پديدارN_SING پذيرفتهN_SING پذيرفتهADJ_SIM پذيرفتهV_PA پذيرفتهADJ_INO پراكنده اندV_PRE پراكنده اندV_PA0.5000

7 November 2006 DBRG- University of Tehran MLE Results-Known Words

7 November 2006 DBRG- University of Tehran MLE Results- Unknown Words, “DEFAULT” For each unknown word we assign the “DEFAULT” tag.

7 November 2006 DBRG- University of Tehran MLE Results- Overall, “DEFAULT” For each unknown word we assign the “DEFAULT” tag.

7 November 2006 DBRG- University of Tehran MLE Results- Unknown Words, “N_SING” For each unknown word we assign the “N_SING” tag.

7 November 2006 DBRG- University of Tehran MLE Results- Overall, “N_SING” For each unknown word we assign the “N_SING” tag, most assigned tag.

7 November 2006 DBRG- University of Tehran Comparison With Other Languages

7 November 2006 DBRG- University of Tehran Implemented Mehtods MLE Based POS Tagger Neural Network POS Tagger Memory Based POS Tagger

7 November 2006 DBRG- University of Tehran Neural Network Each unit corresponds to one of the tags in the tag set. Preceding Words Following Words

7 November 2006 DBRG- University of Tehran Neural Network For each POS tag, pos i and each of the p+1+f in the context, there is an input unit whose activation in i,j represent the probability that word i has pos pos j. Input representation for the currently tagged word and the following words: The activation value for the preceding words:

7 November 2006 DBRG- University of Tehran Neural Network Results on Bijankhan Corpus Training Algorithm No. of Hidden Layer No. of Input for Train Training Duration (Hour) No. of Input for Test Accuracy MLP21mil120:00: Too Low MLP31mil?1000Too Low Generalized Feed Forward 11mil95:30:571000Too Low Generalized Feed Forward 21mil?1000Too Low Generalized Feed Forward :53:351000%58

7 November 2006 DBRG- University of Tehran Neural Network on Other Languages English

7 November 2006 DBRG- University of Tehran Neural Network on Other Languages Chinese

7 November 2006 DBRG- University of Tehran Future Work Using more than 1 level POS tags. Unsupervised POS tagging using Hamshahri Collection Investigation of other methods for Persian POS tagging such as Support Vector Machine (SVM) based tagging KASRE YE EZAFE in Persian!

7 November 2006 DBRG- University of Tehran Thank You Space for Question?