Part-of-Speech Tagging

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Chapter 6: HIDDEN MARKOV AND MAXIMUM ENTROPY Heshaam Faili University of Tehran.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Part of Speech Tagging with MaxEnt Re-ranked Hidden Markov Model Brian Highfill.
Albert Gatt Corpora and Statistical Methods Lecture 8.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Part II. Statistical NLP Advanced Artificial Intelligence (Hidden) Markov Models Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Part-of-speech Tagging cs224n Final project Spring, 2008 Tim Lai.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Lecture 12: Tagging (Chapter 10 of Manning and Schutze, Chapter 8 Jurafsky and Martin,Thede et al., Brants) Wen-Hsiang Lu ( 盧文祥 ) Department of Computer.
Part-of-Speech (POS) tagging See Eric Brill “Part-of-speech tagging”. Chapter 17 of R Dale, H Moisl & H Somers (eds) Handbook of Natural Language Processing,
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור שישי Viterbi Tagging Syntax עידו.
Syllabus Text Books Classes Reading Material Assignments Grades Links Forum Text Books עיבוד שפות טבעיות - שיעור חמישי POS Tagging Algorithms עידו.
Part of speech (POS) tagging
Word classes and part of speech tagging Chapter 5.
BIOI 7791 Projects in bioinformatics Spring 2005 March 22 © Kevin B. Cohen.
Albert Gatt Corpora and Statistical Methods Lecture 9.
Intro to NLP - J. Eisner1 Part-of-Speech Tagging A Canonical Finite-State Task.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
Albert Gatt Corpora and Statistical Methods Lecture 10.
인공지능 연구실 정 성 원 Part-of-Speech Tagging. 2 The beginning The task of labeling (or tagging) each word in a sentence with its appropriate part of speech.
Natural Language Processing Lecture 6 : Revision.
Fall 2005 Lecture Notes #8 EECS 595 / LING 541 / SI 661 Natural Language Processing.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 3 (10/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Statistical Formulation.
10/24/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 6 Giuseppe Carenini.
Hindi Parts-of-Speech Tagging & Chunking Baskaran S MSRI.
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Word classes and part of speech tagging Chapter 5.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
Albert Gatt LIN3022 Natural Language Processing Lecture 7.
Albert Gatt Corpora and Statistical Methods. POS Tagging Assign each word in continuous text a tag indicating its part of speech. Essentially a classification.
CSA3202 Human Language Technology HMMs for POS Tagging.
CPSC 422, Lecture 15Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 15 Oct, 14, 2015.
February 2007CSA3050: Tagging III and Chunking 1 CSA2050: Natural Language Processing Tagging 3 and Chunking Transformation Based Tagging Chunking.
Part-of-speech tagging
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
1 COMP790: Statistical NLP POS Tagging Chap POS tagging Goal: assign the right part of speech (noun, verb, …) to words in a text “The/AT representative/NN.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
Natural Language Processing : Probabilistic Context Free Grammars Updated 8/07.
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
CSCI 5832 Natural Language Processing
N-Gram Model Formulas Word sequences Chain rule of probability
CPSC 503 Computational Linguistics
Presentation transcript:

Part-of-Speech Tagging Updated 22/12/2005

Part-of-Speech Tagging Tagging is the task of labeling (or tagging) each word in a sentence with its appropriate part of speech. The[AT] representative[NN] put[VBD] chairs[NNS] on[IN] the[AT] table[NN]. Or The[AT] representative[JJ] put[NN] chairs[VBZ] on[IN] the[AT] table[NN]. Tagging is a case of limited syntactic disambiguation. Many words have more than one syntactic category. Tagging has limited scope: we just fix the syntactic categories of words and do not do a complete parse. The second tagging: ‘put’ in the sense of ‘sell’ like ‘put an option’ and ‘chair’ in the meaning of being chairman

Performance of POS taggers Most successful algorithms disambiguate about 96%-97% of the tokens! Information of taggers is quite usefull for information extraction, question answering and shallow parsing.

Some POS tags used in English AT article BEZ the word is IN preposition JJ adjective JJR comparative adjective MD modal (may, can, …) MN singular or mass noun NNP singular proper noun NNS plural noun PERIOD . : ? ! PN personal pronoun RB adverb RBR comparative adverb TO the word to VB verb, base form VBD verb, past tense VBG verb, present participle VBN verb, past participle VBP verb, non 3d person singular present VBZ verb, 3d person singular present WDT wh-determiner (what, which …) VBG = Present participle = A Present Participle is used with the verb 'To Be' Eg: I am reading / I was reading. VBN = past participle - The Past Participle is used for ALL perfect forms of the verb, e.g. (Present Perfect) I have taken or (Past Perfect) I had taken .

Information Sources in Tagging How do we decide the correct POS for a word? Syntagmatic Information: Look at tags of other words in the context of the word we are interested in. Lexical Information: Predicting a tag based on the word concerned. For words with a number of POS, they usually occur used as one particular POS.

Baselines Using syntagmatic information is not very successful. For example, using rule based tagger Greene and Rubin (1971) correctly tagged only 77% of the words. Many content words in English can have various POS. for example, there is a productive process that allows almost every noun to be turned into a verb: I flour the pan . A dumb tagger that simply assigns each word with its most common tag performs at a 90% accuracy! (Charniak 1993). The distribution of POS per word is typically extremely uneven, hence the performance of Charniak’s application. Also the dictionary used by Charniak is based on Brown corpus, which is the corpus of application.

Markov Model Taggers We look at the sequence of tags in a text as a Markov chain. Limited horizon. P(Xi+1= tj|X1,…,Xi) = P(Xi+1= tj|Xi) Time invariant (stationary). P(Xi+1= tj|Xi) = P(X2= tj|X1)

The Visible Markov Model Tagging Algorithm The MLE of tag tk following tag tj is obtained from training corpora: akj=P(tk|tj) = C(tj, tk )/C(tj). The probability of a word being emitted by a tag: bkjlP(wl|tj) = C(wl, wj )/C(wj). The best tagging sequence t1,n for a sentence w1,n: arg max P(t1,n | w1,n) = arg max ________________ P(w1,n) t1,n P (w1,n| t1,n )P(t1,n ) = arg max P (w1,n| t1,n )P(t1,n ) = arg max P P (wi| ti )P(ti|ti-1 ) n i=1

The Viterbi Algorithm comment: Given a sentence of length n d1(PERIOD)=1.0 d1(t)=0.0 for t ≠ PERIOD for i:=1 to n step 1 do for all tags tj do di+1(tj):=max1≤ k≤T[di (tk) P(wi+1|tj) P(tj| tk)] Yi+1(tj):=argmax1≤ k≤ T[di (tk) P(wi+1|tj) P(tj| tk)] end Xn+1 = argmax1≤ k ≤ T[dn+1 (tk) ] for j:= n to 1 step -1 do Xj := Yj+1(Xj+1 ) P(X1, …Xn) = max1≤ j≤T[dn+1 (tj)]

Terminological note For the purpose of tagging the Markov Models are treated as Hidden Markov Models. In other words, a mixed formalism is used: on training Visible Markov Models are constructed, but they are used as Hidden Markov Models on testing.

Unknown Words Simplest method: assume an unknown word could belong to any tag; unknown words are assigned the distribution over POS over the whole lexicon. Some tags are more common than others (for example a new word can be most likely verbs, nouns etc. but not prepositions or articles). Use features of the word (morphological and other cues, for example words ending in –ed are likely to be past tense forms or past participles). Use context. Words may not b in training corpus, mybe not even in the dictionary. This could be a major problem for taggers, and often the difference in accuracy of taggers is mainly determined by how well unknown words are treated.

Hidden Markov Model Taggers Often a large tagged training set is not available. We can use an HMM to learn the regularities of tag sequences in this case. The states of the HMM correspond to the tags and the output alphabet consists of words in dictionary or classes of words. Dictionary information is typically used for constraining model parameters. For POS tagging, emission probability for transition i -> j depend solely on i namely bijk = bik . constraining model parameters – e.g. we set word generation probability to zero if it is not listed in the dictionary with the coresponding TAG. (e.g. JJ is not listed as a possible Part od speech for ‘table’. Grouping classes of words: all words with the the same set of dictionary tags are groubed. For example, ‘top’ and ‘bottom’ are both JJ and NN.

Initialization of HMM Tagger (Jelinek, 1985) Output alphabet consists of words. Emission probabilities are given by: bj.l = bj*.l C(wl) _____________ S wm bj*.m C(wm) bj*.l= 0 if tj is not a part of speech allowed for wl Otherwise, where T(wl) is the number of tags allowed for wl T(wl) ___ the sum is over all words wm in the dictionary Jelinek nethod amounts to initializing with MLE for p(w^k | t^j) assuming words occur equally likely with each of their possible tags. It is more like p(t^j |w^k) * p(w^k) / p(t^j)

Initialization (Cont.) (Kupiec, 1992) Output alphabet consists of word equivalence classes, i.e., metawords uL, where L is a subset of the integers from 1 to T, where T is the number of different tags in the tag set) bj.L = bj*.l C(uL) _____________ S uL’ bj*.L’ C(uL’) the sum in the denom. is over all the metawords uL’ Advantage here: no fine tuning for every single word. Disadvantage if there is enough training data to tune per-word parameters. (even kupiec did not include most frequent 100 words in classes). bj*.L= 0 if j is not in L Otherwise, where |L| is the number of indices in L. |L| ___

Training the HMM Tagging using the HMM Once the initialization is completed, the HMM is trained using the Forward-Backward algorithm. Use the Viterbi algorithm, just like in the VMM. Tagging using the HMM

Transformation-Based Tagging Exploits wider range of lexical and syntactic regularities. Condition the tags on preceding words not just preceding tags. Use more context than bigram or trigram.

Key components Transformation-Based tagging has two key components: A specification of which ‘error correcting’ transformations are admissible. The learning algorithm Input data: dictionary and tagged corpus. Basic idea: tag each word by its most frequent tag using the dictionary. Then use the ranked list of transformations, for correcting the initial tagging. The ranked list is produced by the learning algorithm.

Transformations A transformation consists of two parts, a triggering environment and a rewrite rule. Transformations may be triggered by either a tag or by a word. Examples of some transformations learned in transformation-based tagging Source tag Target tag triggering environment NN VB previous tag is TO VBP VB one of the previous three tags is MD JJR RBR next tag is JJ VBP VB one of the previous two words is n’t Example is the Brill tagger (1995). Transformation examples: after TO nouns should be retagged as VERBS e.g. He used to train horses . Further transformation may be more specific and switch some words back to NN such as I went to school refers to verbs with identical base and past tense forms like cut and put. A preceding modal (can, may, should …) makes it unlikely past tense e.g. more valuable player Nt like in shouldn’t retags past from to base from e.g. You shouldn’t cut the grass.

Morphology triggered transformations Morphology triggered transformations are an elegant way of incorporating morphological knowledge into the general tagging formalism. Very helpful for unknown words. Initially unknown words are tagged as NNP (proper noun) if capitalized, and common noun otherwise. Morphology triggered transformations such as “replace NN by NNS if unknown word’s suffix is '-s’ "

Learning Algorithm The learning algorithm selects the best transformations and determines their order of application Initially tag each word with its most frequent tag. Iteratively we choose the transformation that reduces the error rate most. We stop when there is no transformation left that reduces the error rate by more than a prespecified threshold.

Automata Once trained it I possible to convert the transformation-based tagger into an equivalent finite state transducer, a finite state automaton that has a pair of symbols on each arc, one input symbol and one output symbol. A finite state transducer passes over a chain of input symbols and converts it to a chain of output symbols, by consuming the input symbols on the arcs it traverses and outputting the output symbols. The great advantage of the deterministic finite state transducer is speed. (hundred thousands of words per second…) A transformation based tagger can take Rkn steps to tag text, where R is the number of transformations, K is the length of the triggering environment, and n is the length of input text.

Comparison to probabilistic models Transformation-Based Tagging does not have the wealth of standard methods available for probabilistic methods. it cannot assign a probability to its prediction It cannot reliably output ‘k-best’ taggings, namely a set of k most probable hypothesis. However, they encode prior knowledge. The transformations are biased towards good generalization.

Tagging Accuracy Ranges from 96%-97% Depends on: Amount of training data available. The tag set. Difference between training corpus and dictionary and the corpus of application. Unknown words in the corpus of application. A change in any of these factors can have a dramatic effect on tagging accuracy – often much more stronger than the choice of tagging method. Some authors give accuracies only for unknown words in which case accuracy is lower. Amout of data: the more the better. Tag set: the larger, the more difficult. For example, some tag sets distinct TO as proposition and the infinitive marker TO. Using the former, one cannot tag TO worngly. Difference in corpus: in acdemic setting it is train and test are from the same source. In real life, if the test is from a dig\fferent source or from a different genre – results could be poor. C.f. the work of Charniak on brown corpus Unknown: coverage of dictionary. the occurrence of many unknown – greatly degrade performace. E.g. un technical domain.

Applications of Tagging Partial parsing: syntactic analysis Information Extraction: tagging and partial parsing help identify useful terms and relationships between them. Question Answering: analyzing a query to understand what type of entity the user is looking for and how it is related to other noun phrases mentioned in the question. Partial parsing, e.g. identification of noun phrases in a sentence. Thera are various methods to do that based on tagged text. Info extraction: find values with natural language that fit into templates. For example, a template for weather report with a slot for type of weather condition (rain, snow, huricane) the location of the event and the effect of the event (power breakdown, traffic problems, …) tagging and partial parsing help tp identify the entities to fill slots. But this is semantic tagging. Question answering: who killed president kennedy? May be answered by a noun phrase ‘Oswald’ instead of returning a list of documents. Again, analysing the query to determine what type of entity is looked for and how it relates to other entities requires tagging and partial parsing.

Examples for frequent tagging errors Correct tag Tag assigned example NNS JJ An executive order RB more important issues VBD VBG loan needed to meet a. Executive can be a noun or adjective b .‘More issues’ or ‘more important’ c. The loan needed to meet rising costs of health care d. They cannot now handle the loan needed to meet rising costs of health care.