Download presentation
Presentation is loading. Please wait.
Published byGuillermo Daniel Espinoza Valverde Modified over 6 years ago
1
Part of Speech Tagging September 9, 2005 11/12/2018
2
CAP6640 Natural Language Systems
Part of Speech Each word belongs to a word class. The word class of a word is known as part-of-speech (POS) of that word. Most POS tags implicitly encode fine-grained specializations of eight basic parts of speech: noun, verb, pronoun, preposition, adjective, adverb, conjunction, article These categories are based on morphological and distributional similarities (not semantic similarities). Part of speech is also known as: word classes morphological classes lexical tags Summer 2003 CAP6640 Natural Language Systems
3
Part of Speech (cont.) A POS tag of a word describes the major and minor word classes of that word. A POS tag of a word gives a significant amount of information about that word and its neighbours. For example, a possessive pronoun (my, your, her, its) most likely will be followed by a noun, and a personal pronoun (I, you, he, she) most likely will be followed by a verb. Most of words have a single POS tag, but some of them have more than one (2,3,4,…) For example, book/noun or book/verb I bought a book. Please book that flight. 11/12/2018
4
Tag Sets Penn Treebank has 45 tags Brown Corpus has 87 tags
There are various tag sets to choose. The choice of the tag set depends on the nature of the application. We may use small tag set (more general tags) or large tag set (finer tags). Some of widely used part-of-speech tag sets: Penn Treebank has 45 tags Brown Corpus has 87 tags C7 tag set has 146 tags In a tagged corpus, each word is associated with a tag from the used tag set. 11/12/2018
5
English Word Classes closed class types -- such as prepositions
Part-of-speech can be divided into two broad categories: closed class types -- such as prepositions open class types -- such as noun, verb Closed class words are generally also function words. Function words play important role in grammar Some function words are: of, it, and, you Functions words are most of time very short and frequently occur. There are four major open classes. noun, verb, adjective, adverb a new word may easily enter into an open class. Word classes may change depending on the natural language, but all natural languages have at least two word classes: noun and verb. 11/12/2018
6
Nouns Nouns can be divided as: proper nouns -- names for specific entities such as India, John, Ali common nouns Proper nouns do not take an article but common nouns may take. Common nouns can be divided as: count nouns -- they can be singular or plural chair/chairs mass nouns -- they are used when something is conceptualized as a homogenous group -- snow, salt Mass nouns cannot take articles a and an, and they can not be plural. 11/12/2018
7
Verbs main verbs -- open class -- draw, bake
Verb class includes the words referring actions and processes. Verbs can be divided as: main verbs -- open class draw, bake auxiliary verbs -- closed class -- can, should Auxiliary verbs can be divided as: copula -- be, have modal verbs -- may, can, must, should Verbs have different morphological forms: non-3rd-person-sg eat 3rd-person-sg - eats progressive -- eating past -- ate past participle -- eaten 11/12/2018
8
Adjectives for color -- black, white for age -- young, old
Adjectives describe properties or qualities for color -- black, white for age -- young, old In Turkish, all adjectives can also be used as noun. kırmızı kitap red book kırmızıyı the red one (ACC) 11/12/2018
9
Adverbs Adverbs normally modify verbs. Adverb categories:
locative adverbs -- home, here, downhill degree adverbs -- very, extremely manner adverbs -- slowly, delicately temporal adverbs -- yesterday, Friday Because of the heterogeneous nature of adverbs, some adverbs such as Friday may be tagged as nouns. 11/12/2018
10
Major Closed Classes Prepositions -- on, under, over, near, at, from, to, with Determiners -- a, an, the Pronouns -- I, you, he, she, who, others Conjunctions -- and, but, if, when Participles -- up, down, on, off, in, out Numerals -- one, two, first, second 11/12/2018
11
Prepositions Occur before noun phrases
indicate spatial or temporal relations Example: on the table under chair Frequent. For example, some of the frequency counts in a 16 million word corpora (COBUILD). of 540,085 in 331,235 for 142,421 to 125,691 with 124,965 on 109,129 at 100,169 11/12/2018
12
Particles go on turn on turn off shut down
A particle combines with a verb to form a larger unit called phrasal verb. go on turn on turn off shut down 11/12/2018
13
Articles A small closed class Only three words in the class: a an the
Marks definite or indefinite They occur so often. For example, some of the frequency counts in a 16 million word corpora (COBUILD). the 1,071,676 a 413,887 an 59,359 Almost 10% of words are articles in this corpus. 11/12/2018
14
Conjunctions I thought that you might like milk
Conjunctions are used to combine or join two phrases, clauses or sentences. Coordinating conjunctions -- and or but join two elements of equal status Example: you and me Subordinating conjunctions that who combines main clause with subordinate clause Example: I thought that you might like milk 11/12/2018
15
Pronouns Shorthand for referring to some entity or event.
Pronouns can be divided: personal you she I possessive my your his wh-pronouns who what 11/12/2018
16
TagSets for English There are popular actual tagsets for part-of-speech PENN TREEBANK tagset has 45 tags IN preposition/subordinating conj. DT determiner JJ adjective NN noun, singular or mass NNS noun, plural VB verb, base form VBD verb, past tense A sentence from Brown corpus which is tagged using Penn Treebank tagset. The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. 11/12/2018
17
Part of Speech Tagging A set of tags (our tag set)
Part of speech tagging is simply assigning the correct part of speech for each in an input sentence We assume that we have the following: A set of tags (our tag set) A dictionary that tells us the possible tags for each word (including all morphological variants). A text to be tagged. There are different algorithms for tagging. Rule Based Tagging – uses hand-written rules Stochastic Tagging – uses probabilities computed from training corpus Transformation Based Tagging – uses rules learned automatically 11/12/2018
18
How hard is tagging? can/verb can/auxiliary can/noun
Most words in English are unambiguous. They have only a single tag. But many of most common words are ambiguous: can/verb can/auxiliary can/noun The number of word types in Brown Corpus unambiguous (one tag) 35,340 ambiguous (2-7 tags) ,100 2 tags 3760 3 tags 264 4 tags 5 tags 6 tags 7 tags While only 11.5% of word types are ambiguous, over 40% of Brown corpus tokens are ambiguous. 11/12/2018
19
Problem Setup There are M types of POS tags
Tag set: {t1,..,tM}. The word vocabulary size is V Vocabulary set: {w1,..,wV}. We have a word sequence of length n: W = w1,w2…wn Want to find the best sequence of POS tags: T = t1,t2…tn 11/12/2018
20
Information sources for tagging
All techniques are based on the same observations… some tag sequences are more probable than others ART+ADJ+N is more probable than ART+ADJ+VB Lexical information: knowing the word to be tagged gives a lot of information about the correct tag “table”: {noun, verb} but not a {adj, prep,…} “rose”: {noun, adj, verb} but not {prep, ...} 11/12/2018
21
Rule-Based Part-of-Speech Tagging
First Stage: Uses a dictionary to assign each word a list of potential parts-of-speech. Second Stage: Uses a large list of handcrafted rules to window down this list to a single part-of-speech for each word. The ENGTWOL is a rule-based tagger In the first stage, uses a two-level lexicon transducer In the second stage, uses hand-crafted rules (about 1100 rules) 11/12/2018
22
Sample rules N-IP rule: ART-V rule:
A tag N (noun) cannot be followed by a tag IP (interrogative pronoun) ... man who … man: {N} who: {RP, IP} --> {RP} relative pronoun ART-V rule: A tag ART (article) cannot be followed by a tag V (verb) ...the book… the: {ART} book: {N, V} --> {N} 11/12/2018
23
After The First Stage he he/pronoun
Example: He had a book. After the fırst stage: he he/pronoun had have/verbpast have/auxliarypast a a/article book book/noun book/verb 11/12/2018
24
Tagging Rule Rule-1: if (the previous tag is an article)
then eliminate all verb tags Rule-2: if (the next tag is verb) 11/12/2018
25
Transformation-based tagging
Due to Eric Brill (1995) basic idea: take a non-optimal sequence of tags and improve it successively by applying a series of well-ordered re-write rules rule-based but, rules are learned automatically by training on a pre-tagged corpus 11/12/2018
26
An example 1. Assign to words their most likely tag
P(NN|race) = .98 P(VB|race) = .02 2. Change some tags by applying transformation rules 11/12/2018
27
Types of context lots of latitude… can be:
tag-triggered transformation The preceding/following word is tagged this way The word two before/after is tagged this way ... word- triggered transformation The preceding/following word this word … morphology- triggered transformation The preceding/following word finishes with an s a combination of the above The preceding word is tagged this ways AND the following word is this word 11/12/2018
28
Learning the transformation rules
Input: A corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0) Output: A bag of transformation rules Algorithm: Instantiates a small set of hand-written templates (generic rules) by comparing the reference corpus to C0 Change tag a to tag b when… The preceding/following word is tagged z The word two before/after is tagged z One of the 2 preceding/following words is tagged z One of the 2 preceding words is z … 11/12/2018
29
Learning the transformation rules (con't)
Run the initial tagger and compile types of errors <incorrect tag, desired tag, # of occurrences> For each error type, instantiate all templates to generate candidate transformations Apply each candidate transformation to the corpus and count the number of corrections and errors that it produces Save the transformation that yields the greatest improvement Stop when no transformation can reduce the error rate by a predetermined threshold 11/12/2018
30
Example if the initial tagger mistags 159 words as verbs instead of nouns create the error triple: <verb, noun, 159> Suppose template #3 is instantiated as the rule: Change the tag from <verb> to <noun> if one of the two preceding words is tagged as a determiner. When this template is applied to the corpus: it corrects 98 of the 159 errors but it also creates 18 new errors Error reduction is 98-18=80 11/12/2018
31
Learning the best transformations
input: a corpus with each word: correctly tagged (for reference) tagged with its most frequent tag (C0) a bag of unordered transformation rules output: an ordering of the best transformation rules 11/12/2018
32
Learning the best transformations (con’t)
let: E(Ck) = nb of words incorrectly tagged in the corpus at iteration k v(C) = the corpus obtained after applying rule v on the corpus C ε = minimum number of errors desired for k:= 0 step 1 do bt := argmint (E(t(Ck)) // find the transformation t that minimizes // the error rate if ((E(Ck) - E(bt(Ck))) < ε) // if bt does not improve the tagging significantly then goto finished Ck+1 := bt(Ck) // apply rule bt to the current corpus Tk+1 := bt // bt will be kept as the current transformation // rule end finished: the sequence T1 T2 … Tk is the ordered transformation rules 11/12/2018
33
Strengths of transformation-based tagging
exploits a wider range of lexical and syntactic regularities can look at a wider context condition the tags on preceding/next words not just preceding tags. can use more context than bigram or trigram. transformation rules are easier to understand than matrices of probabilities 11/12/2018
34
How TBL Rules are Applied
Before the rules are applied the tagger labels every word with its most likely tag. We get these most likely tags from a tagged corpus. Example: He is expected to race tomorrow he/PRN is/VBZ expected/VBN to/TO race/NN tomorrow/NN After selecting most-likely tags, we apply transformation rules. Change NN to VB when the previous tag is TO This rule converts race/NN into race/VB This may not work for every case ….. According to race 11/12/2018
35
How TBL Rules are Learned
We will assume that we have a tagged corpus. Brill’s TBL algorithm has three major steps. Tag the corpus with the most likely tag for each (unigram model) Choose a transformation that deterministically replaces an existing tag with a new tag such that the resulting tagged training corpus has the lowest error rate out of all transformations. Apply the transformation to the training corpus. These steps are repeated until a stopping criterion is reached. The result (which will be our tagger) will be: First tags using most-likely tags Then apply the learned transformations 11/12/2018
36
Transformations A transformation is selected from a small set of templates. Change tag a to tag b when - The preceding (following) word is tagged z. - The word two before (after) is tagged z. - One of two preceding (following) words is tagged z. - One of three preceding (following) words is tagged z. - The preceding word is tagged z and the following word is tagged w. - The preceding (following) word is tagged z and the word two before (after) is tagged w. 11/12/2018
37
Stochastic POS tagging
Assume that a word’s tag only depends on the previous tags (not following ones) Use a training set (manually tagged corpus) to: learn the regularities of tag sequences learn the possible tags for a word model this info through a language model (n-gram) 11/12/2018
38
Syntagmatic information
Hidden Markov Model (HMM) Taggers Goal: maximize P(word|tag) x P(tag|previous n tags) P(word|tag) word/lexical likelihood probability that given this tag, we have this word NOT probability that this word has this tag modeled through language model (word-tag matrix) P(tag|previous n tags) tag sequence likelihood probability that this tag follows these previous tags modeled through language model (tag-tag matrix) Lexical information Syntagmatic information 11/12/2018
39
Tag sequence probability
P(tag|previous n tags) if we look (n-1) tags before to find current tag --> n-gram model trigram model chooses the most probable tag ti for word wi given: the previous 2 tags ti-2 & ti-1 and the current word wi bigram model the previous tag ti-1 and unigram model (just most-likely tag) 11/12/2018
40
Example “race” can be VB or NN
“Secretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/ADV” “People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN” let’s tag the word “race” in 1st sentence with a bigram model. 11/12/2018
41
Example (con’t) assuming previous words have been tagged, we have:
“Secretariat/NNP is/VBZ expected/VBN to/TO race/?? tomorrow” P(race|VB) x P(VB|TO) ? given that we have a VB, how likely is the current word to be race given that the previous tag is TO, how likely is the current tag to be VB P(race|NN) x P(NN|TO) ? given that we have a NN, how likely is the current word to be race given that the previous tag is TO, how likely is the current tag to be NN 11/12/2018
42
Example (con’t) From the training corpus, we found that: so:
P(NN|TO) = .021 // given that the previous tag is TO // 2.1% chances that the current tag is NN P(VB|TO) = // given that the previous tag is TO // 34% chances that the current tag is VB P(race|NN) = // given that we have an NN // 0.041% chances that this word is "race" P(race|VB) = // given that we have a VB // 003% chances that this word is "race" so: P(VB|TO) x P(race|VB) = .34 x = P(NN|TO) x P(race|NN) = .021 x = VB is more probable! 11/12/2018
43
Example (con’t) and by the way: How are the probabilities found ?
race is 98% of the time a NN !!! P(VB|race) = 0.02 P(NN|race) = 0.98 !!! How are the probabilities found ? using a training corpus of hand-tagged text long & meticulous work done by linguists 11/12/2018
44
HMM Tagging But HMM tagging tries to find:
the best sequence of tags for a sentence not just best tag for a single word goal: maximize the probability of a tag sequence, given a word sequence i.e. choose the sequence of tags that maximizes P(tag sequence|word sequence) 11/12/2018
45
HMM Tagging (con’t) By Bayes law: wordSeq is given…
so P(wordSeq) will be the same for all tagSeq so we can drop it from the equation 11/12/2018
46
Assumptions in HMM Tagging
words are independent Markov assumption (approximation to short history) ex. with bigram approximation: probability of a word is only dependent on its tag emission probability state transition probability 11/12/2018
47
The derivation bestTagSeq = argmax P(tagSeq) x P(wordSeq|tagSeq)
(t1…tn)* = argmax P( t1, …, tn ) x P(w1, …, wn | t1, …, tn ) Assumption 1: Independence assumption + Chain rule P(t1, …, tn) x P(w1, …, wn | t1, …, tn) = P(tn| t1, …, tn-1) x P(tn-1| t1, …, tn-2) x P(tn-2| t1, …, tn-3) x … x P(t1) x P(w1| t1, …, tn) x P(w2 | t1, …, tn) x P(w3 | t1, …, tn) x … x P(wn | t1, …, tn) Assumption 2: Markov assumption: only look at short history (ex. bigram) = P(tn|tn-1) x P(tn-1|tn-2) x P(tn-2|tn-3) x … x P(t1) Assumption 3: A word’s identity only depends on its tag x P(w1| t1) x P(w2 | t2) x P(w3 | t3) x … x P(wn | tn) 11/12/2018
48
Emissions & Transitions probabilities
let N: number of possible tags (size of tag set) V: number of word types (vocabulary) from a tagged training corpus, we compute the frequency of: Emission probabilities P(wi| ti) stored in an N x V matrix emission[i,j] = probability that tag i is the correct tag for word j Transitions probabilities P(ti|ti-1) stored in an N x N matrix transmission[i,j] = probability that tag i follows tag j In practice, these matrices are very sparse So these models are smoothed to avoid zero probabilities 11/12/2018
49
Emission probabilities P(wi| ti)
stored in an N x V matrix emission[i,j] = probability/frequency that tag i is the correct tag for word j 11/12/2018
50
Transitions probabilities P(ti|ti-1)
stored in an N x N matrix transmission[i,j] = probability/frequency that tag i follows tag j 11/12/2018
51
Efficiency issues to find the best probability of a sequence is exponential in time for efficiency, we usually use the Viterbi algorithm For global maximisation i.e. best tag sequence 11/12/2018
52
an Example Emission probabilities: Transition probabilities: Tag PN VB
TO IN AT NN Vocabulary John 0.9 0.1 likes 0.2 0.3 to 0.5 fish 0.8 in 1 the sea Emission probabilities: Transition probabilities: First Tag Second tag PN VB TO IN AT NN None (last tag) 0.2 0.7 0.1 0.5 1 0.9 0.05 0.95 0.3 0.25 None (1st tag) 11/12/2018
53
State Transition Diagram (VMM)
Transition probabilities 0.2 1 start TO 0.05 AT 0.7 0.1 0.95 0.5 0.25 0.1 NN 0.1 0.5 0.2 0.3 VB 0.05 0.25 0.2 PN 0.7 0.2 0.1 0.9 IN 0.1 end 11/12/2018
54
State Transition Diagram (HMM)
but the states are "invisible" (we only see the words) … in: 0.2 the: 0.1 to: 0.1 0.2 start 1 TO 0.05 … AT 0.7 0.1 0.95 likes: 0.1 0.5 0.25 sea: 0.2 0.1 NN 0.1 0.5 … 0.2 0.3 VB fish: 0.3 0.05 0.25 John: 0.3 0.2 likes: 0.1 PN … 0.7 fish: 0.1 0.2 0.1 0.9 … IN 0.1 in: 0.1 end … 11/12/2018
55
The Viterbi Algorithm best tag sequence for "John likes to fish in the sea"? efficiently computes the most likely state sequence given a particular output sequence based on dynamic programming 11/12/2018
56
A smaller example a b b a 0.2 0.4 0.8 0.6 0.7 end start r q 1 1 0.5 0.3 0.5 What is the best sequence of states for the input string “bbba”? Computing all possible paths and finding the one with the max probability is exponential 11/12/2018
57
A smaller example (con’t)
For each state, store the most likely sequence that could lead to it (and its probability) Path probability matrix: An array of states versus time (tags versus words) That stores the prob. of being at each state at each time in terms of the prob. for being in each state at the preceding time. Best sequence Input sequence / time ε --> b b --> b bb --> b bbb --> a leading to q coming from q ε --> q 0.6 (1.0x0.6) q --> q 0.108 (0.6x0.3x0.6) qq --> q (0.108x0.3x0.6) qrq --> q (0.1008x0.3x0.4) from r r --> q 0 (0x0.5x0.6) qr --> q (0.336x0.5x 0.6) qrr --> q (0.1344x0.5x0.4) to r ε --> r 0 (0x0.8) q --> r 0.336 (0.6x0.7x0.8) qq --> r (0.108x0.7x0.8) qrq --> r (0.1008x0.7x0.2) r --> r 0 (0x0.5x0.8) qr --> r (0.336x0.5x0.8) qrr --> r (0.1344x0.5x0.2) 11/12/2018
58
Viterbi for POS tagging
Let: n = nb of words in sentence to tag (nb of input tokens) T = nb of tags in the tag set (nb of states) vit = path probability matrix (viterbi) vit[i,j] = probability of being at state (tag) j at word i state = matrix to recover the nodes of the best path (best tag sequence) state[i+1,j] = the state (tag) of the incoming arc that led to this most probable state j at word i+1 // Initialization vit[1,PERIOD]:= // pretend that there is a period before // our sentence (start tag = PERIOD) vit[1,t]:=0.0 for t ≠ PERIOD 11/12/2018
59
Viterbi for POS tagging (con’t)
// Induction (build the path probability matrix) for i:=1 to n step 1 do // for all words in the sentence for all tags tj do // for all possible tags // store the max prob of the path vit[i+1,tj] := max1≤k≤T(vit[i,tk] x P(wi+1|tj) x P(tj| tk)) // store the actual state path[i+1,tj] := argmax1≤k≤T ( vit[i,tk] x P(wi+1|tj) x P(tj| tk)) end //Termination and path-readout bestStaten+1 := argmax1≤j≤T vit[n+1,j] for j:=n to 1 step -1 do // for all the words in the sentence bestStatej := path[i+1, bestStatej+1] P(bestState1,…, bestStaten ) := max1≤j≤T vit[n+1,j] emission probability state transition probability probability of best path leading to state tk at word i 11/12/2018
60
Possible improvements
in bigram POS tagging, we condition a tag only on the preceding tag why not... use more context (ex. use trigram model) more precise: “is clearly marked” --> verb, past participle “he clearly marked” --> verb, past tense combine trigram, bigram, unigram models condition on words too but with an n-gram approach, this is too costly (too many parameters to model) transformation-based tagging... 11/12/2018
61
Evaluation of POS taggers
compared with gold-standard of human performance metric: accuracy = % of tags that are identical to gold standard most taggers ~96-97% accuracy must compare accuracy to: ceiling (best possible results) how do human annotators score compared to each other? (96-97%) so systems are not bad at all! baseline (worst possible results) what if we take the most-likely tag (unigram model) regardless of previous tags ? (90-91%) so anything less is really bad 11/12/2018
62
More on tagger accuracy
is 95% good? that’s 5 mistakes every 100 words if on average, a sentence is 20 words, that’s 1 mistake per sentence when comparing tagger accuracy, beware of: size of training corpus the bigger, the better the results difference between training & testing corpora (genre, domain…) the closer, the better the results size of tag set Prediction versus classification unknown words the more unknown words (not in dictionary), the worst the results 11/12/2018
63
Error analysis of POS taggers
Where did the tagger go wrong ? Use a confusion matrix / contingency table Most confused: NN (noun) vs. NNP (proper noun) vs. JJ (adjective) VBD (verb, past tense) vs. VBN (past participle) vs. JJ (adjective) he chopped carrots, the carrots were chopped, the chopped carrots 11/12/2018
64
Major difficulties in POS tagging
Unknown words (proper names) because we do not know the set of tags it can take and knowing this takes you a long way (cf. baseline POS tagger) possible solutions: assign all possible tags with probabilities distribution identical to lexicon as a whole use morphological cues to infer possible tags ex. word ending in -ed are likely to be past tense verbs or past participles Frequently confused tag pairs preposition vs particle <running> <up> a hill (prep) / <running up> a bill (particle) verb, past tense vs. past participle vs. adjective 11/12/2018
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.