Download presentation
Presentation is loading. Please wait.
1
Lecture 9 NLTK POS Tagging Part 2
CSCE Natural Language Processing Lecture 9 NLTK POS Tagging Part 2 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? February 3, 2011
2
Overview Last Time Today Readings Overview of POS Tags
Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggers Readings Chapter ?
3
Table 5.1: Simplified Part-of-Speech Tagset
Meaning Examples ADJ adjective new, good, high, special, big, local ADV adverb really, already, still, early, now CNJ conjunction and, or, but, if, while, although DET determiner the, a, some, most, every, no EX existential there, there's FW foreign word dolce, ersatz, esprit, quo, maitre
4
MOD modal verb will, can, would, may, must, should N noun year, home, costs, time, education NP proper noun Alison, Africa, April, Washington NUM number twenty-four, fourth, 1991, 14:24 PRO pronoun he, their, her, its, my, I, us P preposition on, of, at, with, by, into, under TO the word to to UH interjection ah, bang, ha, whee, hmpf, oops V verb is, has, get, do, make, see, run VD past tense said, took, told, made, asked VG present participle making, going, playing, working VN past participle given, taken, begun, sung WH wh determiner who, which, when, what, where, how
5
Rank tags from most to least common
>>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> print tag_fd.keys() ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD', ...]
6
What Tags Precede Nouns?
>>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN', ...]
7
Most common Verbs >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True) >>> word_tag_fd = nltk.FreqDist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V', ...]
8
Rank Tags for words using CFDs
word as a condition and the tag as an event >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True) >>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> print cfd1['yield'].keys() ['V', 'N'] >>> print cfd1['cut'].keys() ['V', 'VD', 'N', 'VN']
9
Tags and counts for the word cut
print "ranked tags for the word cut" cut_tags=cfd1['cut'].keys() print "Counts for cut" for c in cut_tags: print c, cfd1['cut'][c] ranked tags for the word cut Counts for cut V 12 VD 10 N 3 VN 3
10
P(W | T) – Flipping it around
>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) >>> print cfd2['VN'].keys() ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said', ...]
11
List of words for which VD and VN are both events
list1=[w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] print list1
12
Print the 4 word/tag pairs before kicked/VD
idx1 = wsj.index(('kicked', 'VD')) print wsj[idx1-4:idx1+1]
14
Table 2.4 Example Description cfdist = ConditionalFreqDist(pairs)
create a conditional frequency distribution from a list of pairs cfdist.conditions() alphabetically sorted list of conditions cfdist[condition] the frequency distribution for this condition cfdist[condition][sample] frequency for the given sample for this condition cfdist.tabulate() tabulate the conditional frequency distribution cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions cfdist.plot() graphical plot of the conditional frequency distribution cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2
15
Example 5.2 (code_findtags.py)
def findtags(tag_prefix, tagged_text): cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news')) >>> for tag in sorted(tagdict): ... print tag, tagdict[tag] ...
16
NN ['year', 'time', 'state', 'week', 'home'] NN$ ["year's", "world's", "state's", "city's", "company's"] NN$-HL ["Golf's", "Navy's"] NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"] NN-HL ['Question', 'Salary', 'business', 'condition', 'cut'] NN-NC ['aya', 'eva', 'ova'] NN-TL ['President', 'House', 'State', 'University', 'City'] NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner'] NNS ['years', 'members', 'people', 'sales', 'men'] NNS$ ["children's", "women's", "janitors'", "men's", "builders'"] NNS$-HL ["Dealers'", "Idols'"]
17
words following often import nltk from nltk.corpus import brown print "For the Brown Tagged Corpus category=learned" brown_learned_text = brown.words(categories='learned') print "sorted words following often" print sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))
18
fd = nltk.FreqDist(tags) print fd.tabulate()
brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True) tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often'] fd = nltk.FreqDist(tags) print fd.tabulate() VN V VD ADJ DET ADV P , CNJ TO VBZ VG WH
19
highly ambiguous words
>>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> data = nltk.ConditionalFreqDist((word.lower(), tag) ... for (word, tag) in brown_news_tagged) >>> for word in data.conditions(): ... if len(data[word]) > 3: ... tags = data[word].keys() ... print word, ' '.join(tags) ... best ADJ ADV NP V better ADJ ADV V DET ….
20
Tag Package
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.