Download presentation
Presentation is loading. Please wait.
Published byAnn Lindsey Modified over 9 years ago
1
Lecture 9 NLTK POS Tagging Part 2 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? February 3, 2011 CSCE 771 Natural Language Processing
2
– 2 – CSCE 771 Spring 2011 Overview Last Time Overview of POS TagsToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggersReadings Chapter 5.4-5.?
3
– 3 – CSCE 771 Spring 2011 Table 5.1: Simplified Part-of-Speech Tagset TagMeaningExamples ADJadjectivenew, good, high, special, big, local ADVadverbreally, already, still, early, now CNJconjunctionand, or, but, if, while, although DETdeterminerthe, a, some, most, every, no EXexistentialthere, there's FWforeign worddolce, ersatz, esprit, quo, maitre
4
– 4 – CSCE 771 Spring 2011 MODmodal verbwill, can, would, may, must, should Nnounyear, home, costs, time, education NPproper nounAlison, Africa, April, Washington NUMnumbertwenty-four, fourth, 1991, 14:24 PROpronounhe, their, her, its, my, I, us Pprepositionon, of, at, with, by, into, under TOthe word toto UHinterjectionah, bang, ha, whee, hmpf, oops Vverbis, has, get, do, make, see, run VDpast tensesaid, took, told, made, asked VG present participle making, going, playing, working VNpast participlegiven, taken, begun, sung WHwh determinerwho, which, when, what, where, how
5
– 5 – CSCE 771 Spring 2011 Rank tags from most to least common >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> print tag_fd.keys() ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD',...]
6
– 6 – CSCE 771 Spring 2011 What Tags Precede Nouns? >>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN',...]
7
– 7 – CSCE 771 Spring 2011 Most common Verbs >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True ) >>> word_tag_fd = nltk.FreqDist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V',...]
8
– 8 – CSCE 771 Spring 2011 Rank Tags for words using CFDs word as a condition and the tag as an eventword as a condition and the tag as an event >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True ) >>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> print cfd1['yield'].keys() ['V', 'N'] >>> print cfd1['cut'].keys() ['V', 'VD', 'N', 'VN']
9
– 9 – CSCE 771 Spring 2011 Tags and counts for the word cut print "ranked tags for the word cut" cut_tags=cfd1['cut'].keys() print "Counts for cut" for c in cut_tags: print c, cfd1['cut'][c] print c, cfd1['cut'][c] ranked tags for the word cut Counts for cut V 12 VD 10 N 3 VN 3
10
– 10 – CSCE 771 Spring 2011 P(W | T) – Flipping it around >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) >>> print cfd2['VN'].keys() ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said',...]
11
– 11 – CSCE 771 Spring 2011 List of words for which VD and VN are both events list1=[w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] print list1
12
– 12 – CSCE 771 Spring 2011 Print the 4 word/tag pairs before kicked/VD idx1 = wsj.index(('kicked', 'VD')) print wsj[idx1-4:idx1+1]
13
– 13 – CSCE 771 Spring 2011
14
– 14 – CSCE 771 Spring 2011 Table 2.4 ExampleDescription cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs cfdist.conditions() alphabetically sorted list of conditions cfdist[condition] the frequency distribution for this condition cfdist[condition][sample] frequency for the given sample for this condition cfdist.tabulate() tabulate the conditional frequency distribution cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions cfdist.plot() graphical plot of the conditional frequency distribution cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2
15
– 15 – CSCE 771 Spring 2011 Example 5.2 (code_findtags.py) def findtags(tag_prefix, tagged_text): cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) >>> tagdict = findtags('NN', nltk.corpus.brown.tagged_words(categories='news')) >>> for tag in sorted(tagdict):... print tag, tagdict[tag]...
16
– 16 – CSCE 771 Spring 2011 NN ['year', 'time', 'state', 'week', 'home'] NN$ ["year's", "world's", "state's", "city's", "company's"] NN$-HL ["Golf's", "Navy's"] NN$-TL ["President's", "Administration's", "Army's", "Gallery's", "League's"] NN-HL ['Question', 'Salary', 'business', 'condition', 'cut'] NN-NC ['aya', 'eva', 'ova'] NN-TL ['President', 'House', 'State', 'University', 'City'] NN-TL-HL ['Fort', 'Basin', 'Beat', 'City', 'Commissioner'] NNS ['years', 'members', 'people', 'sales', 'men'] NNS$ ["children's", "women's", "janitors'", "men's", "builders'"] NNS$-HL ["Dealers'", "Idols'"]
17
– 17 – CSCE 771 Spring 2011 words following often import nltk from nltk.corpus import brown print "For the Brown Tagged Corpus category=learned" brown_learned_text = brown.words(categories='learned') print "sorted words following often" print sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == 'often'))
18
– 18 – CSCE 771 Spring 2011 brown_lrnd_tagged = brown.tagged_words(categories='learned', simplify_tags=True) tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == 'often'] fd = nltk.FreqDist(tags) print fd.tabulate() VN V VD ADJ DET ADV P, CNJ. TO VBZ VG WH 15 12 8 5 5 4 4 3 3 1 1 1 1 1 15 12 8 5 5 4 4 3 3 1 1 1 1 1
19
– 19 – CSCE 771 Spring 2011 highly ambiguous words >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> for word in data.conditions():... if len(data[word]) > 3:... tags = data[word].keys()... print word, ' '.join(tags)... best ADJ ADV NP V better ADJ ADV V DET ….
20
– 20 – CSCE 771 Spring 2011 Tag Package http://nltk.org/api/nltk.tag.html#module-nltk.tag
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.