>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]"> >> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]">

Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter.

Similar presentations


Presentation on theme: "Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter."— Presentation transcript:

1 Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? February 3, 2011 CSCE 771 Natural Language Processing

2 – 2 – CSCE 771 Spring 2011 Overview Last Time Overview of POS TagsToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggersReadings Chapter 5.4-5.?

3 – 3 – CSCE 771 Spring 2011 NLTK tagging >>> text = nltk.word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

4 – 4 – CSCE 771 Spring 2011 >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") >>> nltk.pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

5 – 5 – CSCE 771 Spring 2011 >>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) >>> text.similar('woman') >>> text.similar('woman') Building word-context index... man time day year car moment world family house country child boy state job way war girl place room word >>> text.similar('bought') made said put done seen had found left given heard brought got been was set told took in felt that >>> text.similar('over') in on to of and for with from at by that into as up out down through is all about in on to of and for with from at by that into as up out down through is all about >>> text.similar('the') a his this their its her an that our any all one these my in your no some other and

6 – 6 – CSCE 771 Spring 2011 Tagged Corpora By convention in NLTK, a tagged token is a tuple. function str2tuple() >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN'

7 – 7 – CSCE 771 Spring 2011 Specifying Tags with Strings >>> sent = '''... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN... other/AP topics/NNS,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC...... accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT... interest/NN of/IN both/ABX governments/NNS ''/''./.... ''' >>> [nltk.tag.str2tuple(t) for t in sent.split()] [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'),... ('.', '.')]

8 – 8 – CSCE 771 Spring 2011 Reading Tagged Corpora >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'),...] >>> nltk.corpus.brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'N'), ('County', 'N'),...]

9 – 9 – CSCE 771 Spring 2011 tagged_words() method >>> print nltk.corpus.nps_chat.tagged_words() [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'),...] >>> nltk.corpus.conll2000.tagged_words() [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'),...] >>> nltk.corpus.treebank.tagged_words() [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','),...]

10 – 10 – CSCE 771 Spring 2011 >>> nltk.corpus.brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'),...] >>> nltk.corpus.treebank.tagged_words(simplify_tags=True) [('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','),...]

11 – 11 – CSCE 771 Spring 2011 readme() methods

12 – 12 – CSCE 771 Spring 2011 Table 5.1: Simplified Part-of-Speech Tagset TagMeaningExamples ADJadjectivenew, good, high, special, big, local ADVadverbreally, already, still, early, now CNJconjunctionand, or, but, if, while, although DETdeterminerthe, a, some, most, every, no EXexistentialthere, there's FWforeign worddolce, ersatz, esprit, quo, maitre

13 – 13 – CSCE 771 Spring 2011 MODmodal verbwill, can, would, may, must, should Nnounyear, home, costs, time, education NPproper nounAlison, Africa, April, Washington NUMnumbertwenty-four, fourth, 1991, 14:24 PROpronounhe, their, her, its, my, I, us Pprepositionon, of, at, with, by, into, under TOthe word toto UHinterjectionah, bang, ha, whee, hmpf, oops Vverbis, has, get, do, make, see, run VDpast tensesaid, took, told, made, asked VG present participle making, going, playing, working VNpast participlegiven, taken, begun, sung WHwh determinerwho, which, when, what, where, how

14 – 14 – CSCE 771 Spring 2011 >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.keys() ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD',...]

15 – 15 – CSCE 771 Spring 2011 Nouns >>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN',...]

16 – 16 – CSCE 771 Spring 2011 Verbs >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True ) >>> word_tag_fd = nltk.FreqDist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V',...]

17 – 17 – CSCE 771 Spring 2011 >>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1['yield'].keys() ['V', 'N'] >>> cfd1['cut'].keys() ['V', 'VD', 'N', 'VN']

18 – 18 – CSCE 771 Spring 2011 >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) >>> cfd2['VN'].keys() ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said',...]

19 – 19 – CSCE 771 Spring 2011 >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted',...] >>> idx1 = wsj.index(('kicked', 'VD')) >>> wsj[idx1-4:idx1+1] [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')] >>> idx2 = wsj.index(('kicked', 'VN')) >>> wsj[idx2-4:idx2+1] [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]

20 – 20 – CSCE 771 Spring 2011 def findtags(tag_prefix, tagged_text): cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

21 – 21 – CSCE 771 Spring 2011 Reading URLs NLTK book 3.1 >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) >>> type(raw) >>> len(raw) 1176831 >>> raw[:75] http://docs.python.org/2/library/urllib2.html

22 – 22 – CSCE 771 Spring 2011 >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) >>> type(tokens) >>> len(tokens) 255809 >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

23 – 23 – CSCE 771 Spring 2011 Dealing with HTML >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] ' >> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN‘ >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out',...]

24 – 24 – CSCE 771 Spring 2011.

25 – 25 – CSCE 771 Spring 2011 Chap 2 Brown corpus >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...] >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',',...] >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...],...]

26 – 26 – CSCE 771 Spring 2011 Freq Dist >>> from nltk.corpus import brown >>> news_text = brown.words(categories='news') >>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> for m in modals:... print m + ':', fdist[m],... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

27 – 27 – CSCE 771 Spring 2011 >>> fdist1 = FreqDist(text1) >>> fdist1 >>> fdist1 >>> vocabulary1 = fdist1.keys() >>> vocabulary1[:50] [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] >>> fdist1['whale'] 906 >>>

28 – 28 – CSCE 771 Spring 2011 >>> cfd = nltk.ConditionalFreqDist(... (genre, word)... for genre in brown.categories()... for word in brown.words(categories=genre)) >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> cfd.tabulate(conditions=genres, samples=modals)

29 – 29 – CSCE 771 Spring 2011 Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK

30 – 30 – CSCE 771 Spring 2011 Table 2.3 Basic Corpus Functionality fileids()the files of the corpus fileids([categories]) the files of the corpus corresponding to these categories categories()the categories of the corpus categories([fileids]) the categories of the corpus corresponding to these files raw()the raw content of the corpus raw(fileids=[f1,f2,f3])the raw content of the specified files raw(categories=[c1,c2]) the raw content of the specified categories words()the words of the whole corpus words(fileids=[f1,f2,f3])the words of the specified fileids words(categories=[c1,c2])the words of the specified categories sents()the sentences of the whole corpus sents(fileids=[f1,f2,f3])the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories abspath(fileid).................. the location of the given file on disk ………….

31 – 31 – CSCE 771 Spring 2011 def generate_model(cfdist, word, num=15): for i in range(num): print word, word = cfdist[word].max() text = nltk.corpus.genesis.words('english-kjv.txt') bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams)

32 – 32 – CSCE 771 Spring 2011 Example 2.5 (code_random_text.py)

33 – 33 – CSCE 771 Spring 2011 Table 2.4 ExampleDescription cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs cfdist.conditions() alphabetically sorted list of conditions cfdist[condition] the frequency distribution for this condition cfdist[condition][sample] frequency for the given sample for this condition cfdist.tabulate() tabulate the conditional frequency distribution cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions cfdist.plot() graphical plot of the conditional frequency distribution cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2

34 – 34 – CSCE 771 Spring 2011 >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Tr ue) >>> word_tag_fd = nltk.FreqDist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V',...]

35 – 35 – CSCE 771 Spring 2011 Example 5.2 (code_findtags.py)

36 – 36 – CSCE 771 Spring 2011 highly ambiguous words >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> for word in data.conditions():... if len(data[word]) > 3:... tags = data[word].keys()... print word, ' '.join(tags)... best ADJ ADV NP V better ADJ ADV V DET ….


Download ppt "Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter."

Similar presentations


Ads by Google