Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

Word Bi-grams and PoS Tags
Programming for Linguists
Text Corpora and Lexical Resources Chapter 2 of Natural Language Processing with Python.
Outline Why part of speech tagging? Word classes
BİL711 Natural Language Processing
Part-of-speech tagging. Parts of Speech Perhaps starting with Aristotle in the West (384–322 BCE) the idea of having parts of speech lexical categories,
LINGUISTICA GENERALE E COMPUTAZIONALE DISAMBIGUAZIONE DELLE PARTI DEL DISCORSO.
Week 8 The Natural Language Toolkit (NLTK)‏ Except where otherwise noted, this work is licensed under:
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Part-of-Speech Tagging & Sequence Labeling
Word classes and part of speech tagging Chapter 5.
ELN – Natural Language Processing Giuseppe Attardi
February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.
Categorizing and Tagging Words
Lecture 6 NLTK Tagging Topics Taggers Readings: NLTK Chapter 5 CSCE 771 Natural Language Processing.
The people.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
1.We all had a good time today. 2. I can return to school now. 3. I like playing piano. 4. I major at accounting. the Wrong word ∧ Missing word in.
Lecture 18 Ontologies and Wordnet Topics Ontologies Wordnet Overview of MeaningReadings: Text 13.5 NLTK book Chapter 2 March 25, 2013 CSCE 771 Natural.
Lecture 9 NLTK POS Tagging Part 2 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
NLTK & Python Day 9 LING Computational Linguistics Harry Howard Tulane University.
Syntax The study of how words are ordered and grouped together Key concept: constituent = a sequence of words that acts as a unit he the man the short.
Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.
Natural Language Processing Lecture 6 : Revision.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Methods for the Automatic Construction of Topic Maps Eric Freese, Senior Consultant ISOGEN International.
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word Bi-grams and PoS Tags COMP3310 Natural Language Processing Eric Atwell,
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
13-1 Chapter 13 Part-of-Speech Tagging POS Tagging + HMMs Part of Speech Tagging –What and Why? What Information is Available? Visible Markov Models.
Euromasters SS Trevor Cohn Introduction to NLTK part 1 1 Euromasters summer school 2005 Introduction to NLTK Trevor Cohn July 12, 2005.
Word classes and part of speech tagging Chapter 5.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
Natural Language Processing
Lecture 12 Classifiers Part 2 Topics Classifiers Maxent Classifiers Maximum Entropy Markov Models Information Extraction and chunking intro Readings: Chapter.
NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.
Sight Words.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Part-of-Speech Tagging & Sequence Labeling Hongning Wang
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
POS TAGGING AND HMM Tim Teks Mining Adapted from Heng Ji.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture5 2 August 2007.
Lecture 9: Part of Speech
Lecture 9 NLTK POS Tagging Part 2
NLTK Natural Language Processing with Python, Steven Bird, Ewan Klein, and Edward Loper, O'REILLY, 2009.
CSCE 590 Web Scraping – NLTK
CSC 594 Topics in AI – Natural Language Processing
CSCI 5832 Natural Language Processing
LING 388: Computers and Language
CSCE 590 Web Scraping - NLTK
Improving an Open Source Question Answering System
LING 3820 & 6820 Natural Language Processing Harry Howard
LING/C SC 581: Advanced Computational Linguistics
Natural Language Processing
CSCE 590 Web Scraping - NLTK
CSA2050: Introduction to Computational Linguistics
LING 388: Computers and Language
LING 388: Computers and Language
Lecture 9 NLTK POS Tagging Part 2
Part-of-Speech Tagging Using Hidden Markov Models
Presentation transcript:

Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter 5.4-? February 3, 2011 CSCE 771 Natural Language Processing

– 2 – CSCE 771 Spring 2011 Overview Last Time Overview of POS TagsToday Part of Speech Tagging Parts of Speech Rule Based taggers Stochastic taggers Transformational taggersReadings Chapter ?

– 3 – CSCE 771 Spring 2011 NLTK tagging >>> text = nltk.word_tokenize("And now for something completely different") >>> nltk.pos_tag(text) [('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

– 4 – CSCE 771 Spring 2011 >>> text = nltk.word_tokenize("They refuse to permit us to obtain the refuse permit") >>> nltk.pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]

– 5 – CSCE 771 Spring 2011 >>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) >>> text.similar('woman') >>> text.similar('woman') Building word-context index... man time day year car moment world family house country child boy state job way war girl place room word >>> text.similar('bought') made said put done seen had found left given heard brought got been was set told took in felt that >>> text.similar('over') in on to of and for with from at by that into as up out down through is all about in on to of and for with from at by that into as up out down through is all about >>> text.similar('the') a his this their its her an that our any all one these my in your no some other and

– 6 – CSCE 771 Spring 2011 Tagged Corpora By convention in NLTK, a tagged token is a tuple. function str2tuple() >>> tagged_token = nltk.tag.str2tuple('fly/NN') >>> tagged_token ('fly', 'NN') >>> tagged_token[0] 'fly' >>> tagged_token[1] 'NN'

– 7 – CSCE 771 Spring 2011 Specifying Tags with Strings >>> sent = '''... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/IN... other/AP topics/NNS,/, AMONG/IN them/PPO the/AT Atlanta/NP and/CC accepted/VBN practices/NNS which/WDT inure/VB to/IN the/AT best/JJT... interest/NN of/IN both/ABX governments/NNS ''/''./.... ''' >>> [nltk.tag.str2tuple(t) for t in sent.split()] [('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented', 'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'),... ('.', '.')]

– 8 – CSCE 771 Spring 2011 Reading Tagged Corpora >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'),...] >>> nltk.corpus.brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'N'), ('County', 'N'),...]

– 9 – CSCE 771 Spring 2011 tagged_words() method >>> print nltk.corpus.nps_chat.tagged_words() [('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'),...] >>> nltk.corpus.conll2000.tagged_words() [('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'),...] >>> nltk.corpus.treebank.tagged_words() [('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','),...]

– 10 – CSCE 771 Spring 2011 >>> nltk.corpus.brown.tagged_words(simplify_tags=True) [('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'),...] >>> nltk.corpus.treebank.tagged_words(simplify_tags=True) [('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','),...]

– 11 – CSCE 771 Spring 2011 readme() methods

– 12 – CSCE 771 Spring 2011 Table 5.1: Simplified Part-of-Speech Tagset TagMeaningExamples ADJadjectivenew, good, high, special, big, local ADVadverbreally, already, still, early, now CNJconjunctionand, or, but, if, while, although DETdeterminerthe, a, some, most, every, no EXexistentialthere, there's FWforeign worddolce, ersatz, esprit, quo, maitre

– 13 – CSCE 771 Spring 2011 MODmodal verbwill, can, would, may, must, should Nnounyear, home, costs, time, education NPproper nounAlison, Africa, April, Washington NUMnumbertwenty-four, fourth, 1991, 14:24 PROpronounhe, their, her, its, my, I, us Pprepositionon, of, at, with, by, into, under TOthe word toto UHinterjectionah, bang, ha, whee, hmpf, oops Vverbis, has, get, do, make, see, run VDpast tensesaid, took, told, made, asked VG present participle making, going, playing, working VNpast participlegiven, taken, begun, sung WHwh determinerwho, which, when, what, where, how

– 14 – CSCE 771 Spring 2011 >>> from nltk.corpus import brown >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged) >>> tag_fd.keys() ['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV', 'VD',...]

– 15 – CSCE 771 Spring 2011 Nouns >>> word_tag_pairs = nltk.bigrams(brown_news_tagged) >>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if b[1] == 'N')) ['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',', 'VG', 'VN',...]

– 16 – CSCE 771 Spring 2011 Verbs >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=True ) >>> word_tag_fd = nltk.FreqDist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V',...]

– 17 – CSCE 771 Spring 2011 >>> cfd1 = nltk.ConditionalFreqDist(wsj) >>> cfd1['yield'].keys() ['V', 'N'] >>> cfd1['cut'].keys() ['V', 'VD', 'N', 'VN']

– 18 – CSCE 771 Spring 2011 >>> cfd2 = nltk.ConditionalFreqDist((tag, word) for (word, tag) in wsj) >>> cfd2['VN'].keys() ['been', 'expected', 'made', 'compared', 'based', 'priced', 'used', 'sold', 'named', 'designed', 'held', 'fined', 'taken', 'paid', 'traded', 'said',...]

– 19 – CSCE 771 Spring 2011 >>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and 'VN' in cfd1[w]] ['Asked', 'accelerated', 'accepted', 'accused', 'acquired', 'added', 'adopted',...] >>> idx1 = wsj.index(('kicked', 'VD')) >>> wsj[idx1-4:idx1+1] [('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly', 'ADV'), ('kicked', 'VD')] >>> idx2 = wsj.index(('kicked', 'VN')) >>> wsj[idx2-4:idx2+1] [('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked', 'VN')]

– 20 – CSCE 771 Spring 2011 def findtags(tag_prefix, tagged_text): cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions()) cfd = nltk.ConditionalFreqDist((tag, word) for (word, tag) in tagged_text if tag.startswith(tag_prefix)) return dict((tag, cfd[tag].keys()[:5]) for tag in cfd.conditions())

– 21 – CSCE 771 Spring 2011 Reading URLs NLTK book 3.1 >>> from urllib import urlopen >>> url = " >>> raw = urlopen(url).read() >>> type(raw) >>> type(raw) >>> len(raw) >>> raw[:75]

– 22 – CSCE 771 Spring 2011 >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) >>> type(tokens) >>> len(tokens) >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

– 23 – CSCE 771 Spring 2011 Dealing with HTML >>> url = " >>> html = urlopen(url).read() >>> html[:60] ' >> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN‘ >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out',...]

– 24 – CSCE 771 Spring 2011.

– 25 – CSCE 771 Spring 2011 Chap 2 Brown corpus >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...] >>> brown.words(fileids=['cg22']) ['Does', 'our', 'society', 'have', 'a', 'runaway', ',',...] >>> brown.sents(categories=['news', 'editorial', 'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury', 'further'...],...]

– 26 – CSCE 771 Spring 2011 Freq Dist >>> from nltk.corpus import brown >>> news_text = brown.words(categories='news') >>> fdist = nltk.FreqDist([w.lower() for w in news_text]) >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> for m in modals:... print m + ':', fdist[m],... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389

– 27 – CSCE 771 Spring 2011 >>> fdist1 = FreqDist(text1) >>> fdist1 >>> fdist1 >>> vocabulary1 = fdist1.keys() >>> vocabulary1[:50] [',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his', 'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this', '!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so', 'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or', 'were', 'now', 'which', '?', 'me', 'like'] >>> fdist1['whale'] 906 >>>

– 28 – CSCE 771 Spring 2011 >>> cfd = nltk.ConditionalFreqDist(... (genre, word)... for genre in brown.categories()... for word in brown.words(categories=genre)) >>> genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> modals = ['can', 'could', 'may', 'might', 'must', 'will'] >>> cfd.tabulate(conditions=genres, samples=modals)

– 29 – CSCE 771 Spring 2011 Table 2.2: Some of the Corpora and Corpus Samples Distributed with NLTK

– 30 – CSCE 771 Spring 2011 Table 2.3 Basic Corpus Functionality fileids()the files of the corpus fileids([categories]) the files of the corpus corresponding to these categories categories()the categories of the corpus categories([fileids]) the categories of the corpus corresponding to these files raw()the raw content of the corpus raw(fileids=[f1,f2,f3])the raw content of the specified files raw(categories=[c1,c2]) the raw content of the specified categories words()the words of the whole corpus words(fileids=[f1,f2,f3])the words of the specified fileids words(categories=[c1,c2])the words of the specified categories sents()the sentences of the whole corpus sents(fileids=[f1,f2,f3])the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories abspath(fileid) the location of the given file on disk ………….

– 31 – CSCE 771 Spring 2011 def generate_model(cfdist, word, num=15): for i in range(num): print word, word = cfdist[word].max() text = nltk.corpus.genesis.words('english-kjv.txt') bigrams = nltk.bigrams(text) cfd = nltk.ConditionalFreqDist(bigrams)

– 32 – CSCE 771 Spring 2011 Example 2.5 (code_random_text.py)

– 33 – CSCE 771 Spring 2011 Table 2.4 ExampleDescription cfdist = ConditionalFreqDist(pairs) create a conditional frequency distribution from a list of pairs cfdist.conditions() alphabetically sorted list of conditions cfdist[condition] the frequency distribution for this condition cfdist[condition][sample] frequency for the given sample for this condition cfdist.tabulate() tabulate the conditional frequency distribution cfdist.tabulate(samples, conditions) tabulation limited to the specified samples and conditions cfdist.plot() graphical plot of the conditional frequency distribution cfdist.plot(samples, conditions) graphical plot limited to the specified samples and conditions cfdist1 < cfdist2 test if samples in cfdist1 occur less frequently than in cfdist2

– 34 – CSCE 771 Spring 2011 >>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags=Tr ue) >>> word_tag_fd = nltk.FreqDist(wsj) >>> [word + "/" + tag for (word, tag) in word_tag_fd if tag.startswith('V')] ['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V', 'added/VD', 'including/VG', 'according/VG', 'made/VN', 'pay/V',...]

– 35 – CSCE 771 Spring 2011 Example 5.2 (code_findtags.py)

– 36 – CSCE 771 Spring 2011 highly ambiguous words >>> brown_news_tagged = brown.tagged_words(categories='news', simplify_tags=True) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> data = nltk.ConditionalFreqDist((word.lower(), tag)... for (word, tag) in brown_news_tagged) >>> for word in data.conditions():... if len(data[word]) > 3:... tags = data[word].keys()... print word, ' '.join(tags)... best ADJ ADV NP V better ADJ ADV V DET ….