Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Part of Speech (POS) Tagging Lab CSC 9010: Special Topics.

Similar presentations


Presentation on theme: "CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Part of Speech (POS) Tagging Lab CSC 9010: Special Topics."— Presentation transcript:

1 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 1 Part of Speech (POS) Tagging Lab CSC 9010: Special Topics. Natural Language Processing. Paula Matuszek, Mary-Angela Papalaskari Spring, 2005 Examples taken from the Bird, Klein and Loper: NLTK Tutorial, Tagging, nltk.sourceforge.net/tutorial/tagging/index.html nltk.sourceforge.net/tutorial/tagging/index.html

2 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 2 Simple Taggers Three simple taggers in NLTK –Default tagger –Regular expression tagger –Unigram tagger All start with tokenized text. >>> from nltk.tokenizer import * >>> text_token = Token(TEXT="John saw 3 polar bears.") >>> WhitespaceTokenizer().tokenize(text_token) >>> print text_token,,,,, ]>

3 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 3 Default Tagger Assigns the same tag to every token. We create an instance of the tagger and give it the desired tag. >>> from nltk.tagger import * >>> my_tagger = DefaultTagger('nn') >>> my_tagger.tag(text_token) >>> print text_token,,,,, ]> We’ve just labeled everything as a noun. 20-30% accuracy (terrible), but useful as an adjunct to other taggers.

4 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 4 Regular Expression Tagger Takes a list of regular expressions and tags to assign when they match. >>> NN_CD_tagger = RegexpTagger([(r'^[0-9]+(.[0-9]+)?$', 'cd'), (r'.*', 'nn')]) >>> NN_CD_tagger.tag(text_token) >>> print text_token,,,,, ]> This tags cardinal numbers as CD and everything else as nouns. Still pretty poor, but may be a useful step in conjunction with other taggers.

5 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 5 Unigram Tagger Assign each word its most frequent tag Must be trained to determine frequency. Will assign “none” as a tag to any word not seen in the training set. About 90% accurate. Example training case (from Brown corpus) The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/‘ that/cs any/dti irregularities/nns took/vbd place/nn./.

6 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 6 Train the Unigram Tagger >>> from nltk.tagger import * >>> from nltk.corpus import brown # Tokenize ten texts from the Brown Corpus >>> train_tokens = [ ] >>> for item in brown.items()[:10]:... train_tokens.append(brown.read(item)) # Initialise and train a unigram tagger >>> mytagger = UnigramTagger(SUBTOKENS='WORDS') >>> for tok in train_tokens: mytagger.train(tok)

7 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 7 And Then Tag New Text >>> text_token = Token(TEXT="John saw the book on the table") >>> WhitespaceTokenizer(SUBTOKENS='WORDS').tokenize(text_token) >>> mytagger.tag(text_token) >>> print text_token,,,,,, ]>

8 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 8 Testing a Tagger So how well does the tagger do? Split up the inputs into training and testing sets >>> train_tokens = [ ] >>> for item in brown.items()[:10]: # texts 0-9... train_tokens.append(brown.read(item)) >>> unseen_tokens = [ ] >>> for item in brown.items()[10:12]: # texts 10-11... unseen_tokens.append(brown.read(item))

9 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 9 Train And Test >>> for tok in train_tokens: mytagger.train(tok) >>> acc = tagger_accuracy(mytagger, unseen_tokens) >>> print 'Accuracy = %4.1f%' % (100 * acc) Accuracy = 64.6%

10 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 10 More in NLTK Error analysis Higher order taggers –Bigram –Nth-order Combining taggers Brill tagger

11 CSC 9010: Special Topics, Natural Language Processing. Spring, 2005. Matuszek & Papalaskari 11 For Lab/Homework Complete the tagger tutorial from the NLTK tutorial page. Tutorial exercises 1, 3, 4, 5 and 10. 8.2 (we will compare next time) 8.9 (using the NLTK and any higher- order tagger)


Download ppt "CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Part of Speech (POS) Tagging Lab CSC 9010: Special Topics."

Similar presentations


Ads by Google