NLTK Tagging CS1573: AI Application Development, Spring 2003

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
Word Bi-grams and PoS Tags
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
Chunk Parsing CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Stemming, tagging and chunking Text analysis short of parsing.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 20, 2004.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
1 I256: Applied Natural Language Processing Marti Hearst Sept 20, 2006.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Part of speech (POS) tagging
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Sept 13, 2004.
1 I256: Applied Natural Language Processing Marti Hearst Sept 18, 2006.
NLTK Tagging CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)
Python for NLP and the Natural Language Toolkit CS1573: AI Application Development, Spring 2003 (modified from Edward Loper’s notes)
Albert Gatt Corpora and Statistical Methods Lecture 9.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
ELN – Natural Language Processing Giuseppe Attardi
February 2007CSA3050: Tagging I1 CSA2050: Natural Language Processing Tagging 1 Tagging POS and Tagsets Ambiguities NLTK.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
1 CPE 641 Natural Language Processing Lecture 2: Levels of Linguistic Analysis, Tokenization & Part- of-speech Tagging Asst. Prof. Dr. Nuttanart Facundes.
10/12/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 10 Giuseppe Carenini.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Lecture 10 NLTK POS Tagging Part 3 Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings:
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 Part of Speech (POS) Tagging Lab CSC 9010: Special Topics.
Euromasters SS Trevor Cohn Introduction to NLTK part 1 1 Euromasters summer school 2005 Introduction to NLTK Trevor Cohn July 12, 2005.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
CPSC 503 Computational Linguistics
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Supertagging CMSC Natural Language Processing January 31, 2006.
Part-of-speech tagging
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
LING/C SC/PSYC 438/538 Lecture 18 Sandiway Fong. Adminstrivia Homework 7 out today – due Saturday by midnight.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Modified from Diane Litman's version of Steve Bird's notes 1 Rule-Based Tagger The Linguistic Complaint –Where is the linguistic knowledge of a tagger?
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Natural Language Processing Information Extraction Jim Martin (slightly modified by Jason Baldridge)
Part-Of-Speech Tagging Radhika Mamidi. POS tagging Tagging means automatic assignment of descriptors, or tags, to input tokens. Example: “Computational.
Python for NLP and the Natural Language Toolkit
Language Identification and Part-of-Speech Tagging
CSC 594 Topics in AI – Natural Language Processing
Natural Language Processing (NLP)
CSC 594 Topics in AI – Natural Language Processing
LING/C SC/PSYC 438/538 Lecture 20 Sandiway Fong.
CSCI 5832 Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
LING/C SC 581: Advanced Computational Linguistics
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong.
Lecture 7 HMMs – the 3 Problems Forward Algorithm
CSCE 771 Natural Language Processing
Chunk Parsing CS1573: AI Application Development, Spring 2003
Natural Language Processing
CSCI 5832 Natural Language Processing
Natural Language Processing (NLP)
CSA2050: Introduction to Computational Linguistics
Artificial Intelligence 2004 Speech & Natural Language Processing
Part-of-Speech Tagging Using Hidden Markov Models
Natural Language Processing (NLP)
Presentation transcript:

NLTK Tagging CS1573: AI Application Development, Spring 2003 (modified from Steven Bird’s notes)

Today’s Outline Administration Final Words on Regular Expressions Regular Expressions in NLTK New Topic: Tagging Motivation and Linguistic Background NLTK Tutorial: Tagging Part-of-Speech Tagging The nltk.tagger Module A Few Tagging Algorithms Some Gory Details

Regular Expressions, again Python Regular expression syntax NLTK uses The regular expression tokenizer A simple regular expression tagging algorithm

Regular Expression Tokenizers Mimicing the WSTokenizer >>> tokenizer=RETokenizer(r'[^\s]+') >>> tokenizer.tokenize(example_text) ['Hello.'@[0w], "Isn't"@[1w], 'this'@[2w], 'fun?'@[3w]]

RE Tokenization, continued > regexp=r'\w+|[^\w\s]+‘ '\w+|[^\w\s]+' > tokenizer = RETokenizer(regexp) > tokenizer.tokenize(example_text) ['Hello'@[0w], '.'@[1w], 'Isn'@[2w], "'"@[3w], 't'@[4w], 'this'@[5w], 'fun'@[6w], '?'@[7w]] Why is this version better?

RE Tokenization, continued > regexp=r'\w+|[^\w\s]+' Why is this version better? -includes punctuation as separate tokens -matches either a sequence of alphanumeric characters (letters and numbers); or a sequence of punctuation characters. But, still has problems, for example … ?

Improved Example > example_text = 'That poster costs $22.40.' > regexp = r'(\w+)|(\$\d+\.\d+)|([^\w\s]+)' '(\w+)|(\$\d+\.\d+)|([^\w\s]+)' > tokenizer = RETokenizer(regexp) > tokenizer.tokenize(example_text) ['That'@[0w], 'poster'@[1w], 'costs'@[2w], '$22.40'@[3w], '.'@[4w]]

Regular Expression Limitations While Regular Languages can model many things, there are still limitations (no advice when rejection, all or one solution when accept condition is ambiguous).

New Topic Now we’re going to start looking at tagging, and especially approaches that depend on looking at words in context. We’ll start with what looks like an artificial task: predicting the next word in a sequence. We’ll then move to tagging, the process of associating auxiliary information with each token, often for use in later stages of text processing

Word Prediction Example From NY Times: Stocks plunged this…

Word Prediction Example From NY Times: Stocks plunged this morning, despite a cut in interest …

Word Prediction Example From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall …

Word Prediction Example From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began …

Word Prediction Example From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last …

Word Prediction Example From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday’s terrorist attacks.

Format Change Move to pdf slides (highlights of Jurafsky and Martin Chapters 6 and 8)

Tagging: Overview /Review Motivation What is tagging? What does tagging do? Kinds of tagging? Significance of part of speech Basics Features and context Brown and Penn Treebank, tagsets Tagging in NLTK (nltk.tagger module) Tagging Algorithms, statistical and rule-based tagging Evaluation

Terminology Tagging Tags Tag Set The process of associating labels with each token in a text Tags The labels Tag Set The collection of tags used for a particular task

Example Typically a tagged text is a sequence of white-space separated base/tag tokens: The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./. .

What does Tagging do? Collapses Distinctions Introduces Distinctions Lexical identity may be discarded e.g. all personal pronouns tagged with PRP Introduces Distinctions Ambiguities may be removed e.g. deal tagged with NN or VB e.g. deal tagged with DEAL1 or DEAL2 Helps classification and prediction

Kinds of Tagging Part-of-Speech tagging Semantic Sense tagging Grammatical tagging Divides words into categories based on how they can be combined to form sentences (e.g., articles can combine with nouns but not verbs) Semantic Sense tagging Sense disambiguation Homonym disambiguation Discourse tagging Speech acts (request, inform, greet, etc.)

Significance of Parts of Speech A word’s POS tells us a lot about the word and its neighbors Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) Helps in stemming Limits the range of following words for ASR Helps select nouns from a document for IR Basis for partial parsing Basis for searching for linguistic constructions Parsers can build trees directly on the POS tags instead of maintaining a lexicon

Features and Contexts wn-2 wn-1 wn wn+1 CONTEXT FEATURE tn-1 tn tn+1

Why there are many tag sets Definition of POS tag Semantic, syntactic, morphological Tagsets differ in both how they define the tags, and at what level of granularity Balancing classification and prediction Introducing more distinctions: Better information about context Harder to classify current token Introducing few distinctions Less information about context Less work to do for classifying current token

The Brown Corpus The first digital corpus (1961) Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore

Penn Treebank First syntactically annotated corpus 1 million words from Wall Street Journal Part of speech tags and syntax trees

Representing Tags in NLTK TaggedType class >>> ttype1 = TaggedType('dog', 'NN') 'dog'/'NN‘ >>> ttype1.base() dog' >>> ttype1.tag() ‘NN' Tagged tokens >>> ttoken = Token(ttype, Location(5)) 'dog'/'NN'@[5]

Reading Tagged Corpora >> tagged_text_str = open('corpus.txt').read() 'John/NN saw/VB the/AT book/NN on/IN the/AT table/NN ./END He/NN sighed/VB ./END' >> tokens=TaggedTokenizer().tokenize(tagged_text_str) ['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], 'table'/'NN'@[6w], '.'/'END'@[7w], 'He'/'NN'@[8w], 'sighed'/'VB'@[9w], '.'/'END'@[10w]] If TaggedTokenizer encouters a word without a tag, it will assign it the default tag None.

The TaggerI Interface > tokens = WSTokenizer().tokenize(untagged_text_str) ['John'@[0w], 'saw'@[1w], 'the'@[2w], 'book'@[3w], 'on'@[4w], 'the'@[5w], 'table'@[6w], '.'@[7w], 'He'@[8w], 'sighed'@[9w], '.'@[10w]] > my_tagger.tag(tokens) ['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], 'table'/'NN'@[6w], '.'/'END'@[7w], 'He'/'NN'@[8w], 'sighed'/'VB'@[9w], '.'/'END'@[10w]] The interface defines a single method, tag, which assigns a tag to each token in a list, and returns the resulting list of tagged tokens.

Tagging Algorithms Default tagger Unigram tagger Inspect the word and guess a tag Unigram tagger Assign the tag which is the most probable for the word in question, based on raw frequency Uses training data Bigram tagger, n-gram tagger Rule-based taggers, HMM taggers (outside scope of this class)

Default Tagger We need something to use for unseen words E.g., guess NNP for a word with an initial capital Do regular-expression processing of the words Sequence of regular expression tests Assigment of the wor to a suitable tag If there are no matches… Assign to the most frequent tag, NN

Finding the most frequent tag nltk.probability module for ttoken in ttext: freq_dist.inc(ttoken.tag()) def_tag = freq_dist.max()

A Default Tagger > tokens=WSTokenizer().tokenize(untag_text_str) ['John'@[0w], 'saw'@[1w], '3'@[2w], 'polar'@[3w], 'bears'@[4w], '.'@[5w]] > my_tagger.tag(tokens) ['John'/'NN'@[0w], 'saw'/'NN'@[1w], '3'/'CD'@[2w], 'polar'/'NN'@[3w], 'bears'/'NN'@[4w], '.'/'NN'@[5w]] NN_CD_Tagger assigns CD to numbers, otherwise NN Poor performance (20-30%) in isolation, but when used with other taggers can significantly improve performance

Unigram Tagger Unigram = table of frequencies Counting events E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1 time, and with VBP 1 time 90% accuracy Counting events freq_dist = CFFreqDist() for tttoken in ttext: context = ttoken.type().base() feature = ttoken.type().tag() freq_dist.inc(CFSample(context,feature)) context_event = ContextEvent(token.type()) sample=freq_dist.cond_max(context_event) tag=sample.feature()

Unigram Tagger (continued) Before being used, UnigramTaggers are trained using the train method, which uses a tagged corpus to determine which tags are most common for each word: # 'train.txt' is a tagged training corpus >>> tagged_text_str = open('train.txt').read() >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) >>> tagger = UnigramTagger() >>> tagger.train(train_toks)

Unigram Tagger (continued) Once a UnigramTagger has been trained, the tag can be used to tag untagged corpora: > tokens = WSTokenizer().tokenize(untagged_text_str) > tagger.tag(tokens) ['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], ...]

Unigram Tagger (continued) Performance is highly dependent on the quality of its training set Can’t be too small Can’t be too different from texts we actually want to tag How is this related to the homework that we just did?

Nth Order Tagging Bigram table: frequencies of pairs N-gram tagger Not necessarily adjacent or of same category What is the most likely tag for w_n, given w_n-1 and t_n-1? What is the context for NLTK? N-gram tagger Consider n-1 previous tags Sparse data problem Accuracy versus coverage tradeoff Backoff Throwing away order Put context into a set

Nth-Order Tagging (continued) In addition to considering the token’s type, the context also considers the tags of the n preceding tokens The tagger then picks the tag which is most likely for that context Different values of n are possible Oth order = unigram tagger 1st order = bigrams 2nd order = trigrams

Nth-Order Tagging (continued) Tagged training corpus determines most likely tag for each context: > train_toks = TaggedTokenizer().tokenize(tagged_text_str) > tagger = NthOrderTagger(3) # 3rd order tagger >tagger.train(train_toks)

Nth-Order Tagging (continued) Once trained, it can tag untagged corpora: > tokens=WSTokenizer().tokenize(untag_text_str) > tagger.tag(tokens) ['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], ...]

Combining Taggers Use more accurate algorithms when we can, backoff to wider coverage when needed. Try tagging the token with the 1st order tagger. If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag.

BackoffTagger class >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) # Construct the taggers >>> tagger1 = NthOrderTagger(1) # 1st order >>> tagger2 = UnigramTagger() # 0th order >>> tagger3 = NN_CD_Tagger() # Train the taggers >>> tagger1.train(train_toks) >>> tagger2.train(train_toks)

Backoff (continued) # Combine the taggers (in order, by specificity) >> tagger = BackoffTagger([tagger1, tagger2, tagger3]) # Use the combined tagger >>tokens=TaggedTokenizer().tokenize(untagged_text_str) >> tagger.tag(tokens) ['John'/'NN'@[0w], 'saw'/'VB'@[1w], 'the'/'AT'@[2w], 'book'/'NN'@[3w], 'on'/'IN'@[4w], 'the'/'AT'@[5w], ...]

Rule-Based Tagger The Linguistic Complaint Where is the linguistic knowledge of a tagger? Just a massive table of numbers Aren’t there any linguistic insights that could emerge from the data? Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

Evaluating a Tagger Tagged tokens – the original data Untag the data Tag the data with your own tagger Compare the original and new tags Iterate over the two lists checking for identity and counting Accuracy = fraction correct

A Look at Tagging Implementations It demonstrates how to write classes implementing the interfaces defined by NLTK. It provides you with a better understanding of the algorithms and data structures underlying each approach to tagging. It gives you a chance to see some of the code used to implement NLTK. The developers have tried hard to ensure that the implementation of every class in NLTK is easy to understand.

A Sequential Tagger The taggers in this tutorial are implemented as sequential taggers Assigns tags to one token at a time, starting with the first token of the text, and proceeding in sequential order. Decides which tag to assign a token on the basis of that token, the tokens that preceed it, and the predicted tags for the tokens that preceed it. To capture this commonality, we define a common base class, SequentialTagger (class SequentialTagger(TaggerI)) The next.tag method (note typo in tutorial) returns the appropriate tag for the next token; each tagger subclass provides its own implementation

SequentialTagger.next_tag -decides which tag to assign a token, given the list of tagged tokens that preceeds it. two arguments: a list of tagged tokens preceeding the token to be tagged, and the token to be tagged; and it returns the appropriate tag for that token. def next_tag(self, tagged_tokens, next_token): assert 0, "next_tag not defined by SequentialTagger subclass"

SequentialTagger.tag def tag(self, text): tagged_text = [] # Tag each token, in sequential order. for token in text: # Get the tag for the next token. tag = self.next_tag(tagged_text, token) # Use tag to build tagged token, add to tagged_text. tagged_token = Token(TaggedType(token.type(), tag), token.loc()) tagged_text.append(tagged_token) return tagged_text

Example Subclass: NN_CD_Tagger class NN_CD_Tagger(SequentialTagger): def __init__(self): pass #empty constructor def next_tag(self, tagged_tokens, next_token): # Assign 'CD' for numbers, 'NN' for anything else. if re.match(r'^[0-9]+(.[0-9]+)?$', next_token.type()): return 'CD' else: return 'NN‘ # just define this method; when the tag method is called, the definition given by SequentialTagger will be used.

Another Example: UnigramTagger class UnigramTagger(TaggerI): class UnigramTagger(SequentialTagger):

Unigram Tagger: Training def train(self, tagged_tokens): for token in tagged_tokens: outcome = token.type().tag() context = token.type().base() self._freqdist[context].inc(outcome

Unigram Tagger: Tagging def next_tag(self, tagged_tokens, next_token): context = next_token.type() return self._freqdist[context].max() eg access context and find most likely outcome >>> freqdist['bank'].max() 'NN'

Unigram Tagger: Initialization The constructor for UnigramTagger simply initializes self._freqdist with a new conditional frequency distribution. def __init__(self): self._freqdist = probability.ConditionalFreqDist()

For Self-Study NthOrder Tagger Implementation BackoffTagger Implementation

For Next Time Chunk Parsing