NLTK Tagging CS1573: AI Application Development, Spring 2003

NLTK Tagging CS1573: AI Application Development, Spring 2003
(modified from Steven Bird’s notes)

Today’s Outline Administration Final Words on Regular Expressions
Regular Expressions in NLTK New Topic: Tagging Motivation and Linguistic Background NLTK Tutorial: Tagging Part-of-Speech Tagging The nltk.tagger Module A Few Tagging Algorithms Some Gory Details

Regular Expressions, again
Python Regular expression syntax NLTK uses The regular expression tokenizer A simple regular expression tagging algorithm

Regular Expression Tokenizers
Mimicing the WSTokenizer >>> tokenizer=RETokenizer(r'[^\s]+') >>> tokenizer.tokenize(example_text)

RE Tokenization, continued
> regexp=r'\w+|[^\w\s]+‘ '\w+|[^\w\s]+' > tokenizer = RETokenizer(regexp) > tokenizer.tokenize(example_text) Why is this version better?

RE Tokenization, continued
> regexp=r'\w+|[^\w\s]+' Why is this version better? -includes punctuation as separate tokens -matches either a sequence of alphanumeric characters (letters and numbers); or a sequence of punctuation characters. But, still has problems, for example … ?

Improved Example > example_text = 'That poster costs $22.40.'
> regexp = r'(\w+)|(\$\d+\.\d+)|([^\w\s]+)' '(\w+)|(\$\d+\.\d+)|([^\w\s]+)' > tokenizer = RETokenizer(regexp) > tokenizer.tokenize(example_text)

Regular Expression Limitations
While Regular Languages can model many things, there are still limitations (no advice when rejection, all or one solution when accept condition is ambiguous).

New Topic Now we’re going to start looking at tagging, and especially approaches that depend on looking at words in context. We’ll start with what looks like an artificial task: predicting the next word in a sequence. We’ll then move to tagging, the process of associating auxiliary information with each token, often for use in later stages of text processing

Word Prediction Example
From NY Times: Stocks plunged this…

From NY Times: Stocks plunged this morning, despite a cut in interest …

From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall …

From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began …

From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last …

From NY Times: Stocks plunged this morning, despite a cut in interest rates by the Federal Reserve, as Wall Street began trading for the first time since last Tuesday’s terrorist attacks.

Format Change Move to pdf slides (highlights of Jurafsky and Martin Chapters 6 and 8)

Tagging: Overview /Review
Motivation What is tagging? What does tagging do? Kinds of tagging? Significance of part of speech Basics Features and context Brown and Penn Treebank, tagsets Tagging in NLTK (nltk.tagger module) Tagging Algorithms, statistical and rule-based tagging Evaluation

Terminology Tagging Tags Tag Set
The process of associating labels with each token in a text Tags The labels Tag Set The collection of tags used for a particular task

Example Typically a tagged text is a sequence of white-space separated base/tag tokens: The/at Pantheon’s/np interior/nn ,/,still/rb in/in its/pp original/jj form/nn ,/, is/bez truly/ql majestic/jj and/cc an/at architectural/jj triumph/nn ./. Its/pp rotunda/nn forms/vbz a/at perfect/jj circle/nn whose/wp diameter/nn is/bez equal/jj to/in the/at height/nn from/in the/at floor/nn to/in the/at ceiling/nn ./. .

What does Tagging do? Collapses Distinctions Introduces Distinctions
Lexical identity may be discarded e.g. all personal pronouns tagged with PRP Introduces Distinctions Ambiguities may be removed e.g. deal tagged with NN or VB e.g. deal tagged with DEAL1 or DEAL2 Helps classification and prediction

Kinds of Tagging Part-of-Speech tagging Semantic Sense tagging
Grammatical tagging Divides words into categories based on how they can be combined to form sentences (e.g., articles can combine with nouns but not verbs) Semantic Sense tagging Sense disambiguation Homonym disambiguation Discourse tagging Speech acts (request, inform, greet, etc.)

Significance of Parts of Speech
A word’s POS tells us a lot about the word and its neighbors Limits the range of meanings (deal), pronunciation (object vs object) or both (wind) Helps in stemming Limits the range of following words for ASR Helps select nouns from a document for IR Basis for partial parsing Basis for searching for linguistic constructions Parsers can build trees directly on the POS tags instead of maintaining a lexicon

Features and Contexts wn-2 wn-1 wn wn+1 CONTEXT FEATURE tn-1 tn tn+1

Why there are many tag sets
Definition of POS tag Semantic, syntactic, morphological Tagsets differ in both how they define the tags, and at what level of granularity Balancing classification and prediction Introducing more distinctions: Better information about context Harder to classify current token Introducing few distinctions Less information about context Less work to do for classifying current token

The Brown Corpus The first digital corpus (1961)
Francis and Kucera, Brown University Contents: 500 texts, each 2000 words long From American books, newspapers, magazines Representing genres: Science fiction, romance fiction, press reportage scientific writing, popular lore

Penn Treebank First syntactically annotated corpus
1 million words from Wall Street Journal Part of speech tags and syntax trees

Representing Tags in NLTK
TaggedType class >>> ttype1 = TaggedType('dog', 'NN') 'dog'/'NN‘ >>> ttype1.base() dog' >>> ttype1.tag() ‘NN' Tagged tokens >>> ttoken = Token(ttype, Location(5))

Reading Tagged Corpora
>> tagged_text_str = open('corpus.txt').read() 'John/NN saw/VB the/AT book/NN on/IN the/AT table/NN ./END He/NN sighed/VB ./END' >> tokens=TaggedTokenizer().tokenize(tagged_text_str) If TaggedTokenizer encouters a word without a tag, it will assign it the default tag None.

The TaggerI Interface > tokens = WSTokenizer().tokenize(untagged_text_str) > my_tagger.tag(tokens) The interface defines a single method, tag, which assigns a tag to each token in a list, and returns the resulting list of tagged tokens.

Tagging Algorithms Default tagger Unigram tagger
Inspect the word and guess a tag Unigram tagger Assign the tag which is the most probable for the word in question, based on raw frequency Uses training data Bigram tagger, n-gram tagger Rule-based taggers, HMM taggers (outside scope of this class)

Default Tagger We need something to use for unseen words
E.g., guess NNP for a word with an initial capital Do regular-expression processing of the words Sequence of regular expression tests Assigment of the wor to a suitable tag If there are no matches… Assign to the most frequent tag, NN

Finding the most frequent tag
nltk.probability module for ttoken in ttext: freq_dist.inc(ttoken.tag()) def_tag = freq_dist.max()

A Default Tagger > tokens=WSTokenizer().tokenize(untag_text_str) > my_tagger.tag(tokens) NN_CD_Tagger assigns CD to numbers, otherwise NN Poor performance (20-30%) in isolation, but when used with other taggers can significantly improve performance

Unigram Tagger Unigram = table of frequencies Counting events
E.g. in tagged WSJ sample, “deal” is tagged with NN 11 times, with VB 1 time, and with VBP 1 time 90% accuracy Counting events freq_dist = CFFreqDist() for tttoken in ttext: context = ttoken.type().base() feature = ttoken.type().tag() freq_dist.inc(CFSample(context,feature)) context_event = ContextEvent(token.type()) sample=freq_dist.cond_max(context_event) tag=sample.feature()

Unigram Tagger (continued)
Before being used, UnigramTaggers are trained using the train method, which uses a tagged corpus to determine which tags are most common for each word: # 'train.txt' is a tagged training corpus >>> tagged_text_str = open('train.txt').read() >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) >>> tagger = UnigramTagger() >>> tagger.train(train_toks)

Once a UnigramTagger has been trained, the tag can be used to tag untagged corpora: > tokens = WSTokenizer().tokenize(untagged_text_str) > tagger.tag(tokens) ...]

Performance is highly dependent on the quality of its training set Can’t be too small Can’t be too different from texts we actually want to tag How is this related to the homework that we just did?

Nth Order Tagging Bigram table: frequencies of pairs N-gram tagger
Not necessarily adjacent or of same category What is the most likely tag for w_n, given w_n-1 and t_n-1? What is the context for NLTK? N-gram tagger Consider n-1 previous tags Sparse data problem Accuracy versus coverage tradeoff Backoff Throwing away order Put context into a set

Nth-Order Tagging (continued)
In addition to considering the token’s type, the context also considers the tags of the n preceding tokens The tagger then picks the tag which is most likely for that context Different values of n are possible Oth order = unigram tagger 1st order = bigrams 2nd order = trigrams

Tagged training corpus determines most likely tag for each context: > train_toks = TaggedTokenizer().tokenize(tagged_text_str) > tagger = NthOrderTagger(3) # 3rd order tagger >tagger.train(train_toks)

Once trained, it can tag untagged corpora: > tokens=WSTokenizer().tokenize(untag_text_str) > tagger.tag(tokens) ...]

Combining Taggers Use more accurate algorithms when we can, backoff to wider coverage when needed. Try tagging the token with the 1st order tagger. If the 1st order tagger is unable to find a tag for the token, try finding a tag with the 0th order tagger. If the 0th order tagger is also unable to find a tag, use the NN_CD_Tagger to find a tag.

BackoffTagger class >>> train_toks = TaggedTokenizer().tokenize(tagged_text_str) # Construct the taggers >>> tagger1 = NthOrderTagger(1) # 1st order >>> tagger2 = UnigramTagger() # 0th order >>> tagger3 = NN_CD_Tagger() # Train the taggers >>> tagger1.train(train_toks) >>> tagger2.train(train_toks)

Backoff (continued) # Combine the taggers (in order, by specificity)
>> tagger = BackoffTagger([tagger1, tagger2, tagger3]) # Use the combined tagger >>tokens=TaggedTokenizer().tokenize(untagged_text_str) >> tagger.tag(tokens) ...]

Rule-Based Tagger The Linguistic Complaint
Where is the linguistic knowledge of a tagger? Just a massive table of numbers Aren’t there any linguistic insights that could emerge from the data? Could thus use handcrafted sets of rules to tag input sentences, for example, if input follows a determiner tag it as a noun.

Evaluating a Tagger Tagged tokens – the original data Untag the data
Tag the data with your own tagger Compare the original and new tags Iterate over the two lists checking for identity and counting Accuracy = fraction correct

A Look at Tagging Implementations
It demonstrates how to write classes implementing the interfaces defined by NLTK. It provides you with a better understanding of the algorithms and data structures underlying each approach to tagging. It gives you a chance to see some of the code used to implement NLTK. The developers have tried hard to ensure that the implementation of every class in NLTK is easy to understand.

A Sequential Tagger The taggers in this tutorial are implemented as sequential taggers Assigns tags to one token at a time, starting with the first token of the text, and proceeding in sequential order. Decides which tag to assign a token on the basis of that token, the tokens that preceed it, and the predicted tags for the tokens that preceed it. To capture this commonality, we define a common base class, SequentialTagger (class SequentialTagger(TaggerI)) The next.tag method (note typo in tutorial) returns the appropriate tag for the next token; each tagger subclass provides its own implementation

SequentialTagger.next_tag
-decides which tag to assign a token, given the list of tagged tokens that preceeds it. two arguments: a list of tagged tokens preceeding the token to be tagged, and the token to be tagged; and it returns the appropriate tag for that token. def next_tag(self, tagged_tokens, next_token): assert 0, "next_tag not defined by SequentialTagger subclass"

SequentialTagger.tag def tag(self, text): tagged_text = []
# Tag each token, in sequential order. for token in text: # Get the tag for the next token. tag = self.next_tag(tagged_text, token) # Use tag to build tagged token, add to tagged_text. tagged_token = Token(TaggedType(token.type(), tag), token.loc()) tagged_text.append(tagged_token) return tagged_text

Example Subclass: NN_CD_Tagger
class NN_CD_Tagger(SequentialTagger): def __init__(self): pass #empty constructor def next_tag(self, tagged_tokens, next_token): # Assign 'CD' for numbers, 'NN' for anything else. if re.match(r'^[0-9]+(.[0-9]+)?$', next_token.type()): return 'CD' else: return 'NN‘ # just define this method; when the tag method is called, the definition given by SequentialTagger will be used.

Another Example: UnigramTagger
class UnigramTagger(TaggerI): class UnigramTagger(SequentialTagger):

Unigram Tagger: Training
def train(self, tagged_tokens): for token in tagged_tokens: outcome = token.type().tag() context = token.type().base() self._freqdist[context].inc(outcome

Unigram Tagger: Tagging
def next_tag(self, tagged_tokens, next_token): context = next_token.type() return self._freqdist[context].max() eg access context and find most likely outcome >>> freqdist['bank'].max() 'NN'

Unigram Tagger: Initialization
The constructor for UnigramTagger simply initializes self._freqdist with a new conditional frequency distribution. def __init__(self): self._freqdist = probability.ConditionalFreqDist()

For Self-Study NthOrder Tagger Implementation
BackoffTagger Implementation

For Next Time Chunk Parsing

NLTK Tagging CS1573: AI Application Development, Spring 2003

Similar presentations

Presentation on theme: "NLTK Tagging CS1573: AI Application Development, Spring 2003"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLTK Tagging CS1573: AI Application Development, Spring 2003

Similar presentations

Presentation on theme: "NLTK Tagging CS1573: AI Application Development, Spring 2003"— Presentation transcript:

Similar presentations

About project

Feedback