CSCE 771 Natural Language Processing

CSCE 771 Natural Language Processing
Lecture 3 Ngrams Topics Python NLTK N – grams Smoothing Readings: Chapter 4 – Jurafsky and Martin January 23, 2013

Last Time Today Slides from Lecture 1 30- Morphology
Regular expressions in Python, (grep, vi, emacs, word)? Eliza Morphology Today N-gram models for prediction

Eliza.py https://github.com/nltk/nltk/blob/master/nltk/chat/eliza.py
List of re – response pattern pairs If Regular expression matches Then respond with … pairs = ( (r'I need (.*)', ( "Why do you need %1?", "Would it really help you to get %1?", "Are you sure you need %1?")), (r'Why don\'t you (.*)', ( "Do you really think I don't %1?", "Perhaps eventually I will %1.", "Do you really want me to %1?")),

Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit Steven Bird, Ewan Klein, and Edward Loper Preface (extras) 1. Language Processing and Python (extras) 2. Accessing Text Corpora and Lexical Resources (extras) 3. Processing Raw Text 4. Writing Structured Programs (extras) 5. Categorizing and Tagging Words 6. Learning to Classify Text (extras) 7. Extracting Information from Text 8. Analyzing Sentence Structure (extras) 9. Building Feature Based Grammars 10. Analyzing the Meaning of Sentences (extras) 11. Managing Linguistic Data 12. Afterword: Facing the Language Challenge nltk.org/book

Language Processing and Python
>>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1, ..., text9 and sent1, ..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus … nltk.org/book

Simple text processing with NLTK
>>> text1.concordance("monstrous") >>> text1.similar("monstrous") >>> text2.common_contexts(["monstrous", "very"]) >>> text4.dispersion_plot(["citizens", "democracy", "freedom", "duties", "America"]) >>> text3.generate() >>> text5[16715:16735] nltk.org/book

Counting Vocabulary >>> len(text3) >>> sorted(set(text3)) >>> from __future__ import division >>> len(text3) / len(set(text3)) >>> text3.count("smote") nltk.org/book

lexical_diversity >>> def lexical_diversity(text): ... return len(text) / len(set(text)) ... >>> def percentage(count, total): ... return 100 * count / total nltk.org/book

1.3 Computing with Language: Simple Statistics
Frequency Distributions >>> fdist1 = FreqDist(text1) >>> fdist1 <FreqDist with outcomes> >>> vocabulary1 = fdist1.keys() >>> vocabulary1[:50] >>> fdist1['whale'] >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] >>> sorted(long_words) nltk.org/book

List constructors in Python
>>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15] >>> sorted(long_words) >>> fdist5 = FreqDist(text5) >>> sorted([w for w in set(text5) if len(w) > 7 and fdist5[w] > 7]) nltk.org/book

Collocations and Bigrams
>>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')] >>> text4.collocations() Building collocations list United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money nltk.org/book

Table 1.2 Example Description fdist = FreqDist(samples)
create a frequency distribution containing the given samples fdist.inc(sample) increment the count for this sample fdist['monstrous'] count of the number of times a given sample occurred fdist.freq('monstrous') frequency of a given sample fdist.N() total number of samples fdist.keys() sorted in order of decreasing frequency for sample in fdist: iterate over the samples, decreasing frequency fdist.max() sample with the greatest count fdist.tabulate() tabulate the frequency distribution fdist.plot() graphical plot of the frequency distribution fdist.plot(cumulative=True) cumulative plot of the frequency distribution fdist1 < fdist2 test if samples in fdist1 occur less frequently than in fdist2 nltk.org/book

SLP – Jurafsky and Matrin for the rest of the day
Quotes from Chapter 4 But it must be recognized that the notion “probability of a sentence” is an entirely useless one, under any known interpretation of this term. Noam Chomsky 1969 Anytime a linguist leaves the group the recognition rate goes up. Fred Jelinek (then of the IBM speech group) SLP – Jurafsky and Matrin for the rest of the day

Predicting Words Please turn your homework … What is the next word?
Language models: N-gram models

Word/Character prediction Uses
Spelling correction (at character level) Spelling correction (at a higher level) when the corrector corrects to the wrong word Augmentative communication – person with disability chooses words from a menu predicted by the system

Real-Word Spelling Errors
Mental confusions Their/they’re/there To/too/two Weather/whether Peace/piece You’re/your Typos that result in real words

Spelling Errors that are Words
Typos Context Left context Right context

Real Word Spelling Errors
Collect a set of common pairs of confusions Whenever a member of this set is encountered compute the probability of the sentence in which it appears Substitute the other possibilities and compute the probability of the resulting sentence Choose the higher one

Word Counting Probability based on counting
He stepped out into the hall, was delighted to encounter a water brother. (from the Brown corpus) Words? Bi-grams Frequencies of words, but what words? Corpora ? Web everything on it Shakespeare Bible/Koran Spoken transcripts (switchboard) Problems with spoken speech “uh” , “um” fillers

6.2 Bigrams from Berkeley Restaurant Proj.
Berkeley Restaurant Project – a speech based restaurant consultant Handling requests: I’m looking for Cantonese food. I’m looking for a good place to eat breakfast.

Chain Rule Recall the definition of conditional probabilities
Rewriting Or…

General Case The word sequence from position 1 to n is
So the probability of a sequence is

Unfortunately That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes.

Markov Assumption Assume that the entire prefix history isn’t necessary. In other words, an event doesn’t depend on all of its history, just a fixed length near history

Markov Assumption So for each component in the product replace each with its with the approximation (assuming a prefix of N)

Maximum Likelihood Estimation
Maximum Likelihood Estimation (MLE) - Method to estimate probabilities for the n-gram models Normalize counts from a corpus

N-Grams: The big red dog
Unigrams: P(dog) Bigrams: P(dog|red) Trigrams: P(dog|big red) Four-grams: P(dog|the big red) In general, we’ll be dealing with P(Word| Some fixed prefix)

Caveat The formulation P(Word| Some fixed prefix) is not really appropriate in many applications. It is if we’re dealing with real time speech where we only have access to prefixes. But if we’re dealing with text we already have the right and left contexts. There’s no a priori reason to stick to left contexts only.

BERP Table: Counts (fig 4.1)
Then we can normalize by dividing each row by the unigram counts.

BERP Table: Bigram Probabilities

An Aside on Logs You don’t really do all those multiplies. The numbers are too small and lead to underflows Convert the probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog.

Some More Observations
P(I | I) P(want | I) P(I | food) I I I want I want I want to The food I want is

Generation Choose N-Grams according to their probabilities and string them together

BERP I want want to to eat eat Chinese Chinese food food .

Some Useful Observations
A small number of events occur with high frequency You can collect reliable statistics on these events with relatively small samples A large number of events occur with small frequency You might have to wait a long time to gather statistics on the low frequency events

Some Useful Observations
Some zeroes are really zeroes Meaning that they represent events that can’t or shouldn’t occur On the other hand, some zeroes aren’t really zeroes They represent low frequency events that simply didn’t occur in the corpus

Slide from: Speech and Language Processing Jurafsky and Martin
Shannon’s Method Sentences randomly generated based on the probability models (n-gram models) Sample a random bigram (<s>, w) according to its probability Now sample a random bigram (w, x) according to its probability Where the prefix w matches the suffix of the first. And so on until we randomly choose a (y, </s>) Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s> Slide from: Speech and Language Processing Jurafsky and Martin

Shannon’s method applied to Shakespeare

Shannon applied to Wall Street Journal

Evaluating N-grams: Perplexity
Training set Test set : W = w1w2….wn Perplexity (PP) is a Measure of how good a model is. PP(W) = P(w1w2….wn )-1/N Higher probability  lower perplexity Wall Street Journal perplexities of models

Unknown words: Open versus Closed Vocabularies
<UNK> unrecognized word token

Google words visualization

Problem Let’s assume we’re using N-grams
How can we assign a probability to a sequence where one of the component n-grams has a value of zero Assume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else

Smoothing Smoothing - reevaluating some of the zero and low probability N-grams and assigning them non-zero values Add-One (Laplace) Make the zero counts 1. Rationale: They’re just events you haven’t seen yet. If you had seen them, chances are you would only have seen them once… so make the count equal to 1.

Add-One Smoothing Terminology N – Number of total words
V – vocabulary size == number of distinct words Maximum Likelihood estimate

Adjusted counts “C*” Terminology N – Number of total words
V – vocabulary size == number of distinct words Adjusted count C* Adjusted probabilities

Discounting Discounting – lowering some of the larger non-zero counts to get the “probability” to assign to the zero entries dc – the discounted counts The discounted probabilities can then be directly calculated

Original BERP Counts (fig 6.4 again)
Berkeley Restaurant Project data V = 1616

Figure 6.6 Add one counts Counts Probabilities

Figure 6.6 Add one counts & prob.
Probabilities

Add-One Smoothed bigram counts
Think about the occurrence of an unseen item (

Witten-Bell Think about the occurrence of an unseen item (word, bigram, etc) as an event. The probability of such an event can be measured in a corpus by just looking at how often it happens. Just take the single word case first. Assume a corpus of N tokens and T types. How many times was an as yet unseen type encountered?

Witten Bell First compute the probability of an unseen event
Then distribute that probability mass equally among the as yet unseen events That should strike you as odd for a number of reasons In the case of words… In the case of bigrams

Witten-Bell In the case of bigrams, not all conditioning events are equally promiscuous P(x|the) vs P(x|going) So distribute the mass assigned to the zero count bigrams according to their promiscuity

Witten-Bell Finally, renormalize the whole table so that you still have a valid probability

Original BERP Counts; Now the Add 1 counts

Witten-Bell Smoothed and Reconstituted

Add-One Smoothed BERP Reconstituted

CSCE 771 Natural Language Processing

Similar presentations

Presentation on theme: "CSCE 771 Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 771 Natural Language Processing

Similar presentations

Presentation on theme: "CSCE 771 Natural Language Processing"— Presentation transcript:

Similar presentations

About project

Feedback