Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke and some …” Word prediction (guessing what might come next) has many applications: –speech recognition –handwriting recognition –augmentative communication systems –spelling error detection/correction
N-grams A simple model for word prediction is the n-gram model. An n-gram model “uses the previous N-1 words to predict the next one”.
Using corpora in word prediction A corpus is a collection of text (or speech). In order to do word prediction using n- grams we must compute conditional word probabilities. Conditional word probabilities are computed from their occurrences in a corpus or several corpora.
Counting types and tokens in corpora Prior to computing conditional probabilities, counts are needed. Most counts are based on word form (i.e. cat and cats are two distinct word forms). We distinguish types from tokens: –“We use types to mean the number of distinct words in a corpus” –We use “tokens to mean the total number of running words.” [pg. 195]
Word (unigram) probabilities Simple model: all words have equal probability. For example, assuming a lexicon of 100,000 words we have P(the)= , P(rabbit) = Consider the difference between “the” and “rabbit” in the following contexts: –I bought … –I read …
Unigrams Somewhat better model: consider frequency of occurrence. For example, we might have P(the) = 0.07, P(rabbit)= Consider difference between –Anyhow, … –Just then, the white …
Bigrams Bigrams consider two-word sequences. –P(rabbit | white) = …high… –P(the | white) = …low… –P(rabbit | anyhow) = …low… –P(the | anyhow) = …high… The probability of occurrence of a word is dependent on the context of occurrence: the probability is conditional.
Probabilities We want to know the probability of a word given a preceding context. We can compute the probability of a word sequence: P(w 1 w 2 …w n ) How do we compute the probability that w n occurs after the sequence (w 1 w 2 …w n-1 )? P(w 1 w 2 …w n ) = P(w 1 )P(w 2 |w 1 )P(w 3 |w 1 w 2 ) … P(w n |w 1 w 2 …w n-1 ) How do we find probabilities like P(w n |w 1 w 2 …w n-1 )? We approximate! In a bigram model we approximate using the previous one word; approximate by P(w n |w n-1 ).
Computing conditional probabilities How do we compute P(w n |w n-1 )? P(w n |w n-1 ) = C(w n-1 w n ) / w C(w n-1 w) Since the sum of all bigram counts that start with w n-1 must be the same as the unigram count for w n-1, we can simplify: P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 )
Exercise Consider the following text: Calculate word counts, bigram counts, and bigram probabilities.
The world of reptiles is an interesting one! Some reptiles are adored, or loved, by people while other reptiles are feared by people! Turtles, lizards, and snakes are often kept as pets, but alligators, crocodiles, and some very large iguanas are best left in the wild! Reptiles are vertebrates and have some things in common. They have dry, scaly skin. Reptiles are cold-blooded, which means that their body temperature stays about the same as the temperature of their surroundings. As the temperature outdoors changes, the reptile's body temperature changes, too! Reptiles live on the land and in the water, but even the reptiles that live in the water most of the time breathe with lungs. Sometimes lazy alligators and crocodiles will lie in the water with only their nostrils, or noses, sticking out. Turtles stick their heads out of the water, too, to breathe in oxygen. All reptiles hatch from eggs. Some eggs are laid in a nest by the mother. All turtles, crocodiles, and some lizards and snakes lay eggs with shells. Other snake and lizard mothers, however, protect the eggs inside their bodies until they hatch. These babies are born alive! A mother alligator will carry her babies to the water, but not many reptiles care for their eggs or their babies. So, do you love snakes or are you afraid of them?