Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine.

1 Outline Applications: Spelling correction Formal Representation: Weighted FSTs Algorithms: Bayesian Inference (Noisy channel model) Methods to determine weights – Hand-coded – Corpus-based estimation Dynamic Programming – Shortest path

2 Detecting and Correcting Spelling Errors Sources of lexical/spelling errors Speech: lexical access and recognition errors (more later) Text: typing and cognitive OCR: recognition errors Applications: Spell checking Hand-writing recognition of zip codes, signatures, Graffiti Issues: Correct non-words in isolation (dg for dog, why not dig?) Correcting non-words could lead to valid words – Homophone substitution: “parents love there children”; “Lets order a desert after dinner” – Correcting words in context

3 Patterns of Error Human typists make different types of errors from OCR systems -- why? why? Error classification I: performance-based: Insertion: catt Deletion: ct Substitution: car Transposition: cta Error classification II: cognitive People don’t know how to spell (nucular/nuclear; potatoe/potato) Homonymous errors (their/there)

4 Probability: Refresher Population: 10 Princeton students – What is the probability that a randomly chosen student (rcs) is a vegetarian? p(v) = 0.4 –That a rcs is a CS major? p(c) = 0.3 –That a rcs is a vegetarian and CS major? p(c,v) = 0.2 –That a vegetarian is a CS major? p(c|v) = 0.5 –That a CS major is a vegetarian? p(v|c) = 0.66 –That a non-CS major is a vegetarian? p(v|c’) = ?? –4 vegetarians –3 CS majors

5 Bayes Rule and Noisy Channel model We know the joint probabilities – p(c,v) = p(c) p(v|c) (chain rule) – p(v,c) = p(c,v) = p(v) p(c|v) So, we can define the conditional probability p(c|v) in terms of the prior probabilities p(c) and p(v) and the likelihood p(v|c). “Noisy channel” metaphor: channel corrupts the input; recover the original. – think cell-phone conversations!! – Hearer’s challenge: decode what the speaker said (w), given a channel- corrupted observation (O). Source model Channel model

6 How do we use this model to correct spelling errors? Simplifying assumptions – We only have to correct non-word errors – Each non-word (O) differs from its correct word (w) by one step (insertion, deletion, substitution, transposition) Generate and Test Method: (Kernighan et al 1990) – Generate a word using one of substitution, deletion or insertion, transposition operations – Test if the resulting word is in the dictionary. Example: Obser vation CorrectCorrect letter Error Letter PositionType of Error caatcat-a2insertion caatcaratr-3deletion

7 How do we decide which correction is most likely? Validate the generated word in a dictionary. But there may be multiple valid words, how to rank them? Rank them based on a scoring function – P(w | typo) = P(typo | w) * P(w) – Note there could be other scoring functions Propose n-best solutions Estimate the likelihood P(typo|w) and the prior P(w) count events from a corpus to estimate these probabilities Labeled versus Unlabeled corpus For spelling correction, what do we need? – Word occurrence information (unlabeled corpus) – A corpus of labeled spelling errors – Approximate word replacement by local letter replacement probabilities: Confusion matrix on letters

8 Cat vs Carat Estimating the Prior: Suppose we look at the occurrence of cat and carat in a large (50M word) AP news corpus cat occurs 6500 times, so p(cat) =.00013 carat occurs 3000 times, so p(carat) =.00006 Estimating the likelihood: Now we need to find out if inserting an ‘a’ after an ‘a’ is more likely than deleting an ‘r’ after an ‘a’ in a corrections corpus of 50K corrections ( p(typo|word)) suppose ‘a’ insertion after ‘a’ occurs 5000 times (p(+a)=.1) and ‘r’ deletion occurs 7500 times (p(-r)=.15) Scoring function: p(word|typo) = p(typo|word) * p(word) p(cat|caat) = p(+a) * p(cat) =.1 *.00013 =.000013 p(carat|caat) = p(-r) * p(carat) =.15 *.000006 =.000009

9 Encoding One-Error Correction as WFSTs Let Σ = {c,a,r,t}; One-edit model: Dictionary model: One-Error spelling correction: Input ● Edit ● Dictionary tc a r at a c:c, a:a, r:r, t:t  :c,  :a,  :r,  :t c: , a: , r: , t:  c:a,c:r, c:t, a:c,a:t… Del 0 Ins 0 0 Sub t

10 Issues What if there are no instances of carat in corpus? Smoothing algorithms Estimate of P(typo|word) may not be accurate Training probabilities on typo/word pairs What if there is more than one error per word?

11 Minimum Edit Distance How can we measure how different one word is from another word? How many operations will it take to transform one word into another? caat --> cat, fplc --> fireplace (*treat abbreviations as typos??) Levenshtein distance: smallest number of insertion, deletion, or substitution operations that transform one string into another (ins=del=subst=1) Alternative: weight each operation by training on a corpus of spelling errors to see which is most frequent

12 Computing Levinshtein Distance Dynamic Programming algorithm – Solution for a problem is a function of the solutions of subproblems – d[i,j] contains the distance upto s i and t j – d[i,j] is computed by combining the distance of shorter substrings using insertion, deletion and substitution operations. – optimal edit operations is recovered by storing back-pointers.

13 Edit Distance Matrix NB: errors Cost=1 for insertions and deletions; Cost=2 for substitutions Recompute the matrix: insertions=deletions=substituitions=1

14 Levenstein Distance with WFSTs Let Σ = {c,a,r,t}; Edit model: The two sentences to compared are encoded as FSTs. Levenstein distance between two sentences: Dist(s1,s2) = s1 ● Edit ● s2 Sub c:c, a:a, r:r, t:t  :c,  :a,  :r,  :t c: , a: , r: , t:  c:a,c:r, c:t, a:c,a:t… Del Ins 0

15 Spelling Correction with WFSTs Dictionary: FST representation of words Isolated word spelling correction: AllCorrections(w) = w ● Edit ● Dictionary BestCorrection(w) = Bestpath (w ● Edit ● Dictionary) Spelling correction in context: “parents love there children” S = w 1, w 2, … w n Spelling correction of w i Generate possible edits for w i Pick the edit that fits best in context Use a n-gram language model (LM) to rank the alternatives. “love there” vs “love their”; “there children” vs “their children” SentenceCorrection (S) = F(S) ● Edit ● LM

17 Summary We can apply probabilistic modeling to NL problems like spell-checking Noisy channel model, Bayesian method Training priors and likelihoods on a corpus Dynamic programming approaches allow us to solve large problems that can be decomposed into sub problems e.g. Minimum Edit Distance algorithm A number of Speech and Language tasks can be cast in this framework. Generate alternatives using a generator Select best/ Rank the alternatives using a model If the generator and the model are encodable as FST – Decoding becomes composition followed by search for best path.

18 Word Classes and Tagging

19 Words can be grouped into classes based on a number of criteria. Application independent criterion – Syntactic class (Nouns, Verbs, Adjectives…) – Proper names (People names, country names…) – Dates, currencies Application specific criterion – Product names (Ajax, Slurpee, Lexmark 3100) – Service names (7-cents plan, GoldPass) Tagging: Categorizing words of a sentence into one of the classes.

20 Syntactic Classes in English: Open Class Words Nouns: Defined semantically: words for people, places, things Defined syntactically: words that take determiners Count nouns: nouns that can be counted – One book, two computers, hundred men Mass nouns: nouns that represent homogenous groups, can occur without articles. – snow, salt, milk, water, hair Proper nouns; common nouns Verbs: words for actions and processes Hit, love, run, fly, differ, go Adjectives: words for describing qualities and properties (modifiers) of objects White, black, old, young, good, bad Adverbs: words for describing modifiers of actions Unfortunately, John walked home extremely slowly yesterday Subclasses: locative (home), degree (very), manner (slowly), temporal (yesterday)

21 Syntactic Classes in English: Closed Class Words Closed Class words: fixed set for a language Typically high frequency words Prepositions: relational words for describing relations among objects and events In, on, before, by Particles: looked up, throw out Articles/Determiners: definite versus indefinite Indefinite: a, an Definite: the Conjunctions: used to join two phrases, clauses, sentences. Coordinating conjunctions: and, or, but Subordinating conjunctions: that, since, because Pronouns: shorthand to refer to objects and events. Personal pronouns: he, she, it, they, us Possessive pronouns: my, your, ours, theirs, his, hers, its, one’s Wh-pronouns: whose, what, who, whom, whomever Auxiliary verbs: used to mark tense, aspect, polarity, mood, of an action Tense: past, present, future Aspect: completed or on-going Polarity: negation Mood: possible, suggested, necessary, desired; depicted by modal verbs (can, do, have, may, might) Copula: “be” connects a subject to a predicate (John is a teacher) Other word classes: Interjections (ah, oh, alas); negatives (not, no); politeness (please, sorry), greetings (hello, goodbye).

22 Tagset Tagset: set of tags to use; depends on the application. Basic tags; tags with some morphology Composition of a number of subtags – Agglutinative languages Popular tagsets for English Penn Treebank Tagset: 45 tags CLAWS tagset: 61 tags C7 tagset: 146 tags How do we decide how many tags to use? Application utility Ease of disambiguation Annotation consistency – “IN” tag in Penn Treebank tagset subordinating conjuntions and prepositions – “TO” tag represents preposition “to” and infinitival marker “to read” Supertags: fold in syntactic information into tagset of the order of 1000 tags

23 Tagging: Disambiguating Words Three different models ENGTWOL model (Karlsson 1995) Transformation-based model (Brill 1995) Hidden Markov Model tagger ENGTWOL tagger Constraint-based tagger 1,100 hand-written constraints to rule out invalid combinations of tags. – Use of probabilistic constraints and syntactic information Transformation-based model Start with the most likely assignment Make note of the context when the most likely assignment is wrong. Induce a transformation rule that corrects the most likely assignment to the correct tag in that context. Rules can be seen as α  β | δ – γ Compilable into an FST

24 Again, the Noisy Channel Model Input to channel: Part-of-speech sequence T Output from channel: a word sequence W Decoding task: find T’ = P(T|W) Using Bayes Rule And since P(W) doesn’t change for any hypothetical T’ T’ = P(W|T) P(T) P(W|T) is the Emit Probability, and P(T) is the prior, or Contextual Probability Source Noisy Channel Decoder

25 Stochastic Tagging: Markov Assumption The tagging model is approximated using Markov assumptions. – T’ = P(T) * P(W|T) – Markov (first-order) assumption: – Independence assumption: – Thus: The probability distributions are estimated from an annotated corpus. – Maximum Likelihood Estimate P(w|t) = count(w,t)/count(t) P(t i |t i-1 ) = count(t i, t i-1 )/count(t i-1 ) Don’t forget to smooth the counts!! – There are other means of estimating these probabilities.

26 Best Path Search Search for the best path pervades many Speech and NLP problems. ASR: best path through a composition of acoustic, pronunciation and language models Tagging: best path through a composition of lexicon and contextual model Edit distance: best path through a search space set up by insertion, deletion and substitution operations. In general: Decisions/operations create a weighted search space Search for the best sequence of decisions Dynamic programming solution Sometimes the score is only relevant. Most often the path (sequence of states; derivation) is relevant.

27 Multi-stage decision problems DT VB VBZ NNNNS The dog runs. P(dog|NN) = 0.99 P(dog|VB) = 0.01 P(the|DT) = 0.999 P(runs|NNS) = 0.63 P(runs|VBZ) = 0.37 P( | ) = 0.999 P(DT|BOS) =1 P(NN|DT) = 0.9 P(VB|DT) = 0.1 P(NNS|NN) = 0.3 P(VBZ|NN) = 0.7 P( |NNS) = 0.3 P( |VBZ) = 0.7 P(EOS | ) = 1 BOS EOS P(NNS|VB) = 0.7 P(VBZ|VB) = 0.3

28 Multi-stage decision problems Find the state sequence through this space that maximizes P(w|t)*P(t|t-1) cost(BOS, EOS) = 1*cost(DT, EOS) cost(DT,EOS) = max{P(the|DT)*P(NN|DT)*cost(NN,EOS), P(the|DT)*P(VB|DT)*cost(VB,EOS)} DT VB VBZ NNNNS The dog runs. BOS EOS

29 Two ways of reasoning Forward approach (Backward reasoning) Compute the best way to get from a state to the goal state. Backward approach (Forward reasoning) Compute the best way from the source state to get to a state. A combination of these two approaches is used in unsupervised training of HMMs. Forward-backward algorithm (Appendix D)

