CSA3180: Natural Language Processing

Slides:



Advertisements
Similar presentations
Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.
Longest Common Subsequence
Spelling Correction as an iterative process that exploits the collective knowledge of web users Silviu Cucerzan & Eric Bill Microsoft Research Proceedings.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.
A BAYESIAN APPROACH TO SPELLING CORRECTION. ‘Noisy channels’ In a number of tasks involving natural language, the problem can be viewed as recovering.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 12: Sequence Analysis Martin Russell.
Gobalisation Week 8 Text processes part 2 Spelling dictionaries Noisy channel model Candidate strings Prior probability and likelihood Lab session: practising.
Computational Language Andrew Hippisley. Computational Language Computational language and AI Language engineering: applied computational language Case.
Linear Discriminant Functions Chapter 5 (Duda et al.)
Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
1 Introduction to Computational Natural Language Learning Linguistics (Under: Topics in Natural Language Processing ) Computer Science (Under:
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Lecture 21: Languages and Grammars. Natural Language vs. Formal Language.
November 2005CSA3180: Statistics III1 CSA3202: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy.
Principles of Pattern Recognition
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
Prof. Pushpak Bhattacharyya, IIT Bombay.1 Application of Noisy Channel, Channel Entropy CS 621 Artificial Intelligence Lecture /09/05.
1 CSA4050: Advanced Topics in NLP Spelling Models.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
Maximum Entropy (ME) Maximum Entropy Markov Model (MEMM) Conditional Random Field (CRF)
Regular Expressions Chapter 6 1. Regular Languages Regular Language Regular Expression Finite State Machine L Accepts 2.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Protein motif extraction with neuro-fuzzy optimization Bill C. H. Chang and Author : Bill C. H. Chang and Saman K. Halgamuge Saman K. Halgamuge Adviser.
CSA3202 Human Language Technology HMMs for POS Tagging.
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Using the Web for Language Independent Spellchecking and Auto correction Authors: C. Whitelaw, B. Hutchinson, G. Chung, and G. Ellis Google Inc. Published.
January 2012Spelling Models1 Human Language Technology Spelling Models.
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Hidden Markov Models BMI/CS 576
CSC 594 Topics in AI – Natural Language Processing
Lecture Slides Elementary Statistics Twelfth Edition
Syntax Analysis Chapter 4.
LECTURE 15: HMMS – EVALUATION AND DECODING
Lecture 2 Introduction to Programming
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Learning Sequence Motif Models Using Expectation Maximization (EM)
CSC 594 Topics in AI – Natural Language Processing
Lecture Slides Elementary Statistics Twelfth Edition
Statistical NLP: Lecture 9
Hidden Markov Models Part 2: Algorithms
Discrete Event Simulation - 4
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Honors Statistics From Randomness to Probability
CS621/CS449 Artificial Intelligence Lecture Notes
LECTURE 14: HMMS – EVALUATION AND DECODING
CONTEXT DEPENDENT CLASSIFICATION
CPSC 503 Computational Linguistics
LECTURE 15: REESTIMATION, EM AND MIXTURES
Machine Learning: Lecture 6
CS621/CS449 Artificial Intelligence Lecture Notes
CSCI 5582 Artificial Intelligence
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

CSA3180: Natural Language Processing Statistics 3 – Spelling Models Typing Errors Error Models Spellchecking Noisy Channel Methods Probabilistic Methods Bayesian Methods November 2005 CSA3180: Statistics III

Introduction Slides based on Lectures by Mike Rosner (2003/2004) Chapter 5 of Jurafsky & Martin, Speech and Language Processing http://www.cs.colorado.edu/~martin/slp.html November 2005 CSA3180: Statistics III

Speech Recognition and Text Correction Commonalities: Both concerned with problem of accepting a string of symbols and mapping them to a sequence of progressively less likely words Spelling correction: characters Speech recognition: phones November 2005 CSA3180: Statistics III

Noisy Channel Model Guess at original word Original Word Noisy Channel Source Noisy Channel Receiver Noisy Word Language generated and passed through a noisy channel Resulting noisy data received Goal is to recover original data from noisy data Same model can be used in diverse areas of language processing e.g. spelling correction, morphological analysis, pronunciation modelling, machine translation Metaphor comes from Jelinek’s (1976) work on speech recognition, but algorithm is a special case of Bayesian inference (1763) November 2005 CSA3180: Statistics III

Spelling Correction Kukich(1992) breaks the field down into three increasingly broader problems: Detection of non-words (e.g. graffe) Isolated word error correction (e.g. graffe  giraffe) Context dependent error detection and correction where the error may result in a valid word (e.g. there  three) November 2005 CSA3180: Statistics III

Spelling Error Patterns According to Damereau (1964) 80% of all misspelled words are caused by single-error misspellings which fall into the following categories (for the word the): Insertion (ther) Deletion (th) Substitution (thw) Transposition (teh) Because of this study, much subsequent research focused on the correction of single error misspellings November 2005 CSA3180: Statistics III

Causes of Spelling Errors Keyboard Based 83% novice and 51% overall were keyboard related errors Immediately adjacent keys in the same row of the keyboard (50% of the novice substitutions, 31% of all substitutions) Cognitive Phonetic seperate – separate Homonym there – their November 2005 CSA3180: Statistics III

Bayesian Classification Given an observation, determine which set of classes it belongs to Spelling Correction: Observation: String of characters Classification: Word that was intended Speech Recognition: Observation: String of phones Classification: Word that was said November 2005 CSA3180: Statistics III

Bayesian Classification Given string O = (o1, o2, …, on) of observations Bayesian interpretation starts by considering all possible classes, i.e. set of all possible words Out of this universe, we want to choose that word w in the vocabulary V which is most probable given the observation that we have O, i.e.: ŵ = argmax w  V P(w | O) where argmax x f(x) means “the x such that f(x) is maximised” Problem: Whilst this is guaranteed to give us the optimal word, it is not obvious how to make the equation operational: for a given word w and a given O we don’t know how to compute P(w | O) November 2005 CSA3180: Statistics III

Bayes’ Rule Intuition behind Bayesian classification is use Bayes’ rule to transform P(w | O) into a product of two probabilities, each of which is easier to compute than P(w | O) November 2005 CSA3180: Statistics III

Bayes’ Rule Bayes’ Rule we obtain P(w) can be estimated from word frequency P(O | w) can also be easily estimated P(O), the probability of the observation sequence, is harder to estimate, but we can ignore it November 2005 CSA3180: Statistics III

PDF Slides See PDF Slides on spell.pdf Pages 11-20 November 2005 CSA3180: Statistics III

Confusion Set The confusion set of a word w includes w along with all words in the dictionary D such that O can be derived from w by a single application of one of the four edit operations: Add a single letter. Delete a single letter. Replace one letter with another. Transpose two adjacent letters. November 2005 CSA3180: Statistics III

Error Model 1 Mayes, Damerau et al. 1991 Let C be the number of words in the confusion set of w. The error model, for all s in the confusion set of d, is: P(O|w) = α if O=w, (1- α)/(C-1) otherwise α is the prior probability of a given typed word being correct. Key Idea: The remaining probability mass is distributed evenly among all other words in the confusion set. November 2005 CSA3180: Statistics III

Error Model 2: Church & Gale 1991 Church & Gale (1991) propose a more sophisticated error model based on same confusion set (one edit operation away from w). Two improvements: Unequal weightings attached to different editing operations. Insertion and deletion probabilities are conditioned on context. The probability of inserting or deleting a character is conditioned on the letter appearing immediately to the left of that character. November 2005 CSA3180: Statistics III

Obtaining Error Probabilities The error probabilities are derived by first assuming all edits are equiprobable. They use as a training corpus a set of space-delimited strings that were found in a large collection of text, and that (a) do not appear in their dictionary and (b) are no more than one edit away from a word that does appear in the dictionary. They iteratively run the spell checker over the training corpus to find corrections, then use these corrections to update the edit probabilities. November 2005 CSA3180: Statistics III

Error Model 3 Brill and Moore (2000) Let Σ be an alphabet Model allows all operations of the form α  β, where α,β in Σ*. P(α  β) is the probability that when users intends to type the string α they type β instead. N.B. model considers substitutions of arbitrary substrings not just single characters. November 2005 CSA3180: Statistics III

Model 3 Brill and Moore (2000) Model also tries to account for the fact that in general, positional information is a powerful conditioning feature, e.g. p(entler|antler) < p(reluctent|reluctant) i.e. Probability is partially conditioned by the position in the string in which the edit occurs. artifact/artefact; correspondance/correspondence November 2005 CSA3180: Statistics III

Three Stage Model Person picks a word. physical Person picks a partition of characters within word. ph y s i c al Person types each partition, perhaps erroneously. f i s i k le p(fisikle|physical) = p(f|ph) * p(i|y) * p(s|s) * p(i|i) * p(k|c) * p(le|al) November 2005 CSA3180: Statistics III

Formal Presentation Let Part(w) be the set of all possible ways to partition string w into substrings. For particular R in Part(w) containing j continuous segments, let Ri be the ith segment. Then P(s|w) = November 2005 CSA3180: Statistics III

Simplification this simplifies to P(s | w) = max R P(R|w) P(Ti|Ri) By considering only the best partitioning of s and w this simplifies to P(s | w) = max R P(R|w) P(Ti|Ri) November 2005 CSA3180: Statistics III

Training the Model To train model, need a series of (s,w) word pairs. begin by aligning the letters in (si,wi) based on MED. For instance, given the training pair (akgsual, actual), this could be aligned as: a c t u a l a k g s u a l November 2005 CSA3180: Statistics III

Training the Model This corresponds to the sequence of editing operations aa ck εg ts uu aa ll To allow for richer contextual information, each nonmatch substitution is expanded to incorporate up to N additional adjacent edits. For example, for the first nonmatch edit in the example above, with N=2, we would generate the following substitutions: November 2005 CSA3180: Statistics III

Training the Model a c t u a l a k g s u a l c  k ac  ak c  kg ac  akg ct  kgs We would do similarly for the other nonmatch edits, and give each of these substitutions a fractional count. November 2005 CSA3180: Statistics III

Training the Model We can then calculate the probability of each substitution α  β as count(α  β)/count(α). count(α  β) is simply the sum of the counts derived from our training data as explained above Estimating count(α) is harder, since we are not training from a text corpus, but from a a set of (s,w) tuples (without an associated corpus) November 2005 CSA3180: Statistics III

Training the Model From a large collection of representative text, count the number of occurrences of α. Adjust the count based on an estimate of the rate with which people make typing errors. November 2005 CSA3180: Statistics III