Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

Slides:



Advertisements
Similar presentations
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Advertisements

What is Statistical Modeling
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Statistical techniques in NLP Vasileios Hatzivassiloglou University of Texas at Dallas.
Visual Recognition Tutorial
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.
Crash Course on Machine Learning
1 Advanced Smoothing, Evaluation of Language Models.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 Wednesday, 20 October.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Kansas State University Department of Computing and Information Sciences CIS 730: Introduction to Artificial Intelligence Lecture 25 of 41 Monday, 25 October.
Classification Techniques: Bayesian Classification
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
CS Bayesian Learning1 Bayesian Learning A powerful and growing approach in machine learning We use it in our own decision making all the time – You.
HMM - Part 2 The EM algorithm Continuous density HMM.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
Natural Language Processing Statistical Inference: n-grams
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Machine Learning: Lecture 6 Bayesian Learning (Based on Chapter 6 of Mitchell T.., Machine Learning, 1997)
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Naive Bayes Classifier
CSC 594 Topics in AI – Natural Language Processing
Bayes Net Learning: Bayesian Approaches
Data Mining Lecture 11.
Classification Techniques: Bayesian Classification
Bayesian Models in Machine Learning
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Machine Learning: Lecture 6
Machine Learning: UNIT-3 CHAPTER-1
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

Word Prediction in Application Domains Guessing the next word/letter  Once upon a time there was …….  C’era una volta …. Domains: speech modeling, augmentative communication systems (disabled persons), T9

Word Prediction for Spelling Andranno a trovarlo alla sua cassa domani. Se andrei al mare sarei abbronzato. Vado a spiaggia. Hopefully, all with continue smoothly in my absence. Can they lave him my message? I need to notified the bank of this problem.

Probs Prior probability that the training data D will be observed P(D) Prior probability of h, P(h), my include any prior knowledge that h is the correct hypothesis P(D|h), probability of observing data D given a world where hypothesis h holds. P(h|D), probability that h holds given the data D, i.e. posterior probability of h, because it reflects our confidence that h holds after we have seen the data D.

The Bayes Rule (Theorem)

Maximum Aposteriory Hypothesis and Maximum Likelihood

Bayes Optimal Classifier Motivation: 3 hypotheses with the posterior probs of 0.4, 0.3 and 0.3. Thus, the first one is the MAP hypothesis. (!) BUT: (A problem) Suppose new instance us classified positive by the first hyp., while negative by the other two. So, the porb. that the new instance is positive is 0.4 opposed to 0.6 for negative classification. The MAP is the 0.4 one ! Solution: The most probable classification of the new instance is obtained by combining the prediction for all hypothesis weighted by their posterior probabilities.

Bayes Optimal Classifier Classification: class Bayes Optimal Classifier

Naïve Bayes Classifier Bayes Optimal Classifier Naïve version

m-estimate of probability

Tagging P (tag = Noun | word = saw) = ?

Language Model Use corpus to find them

N-gram Model The N-th word is predicted by the previous N-1 words. What is a word?  Token, word-form, lemma, m-tag, …

N-gram approximation models

bi-gram and tri-gram models N=2 (bi): N=3 (tri):

Counting n-grams

The Language Model Allows us to Calculate Sentence Probs P( Today is a beautiful day. ) = P( Today | ) * P (is | Today) * P( a | is) * P(beautiful|a) * P(day| beautiful) * P(. | day) * P( |.) Work in log space !

Unseen n-grams and Smoothing Discounting (several types) Backoff Deleted Interpolation

Searching For the Best Tagging W_1W_2W_3W_4W_5W_6W_7W_8 t_1_1 t_1_2 t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8 t_2_1 t_2_2 t_2_3 t_2_5 t_2_8 t_3_1 t_3_3 t_4_1 Use Viterbi search to find the best path through the lattice.

Cross Entropy Entropy from the point of view of the user who has misinterpreted the source distribution to be q rather than p [Cross entropy is an upper bound of entropy]

Cross Entropy as a Quality Measure Two models, therefore 2 upper bounds of entropy. The more accurate is the one with lower cross entropy

Imagine that y was generated with either model A or model B. Then:

Cont. Proof of convergence of the EM algorithm

Estimation - Maximization Algorithm Consider a problem in which the data D is a set of instances generated by a probability distribution that is a mixture of k distinct Normal distributions (assuming same variances) Hypothesis is therefore defined by the vector of the means of the distributions

Estimation-Maximization Algorithm Step 1: Calculate the expected value of each distribution, assuming that the current hypothesis holds Step 2: Calculate a new maximum likelihood hypothesis assuming that the expected value is the true value. Then make the new hypothesis be the actual one. Step 3: Goto Step 1.

If we find lambda prime such that So we need to maximize A with respect to lambda prime Under the constraint that all lambdas sum up to one.  Use Lagrange multipliers

The EM Algorithm Can be analogically generalized for more lambdas

Measuring success rates Recall = (#correct answers)/(#total possible answers) Precision = (#correct answers)/(#answers) Fallout = (#incorrect answers)/(#of spourious facts in the text) F-measure = [(b^2+1)*P*R]/(b^2*P+R)  If b > 1 P is favored.

Chunking as Tagging Even certain parsing problems can be solved via tagging E.g.: ((A B) C ((D F) G)) BIA tags: A/B B/A C/I D/B F/A G/A