Probabilistic Pronunciation + N-gram Models CSPP 56553 Artificial Intelligence February 25, 2004.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Angelo Dalli Department of Intelligent Computing Systems
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Probabilistic Language Processing Chapter 23. Probabilistic Language Models Goal -- define probability distribution over set of strings Unigram, bigram,
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Morphology & FSTs Shallow Processing Techniques for NLP Ling570 October 17, 2011.
FSA and HMM LING 572 Fei Xia 1/5/06.
Learning, Uncertainty, and Information Big Ideas November 8, 2004.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
LING 438/538 Computational Linguistics
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
COMP 4060 Natural Language Processing Speech Processing.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 Statistical NLP: Lecture 5 Mathematical Foundations II: Information Theory.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Entropy & Hidden Markov Models Natural Language Processing CMSC April 22, 2003.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Pronunciation Variation: TTS & Probabilistic Models CMSC Natural Language Processing April 10, 2003.
Speech, Perception, & AI Artificial Intelligence CMSC February 13, 2003.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing.
CSA3202 Human Language Technology HMMs for POS Tagging.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
Hidden Markov Models: Probabilistic Reasoning Over Time Artificial Intelligence CMSC February 26, 2008.
Natural Language Processing Statistical Inference: n-grams
Albert Gatt Corpora and Statistical Methods. Acknowledgement Some of the examples in this lecture are taken from a tutorial on HMMs by Wolgang Maass.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Learning, Uncertainty, and Information: Evaluating Models Big Ideas November 12, 2004.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Hidden Markov Models: Probabilistic Reasoning Over Time Natural Language Processing CMSC February 22, 2005.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
CS 224S / LINGUIST 285 Spoken Language Processing
Statistical Models for Automatic Speech Recognition
Uncertain Reasoning over Time
CSCI 5832 Natural Language Processing
Statistical Models for Automatic Speech Recognition
N-Gram Model Formulas Word sequences Chain rule of probability
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CPSC 503 Computational Linguistics
CSCI 5582 Artificial Intelligence
Presentation transcript:

Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004

The ASR Pronunciation Problem Given a series of phones, what is the most probable word? Simplification: Assume phone sequence known, word boundaries known Approach: Noisy channel model Surface form is an instance of lexical form that has passed through a noisy communication path Model channel to remove noise, find original

Bayesian Model Pr(w|O) = Pr(O|w)Pr(w)/P(O) Goal: Most probable word –Observations held constant –Find w to maximize Pr(O|w)*Pr(w) Where do we get the likelihoods? – Pr(O|w) –Probabilistic rules (Labov) Add probabilities to pronunciation variation rules –Count over large corpus of surface forms wrt lexicon Where do we get Pr(w)? –Similarly – count over words in a large corpus

Weighted Automata Associate a weight (probability) with each arc - Determine weights by decision tree compilation or counting from a large corpus start ax ix b aw aedx tend Computed from Switchboard corpus

Forward Computation For a weighted automaton and a phoneme sequence, what is its likelihood? –Automaton: Tuple Set of states Q: q0,…qn Set of transition probabilities between states aij, –Where aij is the probability of transitioning from state i to j Special start & end states –Inputs: Observation sequence: O = o1,o2,…,ok –Computed as: forward[t,j] = P(o1,o2…ot,qt=j|λ)p(w)=Σi forward[t-1,i]*aij*bjt –Sums over all paths to qt=j

Viterbi Decoding Given an observation sequence o and a weighted automaton, what is the mostly likely state sequence? –Use to identify words by merging multiple word pronunciation automata in parallel –Comparable to forward Replace sum with max Dynamic programming approach –Store max through a given state/time pair

Viterbi Algorithm Function Viterbi(observations length T, state-graph) returns best-path Num-states<-num-of-states(state-graph) Create path prob matrix viterbi[num-states+2,T+2] Viterbi[0,0]<- 1.0 For each time step t from 0 to T do for each state s from 0 to num-states do for each transition s’ from s in state-graph new-score<-viterbi[s,t]*at[s,s’]*bs’(ot) if ((viterbi[s’,t+1]=0) || (viterbi[s’,t+1]<new-score)) then viterbi[s’,t+1] <- new-score back-pointer[s’,t+1]<-s Backtrace from highest prob state in final column of viterbi[] & return

Segmentation Breaking sequence into chunks –Sentence segmentation Break long sequences into sentences –Word segmentation Break character/phonetic sequences into words –Chinese: typically written w/o whitespace »Pronunciation affected by units –Language acquisition: »How does a child learn language from stream of phones?

Models of Segmentation Many: –Rule-based, heuristic longest match Probabilistic: –Each word associated with its probability –Find sequence with highest probability Typically compute as log probs & sum –Implementation: Weighted FST cascade Each word = chars + probability Self-loop on dictionary Compose input with dict* Compute most likely

N-grams Perspective: –Some sequences (words/chars) are more likely than others –Given sequence, can guess most likely next Used in – Speech recognition – Spelling correction, –Augmentative communication –Other NL applications

Corpus Counts Estimate probabilities by counts in large collections of text/speech Issues: –Wordforms (surface) vs lemma (root) –Case? Punctuation? Disfluency? –Type (distinct words) vs Token (total)

Basic N-grams Most trivial: 1/#tokens: too simple! Standard unigram: frequency –# word occurrences/total corpus size E.g. the=0.07; rabbit = –Too simple: no context! Conditional probabilities of word sequences

Markov Assumptions Exact computation requires too much data Approximate probability given all prior wds –Assume finite history –Bigram: Probability of word given 1 previous First-order Markov –Trigram: Probability of word given 2 previous N-gram approximation Bigram sequence

Issues Relative frequency –Typically compute count of sequence Divide by prefix Corpus sensitivity –Shakespeare vs Wall Street Journal Very unnatural Ngrams –Unigram: little; bigrams: colloc; trigrams:phrase

Evaluating n-gram models Entropy & Perplexity –Information theoretic measures –Measures information in grammar or fit to data –Conceptually, lower bound on # bits to encode Entropy: H(X): X is a random var, p: prob fn –E.g. 8 things: number as code => 3 bits/trans –Alt. short code if high prob; longer if lower Can reduce Perplexity: –Weighted average of number of choices

Entropy of a Sequence Basic sequence Entropy of language: infinite lengths –Assume stationary & ergodic

Cross-Entropy Comparing models –Actual distribution unknown –Use simplified model to estimate Closer match will have lower cross-entropy