Speech Recognition  Observe: Acoustic signal (A=a 1,…,a n )  Challenge: Find the likely word sequence  But we also have to consider the context Starting.

Slides:

Advertisements

Similar presentations

Debugging ACL Scripts.

Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:

LING 438/538 Computational Linguistics Sandiway Fong Lecture 17: 10/25.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Programming Types of Testing.

Programming in Visual Basic

CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.

NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.

CS 4705 Lecture 13 Corpus Linguistics I. From Knowledge-Based to Corpus-Based Linguistics A Paradigm Shift begins in the 1980s –Seeds planted in the 1950s.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Ngram models and the Sparsity problem John Goldsmith November 2002.

Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Spelling Checkers Daniel Jurafsky and James H. Martin, Prentice Hall, 2000.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

Metodi statistici nella linguistica computazionale The Bayesian approach to spelling correction.

CS 4705 Lecture 14 Corpus Linguistics II. Relating Conditionals and Priors P(A | B) = P(A ^ B) / P(B) –Or, P(A ^ B) = P(A | B) P(B) Bayes Theorem lets.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

Automatic Continuous Speech Recognition Database speech text Scoring.

1 Advanced Smoothing, Evaluation of Language Models.

BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.

Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.

Lecture 4 Ngrams Smoothing

LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab

Estimating N-gram Probabilities Language Modeling.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

A COMPARISON OF HAND-CRAFTED SEMANTIC GRAMMARS VERSUS STATISTICAL NATURAL LANGUAGE PARSING IN DOMAIN-SPECIFIC VOICE TRANSCRIPTION Curry Guinn Dave Crist.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Natural Language Processing Statistical Inference: n-grams

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Language Model for Machine Translation Jang, HaYoung.

N-Grams Chapter 4 Part 2.

N-Grams and Corpus Linguistics

CPSC 503 Computational Linguistics

N-Gram Model Formulas Word sequences Chain rule of probability

CSCE 771 Natural Language Processing

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Research on the Modeling of Chinese Continuous Speech Recognition

Presentation transcript:

Speech Recognition  Observe: Acoustic signal (A=a 1,…,a n )  Challenge: Find the likely word sequence  But we also have to consider the context Starting at this point, we need to be able to model the target language

LML Speech Recognition Language Modeling

 Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from spoken audio  Issues to consider  determine the intended word sequence  resolve grammatical and pronunciation errors  Implementation: Establish word sequence probabilities  Use existing corpora  Train program with run-time data

 Problem: Our recognizer translates the audio to a possible string of text. How do we know the translation is correct?  Problem: How do we handle a string of text containing words that are not in the dictionary?  Problem: How do we handle strings with valid words, but which do not form sentences with semantics that makes sense?

 Problem: Resolving words not in the dictionary  Question: How different is a recognized word from those that are in the dictionary?  Solution: Count the single step transformations necessary to convert one word into another.  Example: caat  cat with removal of one letter  Example: flpc  fireplace requires adding the letters ire after f and a before c and e at the end

Simple: *Every word follows every other word w/ equal probability (0-gram) – Assume |V| is the size of the vocabulary – Likelihood of sentence S of length n is = 1/|V| × 1/|V| … × 1/|V| – If English has 100,000 words, probability of each next word is 1/ = Simple vs. Smart Smarter: Probability of each next word is related to word frequency – Likelihood of sentence S = P(w1) × P(w2) × … × P(wn) – Assumes probability of each word is independent of probabilities of other words. Even smarter: Look at probability given previous words – Likelihood of sentence S = P(w1) × P(w2|w1) × … × P(wn|wn-1) – Assumes probability of each word is dependent on probabilities of other words. n times

 What’s the probability of “canine”?  What’s the probability of “canine tooth” or tooth | canine?  What’s the probability of “canine companion”?  P(tooth|canine) = P(canine & tooth)/P(canine)  Sometimes we can use counts to deduce probabilities.  Example: According to google:  P(canine): occurs 1,750,000 times  P(canine tooth): 6280 times  P(tooth | canine): 6280/ =.0035  P(companion | canine):.01  So companion is the more likely next word after canine Detecting likely word sequences using counts/table look up

 Limitation: ignores context  We might need to factor in the surrounding words - Use P(need|I) instead of just P(need) - Note: P(new|I) < P(need|I) WordP(O|w)P(w)P(O|w)P(w) new neat need knee P([ni]|new)P(new) P([ni]|neat)P(neat) P([ni]|need)P(need) P([ni]|knee)P(knee) Single word probability  Compute likelihood P([ni]|w), then multiply

 What is the most likely word sequence? 'botik-'spen-siv'pre-z & ns boatexcessivepresidents baldexpensivepresence boldexpressivepresents boughtinactivepress

 Conditional Probability P(A 1,A 2 ) = P(A 1 ) · P(A 2 |A 1 )  The Chain Rule generalizes to multiple events  P(A 1, …,A n ) = P(A 1 ) P(A 2 |A 1 ) P(A 3 |A 1,A 2 )…P(A n |A 1 …A n-1 )  Examples:  P(the dog) = P(the) P(dog | the)  P(the dog bites) = P(the) P(dog | the) P(bites| the dog)  Conditional probability applies more than individual relative word frequencies because they consider the context  Dog may be relatively rare word in a corpus  But if we see barking, P(dog|barking) is much more likely 1 n In general, the probability of a complete string of words w 1 …w n is: P(w ) = P(w1)P(w2|w1)P(w3|w1..w2)…P(wn|w1…wn-1) = Detecting likely word sequences using probabilities

 0 gram: Every word’s likelihood probability is equal  Each word of a 300,000 word corpora has frequency probabilities  Uni-gram: A word’s likelihood depends on frequency counts  The word, ‘the’ occurs 69,971 in the Brown corpus of 1,000,000 words  Bi-gram: word likelihood determined by the previous word  P(w|a) = P(w) * P(w|w i-1 )  The appears with frequency.07, rabbit appears with frequency  Rabbit is a more likely word that follows the word white than the is  Tri-gram: word likelihood determined by the previous two words  P(w|a) = P(w) * P(w|w i-1 & w i-2 )  N-gram  A model of word or phoneme prediction that uses the previous N-1 words or phonemes to predict the next How many previous words should we consider?

 Generating sentences: random unigrams...  Every enter now severally so, let  Hill he late speaks; or! a more to leg less first you enter  With bigrams...  What means, sir. I confess she? then all sorts, he is trim, captain.  Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry.  Trigrams  Sweet prince, Falstaff shall die.  This shall forbid it should be branded, if renown made it empty.

 Quadrigrams (Output now is Shakespeare)  What! I will go seek the traitor Gloucester.  Will you not tell me who I am?  Comments  The accuracy of an n-gram model increases with increasing n because word combinations are more and more constrained  Higher n-gram models are more and more sparse. Shakespeare produced 0.04% of 844 million possible bigrams.  There is a tradeoff between accuracy and computational overhead and memory requirements

Unigrams (SWB): Most Common: “I”, “and”, “the”, “you”, “a” Rank-100: “she”, “an”, “going” Least Common: “Abraham”, “Alastair”, “Acura” Bigrams (SWB): Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” Rank-100: “do it”, “that we”, “don’t think” Least Common:“raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” Rank-100: “it was a”, “you know that” Least Common:“you have parents”, “you seen Brooklyn”

 Non-word detection (easiest) Example: graffe => (giraffe)  Isolated-word (context-free) error correction  A correction is not possible when the error word is in the dictionary  Context-dependent (hardest)Example: your an idiot => you’re an idiot (the mistyped word happens to be a real word)

Mispelled word: acress Candidates – with probabilities of use and use within context Context Context * P(c)

 Word frequency percentage is not enough  We need p(typo|candidate) * p(candidate)  How likely is the particular error?  Deletion of a t after a c and before an r  Insertion of an a at the beginning  Transpose a c and an a  Substitute a c for an r  Substitute an o for an e  Insert an s before the last s, or after the last s  Context of the word within a sentence or paragraph Misspelled word: accress

 They are leaving in about fifteen minuets  The study was conducted manly be John Black.  The design an construction of the system will take more than a year.  Hopefully, all with continue smoothly in my absence.  Can they lave him my messages?  I need to notified the bank of….  He is trying to fine out. Spell check without considering context will fail Difficulty: Detecting grammatical errors, or nonsensical expressions

 Definitions  Maximum likelihood: Finding the most probable sequence of tokens based on the context of the input  N-gram sequence: A sequence of n words whose context speech algorithms consider  Training data: A group of probabilities computed from a corpora of text data  Sparse data problem: How should algorithms handle n-grams that have very low probabilities?  Data sparseness is a frequently occurring problem  Algorithms will make incorrect decisions if it is not handled  Problem 1: Low frequency n-grams  Assume n-gram x occurs twice and n-gram y occurs once  Is x really twice as likely to occur as y?  Problem 2: Zero counts  Probabilities compute to zero for n-grams not seen in the corpora  If n-gram y does not occur, should its probability is zero?

An algorithm that redistributes the probability mass Discounting: Reduces probabilities of n- grams with non-zero counts to accommodate the n-grams with zero counts (that are unseen in the corpora). Definition: A corpora is a collection of written or spoken material in machine-readable form

 The Naïve smoothing technique  Add one to the count of all seen and unseen n-grams  Add the total increased count to the probability mass  Example: uni-grams  Un-smoothed probability for word w: uni-grams  Add-one revised probability for word w:  N = number of words encountered, V = vocabulary size, c(w) = number of times word, w, was encountered

P(w n |w n-1 ) = C(w n-1 w n )/C(w n-1 ) P +1 (w n |w n-1 ) = [C(w n-1 w n )+1]/[C(w n-1 )+V] Note: This example assumes bi-gram counts and a vocabulary V = 1616 words Note: row = times that word in column precedes word on left, or starts a sentence Note: C(I)=3437, C(want)=1215, C(to)=3256, C(eat)=938, C(Chinese)=213, C(food)=1506, C(lunch)=459

C(W I ) I3437 Want1215 To3256 Eat938 Chinese213 Food1506 Lunch459 c’(w i,w i-1 ) =(c(w i,w i-1 ) i +1) * c(w i,w i-1 ) Original Counts Revised Counts Note: High counts reduce by approximately a third for this example Note: Low counts get larger Note : N = c(w i-1 ), V = vocabulary size = 1616

 Advantage:  Simple technique to implement and understand  Disadvantages:  Too much probability mass moves to the unseen n-grams  Underestimates the probabilities of the common n-grams  Overestimates probabilities of rare (or unseen) n-grams  Relative smoothing of all unseen n-grams is the same  Relative smoothing of rare n-grams still incorrect  Alternative:  Use a smaller add value  Disadvantage: Does not fully solve this problem

 Compute the probability of a first time encounter of a new word  Note: Every one of O observed words had a first encounter  How many Unseen words: U = V – O  What is the probability of encountering a new word?  Answer: P( any newly encountered word ) = O/(V+O)  Equally add this probability across all unobserved words  P( any specific newly encountered word ) = 1/U * O/(V+O)  Adjusted counts = V * 1/U*O/(V+O))  Discount each encountered word i to preserve probability space  Probability From: count i /V To: count i /(V+O)  Discounted Counts From: count i To: count i * V/(V+O) Add probability mass to un-encountered words; discount the rest O = observed words, U = words never seen, V = corpus vocabulary words

 Consider the bi-gram w n w n-1  O(w n-1 ) = number of uniquely observed bi-grams starting with w n-1  V(w n-1 ) = count of bi-grams starting with w n-1  U(w n-1 ) = number of un-observed bi-grams starting with w n-1  Compute probability of a new bi-gram (bi n-1 ) starting with w n-1  Answer: P( any newly encountered bi-gram ) = O(w n-1 )/(V(w n-1 ) +O(w n-1 ))  Note: We observed O(w n-1 ) bi-grams in V(w n-1 )+O(w n-1 ) events  Note: An event is either a bi-gram or a first time encounter  Divide this probability among all unseen bi-grams (new(w n-1 ))  Adjusted P(new(w n-1 )) = 1/U(w n-1 )*O(w n-1 )/(V(w n-1 )+O(w n-1 ))  Adjusted count = V(w n-1 ) * 1/U(w n-1 ) * O(w n-1 )/(V(w n-1 )+O(w n-1 ))  Discount observed bi-grams gram(w n-1 ) to preserve probability space  Probability From: c(w n-1 w n )/V(w n-1 ) To: c(w n-1 w n )/(V(w n-1 ) + O(w n-1 ))  Counts From: c(w n-1 w n ) To: c(w n-1 w n ) * V(w n-1 )/(V(w n-1 )+O(w n-1 )) Add probability mass to un-encountered bi-grams; discount the rest O = observed bi-gram, U = bi-gram never seen, V = corpus vocabulary bi-grams

c′(w n,w n-1 )= (c(w n,w n-1 )+1) c(w n,w n-1 ) c′(w n,w n-1 ) = O/U if c(w n, w n-1 )=0 c(w n,w n-1 ) otherwise Original Counts Adjusted Add-One Counts Adjusted Witten-Bell Counts V, O and U values are on the next slide VN V  Note: V, O, U refer to w n-1 counts VN V  VN V 

O(w n-1 )U(W n-1 )V(w n-1 ) I951, Want761, To1301, Eat1241, Chinese201, Food821, Lunch451, O(w n-1 ) = number of observed bi-grams starting with w n-1 V(w n-1 ) = count of bi-grams starting with w n-1 U(w n-1 ) = number of un-observed bi-grams starting with

 Estimates probability of already encountered grams to compute probabilities for unseen grams  Smaller impact on probabilities of already encountered grams  Generally computes reasonable probabilities

 The general Concept  Consider the trigram (w n,w n-1, w n-2 )  If c(w n-1, w n-2 ) = 0, consider the ‘back-off’ bi-gram (w n, w n-1 )  If c(w n-1 ) = 0, consider the ‘back-off’ unigram w n  Goal is to use a hierarchy of approximations  trigram > bigram > unigram  Degrade gracefully when higher level grams don’t exist  Given a word sequence fragment: w n-2 w n-1 w n …  Utilize the following preference rule  1.p(w n |w n-2 w n-1 ) if c(w n-2 w n-1 w n )  0  2.  1 p(w n |w n-1 ) if c(w n-1 w n )  0  3.  2 p(w n ) Note:  1 and  2 are values carefully computed to preserve probability mass

 Goal: Reduce the trainable units that the recognizer needs to process  Approach:  HMMs represent sub-phonetic units  A tree structure Combine sub- phonetic units  Phoneme recognizer searches tree to find HMMs  Nodes partition with questions about neighbors  Performance:  Triphones reduces error rate by:15%  Senones reduces error rate by 24% Definition: A cluster of similar Markov States Is left phone sonorant or nasal? Is right a back-R?Is left s, z, sh, zh? Is left a back-L? Is right voiced?