Pronunciation Modeling Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg.

Slides:



Advertisements
Similar presentations
By: Hossein and Hadi Shayesteh Supervisor: Mr J.Connan.
Advertisements

Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.
CS 4705 Lecture 4 CS4705 Sound Systems and Text-to- Speech.
On-line Learning with Passive-Aggressive Algorithms Joseph Keshet The Hebrew University Learning Seminar,2004.
Three kinds of learning
Learning and Recognizing Activities in Streams of Video Dinesh Govindaraju.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Machine Transliteration T BHARGAVA REDDY (Knowledge sharing)
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Chapter 5. Probabilistic Models of Pronunciation and Spelling 2007 년 05 월 04 일 부산대학교 인공지능연구실 김민호 Text : Speech and Language Processing Page. 141 ~ 189.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
1 Phonetics and Phonemics. 2 Phonetics and Phonemics : Phonetics The principle goal of Phonetics is to provide an exact description of every known speech.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
L ETTER TO P HONEME A LIGNMENT Reihaneh Rabbany Shahin Jabbari.
Hidden Markov Models & POS Tagging Corpora and Statistical Methods Lecture 9.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Introduction to Speech Neal Snider, For LIN110, April 12 th, 2005 (adapted from slides by Florian Jaeger)
Training Tied-State Models Rita Singh and Bhiksha Raj.
CS 416 Artificial Intelligence Lecture 19 Reasoning over Time Chapter 15 Lecture 19 Reasoning over Time Chapter 15.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Letter to Phoneme Alignment Using Graphical Models N. Bolandzadeh, R. Rabbany Dept of Computing Science University of Alberta 1 1.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Experimentation Duration is the most significant feature with around 40% correlation. Experimentation Duration is the most significant feature with around.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.
PREPARED BY MANOJ TALUKDAR MSC 4 TH SEM ROLL-NO 05 GUKC-2012 IN THE GUIDENCE OF DR. SANJIB KR KALITA.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition Introduction
Statistical Models for Automatic Speech Recognition
Speech Technology for Language Learning
Jennifer J. Venditti Postdoctoral Research Associate
Audio Books for Phonetics Research
From Word Spotting to OOV Modeling
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Phonetics and Phonemics
Presentation transcript:

Pronunciation Modeling Lecture 11 Spoken Language Processing Prof. Andrew Rosenberg

What is a pronunciation model? 1 Acoustic Model Acoustic Model Pronunciation Model Pronunciation Model Language Model Language Model Audio Features Phone Hypothese Word Hypothese

Why do we need one? The pronunciation model defines the mapping between sequences of phones and words. The acoustic model can deliver a one- best, hypothesis – “best guess”. From this single guess, converting to words can be done with dynamic programming alignment. Or viewed as a Finite State Automata. 2

Simplest Pronunciation “model” A dictionary. Associate a word (lexical item, orthographic form) with a pronunciation. 3 ACHEEY K ACHESEY K S ADJUNCTAE JH AH NG K T ADJUNCTSAE JH AN NG K T S ADVANTAGEAH D V AE N T IH JH ADVANTAGEAH D V AE N IH JH ADVANTAGEAH D V AE N T AH JH

Example of a pronunciation dictionary 4

Finite State Automata view Each word is an automata over phones 5 EY K K K K AH D D V V AE N N T T S S I I JH

Size of whole word models these models get very big, very quickly 6 EY K K K K AH D D V V AE N N T T S S I I JH START END

Potential problems Every word in the training material and test vocabulary must be in the dictionary The dictionary is generally written by hand Prone to errors and inconsistencies 7 ACHEEY K ACHESEY K S ADJUNCTAE JH AH NG K T ADJUNCTSAE JH AN NG K T S ADVANTAGEAH D V AE N T IH JH ADVANTAGEAH D V AE N IH JH ADVANTAGEAH D V AE N T AH JH

Baseforms represented by graphs 8

Composition From the word graph, we can replace each phone by its markov model 9

Automating the construction Do we need to write a rule for every word? pluralizing? –Where is it +[Z]? +[IH Z]? prefixes, unhappy, etc. –+[UH N] –How can you tell the difference between “unhappy”, “unintelligent” and “under” and “ 10

Is every pronunciation equally likely? Different phonetic realizations can be weighted. The FSA view of the pronunciation model makes this easy. 11 ACAPULCOAE K AX P AH L K OW ACAPULCOAA K AX P UH K OW THETH IY THE TH AX PROBABLYP R AA B AX B L IY PROBABLYP R AA B L IY PROBABLYP R AA L IY

Is every pronunciation equally likely? Different phonetic realizations can be weighted. The FSA view of the pronunciation model makes this easy. 12 ACAPULCOAE K AX P AH L K OW0.75 ACAPULCOAA K AX P UH K OW0.25 THETH IY0.15 THE TH AX0.85 PROBABLYP R AA B AX B L IY0.5 PROBABLYP R AA B L IY0.4 PROBABLYP R AA L IY0.1

Collecting pronunciations Collect a lot of data Ask a phonetician to phonetically transcribe the data. Count how many times each production is observed. This is very expensive – time consuming, finding linguists. 13

Collecting pronunciations Start with equal likelihoods of all pronunciations Run the recognizer on transcribed speech –forced alignment See how many times the recognizer uses each pronunciation. Much cheaper, but less reliable 14

Out of Vocabulary Words A major problem for Dictionary based pronunciation is out of vocabulary terms. If you’ve never seen a name, or new word, how do you know how to pronounce it? –Person names –Organization and Company Names –New words “truthiness”, “hypermiling”, “woot”, “app” –Medical, scientific and technical terms 15

Collecting Pronunciations from the web Newspapers, blog posts etc. often use new names and unknown terms. For example: –Flickeur (pronounced like Voyeur) randomly retrieves images from Flickr.com and creates an infinite film with a style that can vary between stream-of-consciousness, documentary or video clip. –Our group traveled to Peterborough (pronounced like “Pita-borough”)... The web can be mined for pronunciations [Riley, Jansche, Ramabhadran 2009] 16

Grapheme to Phoneme Conversion Given a new word, how do you pronounce it. Grapheme is a language independent term for things like “letters”, “characters”, “kanji”, etc. With a phoneme to grapheme-to-phoneme converter, dictionaries can be augmented with any word. Some languages are more ambiguous than others. 17

Grapheme to Phoneme conversion Goal: Learn an alignment between graphemes (letters) and phonemes (sounds) Find the lowest cost alignment. Weight rules, and learn contextual variants. 18 TEX-T TEHKST TEXT T KST

Grapheme to Phoneme Difficulties How to deal with Abbreviations –US CENSUS –NASA, scuba vs. AT&T, ASR –LOL –IEEE What about misspellings? –should “teh” have an entry in the dictionary? –If we’re collecting new terms from the web, or other unreliable sources, how do we know what is a new word? 19

Application of Grapheme to Phoneme Conversion This Pronunciation Model is used much more often in Speech Synthesis than Speech Recognition In Speech Recognition we’re trying to do Phoneme-to-Grapheme conversion –This is a very tricky problem. –“ghoti” -> F IH SH –“ghoti” -> silence 20

Approaches to Grapheme to Phoneme conversion “Instance Based Learning” –Lookup based on a sliding window of 3 letters –Helps with sounds like “ch” and “sh” Hidden Markov Model –Observations are phones –States are letters 21

Machine Learning for Grapheme to Phoneme Conversion Input: –A letter, and surrounding context, e.g. 2 previous and 2 following letters Output: –Phoneme 22

Decision Trees Decision trees are intuitive classifiers –Classifier: supervised machine learning, generating categorical predictions 23 Feature > threshold? Class A Class B

Decision Trees Example 24

Decision Tree Training How does the letter “p” sound? Training data –Ploophole, peanuts, pay, apple –Fphysics, telephone, graph, photo –øapple, psycho, pterodactyl, pneumonia pronunciation depends on context 25

Decision Trees example Context: L1, L2, p, R1, R2 26 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia

Decision Trees example Context: L1, L2, p, R1, R2 27 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia Yes No PloopholeFphysics Ftelephone Fgraph Fphoto L1 = “o” R1 = consonant No Yes Ppeanut P pay Papple øpsycho ø pterodactyl øpneumonia

Decision Trees example Context: L1, L2, p, R1, R2 28 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia Yes No PloopholeFphysics Ftelephone Fgraph Fphoto L1 = “o” R1 = consonant No Yes Ppeanut P pay Papple øpsycho ø pterodactyl øpneumonia try “PARIS”

Decision Trees example Context: L1, L2, p, R1, R2 29 R1 = “h” YesNo Ploophole Fphysics Ftelephone Fgraph Fphoto Ppeanut Ppay Papple øapple øpsycho øpterodactyl øpneumonia Yes No PloopholeFphysics Ftelephone Fgraph Fphoto L1 = “o” R1 = consonant No Yes Ppeanut P pay Papple øpsycho ø pterodactyl øpneumonia Now try “GOPHER”

Training a Decision Tree At each node, decide what the most useful split is. –Consider all features –Select the one that improves the performance the most There are a few ways to calculate improved performance –Information Gain is typically used. –Accuracy is less common. Can require many evaluations 30

Pronunciation Models in TTS and ASR In ASR, we have phone hypotheses from the acoustic model, and need word hypotheses. In TTS, we have the desired word, but need a corresponding phone sequence to synthesize. 31

Next Class Language Modeling Reading: J&M Chapter 4 32