CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008.

Slides:



Advertisements
Similar presentations
CS 551/651: Structure of Spoken Language Spectrogram Reading: Approximants John-Paul Hosom Fall 2010.
Advertisements

Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
1 CS 551/651: Structure of Spoken Language Spectrogram Reading: Stops John-Paul Hosom Fall 2010.
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
5-Text To Speech (TTS) Speech Synthesis
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Speech in Multimedia Hao Jiang Computer Science Department Boston College Oct. 9, 2007.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Natural Language Processing - Speech Processing -
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
COMP 4060 Natural Language Processing Speech Processing.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Speech Recognition with Hidden Markov Models Winter 2011
Introduction to Automatic Speech Recognition
1 CS 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Source/Filter Theory and Vowels February 4, 2010.
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Time-Domain Methods for Speech Processing 虞台文. Contents Introduction Time-Dependent Processing of Speech Short-Time Energy and Average Magnitude Short-Time.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Page 0 of 23 MELP Vocoders Nima Moghadam SN#: Saeed Nari SN#: Supervisor Dr. Saameti April 2005 Sharif University of Technology.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Structure of Spoken Language
1 CS 552/652 Speech Recognition with Hidden Markov Models Spring 2010 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
CSE 551/651: Structure of Spoken Language Lecture 13: Theories of Human Speech Perception; Formant Based Speech Synthesis; Automatic Speech Recognition.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Acoustic Phonetics 3/14/00.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Vocoders.
Digital Systems: Hardware Organization and Design
Presentation transcript:

CS 551/651: Structure of Spoken Language Lecture 13: Text-to-Speech (TTS) Technology and Automatic Speech Recognition (ASR) John-Paul Hosom Fall 2008

Text-to-Speech (TTS) Synthesis Having looked at theories of human speech production and speech perception, now we’ll look at structures and algorithms currently used to implement these technologies. Text-to-Speech (TTS) has three main approaches: (1) formant-based (2) concatenative (3) articulatory All TTS approaches must address: (a) text analysis:from text, predicting phonemes, stress, and phrase boundaries (b) prosody:from text-analysis output, predicting pitch contour, energy contour, duration of each phoneme (c) signal processing: given phoneme symbols and timing, generate speech waveform

Text-to-Speech (TTS) Synthesis From a linguistic perspective, there may be many more things to consider… (from Klatt 1987)

Text-to-Speech (TTS) Synthesis Generating a Waveform: Articulatory Synthesis The vocal tract is divided into a large number of short tubes, as in the electrical transmission line analog (Lecture 11), which are then combined and resonant frequencies calculated. from Sinder, 1999 (thesis work with Flanagan, Rutgers)

Text-to-Speech (TTS) Synthesis Generating a Waveform: Articulatory Synthesis Vocal-tract sources include noise and a “buzz” source for voiced sounds Articulatory synthesis important for validating the Motor Theory of Speech Perception Demos from 1976 and circa 1992 (Haskins Labs)

Text-to-Speech (TTS) Synthesis Generating a Waveform: Formant Synthesis Instead of specifying mouth shapes, formant synthesis specifies frequencies and bandwidths of resonators, which are used to filter a source waveform. Formant frequency analysis is difficult; bandwidth estimation is even more difficult. But the biggest perceptual problem in formant synthesis is not in the resonances, but in a “buzzy” quality most likely due to the glottal source model. Formant synthesis can sound identical to natural utterance if details of the glottal source and formants are well modeled. NATURAL SPEECHSYNTHETIC SPEECH (John Holmes, 1973)

Text-to-Speech (TTS) Synthesis Formant TTS Synthesis: Architecture Formant-synthesis systems contain a number of sound sources, which are passed to filters in either parallel or cascade series. Each filter corresponds to one formant (resonance) or anti-resonance. (From Yamaguchi, 1993)

Text-to-Speech (TTS) Synthesis Formant systems: Rule-Based Synthesis For synthesis of arbitrary text, formants and bandwidths for each phoneme are determined by analyzing speech of a single person. The models of each phoneme may be a single set of formant frequencies and bandwidths for a canonical phoneme at a single point in time, or a trajectory of frequencies, bandwidths, and source models over time. The formant frequencies for each phoneme are combined over time using a model of coarticulation, such as Klatt’s modified locus theory. Duration, pitch, and energy rules are applied Result: something like this:

Text-to-Speech (TTS) Synthesis Despite great success in copy synthesis, synthesis by rule using formants has severely degraded quality. It’s not clear why… Problem with glottal source? Problem with coarticulation and formant transitions? Problem with prosody? Formant synthesis was main TTS technique until the early or mid 1990’s, when increasing memory size and CPU speed allowed concatenative synthesis to be viable approach. Concatenative synthesis uses recordings of small units of speech (typically the region from the middle of one phoneme to the middle of another phoneme, or a diphone unit), and glues these units together to forms words and sentences. Concatenative synthesis means that you don’t have to worry about glottal source models or coarticulation, since the synthesis is just a concatenation of different waveforms containing “natural” glottal source and coarticulation.

Text-to-Speech (TTS) Synthesis Concatenative Synthesis: Units The basic unit for concatenative synthesis is the diphone: More recent TTS research is on using larger units. Issues include: (a) how to decide what units will be used? (b) how to select best unit from very large database? With increasing size and variety of units, there is an exponential growth in the database size. Yet, despite massive databases that may take months to record, coverage is nowhere near complete. There is a very large number of infrequent events in speech. sil-jh jh-aa aa-n n-sil

Concatenative Synthesis: Signal Processing Waveform-based Pitch-Synchronous Overlap Add (PSOLA) Perform pitch modification by spacing of pitch-synchronous units Or, use Line Spectral Frequencies (LSFs), which are computed from Linear Predictive Coefficients (LPC) Text-to-Speech (TTS) Synthesis

DEMOS Klatt’s DEC Talk (formant synthesis) (early 1990’s) sample 1 AT&T (large-unit selection) sample 1a (2003)sample 2a (2003) sample 1b (2005)sample 2b (2005) Bell Labs (large-unit selection) sample 1a (2003)sample 2a (2003) sample 1b (2005)sample 2b (2005) OGI (diphone units) sample 1a (2003)sample 2a (2003) sample 1b (2005)sample 2b (2005)

ASR Technology: Frame-Based Approaches Stochastic Approach  includes HMMs and HMM/ANN hybrids

ASR Technology: Frame-Based Approaches

HMM-Based System Characteristics  System is in only one state at each time t; at time t+1, the system transfers to one of the states indicated by the arcs.  At each time t, the likelihood of each phoneme is estimated using Gaussian mixture model or ANN. The classifier uses a fixed time window usually extending no more than 60 msec. Each frame is typically classified into each phoneme in a particular left and right context, e.g. /y−eh+s/, and as the left, middle, or right region of that context-dependent phoneme (3 states per phoneme).  The probability of transferring from one state to the next is independent of the observed (test) speech utterance, being computed over the entire training corpus.  The Viterbi search determines the most likely word sequence given the phoneme and state-transition probabilities and the list of possible vocabulary words.

ASR Technology: Frame-Based Approaches Issues with HMMs:  Independence is assumed between frames  Implicit duration model for phonemes is Geometric, whereas phonemes actually have Gamma distributions  Independence is required between features within one frame for GMM classification (not so for ANN classification)  All frames of speech contribute equally to final result  Duration is not used in phoneme classification  Duration is modeled using a priori averages over the entire training set  Language model uses probability of word N given words N−1, N−2, etc. (bigram, trigram, etc. language model); infrequently occurring word combinations poorly recognized (e.g. “black Monday”, a stock-market ‘crash’ in 1987)

ASR Technology: Frame-Based Approaches Why is HMM Dominant Technique for ASR?  well-defined mathematical structure  does not require expert knowledge about speech signal (more people study statistics than study speech)  errors in analysis don’t propagate and accumulate  does not require prior segmentation  does not require a large number of templates  results are usually the best or among the best

Issues in Developing ASR Systems Type of Channel  Microphone signal different from telephone signal, “land-line” telephone signal different from cellular signal.  Channel characteristics: pick-up pattern (omni-directional, unidirectional, etc.) frequency response, sensitivity, noise, etc.  Typical channels: desktop boom mic:unidirectional, 100 to Hz hand-held mic:super-cardioid, 40 to Hz telephone:unidirectional, 300 to 8000 Hz  Training on data from one type of channel automatically “learns” that channel’s characteristics; switching channels degrades performance.

Issues in Developing ASR Systems Speaking Rate  Even the same speaker may vary the rate of speech.  Most ASR systems require a fixed window of input speech.  Formant dynamics change with different speaking rates and speaking styles (e.g. “frustrated speech”).  ASR performance is best when tested on same rate of speech as training data.  Training on a wide variation in speaking rate results in lower overall performance.

Issues in Developing ASR Systems Noise  two types of noise: additive, convolutional  additive: white noise (random values added to waveform)  convolutional: filter (additive values in log spectrum)  techniques for removing noise: RASTA, Cepstral Mean Subtraction (CMS)  (nearly) impossible to remove all noise while preserving all speech  stochastic training “learns” noise as well as speech; if noise changes, performance degrades.

Issues in Developing ASR Systems Vocabulary  Vocabulary must be specified in advance (can’t recognize new words)  Pronunciation of each word must be specified exactly (phonetic substitutions may degrade performance)  Grammar: either very simple or very structured  Reasons: phonetic recognition so poor that confidence in each recognized phoneme usually very low. humans often speak ungrammatically or disfluently.

Issues in Developing ASR Systems How Well Does ASR Do? Read Speech Varied Microphones Broadcast Speech Conversational Speech 5k 20k Spontaneous Speech (2-3k) 1k Noisy 1% 100% Word Error Rate 10% Error Rates on Increasingly Difficult Problems Noisy Speech Structured Speech human speech recognition of Broadcast Speech (0.9%WER) 2.5% 19% Current best performance on conversational telephone speech is around 10% word error rate

ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading: first identify landmarks in the signal  Where’s the vowel? Is that change in energy a plosive? identify change over duration of a phoneme, relative durations  Is that formant movement a diphthong or coarticulation? identify activity at phoneme boundaries  F2 goes to 1800 Hz at onset of voicing,  voicing continues into frication, so it’s a voiced fric. specific cues to phoneme identity  1800 Hz implies alveolar, F3  2000 Hz implies retroflex coarticulation model = tends toward locus theory

ASR Technology vs. Spectrogram Reading HMM-Based ASR: frame based − no identification of landmarks in speech signal duration of phonemes not identified until end of processing all frames are equally important “cues” are completely unspecified, learned by training coarticulation model = context-dependent phoneme models Spectrogram Reading and Human Speech Recognition first identify landmarks in the signal  Humans thought to have landmark (e.g. plosive) detectors identify change over duration of a phoneme, relative durations  Humans very sensitive to small changes, especially at vowel/consonant boundaries identify activity at phoneme boundaries  Transition into the vowel most important region for human speech perception specific cues to phoneme identity  Humans use (large) set of specific cues, e.g. VOT

The Structure of Spoken Language Final Points: Speech is complex! Not as simple as “sequence of phonemes” There is structure in speech, related to broad phonetic categories Identifying formant locations and movement is important Duration is important even for phoneme identity Phoneme boundaries are important There are numerous cues to phoneme identity Little is understood about how humans process speech Current ASR technology is incapable of accounting for all information that humans use in reading spectrograms, and what is known about human speech processing often not used… this implies (but does not prove) that current technology may be incapable of reaching human levels of performance. Speech is complex!