1 Components of Spoken Dialogue Systems Julia Hirschberg LSA 353
2 Recreating the Speech Chain DIALOG SEMANTICS SYNTAX LEXICON MORPHOLOG Y PHONETICS VOCAL-TRACT ARTICULATORS INNER EAR ACOUSTIC NERVE SPEECH RECOGNITION DIALOG MANAGEMENT SPOKEN LANGUAGE UNDERSTANDING SPEECH SYNTHESIS
3 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value: ))
4 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value: )) Intra-speaker variability Noise/reverberation Coarticulation Context-dependency Word confusability Word variations Speaker Dependency Multiple Interpretations Limited vocabulary Ellipses and Anaphors
5 1980s -- The Statistical Approach Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research Foundations of modern speech recognition engines Fred Jelinek S1S1 S2S2 S3S3 a 11 a 12 a 22 a 23 a 33 Acoustic HMMs Word Tri-grams No Data Like More Data Whenever I fire a linguist, our system performance improves (1988) Some of my best friends are linguists (2004) Jim Baker
– Statistical approach becomes ubiquitous Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.
7 1980s-Today – The Power of Evaluation Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity” SPOKEN DIALOG INDUSTRY SPEECHWORKS NUANCE MIT SRI TECHNOLOGY VENDORS PLATFORM INTEGRATORS APPLICATION DEVELOPERS HOSTING TOOLS STANDARDS … Nuance
8 Today’s State of the Art Low noise conditions Large vocabulary –~20,000-64,000 words (or more…) Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) World’s best research systems: Human-human speech: ~13-20% Word Error Rate (WER) Human-machine or monologue speech: ~3-5% WER
9 Components of an ASR System Corpora for training and testing of components Representation for input and method of extracting Pronunciation Model Acoustic Model Language Model Feature extraction component Algorithms to search hypothesis space efficiently
10 Training and Test Corpora Collect corpora appropriate for recognition task at hand –Small speech + phonetic transcription to associate sounds with symbols (Initial Acoustic Model) –Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) –Very large text corpus to identify unigram and bigram probabilities (Language Model)
11 Building the Acoustic Model Model likelihood of phones or subphones given spectral features, pronunciation models, and prior context Usually represented as HMM –Set of states representing phones or other subword units –Transition probabilities on states: how likely is it to see one phone after seeing another? –Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?
12 Initial estimates for Transition probabilities between phone states Observation probabilities associating phone states with acoustic examples Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) I.e., we tell the HMM the ‘right’ answers -- which phones to associate with which sequences of sounds Iteratively retrain transition and observation probabilities by running training data through model and scoring output (we know the right answers) until no improvement
13 HMMs for speech
14 Building the Pronunciation Model Models likelihood of word given network of candidate phone hypotheses –Multiple pronunciations for each word –May be weighted automaton or simple dictionary Words come from all corpora; pronunciations from pronouncing dictionary or TTS system
15 ASR Lexicon: Markov Models for Pronunciation
16 Language Model Models likelihood of word given previous word(s) Ngram models: –Build the LM by calculating bigram or trigram probabilities from (very large) text training corpus –Smoothing issues Grammars –Finite state grammar or Context Free Grammar (CFG) or semantic grammar Out of Vocabulary (OOV) problem
17 Search/Decoding Find the best hypothesis given –Lattice of phone units (AM) –Lattice of words (segmentation of phone lattice into all possible words via Pronunciation Model) –Probabilities of word sequences (LM) How to reduce this huge search space? –Lattice minimization and determinization –Pruning: beam search –Calculating most likely paths
18 Evaluating Success Transcription –Low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Or That was the dentist calling. 125% WER Understanding –High concept accuracy How many domain concepts were correctly recognized? I want to go from Boston to Baltimore on September 29
19 Domain conceptsValues –source cityBoston –target cityBaltimore –travel dateSeptember 29 –Score recognized string “Go from Boston to Washington on December 29” vs. “Go to Boston from Baltimore on September 29” –(1/3 = 33% CA)
20 Summary ASR today –Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of words –Relies upon many approximate techniques to ‘translate’ a signal ASR future –Can we include more language phenomena in the model?
21 Synthesizers Then and Now The task: produce human-sounding speech from some orthographic or semantic representation of an input –An online text –A semantic representation of a response to a query What are the obstacles? What is the state of the art?
22 Possibly the First ‘Speaking Machine’ Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) First to produce whole words, phrases – in many languages
23 Joseph Faber’s Euphonia, 1846
24 Constructed 1835 w/pedal and keyboard control –Whispered and ordinary speech –Model of tongue, pharyngeal cavity with changeable shape –Singing too “God Save the Queen” Modern Articulatory Synthesis: Dennis Klatt (1987)
25
26 World’s Fair in NY, 1939 Requires much training to ‘play’ Purpose: reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line
27
28 Answers: –These days a chicken leg is a rare dish. –It’s easy to tell the depth of a well. –Four hours of steady work faced us. ‘Automatic’ synthesis from spectrogram – but can also use hand-painted spectrograms as input Purpose: understand perceptual effect of spectral details
29 Formant/Resonance/Acoustic Synthesis Parametric or resonance synthesis –Specify minimal parameters, e.g. f0 and first 3 formants –Pass electronic source signal thru filter Harmonic tone for voiced sounds Aperiodic noise for unvoiced Filter simulates the different resonances of the vocal tract E.g. –Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants –Gunnar Fant’s Orator Verbis Electris (1953) for vowels –Formant synthesis download (demo)Formant synthesis download demo
30 Concatenative Synthesis Most common type today First practical application in 1936: British Phone company’s Talking Clock –Optical storage for words, part-words, phrases –Concatenated to tell time E.g. And a ‘similar’ example Bell Labs TTS (1977) (1985)
31 Variants of Concatenative Synthesis Inventory units –Diphone synthesis (e.g. Festival) –Microsegment synthesis –“Unit Selection” – large, variable units Issues –How well do units fit together? –What is the perceived acoustic quality of the concatenated units? –Is post-processing on the output possible, to improve quality?
32 TTS Production Levels: Back End and Front End Orthographic input: The children read to Dr. Smith World Knowledgetext normalization Semantics Syntaxword pronunciation Lexical Intonation assignment Phonologyintonation realization –F0, amplitude, duration Acousticssynthesis
33 Text Normalization Reading is what W. hates most. Reading is what Wilde hated most. Have the students read the questions. In 1996 she sold 1995 shares and deposited $42 in her 401(k). The duck dove supply.
34 Pronunciation in Context Homograph disambiguation –E.g. bass/bass, desert/desert Dictionaries vs letter-to-sound rules –Frequent or exceptional words: dictionary Bellcore business name pronunciation –‘New’ words: rules Inferring language origin, e.g. Infiniti, Gomez, Paris Pronunciation by analogy, e.g. universary Learning rules automatically from dictionaries or small seed-sets + active learning
35 Intonation Assignment: Phrasing Traditional: hand-built rules –Punctuation –Context/function word: no breaks after function word He went to dinner –Parsing? She favors the nuts and bolts approach Current: statistical analysis of large labeled corpus –Punctuation, pos window, utt length,… –~94% `accuracy’
36 Intonation Assignment: Accent Hand-built rules –Function/content distinction Accent content words; deaccent function words But many exceptions –‘Given’ items, focus, contrast –There are presidents and there are good presidents –Complex nominals: Main Street/Park Avenue city hall parking lot Today: Statistical procedures trained on large labeled corpora
37 Intonation Assignment: Contours Simple rules –‘.’ = declarative contour –‘?’ = yes-no-question contour unless wh-word present at/near front of sentence Well, how did he do it? And what do you know? Open problem –But realization even of simple variation is quite difficult in current TTS systems
38 Phonological and Acoustic Realization Task: –Produce a phonological representation from phonemes and intonational assignment Pitch contour aligned with text Durations, intensity –Select best concatenative units from inventory –Post-process if needed/possible to smooth joins, modify pitch, duration, intensity, rate from original units –Produce acoustic waveform as output
39 TTS: Where are we now? Natural sounding speech for some utterances –Where good match between input and database Still…hard to vary prosodic features and retain naturalness –Yes-no questions: Do you want to fly first class? Context-dependent variation still hard to infer from text and hard to realize naturally:
40 –Appropriate contours from text –Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. –Variation in pitch range, rate, pausal duration to convey topic structure Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. How to mimic real voices? ScanSoft/Nuance demo
41 Next Class J&M 22.4,22.8 Walker et al ‘97