1 Components of Spoken Dialogue Systems Julia Hirschberg LSA 353.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Automatic Speech Recognition Slides now available at
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Natural Language Processing - Speech Processing -
Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition. What makes speech recognition hard?
Automatic Speech Recognition: An Overview
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
6/14/20151 Speech Synthesis: Then and Now Julia Hirschberg CS 4706.
CS 188: Artificial Intelligence Fall 2009 Lecture 21: Speech Recognition 11/10/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
CS 4705 Automatic Speech Recognition Opportunity to participate in a new user study for Newsblaster and get $25-$30 for hours of time respectively.
ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
COMP 4060 Natural Language Processing Speech Processing.
Automatic Speech Recognition: An Overview
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
1 Automatic Speech Recognition: An Overview Julia Hirschberg CS 4706 (special thanks to Roberto Pieraccini)
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
ISSUES IN SPEECH RECOGNITION Shraddha Sharma
Introduction to Automatic Speech Recognition
Source/Filter Theory and Vowels February 4, 2010.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
1 Computational Linguistics Ling 200 Spring 2006.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Introduction to CL & NLP CMSC April 1, 2003.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
12/8/20151 Introduction to the Course and to Speech Synthesis Julia Hirschberg.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
Mr. Darko Pekar, Speech Morphing Inc.
Conditional Random Fields for ASR
Automatic Speech Recognition: An Overview
Automatic Speech Recognition, Text-to-Speech, and Natural Language Understanding Technologies Julia Hirschberg LSA.
Automatic Speech Recognition
Meaningful Intonational Variation
Speech Generation: From Concept and from Text
CS4705 Natural Language Processing
Automatic Speech Recognition
Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing
Presentation transcript:

1 Components of Spoken Dialogue Systems Julia Hirschberg LSA 353

2 Recreating the Speech Chain DIALOG SEMANTICS SYNTAX LEXICON MORPHOLOG Y PHONETICS VOCAL-TRACT ARTICULATORS INNER EAR ACOUSTIC NERVE SPEECH RECOGNITION DIALOG MANAGEMENT SPOKEN LANGUAGE UNDERSTANDING SPEECH SYNTHESIS

3 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value: ))

4 The Illusion of Segmentation... or... Why Speech Recognition is so Difficult m I n & m b & r i s e v & n th r E n I n z E r o t ü s e v & n f O r MY NUMBER IS SEVEN THREE NINE ZERO TWO SEVEN FOUR NP VP (user:Roberto (attribute:telephone-num value: )) Intra-speaker variability Noise/reverberation Coarticulation Context-dependency Word confusability Word variations Speaker Dependency Multiple Interpretations Limited vocabulary Ellipses and Anaphors

5 1980s -- The Statistical Approach Based on work on Hidden Markov Models done by Leonard Baum at IDA, Princeton in the late 1960s Purely statistical approach pursued by Fred Jelinek and Jim Baker, IBM T.J.Watson Research Foundations of modern speech recognition engines Fred Jelinek S1S1 S2S2 S3S3 a 11 a 12 a 22 a 23 a 33 Acoustic HMMs Word Tri-grams  No Data Like More Data  Whenever I fire a linguist, our system performance improves (1988)  Some of my best friends are linguists (2004) Jim Baker

– Statistical approach becomes ubiquitous Lawrence Rabiner, A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition, Proceeding of the IEEE, Vol. 77, No. 2, February 1989.

7 1980s-Today – The Power of Evaluation Pros and Cons of DARPA programs + Continuous incremental improvement - Loss of “bio-diversity” SPOKEN DIALOG INDUSTRY SPEECHWORKS NUANCE MIT SRI TECHNOLOGY VENDORS PLATFORM INTEGRATORS APPLICATION DEVELOPERS HOSTING TOOLS STANDARDS … Nuance

8 Today’s State of the Art Low noise conditions Large vocabulary –~20,000-64,000 words (or more…) Speaker independent (vs. speaker-dependent) Continuous speech (vs isolated-word) World’s best research systems: Human-human speech: ~13-20% Word Error Rate (WER) Human-machine or monologue speech: ~3-5% WER

9 Components of an ASR System Corpora for training and testing of components Representation for input and method of extracting Pronunciation Model Acoustic Model Language Model Feature extraction component Algorithms to search hypothesis space efficiently

10 Training and Test Corpora Collect corpora appropriate for recognition task at hand –Small speech + phonetic transcription to associate sounds with symbols (Initial Acoustic Model) –Large (>= 60 hrs) speech + orthographic transcription to associate words with sounds (Acoustic Model) –Very large text corpus to identify unigram and bigram probabilities (Language Model)

11 Building the Acoustic Model Model likelihood of phones or subphones given spectral features, pronunciation models, and prior context Usually represented as HMM –Set of states representing phones or other subword units –Transition probabilities on states: how likely is it to see one phone after seeing another? –Observation/output likelihoods: how likely is spectral feature vector to be observed from phone state i, given phone state i-1?

12 Initial estimates for Transition probabilities between phone states Observation probabilities associating phone states with acoustic examples Re-estimate both probabilities by feeding the HMM the transcribed speech training corpus (forced alignment) I.e., we tell the HMM the ‘right’ answers -- which phones to associate with which sequences of sounds Iteratively retrain transition and observation probabilities by running training data through model and scoring output (we know the right answers) until no improvement

13 HMMs for speech

14 Building the Pronunciation Model Models likelihood of word given network of candidate phone hypotheses –Multiple pronunciations for each word –May be weighted automaton or simple dictionary Words come from all corpora; pronunciations from pronouncing dictionary or TTS system

15 ASR Lexicon: Markov Models for Pronunciation

16 Language Model Models likelihood of word given previous word(s) Ngram models: –Build the LM by calculating bigram or trigram probabilities from (very large) text training corpus –Smoothing issues Grammars –Finite state grammar or Context Free Grammar (CFG) or semantic grammar Out of Vocabulary (OOV) problem

17 Search/Decoding Find the best hypothesis given –Lattice of phone units (AM) –Lattice of words (segmentation of phone lattice into all possible words via Pronunciation Model) –Probabilities of word sequences (LM) How to reduce this huge search space? –Lattice minimization and determinization –Pruning: beam search –Calculating most likely paths

18 Evaluating Success Transcription –Low WER (S+I+D)/N * 100 Thesis test vs. This is a test. 75% WER Or That was the dentist calling. 125% WER Understanding –High concept accuracy How many domain concepts were correctly recognized? I want to go from Boston to Baltimore on September 29

19 Domain conceptsValues –source cityBoston –target cityBaltimore –travel dateSeptember 29 –Score recognized string “Go from Boston to Washington on December 29” vs. “Go to Boston from Baltimore on September 29” –(1/3 = 33% CA)

20 Summary ASR today –Combines many probabilistic phenomena: varying acoustic features of phones, likely pronunciations of words, likely sequences of words –Relies upon many approximate techniques to ‘translate’ a signal ASR future –Can we include more language phenomena in the model?

21 Synthesizers Then and Now The task: produce human-sounding speech from some orthographic or semantic representation of an input –An online text –A semantic representation of a response to a query What are the obstacles? What is the state of the art?

22 Possibly the First ‘Speaking Machine’ Wolfgang von Kempelen, Mechanismus der menschlichen Sprache nebst Beschreibung einer sprechenden Maschine, 1791 (in Deutsches Museum still and playable) First to produce whole words, phrases – in many languages

23 Joseph Faber’s Euphonia, 1846

24 Constructed 1835 w/pedal and keyboard control –Whispered and ordinary speech –Model of tongue, pharyngeal cavity with changeable shape –Singing too “God Save the Queen” Modern Articulatory Synthesis: Dennis Klatt (1987)

25

26 World’s Fair in NY, 1939 Requires much training to ‘play’ Purpose: reduce bandwidth needed to transmit speech, so many phone calls can be sent over single line

27

28 Answers: –These days a chicken leg is a rare dish. –It’s easy to tell the depth of a well. –Four hours of steady work faced us. ‘Automatic’ synthesis from spectrogram – but can also use hand-painted spectrograms as input Purpose: understand perceptual effect of spectral details

29 Formant/Resonance/Acoustic Synthesis Parametric or resonance synthesis –Specify minimal parameters, e.g. f0 and first 3 formants –Pass electronic source signal thru filter Harmonic tone for voiced sounds Aperiodic noise for unvoiced Filter simulates the different resonances of the vocal tract E.g. –Walter Lawrence’s Parametric Artificial Talker (1953) for vowels and consonants –Gunnar Fant’s Orator Verbis Electris (1953) for vowels –Formant synthesis download (demo)Formant synthesis download demo

30 Concatenative Synthesis Most common type today First practical application in 1936: British Phone company’s Talking Clock –Optical storage for words, part-words, phrases –Concatenated to tell time E.g. And a ‘similar’ example Bell Labs TTS (1977) (1985)

31 Variants of Concatenative Synthesis Inventory units –Diphone synthesis (e.g. Festival) –Microsegment synthesis –“Unit Selection” – large, variable units Issues –How well do units fit together? –What is the perceived acoustic quality of the concatenated units? –Is post-processing on the output possible, to improve quality?

32 TTS Production Levels: Back End and Front End Orthographic input: The children read to Dr. Smith World Knowledgetext normalization Semantics Syntaxword pronunciation Lexical Intonation assignment Phonologyintonation realization –F0, amplitude, duration Acousticssynthesis

33 Text Normalization Reading is what W. hates most. Reading is what Wilde hated most. Have the students read the questions. In 1996 she sold 1995 shares and deposited $42 in her 401(k). The duck dove supply.

34 Pronunciation in Context Homograph disambiguation –E.g. bass/bass, desert/desert Dictionaries vs letter-to-sound rules –Frequent or exceptional words: dictionary Bellcore business name pronunciation –‘New’ words: rules Inferring language origin, e.g. Infiniti, Gomez, Paris Pronunciation by analogy, e.g. universary Learning rules automatically from dictionaries or small seed-sets + active learning

35 Intonation Assignment: Phrasing Traditional: hand-built rules –Punctuation –Context/function word: no breaks after function word He went to dinner –Parsing? She favors the nuts and bolts approach Current: statistical analysis of large labeled corpus –Punctuation, pos window, utt length,… –~94% `accuracy’

36 Intonation Assignment: Accent Hand-built rules –Function/content distinction Accent content words; deaccent function words But many exceptions –‘Given’ items, focus, contrast –There are presidents and there are good presidents –Complex nominals: Main Street/Park Avenue city hall parking lot Today: Statistical procedures trained on large labeled corpora

37 Intonation Assignment: Contours Simple rules –‘.’ = declarative contour –‘?’ = yes-no-question contour unless wh-word present at/near front of sentence Well, how did he do it? And what do you know? Open problem –But realization even of simple variation is quite difficult in current TTS systems

38 Phonological and Acoustic Realization Task: –Produce a phonological representation from phonemes and intonational assignment Pitch contour aligned with text Durations, intensity –Select best concatenative units from inventory –Post-process if needed/possible to smooth joins, modify pitch, duration, intensity, rate from original units –Produce acoustic waveform as output

39 TTS: Where are we now? Natural sounding speech for some utterances –Where good match between input and database Still…hard to vary prosodic features and retain naturalness –Yes-no questions: Do you want to fly first class? Context-dependent variation still hard to infer from text and hard to realize naturally:

40 –Appropriate contours from text –Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. –Variation in pitch range, rate, pausal duration to convey topic structure Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. How to mimic real voices? ScanSoft/Nuance demo

41 Next Class J&M 22.4,22.8 Walker et al ‘97