Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999.

Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999

June 19992 Tutorial Overview: Data Flow Discourse Interpretation Dialogue Management Signal Processing Speech Recognition Semantic Interpretation Response Generation Speech Synthesis Part IPart II

June 19993 Speech and Audio Processing Signal processing: –Convert the audio wave into a sequence of feature vectors Speech recognition: –Decode the sequence of feature vectors into a sequence of words Semantic interpretation: –Determine the meaning of the recognized words Speech synthesis: –Generate synthetic speech from a marked-up word string

June 19994 Tutorial Overview: Outline Part I Signal processing Speech recognition –acoustic modeling –language modeling –decoding Semantic interpretation Speech synthesis Part II Discourse and dialogue –Discourse interpretation –Dialogue management –Response generation Dialogue evaluation Data collection

June 19995 s p ee ch l a b Graphs from Simon Arnfield’s web tutorial on speech, Sheffield: http://lethe.leeds.ac.uk/research/cogn/speech/tutorial/ “l” to “a” transition: Acoustic Waves Human speech generates a wave –like a loudspeaker moving A wave for the words “speech lab” looks like:

June 19996 25 ms 10ms... a 1 a 2 a 3 Result: Acoustic Feature Vectors Acoustic Sampling 10 ms frame (ms = millisecond = 1/1000 second) ~25 ms window around frame to smooth signal processing

June 19997 Frequency gives pitch; amplitude gives volume –sampling at ~8 kHz phone, ~16 kHz mic (kHz=1000 cycles/sec) Fourier transform of wave yields a spectrogram –darkness indicates energy at each frequency –hundreds to thousands of frequency samples s p ee ch l a b frequency amplitude Spectral Analysis

June 19998 Acoustic Features: Mel Scale Filterbank Derive Mel Scale Filterbank coefficients Mel scale: –models non-linearity of human audio perception –mel(f) = 2595 log 10 (1 + f / 700) –roughly linear to 1000Hz and then logarithmic Filterbank –collapses large number of FFT parameters by filtering with ~20 triangular filters spaced on mel scale... m 1 m 2 m 3 m 4 m 5 m 6 frequency … coefficients

June 19999 Cepstral Coefficients Cepstral Transform is a discrete cosine transform of log filterbank amplitudes: Result is ~12 Mel Frequency Cepstral Coefficients (MFCC) Almost independent (unlike mel filterbank) Use Delta (velocity / first derivative) and Delta 2 (acceleration / second derivative) of MFCC (+ ~24 features)

June 199910 Additional Signal Processing Pre-emphasis prior to Fourier transform to boost high level energy Liftering to re-scale cepstral coefficients Channel Adaptation to deal with line and microphone characteristics (example: cepstral mean normalization) Echo Cancellation to remove background noise (including speech generated from the synthesizer) Adding a Total (log) Energy feature (+/- normalization) End-pointing to detect signal start and stop

June 199912 Properties of Recognizers Speaker Independent vs. Speaker Dependent Large Vocabulary (2K-200K words) vs. Limited Vocabulary (2-200) Continuous vs. Discrete Speech Recognition vs. Speech Verification Real Time vs. multiples of real time Spontaneous Speech vs. Read Speech Noisy Environment vs. Quiet Environment High Resolution Microphone vs. Telephone vs. Cellphone Adapt to speaker vs. non-adaptive Low vs. High Latency With online incremental results vs. final results

June 199913 The Speech Recognition Problem Bayes’ Law – P(a,b) = P(a|b) P(b) = P(b|a) P(a) –Joint probability of a and b = probability of b times the probability of a given b The Recognition Problem –Find most likely sequence w of “words” given the sequence of acoustic observation vectors a –Use Bayes’ law to create a generative model –ArgMax w P(w|a) = ArgMax w P(a|w) P(w) / P(a) = ArgMax w P(a|w) P(w) Acoustic Model: P(a|w) Language Model: P(w)

June 199915 HMMs provide generative acoustic models P(a|w) probabilistic, non-deterministic finite-state automaton –state n generates feature vectors with density P n –transitions from state j to n are probabilistic P j,n P3(.)P3(.)P1(.)P1(.)P2(.)P2(.) P 1,1 P 1,2 P 2,2 P 3,3 P 2,3 P 3,4 Hidden Markov Models (HMMs)

June 199916 Outgoing likelihoods: Feature vector a generated by normal density (Gaussian) with mean  and covariance matrix  HMMs: Single Gaussian Distribution P3(.)P3(.)P1(.)P1(.)P2(.)P2(.) P 1,1 P 1,2 P 2,2 P 3,3 P 2,3 P 3,4

June 199917 HMMs: Gaussian Mixtures To account for variable pronunciations Each state generates acoustic vectors according to a linear combination of m Gaussian models, weighted by m : Three-component mixture model in two dimensions

June 199918 P3(.)P3(.)P1(.)P1(.)P2(.)P2(.) P 1,1 P 1,2 P 2,2 P 3,3 P 2,3 P 3,4 Acoustic Modeling with HMMs Train HMMs to represent subword units Units typically segmental; may vary in granularity –phonological (~40 for English) –phonetic (~60 for English) –context-dependent triphones (~14,000 for English): models temporal and spectral transitions between phones –silence and noise are usually additional symbols Standard architecture is three successive states per phone:

June 199919 Pronunciation Modeling Needed for speech recognition and synthesis Maps orthographic representation of words to sequence(s) of phones Dictionary doesn’t cover language due to: –open classes –names –inflectional and derivational morphology Pronunciation variation can be modeled with multiple pronunciation and/or acoustic mixtures If multiple pronunciations are given, estimate likelihoods Use rules (e.g. assimilation, devoicing, flapping), or statistical transducers

June 199920 Create compound HMM for each lexical entry by concatenating the phones making up the pronunciation –example of HMM for ‘lab’ (following ‘speech’ for crossword triphone) Multiple pronunciations can be weighted by likelihood into compound HMM for a word (Tri)phone models are independent parts of word models Lexical HMMs P3(.)P3(.)P1(.)P1(.)P2(.)P2(.) P 1,1 P 1,2 P 2,2 P 3,3 P 2,3 P 3,4 P3(.)P3(.)P1(.)P1(.)P2(.)P2(.) P 1,1 P 1,2 P 2,2 P 3,3 P 2,3 P 3,4 P3(.)P3(.)P1(.)P1(.)P2(.)P2(.) P 1,1 P 1,2 P 2,2 P 3,3 P 2,3 P 3,4 phone: l a b triphone: ch- l +a l- a +b a- b +#

June 199921 HMM Training: Baum-Welch Re-estimation Determines the probabilities for the acoustic HMM models Bootstraps from initial model –hand aligned data, previous models or flat start Allows embedded training of whole utterances: –transcribe utterance to words w 1,…,w k and generate a compound HMM by concatenating compound HMMs for words: m 1,…,m k –calculate acoustic vectors: a 1,…,a n Iteratively converges to a new estimate Re-estimates all paths because states are hidden Provides a maximum likelihood estimate –model that assigns training data the highest likelihood

June 199923 Assigns probability P(w) to word sequence w = w 1,w 2,…,w k Bayes’ Law provides a history-based model: P(w 1,w 2,…,w k ) = P(w 1 ) P(w 2 |w 1 ) P(w 3 |w 1,w 2 ) … P(w k |w 1,…,w k-1 ) Cluster histories to reduce number of parameters Probabilistic Language Modeling: History

June 199924 N -gram Language Modeling n-gram assumption clusters based on last n-1 words – P(w j |w 1,…,w j-1 ) ~ P(w j |w j-n-1,…,w j-2,w j-1 ) –unigrams ~ P(w j ) –bigrams ~ P(w j |w j-1 ) –trigrams ~ P(w j |w j-2,w j-1 ) Trigrams often interpolated with bigram and unigram: –the  i typically estimated by maximum likelihood estimation on held out data ( F (.|.) are relative frequencies) –many other interpolations exist (another standard is a non-linear backoff)

June 199925 Extended Probabilistic Language Modeling Histories can include some indication of semantic topic –latent-semantic indexing (vector-based information retrieval model) –topic-spotting and blending of topic-specific models –dialogue-state specific language models Language models can adapt over time –recent history updates model through re-estimation or blending –often done by boosting estimates for seen words (triggers) –new words and/or pronunciations can be added Can estimate category tags (syntactic and/or semantic) –Joint word/category model: P(word 1 :tag 1,…,word k :tag k ) –example: P(word:tag|History) ~ P(word|tag) P(tag|History)

June 199926 Finite State Language Modeling Write a finite-state task grammar (with non-recursive CFG) Simple Java Speech API example (from user’s guide): public = [ ] (and )*; = open | close | delete; = the window | the file; = please; Typically assume that all transitions are equi-probable Technology used in most current applications Can put semantic actions in the grammar

June 199927 Information Theory: Perplexity Perplexity is standard model of recognition complexity given a language model Perplexity measures the conditional likelihood of a corpus, given a language model P (.): Roughly the number of equi-probable choices per word Typically computed by taking logs and applying history- based Bayesian decomposition: But lower perplexity doesn’t guarantee better recognition

June 199928 Zipf’s Law Lexical frequency is inversely proportional to rank –Frequency(n) = Frequency of n-th most frequent word –Zipf’s Law: Frequency(Rank) = Frequency(1)/Rank –Thus: log Frequency(Rank)  - log Rank From G.R. Turner’s web site on Zipf’s law: http://www.btinternet.com/~g.r.turner/ZipfDoc.htm

June 199929 Vocabulary Acquisition IBM personal E-mail corpus of PDB (by R.L. Mercer) static coverage is given by most frequent n words dynamic coverage is most recent n words

June 199930 Language HMMs Can take HMMs for each word and combine into a single HMM for the whole language (allows cross-word models) Result is usually too large to expand statically in memory A two word example is given by: word1 word2 P(word2|word1) P(word1|word2) P(word1|word1) P(word2|word2) P(word1) P(word2)

June 199932 Decoding Problem is finding best word sequence: –ArgMax w1,…,wm P(w 1,…,w m | a 1,…,a n ) Words w 1 …w m are fully determined by sequences of states Many state sequences produce the same words The Viterbi assumption: – the word sequence derived from the most likely path will be the most likely word sequence (as would be computed over all paths) HMM Decoding Acoustics for state s for input a Max over previous states r likelihood previous state is r Transition probability from r to s

June 199933 Visualizing Viterbi Decoding: The Trellis a i-1 s1s1  i-1 (s 1 ) s2s2  i-1 (s 2 ) sksk  i-1 (s k ) s1s1  i (s 1 ) s2s2  i (s 2 ) sksk  i (s k ) s1s1  i+1 (s 1 ) s2s2  i+1 (s 2 ) sksk  i+1 (s k ) P 1,1 P 2,1 P k,1... time t i-1 titi t i+1 a i+1 aiai input... P 1,1 P 1,2 P 1,k best path

June 199934 Viterbi Search: Dynamic Programming Token Passing Algorithm: –Initialize all states with a token with a null history and the likelihood that it’s a start state –For each frame a k For each token t in state s with probability P(t), history H –For each state r »Add new token to s with probability P(t) P s,r P r (a k ), and history s.H Time synchronous from left to right Allows incremental results to be evaluated

June 199935 Pruning the Search Space Entire search space for Viterbi search is much too large Solution is to prune tokens for paths whose score is too low Typical method is to use: –histogram: only keep at most n total hypotheses –beam: only keep hypotheses whose score is a fraction of best score Need to balance small n and tight beam to limit search and minimal search error (good hypotheses falling off beam) HMM densities are usually scaled differently than the discrete likelihoods from the language model –typical solution: boost language model’s dynamic range, using P(w) n P(a|w), usually with with n ~ 15 Often include penalty for each word to favor hypotheses with fewer words

June 199936 N-best Hypotheses and Word Graphs Keep multiple tokens and return n-best paths/scores: –p1 flights from Boston today –p2 flights from Austin today –p3 flights for Boston to pay –p4 lights for Boston to pay Can produce a packed word graph (a.k.a. lattice) –likelihoods of paths in lattice should equal likelihood for n-best flights lights from for Boston Austin today to pay

June 199937 Search-based Decoding A* search: –Compute all initial hypotheses and place in priority queue –For best hypothesis in queue extend by one observation, compute next state score(s) and place into the queue Scoring now compares derivations of different lengths –would like to, but can’t compute cost to complete until all data is seen –instead, estimate with simple normalization for length –usually prune with beam and/or histogram constraints Easy to include unbounded amounts of history because no collapsing of histories as in dynamic programming n-gram Also known as stack decoder (priority queue is “stack”)

June 199938 Multiple Pass Decoding Perform multiple passes, applying successively more fine- grained language models Can much more easily go beyond finite state or n-gram Can use for Viterbi or stack decoding Can use word graph as an efficient interface Can compute likelihood to complete hypotheses after each pass and use in next round to tighten beam search First pass can even be a free phone decoder without a word-based language model

June 199939 Measuring Recognition Accuracy Word Error Rate = Example scoring: –actual utterance: four six seven nine three three seven –recognizer: four oh six seven five three seven insert subst delete –WER: (1 + 1 + 1)/7 = 43% Would like to study concept accuracy –typically count only errors on content words [application dependent] –ignore case marking (singular, plural, etc.) For word/concept spotting applications: –recall: percentage of target words (concept) found –precision: percentage of hypothesized words (concepts) in target

June 199940 Empirical Recognition Accuracies Cambridge HTK, 1997; multipass HMM w. lattice rescoring Top Performer in ARPA’s HUB-4: Broadcast News Task 65,000 word vocabulary; Out of Vocabulary: 0.5% Perplexities: –word bigram: 240 (6.9 million bigrams) –backoff trigram of 1000 categories: 238 (803K bi, 7.1G tri) –word trigram: 159 (8.4 million trigrams) –word 4-gram: 147 (8.6 million 4-grams) –word 4-gram + category trigram: 137 Word Error Rates: –clean, read speech: 9.4% –clean, spontaneous speech: 15.2% –low fidelity speech: 19.5%

June 199941 Empirical Recognition Accuracies (cont’d) Lucent 1998, single pass HMM Typical of real-time telephony performance (low fidelity) 3,000 word vocabulary; Out of Vocabulary: 1.5% Blended models from customer/operator & customer/system Perplexities customer/op customer/system –bigram: 105.8 (27,200) 32.1 (12,808) –trigram: 99.5 (68,500) 24.4 (25,700) Word Error Rate: 23% Content Term (single, pair, triple of words) Precision/Recall –one-word terms: 93.7 / 88.4 –two-word terms: 96.9 / 85.4 –three-word terms: 98.5 / 84.3

June 199942 Confidence Scoring and Rejection Alternative to standard acoustic density scoring –compute HMM acoustic score for word(s) in usual way –baseline score for an anti-model –compute hypothesis ratio (Word Score / Baseline Score) –test hypothesis ratio vs. threshold Can be applied to: –free word spotting (given pronunciations) –(word-by-word) acoustic confidence scoring for later processing –verbal information verification existing info: name, address, social security number password

June 199944 Semantic Interpretation: Word Strings Content is just words –System: What is your address? –User: fourteen eleven main street Can also do concept extraction / keyword(s) spotting –User: My address is fourteen eleven main street Applications –template filling –directory services –information retrieval

June 199945 Semantic Interpretation: Pattern-Based Simple (typically regular) patterns specify content ATIS (Air Traffic Information System) Task: –System: What are your travel plans? –User: [On Monday], I’m going [from Boston] [to San Francisco]. –Content: [DATE=Monday, ORIGIN=Boston, DESTINATION=SFO] Can combine content-extraction and language modeling –but can be too restrictive as a language model Java Speech API: (curly brackets show semantic ‘actions’) public = [ ] [ ]; = open {OP} | close {CL} | move {MV}; = [ ] window | door; = a | the | this | that | the current; = please | kindly; Can be generated and updated on the fly (eg. Web Apps)

June 199946 Semantic Interpretation: Parsing In general case, have to uncover who did what to whom: –System: What would you like me to do next? –User: Put the block in the box on Platform 1. [ambiguous] –System: How can I help you? –User: Where is A Bug’s Life playing in Summit? Requires some kind of parsing to produce relations: –Who did what to whom: ?(where(present(in(Summit,play(BugsLife))))) –This kind of representation often used for machine translation Often transferred to flatter frame-based representation: –Utterance type: where-question –Movie: A Bug’s Life –Town: Summit

June 199947 Robustness and Partiality Controlled Speech –limited task vocabulary; limited task grammar Spontaneous Speech –Can have high out-of-vocabulary (OOV) rate –Includes restarts, word fragments, omissions, phrase fragments, disagreements, and other disfluencies –Contains much grammatical variation –Causes high word error-rate in recognizer Parsing is often partial, allowing: –omission –parsing fragments

June 199949 Recorded Prompts The simplest (and most common) solution is to record prompts spoken by a (trained) human Produces human quality voice Limited by number of prompts that can be recorded Can be extended by limited cut-and-paste or template filling

June 199950 Speech Synthesis Rule-based Synthesis –Uses linguistic rules (+/- training) to generate features –Example: DECTalk Concatenative Synthesis –Record basic inventory of sounds –Retrieve appropriate sequence of units at run time –Concatenate and adjust durations and pitch –Waveform synthesis

June 199951 Diphone and Polyphone Synthesis Phone sequences capture co-articulation Cut speech in positions that minimize context contamination Need single phones, diphones and sometimes triphones Reduce number collected by –phonotactic constraints –collapsing in cases of no co-articulation Data Collection Methods –Collect data from a single (professional) speaker –Select text with maximal coverage (typically with greedy algorithm), or –Record minimal pairs in desired contexts (real words or nonsense)

June 199952 Duration Modeling Must generate segments with the appropriate duration Segmental Identity –/ai/ in like twice as long as /I/ in lick Surrounding Segments –vowels longer following voiced fricatives than voiceless stops Syllable Stress –onsets and nuclei of stressed syllables longer than in unstressed Word “importance” –word accent with major pitch movement lengthens Location of Syllable in Word –word ending longer than word starting longer than word internal Location of the Syllable in the Phrase –phrase final syllables longer than same syllable in other positions

June 199953 Intonation: Tone Sequence Models Functional Information can be encoded via tones: –given/new information (information status) –contrastive stress –phrasal boundaries (clause structure) –dialogue act (statement/question/command) Tone Sequence Models –F0 contours generated from phonologically distinctive tones/pitch accents which are locally independent –generate a sequence of tonal targets and fit with signal processing

June 199954 Intonation for Function ToBI (Tone and Break Index) System, is one example: –Pitch Accent * (H*, L*, H*+L, H+L*, L*+H, L+H*) –Phrase Accent - (H-, L-) –Boundary Tone % (H%, L%) –Intonational Phrase + source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998 statement vs. question example:

June 199955 Text Markup for Synthesis Bell Labs TTS Markup –r(0.9) L*+H(0.8) Humpty L*+H(0.8) Dumpty r(0.85) L*(0.5) sat on a H*(1.2) wall. –Tones: Tone(Prominence) –Speaking Rate: r(Rate) and pauses –Top Line (highest pitch); Reference Line (reference pitch); Base Line (lowest pitch) SABLE is an emerging standard extending SGML http://www.cstr.ed.ac.uk/projects/sable.html –marks: emphasis(#), break(#), pitch(base/mid/range,#), rate(#), volume(#), semanticMode(date/time/email/URL/...), speaker(age,sex) –Implemented in Festival Synthesizer (free for research, etc.): http://www.cstr.ed.ac.uk/projects/festival.html

June 199956 Intonation in Bell Labs TTS Generate a sequence of F0 targets for synthesis Example: –We were away a year ago. –phones: w E w R & w A & y E r & g O –Default Declarative intonation: (H%) H* L- L% [question: L* H- H%] source: Multilingual Text-to-Speech Synthesis, R. Sproat, ed., Kluwer, 1998

June 199957 Signal Processing for Speech Synthesis Diphones recorded in one context must be generated in other contexts Features are extracted from recorded units Signal processing manipulates features to smooth boundaries where units are concatenated Signal processing modifies signal via ‘interpolation’ –intonation –duration

June 199958 The Source-Filter Model of Synthesis Model of features to be extracted and fitted Excitation or Voicing Source(s) to model sound source –standard wave of glottal pulses for voiced sounds –randomly varying noise for unvoiced sounds –modification of airflow due to lips, etc. –high frequency (F0 rate), quasi-periodic, choppy –modeled with vector of glottal waveform patterns in voiced regions Acoustic Filter(s) –shapes the frequency character of vocal tract and radiation character at the lips –relatively slow (samples around 5ms suffice) and stationary –modeled with LPC (linear predictive coding)

June 199959 Barge-in Technique to allow speaker to interrupt the system’s speech Combined processing of input signal and output signal Signal detector runs looking for speech start and endpoints –tests a generic speech model against noise model –typically cancels echoes created by outgoing speech If speech is detected: –Any synthesized or recorded speech is cancelled –Recognition begins and continues until end point is detected

June 199960 Speech Application Programming Interfaces Abstract from recognition/synthesis engines Recognizer and synthesizer loading Acoustic and grammar model loading (dynamic updates) Recognition –online –n-best or lattice Synthesis –markup –barge in Acoustic control –telephony interface –microphone/speaker interface

June 199961 Speech API Examples SAPI: Microsoft Speech API (rec&synth) –communicates through COM objects –instances: most systems implement all or some of this (Dragon, IBM, Lucent, L&H, etc.) JSAPI: Java Speech API (rec & synth) –communicates through Java events (like GUI) –concurrency through threads –instances: IBM ViaVoice (rec), L&H (synth) (J)HAPI: (Java) HTK API (recognition) –communicates through C or Java port of C interface –eg: Entropics Cambridge Research Lab’s HMM Tool Kit (HTK) Galaxy (rec & synth) –communicates through a production system scripting language –MIT System, ported by MITRE for DARPA Communicator

June 199963 Discourse & Dialogue Processing Discourse interpretation: –Understand what the user really intends by interpreting utterances in context Dialogue management: –Determine system goals in response to user utterances based on user intention Response generation: –Generate natural language utterances to achieve the selected goals

June 199964 Discourse Interpretation Goal: understand what the user really intends Example: Can you move it? –What does “it” refer to? –Is the utterance intended as a simple yes-no query or a request to perform an action? Issues addressed: –Reference resolution –Intention recognition Interpret user utterances in context

June 199965 U: Where is A Bug’s Life playing in Summit? S: A Bug’s Life is playing at the Summit theater. U: When is it playing there? S: It’s playing at 2pm, 5pm, and 8pm. U: I’d like 1 adult and 2 children for the first show. How much would that cost? Reference Resolution Knowledge sources: –Domain knowledge –Discourse knowledge –World knowledge

June 199966 Reference Resolution: In Theory Focus stacks: –Maintain recent objects in stack –Select objects that satisfy semantic/pragmatic constraints starting from top of stack –May take into account discourse structure Centering: –Backward-looking center (Cb): object connecting the current sentence with the previous sentence –Forward-looking centers (Cf): potential Cb of the next sentence –Rule-based filtering & ranking of objects for pronoun resolution

June 199967 Reference Resolution: In Practice Non-existent: does not allow the use of anaphoric references Allows only simple references: –utilizes the focus stack reference resolution mechanism –does not take into account discourse structure information Example: U: Where is A Bug’s Life playing in Summit? A Bug’s Life Summit

June 199968 S: A Bug’s Life is playing at the Summit theater. A Bug’s Life Summit theater

June 199969 U: When is it playing there?

June 199970 B: I have to wash my hair. Intention Recognition

June 199971 A: Would you like to go to the hairdresser? B’s utterance should be interpreted as an acceptance of A’s proposal.

June 199972 A: What’s that smell around here? B’s utterance should be interpreted as an answer to A’s question.

June 199973 A: Would you be interested in going out to dinner tonight? B’s utterance should be interpreted as a rejection of A’s proposal.

June 199974 Intention Recognition (Cont’d) Goal: to recognize the intent of each user utterance as one (or more) of a set of dialogue acts based on context Sample dialogue actions: –Switchboard DAMSL Conventional-closing Statement-(non-)opinion Agree/Accept Acknowledgment Yes-No-Question/Yes-Answer Non-verbal Abandoned On-going standardization efforts (Discourse Resource Initiative) –Verbmobil Greet/Thank/Bye Suggest Accept/Reject Confirm Clarify-Query/Answer Give-Reason Deliberate

June 199975 Intention Recognition: In Theory Knowledge sources: –Overall dialogue goals –Orthographic features, e.g.: punctuation cue words/phrases: “but”, “furthermore”, “so” transcribed words: “would you please”, “I want to” –Dialogue history, i.e., previous dialogue act types –Dialogue structure, e.g.: subdialogue boundaries dialogue topic changes –Prosodic features of utterance: duration, pause, F0, speaking rate

June 199976 Intention Recognition: In Theory (Cont’d) Finite-state dialogue grammar: –e.g. Plan-based discourse understanding: –Recipes: templates for performing actions –Inference rules: to construct plausible plans Empirical methods: –Probabilistic dialogue act classifiers: HMMs –Rule-based dialogue act recognition: CART, Transformation-based learning S / Greet U / QuestionS / Answer U / Question U / Bye

June 199977 Intention Recognition: In Practice Makes assumptions about (high-level) task-specific intentions: e.g., –Call routing: giving destination information –ATIS: requesting flight information –Movie information system: movie showtime or theater playlist information Does not allow user-initiated complex dialogue acts, e.g. confirmation, clarification, or indirect responses S1: What’s your account number? U1: Is that the number on my ATM card? S2: Would you like to transfer $1,500 from savings to checking? U2: If I have enough in savings.

June 199978 Intention Recognition: In Practice (Cont’d) User utterances can play one of two roles: –Identify one of a set of possible task intentions –Provide necessary information for performing a task Based on either keywords in an utterance or its syntactic/semantic representation Maps keywords or representations to intentions using: –Template matching –Probabilistic model –Vector-based similarity measures

June 199979 U: What time is A Bug’s Life playing at the Summit theater? Intention Recognition: Example Using keyword extraction and vector-based similarity measures: –Intention: Ask-Reference: _time –Movie: A Bug’s Life –Theater: the Summit quadplex

June 199981 Dialogue Management: Motivating Examples Dialogue 1: S: Would you like movie showtime or theater playlist information? U: Movie showtime. S: What movie do you want showtime information about? U: Saving Private Ryan. S: At what theater do you want to see Saving Private Ryan? U: Paramount theater. S: Saving Private Ryan is not playing at the Paramount theater.

June 199982 DM: Motivating Examples (Cont’d) Dialogue 2: S: How may I help you? U: When is Saving Private Ryan playing? S: For what theater? U: The Paramount theater. S: Saving Private Ryan is not playing at the Paramount theater, but it’s playing at the Madison theater at 3:00, 5:30, 8:00, and 10:30.

June 199983 DM: Motivating Examples (Cont’d) Dialogue 3: S: How may I help you? U: When is Saving Private Ryan playing? S: For what theater? U: The Paramount theater. S: Saving Private Ryan is playing at the Fairmont theater at 6:00 and 8:30. U: I wanted to know about the Paramount theater, not the Fairmont theater. S: Saving Private Ryan is not playing at the Paramount theater, but it’s playing at the Madison theater at 3:00, 5:30, 8:00, and 10:30.

June 199984 Comparison of Sample Dialogues Dialogue 1: –System-initiative –Implicit confirmation –Merely informs user of failed query –Mechanical –Least efficient Dialogue 2: –Mixed-initiative –No confirmation –Suggests alternative when query fails –More natural –Most efficient Dialogue 3: –Mixed-initiative –No confirmation –Suggests alternative when query fails –More natural –Moderately efficient

June 199985 Goal: determine what to accomplish in response to user utterances, e.g.: –Answer user question –Solicit further information –Confirm/Clarify user utterance –Notify invalid query –Notify invalid query and suggest alternative Interface between user/language processing components and system knowledge base Dialogue Management

June 199986 Dialogue Management (Cont’d) Main design issues: –Functionality: how much should the system do? –Process: how should the system do them? Affected by: –Task complexity: how hard the task is –Dialogue complexity: what dialogue phenomena are allowed Affects: –robustness –naturalness –perceived intelligence

June 199987 Task Complexity Application dependent Examples: SimpleComplex Call Routing Travel Planning Weather Information ATIS Automatic Banking University Course Advisement Directly affects: –Types and quantity of system knowledge –Complexity of system’s reasoning abilities

June 199988 Dialogue Complexity Determines what can be talked about: –The task only –Subdialogues: e.g., clarification, confirmation –The dialogue itself: meta-dialogues Could you hold on for a minute? What was that click? Did you hear it? Determines who can talk about them: –System only –User only –Both participants

June 199989 Dialogue Management: Functionality Determines the set of possible goals that the system may select at each turn At the task level, dictated by task complexity At the dialogue level, determined by system designer in terms of dialogue complexity: –Are subdialogues allowed? –Are meta-dialogues allowed? –Only by the system, by the user, or by both agents?

June 199990 DM Functionality: In Theory Task complexity: moderate to complex –Travel planning –University course advisement Dialogue complexity: –System/user-initiated complex subdialogues Embedded negotiation subdialogues Expressions of doubt –Meta-dialogues –Multiple dialogue threads

June 199991 DM Functionality: In Practice Task complexity: simple to moderate –Call routing –Weather information query –Train schedule inquiry Dialogue complexity: –About task only –Limited system-initiated subdialogues

June 199992 Dialogue Management: Process Determines how the system will go about selecting among the possible goals At the dialogue level, determined by system designer in terms of initiative strategies: –System-initiative: system always has control, user only responds to system questions –User-initiative: user always has control, system passively answers user questions –Mixed-initiative: control switches between system and user using fixed rules –Variable-initiative: control switches between system and user dynamically based on participant roles, dialogue history, etc.

June 199993 DM Process: In Theory Initiative strategies: –Mixed-initiative –Variable-initiative Mechanisms for modeling initiative: –Planning and reasoning –Theorem proving –Belief modeling Knowledge sources for modeling initiative: –System beliefs, user beliefs, and mutual beliefs –System domain knowledge –Dialogue history –User preferences

June 199994 DM Process: In Practice Initiative strategies: –User-initiative –System-initiative –Mixed-initiative –Variable-initiative Mechanisms for modeling initiative: –System and mixed-initiative: finite-state machines –Variable-initiative: evidential model for computing initiative Knowledge sources: –Dialogue history: e.g. user fails to make progress in task –Participant roles: advisor/advisee vs. collaborators –Features of current utterance: e.g.: ambiguous utterance, underspecified utterance

June 199995 Dialogue Management: Summary In theory: –Focuses on developing algorithms for addressing difficult dialogue phenomena in complex tasks In practice: –Focuses on developing robust, real-time dialogue strategies for handling common dialogue phenomena in simpler tasks

June 199996 Dialogue Management: Example System for providing movie showtime and theater playlist information Functionality: –Task: provide simple movie information –Dialogue: allow system-initiated subdialogues Process: –Variable-initiative

June 199997 U: When is Antz playing at the Fairmont theater? S: Did you say the Paramount theater? U: No, the Fairmont theater. S: Antz is playing at the Fairmont theater at 7pm. U: When is it playing at the Summit theater? S: Antz is not playing at the Summit theater, but it’s playing at the AMC theater in Summit at 8pm. Dialogue Management Example (Cont’d)

June 199999 Response Generation U: Is Saving Private Ryan playing at the Chatham cinema?

June 1999100 S: No, it’s not. S provides elliptical response

June 1999101 S: No, Saving Private Ryan is not playing at the Chatham cinema. S provides full response (which provides grounding information)

June 1999102 S: No, Saving Private Ryan is not playing at the Chatham cinema; the theater’s under renovation. S provides full response and supporting evidence

June 1999103 Response Generation (Cont’d) Goal: generate natural language utterances to achieve goal(s) selected by the dialogue manager Issues: –Content selection: determining what to say –Surface realization: determining how to say it Generation gap: discrepancy between the actual output of the content selection process and the expected input of the surface realization process

June 1999104 Content Selection Goal: determine the propositional content of utterances to achieve goal(s) Examples: –Antz is not playing at the Maplewood theater; [Nucleus] –Would you like the suite? [Nucleus] –Can you get the groceries from the car? [Nucleus]

June 1999105 –the theater’s under renovation. (evidence) [Satellite] –It’s the same price as the regular room. (motivation) [Satellite] –The key is on the dryer. (enablement) [Satellite]

June 1999106 Content Selection: In Theory Knowledge sources: –Domain knowledge base –User beliefs –User model: user characteristics, preferences, etc. –Dialogue history Content selection mechanisms: –Schemas: patterns of predicates –Rule-based generation –Plan-based generation: Recipes: templates for performing actions Planner: to construct plans for given goal –Case-based reasoning

June 1999107 Content Selection: In Practice Knowledge sources: –Domain knowledge base –Dialogue history Pre-determined content selection strategies: –Nucleus only, no satellite information –Nucleus + fixed satellite

June 1999108 Surface Realization Goal: determine how the selected content will be conveyed by natural language utterances Examples: –Antz is showing (shown) at the Maplewood theater. –The Maplewood theater is showing Antz. –It is at the Maplewood theater that Antz is shown. –Antz, that’s what’s being shown at the Maplewood theater. Issues: –Clausal structure construction –Lexical selection

June 1999109 Surface Realization: In Theory Typical surface generator requires as input: –Semantic representation to be realized –Clausal structure for generated utterance Surface realization component utilizes a grammar to generate utterance that conveys the given semantic representation

June 1999110 Surface Realization: In Practice Canned utterances: –Pre-determined utterances for goals; e.g.: Greetings: Hello, this is the ABC bank’s operator. Repeat: Could you please repeat your request? –Facilitates pre-recorded prompts for speech output Template-based generation: –Templates for goals; e.g.: Notification: Your call is being transferred to X. Inform: A,B,C,D, and E are playing at the F theater. Clarify: Did you say X or Y? –Needs cut-and-paste of pre-recorded segments or full TTS system

June 1999112 Dialogue Evaluation Goal: determine how “well” a dialogue system performs Main difficulties: –No strict right or wrong answers –Difficult to determine what features make a dialogue system better than another –Difficult to select metrics that contribute to the overall “goodness” of the system –Difficult to determine how the metrics compensate for one another –Expensive to collect new data for evaluating incremental improvement of systems

June 1999113 Dialogue Evaluation (Cont’d) System-initiative, explicit confirmation –better task success rate –lower WER –longer dialogues –fewer recovery subdialogues –less natural Mixed-initiative, no confirmation –lower task success rate –higher WER –shorter dialogues –more recovery subdialogues –more natural

June 1999114 Dialogue Evaluation Paradigms Evaluating the end result only: –Reference answers Evaluating both the end result and the process toward it: –Evaluation metrics –Performance functions

June 1999115 Evaluation Paradigms: Reference Answers Evaluates the task success rate only Suitable for query-answering systems for which a correct answer can be defined for each query For each query: –Obtain answer from dialogue system –Compare with reference answer –Score system performance Advantage: simple Disadvantage: ignores many other important factors that contribute to quality of dialogue systems

June 1999116 Evaluation Paradigms: Evaluation Metrics Different metrics for evaluating different components of a dialogue system: –Speech recognizer: word error rate / word accuracy –Understanding component: attribute value matrix –Dialogue manager: appropriateness of system responses error recovery abilities –Overall system: task success average number of turns elapsed time turn correction ratio

June 1999117 Paradigms: Evaluation Metrics (Cont’d) Advantage: –Takes into account the process toward completing the task Limitations: –Difficult to determine how different metrics compensate for one another –Metrics may not be independent of one another –Does not generalize across domains and tasks

June 1999118 Paradigms: Performance Functions PARADISE [Walker et al.]: derives performance functions using both task-based and dialogue-based metrics User satisfaction: –Maximize task success –Minimize costs: Efficiency measures: e.g., number of utterances, elapsed time Qualitative measures: e.g., repair ratio, inappropriate utt. ratio Performance function derivation: –Obtain user satisfaction ratings (questionnaire) –Obtain values for various metrics (automatic or manual) –Apply multiple linear regression to derive a function relating user satisfaction and various cost factors, e.g.,

June 1999119 Paradigms: Performance Functions (Cont’d) Advantages: –Allows for comparison of dialogue systems performing different tasks –Specifies relative contributions of cost factors to overall performance –Can be used to make predictions about future versions of the dialogue system Disadvantages: –Data collection cost for deriving performance function is high –Cost for deriving performance function for multiple systems to draw general conclusions is high

June 1999121 Data Collection: Wizard of Oz Paradigm Setup for initial data collection: –User communicates with “system” through telephone (speech) or keyboard (text) –“System” is actually a human, typically given instructions on how to behave like a system –Users are typically given tasks to perform in the target domain –Subjects are the users and the “system” can be played by one person –Dialogues between “system” and user are recorded and transcribed Setup for intermediate system evaluation: –Use actual running system, with wizard supervision –Wizard can override undesirable system behavior, e.g., correct ASR errors

June 1999122 Data Collection: Wizard of Oz (Cont’d) Features of collected data: –Typically much less complex than actual human-human dialogues performing the same tasks –Captures how humans behave when they talk to computers –Captures variations among different subjects in both language and approach when performing the same tasks Use of collected data: –Particularly useful for designing the interpretation component of the dialogue system –Useful for training purposes for ASR systems –May also be helpful for designing the dialogue management and response generation components

June 1999123 Summary Signal Processing Speech Recognition Semantic Interpretation Discourse Interpretation Dialogue Management Response Generation Speech Synthesis

June 1999124 Publicly Available Telephone Demos Nuance http://www.nuance.com/demo/index.html –Banking: 1-650-847-7438 –Travel Planning: 1-650-847-7427 –Stock Quotes: 1-650-847-7423 SpeechWorks http://www.speechworks.com/demos/demos.htm –Banking: 1-888-729-3366 –Stock Trading: 1-800-786-2571 MIT Spoken Language Systems Laboratory http://www.sls.lcs.mit.edu/sls/whatwedo/applications.html –Travel Plans (Pegasus): 1-877-648-8255 –Weather (Jupiter): 1-888-573-8255 IBM http://www.software.ibm.com/speech/overview/business/demo.html –Mutual Funds, Name Dialing: 1-877-VIA-VOICE

Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999.

Similar presentations

Presentation on theme: "Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999.

Similar presentations

Presentation on theme: "Spoken Dialogue Systems Bob Carpenter and Jennifer Chu-Carroll June 20, 1999."— Presentation transcript:

Similar presentations

About project

Feedback