Download presentation
Presentation is loading. Please wait.
2
Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee Maxine Eskenazi (chair) Alan W Black Reid Simmons Diane J. Litman
3
Spoken Dialog Systems 2 Spoken dialog systems have long promised to improve human-machine interaction Speech is a natural means of communication Recent improvements in underlying technologies have made such systems a reality
4
3 Sometimes they work… S: U: S: U: S: U: S: U: S: What can I do for you? I’d like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16 PM. S U
5
4 …but not always… S: U: S: U: S: U: S: What can I do for you? ‘kay. 51C Carrick from Century Square to Downtown Pittsburgh, to Oakland. The 61… If you want… Leaving from Oakland. Is this correct? 51C leaving Century Square going to Oakland, I mean go to South Side. Leaving… Leaving from McKeesport. Is… No. Leaving from Century Square. Leaving from McKeesport. Did I get that right? S U
6
Key Definitions 5 (Conversational) Floor “The right to address an assembly” (Merriam-Webster) The interactional state that describes which participant in a dialog has the right to provide or request information at any point. Turn-Taking The process by which participants in a conversation alternately own the conversational floor.
7
Thesis Statement 6 Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty.
8
7 Floor, Intentions and Beliefs The floor is not an observable state. Rather, participants have: intentions to claim the floor or not beliefs over whether others are claiming it Participants negotiate the floor to limit gaps and overlaps. [Sacks et al 1974, Clark 1996]
9
8 Uncertainty over the Floor S U Uncertainty over the floor leads to breakdowns in turn-taking: Cut-ins Latency Barge-in latency Self interruptions
10
9 Turn-Taking Errors by System U: S: ‘kay. 51C Carrick from Century Square (…) The 61… S U S: U: S: (…) Is this correct? Yeah. Alright (…) S U Cut-ins System grabs floor before user releases it. Latency System waits after user has released floor.
11
S: U: S: What can I do for you? 61A. For example, you can say when is… Where would you li… Let’s proceed step by step. Which neighb… Leaving from North Side. Is this correct? 10 S: U: S: For example, you can say “When is the next 28X from downtown to the airport?” or “I’d like to go from McKee… When is the next 54… Leaving from Atwood. Is this correct? S U S U Barge-in latency System keeps floor while user is claiming it. Self interruptions System releases floor while user not claiming it. Turn-Taking Errors by System
12
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model Conclusion 11
13
Pipeline Architectures 12 Natural Language Understanding Dialog Management Backend Natural Language Generation Speech Recognition Speech Synthesis Turn-taking imposed by full-utterance-based processing Sequential processing Lack of reactivity No sharing of information across modules Hard to extend to multimodal/asynchronous events
14
Multi-layer Architectures Separate reactive from deliberative behavior – turn-taking vs dialog act planning Different layers work asynchronously [Thorisson 1996, Allen et al 2001, Lemon et al 2003] But no previous work: – addressed how conversational floor interacts with dialog management – successfully deployed a multi-layer architecture in a broadly used system 13
15
Proposed Architecture: Olympus 2 14 Dialog Management Backend Interaction Management Natural Language Understanding Speech Recognition Sensors Natural Language Generation Speech Synthesis Actuators
16
Olympus 2 Architecture 15 Natural Language Understanding Speech Recognition Sensors Natural Language Generation Speech Synthesis Actuators Dialog Management Backend Interaction Management Explicitly models turn-taking explicitly Integrates dialog features from both low and high levels Operates on generalized events and actions Uses floor state to control planning of conversational acts
17
Olympus 2 Deployment Ported Let’s Go to Olympus 2 – publicly deployed telephone bus information – originally built using Olympus 1 New version processed about 30,000 dialogs since deployed – no performance degradation Allows research on turn-taking models to be guided by real users behavior 16
18
Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 17
19
End-of-Turn Detection 18 S U What can I do for you? I’d like to go to the airport. Detecting when the user releases the floor. Potential problems: Cut-ins Latency
20
19 S U What can I do for you? I’d like to go to the airport. End-of-Turn Detection End of turn
21
20 S U What can I do for you? I’d like to go to the airport. Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency
22
21 S U What can I do for you? I’d like to go to the airport. Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency Short thresholdMany cut-insShort latency Can we exploit dialog information to get the best of both worlds?
23
End-of-Turn Detection as Classification Classify pauses as internal/final based on words, syntax, prosody [Sato et al, 2002] Repeat classification every n milliseconds until pause ends or end-of-turn is detected [Ferrer et al, 2003, Takeuchi et al, 2004] But no previous work: – successfully combined a wide range of features – tested model in a real dialog system 22
24
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 23
25
24 S U What can I do for you? I’d like to go to the airport. Using Variable Thresholds Discourse (dialog state) Semantics (partial ASR) Prosody (F0, duration) Timing (pause start) Speaker (avg # pauses) Open question Specific question Confirmation Does partial hyp match current expectations?
26
Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has “YES” 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 25 Trained on 1326 dialogs with the Let’s Go public dialog system
27
Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has “YES” 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 26 Trained on 1326 dialogs with the Let’s Go public dialog system I’d like to go to
28
Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has “YES” 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 27 Trained on 1326 dialogs with the Let’s Go public dialog system I’d like to go to the airport.
29
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 28
30
Performance per Feature Set 22% latency reduction 38% cut-in rate reduction 29
31
Performance per Feature Set Semantics is the most useful feature type 30 All features
32
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 31
33
Live Evaluation Implemented decision tree in the Let’s Go IM Operating point: 3% cut-in, 635 ms average 1061 dialogs collected in May ‘08 – 548 control dialogs (threshold = 700 ms) – 513 treatment dialogs (decision tree) 32
34
Cut-in Rate per Dialog State Largest improvement: after open requests 33 Fewer cut-ins overall (p<0.05)
35
Average Latency per State Faster on confirmations 34 Slower on answers to open questions
36
Non-Understanding Rate per State 35 Significant reduction of after confirmations (p < 0.01)
37
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection In pauses Anytime Application to barge-in detection 36
38
UserSystem 37 The Finite-State Turn-Taking Machine
39
UserSystem 38 The Finite-State Turn-Taking Machine Free S Free U
40
UserSystem 39 The Finite-State Turn-Taking Machine Both S Free S Both U Free U
41
UserSystem 40 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Similar models were proposed by Brady (1969) and Jaffe and Feldstein (1970) for analysis of human conversations.
42
Uncertainty in the FSTTM System: – knows whether it is claiming the floor or not – holds probabilistic beliefs over whether the user is Probability distribution over the state In some (useful) cases, approximations allow to reduce uncertainty to two states: – User vs Free U during user utterances – System vs Both S during system prompts 41
43
Making Decisions with the FSTTM Actions – YIELD, KEEP if system is currently holding the floor – GRAB, WAIT if it is not – Different costs in different states Decision-theoretic action selection – Pick action with lowest expected cost given the belief distribution over the states 42
44
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection In pauses Anytime Application to barge-in detection 43
45
UserSystem 44 End-of-Turn Detection in the FSTTM Both S Free S Both U Free U
46
GRAB UserSystem 45 End-of-Turn Detection in the FSTTM Both S Free S Both U Free U GRAB WAIT
47
Action/State Cost Matrix in Pauses Latency cost linearly increases with time Constant cut-in cost UserFree U WAIT0 C G ∙t (time in pause) GRAB C U (constant) 0 46 System action Floor state
48
At time t in a pause, take the action with minimal expected cost: Action Selection 47
49
Estimating State Probabilities 48 Exponential decay Probability that user releases floor, estimated at the beginning of pause Probability that user keeps floor, estimated at the beginning of pause
50
Estimating P(Free U ) Step-wise logistic regression Selected features: – boundary LM score, “YES” in ASR hyp – energy, F0 before pause – Barge-in BaselineLogistic Regression Classification Error21.9%21.7% Log Likelihood-0.52-0.44 49
51
In-Pause Detection Results 50 28% latency reduction
52
Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection At pauses Anytime Application to barge-in detection 51
53
Delays in Pause Detection 52 I’d like to go to the airport. About 200 ms between pause start and VAD change of state In some cases, we can make the decision before VAD detection: – partial hypotheses during speech – previous model once a pause is detected Anytime End-of-Turn Detection
54
End-of-turn Detection in Speech Cost matrix: Leads to a fixed threshold on P(Free U ) UserFree U WAIT0 C W (constant) GRAB C U (constant) 0 53 System action Floor state
55
Estimating P(Free U ) in Speech Step-wise logistic regression Features: – boundary LM score, “YES”/”NO” in hyp – number of words – Barge in BaselineLogistic Regression Classification Error38.9%19.2% Log Likelihood-0.67-0.45 54
56
Anytime Detection Results 55 35% latency reduction
57
Histogram of Turn Latencies 56 Highly predictable ends of turns Less predictable ends of turns
58
Histogram of Turn Latencies 57 10% of turn ends detected during speech 40% of highly predictable cases get predicted during speech No change to less predictable cases
59
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A generic, trainable turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection At pauses Anytime Application to barge-in detection 58
60
UserSystem 59 Barge-in Detection in the FSTTM Both S Free S Both U Free U
61
UserSystem 60 Barge-in Detection in the FSTTM Both S Free S Both U Free U KEEP YIELD
62
Cost Matrix during System Prompts Constant costs Equivalent to setting a threshold on P (Both S ) SystemBoth S KEEP0 C O (constant) YIELD C S (constant) 0 61 System action Floor state
63
Estimating P(Both S ) Estimated at each new partial ASR hypothesis Logistic regression Features: – partial hyp matches expectations – cue words in the hypothesis selected using mutual information on a previous corpus E.g. : “When” in a state where “When is the next/previous bus” is expected 62
64
Barge-in Detection Results 63
65
Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A generic, trainable turn-taking model Conclusion 64
66
Thesis Statement 65 Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. latency and/or cut-in rate reduced by both decision tree and FSTTM semantic features most useful Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty. FSTTM
67
Contributions An architecture for spoken dialog systems that incorporates dialog and interaction management Analysis of dialog features underlying turn-taking The Finite State Turn-Taking Machine – domain-independent turn-taking model – data-driven – improves end-of-turn and barge-in detection 66
68
Extending the FSTTM A framework to organize turn-taking Extensions – generalized FSTTM topology multi-party conversation – richer cost functions non-linear latency cost, non-uniform cut-in cost, etc – better tracking of uncertainty priors Partially Observable Markov Decision Processes 67
69
68 FSTTM Dialog S: U: S: U: S: U: S: U: S: What can I do for you? Next bus from Fifth and Negley to Fifth and Craig. Leaving from Fifth and Negley. Is this correct? Yes. Alright. Going to Fifth and Craig. Is this correct? Yes. Alright. I think you want the next bus. Am I… Yes. Right. Just a minute. I’ll look that up. The next 71D leaves Fifth Avenue at Negley at 10:54 AM. S U
70
Thank you! Questions?
71
Extra Slides 70
72
Building Threshold Decision Trees 1.Cluster pauses using automatically extracted features from discourse, semantics, prosody, timing and speaker. 2. Set one threshold for each cluster so as to minimize overall latency. 71
73
Learning Curve 72
74
Estimating Parameters μ – overall mean pause duration – state-specific mean pause duration – predicted using dialog features through step- wise generalized linear regression Correl: 0.42 Feats: barge-in, dialog state, LM score, “YES” 73
75
Endpointing Threshold Threshold is solution of: 74
76
User Barge-in Time Distribution 75 Going to. Is this correct?
77
The First Step 76 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C. Did I get that right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Going to Carson. Is this correct? U: Yes. S: Okay. Let me check that for you.
78
The First Step 77 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. Prosody Prompt design Turn-Taking Incremental Processing
79
The First Step 78 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. Prosody Prompt design Turn-Taking Incremental Processing
80
The First Step 79 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. Prosody Prompt design Turn-Taking Incremental Processing
81
80 Spoken Dialog Systems S: U: S: U: S: U: S: What can I do for you? I’d like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16pm. S U
82
Turn Endpointing S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Silence DetectedSpeech DetectedSilence DetectedEndpoint 81
83
Endpointing Issues S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Cut-in 82
84
End-of-Turn Detection Issues S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Latency 83
85
The Endpointing Trade Off S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold 84 Long thresholdFew cut-insLong latency
86
The Endpointing Trade Off S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold 85 Long thresholdFew cut-insLong latency Short thresholdMany cut-insShort latency
87
86 Using Variable Thresholds S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Discourse (dialog state) Semantics (partial ASR) Prosody (F0, duration) Timing (pause start) Speaker (avg # pauses)
88
Standard Approach to Turn-Taking in Spoken Dialog Systems Typically not explicitly modeled Rules based on low-level features – threshold-based end-of-utterance detection – (optionally) barge-in detection Fixed behavior Not integrated in the overall dialog model 87
89
USER YIELDS UserSystem 88 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Smooth Transition
90
SYSTEM GRABS UserSystem 89 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Smooth Transition
91
SYSTEM WAITS USER WAITS UserSystem 90 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Latency
92
UserSystem 91 The Finite-State Turn-Taking Machine Both S Free S Both U Free U SYSTEM GRABS Cut-in
93
SYSTEM GRABS UserSystem 92 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Time out
94
UserSystem 93 The Finite-State Turn-Taking Machine Both S Free S Both U Free U USER GRABS Barge-in
95
UserSystem 94 The Finite-State Turn-Taking Machine Both S Free S Both U Free U SYSTEM YIELDS Barge-in
96
Optimal C W 95 C U is set to maintain an overall cut-in rate of 5%
97
Estimating State Probabilities 96
98
Estimating State Probabilities 97 User remain silent indefinitely at the end of turn (no transition Free S User)
99
Estimating State Probabilities 98 Without knowledge of silence duration, P t (Free S )=P 0 (Free S )
100
Estimating State Probabilities 99
101
Estimating State Probabilities 100
102
Estimating State Probabilities 101 Probability that the user is still silent at time t, given that they haven’t finished their turn.
103
Estimating State Probabilities 102 Assuming an exponential distribution on internal silence duration. μ is mean pause duration.
104
Estimating State Probabilities 103
105
104 Reducing Uncertainty Different levels of information can help reduce uncertainty over the floor: Immediate information syntax, semantics, prosody of current turn… Discourse information dialog state, task structure, expectations… Environment information acoustic conditions, user characteristics…
106
Endpointing Threshold Threshold is solution of: Parameter K set empirically (typically 10,000 ms) 105 Endpointing threshold
107
Pause Endpointing Results 106
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.