Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee.

Similar presentations


Presentation on theme: "Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee."— Presentation transcript:

1

2 Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee Maxine Eskenazi (chair) Alan W Black Reid Simmons Diane J. Litman

3 Spoken Dialog Systems 2 Spoken dialog systems have long promised to improve human-machine interaction Speech is a natural means of communication Recent improvements in underlying technologies have made such systems a reality

4 3 Sometimes they work… S: U: S: U: S: U: S: U: S: What can I do for you? I’d like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16 PM. S U

5 4 …but not always… S: U: S: U: S: U: S: What can I do for you? ‘kay. 51C Carrick from Century Square to Downtown Pittsburgh, to Oakland. The 61… If you want… Leaving from Oakland. Is this correct? 51C leaving Century Square going to Oakland, I mean go to South Side. Leaving… Leaving from McKeesport. Is… No. Leaving from Century Square. Leaving from McKeesport. Did I get that right? S U

6 Key Definitions 5 (Conversational) Floor “The right to address an assembly” (Merriam-Webster) The interactional state that describes which participant in a dialog has the right to provide or request information at any point. Turn-Taking The process by which participants in a conversation alternately own the conversational floor.

7 Thesis Statement 6 Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty.

8 7 Floor, Intentions and Beliefs The floor is not an observable state. Rather, participants have: intentions to claim the floor or not beliefs over whether others are claiming it Participants negotiate the floor to limit gaps and overlaps. [Sacks et al 1974, Clark 1996]

9 8 Uncertainty over the Floor S U Uncertainty over the floor leads to breakdowns in turn-taking: Cut-ins Latency Barge-in latency Self interruptions

10 9 Turn-Taking Errors by System U: S: ‘kay. 51C Carrick from Century Square (…) The 61… S U S: U: S: (…) Is this correct? Yeah. Alright (…) S U Cut-ins System grabs floor before user releases it. Latency System waits after user has released floor.

11 S: U: S: What can I do for you? 61A. For example, you can say when is… Where would you li… Let’s proceed step by step. Which neighb… Leaving from North Side. Is this correct? 10 S: U: S: For example, you can say “When is the next 28X from downtown to the airport?” or “I’d like to go from McKee… When is the next 54… Leaving from Atwood. Is this correct? S U S U Barge-in latency System keeps floor while user is claiming it. Self interruptions System releases floor while user not claiming it. Turn-Taking Errors by System

12 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model Conclusion 11

13 Pipeline Architectures 12 Natural Language Understanding Dialog Management Backend Natural Language Generation Speech Recognition Speech Synthesis Turn-taking imposed by full-utterance-based processing Sequential processing  Lack of reactivity No sharing of information across modules Hard to extend to multimodal/asynchronous events

14 Multi-layer Architectures Separate reactive from deliberative behavior – turn-taking vs dialog act planning Different layers work asynchronously [Thorisson 1996, Allen et al 2001, Lemon et al 2003] But no previous work: – addressed how conversational floor interacts with dialog management – successfully deployed a multi-layer architecture in a broadly used system 13

15 Proposed Architecture: Olympus 2 14 Dialog Management Backend Interaction Management Natural Language Understanding Speech Recognition Sensors Natural Language Generation Speech Synthesis Actuators

16 Olympus 2 Architecture 15 Natural Language Understanding Speech Recognition Sensors Natural Language Generation Speech Synthesis Actuators Dialog Management Backend Interaction Management Explicitly models turn-taking explicitly Integrates dialog features from both low and high levels Operates on generalized events and actions Uses floor state to control planning of conversational acts

17 Olympus 2 Deployment Ported Let’s Go to Olympus 2 – publicly deployed telephone bus information – originally built using Olympus 1 New version processed about 30,000 dialogs since deployed – no performance degradation Allows research on turn-taking models to be guided by real users behavior 16

18 Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 17

19 End-of-Turn Detection 18 S U What can I do for you? I’d like to go to the airport. Detecting when the user releases the floor. Potential problems: Cut-ins Latency

20 19 S U What can I do for you? I’d like to go to the airport. End-of-Turn Detection End of turn

21 20 S U What can I do for you? I’d like to go to the airport. Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency

22 21 S U What can I do for you? I’d like to go to the airport. Latency / Cut-in Tradeoff Long thresholdFew cut-insLong latency Short thresholdMany cut-insShort latency Can we exploit dialog information to get the best of both worlds?

23 End-of-Turn Detection as Classification Classify pauses as internal/final based on words, syntax, prosody [Sato et al, 2002] Repeat classification every n milliseconds until pause ends or end-of-turn is detected [Ferrer et al, 2003, Takeuchi et al, 2004] But no previous work: – successfully combined a wide range of features – tested model in a real dialog system 22

24 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 23

25 24 S U What can I do for you? I’d like to go to the airport. Using Variable Thresholds Discourse (dialog state) Semantics (partial ASR) Prosody (F0, duration) Timing (pause start) Speaker (avg # pauses) Open question Specific question Confirmation Does partial hyp match current expectations?

26 Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has “YES” 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 25 Trained on 1326 dialogs with the Let’s Go public dialog system

27 Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has “YES” 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 26 Trained on 1326 dialogs with the Let’s Go public dialog system I’d like to go to

28 Example Decision Tree Utterance duration < 2000 ms Partial ASR matches expectations Average pause duration < 200 ms 205 ms Partial ASR has “YES” 200 ms693 ms Dialog state is open question Partial ASR has less than 3 words 789 ms1005 ms Average pause duration < 300 ms Partial ASR is available 637 ms847 ms907 ms Average non-understanding ratio < 15% 779 ms Consecutive user turns w/o system prompt 1440 ms Dialog state is open question 1078 ms Average pause duration < 300 ms 839 ms922 ms 27 Trained on 1326 dialogs with the Let’s Go public dialog system I’d like to go to the airport.

29 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 28

30 Performance per Feature Set 22% latency reduction 38% cut-in rate reduction 29

31 Performance per Feature Set Semantics is the most useful feature type 30 All features

32 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking End-of-turn detection Decision tree-based thresholds Batch evaluation Live evaluation 31

33 Live Evaluation Implemented decision tree in the Let’s Go IM Operating point: 3% cut-in, 635 ms average 1061 dialogs collected in May ‘08 – 548 control dialogs (threshold = 700 ms) – 513 treatment dialogs (decision tree) 32

34 Cut-in Rate per Dialog State Largest improvement: after open requests 33 Fewer cut-ins overall (p<0.05)

35 Average Latency per State Faster on confirmations 34 Slower on answers to open questions

36 Non-Understanding Rate per State 35 Significant reduction of after confirmations (p < 0.01)

37 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection In pauses Anytime Application to barge-in detection 36

38 UserSystem 37 The Finite-State Turn-Taking Machine

39 UserSystem 38 The Finite-State Turn-Taking Machine Free S Free U

40 UserSystem 39 The Finite-State Turn-Taking Machine Both S Free S Both U Free U

41 UserSystem 40 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Similar models were proposed by Brady (1969) and Jaffe and Feldstein (1970) for analysis of human conversations.

42 Uncertainty in the FSTTM System: – knows whether it is claiming the floor or not – holds probabilistic beliefs over whether the user is Probability distribution over the state In some (useful) cases, approximations allow to reduce uncertainty to two states: – User vs Free U during user utterances – System vs Both S during system prompts 41

43 Making Decisions with the FSTTM Actions – YIELD, KEEP if system is currently holding the floor – GRAB, WAIT if it is not – Different costs in different states Decision-theoretic action selection – Pick action with lowest expected cost given the belief distribution over the states 42

44 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection In pauses Anytime Application to barge-in detection 43

45 UserSystem 44 End-of-Turn Detection in the FSTTM Both S Free S Both U Free U

46 GRAB UserSystem 45 End-of-Turn Detection in the FSTTM Both S Free S Both U Free U GRAB WAIT

47 Action/State Cost Matrix in Pauses Latency cost linearly increases with time Constant cut-in cost UserFree U WAIT0 C G ∙t (time in pause) GRAB C U (constant) 0 46 System action Floor state

48 At time t in a pause, take the action with minimal expected cost: Action Selection 47

49 Estimating State Probabilities 48 Exponential decay Probability that user releases floor, estimated at the beginning of pause Probability that user keeps floor, estimated at the beginning of pause

50 Estimating P(Free U ) Step-wise logistic regression Selected features: – boundary LM score, “YES” in ASR hyp – energy, F0 before pause – Barge-in BaselineLogistic Regression Classification Error21.9%21.7% Log Likelihood-0.52-0.44 49

51 In-Pause Detection Results 50 28% latency reduction

52 Outline Introduction An event-driven architecture for spoken dialog systems Using dialog features to inform turn-taking A domain-independent data-driven turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection At pauses Anytime Application to barge-in detection 51

53 Delays in Pause Detection 52 I’d like to go to the airport. About 200 ms between pause start and VAD change of state In some cases, we can make the decision before VAD detection: – partial hypotheses during speech – previous model once a pause is detected Anytime End-of-Turn Detection

54 End-of-turn Detection in Speech Cost matrix: Leads to a fixed threshold on P(Free U ) UserFree U WAIT0 C W (constant) GRAB C U (constant) 0 53 System action Floor state

55 Estimating P(Free U ) in Speech Step-wise logistic regression Features: – boundary LM score, “YES”/”NO” in hyp – number of words – Barge in BaselineLogistic Regression Classification Error38.9%19.2% Log Likelihood-0.67-0.45 54

56 Anytime Detection Results 55 35% latency reduction

57 Histogram of Turn Latencies 56 Highly predictable ends of turns Less predictable ends of turns

58 Histogram of Turn Latencies 57 10% of turn ends detected during speech 40% of highly predictable cases get predicted during speech No change to less predictable cases

59 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A generic, trainable turn-taking model The Finite-State Turn-Taking Machine Application to end-of-turn detection At pauses Anytime Application to barge-in detection 58

60 UserSystem 59 Barge-in Detection in the FSTTM Both S Free S Both U Free U

61 UserSystem 60 Barge-in Detection in the FSTTM Both S Free S Both U Free U KEEP YIELD

62 Cost Matrix during System Prompts Constant costs Equivalent to setting a threshold on P (Both S ) SystemBoth S KEEP0 C O (constant) YIELD C S (constant) 0 61 System action Floor state

63 Estimating P(Both S ) Estimated at each new partial ASR hypothesis Logistic regression Features: – partial hyp matches expectations – cue words in the hypothesis selected using mutual information on a previous corpus E.g. : “When” in a state where “When is the next/previous bus” is expected 62

64 Barge-in Detection Results 63

65 Outline Introduction An architecture for dialog and interaction management Using dialog features to inform turn-taking A generic, trainable turn-taking model Conclusion 64

66 Thesis Statement 65 Incorporating different levels of knowledge using a data-driven decision model will improve the turn-taking behavior of spoken dialog systems. latency and/or cut-in rate reduced by both decision tree and FSTTM semantic features most useful Specifically, turn-taking can be modeled as a finite-state decision process operating under uncertainty. FSTTM

67 Contributions An architecture for spoken dialog systems that incorporates dialog and interaction management Analysis of dialog features underlying turn-taking The Finite State Turn-Taking Machine – domain-independent turn-taking model – data-driven – improves end-of-turn and barge-in detection 66

68 Extending the FSTTM A framework to organize turn-taking Extensions – generalized FSTTM topology multi-party conversation – richer cost functions non-linear latency cost, non-uniform cut-in cost, etc – better tracking of uncertainty priors Partially Observable Markov Decision Processes 67

69 68 FSTTM Dialog S: U: S: U: S: U: S: U: S: What can I do for you? Next bus from Fifth and Negley to Fifth and Craig. Leaving from Fifth and Negley. Is this correct? Yes. Alright. Going to Fifth and Craig. Is this correct? Yes. Alright. I think you want the next bus. Am I… Yes. Right. Just a minute. I’ll look that up. The next 71D leaves Fifth Avenue at Negley at 10:54 AM. S U

70 Thank you! Questions?

71 Extra Slides 70

72 Building Threshold Decision Trees 1.Cluster pauses using automatically extracted features from discourse, semantics, prosody, timing and speaker. 2. Set one threshold for each cluster so as to minimize overall latency. 71

73 Learning Curve 72

74 Estimating Parameters μ – overall mean pause duration – state-specific mean pause duration – predicted using dialog features through step- wise generalized linear regression Correl: 0.42 Feats: barge-in, dialog state, LM score, “YES” 73

75 Endpointing Threshold Threshold is solution of: 74

76 User Barge-in Time Distribution 75 Going to. Is this correct?

77 The First Step 76 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C. Did I get that right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Going to Carson. Is this correct? U: Yes. S: Okay. Let me check that for you.

78 The First Step 77 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. Prosody Prompt design Turn-Taking Incremental Processing

79 The First Step 78 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. Prosody Prompt design Turn-Taking Incremental Processing

80 The First Step 79 S: What can I do for you? U: When is the next 54C coming to 18 th street? S: The 54C, right? U: Yes. S: Okay. Where do you wanna go? U: Carson. S: Carson, correct? U: Yes. S: Okay. Let me check that for you. Prosody Prompt design Turn-Taking Incremental Processing

81 80 Spoken Dialog Systems S: U: S: U: S: U: S: What can I do for you? I’d like to go to the Waterfront. Going to Waterfront. Is this correct? Yes. Alright. Where do you want to leave from? Oakland. Leaving from Oakland. When are you going to take that bus? Now. The next bus. Hold on. Let me check that for you. The next 61C leaves Forbes Avenue at Atwood Children’s Hospital at 5:16pm. S U

82 Turn Endpointing S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Silence DetectedSpeech DetectedSilence DetectedEndpoint 81

83 Endpointing Issues S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Cut-in 82

84 End-of-Turn Detection Issues S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Latency 83

85 The Endpointing Trade Off S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold 84 Long thresholdFew cut-insLong latency

86 The Endpointing Trade Off S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold 85 Long thresholdFew cut-insLong latency Short thresholdMany cut-insShort latency

87 86 Using Variable Thresholds S: What can I do for you? U: I’d like to go to the airport. SpeechSilence VAD SpeechSilence Threshold Discourse (dialog state) Semantics (partial ASR) Prosody (F0, duration) Timing (pause start) Speaker (avg # pauses)

88 Standard Approach to Turn-Taking in Spoken Dialog Systems Typically not explicitly modeled Rules based on low-level features – threshold-based end-of-utterance detection – (optionally) barge-in detection Fixed behavior Not integrated in the overall dialog model 87

89 USER YIELDS UserSystem 88 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Smooth Transition

90 SYSTEM GRABS UserSystem 89 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Smooth Transition

91 SYSTEM WAITS USER WAITS UserSystem 90 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Latency

92 UserSystem 91 The Finite-State Turn-Taking Machine Both S Free S Both U Free U SYSTEM GRABS Cut-in

93 SYSTEM GRABS UserSystem 92 The Finite-State Turn-Taking Machine Both S Free S Both U Free U Time out

94 UserSystem 93 The Finite-State Turn-Taking Machine Both S Free S Both U Free U USER GRABS Barge-in

95 UserSystem 94 The Finite-State Turn-Taking Machine Both S Free S Both U Free U SYSTEM YIELDS Barge-in

96 Optimal C W 95 C U is set to maintain an overall cut-in rate of 5%

97 Estimating State Probabilities 96

98 Estimating State Probabilities 97 User remain silent indefinitely at the end of turn (no transition Free S  User)

99 Estimating State Probabilities 98 Without knowledge of silence duration, P t (Free S )=P 0 (Free S )

100 Estimating State Probabilities 99

101 Estimating State Probabilities 100

102 Estimating State Probabilities 101 Probability that the user is still silent at time t, given that they haven’t finished their turn.

103 Estimating State Probabilities 102 Assuming an exponential distribution on internal silence duration. μ is mean pause duration.

104 Estimating State Probabilities 103

105 104 Reducing Uncertainty Different levels of information can help reduce uncertainty over the floor: Immediate information syntax, semantics, prosody of current turn… Discourse information dialog state, task structure, expectations… Environment information acoustic conditions, user characteristics…

106 Endpointing Threshold Threshold is solution of: Parameter K set empirically (typically 10,000 ms) 105 Endpointing threshold

107 Pause Endpointing Results 106


Download ppt "Flexible Turn-Taking for Spoken Dialog Systems PhD Thesis Defense Antoine Raux Language Technologies Institute, CMU December 12, 2008 Thesis Committee."

Similar presentations


Ads by Google