Download presentation
Presentation is loading. Please wait.
1
Agustín Gravano1,2 Julia Hirschberg1
Backchannel-Inviting Cues in Task-Oriented Dialogue Agustín Gravano1,2 Julia Hirschberg1 Columbia University, New York, USA (2) Universidad de Buenos Aires, Argentina
2
Interactive Voice Response Systems
Introduction Interactive Voice Response Systems Quickly spreading. Mostly simple functionality. “Uncomfortable”, “awkward”. ASR+TTS account for most IVR problems. As ASR and TTS improve, other problems revealed. Coordination of system-user exchanges. Backchannels. Begin to show revealed Agustín Gravano Interspeech 2009
3
Backchannels Short expressions uttered by listeners to:
Introduction Backchannels Short expressions uttered by listeners to: Convey that they are paying attention. Encourage the speaker to continue. Examples: okay, uh-huh, mm-hm, alright. Very frequent in task-oriented dialogue. Thus, modeling human usage of BC should lead to an improved system-user coordination. The usage human usage Agustín Gravano Interspeech 2009
4
Goal Learn when backchannels are likely to occur.
Introduction Goal Learn when backchannels are likely to occur. Find “backchannel-inviting” cues. Cues displayed by the speaker “inviting” the listener to produce a backchannel response. This could improve the coordination of IVRs: Speech understanding: Detect points in the user’s turn where a backchannel would be welcome. Speech generation: Display cues inviting the user to produce a backchannel. I would put ‘backchannel-inviting’ and ‘inviting’ in subbullets below in scare-quotes Agustín Gravano Interspeech 2009
5
Talk Outline Previous work Material Method Results Conclusions
Rest of this talk Outline of this Talk Agustín Gravano Interspeech 2009
6
Previous Work Duncan 1972, 1973, 1974, inter alia.
Backchannel-Inviting Cues Previous Work Duncan 1972, 1973, 1974, inter alia. Hypothesized six turn-yielding cues in face-to-face dialogue. Several studies continued this line of research, but always excluded backchannels. Ward & Tsukahara 2000. Region of low pitch lasting 110ms or more. Cathcart et al Language model based on pause duration and part-of-speech tags to predict the location of BC. I would make it clear that Duncan hypothesized these cues but did not provide real evidence that they are reliably associated with backchannels. I think people who have not read Duncan wonder what the point of further work is, if he has already done it. You could change ‘Six…’ to ‘Hypothesized six…’ e.g. Agustín Gravano Interspeech 2009
7
Columbia Games Corpus 12 task-oriented spontaneous dialogues.
Material Columbia Games Corpus 12 task-oriented spontaneous dialogues. Standard American English. 13 subjects: 6 female, 7 male. Series of collaborative computer games. No eye contact. No speech restrictions. 9 hours of dialogue. Manual orthographic transcription, alignment. Manual prosodic annotations (ToBI). Agustín Gravano Interspeech 2009
8
Columbia Games Corpus Material Player 1: Describer Player 2: Follower
In an Objects games, each player saw a board with 5-7 objects. The boards were almost identical, with one object misplaced. One of the players had to describe the position of the target object to the other player, who had to move it to the correct position. Agustín Gravano Interspeech 2009
9
Backchannel-Inviting Cues
Cues displayed by the speaker “inviting” the listener to produce a backchannel response. Again, i’d use scare-quotes around ‘inviting’ or ‘backchannel-inviting’ Agustín Gravano Interspeech 2009
10
Method Backchannel-Inviting Cues
IPU (Inter Pausal Unit): Maximal sequence of words from the same speaker surrounded by silence ≥ 50ms. Hold Backchannel Speaker A: Speaker B: IPU1 IPU2 IPU3 IPU4 3 trained annotators identified Backchannels using a labeling scheme described in [Gravano et al. 2007]. To find BC-inviting cues, we compare: IPUs preceding Holds, IPUs preceding Backchannels. Agustín Gravano Interspeech 2009
11
} Individual Cues Final rising intonation: Higher pitch level.
Backchannel-Inviting Cues Individual Cues Final rising intonation: 81% of IPUs before BC end in H-H% or L-H%. Higher pitch level. Higher intensity level. Lower NHR (voice quality). Longer IPU duration (seconds, #words). Final POS bigram: 72% of IPUs before BC end in DT NN, JJ NN, or NN NN. } entire IPU final 1.0 sec final 0.5 sec You’ll explain here that you investigated many other potential cues but these were the ones that proved discriminatory, right? Agustín Gravano Interspeech 2009
12
Defining Presence of a Cue
Backchannel-Inviting Cues Defining Presence of a Cue 2 representative features for each cue: Final intonation Pitch slope over final 200ms, 300ms. Intensity level Mean intensity over final 500ms, 1000ms. Pitch level Mean pitch over final 500ms, 1000ms. Voice quality NHR over final 500ms, 1000ms. IPU duration Duration in ms, and in number of words. Final POS bigram {‘DT NN’, ‘JJ NN’, ‘NN NN’} vs. Rest (binary). Define presence/absence based on whether the value is closer to the mean before BC or H. Agustín Gravano Interspeech 2009
13
Top Frequencies of Complex Cues
digit == cue present dot == cue absent BC-inviting cues: 1: Final intonation 2: Intensity level 3: Pitch level 4: IPU duration 5: Voice quality 6: Final POS bigram Agustín Gravano Interspeech 2009
14
Number of cues conjointly displayed
Backchannel-Inviting Cues Combined Cues Percentage of IPUs followed by a BC r 2 = 0.993 Number of cues conjointly displayed Agustín Gravano Interspeech 2009
15
IVR Systems After each IPU from the user:
Backchannel-Inviting Cues IVR Systems After each IPU from the user: if estimated likelihood > threshold then produce a backchannel To elicit a backchannel from the user, if desired: Include as many cues as possible in the system’s final IPU. For the second bullet, you might say ‘To elicit a backchannel from the user, if desired:’ and mention what the system motivation might be, if you haven’t already Agustín Gravano Interspeech 2009
16
Summary Study of backchannel-inviting cues.
Objective, automatically computable. Combined cues. Improve turn-taking decisions of IVR systems. Results drawn from task-oriented dialogues. Not necessarily generalizable. Suitable for most IVR domains. SIGdial 2009: Study of turn-yielding cues. Agustín Gravano Interspeech 2009
17
Special thanks to… My advisor, Julia Hirschberg
Thesis Committee Members Maxine Eskenazi, Kathy McKeown, Becky Passonneau, Amanda Stent. Speech Lab at Columbia University Stefan Benus, Fadi Biadsy, Sasha Caskey, Bob Coyne, Frank Enos, Martin Jansche, Jackson Liscombe, Sameer Maskey, Andrew Rosenberg. Collaborators Gregory Ward and Elisa Sneed German (Northwestern U); Ani Nenkova (UPenn); Héctor Chávez, David Elson, Michel Galley, Enrique Henestroza, Hanae Koiso, Shira Mitchell, Michael Mulley, Kristen Parton, Ilia Vovsha, Lauren Wilcox. Agustín Gravano Interspeech 2009
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.