Download presentation
Presentation is loading. Please wait.
1
SDS Future Julia Hirschberg LSA07 353 9/17/2018
2
Today Whither Spoken Dialogue Systems? Technology issues
Human factors issues Taking automated dialogue to the next level Modeling users’ emotional state Entrainment/adaptation/… to users System personality Cultural sensitivity 9/17/2018
3
Technology Issues Better ASR Fast and accurate
Better rejection capabilities Trained on real dialogue phenomena (hyper-articulation, self-repairs and other disfluencies) and broader subject pool More sophisticated semantic representations Automated call routing Tools to automate creation of new systems Recognizers More accurate Easier to train Dialogue flow schemes TTS voices 9/17/2018
4
Human Factors Issues Improved modeling of human/human interaction
Better model of turn-taking (e.g. backchanneling behavior, timing issues) Incorporation of dialogue act disambiguation findings Error detection/correction capabilities Support for mixed initiative Moving away from the voice menu model Better ways to evaluate system performance Adaptation/customization for frequent/power users 9/17/2018
5
Today Whither Spoken Dialogue Systems? Technology issues
Human factors issues Taking automated dialogue to the next level Modeling users’ emotional state Entrainment/adaptation/… to users System personality Cultural sensitivity 9/17/2018
6
Emotion and Speaker State
A speaker’s emotional state represents important and useful information To recognize Anger/frustration in call center SDS Confidence/uncertainty in a tutoring domain To generate (e.g. any emotion for games) Prosodic information has proven quite useful in detecting different emotions automatically 9/17/2018
7
Studies of Emotional Speech in Human/Human Corpora
Anger/frustration Travel scenarios (Batliner et al (2003), Ang et al (2002)) Call Centers (Liscombe et al (2005), Vidrascu & Devillers (2005), Lee & Narayanan (2005)) Other emotions Meetings (Wrede & Shriberg (2003)) Unconstrained (Roach (2000), Cowie et al (2001), Campbell (2003),…) 9/17/2018
8
Issues in Emotional Speech Studies
Data debate: Acted speech vs. natural (hand labeled) corpora Classification tasks: Distinguish specific ‘classic’ emotions Distinguish negative emotions Distinguish valence, activation Representations of prosodic features Direct modeling via acoustic correlates A symbolic representation (e.g. ToBI) 9/17/2018
9
Acted Speech: LDC Emotional Speech Corpus
happy sad angry confident frustrated friendly interested 8 speakers, 15 emotions, semantically neutral utterances. anxious bored encouraging 9/17/2018
10
Can We Distinguish Classic Emotions in Acted Speech?
User study to classify tokens from LDC corpus 10 emotions: Positive: confident, encouraging, friendly, happy, interested Negative: angry, anxious, bored, frustrated, sad Chosen from most convincing Machine learning classification of tokens by majority label (binary classification) (Liscombe, Hirschberg & Venditti, 2003) At Columbia, we have used this corpus to collect subject judgments on emotional speech, in order to test the hypothesis that utterances often convey mixed emotions. However, we also looked at the feasibility of classifying the ‘majority label’ of each utterance in terms of its prosodic features. 9/17/2018
11
What Features Are Useful in Emotion Classification?
Automatically extracted pitch, intensity, rate Spectral tilt from hand-segmented vowels Hand-labeled ToBI contours Results: Direct modeling of acoustic/prosodic features 62% average baseline 75% average accuracy Acoustic-prosodic features identify activation Higher-level ToBI features distinguish valence H-L% correlated with negative emotions L-L% with positive 9/17/2018
12
Accuracy Distinguishing One Emotion from the Rest: Direct Modeling
Baseline Accuracy angry 69.32% 77.27% confident 75.00% happy 57.39% 80.11% interested 69.89% 74.43% encouraging 52.27% 72.73% sad 61.93% anxious 55.68% 71.59% bored 66.48% 78.98% friendly 59.09% 73.86% frustrated Directly modeled acoustic/prosodic features classify emotions like happy and sad with high accuracy; anxiety with lower success. Note that states such as anger, confidence and frustration fall somewhere in the middle. We will compare this performance on acted versions with similar experiments on natural speech. 9/17/2018
13
Different Valence/Different Activation
It is useful to note where direct modeling features fall short: While mean f0 successfully differentiates between emotions with different valence, so long as they have different degrees of activation… 9/17/2018
14
Different Valence/ Same Activation
When different valence emotions such as happy and angry have the same activation, simple pitch features are less successful. 9/17/2018
15
Can We Identify Emotions in Natural Speech?
AT&T’s “How May I Help You?” system Liscombe, Guicciardi, Tur & Gokken-Tur ‘05 Customers are sometimes angry, frustrated Data: 5690 operator/caller dialogues with 20,013 caller turns Labeled for degrees of anger, frustration, negativity and collapsed to positive vs. frustrated vs. angry 9/17/2018
16
HMIHY Example Very Frustrated Somewhat Frustrated 9/17/2018
17
Features Automatic acoustic-prosodic Lexical Pragmatic (labeled DAs)
Contextual (all above features for preceding 1 or 2 turns) 9/17/2018
18
Direct Modeling of Prosody Features in Context
We also notice however certain important effects of context which can be captured through direct modeling. Note how, for utterances labeled as positive, there is little change in features such as median pitch, mean energy and speaking rate. 9/17/2018
19
Direct Modeling of Prosodic Features in Context
However, for utterances labeled ‘angry’, it is rare that anger emerges suddenly, without some earlier changes in these features. So it seems that acoustic/prosodic context as well as contour type my indeed be useful in identifying a speaker’s emotional state. Note that ML classifications of this data perform quite similarly to experiments on acted speech id’ing anger and frustration, with performance in the mid 70% range. 9/17/2018
20
Rel. Improv. over Baseline
Results Feature Set Accuracy Rel. Improv. over Baseline Majority Class 73.1% ----- pros+lex 76.1% pros+lex+da 77.0% 1.2% all 79.0% 3.8% 9/17/2018
21
Implications for SDS SDS should be able to take advantage of current imperfect emotion prediction capabilities Even if you miss some angry people…. E.g. Some call center software monitors conversations for regions of high intensity 9/17/2018
22
Today Whither Spoken Dialogue Systems? Technology issues
Human factors issues Taking automated dialogue to the next level Modeling users’ emotional state Entrainment/adaptation/… to users System personality Cultural sensitivity 9/17/2018
23
Entrainment/Adaptation/Accommodation/Alignment
Hypothesis: over time, people tend to adapt their communicative behavior to that of their conversational partner Issues What are the dimensions of entrainment? How rapidly do people adapt? Does entrainment occur (on the human side) in human/computer conversations? 9/17/2018
24
Varieties of Entrainment…
Lexical: S and H tend over time to adopt the same method of referring to items in a discourse A: It’s that thing that looks like a harpsichord. B: So the harpsichord-looking thing… .... B: The harpsichord… Phonological Word pronunciation: voice/voiceless /t/ in better Acoustic/Prosodic Speaking rate, pitch range, choice of contour Discourse/dialogue/social Marking of topic shift, turn-taking 9/17/2018
25
The Vocabulary Problem
Furnas et al ’87: the probability that 2 subjects will producing the same name for a command for common computer operations varied from Remove a file: remove, delete, erase, kill, omit, destroy, lose, change, trash With 20 synonyms for a single command, the likelihood that 2 people will choose the same one was 80% With 25 commands, the likelihood that 2 people who choose the same term think it means the same command was 15% How can people possibly communicate? They collaborate on choice of referring expressions 9/17/2018
26
Early Studies of Priming Effects
Hypothesis: Users will tend to use the vocabulary and syntax the system uses Evidence from data collections in the field Systems should take advantage of this proclivity to prime users to speak in ways that the system can recognize well 9/17/2018
27
User Responses to Vaxholm
The answers to the question: “What weekday do you want to go?” (Vilken veckodag vill du åka?) 22% Friday (fredag) 11% I want to go on Friday (jag vill åka på fredag) 11% I want to go today (jag vill åka idag) 7% on Friday (på fredag) 6% I want to go a Friday (jag vill åka en fredag) - are there any hotels in Vaxholm? (finns det några hotell i Vaxholm) 9/17/2018
28
Verb Priming: How often do you go abroad on holiday?
Hur ofta åker du utomlands på semestern? Hur ofta reser du utomlands på semestern? jag åker en gång om året kanske jag åker ganska sällan utomlands på semester jag åker nästan alltid utomlands under min semester jag åker ungefär 2 gånger per år utomlands på semester jag åker utomlands nästan varje år jag åker utomlands på semestern varje år jag åker utomlands ungefär en gång om året jag är nästan aldrig utomlands en eller två gånger om året en gång per semester kanske en gång per år ungefär en gång per år åtminståne en gång om året nästan aldrig jag reser en gång om året utomlands jag reser inte ofta utomlands på semester det blir mera i arbetet jag reser reser utomlands på semestern vartannat år jag reser utomlands en gång per semester jag reser utomlands på semester ungefär en gång per år jag brukar resa utomlands på semestern åtminståne en gång i året en gång per år kanske en gång vart annat år varje år vart tredje år ungefär nu för tiden inte så ofta varje år brukar jag åka utomlands 9/17/2018
29
Results no reuse no answer 4% 2% other 24% reuse 52% 18% ellipse
9/17/2018
30
Lexical Entrainment in Referring Expressions
Choice of Referring Expressions: Informativeness vs. availability (basic level or not) vs. saliency vs. recency Gricean prediction People use descriptions that minimally but effectively distinguish among items in the discourse Garrod & Anderson ’87 Output/Input Principle Conversational partners formulate their current utterance according to the model used to interpret their partner’s most recent utterance Clark, Brennan, et al’s Conceptual Pacts People make Conceptual Pacts wrt appropriate referring expressions made with particular conversational partners They are loath to abandon these even when shorter expressions possible 9/17/2018
31
Entrainment in Spontaneous Speech
S13: the orange M&M looking kind of scared and then a one on the bottom left and a nine on the bottom right S12: alright I have the exact same thing I just had it's an M&M looking scared that's orange S13: yeah the scared M&M guy yeah S12: framed mirror and the scared M&M on the lower right S13: and it's to the right of the scared M&M guy S13: yeah and the iron should be on the same line as the frightened M&M kind of like an L S12: to the left of the scared M&M to the right of the onion and above the iron Neither S12 nor S13 carried this term on to their second session. Neither repeats this description in their second session. 9/17/2018
32
Extraterrestrial vs Alien I
s11: okay in the middle of the card I have an extraterrestrial figure s11: okay middle of the card I have the extraterrestrial … s10: I've got the blue lion with the extraterrestrial on the lower right s11: okay I have the extraterrestrial now and then I have the eye at the bottom right corner s10: my extraterrestrial's gone 9/17/2018
33
Extraterrestrial vs. Alien II
S03: okay I have a blue lion and then the extraterrestrial at the lower right corner S11: mm I'll pass I have the alien with an eye in the lower right corner S03: um I have just the alien so I guess I'll match that S10: yes now I've got that extraterrestrial with the yellow lion and the money … S12: oh now I have the blue lion in the center with our little alien buddy in the right hand corner S10: with the alien buddy so I'm gonna match him with the single blue lion okay I've got our alien with the eye in the corner 9/17/2018
34
Timing and Voice Quality
Guitar & Marchinkoski ’01: How early do we start to adapt to others’ speech? Do children adapt their speaking rate to their mother’s speech? Study: 6 mothers spoke with their own (normally speaking) 3-yr-olds (3M, 3F) Mothers’ rates significantly reduced (B) or not (A) in A-B-A-B design Results: 5/6 children reduced their rates when their mothers spoke more slowly Guitar & marchinkoski ’01: 9/17/2018
35
Utter a single sentence before and after the conversation
Sherblom & La Riviere ’87: How are speech timing and voice quality affected by a non-familiar conversational partner? Study: 65 pairs of undergraduates asked to discuss a ‘problem situation’ together Utter a single sentence before and after the conversation Sentences compared for speaking rate, utterance length and vocal jitter Results: Substantial influence of partner on all 3 measures Interpersonal uncertainty and differences in arousal influenced degree of adaptation Other investigations of effect of gender: only when partner was male was their a significant effect of gender on adaptation 9/17/2018
36
Amplitude and Response Latency
Coulston et al ’02: Do humans adapt to the behavior of non-human partners? Do children speak more loudly to a loud animated character? Study: yr olds interacted with an extroverted, loud animated character and with an introverted, soft character (TTS voices) Multiple tasks using different amplitude ranges Human/TTS amplitudes and latencies compared Results: 79-94% of children adapted their amplitude, bi-directionally Also adapted their response latencies (mean 18.4%), bidirectionally 9/17/2018
37
Social Status and Entrainment
Azuma ’97: Do speakers adapt to the style of other social classes? Study: Emperor Hirohito visits the countryside Corpus-based study of speech style of Japanese Emperor Hirohito during chihoo jyunkoo (`visits to countryside‘), Published transcripts of speeches Findings: Emperor Hirohito converged his speech style to that of listeners lower in social status Choice of verb-forms, pronouns no longer those of person with highest authority Perceived as like those of a (low-status) mother Before, used pronouns used only by the highest authority (only one recorded instance of such a speech tho) 9/17/2018
38
Socio-Cultural Influences and Entrainment
Co-teachers adapt teaching styles (Roth ’05) Social context High school in NE with predominantly African-American student body Cristobal: Cuban-African-American teacher Chris: new Italian-American teacher Adaptation of Chris to Cristobal Catch phrases (e.g. right!, really really hot) and their production: pitch and intensity contours Pitch ‘matching’ across speakers Mimesis vs entrainment Right rises sharply Really really downstepped and then higher pitch on modifier 9/17/2018
39
Conclusions for SDS Systems can make use of user tendency to entrain to system vocabulary Should systems also entrain to their users? CMU’s Let’s Go system adapts confirmation prompts to non-native speech, finding the closest match to user input in its own vocabulary 9/17/2018
40
Today Whither Spoken Dialogue Systems? Technology issues
Human factors issues Taking automated dialogue to the next level Modeling users’ emotional state Entrainment/adaptation/… to users System personality Cultural sensitivity 9/17/2018
41
Personality and Computer Systems
Early-pc-era reports that significant others were jealous of the time their partners spent with their computers. Reeves & Nass, The Media Equation How People Treat Computers, Television, and New Media Like Real People and Places, 1996 Evolution explains the anthropomorphization of the pc Humans evolved over millions of years without media Proper response to any stimulus was critical to survival Human psychology and physiological responses well developed before media invented Ergo, our bodies and minds react to media, immediately and fundamentally, as if they were real Clifford Nass and Byron Reeves, then both at Stanford in Communications Dept 9/17/2018
42
People See ‘Personality’ Everywhere
Humans assess personality of another (human or otherwise) quickly, with minimal clues Perceived computer personality strongly affects how we evaluate the computer and information it provides Experiments: Created “dominant” and “submissive” computer interfaces and asked subjects to use to solve hypothetical problems Max (dominant) used assertive language, showed higher confidence in the information displayed (via a numeric scale), always presented its own analysis of the problem first Linus (submissive) phrased information more tentatively, rated its own information at lower confidence levels, and allowed human to discuss problem first Each used alternately by people whose personalities previously identified as being either dominant or submissive 9/17/2018
43
User Reactions Users described Max and Linus in human terms: aggressive, assertive, authoritative vs. shy, timid, submissive Users correctly identified machines more like themselves Users rated machines more like themselves as better computers even though content received exactly the same. Users rated their own performance better when machine’s personality matched theirs People more frank when rating a computer if questionnaire presented on another machine Subjects thought highly of computers that praised them, even if praise clearly undeserved 9/17/2018
44
Personality in SDS Mairesse & Walker ’07 PERSONAGE (PERSONAlity GEnerator) ‘Big 5’ personality trait model: extroversion, neuroticism, agreeableness, conscientiousness, openness to experience Attempts to generate “extroverted” language based on traits associated with extroversion in psychology literature Demo: find your personality type Francois Mairesse and Marilyn Walker PERSONAGE: Personality Generation for Dialogue 9/17/2018
45
9/17/2018
46
Conclusions for SDS Systems can be designed to convey different personalities Can they recognize users’ personalities and entrain to them? Should they? 9/17/2018
47
Goodbye! Final Paper 9/17/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.