Emotional Speech Julia Hirschberg CS 6998 2/16/2019.

Slides:



Advertisements
Similar presentations
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Advertisements

Spoken Cues to Deception CS 4706 What is Deception?
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Presented by Ravi Kiran. Julia Hirschberg Stefan Benus Jason M. Brenier Frank Enos Sarah Friedman Sarah Gilman Cynthia Girand Martin Graciarena Andreas.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Spoken Language Processing Lab Who we are: Julia Hirschberg, Stefan Benus, Fadi Biadsy, Frank Enos, Agus Gravano, Jackson Liscombe, Sameer Maskey, Andrew.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Detecting missrecognitions Predicting with prosody.
Techniques for Emotion Classification Julia Hirschberg COMS 4995/6998 Thanks to Kaushal Lahankar.
9/5/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University Interspeech Lisbon.
Techniques for Emotion Classification Kaushal N Lahankar Oct 12,2009 COMS 6998.
10/10/20051 Acoustic/Prosodic and Lexical Correlates of Charismatic Speech Andrew Rosenberg & Julia Hirschberg Columbia University 10/10/05 - IBM.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Shriberg, Stolcke, Ang: Prosody for Emotion Detection DARPA ROAR Workshop 11/30/01 1 Liz Shriberg* Andreas Stolcke* Jeremy Ang + * SRI International International.
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
1 Deceptive Speech CS4706 Julia Hirschberg 2 Everyday Lies Ordinary people tell an average of 2 lies per day –Your new hair-cut looks great. –I’m sorry.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
1 Computation Approaches to Emotional Speech Julia Hirschberg
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
Predicting Student Emotions in Computer-Human Tutoring Dialogues Diane J. Litman&Kate Forbes-Riley University of Pittsburgh Department of Computer Science.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Performance Comparison of Speaker and Emotion Recognition
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Predicting Voice Elicited Emotions
1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
Dyadic Behavior Analysis in Depression Severity Assessment Interviews
Investigating Pitch Accent Recognition in Non-native Speech
Text-To-Speech System for English
Towards Emotion Prediction in Spoken Tutoring Dialogues
Conditional Random Fields for ASR
Recognizing Disfluencies
Why Study Spoken Language?
Meanings of Intonational Contours
Studying Intonation Julia Hirschberg CS /21/2018.
Spoken Dialogue Systems
Automatic Fluency Assessment
Comparing American and Palestinian Perceptions of Charisma Using Acoustic-Prosodic and Lexical Analysis Fadi Biadsy, Julia Hirschberg, Andrew Rosenberg,
Why Study Spoken Language?
Meanings of Intonational Contours
Speech Perception CS4706.
Turn-taking and Disfluencies
Recognizing Structure: Sentence, Speaker, andTopic Segmentation
Turn-taking and Disfluencies
Fadi Biadsy. , Andrew Rosenberg. , Rolf Carlson†, Julia Hirschberg
Agustín Gravano & Julia Hirschberg {agus,
Recognizing Disfluencies
Advanced NLP: Speech Research and Technologies
Spoken Dialogue Systems
Liz Shriberg* Andreas Stolcke* Jeremy Ang+ * SRI International
Discourse Structure in Generation
Emotional Speech Julia Hirschberg CS /8/2018.
SECOND LANGUAGE LISTENING Comprehension: Process and Pedagogy
Towards Automatic Fluency Assessment
Recognizing Structure: Dialogue Acts and Segmentation
Low Level Cues to Emotion
Acoustic-Prosodic and Lexical Entrainment in Deceptive Dialogue
Automatic Prosodic Event Detection
Presentation transcript:

Emotional Speech Julia Hirschberg CS 6998 2/16/2019

Today Defining emotional speech Emotional categories Eliciting judgments Producing emotional speech Detecting emotional speech A Subclass: Deceptive speech 2/16/2019

Cowie ‘00 Is there a good theoretical or practical definition of emotional speech? “Full-blown” emotion vs. emotional state Cause and effect descriptions Primary and secondary (second order) Everyday descriptions Representations Biological 2/16/2019

Dimensions in continuous space, e.g. Valence: positive or negative Activation level: how disposed to take action Structural models: different ways of appraising situation that evokes emotion e.g. positive or negative? Does situation help agent to achieve his/her goals? Timing as a key variable sadness vs. grief vs. depression vs. gloominess 2/16/2019

How are emotions expressed? Display rules? In speech? Mixing Simulation 2/16/2019

Schroeder ‘01: Emotion in Synthesis How is a given emotion expressed in speech? What are the properties of the emotion to be expressed? How are they related to those of other emotions? What kind of synthesizer works best? Formant Diphone Unit selection 2/16/2019

Prosody rules: what to modify? How do we evaluate the results? Forced choice Free response Recognition rate Perceived naturalness 2/16/2019

Ten Bosch ‘00: Emotion Recognition How hard is the problem? Is ‘standard’ ASR technology well-suited to it? Acoustic and language models target short local events Feature extraction normlizes/excludes e.g. pitch, rate, amplitude -- why? Interaction: emotional speech and ASR performance Synthesis needs one good example but... 2/16/2019

Ang et al Challenges: Use output from ASR system Use automatic prosodic features Find good speaker normalization Combine with lexical features Pioneered approach of “direct modeling” – no use of intermediate phonological units Applications: detecting frustration, disappointment/tiredness, amusement/surprise Results: prediction comparable to human accuracy 70-75% 2/16/2019

Method: Prosodic Models Extract pitch from signal Speech recognizer outputs word and phone alignments (duration features) Utterance-level features extracted (e.g., max speaker normalized pitch in the longest phone-normalized vowel, etc) Decision trees created to provide posterior probabilities of emotion classes given features Feature selection from development test set Separate test set used for evaluation 2/16/2019

Prosodic Features Duration features Phone / Vowel / Syllable Durations Normalized by Phone/Vowel Means, Speaker Speaking rate features (vowels/time) Pause features Speech to pause ratio, number of long pauses Maximum pause length Energy features (RMS energy) Pitch features Used pitch stylization algorithm (Sonmez et al.) LTM model of F0 to estimate speaker range Pitch ranges, slopes, locations of interest Spectral tilt features Other (non-prosodic) features Position of utterance in dialog Repeat or correction 2/16/2019

Emotion in Deception Motivation: why might such cues exist? Deception evokes emotion in deceivers (e.g. Ekman ‘85-92) Fear of discovery: higher pitch, faster, louder, pauses disfluencies, indirect speech Elation at successful deceiving: higher pitch, faster, louder, greater elaboration 2/16/2019

Acoustic/Prosodic/Lexical Cues Are deceivers less forthcoming? Shorter speech with fewer details Are lies less compelling than truths? Less plausible, logical, more discrepancies Less verbal and vocal ‘involvement’ Less verbal ‘immediacy’: more passives, negations, indirect speech More uncertainty (subjective) More repetitions Are liars less positive, pleasant? 2/16/2019

More negative statements, complaints Are liars more tense? Nervous overall Vocal tension High pitch Do lies contain fewer ‘imperfections’? Fewer self-repairs Fewer admissions of forgetfulness Fewer scene descriptions, details More mention of peripheral events or relationships 2/16/2019

Current State-of-the-Art No single cue to deceptive speech: most studied are visual Other acoustic/prosodic features proposed, but evidence mixed so far Loudness/intensity Speaking rate Response latency Disfluencies No attested method to detect deception automatically using acoustic/prosodic/lexical cues All current findings are descriptive, suggestive All proposed methods require human intervention 2/16/2019

Our Approach Elicit deceptive and non-deceptive corpus Motivation: Identity-relevant (self-image) and instrumental (monetary) incentives “Real” deception vs. acted Good recording conditions Tasks/interview paradigm Transcription/annotation Acoustic/prosodic/lexical analysis to identify features of interest, test validity of paradigm Automatic feature extraction and analysis to train models of deceptive and non-deceptive speech 2/16/2019

Corpus Collection Subjects asked to perform tasks for comparison with target profile of 25 top entrepreneurs Performance manipulated to produce performance same as/differing from target Monetary incentive to convince an interviewer they matched target Recorded interview/interrogation Biographical information (t/f) “Big lie” on task performance “Local lie”: Pedal indicators of t/f for each answer 2/16/2019

To date: 15 subjects, totaling ~3h of subject speech Collection To date: 15 subjects, totaling ~3h of subject speech Planned: 7-8h hours of subject speech 2/16/2019

Results of Prosodic/Acoustic Analysis On Arizona Mock Theft data subset: 32 interviews/72m, required segmentation, recording issues (50/160m more being segmented) Significant pitch feature differences between deceptive and non-deceptive speech, but... Highly motivated speakers lower pitch when lying Low motivation speakers raise pitch when lying Males lower pitch when lying Females raise pitch when lying 2/16/2019

Preliminary analyses of 8 speakers for ‘local’ t/f On Columbia corpus: Preliminary analyses of 8 speakers for ‘local’ t/f Significant differences in pitch range for six subjects, but differ from Mock Theft wrt gender Lexical findings: Preliminary analyses on Columbia data using LIWC show negative words more prevalent in deceptive speech 2/16/2019