Emotional Speech Guest Lecturer: Jackson Liscombe CS 4706 Julia Hirschberg 4/20/05.

Slides:



Advertisements
Similar presentations
Presented by Erin Palmer. Speech processing is widely used today Can you think of some examples? Phone dialog systems (bank, Amtrak) Computers dictation.
Advertisements

Phonetics as a scientific study of speech
Design of Experiments Lecture I
Acoustic Characteristics of Consonants
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Phonology, part 5: Features and Phonotactics
Voice Quality October 14, 2014 Practicalities Course Project report #2 is due! Also: I have new guidelines to hand out. The mid-term is on Tuesday after.
INTONATION Chapters 15 & 16.
Perception of syllable prominence by listeners with and without competence in the tested language Anders Eriksson 1, Esther Grabe 2 & Hartmut Traunmüller.
Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.
PHONETICS AND PHONOLOGY
The Human Voice. I. Speech production 1. The vocal organs
Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
Voice source characterisation Gerrit Bloothooft UiL-OTS Utrecht University.
Using Creaky Voice Index in Forensic Phonetics – Is it valid and is it reliable? ____________________________ Tuija Niemi-Laitinen Forensic Scientist/Technical.
Topic 3b: Phonation.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Outline Why study emotional speech?
Emotional Grounding in Spoken Dialog Systems Jackson Liscombe Giuseppe Riccardi Dilek Hakkani-Tür
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
1 Evidence of Emotion Julia Hirschberg
Cues to Emotion: Anger and Frustration Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
V OICE QUALITY AND F0 CUES FOR AFFECT EXPRESSION By I. Yanushevskaya, C. Gobl and N. Chasaide.
Producing Emotional Speech Thanks to Gabriel Schubiner.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Communication Skills Anyone can hear. It is virtually automatic. Listening is another matter. It takes skill, patience, practice and conscious effort.
Phonetics and Phonology
Categorizing Emotion in Spoken Language Janine K. Fitzpatrick and John Logan METHOD RESULTS We understand emotion through spoken language via two types.
Voice Quality Feburary 11, 2013 Practicalities Course project reports to hand in! And the next set of guidelines to hand out… Also: the mid-term is on.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Chapter 3: Paralanguage Flavors the Verbal Mesage.
MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Temporal Compression Of Speech: An Evaluation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Simon Tucker and Steve.
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
MS. SUHA JAWABREH LECTURE # 9 Oral Communication.
Eva Björkner Helsinki University of Technology Laboratory of Acoustics and Audio Signal Processing HUT, Helsinki, Finland KTH – Royal Institute of Technology.
Acoustic Cues to Laryngeal Contrasts in Hindi Susan Jackson and Stephen Winters University of Calgary Acoustics Week in Canada October 14,
Structure of Spoken Language
1 Computation Approaches to Emotional Speech Julia Hirschberg
HMM-Based Synthesis of Creaky Voice
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Introduction to Digital Speech Processing Presented by Dr. Allam Mousa 1 An Najah National University SP_1_intro.
Performance Comparison of Speaker and Emotion Recognition
IMPORTANCE OF STATISTICS MR.CHITHRAVEL.V ASST.PROFESSOR ACN.
Phonation + Voice Quality Feburary 11, 2014 Weekday Update Course project report #2 is due right now! I have guidelines for course project report #3,
1. SPEECH PRODUCTION MUSIC 318 MINI-COURSE ON SPEECH AND SINGING
Predicting Voice Elicited Emotions
SPPA 6010 Advanced Speech Science
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Effective Communication Techniques. Interest Approach Give each student a copy of a relevant news article. Explain the importance of skimming and scanning.
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)
The Human Voice. 1. The vocal organs
August 15, 2008, presented by Rio Akasaka
Laryngeal correlates of the English tense/lax vowel contrast
Towards Emotion Prediction in Spoken Tutoring Dialogues
The Human Voice. 1. The vocal organs
Studying Intonation Julia Hirschberg CS /21/2018.
Voice source characterisation
Emotional Speech Julia Hirschberg CS /16/2019.
Speech Perception (acoustic cues)
Presentation transcript:

Emotional Speech Guest Lecturer: Jackson Liscombe CS 4706 Julia Hirschberg 4/20/05

CS Assumptions (1) Prosody is –pitch ≈ fundamental frequency (f0) –loudness ≈ energy (rms) –duration ≈ speaking rate, hesitation Prosody carries meaning –given/new –focus –discourse structure

4/20/05CS Assumptions (2) Text to Speech Synthesis (TTS) –formant-based –concatenative / unit selection –Articulatory Machine learning techniques –predefined set of features –learn rules on a training corpus –apply rules to unseen data

4/20/05CS Outline Why do we care about emotional speech? Emotional Speech Defined Perception Studies Production Studies Lauren Wilcox on voice quality

4/20/05CS Emotion. What is it Good For? Spoken Dialogue Systems –customer-care centers –task planning –tutorial systems –automated agents Approaching Artificial Intelligence

4/20/05CS Emotion. Why is it ‘hard’? Colloquial def. ≠ Technical def. Emotions are non-exclusive Human consensus low

4/20/05CS Study I: Consensus Liscombe et al User study to classify emotional speech tokens Semantically neutral (dates and numbers) 10 emotions: –confident, encouraging, friendly, happy, interested –angry, anxious, bored, frustrated, sad Example

4/20/05CS Study I: Consensus sadangryboredfrustanxsfriendconfhappyinterencrg sad angry bored frust anxs friend conf happy inter0.62 p < 0.001

4/20/05CS Study I: Consensus Emotions are heavily correlated Emotions are non-exclusive Are emotion labels appropriate? –activation –valency

4/20/05CS Perception of Emotional Speech Machine learning to predict emotional states in human speech Common Features –prosody –lexical items –voice Quality

4/20/05CS Acted Speech 1990s - present Aubergé, Campbell, Cowie, Douglas- Cowie, Hirscheberg, Liscombe, Mozziconacci, Oudeyer, Pereira, Roach, Scherer, Schröder, Tato, Yuan, Zetterholm, …

4/20/05CS Study II: Acted Speech 4 actors 10 emotions Binary decision trees (RIPPER) Accuracy ranged from 70% - 80% Prosody indicative of anger, happy, sad Voice quality indicative of anxious, bored

4/20/05CS Emotional Speech in Spoken Dialogue Systems Batliner, Huber, Fischer, Spilker, Nöth (2003) –Verbmobil (Wizard of Oz scenarios) Ang, Dhillon, Krupski, Shriberg, Stolcke (2002) –DARPA Communicator Lee, Narayanan (2004) –Speechworks call-center Prosodic, Lexical, and Discourse-level features

4/20/05CS Study III: Call-center AT&T’s “How May I Help You” system Predict anger and frustration

4/20/05CS Study III: Call-center “That amount is incorrect.”

4/20/05CS Study III: Call-center

4/20/05CS Study III: Call-center

4/20/05CS Study III: Call-center Feature sets –Prosodic (f0, rms, speaking rate) –Discourse (turn number, dialog act) –Lexical (words) –Contextual (dialogue history)

4/20/05CS Study III: Call-center Feature SetAccuracy Rel. Improv. over Baseline Majority Class73.1%----- pros+lex76.1%----- pros+lex+da77.0%1.2% all79.0%3.8%

4/20/05CS Study IV: Tutorial Physics tutorial system Detect student uncertainty Examples

4/20/05CS Production of Emotional Speech

4/20/05CS TTS: Where are we now Natural sounding speech for some utterances –Where good match between input and database Still…hard to vary prosodic features and retain naturalness –Yes-no questions: Do you want to fly first class? Context-dependent variation still hard to infer from text and hard to realize naturally:

4/20/05CS –Appropriate contours from text –Emphasis, de-emphasis to convey focus, given/new distinction: I own a cat. Or, rather, my cat owns me. –Variation in pitch range, rate, pausal duration to convey topic structure Characteristics of ‘emotional speech’ little understood, so hard to convey: …a voice that sounds friendly, sympathetic, authoritative…. How to mimic real voices?

4/20/05CS Examples of Emotional Synthesis

The Role of Voice Quality in Communicating Emotion, Mood, and Attitude Christer Gobl, Ailbhe Ni Chasaide Some slide content borrowed from an online voice quality tutorial by K. Marasek Experimental Phonetics Group at the Institute of Natural Language Processing University of Stuttgart, Germany L. Wilcox: Overview of Speech Communication paper for COMS4706

4/20/05CS Voice Quality: The characteristic auditory “coloring” of one’s voice Derived from a variety of laryngeal and supralaryngeal features Present throughout one’s speech. The natural and distinctive tone of speech sounds produced by a particular person yields a particular voice (Trask 1996). This paper focuses on harsh voice, tense voice, modal voice, breathy voice, whispery voice, creaky voice, and lax-creaky voice and the role of these voice qualities in affective expression. The larynx is used to transform an airstream into audible sounds. This process is central to perceived voice quality. Most people in linguistics view voice qualities in terms of one quality in contrast with another. Phonemic voice quality has a contrastive function in the phonological system of a language.

4/20/05CS Experiment: -Subjects are asked to listen to synthesized utterances. -Utterances were synthesized with seven different voice qualities. -Subjects were asked to identify pairs of opposing affective attributes

4/20/05CS Motivation for experiment Many vocal expressions signal affect: pitch variables, speech rate, pausing structure, duration of accented/unaccented syllables, these are easier to measure that voice quality Voice quality is said to play a fundamental role in affective communication but few empirical studies seek to understand voice source correlates. Some natural voice qualities said to map to affect and therefore assist in characterizing emotion in speech (based on phonetic observations)

4/20/05CS Motivation for Experiment -Different researchers have found varied mappings in their own empirical studies. Further study could confirm some previous findings: Lavar ‘80, Scherer ‘86, Laukkanen ‘96 Breathy: intimacy Whispery: confidentiality, secrecy Harsh voice: anger Tense voice: anger, joy, fear Lax voice: sadness But not all agree: Murray, Arnott (’93) Breathy: anger, happiness Modal to tense: sadness

4/20/05CS Motivation for Experiment -Some findings conclude that glottal source contributes to the perception of valence as well as vocal effort (Laukkanen ‘97). -Synthesis might be an ideal tool for examining how individual features of a signal contribute to the perception of affect. -Previous work has generated emotive synthetic speech through manipulation of voice quality parameters (Cahn, ’90, Murray, Arnott ’95) but the synthesizers used didn’t offer full control of these parameters (DECtalk) -Voice quality might signal strong as well as milder emotional states and speaker attitude

4/20/05CS Different speech source behaviors generate different voice qualities. Larynx adjusts in different ways to create different phonatory gestures, features Laver (’80) defines three which are considered in this paper: Adductive tension (interarytenoid muscles adduct the arytenoid muscles) Medial compression (adductive force on vocal processes- adjustment of ligamental glottis) Longitudinal pressure (tension of vocal folds) Recall scary glottis animation  diagram online voice quality tutorial by K. Marasek Experimental Phonetics Group at the Institute of Natural Language Processing, University of Stuttgart, Germany

4/20/05CS Modal voice neutral mode muscular adjustments are moderate vibration of the vocal folds is periodic with full closing of glottis, so no audible friction noises are produced when air flows through the glottis. frequency of vibration and loudness are in the lowto mid range for conversational speech

4/20/05CS Tense voice – voiced phonation Very strong tension of the vocal folds, very high tension in the vocal tract leads to harsh voice quality.

4/20/05CS Whispery voice – voiceless phonation Very low adductive tension Medial compression moderately high Longitudinal tension moderately high Little or no vocal fold vibration ( produced through turbulences generated by the friction of the air in and above the larynx, which produces frication)

4/20/05CS Creaky voice – voiced phonation vocal folds vibrate at a very low frequency – vibration is somewhat irregular, vibrating mass is “heavier” because of low tension (only the ligamental part of glottis vibrates) The vocal folds are strongly adducted longitudinal tension is weak Moderately high medial compression Vocal folds “thicken” and create an unusually thick and slack structure.

4/20/05CS Lax - creaky Despite definition of creaky voice quality, creaky voice is found to have high glottal tension at times, and low tension at others Different creaky quality, lax-creaky was created in experiment as separate from creaky. Lax-creaky = breathy voice settings + reduced aspiration noise and added “creakiness” for experiment.

4/20/05CS Breathy voice – voiced phonation Tension is low minimal adductive tension, weak medial compression medium longitudinal tension of the vocal folds –folds do not come together completely leading to frication

4/20/05CS Voice quality estimation is difficult If estimated with respect to a controlled neutral quality, how is that controlled quality known to be truly neutral? One must match the natural laryngeal behavior to the neutral model of behavior. How adequate are the models of vocal fold movements for the description of real phonation? The established relationships between a produced acoustical signal and the voice source are complex and since we are only able to observe the behavior of voicing indirectly, prone to error. Otherwise need direct source signal: obtained by invasive techniques (ouch) and invasion might interfere with signal.

4/20/05CS Voice quality estimation Inverse filtering approach: Speech production = source signal + vocal tract filter response Inverse filtering cancels the effects of the vocal tracts, resulting signal is estimate of source – ill-posed problem (popular approaches are automatic- based on linear predictive analysis – but do worse for non-modal (colorful) qualities Still need to measure the inversely filtered signal

4/20/05CS Example:

4/20/05CS Experiment: -Subjects are asked to listen to synthesized utterances. -Utterances were synthesized with seven different voice qualities. -Subjects were asked to identify pairs of opposing affective attributes

4/20/05CS Experiment - details Natural utterances recorded in anehoic chamber ("anechoic" = "without echo”) high quality recording of the Swedish utterance “ja adjo” (semantically neutral) statement heard by non-swedish speaking native speakers of Irish English. The recording was digitized at high sampling frequency and high resolution (16bit) and prepared for analysis

4/20/05CS Experiment- details Recorded utterance analyzed and parameterized. The popular LF (Liljencrants-Fant) model of differentiated glottal flow (Fant et al., 1995) was used to match the measured glottal waveform with a theoretical model of the voice source. Using LF: a waveform is described by a set of mathematical functions that model a given segment of the waveform. The following parameters were used in the experiment: EE - excitation strength RA – normalized value of TA - time constant of the exponential curve, describes the "rounding of the corner" of the waveform between t 4 and t 3 divided by t0 (amount of residual airflow after the main excitation prior to ax glottal closure. RG – measure of glottal frequency as determined by the opening branch of the glottal pulse (normalized to fundamental frequency) RK – measure of glottal pulse skew, defined by the relative durations of the opening and closing branches of the glottal pulse.

4/20/05CS Experiment - details Utterance resynthesized with modal voice quality (moderate tension) formant synth (KLSYN88a synth Sensimetrics corp- Boston) allowing control of source and filter parameters and different variations of each Once synthesized with modal voice, the modal stimuli is reproduced six times, each time with a different non-modal voice quality (tense, breathy, whispery, creaky, harsh, lax-creaky). This is done by adjusting parameters such as -fundamental frequency - Open Quotient (OQ) (ratio of the time in which the vocal folds are open and the whole pitch period duration) - Speed Quotient (also called skewness or r k) -(ratio of rise and fall time of the glottal flow -more, differently to create different voice qualities

4/20/05CS Experiment - details Perception tests constructed with each of the stimuli and given to subjects: 8 short subtests with 10 randomally chosen stimuli were given to subjects. Interval between sets: 7 secs within each set of stimuli: 4 sec interval Subjects respond to the affective content of the stimuli on a scale of 1 to 7 (opposite terms on either side): responses elicited for one particular pair of opposite affective attributes (bored vs. interested, friendly vs. hostile, sad vs. happy, intimate vs. formal, timid vs. confident afraid vs. unafraid) 12 subjects partipicated: 6 male, 6 female

4/20/05CS Results

4/20/05CS

4/20/05CS Results Voice quality and subject variable were statistically highly significant Differences between individual qualities were statistically significant Most readily perceived: Relaxation and stress Highly perceived: Anger, boredom, intimacy, content, formal (aside from anger- these could be categorized as states, moods, attitudes, so consistent with experiment goal) Least well perceived: Unafraid, afraid, friendly, happy, sad Milder states better signaled than strong emotion

4/20/05CS Results Notice modal stimuli is not perceived as totally neutral Similar response patterns occurred with breathy/whispery and tense/harsh Lax-creaky vs creaky does show significant differences Results and their comparison to previous findings: Lax-creaky: lower arousal, activation Whispery: timid, afraid Tense: high arousal/activation (confident, interested, happy, angry) Breathy, whispery, creaky, and more so lax creaky: relaxed, content, intimate, friendly, sad, bored) Lax-creaky, more so than whispery- effectively signaled intimacy And lax-creaky, more so than breathy, signaled sadness Linking of breathy voice to anger and happiness were not supported A shift from modal to tense elicited happy affect (rather than sad as proposed by Murray/Arnott ’99) Anger is shown to link to tense voice and joy (Scherer ’86) As one moves from high to low activation stimuli set, cross-subject variability increases

4/20/05CS Some pros and cons of this study + Showed that voice quality alone can evoke differences in speaker affect -But when comparing only synthesized voices, isn’t it a question of which is relatively more colorful? + voice qualities are multi-colored and each map to a variety of affective expression (expressions are in some cases related, in others unrelated) + traditional view that voice quality conveys valence of emotion but not activation is challenged (for affective states with negative valence, activation still differentiates them and is detected with voice quality alone) -Hard to know to what degree naturally occurring phonomena matches model matches synthesis and which level to look at to improve or criticize when hearing final synthesis. -Aside from a phonetic system, subjects might associate voice qualities depending on personal situations, events, etc (could whispery sound sinister?) -When only deciding between 2 extremes, subjects might have difficulty trying “not” to listen for the purpose of choosing one or another (?) - but same data reduction occurred, so beginning natural utterance not exact “copy”