August 15, 2008, presented by Rio Akasaka On The Robustness Of Overall F0-only Modifications To The Perception Of Emotions In Speech Murtaza Buluta and Shrikanth Narayanan August 15, 2008, presented by Rio Akasaka
General ideas Examines the effects of changing F0 Notes how it can be changed without changing the perception or sound quality of a particular utterance. Introduces the concept of emotional regions. Performs statistical analyses on the various modifications.
What? F0, pitch, is a good descriptor of emotion. However, usefulness is limited because it isn’t so descriptive in natural speech. Let’s introduce a new model called ‘emotional regions’ to represent utterances.
Why? Useful in making automated judgments on emotion in speech MoodSwings (arousal content in speech), Timbre Game (F0 contours) Can complement facial recognition research
Emotion Perception Analytic F0 contour, range, voice quality Contextual sentence content, speaker
Neutral
Sad
Joy
Anger
How? Changing the F0 mean: Shifting the entire contour up or down Changing the F0 range Multiplying the contour by a constant and shifting it so as to retain the original mean. Stylizing Representing the F0 contour with linear segments of differing resolutions
Data Collection 2 speakers x 2 sentences x 4 emotions x 29 modifications + original = 480 files Male, female “She told me what you did” “This hat makes me look like an aardvark.” Happy, angry, neutral, sad
Analysis Listening test: 14 people Rate emotion and naturalness (quality)
Emotional regions 2D (F0 mean and range), not 3D Mahalanobis distance All resynthesized utterances are assigned an emotional label using majority voting. These are then grouped with their original utterances if they have been identified as the same. The Mahalanobis distance takes into consideration the correlation of the data set and is scale-invariant. Useful for determining similarity. For each group the mean vector and covariance matrix are calculated, so that the center and shape of the contours is determined by each. The region within the circles represents the possible F0 values with which a given original utterance can be modified to elicit the same emotion. Gaussian vs. Euclidean
In-Depth Important to realize that the emotional regions do not define how new emotions can be synthesized Perception of emotions is based not only on F0, but on the combined effects of prosody – rhythm, stress and intonation spectral - speaker linguistic – sentence 4-way ANOVA with H0: emotions is equally perceived across all modifications H0: speech quality is equally perceived
Observations Increasing the F0 mean (+/- 50%) Sad and neutral emotion perception increased, angry and happy decreased Changing the F0 range caused more variation in emotion recognition that changing F0 contours. In some cases changing the F0 range did not change the sound quality. Decreasing F0 range caused increase in sad. Speakers were able to recognize emotion even with changes in F0 and distortion in sound quality. Perceived speech quality drop is less severe when changing F0 range modifications instead of mean Changes in contour shapes does not necessarily cause significant changes in emotion recognition.
Things to retain from this presentation Emotional regions can be used to parametrize emotions, but you also need to take linguistic content as a factor Changing F0 did not necessarily change perception of emotions Changing the F0 range affected emotion perception more than changing the F0 mean. Also, drop in speech quality was significantly less when playing around with F0 range.
Bibliography http://emosamples.syntheticspeech.de/