August 15, 2008, presented by Rio Akasaka

Slides:



Advertisements
Similar presentations
Chapter 12: Speech and Music Perception
Advertisements

Tools for Speech Analysis Julia Hirschberg CS4995/6998 Thanks to Jean-Philippe Goldman, Fadi Biadsy.
Associations of behavioral parameters of speech emotional prosody perception with EI measures in adult listeners Elena Dmitrieva Kira Zaitseva, Alexandr.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.
Spoken Language Analysis Dept. of General & Comparative Linguistics Christian-Albrechts-Universität zu Kiel Oliver Niebuhr 1 At the Segment-Prosody.
Two Types of Listeners? Marie Nilsenov á (Tilburg University) 1. Background When you and I listen to the same utterance, we may not perceive the linguistic.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Perception of syllable prominence by listeners with and without competence in the tested language Anders Eriksson 1, Esther Grabe 2 & Hartmut Traunmüller.
Effectiveness of spatial cues, prosody, and talker characteristics in selective attention C.J. Darwin & R.W. Hukin.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Spoken Language Generation Project II Synthesizing Emotional Speech in Fairy Tales.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Sound and Speech. The vocal tract Figures from Graddol et al.
Praat Fadi Biadsy.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Producing Emotional Speech Thanks to Gabriel Schubiner.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Categorizing Emotion in Spoken Language Janine K. Fitzpatrick and John Logan METHOD RESULTS We understand emotion through spoken language via two types.
Principles of Pattern Recognition
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Graphite 2004 Statistical Synthesis of Facial Expressions for the Portrayal of Emotion Lisa Gralewski Bristol University United Kingdom
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Multimodal Information Analysis for Emotion Recognition
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Performance Comparison of Speaker and Emotion Recognition
Introduction to Pattern Recognition (การรู้จํารูปแบบเบื้องต้น)
Predicting Voice Elicited Emotions
MIT Artificial Intelligence Laboratory — Research Directions The Next Generation of Robots? Rodney Brooks.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
Language and Speech, 2000, 43 (2), THE BEHAVIOUR OF H* AND L* UNDER VARIATIONS IN PITCH RANGE IN DUTCH RISING CONTOURS Carlos Gussenhoven and Toni.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Functions of Intonation By Cristina Koch. Intonation “Intonation is the melody or music of a language. It refers to the way the voice rises and falls.
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
Yow-Bang Wang, Lin-Shan Lee INTERSPEECH 2010 Speaker: Hsiao-Tsung Hung.
Speech emotion detection General architecture of a speech emotion detection system: What features?
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Inferential Statistics
High Quality Voice Morphing
Emotional Speech Modelling and Synthesis
‘The most natural way to communicate is simply to speak
Studying Intonation Julia Hirschberg CS /21/2018.
Detecting Prosody Improvement in Oral Rereading
Voice conversion using Artificial Neural Networks
Speech Perception CS4706.
ECE 417 Lecture 4: Multivariate Gaussians
CRANDEM: Conditional Random Fields for ASR
12-4: Area and volume of Spheres
Agustín Gravano1 · Stefan Benus2 · Julia Hirschberg1
Hairong Qi, Gonzalez Family Professor
Tools for Speech Analysis
Looking at Spectrogram in Praat cs4706, Jan 30
Speech Prosody Conversion using Sequence Generative Adversarial Nets
COPYRIGHT © All rights reserved by Sound acoustics Germany
Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS
Auditory Morphing Weyni Clacken
Presentation transcript:

August 15, 2008, presented by Rio Akasaka On The Robustness Of Overall F0-only Modifications To The Perception Of Emotions In Speech Murtaza Buluta and Shrikanth Narayanan August 15, 2008, presented by Rio Akasaka

General ideas Examines the effects of changing F0 Notes how it can be changed without changing the perception or sound quality of a particular utterance. Introduces the concept of emotional regions. Performs statistical analyses on the various modifications.

What? F0, pitch, is a good descriptor of emotion. However, usefulness is limited because it isn’t so descriptive in natural speech. Let’s introduce a new model called ‘emotional regions’ to represent utterances.

Why? Useful in making automated judgments on emotion in speech MoodSwings (arousal content in speech), Timbre Game (F0 contours) Can complement facial recognition research

Emotion Perception Analytic F0 contour, range, voice quality Contextual sentence content, speaker

Neutral

Sad

Joy

Anger

How? Changing the F0 mean: Shifting the entire contour up or down Changing the F0 range Multiplying the contour by a constant and shifting it so as to retain the original mean. Stylizing Representing the F0 contour with linear segments of differing resolutions

Data Collection 2 speakers x 2 sentences x 4 emotions x 29 modifications + original = 480 files Male, female “She told me what you did” “This hat makes me look like an aardvark.” Happy, angry, neutral, sad

Analysis Listening test: 14 people Rate emotion and naturalness (quality)

Emotional regions 2D (F0 mean and range), not 3D Mahalanobis distance All resynthesized utterances are assigned an emotional label using majority voting. These are then grouped with their original utterances if they have been identified as the same. The Mahalanobis distance takes into consideration the correlation of the data set and is scale-invariant. Useful for determining similarity. For each group the mean vector and covariance matrix are calculated, so that the center and shape of the contours is determined by each. The region within the circles represents the possible F0 values with which a given original utterance can be modified to elicit the same emotion. Gaussian vs. Euclidean

In-Depth Important to realize that the emotional regions do not define how new emotions can be synthesized Perception of emotions is based not only on F0, but on the combined effects of prosody – rhythm, stress and intonation spectral - speaker linguistic – sentence 4-way ANOVA with H0: emotions is equally perceived across all modifications H0: speech quality is equally perceived

Observations Increasing the F0 mean (+/- 50%) Sad and neutral emotion perception increased, angry and happy decreased Changing the F0 range caused more variation in emotion recognition that changing F0 contours. In some cases changing the F0 range did not change the sound quality. Decreasing F0 range caused increase in sad. Speakers were able to recognize emotion even with changes in F0 and distortion in sound quality. Perceived speech quality drop is less severe when changing F0 range modifications instead of mean Changes in contour shapes does not necessarily cause significant changes in emotion recognition.

Things to retain from this presentation Emotional regions can be used to parametrize emotions, but you also need to take linguistic content as a factor Changing F0 did not necessarily change perception of emotions Changing the F0 range affected emotion perception more than changing the F0 mean. Also, drop in speech quality was significantly less when playing around with F0 range.

Bibliography http://emosamples.syntheticspeech.de/