Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.

Slides:



Advertisements
Similar presentations
PF-STAR: emotional speech synthesis Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR.
Advertisements

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
HARMONIC MODEL FOR FEMALE VOICE EMOTIONAL SYNTHESIS Anna PŘIBILOVÁ Department of Radioelectronics, Slovak University of Technology Ilkovičova 3, SK-812.
“Connecting the dots” How do articulatory processes “map” onto acoustic processes?
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Speech Science XII Speech Perception (acoustic cues) Version
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Voice source characteristics in speaker segregation Patti Adank.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Assessment of Vocal Noise via Bi-directional Long-term Linear Prediction of Running Speech F. Bettens *, F. Grenez *, J. Schoentgen *,** * Université Libre.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Vocal microtremor in normophonic and mildly dysphonic speakers Jean Schoentgen Université Libre Bruxelles Brussels - Belgium.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
Emotions in IVR Systems Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Voice source characterisation Gerrit Bloothooft UiL-OTS Utrecht University.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Speaker Adaptation for Vowel Classification
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Cues to Emotion: Anger and Frustration Julia Hirschberg COMS 4995/6998 Thanks to Sue Yuen and Yves Scherer.
Optimal Adaptation for Statistical Classifiers Xiao Li.
Learning Objectives Describe how speakers control frequency and amplitude of vocal fold vibration Describe psychophysical attributes of pitch, loudness.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Producing Emotional Speech Thanks to Gabriel Schubiner.
MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006.
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
Speech Perception1 Fricatives and Affricates We will be looking at acoustic cues in terms of … –Manner –Place –voicing.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Regression Approaches to Voice Quality Control Based on One-to-Many Eigenvoice Conversion Kumi Ohta, Yamato Ohtani, Tomoki Toda, Hiroshi Saruwatari, and.
Multimodal Information Analysis for Emotion Recognition
SIL Speech Analyzer: Tutorial Part 2 Dr. Barbara Brindle CD 508 – Voice Disorders Dr. Dudley Bryant PHYS Acoustics.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.
Towards a Cohort-Selective Frequency- Compression Hearing Aid Marie Roch ¤, Richard R. Hurtig ¥, Jing Lui ¤, and Tong Huang ¤ ¥ ¤
IIT Bombay 14 th National Conference on Communications, 1-3 Feb. 2008, IIT Bombay, Mumbai, India 1/27 Intro.Intro.
Performance Comparison of Speaker and Emotion Recognition
Predicting Voice Elicited Emotions
SRINIVAS DESAI, B. YEGNANARAYANA, KISHORE PRAHALLAD A Framework for Cross-Lingual Voice Conversion using Artificial Neural Networks 1 International Institute.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]
1 Acoustic Phonetics 3/28/00. 2 Nasal Consonants Produced with nasal radiation of acoustic energy Sound energy is transmitted through the nasal cavity.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Emotion Recognition from Speech: Stress Experiment Stefan Scherer, Hansjörg Hofmann, Malte Lampmann, Martin Pfeil, Steffen Rhinow, Friedhelm Schwenker,
Danielle Werle Undergraduate Thesis Intelligibility and the Carrier Phrase Effect in Sinewave Speech.
HIGH-RESOLUTION SINUSOIDAL MODELING OF UNVOICED SPEECH GEORGE P. KAFENTZIS, YANNIS STYLIANOU MULTIMEDIA INFORMATICS LABORATORY DEPARTMENT OF COMPUTER SCIENCE.
Speech emotion detection General architecture of a speech emotion detection system: What features?
Laryngeal correlates of the English tense/lax vowel contrast
Voice conversion using Artificial Neural Networks
Speech Conductor Team Six (see below)
Voice source characterisation
Towards Automatic Fluency Assessment
EE513 Audio Signals and Systems
Speech Perception (acoustic cues)
Presentation transcript:

Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive Sciences and Technologies - CNR Department of Phonetics and Dialectology - Padova Voice quality: functions, analysis and synthesis VOQUAL’03 Geneva, August 27-29, 2003

Outline  Objectives and motivations  Voice material  Acoustic indexes and statistical analysis  Neutral to emotive utterance mapping  Experimental results

Objectives and motivations  Long-term goals: - emotive speech analysis/synthesis - improvement of ASR/TTS systems  Short-term goal: - preliminary evaluation of processing tools for the reproduction of different voice qualities  Focus of talk: - analysis/synthesis of different voice qualities corresponding to different emotive intentions  Method: - analysis of voice quality acoustic correlates - definition of a sinusoidal modeling framework to control voice timbre and phonation quality

An emotive voice corpus was recorded with the following characteristics:  two phonological structures ’VCV: /’aba/ and /’ava/.  neutral (N) plus six emotional states:  1 speaker, 7 recordings for each emotive intention, for each word. anger (A), joy (J), fear (F), sadness (SA), disgust (D), surprise (SU). Voice material

Analysis of emotive speech: acoustic correlates Cue extraction and analysis : Intensity, duration, pitch, pitch range, formants. F0 stressed vowel mean and F0 mid values are strongly correlated. F0 mean (global and for stressed vowel), F0 “mid”, and F0 range anger (A) joy (J) fear (F) sadness (SA) disgust (D) surprise (SU) neutral (N)

Analysis of emotive speech: acoustic correlates Cue extraction and analysis (acoustic correlates of voice quality) : Shimmer, Jitter HNR Hammarberg’s index (HammI) difference between energy max in the Hz and Hz frequency bands Spectral flatness (SFM) ratio of the geometric to the arithmetic mean Drop-off of spectral energy above 1000 Hz (Do1000) LS approx. of the spectral tilt above 1000 Hz High- versus low-frequency range relative energy amount (Pe1000)

Analysis of emotive speech: voice quality Voice quality patterns (distance from Neutral): Discriminant analysis: classification scores: 60/70 % for stressed and unstressed vowel Best score: Fear, Anger Voice quality characterization : Anger: harsh voice (/’a/) Disgust: creaky voice (/a/) Joy, Fear, Surprise : breathy voice Classification matrix for stressed vowel:

Processing of emotive speech: method Neutral Emotive transformation based on sinusoidal modeling and spectral processing Neutral sinus. spectral envelope after pitch shift Emotive sinus. spectral envelope after Ts Spectral envelope conversion function ( : mfcc from ) Neutral Spectral conversion model Emotion j Spectral conversion function design: Neutral sinus. spectral envelope Spectral conversion model (Stylianou et Al., 1998) gaussian mixture model conversion parameters

Processing of emotive speech: method Neutral Emotive transformation based on trained model Neutral Target Disgust Target Sadness Disgust Sadness Disgust (Ps+Ts) Sadness (Ps+Ts)

Processing of emotive speech: results Neutral Emotive transformation based on sinusoidal modeling: Neutral Anger Disgust Joy Fear Surprise Sadness Ps+TsPs+Ts+ScTarget

Processing of emotive speech: results Results: Time-stretch and (formant preserving) pitch shift alone can’t account for the principal emotion related cues Spectral conversion can account for some of the emotion cues In general, the method can’t account for cues related to period-to-period variability (i.e., Shimmer, Jitter) The inclusion of a noise model is required to evaluate the effect on HNR Neutral

Conclusions Future work  Refinements of the model (i.e., noise model)  Adaptation to TTS system  Search for the existence of speaker-independent transformation patterns (using multi-speaker corpora).  Sinusoidal framework was found adequate to process emotive information  Need refinements (e.g. noise model, harshness model) to account for all the acoustic correlates of emotions  Results of processing are perceptually good