Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.

Slides:



Advertisements
Similar presentations
PF-STAR: emotional speech synthesis Istituto di Scienze e Tecnologie della Cognizione, Sezione di Padova – “Fonetica e Dialettologia”, CNR.
Advertisements

Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
“Connecting the dots” How do articulatory processes “map” onto acoustic processes?
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Mandarin Chinese Speech Recognition. Mandarin Chinese Tonal language (inflection matters!) Tonal language (inflection matters!) 1 st tone – High, constant.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
PHONETICS AND PHONOLOGY
General Problems  Foreign language speakers of a target language cause a great difficulty to native speakers because the sounds they produce seems very.
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
The Human Voice. I. Speech production 1. The vocal organs
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Itay Ben-Lulu & Uri Goldfeld Instructor : Dr. Yizhar Lavner Spring /9/2004.
Vocal Emotion Recognition with Cochlear Implants Xin Luo, Qian-Jie Fu, John J. Galvin III Presentation By Archie Archibong.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
Why an objective intelligibility assessment ? Catherine Middag Jean-Pierre Martens Gwen Van Nuffelen Marc De Bodt.
Speech perception Relating features of hearing to the perception of speech.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
Spoken Language Generation Project II Synthesizing Emotional Speech in Fairy Tales.
1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Why is ASR Hard? Natural speech is continuous
Producing Emotional Speech Thanks to Gabriel Schubiner.
1. Introduction to Pattern Recognition and Machine Learning. Prof. A.L. Yuille. Dept. Statistics. UCLA. Stat 231. Fall 2004.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Source/Filter Theory and Vowels February 4, 2010.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
7-Speech Recognition Speech Recognition Concepts
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Speech Science Fall 2009 Oct 28, Outline Acoustical characteristics of Nasal Speech Sounds Stop Consonants Fricatives Affricates.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Multimodal Information Analysis for Emotion Recognition
Speech Perception 4/4/00.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
LATERALIZATION OF PHONOLOGY 2 DAY 23 – OCT 21, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Speech Science VI Resonances WS Resonances Reading: Borden, Harris & Raphael, p Kentp Pompino-Marschallp Reetzp
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Stops Stops include / p, b, t, d, k, g/ (and glottal stop)
Artificial Intelligence 2004 Speech & Natural Language Processing Speech Recognition acoustic signal as input conversion into written words Natural.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Performance Comparison of Speaker and Emotion Recognition
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
MIT Artificial Intelligence Laboratory — Research Directions The Next Generation of Robots? Rodney Brooks.
Speech Perception.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Speech Processing Laboratory, Temple University May 5, Structure-Based Speech Classification Using Nonlinear Embedding Techniques Uchechukwu Ofoegbu.
EXPRESS YOURSELF. NEUTRAL ACCENT Neutral accent is a way of speaking a language without regionalism. Accent means variation in pronunciation and it should.
Subjective evaluation of an emotional speech database for Basque Aholab Signal Processing Laboratory – University of the Basque Country Authors: I. Sainz,
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.
Speech emotion detection General architecture of a speech emotion detection system: What features?
Acoustic phonetics: Concerned with describing the acoustics of speech. Also called speech acoustics. Big questions: (1) What are the relationships between.
The Human Voice. 1. The vocal organs
August 15, 2008, presented by Rio Akasaka
The Human Voice. 1. The vocal organs
Speech Perception.
Speech Conductor Team Six (see below)
Presentation transcript:

Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance Technology Center Stuttgart Sony International (Europe) GmbH

Advanced Technology Center Stuttgart State of the ArtState of the Art MotivationMotivation GoalGoal ApproachApproach ResultsResults ConclusionsConclusions Future ResearchFuture Research Content

Advanced Technology Center Stuttgart State of the Art Database: Professional actors –Not really spontaneous speech. –Exaggerated emotion following stereotypes. Features: Prosody features. –Easy to calculate –Representing only one dimension in the emotional space: Arousal. –Pleasure dimension related to Voice Quality features. State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Activation-Evaluation Theory VERY ACTIVE VERY PASSIVE VERY POSITIVE VERY NEGATIVE Evaluation Voice Quality Activation Prosody happy excited afraid serene terrified bored furious sad pleased interested relaxed disgusted depressed content despairing delighted exhilarated blissful neutral angry State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Acoustic: Based on the speech signal. Ex. Intonation rising or falling, accents, stress. Linguistic (lexical, syntactic, semantic) Ex. syllable accent, sentence structure, etc. Prosody Features “Komm wir spielen” (bored ) “Komm wir spielen” (happy) State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Voice Quality Features  Phonatory Quality: auditory qualities that arise from variation in the source signal. Ex. Glottal spectrum  Articulatory precision: vocal tract properties. Ex. Formant structure  Speech State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Goal Spontaneous Emotion Recognizer, –Language and speaker independent. –Only acoustic information. –No stereotyped speech. New view of automatic emotion recognition. –Need of taken into account, at least, the second emotion dimension. –Relation of the emotional dimensions with different types of features. –Application: emotional space region recognition. State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Sequential classifiers – First classifier: Arousal dimension, prosody features High (= happy/angry) Medium (= neutral) Low (= sad/bored) – Second classifier: Pleasure dimension, quality features Final decision Target scenario: Sony entertainment robot AIBO “One day with AIBO” How to provoke emotions? – Context action – Automatic labeling: Happy, bored, sad, angry and neutral Data: – 14 speakers – ~ 40 commands/emotion Prosody features – Logarithmic F0 & derivative – Energy – Durational aspects – Jitter & tremor Quality features – Formants – Harmonic to noise ratio – Spectral energy distribution – Voice to unvoiced energy ratio – Glottal flow Approach Database Feature Calculation Classification State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart SPEAKER DEPENDENT - AROUSAL. Speaker dependent / Prosody Features discrimination in the AROUSAL axis emotions groups according to the position in the axis : – high level:happy + angry – medium level:neutral – low level: bored + sad State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart SPEAKER DEPENDENT - AROUSAL. Average recognition rate: 84% No confusion along the arousal dimension. Confusability only with neutral emotion. – Intermediate position – Database properties State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart SPEAKER DEPENDENT - PLEASURE. Discrimination between happy and angry: 74% Discrimination between bored and sad: 66% Speaker-dependent happy-angry, sad-bored classification More distance between happy and angry than between sad and bored in the pleasure axis. State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart SPEAKER INDEPENDENT - AROUSAL. Average recognition rate: 59.3% Neutral recognition rate close to chance. Need of “real” neutral. State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart SPEAKER INDEPENDENT - AROUSAL Original test. (emotional neutrals) 61% New test. (new neutrals) 77% Training with new neutrals State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart SPEAKER INDEPENDENT: PLEASURE. Average recognition rate: ~ 60% Quality features very speaker dependent. Discrimination between happy and angry better than between bored and sad. State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Prosody features Arousal –Not enough Quality features Pleasure –Further research needed Application: –Find a place in the emotional space + additional information = emotional state “Pure” neutral very ambiguous. –In general emotional expression very contingent upon environment. Appropriate emotional database crucial Conclusions State of the Art Motivation Goal Approach Results Conclusions Future Research

Advanced Technology Center Stuttgart Future Research Speaker independent voice quality features Improvement of the estimation reliability. Different features in different vowels. Pleasure dimension: –Quality features, but also …... some prosody features. –Classification design: speaker dependencies “Speaker identification” Specific models: age, gender, … Feature selection State of the Art Motivation Goal Approach Results Conclusions Future Research