Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts.

Slides:



Advertisements
Similar presentations
© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
An Approach to ECG Delineation using Wavelet Analysis and Hidden Markov Models Maarten Vaessen (FdAW/Master Operations Research) Iwan de Jong (IDEE/MI)
Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Overview What : Stroke type Transformation: Timbre Rhythm When: Stroke timing Resynthesis.
AUTOMATIC SPEECH CLASSIFICATION TO FIVE EMOTIONAL STATES BASED ON GENDER INFORMATION ABSTRACT We report on the statistics of global prosodic features of.
AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
EE663 Image Processing Edge Detection 2 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.
Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue SIGDial 2004 Gina-Anne Levow April 30, 2004.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Optimal Adaptation for Statistical Classifiers Xiao Li.
Classification of Discourse Functions of Affirmative Words in Spoken Dialogue Julia Agustín Gravano, Stefan Benus, Julia Hirschberg Shira Mitchell, Ilia.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Relative Extrema.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Introduction Mel- Frequency Cepstral Coefficients (MFCCs) are quantitative representations of speech and are commonly used to label sound files. They are.
Knowledge Base approach for spoken digit recognition Vijetha Periyavaram.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
1 Computational Linguistics Ling 200 Spring 2006.
Implementing a Speech Recognition System on a GPU using CUDA
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Dynamic Programming.
Maxima and minima.
Multimodal Information Analysis for Emotion Recognition
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Copyright © 2015 by Educational Testing Service. 1 Feature Selection for Automated Speech Scoring Anastassia Loukina, Klaus Zechner, Lei Chen, Michael.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
1 Reconstructing head models from photograph for individualized 3D-audio processing Matteo Dellepiane, Nico Pietroni, Nicolas Tsingos, Manuel Asselot,
Dimension-Decoupled Gaussian Mixture Model for Short Utterance Speaker Recognition Thilo Stadelmann, Bernd Freisleben, Ralph Ewerth University of Marburg,
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
Duraid Y. Mohammed Philip J. Duncan Francis F. Li. School of Computing Science and Engineering, University of Salford UK Audio Content Analysis in The.
By Brian Lam and Vic Ciesielski RMIT University
Copyright © 2016, 2012 Pearson Education, Inc
3D Face Recognition Using Range Images
Copyright © 2016, 2012 Pearson Education, Inc
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
Performance Comparison of Speaker and Emotion Recognition
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Objectives: Terminology Components The Design Cycle Resources: DHS Slides – Chapter 1 Glossary Java Applet URL:.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.../publications/courses/ece_8443/lectures/current/lecture_02.ppt.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Multimedia Concepts and Applications Multimedia Concepts and Applications Affect Sensing in Speech: Studying Fusion of Linguistic and Acoustic Features.
Subjective evaluation of an emotional speech database for Basque Aholab Signal Processing Laboratory – University of the Basque Country Authors: I. Sainz,
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Towards Semantic Affect Sensing in Sentences Alexander Osherenko.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
By Brian Lam and Vic Ciesielski RMIT University
Investigating Pitch Accent Recognition in Non-native Speech
University of Rochester
My Tiny Ping-Pong Helper
Automatic Speech Recognition
Vijay Srinivasan Thomas Phan
Retrieval of audio testimonials via voice search
Using First Derivatives to Find Maximum and Minimum Values and Sketch Graphs OBJECTIVE Find relative extrema of a continuous function using the First-Derivative.
Towards Automatic Fluency Assessment
Multimodal Caricatural Mirror
4.2 Critical Points, Local Maxima and Local Minima
Low Level Cues to Emotion
Presentation transcript:

Comparing feature sets for acted and spontaneous speech in view of automatic emotion recognition Thurid Vogt, Elisabeth André ICME 2005 Multimedia concepts and applications, Augsburg University, Germany Applied Computer Science, Bielefeld University, Germany

Emotion Recognition System Feature extraction Input Classification Result Training data

Research questions 1.Does a large number of features provided to the selection algorithm enable the selection of a better feature set? 2.Which analysis units can be calculated automatically in an online system and still give good results? 3.How do feature sets for acted and realistic data differ?

Overview Feature extraction: –Segment length –Feature calculation –Feature selection Databases Results Conclusions

Feature extraction

Segment length Features are computed over signal segments Difficulty: –Features can be computed more accurate for long segments –Emotions can be short and change quickly Possible segments: –Whole utterances –Larger pauses as segment borders –Words, syllables, word in context (1 or 2 left and right) –Fixed length, e.g. 0.5, 1, 2 seconds

Feature calculation Features based on pitch, energy + 1st & 2nd derivatives and 12 MFCCs + 1st & 2nd derivatives Looking at basic values, only minima or maxima, as well as distances, differences, slopes between adjacent extrema Mean, minimum, maximum,... of time segments Some others, such as normalised pitch and pauses Oriented at Oudeyer, 2003

Feature selection Correlation-based feature selection from Weka data mining software from University of Waikato, New Zealand (Witten & Frank, 2000) Reduction from 1280 to ~ features

Databases

Acted speech database Database from TU Berlin for emotional speech synthesis (Sendelmeier, 2001) –Recorded from actors –High quality –10 speakers; 20 min –7 emotions Spontaneous speech database SmartKom Database from U. of Munich (Steininger et. al., 2002) –Wizard-of-Oz scenario –Mid quality –~80 speakers; 3h20min net; few emotions exhibited –11 user states

Results

Which analysis units can be computed automatically in an online system and still give good results?

Does a large number of features provided to the selection algorithm yield a better feature set?

Does a large number of features provided to the selection algorithm yield a better feature set cont. Reduced feature set almost always better than full feature set Features perform comparable to Batliner et al., 2003, on SmartKom data, but our features are computed completely automatically, while some of theirs were determined manually Selected features are not necessarily those one would expect

How do feature sets for acted and realistic data differ? Important features for acted emotions: –Basic pitch –Pauses (for sadness) Important features for WOZ emotions: –MFCCs (mainly low coefficients and 1st derivatives) –Extrema of pitch and energy

Conclusions Automatic segment extraction showed not to be a disadvantage Big feature set provided to the selection algorithm might compensate for the disadvantages of completely automatically computed features Feature sets for acted and WOZ emotions overlap little  looking at acted data when building an emotion recognizer for spontaneous emotions may not make sense