Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech Emotion Recognition Combining Acoustic and Semantic Analyses Institute for Human-Machine Communication Technische Universität München
Slide -2- System Overview Emotional Speech Corpus Acoustic Analysis Semantic Analysis Stream Fusion Results Outline Outline
Slide -3- System Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
Slide -4- Emotion set: Anger, disgust, fear, joy, neutrality, sadness, surprise Corpus 1: Practical course 404 acted samples per emotion 13 speakers (1 female) Recorded within one year Corpus 2: Driving simulator 5 00 spontaneous emotion samples 200 acted samples (disgust, sadness) Emotional Speech Corpus
Slide -5- System Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
Slide -6- Acoustic Analysis Low-level features Pitch contour (AMDF, low-pass filtering) Energy contour Spectrum Signal High-level features Statistic analysis of contours Elimination of mean, normalization to standard dev. Duration of one utterance (1-5 seconds)
Slide -7- Acoustic Analysis Feature selection (1/2) Initial set of 200 statistical features Ranking 1: Single performance of each feature (nearest-mean classifier) Ranking 2: Sequential Forward Floating Search wrapping by nearest-mean classifier
Slide -8- Acoustic Analysis Feature selection (2/2) Top 10 features Acoustic FeatureSFFS-RankSingle Perf. Pitch, maximum gradient131.5 Pitch, standard deviation of distance between reversal points Pitch, mean value325.6 Signal, number of zero-crossings416.9 Pitch, standard deviation527.6 Duration of silences, mean value617.5 Duration of voiced sounds, mean value718.5 Energy, median of fall-time817.8 Energy, mean distance between reversal points Energy, mean of rise-time1017.6
Slide -9- Acoustic Analysis Classification Evaluation of various classification methods 33 features Classifier Error, % Speaker indep.Speaker dep. kMeans kNN GMM MLP SVM ML-SVM Output: Vector of (pseudo-) recognition confidences
Slide -10- Acoustic Analysis Classification Multi-Layer Support Vector Machines acoustic feature vector ang, ntl, fea, joy / dis, sur, sad ang, ntl / fea, joy dis, sur / sad ang / ntl fea / joy dis / sur angntlfeajoy sad dissur No confidence vector to forward to fusion
Slide -11- System Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
Slide -12- Semantic Analysis ASR-Unit HMM-based 1300 words german vocabulary No language model 5-best phrase hypotheses Recognition confidences per word Example output (first hypothesis): Ican‘tstandthiseverytraytraffic-jam
Slide -13- Semantic Analysis Conditions Natural language Erroneous speech recognition Uncertain knowledge Incomplete knowledge Superfluous knowledge Probabilistic spotting approach Bayesian Belief Networks
Slide -14- Semantic Analysis Bayesian Belief Networks Acyclic graph of nodes and directed edges One state variable per node (here states, ) Setting node-dependencies via cond. probability matrices Setting initial probabilities in root nodes Observation A causes evidence in a child node (i.e. is known) Inference to direct parent nodes and finally to root nodes Bayes‘ rule :
Slide -15- Semantic Analysis Emotion modelling... I I_hateBadAdhorrence first_person Joy Negative Positive Disgust Inputlevel Words Superwords Phrases Super- phrases Disgust I can‘t stand this nasty every tray traffic-jam can‘tstandnasty cannotstandbaddisgusting Interpretation Good Anger Clustering Sequence Handling Clustering Clustering Spotting I_like... Output: Vector of “real“ recognition confidences
Slide -16- System Overview F&F of HMC Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
Slide -17- Stream Fusion Pairwise mean Discriminative fusion applying MLP Input layer: 2 x 7 confidences Hidden layer: 100 nodes Output layer: 7 recognition confidences
Slide -18- Results Results Emotion angdisfeajoyntlsadsurMean % Acoustic recognition rates (SVM): Semantic recognition rates: Emotion angdisfeajoyntlsadsurMean %
Slide -19- Results Results Emotion angdisfeajoyntlsadsurMean % Recognition rates after discriminative fusion: Acoustic Information Language Information Fusion by means Fusion by MLP % Overview:
Slide -20- Summary Summary Acted Emotions 7 discrete emotion categories Prosodic feature selection via Singe feature performance Sequential forward floating search Evaluative comparision of different classifiers Outperforming SVMs Semantic analysis applying Bayesian Networks Significant gain by discriminative stream fusion
Slide -21-