Download presentation
Presentation is loading. Please wait.
Published byTerence Foster Modified over 9 years ago
1
Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech Emotion Recognition Combining Acoustic and Semantic Analyses Institute for Human-Machine Communication Technische Universität München
2
Slide -2- System Overview Emotional Speech Corpus Acoustic Analysis Semantic Analysis Stream Fusion Results Outline Outline
3
Slide -3- System Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
4
Slide -4- Emotion set: Anger, disgust, fear, joy, neutrality, sadness, surprise Corpus 1: Practical course 404 acted samples per emotion 13 speakers (1 female) Recorded within one year Corpus 2: Driving simulator 5 00 spontaneous emotion samples 200 acted samples (disgust, sadness) Emotional Speech Corpus
5
Slide -5- System Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
6
Slide -6- Acoustic Analysis Low-level features Pitch contour (AMDF, low-pass filtering) Energy contour Spectrum Signal High-level features Statistic analysis of contours Elimination of mean, normalization to standard dev. Duration of one utterance (1-5 seconds)
7
Slide -7- Acoustic Analysis Feature selection (1/2) Initial set of 200 statistical features Ranking 1: Single performance of each feature (nearest-mean classifier) Ranking 2: Sequential Forward Floating Search wrapping by nearest-mean classifier
8
Slide -8- Acoustic Analysis Feature selection (2/2) Top 10 features Acoustic FeatureSFFS-RankSingle Perf. Pitch, maximum gradient131.5 Pitch, standard deviation of distance between reversal points 223.0 Pitch, mean value325.6 Signal, number of zero-crossings416.9 Pitch, standard deviation527.6 Duration of silences, mean value617.5 Duration of voiced sounds, mean value718.5 Energy, median of fall-time817.8 Energy, mean distance between reversal points 919.0 Energy, mean of rise-time1017.6
9
Slide -9- Acoustic Analysis Classification Evaluation of various classification methods 33 features Classifier Error, % Speaker indep.Speaker dep. kMeans57.0527.38 kNN30.4117.39 GMM25.1710.88 MLP26.869.36 SVM23.887.05 ML-SVM18.719.05 Output: Vector of (pseudo-) recognition confidences
10
Slide -10- Acoustic Analysis Classification Multi-Layer Support Vector Machines acoustic feature vector ang, ntl, fea, joy / dis, sur, sad ang, ntl / fea, joy dis, sur / sad ang / ntl fea / joy dis / sur angntlfeajoy sad dissur No confidence vector to forward to fusion
11
Slide -11- System Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
12
Slide -12- Semantic Analysis ASR-Unit HMM-based 1300 words german vocabulary No language model 5-best phrase hypotheses Recognition confidences per word Example output (first hypothesis): Ican‘tstandthiseverytraytraffic-jam 69.334.672.120.036.115.955.8
13
Slide -13- Semantic Analysis Conditions Natural language Erroneous speech recognition Uncertain knowledge Incomplete knowledge Superfluous knowledge Probabilistic spotting approach Bayesian Belief Networks
14
Slide -14- Semantic Analysis Bayesian Belief Networks Acyclic graph of nodes and directed edges One state variable per node (here states, ) Setting node-dependencies via cond. probability matrices Setting initial probabilities in root nodes Observation A causes evidence in a child node (i.e. is known) Inference to direct parent nodes and finally to root nodes Bayes‘ rule :
15
Slide -15- Semantic Analysis Emotion modelling... I I_hateBadAdhorrence first_person Joy Negative Positive Disgust Inputlevel Words Superwords Phrases Super- phrases Disgust I can‘t stand this nasty every tray traffic-jam can‘tstandnasty cannotstandbaddisgusting Interpretation Good Anger Clustering Sequence Handling Clustering Clustering Spotting I_like... Output: Vector of “real“ recognition confidences
16
Slide -16- System Overview F&F of HMC Overview Speech signal Prosodic features ASR-unit Semantic interpretation (Bayesian Networks) Classifier(SVM) Stream fusion (MLP) Emotion
17
Slide -17- Stream Fusion Pairwise mean Discriminative fusion applying MLP Input layer: 2 x 7 confidences Hidden layer: 100 nodes Output layer: 7 recognition confidences
18
Slide -18- Results Results Emotion angdisfeajoyntlsadsurMean %95.561.378.775.178.562.168.374.2 Acoustic recognition rates (SVM): Semantic recognition rates: Emotion angdisfeajoyntlsadsurMean %78.471.253.457.756.035.065.559.6
19
Slide -19- Results Results Emotion angdisfeajoyntlsadsurMean %98.078.788.395.998.291.795.892.0 Recognition rates after discriminative fusion: Acoustic Information Language Information Fusion by means Fusion by MLP %74.259.683.192.0 Overview:
20
Slide -20- Summary Summary Acted Emotions 7 discrete emotion categories Prosodic feature selection via Singe feature performance Sequential forward floating search Evaluative comparision of different classifiers Outperforming SVMs Semantic analysis applying Bayesian Networks Significant gain by discriminative stream fusion
21
Slide -21-
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.