Download presentation
1
Predicting Voice Elicited Emotions
Nishant Pandey
2
Synopsis Problem statement and motivation Previous work and background
System Intuition and Overview Pre-processing of audio signals Building feature space Finding patterns in unlabelled data and labelling of samples Regression Results Deployed System Market Research
3
Problem Statement Motivation
To be able to analyse voice and predict listener emotions elicited by the paralinguistic elements of the voice. Motivation Automate the screening process in service based industries Hourly job workers (two-thirds of U.S. Labour force or ~50 million job seekers every year) - Paralinguistic feature – tone - “how things are said are just as important as what is being said” This can be used in service based industries, where customer’s emotional response to worker’s voice, may affect the service outcome.
4
Previous work 2 set of goals, which includes recognizing-
the type of personality traits intrinsically possessed by the speaker, for e.g. speaker trait and speaker state the types of emotions carried within the speech clip, for e.g. acoustic affect (cheerful, trustworthy, deceitful etc.) Train possessed by speaker: age, gender, pronunciation, fluency, personality Speak state: affection, interest, stress Emotions carried within the speech: sounds pleasant, cheerful, trustworthy Current work focuses on predicting the elicited emotions of voice clips.
5
Background – Emotion Taxonomy
The framework articulated by “FEELTRACE” Includes all the emotion responses we want to predict. Emotions by finite quantifiable dimensions. - Finite dimension as active-passive, pos-neg assist us in mapping emotion characteristics to and measure them by known voice paralinguistic features
6
Features - Paralinguistic features of Voices
Concept Definition Data Representation Amplitude measurement of the variations over time of the acoustic signal quantified values of a sound wave’s Oscillation Energy acoustic signal energy representation in decibels 20*log10(abs(FFT)) Formants the resonance frequencies of the vocal tract maxima detected using Linear Prediction on audio windows with high tonal content Perceived pitch Perceived Fundamental frequency and harmonics Fundamental frequency the reciprocal of time duration of one glottal cycle - a strict definition of “pitch” first formant - Generally, frequency and energy variation are major cues used to analyse emotions in voice samples.
7
System – Intuition Common day exp is that we can listen and tell the emotions elicited by speech, give an example. Energy level difference bw the clips, 1 is from clip which would make listener less engaged and vice versa. Spectrogram of two job applicants responding to “Greet me as if I am a customer”
8
System – Overview Record and sample raw voice clips
Extract audio features that represent voice cues Construct data feature space suitable for data mining and ml algorithms Build models using supervise/un-supervised learning Engineer scalable data processing pipelines which process and generate prediction scores
9
System – Pre-Processing of Audio Signals
Pre-processing tasks involve: Removing voice clips with <2 seconds length and containing noise audio signal to data in time and frequency domain Short-term Fast Fourier Transform per frame Energy measures in frequency domain per frame Linear prediction coefficient in frequency domain per frame Read more about FFT, Energy measures, LPC
10
System - Feature Space Construction
We experimented with feature construction based on the following dimensions and combinations: Signal measurements such as energy and amplitude. Statistics such as min, max, mean, and standard deviation on signal measurements Measurement window in time domain: different time size and entire time window Measurement window in frequency domain: all frequencies, optimal audible frequencies, and selected frequency ranges Existing research, we focused on voice energy features and constructed feature space using statistical measures of energy attributes.
11
System – Labels and Right set of Features?
Conventional approach – getting voice samples rated by experts Unsupervised Learning – Analyse features and their effectiveness Process: Unsupervised learning is used to find patterns in unlabelled data. Now, training data sets are constructed based on clustering results and manual labelling. analysed feature selection against clustering algorithms to determine effectiveness of features
12
System – How do we get the labels? Contd.
Parameters Cost Function: Connectivity Dunn Index Silhouette Clustering Results Technique: Hierarchical Clustering Number of clusters: 5 Manual validation of clusters was also done Clustering on paralinguistic features of voice Experimented on clustering algorithms and distance metrics Results evaluated on cluster quality measurements (compactness, good separation, connectedness and stability) Manual validation (aural inspection) of samples within a cluster, are they meaningful or not
13
System – Visualization of clusters
14
System – Modelling Supervised Learning algorithms Logistic Regression
Support Vector Machine Random Forest Semi-Supervised Learning algorithm KODAMA Output: Binary outcome (positive or negative) Numerical scores
15
Case Study – Modelling Prediction – Positive vs Negative Response
A positive response could be one or multiple perceptions of a “pleasant voice”, “makes me feel good”, “cares about me”, “makes me feel comfortable”, or “makes me feel engaged”. System.V1 -> Using SVM and V2 -> Random Forest Interview Prompts: “Greet me as If I am a customer” Given a voice clip, our model predicts degree in which a listener will find voice ‘engaging’
16
System - Prediction Results
Accuracy : 0.86 95% CI : (0.76, 0.92) P-Value [Acc > NIR] : 5.76e-07 Sensitivity : 0.81 Specificity : 0.88 Pos Pred Value : 0.81 Neg Pred Value : 0.88
17
System - Prediction Results (KODAMA)
Kodama performs feature extraction from noisy and high- dimensional data. Output of Kodama includes dissimilarity matrix from which we can perform clustering and classification.
18
Deployed System
19
Market Research Demographics Matters
Young listeners (18-29 years old) and Income less than $29000/year have more strict criteria of how they sense engaging. No Correlation b/w emotion elicited vs age/ ethnicity/ education level. Bias towards female voice.
20
Thanks
21
Time and Frequency Domain
Time Domain: nsform_time_and_frequency_domains_(small).gif Frequency Domain: transform_time_and_frequency_domains_(small).gif
22
Learnings – Difference in Voice Characteristics
Result Improves by 10% - when a decision tree is layered by features related to voice characteristic on top of the Random Forest.
23
Prediction Results – SVM vs Random Forest
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.