A Generative Audio-Visual Prosodic Model for Virtual Actors Modelling virtual humans Cas Laugs (4140613) Max Meijers (6493769)
Authors Adela Barbulescu and Rémi Ronfard From Inria, French research institute for digital sciences Gérard Bailly From GIPSA-lab, University of Grenoble-Alpes
Overview and Motivation Published in IEEE Computer Graphics and Applications December 2017 Only a single citation
Overview and Motivation Expressing complex mental states during conversation in animation is hard Prosody of voice (Rhythm, tone) Facial animation Gaze NOT speech-to-motion!
Problem statement Create a method for generating natural speech and facial animation Expressing various attitudes Based on neutral input Speech Animation Make distinction between attitude and emotion. Voluntary-involuntary
Previous work Three types of expressive facial animation Text-driven Speech-driven Expressive SFC Model
Previous work - Types Text-driven Use Text-To-Speech to obtain Speech (audio) Phoneme duration Joint driven Rule-based approach for attitude Speech-driven Generate face motion Also TTS and rule-based Semantics
Previous work - Types Expressive conversion Mapping function to emotional space (speech) Statistical mapping of motion capture data (animation)
Previous work - SFC Superposition of functional contours (SFC) By Gérard Bailly, 2 citations Generates prosody of utterance Based on (non-)linguistic information SFC genereert intonatie, toon, ritme
Approach Train model on various dramatic performances Input neutral sentence performance (video + audio) and attitude label Generate animated 3D character Speech and animation To match attitude label Three main steps
A Visual Example In this video, the attitude labels are shown above the characters
Method Model 10 attitudes From Simon Baron-Cohen’s Mind Reading Project Captured by ‘semi-professional’ actors More into detail on their method 412 attitudes in whole corpus
Method Extend existing SFC model with speech and facial animation Neural networks using prosody features Acoustic Visual Virtual syllables (supportive)
Method - Acoustic Acoustic features Voice pitch contours Rhythm Energy value Acoustic features from Praat
Method - Visual ...and visual features Verbal motion Non-verbal motion Combine linearly by adding to ‘neutral’ motion Split upper and lower face 19 and 29 blendshapes Verbal motion: shapes made by mouth to articulate words Non-verbal motion: gestures, facial expressions, etc. https://aeanimation.files.wordpress.com/2017/04/79e36420a59429cfd4fad9036f356c7d.jpg?w=676 https://www.telenews.pk/wp-content/uploads/2017/06/mp900422226.jpg
Method - Virtual syllables Supporting Virtual syllables Before and after each utterance Indicates ‘talking turns’ using nonverbal gestures 250ms
Method Sampled and reconstructed at 20, 50, and 80% of each syllable Pitch contour Motion Rhythm Energy These features are...
Experiment Perceptual test Short dialogue video Only one actor shown Participants asked to identify attitude To assess perceived expressiveness of results.
Experiment Participants were shown 3 classes of video: a) Original video b) Motion-captured animation c) Animation generated by method
Experiment 26 different short exchanges Choose 1 of 6 attitudes per exchange Some attitudes actor-exclusive 36 evaluations per participant 51 (French) participants 36 short dialogue exchanges, total 20 minutes
Results Compare attitude evaluations between video types per participant Consistency = good Hypothesis: No significant differences in results between video classes Hypothesis: no significant differences between classes
Results Hypothesis: no significant differences between classes
Results Male p = 0.43 Fail to reject null hypothesis No stat. difference between video types Female p < 1.e-3 → reject null hypothesis Remove tender and fascinated p = 0.32 → fail to reject Hypothesis: no significant differences between classes
Critical review
Positives
Results look promising Participants recognise most attitudes correctly Animation and speech feel ‘expressive’ But a bit ‘stiff’
Large expansion Huge jump from SFC to fully animated Multiple innovations Blink Gaze Facial animation Rhythm The authors could have been content with just furthering research in one area, but they did work on many fronts at once
Negatives
Mediocre structuring Topic structure and naming sometimes confusing Focus changes within topics Without proper writing indicating change Makes it confusing and hard to combine everything together on first read
‘High-level’ explanation lacking Paper lacks solid explanation ‘how everything comes together’ Encapsulated in a single image Referenced only once This is a good figure, but the text does not follow the structure of the figure
Lacking self reflection No analysis of the quality of animation/speech How does the NN differ from the GT? Perfect? Flaws? Unexpected behavior? Ze brengen het allemaal te rooskleurig
Lacking integrity Is ‘trial and error’ a valid way to structure a Neural Network? Reproducibility? Authors disregard certain attitudes to obtain significant results ‘Cause’ could easily be corrected Ze brengen het allemaal te rooskleurig
Future work Innovative project Huge jump from SFC to fully animated Too ambitious? Training set quite small Time constraints? Focus on improving/studying subsystems Blinking, gaze etc. Invest in professional actors Also for analysis Ask about ANALYSIS
Questions/ Discussion
Discussion #1 Is the reason for excluding the badly performing female attitude valid? If not, what should they have done?
Discussion #2 Could making certain attitudes actor-exclusive influence the results?