Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Generative Audio-Visual Prosodic Model for Virtual

Similar presentations


Presentation on theme: "A Generative Audio-Visual Prosodic Model for Virtual"— Presentation transcript:

1 A Generative Audio-Visual Prosodic Model for Virtual
Actors Modelling virtual humans Cas Laugs ( ) Max Meijers ( )

2 Authors Adela Barbulescu and Rémi Ronfard
From Inria, French research institute for digital sciences Gérard Bailly From GIPSA-lab, University of Grenoble-Alpes

3 Overview and Motivation
Published in IEEE Computer Graphics and Applications December 2017 Only a single citation

4 Overview and Motivation
Expressing complex mental states during conversation in animation is hard Prosody of voice (Rhythm, tone) Facial animation Gaze NOT speech-to-motion!

5 Problem statement Create a method for generating natural speech and facial animation Expressing various attitudes Based on neutral input Speech Animation Make distinction between attitude and emotion. Voluntary-involuntary

6 Previous work Three types of expressive facial animation Text-driven
Speech-driven Expressive SFC Model

7 Previous work - Types Text-driven Use Text-To-Speech to obtain
Speech (audio) Phoneme duration Joint driven Rule-based approach for attitude Speech-driven Generate face motion Also TTS and rule-based Semantics

8 Previous work - Types Expressive conversion
Mapping function to emotional space (speech) Statistical mapping of motion capture data (animation)

9 Previous work - SFC Superposition of functional contours (SFC)
By Gérard Bailly, 2 citations Generates prosody of utterance Based on (non-)linguistic information SFC genereert intonatie, toon, ritme

10 Approach Train model on various dramatic performances
Input neutral sentence performance (video + audio) and attitude label Generate animated 3D character Speech and animation To match attitude label Three main steps

11 A Visual Example In this video, the attitude labels are shown above the characters

12 Method Model 10 attitudes
From Simon Baron-Cohen’s Mind Reading Project Captured by ‘semi-professional’ actors More into detail on their method 412 attitudes in whole corpus

13 Method Extend existing SFC model with speech and facial animation
Neural networks using prosody features Acoustic Visual Virtual syllables (supportive)

14 Method - Acoustic Acoustic features Voice pitch contours Rhythm
Energy value Acoustic features from Praat

15 Method - Visual ...and visual features Verbal motion Non-verbal motion
Combine linearly by adding to ‘neutral’ motion Split upper and lower face 19 and 29 blendshapes Verbal motion: shapes made by mouth to articulate words Non-verbal motion: gestures, facial expressions, etc.

16 Method - Virtual syllables
Supporting Virtual syllables Before and after each utterance Indicates ‘talking turns’ using nonverbal gestures 250ms

17 Method Sampled and reconstructed at 20, 50, and 80% of each syllable
Pitch contour Motion Rhythm Energy These features are...

18 Experiment Perceptual test Short dialogue video Only one actor shown
Participants asked to identify attitude To assess perceived expressiveness of results.

19 Experiment Participants were shown 3 classes of video:
a) Original video b) Motion-captured animation c) Animation generated by method

20 Experiment 26 different short exchanges
Choose 1 of 6 attitudes per exchange Some attitudes actor-exclusive 36 evaluations per participant 51 (French) participants 36 short dialogue exchanges, total 20 minutes

21 Results Compare attitude evaluations between video types per participant Consistency = good Hypothesis: No significant differences in results between video classes Hypothesis: no significant differences between classes

22 Results Hypothesis: no significant differences between classes

23 Results Male p = 0.43 Fail to reject null hypothesis
No stat. difference between video types Female p < 1.e-3 → reject null hypothesis Remove tender and fascinated p = 0.32 → fail to reject Hypothesis: no significant differences between classes

24 Critical review

25 Positives

26 Results look promising
Participants recognise most attitudes correctly Animation and speech feel ‘expressive’ But a bit ‘stiff’

27 Large expansion Huge jump from SFC to fully animated
Multiple innovations Blink Gaze Facial animation Rhythm The authors could have been content with just furthering research in one area, but they did work on many fronts at once

28 Negatives

29 Mediocre structuring Topic structure and naming sometimes confusing
Focus changes within topics Without proper writing indicating change Makes it confusing and hard to combine everything together on first read

30 ‘High-level’ explanation lacking
Paper lacks solid explanation ‘how everything comes together’ Encapsulated in a single image Referenced only once This is a good figure, but the text does not follow the structure of the figure

31 Lacking self reflection
No analysis of the quality of animation/speech How does the NN differ from the GT? Perfect? Flaws? Unexpected behavior? Ze brengen het allemaal te rooskleurig

32 Lacking integrity Is ‘trial and error’ a valid way to structure a Neural Network? Reproducibility? Authors disregard certain attitudes to obtain significant results ‘Cause’ could easily be corrected Ze brengen het allemaal te rooskleurig

33 Future work Innovative project Huge jump from SFC to fully animated
Too ambitious? Training set quite small Time constraints? Focus on improving/studying subsystems Blinking, gaze etc. Invest in professional actors Also for analysis Ask about ANALYSIS

34 Questions/ Discussion

35 Discussion #1 Is the reason for excluding the badly performing female attitude valid? If not, what should they have done?

36 Discussion #2 Could making certain attitudes actor-exclusive influence the results?


Download ppt "A Generative Audio-Visual Prosodic Model for Virtual"

Similar presentations


Ads by Google