Presented by: Iwan Boksebeld and Marijn Suijten

Slides:



Advertisements
Similar presentations
DDDAS: Stochastic Multicue Tracking of Objects with Many Degrees of Freedom PIs: D. Metaxas, A. Elgammal and V. Pavlovic Dept of CS, Rutgers University.
Advertisements

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.
Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.
Speech Group INRIA Lorraine
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Apex Point Map for Constant-Time Bounding Plane Approximation Samuli Laine Tero Karras NVIDIA.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
Flow Based Action Recognition Papers to discuss: The Representation and Recognition of Action Using Temporal Templates (Bobbick & Davis 2001) Recognizing.
Irfan Essa, Alex Pentland Facial Expression Recognition using a Dynamic Model and Motion Energy (a review by Paul Fitzpatrick for 6.892)
Facial animation retargeting framework using radial basis functions Tamás Umenhoffer, Balázs Tóth Introduction Realistic facial animation16 is a challenging.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Making a great Project 2 OCR 1994/2360. Design Some candidates dive in, make a database or spreadsheet, then try and make a design afterwards. This won’t.
Lipreading: how it works. Learning objectives Recognise the different processes and skills involved in lipreading Revise factors that help or hinder lipreading.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
MULTIMEDIA INPUT / OUTPUT TECHNOLOGIES INTRODUCTION 6/1/ A.Aruna, Assistant Professor, Faculty of Information Technology.
Presented by Matthew Cook INFO410 & INFO350 S INFORMATION SCIENCE Paper Discussion: Dynamic 3D Avatar Creation from Hand-held Video Input Paper Discussion:
December 9, 2014Computer Vision Lecture 23: Motion Analysis 1 Now we will talk about… Motion Analysis.
School of Engineering and Computer Science Victoria University of Wellington Copyright: Peter Andreae, VUW Image Recognition COMP # 18.
Performance Comparison of Speaker and Emotion Recognition
Team Members Ming-Chun Chang Lungisa Matshoba Steven Preston Supervisors Dr James Gain Dr Patrick Marais.
A Patent Document Retrieval System Addressing Both Semantic and Syntactic Properties Liang Chen*,Naoyuki Tokuda+, Hisahiro Adachi+ *University of Northern.
Learning video saliency from human gaze using candidate selection CVPR2013 Poster.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.
 ASMARUL SHAZILA BINTI ADNAN  Word Emotion comes from Latin word, meaning to move out.  Human emotion can be recognize from facial expression,
Audio Fingerprinting Wes Hatch MUMT-614 Mar.13, 2003.
KINECT AMERICAN SIGN TRANSLATOR (KAST)
Game Art and Design Unit 4 Lesson 1 Game Conceptualization
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
How to handle the reading section of Paper 1
Computer vision: models, learning and inference
Visual Learning with Navigation as an Example
[Ran Manor and Amir B.Geva] Yehu Sapir Outlines Review
The Wonder That Is Drama Theory !!!!
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Computer Science and Engineering, Seoul National University
CS201 Lecture 02 Computer Vision: Image Formation and Basic Techniques
Cognitive Processes PSY 334
Tracking Objects with Dynamics
Intro to NLP and Deep Learning
Recovery from Occlusion in Deep Feature Space for Face Recognition
Data Preparation for Deep Learning
Final Year Project Presentation --- Magic Paint Face
Convolutional Networks
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Sparse Rig Parameter Optimization for Character Animation
MikeTalk:An Adaptive Man-Machine Interface
SBNet: Sparse Blocks Network for Fast Inference
Pattern recognition (…and object perception).
Easy Generation of Facial Animation Using Motion Graphs
Prepared by: Engr . Syed Atir Iftikhar
Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.
Synthesis of Motion from Simple Animations
Multimodal Caricatural Mirror
Dr. Debaleena Chattopadhyay Department of Computer Science
Computer Graphics Lecture 15.
Machine learning overview
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Phase-Functioned Neural Networks for Character Control
Unrolling the shutter: CNN to correct motion distortions
End-to-End Facial Alignment and Recognition
Cengizhan Can Phoebe de Nooijer
Text-to-speech (TTS) Traditional approaches (before 2016) Neural TTS
Auditory Morphing Weyni Clacken
Paper presentation by: Dan Andrei Ganea and Anca Negulescu
End-to-End Speech-Driven Facial Animation with Temporal GANs
Data-Driven Approach to Synthesizing Facial Animation Using Motion Capture Ioannis Fermanis Liu Zhaopeng
A Generative Audio-Visual Prosodic Model for Virtual
Presentation transcript:

Presented by: Iwan Boksebeld and Marijn Suijten Audio-Driven Facial Animation by Joint End-to-End Learning of Pose and Emotion Presented by: Iwan Boksebeld and Marijn Suijten

Authors Tero Karras NVIDIA Timo Aila NVIDIA Samuli Laine NVIDIA Antti Herva Remedy Entertainment Jaakko Lehtinen NVIDIA & Aalto University

Goals of the paper Create 3D mesh from just audio With the use of CNNs While keeping low latency And factoring in emotions Quickly mention vision-based PC drawbacks Used in games, with audio-based to fill the gaps - but quality leaves much to be desired

The Problem Create a full face mesh for added realism Use emotions in the animation Dealing with ambiguity of audio Creating a CNN and training this

Related Work

Linguistic based animation Input audio often with transcript Animation results from language based rules The strengths of this method is the high level of control The weakness is the complexity of the system Example of such a model is the Dominance model

Machine learning techniques Mostly in 2D Learn the rules given in linguistic based animation Blend and/or concatenate images to produce results Not really useful for the application here

Capturing emotions Mostly based on user parameters Some work with neural networks Creates mapping from emotion parameters to facial expression

The technical side

Audio processing 16 kHz mono; normalized volume 260ms of past and future samples, total of 520ms Value empirically chosen Take 64 audio frames of 16ms 2x overlap: every 8ms used twice Hann window: remove temporal aliasing effects Empirically chosen: Enough to capture phoneme articulation; without providing too much data for overfitting “i.e., 260ms of past and future samples” (64 * 16 + 16) / 2 = 520

Autocorrelation Calculate K=32 autocorrelation coefficients 12 enough for identifying individual phonemes Need more to identify pitch No special techniques for linear separation of phonemes Tests indicate this process is clearly superior Resonance frequencies entirely determined by autocorrelation Early tests indicate using autocorrelation coefficients directly is clearly superior

CNN Layout Formant analysis network: First layer is audio-processing and autocorrelation Time axis of 64 samples 32 autocorrelation coefficients Followed by 5 convolution layers Convert formant audio features to 256 abstract feature maps Formant = “Resonance Frequency of linear filters, which carry information of a phoneme” learn to extract shortterm features that are relevant for facial animation, such as intonation, emphasis, and specific phonemes

CNN Layout Articulation network Analyze temporal evolution 5 layers as well Emotion vector concatenated Formant = “Resonance Frequency of linear filters, which carry information of a phoneme”

Working with emotions Speech highly ambiguous Consider silence: what does it look like?

Representing Emotions Emotional state stored as "meaningless" E-dimensional vector Learns with the network Vector concatenated to convolution layers in articulation network Concatenated to every layer: significantly better result Support early layers with nuanced control over details such as coarticulation Later layers have more control over the output pose Explain "meaningless": These values are changed during the learning phase (just like a regular NN layer), and carry no semantic meaning otherwise. This layer tries to "learn" all information that's not inferable from the audio. Don't forget to mention that the bottom two points are assumptions from the author

Training

Training Target Use 9 cameras to get unstructured mesh and optical flow Project template mesh onto unstructured mesh Link optical flow to template Template mesh is then tracked across performance Use some vertices to stabilize head Limitation no tongue

Training Data Pangrams and in-character material 3-5 minutes per actor (trade off quality vs. time/cost) Pangrams: Designed sentences with as many sounds of a language in-character: Capture emotions based on character narrative Time-shifting data augmentation

Loss Function Loss function in 3-terms: Position term Motion term Regularization term Use normalization scheme to balance these terms

Position Term Ensure correct vertex location V: # of vertices y: desired position ŷ: actual position

Motion Term Ensure correct motion Comparing paired frames m(~): Difference between paired frames

Regularization Term Ensure no erratic emotion Normalized to prevent becoming ineffective E: # of emotion components e(i):ith component of the emotion vector for sample x

Inference Quickly mention: Audio in, vertex delta’s out

Inferring emotion Step 1: Cull “wrong” emotion vectors Bilabials -> closed mouth Vowels -> opened mouth Step 2: Visually inspect animation Remove short-term effects Step3: Use voice from different actor Unnatural -> lack of generalization Manually assign semantic meaning Interpolate emotion vectors for transition/complex emotion Step 1: ~100 left Step 2: ~86 - subdued, spurious or unnatural motion Step 3: ~33 Lack of generalization -> will go wrong at some point Possibility to blend/interpolate emotion vectors

Results

Results 2:19(General & emotions) 0:06(vs PC) 0:53(vs DM) 3:20(different languages(columns represent emotional states))

User Study Setup Blind user study with 20 participants User were asked to choose between 2 which was more realistic Two sets of experiments Comparing against other methods DM vs PC vs Ours Audio from validation set not used in training 13 clips of 3-8 seconds long Generalization over language and gender 14 clips form several languages From online database without checking output

User Study Results

User Study Results

User Study Results Clearly better then DM Still quite a bit worse then PC Generalizes quite well over languages Even compared to linguistic method

Critical Review

Drawbacks of solution No residual motion No blinking No head movement Assumes higher power handles these Problems with similar looking sounds E.g. Confuse B and G Fast languages are a problem Novel data needs to be somewhat similar to training data Misses detail compared to PC Emotions have no defined meaning Penultimate according to NVIDIA main drawback

Questions?

Discussion

Discussion Is it useful to gather emotions like this paper describes? Why not just tell what the emotion is?

Discussion Should they have used more participants?

Discussion Why do you think blinking and eye/head motion is not covered by the network?