Deep Learning for Expression Recognition in Image Sequences

Deep Learning for Expression Recognition in Image Sequences
Daniel Natanael García Zapata Tutors: Dr. Sergio Escalera Dr. Gholamreza Anbarjafari April

Introduction and Goals

Introduction Facial expressions convey information for transmitting emotions. Emotion recognition is a complex task even some humans. Deep Learning algorithms have gotten great results in the area of computer vision. Dennis Hamester et al., “Face ExpressionRecognition with a 2-Channel ConvolutionalNeural Network”, International Joint Conference on Neural Networks (IJCNN), 2015. Zhang, Y., & Ji, Q. (2003, October). Facial expression understanding in image sequences using dynamic and active visual information fusion. In Computer Vision, Proceedings. Ninth IEEE International Conference on (pp ). IEEE.

Goals Identify pros and cons of the different deep learning models that are tested. Compare computer vision techniques that recognises emotions in facial expressions. The comparison includes still images models and image sequences models in the datasets.

Index Basics Background Evaluated Deep Models Results Conclusions

Basics cnn basics

Convolutional Neural Network
Example of a Convolutional Neural Network Yang, S., Luo, P., Loy, C. C., Shum, K. W., & Tang, X. (2015, January). Deep Representation Learning with Target Coding. In AAAI (pp ).

Convolutional Layer Convolution Neurons Kernel
Connects each neuron to a local region. Neurons Three dimensions: width, height, depth. Depth is the activation volume called kernels. Kernel Describe the set of weights learnt. A. Karpath. CS231n Convolutional Neural Networks for Visual Recognition, 2018.

Pooling Layer Partitions the input into non-overlapping rectangles.
For each sub-region outputs the maximum value of the features in that region. A. Karpath. CS231n Convolutional Neural Networks for Visual Recognition, 2018.

Modalities and Features
A. Karpath. CS231n Convolutional Neural Networks for Visual Recognition, 2018. I. Ofodile, K. Kulkarni, C. A. Corneanu, S. Escalera, X. Baro, S. Hyniewska, J. Allik, and G. Anbarjafari. Automatic Recognition of Deceptive Facial Expressions of Emotion

Background

Deep Learning Based Emotion Recognition from Still Images
A pre-trained deep CNN as a Stacked Convolutional AutoEncoder (SCAE). The model is trained using the Karolinska Directed Emotional Faces (KDEF) dataset. M. G. Calvo and D. Lundqvist. Facial expressions of emotion (KDEF): Identification under different display-duration conditions. Behavior Research Methods, 2008.

Deep Learning Based Emotion Recognition from Image Sequences
CNN-RNN architecture for emotion transaction analysis. The model is trained with two datasets: CASIA-Webface Emotion Recognition in the Wild. A. Dhall, O. V. Ramana Murthy, R. Goecke, J. Joshi, and T. Gedeon. Video and image based emotion recognition challenges in the wild: Emotiw Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015. N. Ronghe, S. Nakashe, A. Pawar, and S. Bobde. Emotion recognition and reaction prediction in videos, 2017.

Evaluated Deep Models

Basic CNN Two-Stream CNN Middle Fusion CNN
VGG-Face Basic CNN Two-Stream CNN Middle Fusion CNN O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep Face Recognition. In Procedings of the British Machine Vision Conference 2015, 2015.

C3D Frames Input Balu, Aditya et al. “Learning Localized Geometric Features Using 3D-CNN: An Application to Manufacturability Analysis of Drilled Holes.” (2016).

Recurrent Models LSTM GRU LSTM GRU
Olah Christopher. Understanding LSTM Networks, 2015.

Results

SASE-FE Examples: 29,798 Classes: 12 emotions Data Split:
Training: 75% Validation: 15% Test: 10% I. Ofodile, K. Kulkarni, C. A. Corneanu, S. Escalera, X. Baro, S. Hyniewska, J. Allik, and G. Anbarjafari. Automatic Recognition of Deceptive Facial Expressions of Emotion

OULU-CASIA Examples: 8,384 Classes: 6 emotions Data Split:
Training: 75% Validation: 15% Test: 10% G. Zhao, X. Huang, M. Taini, S. Z. Li, and M. Pietika ̈inen. Facial expression recognition from near-infrared videos. Image and Vision Computing, 2011.

Preprocessing No Symmetry Soft Symmetry Geometry
Extract each frame as an image. Only considers the frames from half of the video duration until the 90% of the duration. Frontalization process to transform into forward facing faces. Obtain face landmarks geometry. Geometry T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective Face Frontalization in Unconstrained Images. I. Ofodile, K. Kulkarni, C. A. Corneanu, S. Escalera, X. Baro, S. Hyniewska, J. Allik, and G. Anbarjafari. Automatic Recognition of Deceptive Facial Expressions of Emotion

Models Parameters Loss Function Optimizer RNN Layers: 2
Sparse Categorical Cross Entropy Optimizer Adam Learning Rate: 0.001 Beta 1: 0.9 Beta 2: RNN Layers: 2 Models Parameters

Best Models for each dataset and model
Quantitative Results Best Models for each dataset and model Multi-modalities overall improve the models in about 5% on average. Image input has a higher accuracy of 4% on average compared to Features input on OULU. Feature input has a higher accuracy of 3% on average compared to Image input on SASE. OULU receives a boost of 17% compared to 1% of SASE.

Qualitative Results Still Image Image Sequence
Frontalization outperforms raw face input. All multimodality models improve over vanilla CNN. Most models are not able to distinguish Disgust at all. Fear only has a handful of images correctly classified or even misclassified. Anger, Sadness and Surprise are proven to be correctly classified. LSTM and GRU both have higher accuracy than 3D CNN. SASE has a higher accuracy with image features; on the contrary OULU when using features input. GRU obtain the highest test accuracy in both datasets. The evidence shows that GRU’s lesser parameters are more effective in considerably smaller datasets. Although evidence suggests that temporal features are not as important, given the small size of both datasets there is no sufficient evidence to conclude that temporal models are not effective.

Conclusions

The evaluation clearly demonstrated the superiority of performing face frontalization for pre-processing the data over performing no pre-processing at all. In spite the fact that both multi-modal and middle- fusion models improve over the base model. There is clear consensus which model improved to a greater extent. It was demonstrated the superiority of the GRU models in image-sequence inputs over the 3D CNN and LSTM. It was not possible to concretely conclude if using extracted feature vectors from the CNN as the input of the RNNs were better than using image vectors as inputs. It is compelling to include more databases in a future work and try other pre-trained models. Also, It would be interesting to combine hand-crafted features that combine spatio-temporal features. Conclusions

Thank You! Daniel Natanael García Zapata

Deep Learning for Expression Recognition in Image Sequences

Similar presentations

Presentation on theme: "Deep Learning for Expression Recognition in Image Sequences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Deep Learning for Expression Recognition in Image Sequences

Similar presentations

Presentation on theme: "Deep Learning for Expression Recognition in Image Sequences"— Presentation transcript:

Similar presentations

About project

Feedback