Download presentation
Presentation is loading. Please wait.
1
Automatic Analysis of Facial Expressions: The State of the Art Automatic Analysis of Facial Expressions: The State of the Art By Maja Pantic, Leon Rothkrantz Automatic Analysis of Facial Expressions: The State of the Art
2
Presentation Outline Motivation Motivation Desired functionality and evaluation criteria Desired functionality and evaluation criteria Face Detection Face Detection Expression data extraction Expression data extraction Classification Classification Conclusions and future research Conclusions and future research
3
Motivation HCI HCI Hope to achieve robust communication by recovering from failure of one communication channel using information from another channel Hope to achieve robust communication by recovering from failure of one communication channel using information from another channel According to some estimates, the facial expression of the speaker counts for 55% of the effect of the spoken message (with the voice intonation contributing 38%, and the verbal part just 7%) According to some estimates, the facial expression of the speaker counts for 55% of the effect of the spoken message (with the voice intonation contributing 38%, and the verbal part just 7%) Behavioral science research Behavioral science research Automation of objective measurement of facial activity Automation of objective measurement of facial activity
4
Desired Functionality Human visual system = good reference point Human visual system = good reference point Desired properties: Desired properties: Works on images of people of any sex, age, and ethnicity Works on images of people of any sex, age, and ethnicity Robust to variation in lighting Robust to variation in lighting Insensitive to hair style changes, presence of glasses, facial hair, partial occlusions Insensitive to hair style changes, presence of glasses, facial hair, partial occlusions Can deal with rigid head motions Can deal with rigid head motions Is real-time Is real-time Capable of classifying expressions into multiple emotion categories Capable of classifying expressions into multiple emotion categories Able to learn the range of emotional expression by a particular person Able to learn the range of emotional expression by a particular person Able to distinguish all possible facial expressions (probably impossible) Able to distinguish all possible facial expressions (probably impossible)
5
Overview Three basic problems need to be solved: Three basic problems need to be solved: Face detection Face detection Facial expression data extraction Facial expression data extraction Facial expression classification Facial expression classification Both static images and image sequences have been used in studies surveyed in the paper Both static images and image sequences have been used in studies surveyed in the paper
6
Face Detection In arbitrary images In arbitrary images A. Pentland et al. A. Pentland et al. Detection in a single image Detection in a single image Principal Component Analysis is used to generate a face space from a set of sample images Principal Component Analysis is used to generate a face space from a set of sample images A face map is created by calculating the distance between the local subimage and the face space at every location in the image A face map is created by calculating the distance between the local subimage and the face space at every location in the image If the distance is smaller than a certain threshold, the presence of a face is declared If the distance is smaller than a certain threshold, the presence of a face is declared Detection in an image sequence Detection in an image sequence Frame differencing is used Frame differencing is used The difference image is thresholded to obtain motion blobs The difference image is thresholded to obtain motion blobs Blobs are tracked and analyzed over time to determine if motion is caused by a person and to determine the head position Blobs are tracked and analyzed over time to determine if motion is caused by a person and to determine the head position
7
Face Detection (Continued) In face images In face images Holistic approaches (the face is detected as a whole unit) Holistic approaches (the face is detected as a whole unit) M. Pantic, L. Rothkrantz M. Pantic, L. Rothkrantz Use a frontal and a profile face images Use a frontal and a profile face images Outer head boundaries are determined by analyzing the horizontal and vertical histograms of the frontal face image Outer head boundaries are determined by analyzing the horizontal and vertical histograms of the frontal face image The face contour is obtained by using an HSV color model based algorithm (the face is extracted as the biggest object in the scene having the Hue parameter in the defined range) The face contour is obtained by using an HSV color model based algorithm (the face is extracted as the biggest object in the scene having the Hue parameter in the defined range) The profile contour is determined by following the procedure below: The profile contour is determined by following the procedure below: The value component of the HSV color model is used to threshold the input image The value component of the HSV color model is used to threshold the input image The number of background pixels between the right edge of the image and the first “On” pixel is counted (this gives a vector that represents a discrete approximation of the contour curve) The number of background pixels between the right edge of the image and the first “On” pixel is counted (this gives a vector that represents a discrete approximation of the contour curve) Noise is removed by averaging Noise is removed by averaging Local extrema correspond to points of interest (found by determining zero crossings of the 1st derivative) Local extrema correspond to points of interest (found by determining zero crossings of the 1st derivative)
8
Face Detection (Continued) Analytic approaches (the face is detected by detecting some important facial features first) Analytic approaches (the face is detected by detecting some important facial features first) H. Kobayashi, F. Hara H. Kobayashi, F. Hara Brightness distribution data of the human face is obtained with a camera in monochrome mode Brightness distribution data of the human face is obtained with a camera in monochrome mode An average of brightness distribution data obtained from 10 subjects is calculated An average of brightness distribution data obtained from 10 subjects is calculated Irises are identified by computing crosscorrelation between the average image and the novel image Irises are identified by computing crosscorrelation between the average image and the novel image The locations of other features are determined using relative locations of the facial features in the face The locations of other features are determined using relative locations of the facial features in the face
9
Template-based facial expression data extraction using static images Edwards et al. Edwards et al. Use Active Appearance Models (AAMs) Use Active Appearance Models (AAMs) Combined model of shape and gray-level appearance Combined model of shape and gray-level appearance A training set of hand-labeled images with landmark points marked at key positions to outline the main features A training set of hand-labeled images with landmark points marked at key positions to outline the main features PCA is applied to shape and gray level data separately, then applied again to a vector of concatenated shape and gray level parameters PCA is applied to shape and gray level data separately, then applied again to a vector of concatenated shape and gray level parameters The result is a description in terms of “appearance” parameters The result is a description in terms of “appearance” parameters 80 appearance parameters sufficient to explain 98% of the variation in the 400 training images labeled with 122 points 80 appearance parameters sufficient to explain 98% of the variation in the 400 training images labeled with 122 points Given a new face image, they find appearance parameter values that minimize the error between the new image and the synthesized AAM image Given a new face image, they find appearance parameter values that minimize the error between the new image and the synthesized AAM image
10
Feature-based facial expression data extraction using static images M. Pantic, L. Rothkrantz M. Pantic, L. Rothkrantz A point-based face model is used A point-based face model is used 19 points selected in the frontal-view image, and 10 in the side-view image 19 points selected in the frontal-view image, and 10 in the side-view image Face model features are defined as some geometric relationship between facial points or the image intensity in a small region defined relative to facial points (e.g. Feature 17 = Distance KL) Face model features are defined as some geometric relationship between facial points or the image intensity in a small region defined relative to facial points (e.g. Feature 17 = Distance KL) Neutral facial expression analyzed first Neutral facial expression analyzed first The positions of facial points are determined by using information from feature detectors The positions of facial points are determined by using information from feature detectors Multiple feature detectors are used for each facial feature localization and model feature extraction Multiple feature detectors are used for each facial feature localization and model feature extraction The result obtained from each detector is stored in a separate file The result obtained from each detector is stored in a separate file The detector output is checked for accuracy The detector output is checked for accuracy After “inaccurate” results are discarded, those that were obtained by the highest priority detector are selected for use in the classification stage After “inaccurate” results are discarded, those that were obtained by the highest priority detector are selected for use in the classification stage
11
Template-based facial expression data extraction using image sequences M. Black, Y. Yacoob M. Black, Y. Yacoob Do not address the problem of initially locating the various facial features Do not address the problem of initially locating the various facial features The motion of various face regions is estimated using parameterized optical flow The motion of various face regions is estimated using parameterized optical flow Estimates of deformation and motion parameters (e.g. horizontal and vertical translation, divergence, curl) are derived Estimates of deformation and motion parameters (e.g. horizontal and vertical translation, divergence, curl) are derived
12
Feature-based facial expression data extraction using image sequences Cohn et al. (the only surveyed method) Cohn et al. (the only surveyed method) Feature points in the first frame manually marked with a mouse around facial landmarks Feature points in the first frame manually marked with a mouse around facial landmarks A 13x13 flow window is centered around each point A 13x13 flow window is centered around each point Hierarchical optical flow method of Lucas and Kanade used to track feature points in the image sequence Hierarchical optical flow method of Lucas and Kanade used to track feature points in the image sequence Displacement of each point calculated relative to the first frame Displacement of each point calculated relative to the first frame The displacement of feature points between the initial and peak frames used for classification The displacement of feature points between the initial and peak frames used for classification
13
Classification Two basic problems: Two basic problems: Defining a set of categories/classes Defining a set of categories/classes Choosing a classification mechanism Choosing a classification mechanism People are not very good at it either People are not very good at it either In one study, a trained observer could classify only 87% of the faces correctly In one study, a trained observer could classify only 87% of the faces correctly Expressions can be classified in terms of facial actions that cause an expression or “typical” emotions Expressions can be classified in terms of facial actions that cause an expression or “typical” emotions Facial muscle activity can be described by a set of codes Facial muscle activity can be described by a set of codes The codes are called Action Units (AUs). All possible, visually detectable facial changes can be described by a set of 44 AUs. These codes form the basis of Facial Action Coding System (FACS), which provides a linguistic description for each code. The codes are called Action Units (AUs). All possible, visually detectable facial changes can be described by a set of 44 AUs. These codes form the basis of Facial Action Coding System (FACS), which provides a linguistic description for each code.
14
Classification (continued) Most of the studies perform an emotion classification and use the following 6 basic categories: happiness, sadness, surprise, fear, anger, and disgust Most of the studies perform an emotion classification and use the following 6 basic categories: happiness, sadness, surprise, fear, anger, and disgust No agreement among psychologists whether these are the right categories No agreement among psychologists whether these are the right categories People rarely produce “pure” expressions (e.g. 100% happiness), blends are much more common People rarely produce “pure” expressions (e.g. 100% happiness), blends are much more common
15
Template-based classification using static images Edwards et al. Edwards et al. The Mahalanobis distance measure can be used for classification The Mahalanobis distance measure can be used for classification Classification into 6 basic + neutral categories Classification into 6 basic + neutral categories Correct recognition of 74% reported Correct recognition of 74% reported c is the vector of appearance parameters for the new image, is the centroid of the multivariate distribution for class i, and C -1 is the within-class covariance matrix for all the training images
16
Neural network-based classification using static images H. Kobayashi, F. Hara H. Kobayashi, F. Hara Used 234x50x6 neural network trained off-line using backpropagation Used 234x50x6 neural network trained off-line using backpropagation The input layer units correspond to intensity values extracted from the input image along the 13 vertical lines The input layer units correspond to intensity values extracted from the input image along the 13 vertical lines The output units correspond to the 6 basic emotion categories The output units correspond to the 6 basic emotion categories Average correct recognition rate 85% Average correct recognition rate 85%
17
Neural network-based classification using static images (Continued) Zhang et al. Zhang et al. Used 680x7x7 neural network Used 680x7x7 neural network Output units represent six basic emotion categories plus the neutral category Output units represent six basic emotion categories plus the neutral category Output units give a probability of the analyzed expression belonging to the corresponding emotion category Output units give a probability of the analyzed expression belonging to the corresponding emotion category Cross-validation used for testing Cross-validation used for testing J. Zhao, G. Kearney J. Zhao, G. Kearney Used 10x10X3 neural network Used 10x10X3 neural network Neural network trained and tested on the whole set of data with 100% percent recognition rate Neural network trained and tested on the whole set of data with 100% percent recognition rate
18
Rule-based classification using static images M. Pantic, L. Rothkrantz (the only surveyed method) M. Pantic, L. Rothkrantz (the only surveyed method) Two-stage classification: Two-stage classification: 1. Facial actions (corresponding to one of the Action Units) are deduced from changes in face geometry 1. Facial actions (corresponding to one of the Action Units) are deduced from changes in face geometry Action Units are described in terms of face model feature values (E.g. AU 28 = (Both) lips sucked in = feature 17 is 0, where feature 17 = Distance KL) Action Units are described in terms of face model feature values (E.g. AU 28 = (Both) lips sucked in = feature 17 is 0, where feature 17 = Distance KL) 2. The stage 1 classification results are used to classify the expression into one of the emotion categories 2. The stage 1 classification results are used to classify the expression into one of the emotion categories E.g. AU6 + AU12 + AU16 + AU25 => Happiness E.g. AU6 + AU12 + AU16 + AU25 => Happiness The two-stage classification process allows “weighted emotion labels” The two-stage classification process allows “weighted emotion labels” Assumption: each AU that is part of the AU-coded description of a “pure” emotional expression has the same influence on the intensity of that emotional expression Assumption: each AU that is part of the AU-coded description of a “pure” emotional expression has the same influence on the intensity of that emotional expression E.g. If the analysis of some image results in the activation of AU6, AU12, and AU16, then the expression is classified as 75% happiness E.g. If the analysis of some image results in the activation of AU6, AU12, and AU16, then the expression is classified as 75% happiness The system can distinguish 29 AUs The system can distinguish 29 AUs Recognition rate 92% for upper face Aus, and 86% for lower face AUs Recognition rate 92% for upper face Aus, and 86% for lower face AUs
19
Template-based classification using image sequences Cohn et al. Cohn et al. Classification in terms of Action Units Classification in terms of Action Units Uses Discriminant Function Analysis Uses Discriminant Function Analysis Deals with each face region separately Deals with each face region separately Used for classification only (i.e. all facial point displacements are used as input) Used for classification only (i.e. all facial point displacements are used as input) Does not deal with image sequences containing several consecutive facial actions Does not deal with image sequences containing several consecutive facial actions Recognition rate: 92% in the brow region, 88% in the eye region, 83% in the nose and mouth region Recognition rate: 92% in the brow region, 88% in the eye region, 83% in the nose and mouth region
20
Rule-based classification using image sequences M. Black, Y. Yacoob (the only surveyed method) M. Black, Y. Yacoob (the only surveyed method) Mid- and high-level descriptions of facial actions are used Mid- and high-level descriptions of facial actions are used The parameter values (e.g. translation, divergence) derived from optical flow are thresholded The parameter values (e.g. translation, divergence) derived from optical flow are thresholded E.g. Div >0.02 => expansion, Div contraction. This is what the authors would call a mid-level predicate for the mouth. E.g. Div >0.02 => expansion, Div contraction. This is what the authors would call a mid-level predicate for the mouth. High-level predicates are rules for classifying facial expressions High-level predicates are rules for classifying facial expressions Rules for detecting the beginning and the end of an expression Rules for detecting the beginning and the end of an expression Use the results of applying mid-level rules as input Use the results of applying mid-level rules as input E.g. Beginning of surprise = Raising brows and vertical expansion of mouth, End of Surprise = Lowering brows and vertical contraction of mouth E.g. Beginning of surprise = Raising brows and vertical expansion of mouth, End of Surprise = Lowering brows and vertical contraction of mouth The rules used for classification are not designed to deal with blends of emotional expressions (Anger + Fear recognized as disgust) The rules used for classification are not designed to deal with blends of emotional expressions (Anger + Fear recognized as disgust) Recognition rate: 88% Recognition rate: 88%
21
Conclusions and Possible Directions for Future Research Active research area Active research area Most surveyed systems rely on the frontal view of the face and assume no facial hair or glasses Most surveyed systems rely on the frontal view of the face and assume no facial hair or glasses None of the surveyed systems can distinguish all 44 AUs defined in FACS None of the surveyed systems can distinguish all 44 AUs defined in FACS Classification into basic emotion categories in most surveyed studies Classification into basic emotion categories in most surveyed studies Some reported results are of little practical value Some reported results are of little practical value The ability of the human visual system to “fill in” missing parts of the observed face (i.e. deal with partial occlusions) has not been investigated The ability of the human visual system to “fill in” missing parts of the observed face (i.e. deal with partial occlusions) has not been investigated
22
Conclusions and Possible Directions for Future Research (Continued) Not clear at all whether the 6 “basic” emotion categories are universal Not clear at all whether the 6 “basic” emotion categories are universal Each person has his/her own range of expression intensity – so systems that start with a generic classification and then adapt may be of interest Each person has his/her own range of expression intensity – so systems that start with a generic classification and then adapt may be of interest Assignment of a higher priority to upper face features by the human visual system (when interpreting facial expressions) has not been subject of a lot of research Assignment of a higher priority to upper face features by the human visual system (when interpreting facial expressions) has not been subject of a lot of research Hard or impossible to compare reported results objectively without a well-defined, commonly used database of face images Hard or impossible to compare reported results objectively without a well-defined, commonly used database of face images
23
References M. Pantic, L. Rothkrantz, “Automatic Analysis of Facial Expressions: The State of the Art”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, December 2000 M. Pantic, L. Rothkrantz, “Automatic Analysis of Facial Expressions: The State of the Art”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, No. 12, December 2000 M. Pantic, L. Rothkrantz, Expert System for Automatic Analysis of Facial Expressions, Image and Vision Computing, Vol. 18, No. 11, pp. 881-905, 2000 M. Pantic, L. Rothkrantz, Expert System for Automatic Analysis of Facial Expressions, Image and Vision Computing, Vol. 18, No. 11, pp. 881-905, 2000 M. J. Black, Y. Yacoob, “Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion”, Int’l J. Computer Vision, Vol. 25, no.1, pp. 23-48, 1997 M. J. Black, Y. Yacoob, “Recognizing Facial Expressions in Image Sequences Using Local Parameterized Models of Image Motion”, Int’l J. Computer Vision, Vol. 25, no.1, pp. 23-48, 1997 J. F. Cohn, A.J. Zlochower, J.J. Lien, T. Kanade, “Feature-Point Tracking by Optical Flow Discriminates Subtle Differences in Facial Expression”, Proc. Int’l Conf. Automatic Face and Gesture Recognition, pp. 396-401, 1998 J. F. Cohn, A.J. Zlochower, J.J. Lien, T. Kanade, “Feature-Point Tracking by Optical Flow Discriminates Subtle Differences in Facial Expression”, Proc. Int’l Conf. Automatic Face and Gesture Recognition, pp. 396-401, 1998 G.J. Edwards, T.F. Cootes, C.J. Taylor, “Face Recognition Using Active Appearance Models”, Proc. European Conference on Computer Vision, Vol. 2, pp. 581-695, 1998 G.J. Edwards, T.F. Cootes, C.J. Taylor, “Face Recognition Using Active Appearance Models”, Proc. European Conference on Computer Vision, Vol. 2, pp. 581-695, 1998 G.J. Edwards, T.F. Cootes, C.J. Taylor, “Active Appearance Models”, Proc. European Conf. Computer Vision, Vol. 2, pp. 484-498, 1998 G.J. Edwards, T.F. Cootes, C.J. Taylor, “Active Appearance Models”, Proc. European Conf. Computer Vision, Vol. 2, pp. 484-498, 1998 H. Kobayashi, F. Hara, “Facial Interaction between Animated 3D Face Robot and Human Beings”, Proc. Int’l Conf. Systems, Man, Cybernetics, pp.3,732-3,737, 1997 H. Kobayashi, F. Hara, “Facial Interaction between Animated 3D Face Robot and Human Beings”, Proc. Int’l Conf. Systems, Man, Cybernetics, pp.3,732-3,737, 1997
24
Some YouTube Videos Real-time facial expression recognition Real-time facial expression recognition Real-time facial expression recognition Real-time facial expression recognition Take 2 Take 2 Take 2 Take 2 Facial expression recognition Facial expression recognition Facial expression recognition Facial expression recognition Facial expression mirroring Facial expression mirroring Facial expression mirroring Facial expression mirroring Facial expression animation Facial expression animation Facial expression animation Facial expression animation
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.