FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University.

FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University *NEC Research Institute Presentation by Andy Rova School of Computing Science Simon Fraser University

March 15, 2005 2Andy Rova SFU CMPT 820 Introduction FaceTrack FaceTrack System for both tracking and summarizing faces in compressed video data System for both tracking and summarizing faces in compressed video data Tracking Tracking Detect faces and trace them through time in video shots Detect faces and trace them through time in video shots Summarizing Summarizing Cluster the faces across video shots and associate them with different people Cluster the faces across video shots and associate them with different people Compressed video Compressed video Avoids the costly overhead of decoding prior to face detection Avoids the costly overhead of decoding prior to face detection

March 15, 2005 3Andy Rova SFU CMPT 820 System Overview The FaceTrack system’s goals are related to ideas discussed in previous presentations The FaceTrack system’s goals are related to ideas discussed in previous presentations A face-based video summary can help users decide if they want to download the whole video A face-based video summary can help users decide if they want to download the whole video The summary provides good visual indexing information for a database search engine The summary provides good visual indexing information for a database search engine

March 15, 2005 4Andy Rova SFU CMPT 820 Problem definition The goal of the FaceTrack system is to take an input video sequence and generate a list of prominent faces that appear in the video, and determine the time periods where each of the faces appears The goal of the FaceTrack system is to take an input video sequence and generate a list of prominent faces that appear in the video, and determine the time periods where each of the faces appears

March 15, 2005 5Andy Rova SFU CMPT 820 General Approach Track faces within shots Track faces within shots Once tracking is done, group faces across video shots into faces of different people Once tracking is done, group faces across video shots into faces of different people Output a list of faces for each sequence Output a list of faces for each sequence For each face, list shots where it appears, and when For each face, list shots where it appears, and when Face recognition is not performed Face recognition is not performed Very difficult in unconstrained videos due to the broad range of face sizes, numbers, orientations and lighting conditions Very difficult in unconstrained videos due to the broad range of face sizes, numbers, orientations and lighting conditions

March 15, 2005 6Andy Rova SFU CMPT 820 General Approach Try to work in the compressed domain as much as possible Try to work in the compressed domain as much as possible MPEG-1 and MPEG-2 videos MPEG-1 and MPEG-2 videos Used in applications such as digital TV and DVD Used in applications such as digital TV and DVD Macroblocks and motion vectors can be used directly in tracking Macroblocks and motion vectors can be used directly in tracking Greater computational speed compared to decoding Greater computational speed compared to decoding Can always decode select frames down to the pixel level for further analysis Can always decode select frames down to the pixel level for further analysis For example, grouping faces across shots For example, grouping faces across shots

March 15, 2005 7Andy Rova SFU CMPT 820 MPEG Review 3 types of frame data 3 types of frame data Intra-frames (I-frames) Intra-frames (I-frames) Forward predictive frames (P-frames) Forward predictive frames (P-frames) Bidirectional predictive frames (B-frames) Bidirectional predictive frames (B-frames) Macroblocks are coding units which combine pixel information via DCT Macroblocks are coding units which combine pixel information via DCT Luminance and chrominance are separated Luminance and chrominance are separated P-frames and B-frames are subjected to motion compensation P-frames and B-frames are subjected to motion compensation Motion vectors are found and their differences are encoded Motion vectors are found and their differences are encoded

March 15, 2005 8Andy Rova SFU CMPT 820 System Diagram

March 15, 2005 9Andy Rova SFU CMPT 820 Face Tracking Challenges Challenges Locations of detected faces may not be accurate, since the face detection algorithm works on 16x16 macroblocks Locations of detected faces may not be accurate, since the face detection algorithm works on 16x16 macroblocks False alarms and misses False alarms and misses Multiple faces cause ambiguities when they move close to each other Multiple faces cause ambiguities when they move close to each other The motion approximated by the MPEG motion vectors may not be accurate The motion approximated by the MPEG motion vectors may not be accurate A tracking framework which can handle these issues in the compressed domain is needed A tracking framework which can handle these issues in the compressed domain is needed

March 15, 2005 10Andy Rova SFU CMPT 820 The Kalman Filter A linear, discrete-time dynamic system is defined by the following difference equations: A linear, discrete-time dynamic system is defined by the following difference equations: We only have access to a sequence of measurements We only have access to a sequence of measurements Given this noisy observation data, the problem is to find the optimal estimate of the unknown system state variables Given this noisy observation data, the problem is to find the optimal estimate of the unknown system state variables

March 15, 2005 11Andy Rova SFU CMPT 820 The Kalman Filter The “filter” is actually an iterative algorithm which keeps taking in new observations The “filter” is actually an iterative algorithm which keeps taking in new observations The new states are successively estimated The new states are successively estimated The error of the prediction ofis called the innovation The error of the prediction ofis called the innovation The innovation is amplified by a gain matrix and used as a correction for the state prediction The innovation is amplified by a gain matrix and used as a correction for the state prediction The corrected prediction is the new state estimate The corrected prediction is the new state estimate

March 15, 2005 12Andy Rova SFU CMPT 820 The Kalman Filter In the FaceTrack system, the state vector of the Kalman filter is the kinematic information of the face In the FaceTrack system, the state vector of the Kalman filter is the kinematic information of the face position, velocity (and sometimes acceleration) position, velocity (and sometimes acceleration) The observation vector is the position of the detected face The observation vector is the position of the detected face May not be accurate May not be accurate The Kalman filter lets the system predict and update the position and parameters of the faces The Kalman filter lets the system predict and update the position and parameters of the faces

March 15, 2005 13Andy Rova SFU CMPT 820 The Kalman Filter The FaceTrack system uses a 0.1 second time interval for state updates The FaceTrack system uses a 0.1 second time interval for state updates This corresponds to every I-frame and P-frame for typical MPEG GOP structure This corresponds to every I-frame and P-frame for typical MPEG GOP structure GOP: “Group Of Pictures” frame structure GOP: “Group Of Pictures” frame structure For example, IBBPBBP… For example, IBBPBBP…

March 15, 2005 14Andy Rova SFU CMPT 820 The Kalman Filter For I-frames, the face detector results are used directly For I-frames, the face detector results are used directly For P-frames, the face detector results are more prone to false alarms For P-frames, the face detector results are more prone to false alarms Instead, P-frame face locations are predicted based on the MPEG motion vectors (approximately) Instead, P-frame face locations are predicted based on the MPEG motion vectors (approximately) These locations are then fed into the Kalman filter as observations These locations are then fed into the Kalman filter as observations (in contrast with previous trackers, which assumed that the motion-vector calculated locations were correct alone) (in contrast with previous trackers, which assumed that the motion-vector calculated locations were correct alone)

March 15, 2005 15Andy Rova SFU CMPT 820 The Face Tracking Framework How to discriminate new faces from previous ones during tracking? How to discriminate new faces from previous ones during tracking? The Mahalanobis distance is a quantitative indicator of how close the new observation is to the prediction The Mahalanobis distance is a quantitative indicator of how close the new observation is to the prediction This can help separate new faces from existing tracks: if the Mahalanobis distance is greater than a certain threshold, then the newly detected face is unlikely to belong to a particular existing track This can help separate new faces from existing tracks: if the Mahalanobis distance is greater than a certain threshold, then the newly detected face is unlikely to belong to a particular existing track

March 15, 2005 16Andy Rova SFU CMPT 820 The Face Tracking Framework In the case where two faces move close together, Mahalanobis distance alone cannot keep track of multiple faces In the case where two faces move close together, Mahalanobis distance alone cannot keep track of multiple faces Case where a face is missed or occluded: Case where a face is missed or occluded: Hypothesize the continuation of the face track Hypothesize the continuation of the face track Case of false alarm or faces close together: Case of false alarm or faces close together: Hypothesize creation of a new track Hypothesize creation of a new track The idea is to wait for new observation data before making the final decision about a track The idea is to wait for new observation data before making the final decision about a track

March 15, 2005 17Andy Rova SFU CMPT 820 Intra-shot Tracking Challenges Multiple hypothesis method: Multiple hypothesis method:

March 15, 2005 18Andy Rova SFU CMPT 820 Kalman Motion Models The Kalman filter is a framework which can model different types of motion, depending on the system matrices used The Kalman filter is a framework which can model different types of motion, depending on the system matrices used Several models were tested for the paper, with varying results Several models were tested for the paper, with varying results Intuition: who pays to research object tracking? Intuition: who pays to research object tracking? The military! The military! Hence many tracking models are based on trajectories that are unlike those that faces in video will likely exhibit Hence many tracking models are based on trajectories that are unlike those that faces in video will likely exhibit For example, in most commercial video, a human face will not maneuver like a jet or missile For example, in most commercial video, a human face will not maneuver like a jet or missile

March 15, 2005 19Andy Rova SFU CMPT 820 Kalman Motion Models Four motion models were tested for FaceTrack Four motion models were tested for FaceTrack Constant Velocity(CV) Constant Velocity(CV) Constant Acceleration(CA) Constant Acceleration(CA) Correlated Acceleration(AA) Correlated Acceleration(AA) Variable Dimension(VDF) Variable Dimension(VDF) The testing was done against ground truth consisting of manually identified face centers in each frame The testing was done against ground truth consisting of manually identified face centers in each frame

March 15, 2005 20Andy Rova SFU CMPT 820 Kalman Motion Models Rather than go through the whole process in exact detail, the next several slides are an illustration of the differences between the CV and CA models Rather than go through the whole process in exact detail, the next several slides are an illustration of the differences between the CV and CA models Also, the matrices are expanded to show how the states are updated Also, the matrices are expanded to show how the states are updated

March 15, 2005 21Andy Rova SFU CMPT 820 Constant Velocity (CV) Model expand

March 15, 2005 22Andy Rova SFU CMPT 820 Constant Velocity (CV) Model simplify

March 15, 2005 23Andy Rova SFU CMPT 820 Constant Velocity (CV) Model simplify expand

March 15, 2005 24Andy Rova SFU CMPT 820 Constant Acceleration (CA) Model Acceleration is now added to the state vector, and is explicitly modeled as constants disturbed by random noises expand

March 15, 2005 25Andy Rova SFU CMPT 820 Constant Acceleration (CA) Model simplify

March 15, 2005 26Andy Rova SFU CMPT 820 The Correlated Acceleration Model Replaces constant accelerations with a AR(1) model Replaces constant accelerations with a AR(1) model AR(1): First order autoregressive AR(1): First order autoregressive A stochastic process where the immediately previous value has an effect on the current value (plus some random noise) A stochastic process where the immediately previous value has an effect on the current value (plus some random noise) Why? Why? There is a strong negative autocorrelation between the accelerations of consecutive frames There is a strong negative autocorrelation between the accelerations of consecutive frames Positive accelerations tend to be followed by negative accelerations Positive accelerations tend to be followed by negative accelerations Implies that faces tend to “stabilize” Implies that faces tend to “stabilize”

March 15, 2005 27Andy Rova SFU CMPT 820 The Variable Dimension Filter A system that switches between CV (constant velocity) and CA (constant acceleration) modes A system that switches between CV (constant velocity) and CA (constant acceleration) modes The dimension of the state vector changes when a maneuver is detected, hence “VDF” The dimension of the state vector changes when a maneuver is detected, hence “VDF” Developed for tracking highly maneuverable targets (probably military jets) Developed for tracking highly maneuverable targets (probably military jets)

March 15, 2005 28Andy Rova SFU CMPT 820 Comparison of Motion Models average tracking error tracking runs (first 16)

March 15, 2005 29Andy Rova SFU CMPT 820 Comparison of Motion Models Why does CV perform best? Why does CV perform best? Small sampling interval justifies viewing face motion as piecewise linear movements Small sampling interval justifies viewing face motion as piecewise linear movements The face cannot achieve very high accelerations (as opposed to a jet fighter) The face cannot achieve very high accelerations (as opposed to a jet fighter) AA also performs well because it fits the nature of the face motion well AA also performs well because it fits the nature of the face motion well Commercial video faces exhibit few persistent accelerations (negative autocorrelation) Commercial video faces exhibit few persistent accelerations (negative autocorrelation)

March 15, 2005 30Andy Rova SFU CMPT 820 Summarization Across Shots Select representative frames for tracked faces Select representative frames for tracked faces Large, frontal-view faces are best Large, frontal-view faces are best Decode representative frames into the pixel domain Decode representative frames into the pixel domain Use clustering algorithms to group the faces into different persons Use clustering algorithms to group the faces into different persons Make use of domain knowledge Make use of domain knowledge For example, people do not usually change clothes within a news segment, but often do change outfits within a sitcom episode For example, people do not usually change clothes within a news segment, but often do change outfits within a sitcom episode

March 15, 2005 31Andy Rova SFU CMPT 820 Simulation Results

March 15, 2005 32Andy Rova SFU CMPT 820 Conclusions & Future Research The FaceTrack is an effective face tracking (and summarization) architecture, within which different detection and tracking methods can be used The FaceTrack is an effective face tracking (and summarization) architecture, within which different detection and tracking methods can be used Could be updated to use new face detection algorithms or improved motion models Could be updated to use new face detection algorithms or improved motion models Based on the results, the CV and AA motion models are sufficient for commercial face motion Based on the results, the CV and AA motion models are sufficient for commercial face motion Summarization techniques need the most development, followed by optimizing tracking for adverse situations Summarization techniques need the most development, followed by optimizing tracking for adverse situations

FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University.

Similar presentations

Presentation on theme: "FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University.

Similar presentations

Presentation on theme: "FaceTrack: Tracking and summarizing faces from compressed video Hualu Wang, Harold S. Stone*, Shih-Fu Chang Dept. of Electrical Engineering, Columbia University."— Presentation transcript:

Similar presentations

About project

Feedback