Major Cast Detection in Video Using Both Speaker and Face Information Zhu Liu, Senior Member, IEEE, and Yao Wang, Fellow, IEEE IEEE Transactions On Multimedia VOL.9, NO.1, January 2007
Outline Introduction Major Cast Detection System Integration Clean Speech Extraction Speaker Segmentation and Clustering Face Detection in Still Image Face Tracking and Clustering Integration Summary and Conclusion
Introduction (1/4) To provide efficient content description becomes very important. huge amount of video data browse and retrieve data of interest Major casts and their occurrences provide good indices for organizing video content. Anchors or Reporters in news programs Principal characters in movies
Introduction (2/4) Most of the previous works focused on utilizing single modality. audio or visual information alone New approach for generating the list of major casts in a video. based on both audio and visual information should analyze all available media
Information Integration Introduction (3/4) Speaker Information (Audio Track) Face Information (Visual Track) Information Integration correspondence Major Cast Detection Major Cast Finder Browsing System
Introduction (4/4) Our goal is similar to Name-it. Name-it Name-it: Name and Face This Paper: Sound and Face Name-it For news video… Faces detected from video frames Names extracted form closed captions [24] “Name-it: Naming and detecting faces in news videos”, S. Satoh, Y. Nakamura, T. Kanade, IEEE Multimedia Mag., 1999.
System Overview Each cast is characterized by two intrinsic attributes. Face and Speech Using two-level hierarchical procedure. Level 1: Audio and Visual information is utilized independently. Level 2: Determine the major casts by associating faces and speakers belonging to the same cast.
Major Cast Detection Algorithm In Level 1, Audio and Visual information is utilized independently. In Level 2, cues from different modalities are combined.
Clean Speech Extraction Audio Feature Description Silence Gap longer than 300 ms 2 sec 32 ms 16 ms
Clean Speech Extraction Audio Feature Description 2D Projection of the audio features: clean speech versus others 14 features are extracted for each audio clip! using Karhunen-Loeve transform
Clean Speech Extraction Clean Speech Classification Two types of classifiers for identifying clean speech from other types of audio data. GMM Classifier SVM Classifier GMM classifier consists of a set of weighted Gaussian
Clean Speech Extraction Clean Speech Classification Support Vector Machine (SVM) Classifier a learning algorithm based on statistical learning theory We experimented three types of kernel functions. Dot Product Polynomial Radial Basis Function d: order of polynomial kernel, : parameter of RBF
Clean Speech Extraction Simulation Results The experimental data consists of eight half-hour news broadcast videos. NBC Nightly News in year 2000 Four videos used for training the models. Audio Track is sampled at 16 kHz. with resolution of 16 bits per sample Visual Track is digitized at 10 frames per second. Size: 240x180
Clean Speech Extraction Simulation Results Error Rates of Clean Speech Classifier Using GMM Unit: % Worst performance for more mixtures may due to limited size of the training data!
Clean Speech Extraction Simulation Results Error Rates of Clean Speech Classifier Using SVM Unit: % coefficient of RBF is set to 0.5
Speaker Segmentation and Clustering Speaker Segmentation We segment speaker at frame level. Speaker segmentation schemes Feature Computation Splitting Merging Use Mel-frequency cepstral coefficients. 13 MFCCs and their temporal delta value for each frame Compute the divergence between N previous frames and N future frames.
Speaker Segmentation and Clustering Speaker Segmentation N frames N frames local minimum! Compare the difference of the two audio block.
Speaker Segmentation and Clustering Speaker Segmentation Assumption 26 features are independent Each feature follows a Gaussian distribution. The divergence between B1 and B2 is simplified as: If the distance is higher than a certain threshold, and is a local maximum in surrounding range, it is a candidate speaker boundary.
Speaker Segmentation and Clustering Speaker Segment Clustering Build a GMM for each speaker segment. Compute the distance between GMMs as the difference of two speaker segments. Suppose: A(x) is a mixture of N elements Gaussians B(x) is a mixture of K elements Gaussians
Speaker Segmentation and Clustering Speaker Segment Clustering Distance between two Gaussian Mixture Model subject to , , [12] “A new distance measure for probability distribution function of mixture type”, Z. Liu, Y. Wang and T. Chen
Speaker Segmentation and Clustering Speaker Segment Clustering
Speaker Segmentation and Clustering Simulation Results To evaluate the performance, Manually annotated the speaker boundary and labeled speaker identifications. Dominant Speaker: speech lasts longest Set the Block Length to 188. equals to 3 (sec) Detected boundary is within 2 (sec) of the real boundary, see as correct. Otherwise, it is a false detection.
Speaker Segmentation and Clustering Simulation Results Speaker Segmentation Results relatively high! the falsely separately segments may be grouped together in the next clustering step!
Speaker Segmentation and Clustering Simulation Results Speaker Clustering Results Not serious! split speaker segments (3-5 s) vs. average duration (>20 s)
Face Detection in Still Image Basic Procedure Illustration of Template Matching Impose a constraint for the warping functions! F(m, n) and T(i, j) are intensity value of corresponding pixels.
Face Detection in Still Image Basic Procedure Example of Row Mapping Function
Face Detection in Still Image Face Detection in Multiple Resolutions We apply the basic procedure over multiple resolutions. to find faces of various sizes Two successive resolutions should differ in size by a factor of two. since basic procedure can handle faces of the same to twice the size of the template face For all faces detected, we eliminate those overlap with other faces whose matching values are higher.
Face Detection in Still Image Generation of Average Face Template The face template should grasp as much as possible the common features of the human faces. but not vulnerable to the background and individual character Use a rectangle that encloses the eye brows and the upper lip as face template. Face Template Region
Face Detection in Still Image Generation of Average Face Template Size of Face Template: 24x24 The training data is from AR face database. Purdue University [17] We choose 70 faces as training data. neutral expression Without eye glasses [17] The AR face Database [Online] http://rvl1.ecn,purdue.edu/~aleix/aleix_face_DB.html
Face Detection in Still Image Generation of Average Face Template Face Template Region Face Template Partial Training Faces from AR Database
Face Detection in Still Image Improvement of the Performance Use skin-tone to reduce the search area. Based on a skin-tone model in Hue and Saturation space. We can obtain the candidate face regions. Search in a hierarchical way. not pixel by pixel Partition the image into a gross mesh with patch size of S0 x S0 pixels. Pick the node whose MVs are higher, and partition the image into a finer mesh.
Face Detection in Still Image Simulation Results Face Detection Results of Still Images
Face Detection in Still Image Simulation Results It takes 0.3 sec to detect faces in a still image. Pentium 4 2.8-GHz machine Image Size: 180x240 Test the algorithm on 100 front view images from the AR database. different from the faces used to generate the average face template Among 100 images, 91 faces are successfully detected with no false alarms. All the missed faces have eye glasses.
Face Tracking and Clustering Face Tracking in Video Segment the video sequence into shots, and track faces in each shot independently. A real shot cut typically leads to continuous large distances that last K frames. Through experiments, K=6 gives reliable results. for video digitized at 10 frames/sec Video Shot Segmentation:
Face Tracking and Clustering Face Tracking in Video Face Tracking Within Each Shot: Stage 1 Detect frontal faces in all frames and expanding face tracks in surrounding frames. If a detected face overlaps in spatial location with the face in the previous frame, they belongs to the same track. Average face template is used to detect faces. Stage 2 Use detected face as new face template. Search faces in neighboring frames bidirectionally.
Face Tracking and Clustering Face Tracking in Video Two cases for face track expansion Note: if there is no skin-color occurrence in the first frame of a shot, we simply skip the whole shot!
Face Tracking and Clustering Face Track Clustering Group the trajectories of the same face in different shots. The similarity is measured by comparing the representative faces of the two tracks. the faces detected with the maximum MVs Set up a similarity matrix. by computing the MV between every two face tracks Use the same algorithm described in audio section.
Face Tracking and Clustering Simulation Results Visual Shot Segmentation Results False Detection Rate and Missing Rate are around 1 % Face Tracking within a shot by Face Tracking Algorithm!
Face Tracking and Clustering Simulation Results Face Tracking Results due to lighting effect, eye glass reflection, …etc
Face Tracking and Clustering Simulation Results Face Track Clustering Results manually counted! after clustering…
Integrated System Speaker Face Correlation Matrix Suppose M speakers: N faces: Assume: Speaker Sm has Lm discontinuous segments Two attributes: starting time (ST) and ending time (ET). Face Fn has lm discontinuous segments Three attributes: ST, ET and face size (FS). representative face
Integrated System Speaker Face Correlation Matrix FS: face size OL: overlapping duration Illustration of speaker face illustration
Integrated System Major Cast Generation Association of faces to speakers entirely depends on the correlation matrix. reflects both the temporal and spatial importance of a major cast. The algorithm produces a list of major casts with corresponding values. used as temporal-spatial importance score Suppose: M different speakers, N different faces M x N speaker-face-correlation-matrix,
Integrated System Major Cast Generation The algorithm:
Integrated System Major Cast Presentation System Major Cast Based Video Presentation System occurrences of speech importance score video List of Major Casts
Integrated System Simulation Results Using four test video sequences. Detect 8, 9, 6, 8 major casts, respectively. The most important ones are the anchor persons. consistent in four test sequences Followed by different reporters and interviewees.
Summary and Conclusion (1/2) This paper proposes a video browsing system using both audio and visual information. based on the major casts appearing in the video Independent components in the major cast detection framework. Each can be improved separately.
Summary and Conclusion (2/2) Future research includes: Improve the accuracy of speaker-face association by detecting talking faces. Utilize the text stream. Apply different speaker face association methods according to the category of the video data.