Download presentation
1
Major Cast Detection in Video Using Both Speaker and Face Information
Zhu Liu, Senior Member, IEEE, and Yao Wang, Fellow, IEEE IEEE Transactions On Multimedia VOL.9, NO.1, January 2007
2
Outline Introduction Major Cast Detection System Integration
Clean Speech Extraction Speaker Segmentation and Clustering Face Detection in Still Image Face Tracking and Clustering Integration Summary and Conclusion
3
Introduction (1/4) To provide efficient content description becomes very important. huge amount of video data browse and retrieve data of interest Major casts and their occurrences provide good indices for organizing video content. Anchors or Reporters in news programs Principal characters in movies
4
Introduction (2/4) Most of the previous works focused on utilizing single modality. audio or visual information alone New approach for generating the list of major casts in a video. based on both audio and visual information should analyze all available media
5
Information Integration
Introduction (3/4) Speaker Information (Audio Track) Face Information (Visual Track) Information Integration correspondence Major Cast Detection Major Cast Finder Browsing System
6
Introduction (4/4) Our goal is similar to Name-it. Name-it
Name-it: Name and Face This Paper: Sound and Face Name-it For news video… Faces detected from video frames Names extracted form closed captions [24] “Name-it: Naming and detecting faces in news videos”, S. Satoh, Y. Nakamura, T. Kanade, IEEE Multimedia Mag., 1999.
7
System Overview Each cast is characterized by two intrinsic attributes. Face and Speech Using two-level hierarchical procedure. Level 1: Audio and Visual information is utilized independently. Level 2: Determine the major casts by associating faces and speakers belonging to the same cast.
8
Major Cast Detection Algorithm
In Level 1, Audio and Visual information is utilized independently. In Level 2, cues from different modalities are combined.
9
Clean Speech Extraction Audio Feature Description
Silence Gap longer than 300 ms 2 sec 32 ms 16 ms
10
Clean Speech Extraction Audio Feature Description
2D Projection of the audio features: clean speech versus others 14 features are extracted for each audio clip! using Karhunen-Loeve transform
11
Clean Speech Extraction Clean Speech Classification
Two types of classifiers for identifying clean speech from other types of audio data. GMM Classifier SVM Classifier GMM classifier consists of a set of weighted Gaussian
12
Clean Speech Extraction Clean Speech Classification
Support Vector Machine (SVM) Classifier a learning algorithm based on statistical learning theory We experimented three types of kernel functions. Dot Product Polynomial Radial Basis Function d: order of polynomial kernel, : parameter of RBF
13
Clean Speech Extraction Simulation Results
The experimental data consists of eight half-hour news broadcast videos. NBC Nightly News in year 2000 Four videos used for training the models. Audio Track is sampled at 16 kHz. with resolution of 16 bits per sample Visual Track is digitized at 10 frames per second. Size: 240x180
14
Clean Speech Extraction Simulation Results
Error Rates of Clean Speech Classifier Using GMM Unit: % Worst performance for more mixtures may due to limited size of the training data!
15
Clean Speech Extraction Simulation Results
Error Rates of Clean Speech Classifier Using SVM Unit: % coefficient of RBF is set to 0.5
16
Speaker Segmentation and Clustering Speaker Segmentation
We segment speaker at frame level. Speaker segmentation schemes Feature Computation Splitting Merging Use Mel-frequency cepstral coefficients. 13 MFCCs and their temporal delta value for each frame Compute the divergence between N previous frames and N future frames.
17
Speaker Segmentation and Clustering Speaker Segmentation
N frames N frames local minimum! Compare the difference of the two audio block.
18
Speaker Segmentation and Clustering Speaker Segmentation
Assumption 26 features are independent Each feature follows a Gaussian distribution. The divergence between B1 and B2 is simplified as: If the distance is higher than a certain threshold, and is a local maximum in surrounding range, it is a candidate speaker boundary.
19
Speaker Segmentation and Clustering Speaker Segment Clustering
Build a GMM for each speaker segment. Compute the distance between GMMs as the difference of two speaker segments. Suppose: A(x) is a mixture of N elements Gaussians B(x) is a mixture of K elements Gaussians
20
Speaker Segmentation and Clustering Speaker Segment Clustering
Distance between two Gaussian Mixture Model subject to , , [12] “A new distance measure for probability distribution function of mixture type”, Z. Liu, Y. Wang and T. Chen
21
Speaker Segmentation and Clustering Speaker Segment Clustering
22
Speaker Segmentation and Clustering Simulation Results
To evaluate the performance, Manually annotated the speaker boundary and labeled speaker identifications. Dominant Speaker: speech lasts longest Set the Block Length to 188. equals to 3 (sec) Detected boundary is within 2 (sec) of the real boundary, see as correct. Otherwise, it is a false detection.
23
Speaker Segmentation and Clustering Simulation Results
Speaker Segmentation Results relatively high! the falsely separately segments may be grouped together in the next clustering step!
24
Speaker Segmentation and Clustering Simulation Results
Speaker Clustering Results Not serious! split speaker segments (3-5 s) vs. average duration (>20 s)
25
Face Detection in Still Image Basic Procedure
Illustration of Template Matching Impose a constraint for the warping functions! F(m, n) and T(i, j) are intensity value of corresponding pixels.
26
Face Detection in Still Image Basic Procedure
Example of Row Mapping Function
27
Face Detection in Still Image Face Detection in Multiple Resolutions
We apply the basic procedure over multiple resolutions. to find faces of various sizes Two successive resolutions should differ in size by a factor of two. since basic procedure can handle faces of the same to twice the size of the template face For all faces detected, we eliminate those overlap with other faces whose matching values are higher.
28
Face Detection in Still Image Generation of Average Face Template
The face template should grasp as much as possible the common features of the human faces. but not vulnerable to the background and individual character Use a rectangle that encloses the eye brows and the upper lip as face template. Face Template Region
29
Face Detection in Still Image Generation of Average Face Template
Size of Face Template: 24x24 The training data is from AR face database. Purdue University [17] We choose 70 faces as training data. neutral expression Without eye glasses [17] The AR face Database [Online]
30
Face Detection in Still Image Generation of Average Face Template
Face Template Region Face Template Partial Training Faces from AR Database
31
Face Detection in Still Image Improvement of the Performance
Use skin-tone to reduce the search area. Based on a skin-tone model in Hue and Saturation space. We can obtain the candidate face regions. Search in a hierarchical way. not pixel by pixel Partition the image into a gross mesh with patch size of S0 x S0 pixels. Pick the node whose MVs are higher, and partition the image into a finer mesh.
32
Face Detection in Still Image Simulation Results
Face Detection Results of Still Images
33
Face Detection in Still Image Simulation Results
It takes 0.3 sec to detect faces in a still image. Pentium GHz machine Image Size: 180x240 Test the algorithm on 100 front view images from the AR database. different from the faces used to generate the average face template Among 100 images, 91 faces are successfully detected with no false alarms. All the missed faces have eye glasses.
34
Face Tracking and Clustering Face Tracking in Video
Segment the video sequence into shots, and track faces in each shot independently. A real shot cut typically leads to continuous large distances that last K frames. Through experiments, K=6 gives reliable results. for video digitized at 10 frames/sec Video Shot Segmentation:
35
Face Tracking and Clustering Face Tracking in Video
Face Tracking Within Each Shot: Stage 1 Detect frontal faces in all frames and expanding face tracks in surrounding frames. If a detected face overlaps in spatial location with the face in the previous frame, they belongs to the same track. Average face template is used to detect faces. Stage 2 Use detected face as new face template. Search faces in neighboring frames bidirectionally.
36
Face Tracking and Clustering Face Tracking in Video
Two cases for face track expansion Note: if there is no skin-color occurrence in the first frame of a shot, we simply skip the whole shot!
37
Face Tracking and Clustering Face Track Clustering
Group the trajectories of the same face in different shots. The similarity is measured by comparing the representative faces of the two tracks. the faces detected with the maximum MVs Set up a similarity matrix. by computing the MV between every two face tracks Use the same algorithm described in audio section.
38
Face Tracking and Clustering Simulation Results
Visual Shot Segmentation Results False Detection Rate and Missing Rate are around 1 % Face Tracking within a shot by Face Tracking Algorithm!
39
Face Tracking and Clustering Simulation Results
Face Tracking Results due to lighting effect, eye glass reflection, …etc
40
Face Tracking and Clustering Simulation Results
Face Track Clustering Results manually counted! after clustering…
41
Integrated System Speaker Face Correlation Matrix
Suppose M speakers: N faces: Assume: Speaker Sm has Lm discontinuous segments Two attributes: starting time (ST) and ending time (ET). Face Fn has lm discontinuous segments Three attributes: ST, ET and face size (FS). representative face
42
Integrated System Speaker Face Correlation Matrix
FS: face size OL: overlapping duration Illustration of speaker face illustration
43
Integrated System Major Cast Generation
Association of faces to speakers entirely depends on the correlation matrix. reflects both the temporal and spatial importance of a major cast. The algorithm produces a list of major casts with corresponding values. used as temporal-spatial importance score Suppose: M different speakers, N different faces M x N speaker-face-correlation-matrix,
44
Integrated System Major Cast Generation
The algorithm:
45
Integrated System Major Cast Presentation System
Major Cast Based Video Presentation System occurrences of speech importance score video List of Major Casts
46
Integrated System Simulation Results
Using four test video sequences. Detect 8, 9, 6, 8 major casts, respectively. The most important ones are the anchor persons. consistent in four test sequences Followed by different reporters and interviewees.
47
Summary and Conclusion (1/2)
This paper proposes a video browsing system using both audio and visual information. based on the major casts appearing in the video Independent components in the major cast detection framework. Each can be improved separately.
48
Summary and Conclusion (2/2)
Future research includes: Improve the accuracy of speaker-face association by detecting talking faces. Utilize the text stream. Apply different speaker face association methods according to the category of the video data.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.