Major Cast Detection in Video Using Both Speaker and Face Information

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.
DONG XU, MEMBER, IEEE, AND SHIH-FU CHANG, FELLOW, IEEE Video Event Recognition Using Kernel Methods with Multilevel Temporal Alignment.
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
Patch to the Future: Unsupervised Visual Prediction
Adviser:Ming-Yuan Shieh Student:shun-te chuang SN:M
1 A scheme for racquet sports video analysis with the combination of audio-visual information Visual Communication and Image Processing 2005 Liyuan Xing,
Personalized Abstraction of Broadcasted American Football Video by Highlight Selection Noboru Babaguchi (Professor at Osaka Univ.) Yoshihiko Kawai and.
Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
ICIP 2000, Vancouver, Canada IVML, ECE, NTUA Face Detection: Is it only for Face Recognition?  A few years earlier  Face Detection Face Recognition 
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.
ADVISE: Advanced Digital Video Information Segmentation Engine
Incremental Learning of Temporally-Coherent Gaussian Mixture Models Ognjen Arandjelović, Roberto Cipolla Engineering Department, University of Cambridge.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.
A Self-Organizing Approach to Background Subtraction for Visual Surveillance Applications Lucia Maddalena and Alfredo Petrosino, Senior Member, IEEE.
Tracking Video Objects in Cluttered Background
Presented by Zeehasham Rasheed
MULTIPLE MOVING OBJECTS TRACKING FOR VIDEO SURVEILLANCE SYSTEMS.
A Probabilistic Framework for Video Representation Arnaldo Mayer, Hayit Greenspan Dept. of Biomedical Engineering Faculty of Engineering Tel-Aviv University,
Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.
A fuzzy video content representation for video summarization and content-based retrieval Anastasios D. Doulamis, Nikolaos D. Doulamis, Stefanos D. Kollias.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Face Recognition and Retrieval in Video Basic concept of Face Recog. & retrieval And their basic methods. C.S.E. Kwon Min Hyuk.
김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.
Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.
TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.
What’s Making That Sound ?
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
Università degli Studi di Modena and Reggio Emilia Dipartimento di Ingegneria dell’Informazione Prototypes selection with.
Characterizing activity in video shots based on salient points Nicolas Moënne-Loccoz Viper group Computer vision & multimedia laboratory University of.
Blind Pattern Matching Attack on Watermark Systems D. Kirovski and F. A. P. Petitcolas IEEE Transactions on Signal Processing, VOL. 51, NO. 4, April 2003.
COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL Presented by 2006/8.
Multimodal Information Analysis for Emotion Recognition
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.
1 Webcam Mouse Using Face and Eye Tracking in Various Illumination Environments Yuan-Pin Lin et al. Proceedings of the 2005 IEEE Y.S. Lee.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.
Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Case Study 1 Semantic Analysis of Soccer Video Using Dynamic Bayesian Network C.-L Huang, et al. IEEE Transactions on Multimedia, vol. 8, no. 4, 2006 Fuzzy.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Levels of Image Data Representation 4.2. Traditional Image Data Structures 4.3. Hierarchical Data Structures Chapter 4 – Data structures for.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Wonjun Kim and Changick Kim, Member, IEEE
Irfan Ullah Department of Information and Communication Engineering Myongji university, Yongin, South Korea Copyright © solarlits.com.
Cell Segmentation in Microscopy Imagery Using a Bag of Local Bayesian Classifiers Zhaozheng Yin RI/CMU, Fall 2009.
Learning and Removing Cast Shadows through a Multidistribution Approach Nicolas Martel-Brisson, Andre Zaccarin IEEE TRANSACTIONS ON PATTERN ANALYSIS AND.
Shadow Detection in Remotely Sensed Images Based on Self-Adaptive Feature Selection Jiahang Liu, Tao Fang, and Deren Li IEEE TRANSACTIONS ON GEOSCIENCE.
Student Gesture Recognition System in Classroom 2.0 Chiung-Yao Fang, Min-Han Kuo, Greg-C Lee, and Sei-Wang Chen Department of Computer Science and Information.
Content Based Coding of Face Images
Face Detection EE368 Final Project Group 14 Ping Hsin Lee
Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.
Mean Shift Segmentation
A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.
Image Segmentation Techniques
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
Presentation transcript:

Major Cast Detection in Video Using Both Speaker and Face Information Zhu Liu, Senior Member, IEEE, and Yao Wang, Fellow, IEEE IEEE Transactions On Multimedia VOL.9, NO.1, January 2007

Outline Introduction Major Cast Detection System Integration Clean Speech Extraction Speaker Segmentation and Clustering Face Detection in Still Image Face Tracking and Clustering Integration Summary and Conclusion

Introduction (1/4) To provide efficient content description becomes very important. huge amount of video data browse and retrieve data of interest Major casts and their occurrences provide good indices for organizing video content. Anchors or Reporters in news programs Principal characters in movies

Introduction (2/4) Most of the previous works focused on utilizing single modality. audio or visual information alone New approach for generating the list of major casts in a video. based on both audio and visual information should analyze all available media

Information Integration Introduction (3/4) Speaker Information (Audio Track) Face Information (Visual Track) Information Integration correspondence Major Cast Detection Major Cast Finder Browsing System

Introduction (4/4) Our goal is similar to Name-it. Name-it Name-it: Name and Face This Paper: Sound and Face Name-it For news video… Faces detected from video frames Names extracted form closed captions [24] “Name-it: Naming and detecting faces in news videos”, S. Satoh, Y. Nakamura, T. Kanade, IEEE Multimedia Mag., 1999.

System Overview Each cast is characterized by two intrinsic attributes. Face and Speech Using two-level hierarchical procedure. Level 1: Audio and Visual information is utilized independently. Level 2: Determine the major casts by associating faces and speakers belonging to the same cast.

Major Cast Detection Algorithm In Level 1, Audio and Visual information is utilized independently. In Level 2, cues from different modalities are combined.

Clean Speech Extraction Audio Feature Description Silence Gap longer than 300 ms 2 sec 32 ms 16 ms

Clean Speech Extraction Audio Feature Description 2D Projection of the audio features: clean speech versus others 14 features are extracted for each audio clip! using Karhunen-Loeve transform

Clean Speech Extraction Clean Speech Classification Two types of classifiers for identifying clean speech from other types of audio data. GMM Classifier SVM Classifier GMM classifier consists of a set of weighted Gaussian

Clean Speech Extraction Clean Speech Classification Support Vector Machine (SVM) Classifier a learning algorithm based on statistical learning theory We experimented three types of kernel functions. Dot Product Polynomial Radial Basis Function d: order of polynomial kernel, : parameter of RBF

Clean Speech Extraction Simulation Results The experimental data consists of eight half-hour news broadcast videos. NBC Nightly News in year 2000 Four videos used for training the models. Audio Track is sampled at 16 kHz. with resolution of 16 bits per sample Visual Track is digitized at 10 frames per second. Size: 240x180

Clean Speech Extraction Simulation Results Error Rates of Clean Speech Classifier Using GMM Unit: % Worst performance for more mixtures may due to limited size of the training data!

Clean Speech Extraction Simulation Results Error Rates of Clean Speech Classifier Using SVM Unit: % coefficient of RBF is set to 0.5

Speaker Segmentation and Clustering Speaker Segmentation We segment speaker at frame level. Speaker segmentation schemes Feature Computation Splitting Merging Use Mel-frequency cepstral coefficients. 13 MFCCs and their temporal delta value for each frame Compute the divergence between N previous frames and N future frames.

Speaker Segmentation and Clustering Speaker Segmentation N frames N frames local minimum! Compare the difference of the two audio block.

Speaker Segmentation and Clustering Speaker Segmentation Assumption 26 features are independent Each feature follows a Gaussian distribution. The divergence between B1 and B2 is simplified as: If the distance is higher than a certain threshold, and is a local maximum in surrounding range, it is a candidate speaker boundary.

Speaker Segmentation and Clustering Speaker Segment Clustering Build a GMM for each speaker segment. Compute the distance between GMMs as the difference of two speaker segments. Suppose: A(x) is a mixture of N elements Gaussians B(x) is a mixture of K elements Gaussians

Speaker Segmentation and Clustering Speaker Segment Clustering Distance between two Gaussian Mixture Model subject to , , [12] “A new distance measure for probability distribution function of mixture type”, Z. Liu, Y. Wang and T. Chen

Speaker Segmentation and Clustering Speaker Segment Clustering

Speaker Segmentation and Clustering Simulation Results To evaluate the performance, Manually annotated the speaker boundary and labeled speaker identifications. Dominant Speaker: speech lasts longest Set the Block Length to 188. equals to 3 (sec) Detected boundary is within 2 (sec) of the real boundary, see as correct. Otherwise, it is a false detection.

Speaker Segmentation and Clustering Simulation Results Speaker Segmentation Results relatively high! the falsely separately segments may be grouped together in the next clustering step!

Speaker Segmentation and Clustering Simulation Results Speaker Clustering Results Not serious! split speaker segments (3-5 s) vs. average duration (>20 s)

Face Detection in Still Image Basic Procedure Illustration of Template Matching Impose a constraint for the warping functions! F(m, n) and T(i, j) are intensity value of corresponding pixels.

Face Detection in Still Image Basic Procedure Example of Row Mapping Function

Face Detection in Still Image Face Detection in Multiple Resolutions We apply the basic procedure over multiple resolutions. to find faces of various sizes Two successive resolutions should differ in size by a factor of two. since basic procedure can handle faces of the same to twice the size of the template face For all faces detected, we eliminate those overlap with other faces whose matching values are higher.

Face Detection in Still Image Generation of Average Face Template The face template should grasp as much as possible the common features of the human faces. but not vulnerable to the background and individual character Use a rectangle that encloses the eye brows and the upper lip as face template. Face Template Region

Face Detection in Still Image Generation of Average Face Template Size of Face Template: 24x24 The training data is from AR face database. Purdue University [17] We choose 70 faces as training data. neutral expression Without eye glasses [17] The AR face Database [Online] http://rvl1.ecn,purdue.edu/~aleix/aleix_face_DB.html

Face Detection in Still Image Generation of Average Face Template Face Template Region Face Template Partial Training Faces from AR Database

Face Detection in Still Image Improvement of the Performance Use skin-tone to reduce the search area. Based on a skin-tone model in Hue and Saturation space. We can obtain the candidate face regions. Search in a hierarchical way. not pixel by pixel Partition the image into a gross mesh with patch size of S0 x S0 pixels. Pick the node whose MVs are higher, and partition the image into a finer mesh.

Face Detection in Still Image Simulation Results Face Detection Results of Still Images

Face Detection in Still Image Simulation Results It takes 0.3 sec to detect faces in a still image. Pentium 4 2.8-GHz machine Image Size: 180x240 Test the algorithm on 100 front view images from the AR database. different from the faces used to generate the average face template Among 100 images, 91 faces are successfully detected with no false alarms. All the missed faces have eye glasses.

Face Tracking and Clustering Face Tracking in Video Segment the video sequence into shots, and track faces in each shot independently. A real shot cut typically leads to continuous large distances that last K frames. Through experiments, K=6 gives reliable results. for video digitized at 10 frames/sec Video Shot Segmentation:

Face Tracking and Clustering Face Tracking in Video Face Tracking Within Each Shot: Stage 1 Detect frontal faces in all frames and expanding face tracks in surrounding frames. If a detected face overlaps in spatial location with the face in the previous frame, they belongs to the same track. Average face template is used to detect faces. Stage 2 Use detected face as new face template. Search faces in neighboring frames bidirectionally.

Face Tracking and Clustering Face Tracking in Video Two cases for face track expansion Note: if there is no skin-color occurrence in the first frame of a shot, we simply skip the whole shot!

Face Tracking and Clustering Face Track Clustering Group the trajectories of the same face in different shots. The similarity is measured by comparing the representative faces of the two tracks. the faces detected with the maximum MVs Set up a similarity matrix. by computing the MV between every two face tracks Use the same algorithm described in audio section.

Face Tracking and Clustering Simulation Results Visual Shot Segmentation Results False Detection Rate and Missing Rate are around 1 % Face Tracking within a shot by Face Tracking Algorithm!

Face Tracking and Clustering Simulation Results Face Tracking Results due to lighting effect, eye glass reflection, …etc

Face Tracking and Clustering Simulation Results Face Track Clustering Results manually counted! after clustering…

Integrated System Speaker Face Correlation Matrix Suppose M speakers: N faces: Assume: Speaker Sm has Lm discontinuous segments Two attributes: starting time (ST) and ending time (ET). Face Fn has lm discontinuous segments Three attributes: ST, ET and face size (FS). representative face

Integrated System Speaker Face Correlation Matrix FS: face size OL: overlapping duration Illustration of speaker face illustration

Integrated System Major Cast Generation Association of faces to speakers entirely depends on the correlation matrix. reflects both the temporal and spatial importance of a major cast. The algorithm produces a list of major casts with corresponding values. used as temporal-spatial importance score Suppose: M different speakers, N different faces M x N speaker-face-correlation-matrix,

Integrated System Major Cast Generation The algorithm:

Integrated System Major Cast Presentation System Major Cast Based Video Presentation System occurrences of speech importance score video List of Major Casts

Integrated System Simulation Results Using four test video sequences. Detect 8, 9, 6, 8 major casts, respectively. The most important ones are the anchor persons. consistent in four test sequences Followed by different reporters and interviewees.

Summary and Conclusion (1/2) This paper proposes a video browsing system using both audio and visual information. based on the major casts appearing in the video Independent components in the major cast detection framework. Each can be improved separately.

Summary and Conclusion (2/2) Future research includes: Improve the accuracy of speaker-face association by detecting talking faces. Utilize the text stream. Apply different speaker face association methods according to the category of the video data.