Character retrieval and annotation in multimedia

Slides:

Advertisements

Similar presentations

Distinctive Image Features from Scale-Invariant Keypoints

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Active Appearance Models

Three things everyone should know to improve object retrieval

Evaluating Color Descriptors for Object and Scene Recognition Koen E.A. van de Sande, Student Member, IEEE, Theo Gevers, Member, IEEE, and Cees G.M. Snoek,

Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

Face Description with Local Binary Patterns:

Face Alignment with Part-Based Modeling

Facial feature localization Presented by: Harvest Jang Spring 2002.

Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.

December 5, 2013Computer Vision Lecture 20: Hidden Markov Models/Depth 1 Stereo Vision Due to the limited resolution of images, increasing the baseline.

Broadcast News Parsing Using Visual Cues: A Robust Face Detection Approach Yannis Avrithis, Nicolas Tsapatsoulis and Stefanos Kollias Image, Video & Multimedia.

Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

A Study of Approaches for Object Recognition

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

1 Face Tracking in Videos Gaurav Aggarwal, Ashok Veeraraghavan, Rama Chellappa.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.

1 Interest Operator Lectures lecture topics –Interest points 1 (Linda) interest points, descriptors, Harris corners, correlation matching –Interest points.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2005 with a lot of slides stolen from Steve Seitz and.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Automatic Face Recognition for Film Character Retrieval in Feature-Length Films Ognjen Arandjelović Andrew Zisserman.

Tracking Video Objects in Cluttered Background

Presented by Zeehasham Rasheed

Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.

Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Hand Signals Recognition from Video Using 3D Motion Capture Archive Tai-Peng Tian Stan Sclaroff Computer Science Department B OSTON U NIVERSITY I. Introduction.

Lecture 6: Feature matching and alignment CS4670: Computer Vision Noah Snavely.

FACE DETECTION AND RECOGNITION By: Paranjith Singh Lohiya Ravi Babu Lavu.

EE392J Final Project, March 20, Multiple Camera Object Tracking Helmy Eltoukhy and Khaled Salama.

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Computer vision.

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

“Hello! My name is... Buffy” Automatic Naming of Characters in TV Video Mark Everingham, Josef Sivic and Andrew Zisserman Arun Shyam.

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

Face detection Slides adapted Grauman & Liebe’s tutorial

Visual Object Recognition

Phase Congruency Detects Corners and Edges Peter Kovesi School of Computer Science & Software Engineering The University of Western Australia.

Recognizing Action at a Distance Alexei A. Efros, Alexander C. Berg, Greg Mori, Jitendra Malik Computer Science Division, UC Berkeley Presented by Pundik.

Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Face Image-Based Gender Recognition Using Complex-Valued Neural Network Instructor :Dr. Dong-Chul Kim Indrani Gorripati.

Timo Ahonen, Abdenour Hadid, and Matti Pietikainen

A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )

FACE DETECTION : AMIT BHAMARE. WHAT IS FACE DETECTION ? Face detection is computer based technology which detect the face in digital image. Trivial task.

Detecting Eye Contact Using Wearable Eye-Tracking Glasses.

Image features and properties. Image content representation The simplest representation of an image pattern is to list image pixels, one after the other.

Preliminary Transformations Presented By: -Mona Saudagar Under Guidance of: - Prof. S. V. Jain Multi Oriented Text Recognition In Digital Images.

Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.

Max-Confidence Boosting With Uncertainty for Visual tracking WEN GUO, LIANGLIANG CAO, TONY X. HAN, SHUICHENG YAN AND CHANGSHENG XU IEEE TRANSACTIONS ON.

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

Evaluation of Gender Classification Methods with Automatically Detected and Aligned Faces Speaker: Po-Kai Shen Advisor: Tsai-Rong Chang Date: 2010/6/14.

Face Detection & Recognition

CONTENTS:  Introduction.  Face recognition task.  Image preprocessing.  Template Extraction and Normalization.  Template Correlation with image database.

Student Gesture Recognition System in Classroom 2.0 Chiung-Yao Fang, Min-Han Kuo, Greg-C Lee, and Sei-Wang Chen Department of Computer Science and Information.

Visual homing using PCA-SIFT

ROBUST FACE NAME GRAPH MATCHING FOR MOVIE CHARACTER IDENTIFICATION

PRESENTED BY Yang Jiao Timo Ahonen, Matti Pietikainen

Video Google: Text Retrieval Approach to Object Matching in Videos

Fast Preprocessing for Robust Face Sketch Synthesis

CSE 455 – Guest Lectures 3 lectures Contact Interest points 1

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Video Google: Text Retrieval Approach to Object Matching in Videos

Learning complex visual concepts

Presentation transcript:

Character retrieval and annotation in multimedia Elena Tojtovska Dalibor Mladenovski Shima Sajjad

Outline 1 Introduction 2 Subtitles and Script Processing 3 Video Processing 3.1 Face Detection and Tracking 3.2 Facial Feature Localization 3.3 Representing Face Appearance 3.4 Representing Clothing Appearance 3.5 Speaker Detection 4 Classifications by Exemplar Sets 5 Experimental Results 6 Conclusions

Introduction Automatically labelling appearances of characters in TV or film material presented in each frame of the video Requires additional information

Introduction High precision can be achieved by combining multiple sources of information, both visual and textual. The principal novelties are: (i) automatic generation of time stamped character annotation by aligning subtitles and transcripts; (ii) strengthening the supervisory information by identifying when characters are speaking; (iii) using complementary cues of face matching and clothing matching to propose common annotations for face tracks.

Introduction Are these faces of the same person?

Introduction Problems: scale, pose lighting partial occlusion expressions

Subtitle and Script Processing DVD format includes subtitles stored as bitmaps. Apply the “SubRip” program to convert the text using simple OCR correction algorithm What is said, and when, but not who says it

Subtitle and Script Processing Many fan websites publish transcripts - Automatically extract text from HTML What is said, and who says it, but not when

Subtitle and Script Processing Script has no timing but sequence is preserved Efficient alignment by dynamic programming

Subtitle and Script Processing By automatic alignment of the two sources, it is possible to extract who says what and when.

Face Detection and Tracking (a )Face detections in original frames (b) Localized facial features Face detection and facial feature localization. Note the low resolution, non-frontal pose and challenging lighting in the example on the right.

Face Detection and Tracking The point tracks are used to establish correspondence between pairs of faces within the shot for a given pair of faces in different frames, the number of point tracks which pass through both faces is counted, and if this number is large relative to the number of point tracks which are not in common to both faces, a match is declared.

Face Detection and Tracking Tracking faces in spatio-temporal video volume Automatically associated facial examplar

Facial Feature Localization Nine facial features are located: left and right corners of each eye, two nostrils, tip of the nose and left and right corners of the mouth Generative model of the feature positions combined with a discriminative model of the feature appearance is applied (improves the ability of the model to capture pose variation) (a )Face detections in original frames (b) Localized facial features High reliability in detecting of the facial features despite variation in pose, lighting, and facial expression

Representing Face Appearance Representation of the face appearance is extracted by computing descriptors of the local appearance of the face around each of the located facial features Gives robustness to pose variation, lighting, and partial occlusion compared to a global face descriptor SIFT Descriptor computes a histogram of gradient orientation on a coarse spatial grid, aiming to emphasize strong edge features and give some robustness to image deformation Simple pixel-wised descriptor taking the vector of pixels in the elliptical region and normalizing to obtain local photometric invariance

Representing Clothing Appearance Matching the appearance of the face can be extremely challenging because of different expression, pose, lighting or motion blur For each face detection a bounding box which is expected to contain the clothing of the corresponding character is predicted relative to the position and scale of the face detection

Representing Clothing Appearance Within the predicted clothing box a colour histogram is computed as a descriptor of the clothing Similar clothing appearance suggests the same character-different clothing does not necessarily imply a different character Straightforward weighting of the clothing appearance relative to the face appearance proved effective

Speaker Detection This annotation is still extremely ambiguous: There might be several detected faces present in the frame and we do not know which one is speaking. (figure (a))

Speaker Detection Even in the case of a single face detection in the frame the actual speaking person might be undetected by: the frontal face detector (Figure b) the frame might be part of a ‘reaction shot’ where the speaker is not present in the frame at all. (Figure c)

Speaker Detection To resolve these ambiguities: Using of face detections with significant lip motion. A rectangular mouth region within each face detection is identified. Mean squared difference of the pixel values within the region is computed between the current and previous frame. The difference is computed over a search region around the mouth region in the current frame and the minimum taken.

Speaker Detection Two thresholds on the difference are set to classify face detections into: ‘speaking’ (difference above a high threshold) ‘non-speaking’ (difference below a low threshold) ‘refuse to predict’ (difference between the thresholds) This simple lip motion detection algorithm works well in practice as illustrated in next figure.

Speaker Detection Inter-frame differences for a face track of 101 face detections: The character is speaking between frames 1–70 and remains silent for the rest of the track. The two horizontal lines indicate the ‘speaking’ (top) and ‘non-speaking’ (bottom) thresholds respectively.

Speaker Detection Top row: Extracted face detections with facial feature points overlaid for frames 47–54. Bottom row: Corresponding extracted mouth regions.

Classification by Exemplar Sets The combination of subtitle/script alignment and speaker detection gives a number of face tracks. Tracks with a single identity are treated as exemplars to label other tracks which have no, or uncertain, proposed identity. For a given track F, the quasi-likelihood that the face corresponds to a particular name λi is defined thus: Each unlabelled face track F is represented as a set of face descriptors and clothing descriptors {f,c}. Exemplar sets are represented λi.

Classification by Exemplar Sets The face distance df (F,λi) is defined as: The minimum distance between the descriptors in F and in the exemplar tracks. The clothing distance dc(F,λi) is similarly defined. The quasi-likelihoods for each name λi are combined to obtain a posterior probability of the name by assuming equal priors on the names and applying Bayes’ rule: Taking λi for which the posterior P(λi|F) is maximal assigns a name to the face.

Classification by Exemplar Sets By thresholding the posterior, a “refusal to predict” mechanism is implemented. faces for which the certainty of naming does not reach some threshold will be left unlabelled; This decreases the recall of the method but improves the accuracy of the labelled tracks.

Experimental Results Two terms are defined in “refusal to predict”: The proposed method was applied to two episodes of “Buffy the Vampire Slayer”. Episode 05-02 contains 62,157 frames in which 25,277 faces were detected, forming 516 face tracks. Episode 05-05 contains 64,083 frames, 24,170 faces, and 477 face tracks. Two terms are defined in “refusal to predict”: “Recall” is the proportion of face tracks which are assigned labels by the proposed method at a given confidence level. “Precision is” the proportion of correctly labelled tracks.

Experimental Results The graphs show the performance of the proposed method and two baseline methods. “Prior” – label all tracks with the name which occurs most often in the script (Buffy); “Subtitles only” – label any tracks with proposed names from the script (not using speaker identification)

Experimental Results

Conclusions We have proposed methods for incorporating textual and visual information to automatically name characters in TV or movies. The detection method and appearance models used here could be improved: By further use of tracking, for example using a specific body tracker rather than a generic point tracker. Could propagate detections to frames in which detection based on the face is difficult. By introducing a mechanism for error correction.

References [1] “Hello! My name is... Buffy” – Automatic Naming of Characters in TV Video. Mark Everingham, Josef Sivic and Andrew Zisserman “Department of Engineering Science”, University of Oxford [2] Character retrieval and annotation in multimedia - or "How to find Buffy" Andrew Zisserman,(work with Josef Sivic and Mark Everingham)Department of Engineering Science University of Oxford, UK http://www.robots.ox.ac.uk/~vgg/

Thank you!  Questions?