CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Slides:



Advertisements
Similar presentations
Distinctive Image Features from Scale-Invariant Keypoints
Advertisements

Applications of one-class classification
CVPR2013 Poster Modeling Actions through State Changes.
Presented by Xinyu Chang
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
Patch to the Future: Unsupervised Visual Prediction
Learning to estimate human pose with data driven belief propagation Gang Hua, Ming-Hsuan Yang, Ying Wu CVPR 05.
Automatic Feature Extraction for Multi-view 3D Face Recognition
Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.
One-Shot Multi-Set Non-rigid Feature-Spatial Matching
Robust and large-scale alignment Image from
Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,
A Bayesian Formulation For 3d Articulated Upper Body Segmentation And Tracking From Dense Disparity Maps Navin Goel Dr Ara V Nefian Dr George Bebis.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman ICCV 2003 Presented by: Indriyati Atmosukarto.
Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2005 with a lot of slides stolen from Steve Seitz and.
3D Hand Pose Estimation by Finding Appearance-Based Matches in a Large Database of Training Views
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Stepan Obdrzalek Jirı Matas
Scale Invariant Feature Transform (SIFT)
Lecture 3a: Feature detection and matching CS6670: Computer Vision Noah Snavely.
Smart Traveller with Visual Translator for OCR and Face Recognition LYU0203 FYP.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Automatic Image Alignment (feature-based) : Computational Photography Alexei Efros, CMU, Fall 2006 with a lot of slides stolen from Steve Seitz and.
1 Image Features Hao Jiang Sept Image Matching 2.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Tracking Pedestrians Using Local Spatio- Temporal Motion Patterns in Extremely Crowded Scenes Louis Kratz and Ko Nishino IEEE TRANSACTIONS ON PATTERN ANALYSIS.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
BraMBLe: The Bayesian Multiple-BLob Tracker By Michael Isard and John MacCormick Presented by Kristin Branson CSE 252C, Fall 2003.
“Hello! My name is... Buffy” Automatic Naming of Characters in TV Video Mark Everingham, Josef Sivic and Andrew Zisserman Arun Shyam.
Shape-Based Human Detection and Segmentation via Hierarchical Part- Template Matching Zhe Lin, Member, IEEE Larry S. Davis, Fellow, IEEE IEEE TRANSACTIONS.
1 Mean shift and feature selection ECE 738 course project Zhaozheng Yin Spring 2005 Note: Figures and ideas are copyrighted by original authors.
Building Face Dataset Shijin Kong. Building Face Dataset Ramanan et al, ICCV 2007, Leveraging Archival Video for Building Face DatasetsLeveraging Archival.
BACKGROUND LEARNING AND LETTER DETECTION USING TEXTURE WITH PRINCIPAL COMPONENT ANALYSIS (PCA) CIS 601 PROJECT SUMIT BASU FALL 2004.
Building local part models for category-level recognition C. Schmid, INRIA Grenoble Joint work with G. Dorko, S. Lazebnik, J. Ponce.
Video Google: A Text Retrieval Approach to Object Matching in Videos Josef Sivic and Andrew Zisserman.
80 million tiny images: a large dataset for non-parametric object and scene recognition CS 4763 Multimedia Systems Spring 2008.
Features-based Object Recognition P. Moreels, P. Perona California Institute of Technology.
Character Identification in Feature-Length Films Using Global Face-Name Matching IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 11, NO. 7, NOVEMBER 2009 Yi-Fan.
Fitting: The Hough transform
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Non-Photorealistic Rendering and Content- Based Image Retrieval Yuan-Hao Lai Pacific Graphics (2003)
Robust Object Tracking by Hierarchical Association of Detection Responses Present by fakewen.
Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.
Distinctive Image Features from Scale-Invariant Keypoints David Lowe Presented by Tony X. Han March 11, 2008.
Category Independent Region Proposals Ian Endres and Derek Hoiem University of Illinois at Urbana-Champaign.
 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.
CSE 185 Introduction to Computer Vision Feature Matching.
A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )
Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations ZUO ZHEN 27 SEP 2011.
Lecture 9 Feature Extraction and Motion Estimation Slides by: Michael Black Clark F. Olson Jean Ponce.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
776 Computer Vision Jan-Michael Frahm Spring 2012.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
1 Shape Descriptors for Maximally Stable Extremal Regions Per-Erik Forss´en and David G. Lowe Department of Computer Science University of British Columbia.
Invariant Local Features Image content is transformed into local feature coordinates that are invariant to translation, rotation, scale, and other imaging.
Goal: Predicting the Where and What of actors and actions through Online Action Localization Figure 1.
Tracking Objects with Dynamics
LOCUS: Learning Object Classes with Unsupervised Segmentation
Video Google: Text Retrieval Approach to Object Matching in Videos
Dynamical Statistical Shape Priors for Level Set Based Tracking
CSDD Features: Center-Surround Distribution Distance
PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD
Midterm Exam Closed book, notes, computer Similar to test 1 in format:
Video Google: Text Retrieval Approach to Object Matching in Videos
Midterm Exam Closed book, notes, computer Similar to test 1 in format:
EM Algorithm and its Applications
Weak-supervision based Multi-Object Tracking
Presentation transcript:

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models

Outline 1.Introduction 2. Generative model 3. Actor detection 4. Experiments 5. Conclusions

1.Introduction ․ Detecting and naming actors in movies is important for content-based indexing and retrieval of movie scenes and can also be used to support statistical analysis of film style. ․ Detecting and naming actors in unedited footage can be useful for post-production. ․ Methods for learning such features are desired to improve the recall and precision of actor detection in long movie scenes

1.Introduction Contributions 1.We propose a complete framework to learn view-independent actor models (using maximally stable color regions (MSCR) [10]) 2.The actor’s head and shoulders are represented as constellations of color blobs (9 dimensional =>color, size, shape and position relative to the actor’s coordinate system) 3.Detection framework with two stages Stage1:search space reduction using the KNN-search Stage2:Sliding window search for the best localization of the actor in position and scale 4.We obtain detection windows and actor names that maximize the posterior likelihood of each video frame.

2. Generative model Our appearance of actors models ․ Learning from a small number of individual keyframes or short video tracks. ․ Incorporate the costume of the actor and to be robust to changes in viewpoint and pose. ․ Important assumption : Actor is in an upright position and that both the head and the shoulders are visible Normalized coordinate system ․ Origin point : at the actor’s neck ․ Unit size : twice the height of the actor’s eyes relative to the origin head region extends from (−1,−1) to (1, 0) shoulder region extends from (−1, 0) to (1, 3)

2. Generative model ․ color blobs C i Normalized coordinates: x i, y i sizes : s i colors : c i Shapes :m i frequencies :H i ․ generative model for each actor consists in the following three steps 1.Choose screen location and window size for the actor on the screen, using the detections in the previous frame as a prior. 2.Choose visible features C i in the ”head” and ”shoulder” regions independently, each with a probability H i 3.For all visible features C i, generate color blob B i from a gaussian distribution with mean C i and covariance Σ i, then translate and scale to the chosen screen location and size

2. Generative model Maximally stable color regions(MSCR) ․ maximally stable extremal region (MSER) color MSCR [10] ․ It is an affine covariant region detector which uses successive time steps of an agglomerative clustering of pixels to define a region

2. Generative model Clustering ․ we try to keep the training set equally sampled across different views ․ We manually draw the upper body bounding boxes for all training examples and label them with the actor’s names

2. Generative model ․ We cluster the blobs for all actors using a constrained agglomerative clustering ․ For every actor in n training images we get n set of blobs (f 1, f 2,...., f n ) with varying number of blobs in each set ․ Each blob in the first set f 1 is initialized as a singleton cluster. We then compute pairwise matching between those clusters and the blobs in the next set f 2. At each step, for each cluster, we assign at most one blob (its nearest neighbor) if it is closer than a threshold ․ Each cluster is represented by the mean value of its blobs

2. Generative model

3. Actor detection 1.For each actor we first perform a search space reduction using kNN-search 2.We scan a sliding window and search for the most likely location for the actor 3.we perform a multi-actor non maxima suppression over all the detected actors 4.We report the best position and scale for each detected actor ․ Our framework searches for actors over a variety of scales, from foreground (larger scales) to background (smaller scales). (e) Binary map which can be used to reject some of the windows without processing, in practice we use separate binary maps for head and torso features.

3. Actor detection Search space reduction ․ This pre-refinement is done only based on the color, size and shape parameters. ․ IDX and D 7 : N × k matrices ․ Each row in IDX contains the indices of the k closest neighbours in B corresponding to the k smallest distances in D 7. N : number of clusters in the given actor model. ․ Build inverted indices Firstly for each blob within the sliding window we only require to compare it with its corresponding entries in the inverted index table instead of doing an exhaustive search This is further used to build inverted indices i.e. for each unique blob B in the kNN refined set IDX, we store corresponding clusters in C a and respective distances, ensuring that the distance is less then a threshold τ 1

3. Actor detection Sliding window search many windows can be rejected without any processing using a importance term based on a binary map B : the set of all blobs detected in the image C a :the set of cluster centers in the model for a given actor a sliding window at position (x, y) and scale s, ․ In practice we compute MSCR features in the best available scale and then shift and scale the blobs respectively while searching at different scales

3. Actor detection similarity function between the model cluster C a i and the corresponding matched blob B j in nine dimensional space C a i is the center for cluster i in the actor model and Σ a i is its covariance matrix ․ we compute m ij such that each blob in the sliding window is assigned to at most one cluster, each cluster is assigned to at most one blob

3. Actor detection D t−1 : previous frame detections Σ pos : covariance term (we have found that the terms l 1,1, l 1,0 and Σ pos are in fact independent on the choice of actor or movie) t 0 is used to reject all the detections with score below a threshold value.

4. Experiments Dataset : ROPE (443 frame)

4. Experiments Result

4. Experiments

5.Conclusions We have presented a generative appearance model for detecting and naming actors in movies that can be learned from a small number of training examples Results show significant increase in coverage (recall) for actor detection maintaining high precision. We also plan to investigate weakly supervised methods by extracting actor labels