Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Slides:



Advertisements
Similar presentations
Image Retrieval With Relevant Feedback Hayati Cam & Ozge Cavus IMAGE RETRIEVAL WITH RELEVANCE FEEDBACK Hayati CAM Ozge CAVUS.
Advertisements

Complex Networks for Representation and Characterization of Images For CS790g Project Bingdong Li 9/23/2009.
Recognizing Human Actions by Attributes CVPR2011 Jingen Liu, Benjamin Kuipers, Silvio Savarese Dept. of Electrical Engineering and Computer Science University.
Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Limin Wang, Yu Qiao, and Xiaoou Tang
Human Identity Recognition in Aerial Images Omar Oreifej Ramin Mehran Mubarak Shah CVPR 2010, June Computer Vision Lab of UCF.
Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.
Multi-layer Orthogonal Codebook for Image Classification Presented by Xia Li.
Patch to the Future: Unsupervised Visual Prediction
Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.
Transferable Dictionary Pair based Cross-view Action Recognition Lin Hong.
Addressing the Medical Image Annotation Task using visual words representation Uri Avni, Tel Aviv University, Israel Hayit GreenspanTel Aviv University,
MPEG-4 Objective Standardize algorithms for audiovisual coding in multimedia applications allowing for Interactivity High compression Scalability of audio.
Foreground Modeling The Shape of Things that Came Nathan Jacobs Advisor: Robert Pless Computer Science Washington University in St. Louis.
Robust Object Tracking via Sparsity-based Collaborative Model
Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)
Human-Computer Interaction Human-Computer Interaction Segmentation Hanyang University Jong-Il Park.
Ghunhui Gu, Joseph J. Lim, Pablo Arbeláez, Jitendra Malik University of California at Berkeley Berkeley, CA
More sliding window detection: Discriminative part-based models Many slides based on P. FelzenszwalbP. Felzenszwalb.
Local Descriptors for Spatio-Temporal Recognition
Learning to Detect A Salient Object Reporter: 鄭綱 (3/2)
Recognition using Regions CVPR Outline Introduction Overview of the Approach Experimental Results Conclusion.
EE 7730 Image Segmentation.
Region Segmentation. Find sets of pixels, such that All pixels in region i satisfy some constraint of similarity.
Generic Object Detection using Feature Maps Oscar Danielsson Stefan Carlsson
Ensemble Tracking Shai Avidan IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE February 2007.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
Highlights Lecture on the image part (10) Automatic Perception 16
Multi-camera Video Surveillance: Detection, Occlusion Handling, Tracking and Event Recognition Oytun Akman.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
Review: Intro to recognition Recognition tasks Machine learning approach: training, testing, generalization Example classifiers Nearest neighbor Linear.
Exercise Session 10 – Image Categorization
Image Segmentation Rob Atlas Nick Bridle Evan Radkoff.
Bag-of-Words based Image Classification Joost van de Weijer.
Bag of Video-Words Video Representation
What’s Making That Sound ?
Action recognition with improved trajectories
Watch, Listen and Learn Sonal Gupta, Joohyun Kim, Kristen Grauman and Raymond Mooney -Pratiksha Shah.
Nonparametric Part Transfer for Fine-grained Recognition Presenter Byungju Kim.
Chapter 14: SEGMENTATION BY CLUSTERING 1. 2 Outline Introduction Human Vision & Gestalt Properties Applications – Background Subtraction – Shot Boundary.
Svetlana Lazebnik, Cordelia Schmid, Jean Ponce
Object Detection with Discriminatively Trained Part Based Models
Semantic Embedding Space for Zero ­ Shot Action Recognition Xun XuTimothy HospedalesShaogang GongAuthors: Computer Vision Group Queen Mary University of.
Beyond Sliding Windows: Object Localization by Efficient Subwindow Search The best paper prize at CVPR 2008.
Chao-Yeh Chen and Kristen Grauman University of Texas at Austin Efficient Activity Detection with Max- Subgraph Search.
A DISTRIBUTION BASED VIDEO REPRESENTATION FOR HUMAN ACTION RECOGNITION Yan Song, Sheng Tang, Yan-Tao Zheng, Tat-Seng Chua, Yongdong Zhang, Shouxun Lin.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Segmentation of Vehicles in Traffic Video Tun-Yu Chiang Wilson Lau.
Recognition Using Visual Phrases
Dense Color Moment: A New Discriminative Color Descriptor Kylie Gorman, Mentor: Yang Zhang University of Central Florida I.Problem:  Create Robust Discriminative.
Colour and Texture. Extract 3-D information Using Vision Extract 3-D information for performing certain tasks such as manipulation, navigation, and recognition.
Object Recognition as Ranking Holistic Figure-Ground Hypotheses Fuxin Li and Joao Carreira and Cristian Sminchisescu 1.
Line Matching Jonghee Park GIST CV-Lab..  Lines –Fundamental feature in many computer vision fields 3D reconstruction, SLAM, motion estimation –Useful.
Video Google: Text Retrieval Approach to Object Matching in Videos Authors: Josef Sivic and Andrew Zisserman University of Oxford ICCV 2003.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Object detection with deformable part-based models
Data Driven Attributes for Action Detection
Nonparametric Semantic Segmentation
Saliency detection Donghun Yeo CV Lab..
ICCV Hierarchical Part Matching for Fine-Grained Image Classification
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
Cheng-Ming Huang, Wen-Hung Liao Department of Computer Science
Vehicle Segmentation and Tracking in the Presence of Occlusions
CS 1674: Intro to Computer Vision Scene Recognition
The Open World of Micro-Videos
Xiaodan Liang Sun Yat-Sen University
Human Action Recognition Week 8
Discriminative Probabilistic Models for Relational Data
EM Algorithm and its Applications
Presentation transcript:

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014

Motivation

Dense STIP for cross dataset recognition Training Testing Accuracy (avg) UCF50 70 % UCF50 HMDB51 55.7 % Olympic Sports 71.8 % UCF50 Olympic Sports 16.67 % Recognition Accuracy drops across the datasets! Training and Testing is done on similar actions across the datasets

Is the background responsible for this drop in accuracy? UCF50 HMDB51 Is the background responsible for this drop in accuracy?

Do action classifiers learn background?

Experiment Two recent challenging datasets: UCF YouTube, 1100 Videos, 11 Actions UCF Sports, 150 Videos, 10 Actions Actor Bounding Boxes are available for these datasets Extract STIP (HOG, HOF) 50% Spatial-Temporal overlaps. Single scale Foreground Features: 50% overlap with bounding box Background Features: Less than 50% overlap with bounding box Dense Features: All features

Experiment Experimental Setup: Leave one group out for UCF YouTube Leave one actor out for UCF Sports

STIP Features UCF YouTube UCF Sports UCF YouTube Biking Video UCF Sports Running Video Foreground Features STIP Features UCF YouTube UCF Sports Foreground 59.80 % 71.92 % Features with more than 50% overlap with actor bounding box

STIP Features UCF YouTube UCF Sports UCF YouTube Biking Video UCF Sports Running Video Dense Features STIP Features UCF YouTube UCF Sports Dense 60.6% 75.34 % Foreground 59.80 % 71.92 % All features

Background 55.27 % 73.97 % UCF YouTube Biking Video UCF Sports Running Video Background Features Comparable performance with only background features without even observing the actor STIP Features UCF YouTube UCF Sports 59.80 % 71.92 % Dense 60.6 % 75.34 % Background 55.27 % 73.97 % Foreground Features with less than 50% overlap with actor bounding box

Background should be diverse but not discriminative !

As action datasets are becoming large and more complex, their background may become more discriminative!

Action class discriminatively using GIST Experiment 1 Compute GIST descriptor for KTH, UCF50, HMDB51 datasets Cluster GIST descriptor in ‘k’ clusters using K-means clustering Estimate point-wise mutual information between each cluster and action class 𝑃𝑀𝐼 𝑥;𝑦 =𝑙𝑜𝑔 𝑝 𝑥,𝑦 𝑝 𝑥 𝑝(𝑦) =𝑙𝑜𝑔 𝑝(𝑥|𝑦) 𝑝(𝑥) =𝑙𝑜𝑔 𝑝(𝑦|𝑥) 𝑝(𝑦)

Action class discriminatively using GIST Experiment 1 (Continue) PMI Distance Matrices KTH UCF50 HMDB51 Clusters 100 200 300 KTH 5.12 8.25 10.4 HMDB51 7.15 11.07 14.06 UCF50 7.97 11.79 14.38 Small value means more interclass confusion Based on scene information alone, KTH is harder to classify than HMDB51 and UCF50

Action class discriminatively using GIST Experiment 2 Compute GIST descriptor for every 50th frame in KTH, UCF50, HMDB51 datasets Construct a Graph, 𝐺=(𝑉,𝐸) V= 𝑣 𝑖 , i ∈ 1, . . .,𝑛 is the set of descriptors E= 𝑒 𝑖𝑗 , i ∈ 1, . . .,𝑛 ,j ∈ 1, . . .,𝑛 𝑒 𝑖𝑗 = 𝑣 𝑖 − 𝑣 𝑗 2 Graph Connected Component Analysis is performed by threshold ‘E’

Action class discriminatively using GIST Experiment 2 (Continue)

Foreground Focused Representation Our Approach Foreground Focused Representation

Foreground Focused Representation Action localization Binary foreground/Background Segmentation Very challenging and difficult, akin to introducing a new problem to solve the first. Instead Estimate the confidence in each pixel being a part of the foreground, and use it obtain video representation

Motion Gradients Color Gradients

Visual Saliency Fully connected graph is built, where Edge between two pixel is given as By computing stationary distribution of Markov chain, new graph is build Equilibrium distribution of chain is used as per pixel saliency

Visual Saliency Due to camera motion, video saliency is noisy Graph based Image Saliency ( NIPS 2006)

Coherence of Foreground Confidence Initial aggregate of confidence map 𝑓 𝑎 = 𝑙𝑜𝑔 ( 𝑓 𝑚 𝑓 𝑐 + 𝑓 𝑠 +1) The score is max-normalized for each frame of a video The quality of labeling is given by: 𝐸 𝑤 = 𝜑 𝑝 𝜖 𝑉 𝐷 𝑝 𝑤 𝑝 + (𝑝,𝑞)𝜖 𝑉 𝑉 𝑤 𝑝 − 𝑤 𝑞 𝐷 𝑝 𝑤 𝑝 = ( 𝑓 𝑎 − 𝑤 𝑝 ) 2 𝑉 𝑤 𝑝 − 𝑤 𝑞 =min⁡( ( 𝑤 𝑝 − 𝑤 𝑞 ) 2 ) ,𝛾)

Coherence of Foreground Confidence (continue) Inference The message the node p send to q is given by 𝐷 𝑝 𝑤 𝑝 +𝑉 𝑤 𝑝 − 𝑤 𝑞 + 𝑠∈ 𝑁 𝑝 \q 𝑚 𝑠→𝑝 𝑡−1 ( 𝑤 𝑝 ) 𝑤 𝑝 𝑚𝑖𝑛 The belief vector of node q is given by 𝑏 𝑞 𝑡 ( 𝑤 𝑞 )= 𝐷 𝑞 𝑤 𝑞 + 𝑠∈ 𝑁 𝑞 𝑚 𝑠→𝑡 𝑡 ( 𝑤 𝑞 )

Spatial-Temporal Coherence using 3D-MRF Obtain probability of each pixel being the foreground using Spatial-Temporal Coherence using 3D-MRF Motion Gradients Color Gradients Video Saliency Final weights

Examples: Final Foreground weights UCF50 Pull up UCF50 Basketball Olympic Sports Diving HMDB51 ride-horse UCF50 Golf swing HMDB51 ride-bike HMDB51 ride-bike Olympic Sports Pole vault

UCF YouTube Biking Video Traditional Bag of words UCF YouTube Biking Video Histogram Foreground words Background words

Weighted k-means To make codebook biased towards foreground features, use weights of features during clustering The confidence of each descriptors as being on foreground in given by: 𝑤 𝑖 = (𝑥,𝑦)∈ 𝑃 𝑖 𝑓 𝑎 (𝑥,𝑦) 𝑃 𝑖 The goal of clustering is to minimize the following energy function: 𝑗=1 𝐾 𝐶(𝑖,𝑗) 𝑤 𝑖 𝑥 𝑖 − 𝑧 𝑗 2 𝐶 𝑖𝑠 𝑋×𝑘 𝑖𝑠 𝑢𝑛𝑘𝑛𝑜𝑤𝑛 𝑚𝑒𝑚𝑠ℎ𝑖𝑝 𝑚𝑎𝑡𝑟𝑖𝑥

Example

Weighted Histogram To reduce the contribution of background features, use weights for each features of being foreground during quantization

UCF YouTube Biking Video Weighted bag of words UCF YouTube Biking Video Weighted Histogram Background words Foreground words Weighted-kmeans

Problem No separate foreground and background words or vocabulary The large number of background features can sum up to be significant.

UCF YouTube Biking Video Weighted bag of words UCF YouTube Biking Video Weighted Histogram Background words Foreground words Weighted-kmeans

UCF YouTube Biking Video Weighted bag of words UCF YouTube Biking Video Weighted Histogram Background words Foreground words Weighted-kmeans

UCF YouTube Biking Video Weighted bag of words UCF YouTube Biking Video Weighted Histogram Background words Foreground words Weighted-kmeans

Foreground confidence based Histogram decomposition Categorize spatiotemporal regions corresponding to different weights in ‘𝑅’ classes Compute Histograms for each region separately The regions of two videos that has same foreground confidence are compared only with each other The kernel function becomes ∆ 𝐻 𝑖 , 𝐻 𝑗 = 𝑟=1 𝑅 𝛼 𝑟 𝜃( ℎ 𝑟 𝑖 , ℎ 𝑟 𝑗 ) 𝛼 𝑖 ={1, 0.8, 0.6, 0.4, 0.2} 𝑅=5

Partition based weighted histograms 1 Features partitions based on weights

UCF50 Video Weighted Histograms Weighted Histograms HMDB51 Video Histogram Intersection Weights Partitions Weights Partitions Weighted Summations Final Similarity

Experimental Results Datasets used: UCF50, HMDB51, Olympic Sports Features used: STIP UCF50 Vs. HMDB51 10 common actions We choose actions which are visually similar: Biking, Golf Swing, Pull Ups, Horse Riding, Basketball UCF50 Vs. Olympic Sport 6 common actions: Basketball, Pole Vault, Tennis serve, Diving, Clean and Jerk, Throw Discus

Qualitative Results Biking UCF 50 HMDB51 Histogram Intersection= 0.1035 Weighted Histogram Intersection=0.1142 Weighted Histogram Decomposition =0.1295

Qualitative Results Golf Swing UCF 50 HMDB51 Histogram Intersection= 0.1684 Weighted Histogram Intersection=0.2740 Weighted Histogram Decomposition =0.3089

Quantitative Results Pull Ups UCF 50 HMDB51 Histogram Intersection= 0.2744 Weighted Histogram Intersection=0.5454 Weighted Histogram Decomposition =0.5586

Qualitative Results Training Testing Unweighted Weighted UCF50 70.00 74.20 77.85 HMDB51 55.70 60.00 68.70 65.30 69.30 68.00 63.3 64.00 68.67 Olympic Sports 71.80 73.95 69.79 31.25 33.33 16.67 32.29 47.91 Histogram Decomposition

Quantitative Results Confusion Matrix UCF50 classifiers on HMDB51 Unweighted Histogram Decomposition

Thank you