PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook.

Slides:

Advertisements

Similar presentations

Semantic Contours from Inverse Detectors Bharath Hariharan et.al. (ICCV-11)

Advertisements

Attributes for Classifier Feedback Amar Parkash and Devi Parikh.

Rich feature Hierarchies for Accurate object detection and semantic segmentation Ross Girshick, Jeff Donahue, Trevor Darrell, Jitandra Malik (UC Berkeley)

Classification spotlights

Limin Wang, Yu Qiao, and Xiaoou Tang

3 Small Comments Alex Berg Stony Brook University I work on recognition: features – action recognition – alignment – detection – attributes – hierarchical.

Face detection Behold a state-of-the-art face detector! (Courtesy Boris Babenko)Boris Babenko.

Performance Evaluation Measures for Face Detection Algorithms Prag Sharma, Richard B. Reilly DSP Research Group, Department of Electronic and Electrical.

Structural Human Action Recognition from Still Images Moin Nabi Computer Vision Lab. ©IPM - Oct

Intelligent Systems Lab. Recognizing Human actions from Still Images with Latent Poses Authors: Weilong Yang, Yang Wang, and Greg Mori Simon Fraser University,

Face Alignment at 3000 FPS via Regressing Local Binary Features

Human Action Recognition by Learning Bases of Action Attributes and Parts.

Large-Scale Object Recognition with Weak Supervision

Student: Yao-Sheng Wang Advisor: Prof. Sheng-Jyh Wang ARTICULATED HUMAN DETECTION 1 Department of Electronics Engineering National Chiao Tung University.

DISCRIMINATIVE DECORELATION FOR CLUSTERING AND CLASSIFICATION ECCV 12 Bharath Hariharan, Jitandra Malik, and Deva Ramanan.

Poselets Michael Krainin CSE 590V Oct 18, Person Detection Dalal and Triggs ‘05 – Learn to classify pedestrians vs. background – HOG + linear SVM.

Object Recognition with Informative Features and Linear Classification Authors: Vidal-Naquet & Ullman Presenter: David Bradley.

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang1;2, Manohar Paluri1, Marc’Aurelio Ranzato1, Trevor Darrell2, Lubomir Bourdev1 1: Facebook.

Methods in Leading Face Verification Algorithms

Spatial Pyramid Pooling in Deep Convolutional

Generic object detection with deformable part-based models

Wang, Z., et al. Presented by: Kayla Henneman October 27, 2014 WHO IS HERE: LOCATION AWARE FACE RECOGNITION.

Describing People: A Poselet-Based Approach to Attribute Classification Lubomir Bourdev 1,2 Subhransu Maji 1 Jitendra Malik 1 1 EECS U.C. Berkeley 2 Adobe.

School of Electronic Information Engineering, Tianjin University Human Action Recognition by Learning Bases of Action Attributes and Parts Jia pingping.

The Three R’s of Vision Jitendra Malik.

Detection, Segmentation and Fine-grained Localization

A Codebook-Free and Annotation-free Approach for Fine-Grained Image Categorization Authors Bangpeng Yao et al. Presenter Hyung-seok Lee ( 이형석 ) CVPR 2012.

Real-Time Detection, Alignment and Recognition of Human Faces Rogerio Schmidt Feris Changbo Hu Matthew Turk Pattern Recognition Project June 12, 2003.

VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR

The Viola/Jones Face Detector A “paradigmatic” method for real-time object detection Training is slow, but detection is very fast Key ideas Integral images.

Learning Features and Parts for Fine-Grained Recognition Authors: Jonathan Krause, Timnit Gebru, Jia Deng, Li-Jia Li, Li Fei-Fei ICPR, 2014 Presented by:

Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.

Describing People: A Poselet-Based Approach to Attribute Classification.

Convolutional Restricted Boltzmann Machines for Feature Learning Mohammad Norouzi Advisor: Dr. Greg Mori Simon Fraser University 27 Nov

Unsupervised Visual Representation Learning by Context Prediction

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Fine-grained Fine-grained Recognition( 细粒度分类 ) 沈志强.

Face detection Many slides adapted from P. Viola.

Rich feature hierarchies for accurate object detection and semantic segmentation 2014 IEEE Conference on Computer Vision and Pattern Recognition Ross Girshick,

Spatial Localization and Detection

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Things iPhoto thinks are faces

FINGERTEC FACE ID FACE RECOGNITION Technology Overview.

Recent developments in object detection

Bangpeng Yao1, Xiaoye Jiang2, Aditya Khosla1,

Guillaume-Alexandre Bilodeau

Object Detection based on Segment Masks

Compact Bilinear Pooling

Object detection with deformable part-based models

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Regularizing Face Verification Nets To Discrete-Valued Pain Regression

Combining CNN with RNN for scene labeling (segmentation)

Part-Based Room Categorization for Household Service Robots

Summary Presentation.

Li Fei-Fei, UIUC Rob Fergus, MIT Antonio Torralba, MIT

CS6890 Deep Learning Weizhen Cai

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

Above and below the object level

Adversarially Tuned Scene Generation

Object detection.

A Convolutional Neural Network Cascade For Face Detection

Computer Vision James Hays

Introduction to Neural Networks

Rob Fergus Computer Vision

Deep Learning Tutorial

Outline Background Motivation Proposed Model Experimental Results

Heterogeneous convolutional neural networks for visual recognition

Automatic Handwriting Generation

Presentation transcript:

PANDA: Pose Aligned Networks for Deep Attribute Modeling Ning Zhang 1,2 Manohar Paluri 1 Marć Aurelio Ranzato 1 Trevor Darrell 2 Lumbomir Boudev 1 1 Facebook AI Research 2 EECS, UC Berkeley Presented by: Sidharth Sahdev

Some of the material has been taken from the original CVPR’14 talk by Ning Zhang

Every Object has some Attributes associated with it. For instance, A chair has 4 legs, a seat, may have a backrest A monitor screen has attributes like its size (eg 27 inches ), resolution Animals have attributes like 4 legs usually (strips for zebra, black and white patches for panda, etc) This research work focuses on inferring human attributes like gender, hair style, clothes style, expressions, etc.

 Why is it cool? You maybe able to recognize a person, chair, table, backpacks, etc. But can you tell what are the features of the detected person which is apparent in the image but the conventional recognition tasks cannot quite do it.  This is where Attribute classification of an object comes in handy. Facial verification (current techniques require a frontal pose more or less and face challenges against low image quality, occlusion and pose variations) Visual search Tagging suggestions

Low resolution

Occlusion

Pose Variations

Before that – some technical jargon we need to know Transfer learning Poselets Attribute Classification Decaf

Facial attribute [ICCV’09 Kumar et al.]

Describing Objects by their Attributes [CVPR’09 Ali et al.]

Human pose estimation [Toshev et al. CVPR 14] Face verification [Taigman et al. CVPR 14]

Can we train a CNN from scratch? Lack of training data!

What if we finetune from ImageNet? Can CNNs do better than this? [Donahue et al. ICML 2014]

Problem Revisited Suppose I wear glasses Occluded by hair Affected by head pose Glasses signal is weak at the scale of a full person  So we need to localize objects by parts to start predicting

Part based approach [ICCV’11 Bourdev et al.] [ICCV’13 Zhang et al. & Joo et al.]

Gender Long hair Wear long pants Wear jeans Wear glasses Is sitting Is playing tennis Is dancing Decompose the image into parts

Part based models CNNs

Use CNNs to learn discriminative features Use poselets for their ability to simplify the learning task by decomposing the object into their canonical poses.

AlexNet type Architecture Each poselet patch is resized to 64x64, randomly jittered and flipped horizontally to improve generalization. The fc_attr is 576x150 dimension. After this, the network branches out one fc with 128 hidden units for each attribute. Each of the branch outputs a binary classifier of the attribute.

Solution – need to also incorporate the whole person bounding box region. It turns out that a more complex net is needed to do that. Therefore the authors used a pre- trained ImageNet model as the deep representation of the full image patch.

Holistic Concatenate to give the final representation Linear SVM Gender Short Sleeves Wear hat Wear shorts Long hair

Dataset What is the ground truth What is the attribute How big is it Is it big enough?

On an average there are 2061 training examples for each poselet Dataset: Attribute 25k

Average Precision on Attribute 25k

Wear jeans Wear hat Wear shorts

female Wear glasses Short hair

Incorrect prediction for short hair Incorrect predictions as wearing tshirts

[Kumar et al, ICCV 2009]

Combine part based model and deep learning Use of pose-normalized CNNs Testing and beating the previous state-of-the-art on 3 datasets (training on their own Facebook dataset of 25K images)