Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba

Slides:



Advertisements
Similar presentations
Greedy Layer-Wise Training of Deep Networks
Advertisements

Classification spotlights
ImageNet Classification with Deep Convolutional Neural Networks
Large-Scale Object Recognition with Weak Supervision
Spatial Pyramid Pooling in Deep Convolutional
Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.
VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR
Deep Convolutional Nets
Neural networks in modern image processing Petra Budíková DISA seminar,
Unsupervised Visual Representation Learning by Context Prediction
Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.
Convolutional Neural Network
Philipp Gysel ECE Department University of California, Davis
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.
Lecture 4b Data augmentation for CNN training
ICCV 2009 Tilke Judd, Krista Ehinger, Fr´edo Durand, Antonio Torralba.
Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.
Understanding and Predicting Image Memorability at a Large Scale A. Khosla, A. S. Raju, A. Torralba and A. Oliva International Conference on Computer Vision.
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Cancer Metastases Classification in Histological Whole Slide Images
Deep Learning for Big Data
Recent developments in object detection
Automatic Grading of Diabetic Retinopathy through Deep Learning
CNN-RNN: A Unified Framework for Multi-label Image Classification
A Discriminative Feature Learning Approach for Deep Face Recognition
Demo.
Convolutional Neural Network
The Relationship between Deep Learning and Brain Function
Object Detection based on Segment Masks
Data Mining, Neural Network and Genetic Programming
From Vision to Grasping: Adapting Visual Networks
DeepCount Mark Lenson.
Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek
Krishna Kumar Singh, Yong Jae Lee University of California, Davis
Saliency-guided Video Classification via Adaptively weighted learning
Jure Zbontar, Yann LeCun
Understanding and Predicting Image Memorability at a Large Scale
Combining CNN with RNN for scene labeling (segmentation)
YOLO9000:Better, Faster, Stronger
Compositional Human Pose Regression
Hierarchical Deep Convolutional Neural Network
ECE 6504 Deep Learning for Perception
Training Techniques for Deep Neural Networks
Efficient Deep Model for Monocular Road Segmentation
Gaze Following Ruby Simply Snir Bar.
Machine Learning: The Connectionist
R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.
ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants
Introduction to Neural Networks
Image Classification.
Toward improved document classification and retrieval
Two-Stream Convolutional Networks for Action Recognition in Videos
Object Classification through Deconvolutional Neural Networks
Object Detection Creation from Scratch Samsung R&D Institute Ukraine
Outline Background Motivation Proposed Model Experimental Results
Problems with CNNs and recent innovations 2/13/19
Heterogeneous convolutional neural networks for visual recognition
Course Recap and What’s Next?
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
Reuben Feinman Research advised by Brenden Lake
Human-object interaction
Deep Object Co-Segmentation
Natalie Lang Tomer Malach
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Learning and Memorization
Jiahe Li
Identifying Private Content for Online Image Sharing
CVPR2019 Jiahe Li SiamRPN introduces the region proposal network after the Siamese network and performs joint classification and regression.
Presentation transcript:

Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba Where are they looking? Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba -- Presented by Yinan Zhao

Outline Motivation Approach Dataset Experiments Extension

Outline Motivation Approach Dataset Experiments Extension

Motivation Human’s remarkable ability to follow gaze Crucial in interaction with other people and environment

Motivation Human’s remarkable ability to follow gaze Crucial in interaction with other people and environment Joint attention is an important part of early language learning for children.

Motivation No look pass by Magic Jackson

Motivation Is it possible for machines to perform gaze-following in natural settings without restrictive assumptions when only a single view is available?

Motivation Is it possible for machines to perform gaze-following in natural settings without restrictive assumptions when only a single view is available?

Outline Motivation Approach Dataset Experiments Extension

Approach How do humans tend to follow gaze?

Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view Head Detection

Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view

Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view Subsequenly reason about salient objects in their perspective

Approach

Approach Convolutional layered combined with FC?

Approach Saliency Pathway See the full image but not the person’s location Produce a spatial map of size D×D (D=13) We hope it learns to find objects that are salient Independent of the person’s viewpoint

Approach Gaze Pathway Only has access to the closeup image of the person’s head and their location Produce a spatial map of the same size D×D (D=13) Expect it will learn to predict the direction of gaze Head orientation is modeled implicitly

Approach Shifted Grids Formulate the problem as classification supporting multimodal outputs naturally Quantize the fixation location y into N×N grid Large N: Harder learning due to no gradual penalty on spatial categories Small N: Poor precision Shifted grids increases resolution while keeping learning easy.

Approach Shifted Grids Formulate the problem as classification supporting multimodal outputs naturally Quantize the fixation location y into N×N grid Large N: Harder learning due to no gradual penalty on spatial categories Small N: Poor precision Shifted grids increases resolution while keeping learning easy. Other approach for multimodal distribution?

Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction

Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction

They have different values now Resolution is increased! Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction They have different values now Resolution is increased!

Approach Training Differentiable. End-to-end using backpropagation.

Approach Training Differentiable. End-to-end using backpropagation. softmax loss

Approach Training Differentiable. End-to-end using backpropagation. Supervision on gaze fixations only The role of saliency and gaze pathways emerges automatically softmax loss

Outline Motivation Approach Dataset Experiments Extension

Dataset GazeFollow Large-scale dataset annotated with the location where people are looking 1548 from SUN, 33790 from MS COCO, 9135 from Action 40 7791 from PASCAL, 508 from ImageNet, 198097 from Places Annotated with AMT. Mark the center of eyes and where the person is looking Finally, 130339 people in 122143 images with gaze location inside the image

Dataset GazeFollow

Outline Motivation Approach Dataset Experiments Extension

Experiments Implementation AlexNet architecture Initialization Saliency pathway: Places-CNN Gaze pathway: ImageNet-CNN Augment training data Flip Random crop Train/Test 4782 people for testing, the rest for training Uniform fixation location in test set 10 gaze annotations per person in test set 100 400 200 169 1×1×256 kernel N=5 shifted grids GT

Experiments Result Effectiveness of shifted grids

Experiments Result Effectiveness of shifted grids Outperform all baselines by a margin

Experiments Result Effectiveness of shifted grids Outperform all baselines by a margin Far from human performance

Outline Motivation Approach Dataset Experiments Extension

Extension Gaze out of frame Extention for videos? Motion? Co gaze following 2.5D

Reference [1] Recasens, Adria, et al. "Where are they looking?." Advances in Neural Information Processing Systems. 2015. [2] Pusiol, Guido, et al. "Discovering the signatures of joint attention in child-caregiver interaction." Proceedings of the 36th annual meeting of the Cognitive Science Society, Quebec City, Canada. 2014. [3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [4] http://giphy.com/search/history-of-magic

Thanks!