Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba

Slides:

Advertisements

Similar presentations

Greedy Layer-Wise Training of Deep Networks

Advertisements

Classification spotlights

ImageNet Classification with Deep Convolutional Neural Networks

Large-Scale Object Recognition with Weak Supervision

Spatial Pyramid Pooling in Deep Convolutional

Convolutional Neural Networks for Image Processing with Applications in Mobile Robotics By, Sruthi Moola.

VIP: Finding Important People in Images Clint Solomon Mathialagan Andrew C. Gallagher Dhruv Batra CVPR

Deep Convolutional Nets

Neural networks in modern image processing Petra Budíková DISA seminar,

Unsupervised Visual Representation Learning by Context Prediction

Object Recognizing. Deep Learning Success in 2012 DeepNet and speech processing.

Convolutional Neural Network

Philipp Gysel ECE Department University of California, Davis

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition arXiv: v4 [cs.CV(CVPR)] 23 Apr 2015 Kaiming He, Xiangyu Zhang, Shaoqing.

Lecture 4b Data augmentation for CNN training

ICCV 2009 Tilke Judd, Krista Ehinger, Fr´edo Durand, Antonio Torralba.

Parsing Natural Scenes and Natural Language with Recursive Neural Networks INTERNATIONAL CONFERENCE ON MACHINE LEARNING (ICML 2011) RICHARD SOCHER CLIFF.

Understanding and Predicting Image Memorability at a Large Scale A. Khosla, A. S. Raju, A. Torralba and A. Oliva International Conference on Computer Vision.

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Cancer Metastases Classification in Histological Whole Slide Images

Deep Learning for Big Data

Recent developments in object detection

Automatic Grading of Diabetic Retinopathy through Deep Learning

CNN-RNN: A Uniﬁed Framework for Multi-label Image Classiﬁcation

A Discriminative Feature Learning Approach for Deep Face Recognition

Convolutional Neural Network

The Relationship between Deep Learning and Brain Function

Object Detection based on Segment Masks

Data Mining, Neural Network and Genetic Programming

From Vision to Grasping: Adapting Visual Networks

DeepCount Mark Lenson.

Convolutional Neural Fabrics by Shreyas Saxena, Jakob Verbeek

Krishna Kumar Singh, Yong Jae Lee University of California, Davis

Saliency-guided Video Classification via Adaptively weighted learning

Jure Zbontar, Yann LeCun

Understanding and Predicting Image Memorability at a Large Scale

Combining CNN with RNN for scene labeling (segmentation)

YOLO9000:Better, Faster, Stronger

Compositional Human Pose Regression

Hierarchical Deep Convolutional Neural Network

ECE 6504 Deep Learning for Perception

Training Techniques for Deep Neural Networks

Efficient Deep Model for Monocular Road Segmentation

Gaze Following Ruby Simply Snir Bar.

Machine Learning: The Connectionist

R-CNN region By Ilia Iofedov 11/11/2018 BGU, DNN course 2016.

ECE 599/692 – Deep Learning Lecture 6 – CNN: The Variants

Introduction to Neural Networks

Image Classification.

Toward improved document classification and retrieval

Two-Stream Convolutional Networks for Action Recognition in Videos

Object Classification through Deconvolutional Neural Networks

Object Detection Creation from Scratch Samsung R&D Institute Ukraine

Outline Background Motivation Proposed Model Experimental Results

Problems with CNNs and recent innovations 2/13/19

Heterogeneous convolutional neural networks for visual recognition

Course Recap and What’s Next?

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

Attention for translation

Reuben Feinman Research advised by Brenden Lake

Human-object interaction

Deep Object Co-Segmentation

Natalie Lang Tomer Malach

VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION

Learning and Memorization

Identifying Private Content for Online Image Sharing

CVPR2019 Jiahe Li SiamRPN introduces the region proposal network after the Siamese network and performs joint classification and regression.

Presentation transcript:

Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba Where are they looking? Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba -- Presented by Yinan Zhao

Outline Motivation Approach Dataset Experiments Extension

Outline Motivation Approach Dataset Experiments Extension

Motivation Human’s remarkable ability to follow gaze Crucial in interaction with other people and environment

Motivation Human’s remarkable ability to follow gaze Crucial in interaction with other people and environment Joint attention is an important part of early language learning for children.

Motivation No look pass by Magic Jackson

Motivation Is it possible for machines to perform gaze-following in natural settings without restrictive assumptions when only a single view is available?

Motivation Is it possible for machines to perform gaze-following in natural settings without restrictive assumptions when only a single view is available?

Outline Motivation Approach Dataset Experiments Extension

Approach How do humans tend to follow gaze?

Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view Head Detection

Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view

Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view Subsequenly reason about salient objects in their perspective

Approach

Approach Convolutional layered combined with FC?

Approach Saliency Pathway See the full image but not the person’s location Produce a spatial map of size D×D (D=13) We hope it learns to find objects that are salient Independent of the person’s viewpoint

Approach Gaze Pathway Only has access to the closeup image of the person’s head and their location Produce a spatial map of the same size D×D (D=13) Expect it will learn to predict the direction of gaze Head orientation is modeled implicitly

Approach Shifted Grids Formulate the problem as classification supporting multimodal outputs naturally Quantize the fixation location y into N×N grid Large N: Harder learning due to no gradual penalty on spatial categories Small N: Poor precision Shifted grids increases resolution while keeping learning easy.

Approach Shifted Grids Formulate the problem as classification supporting multimodal outputs naturally Quantize the fixation location y into N×N grid Large N: Harder learning due to no gradual penalty on spatial categories Small N: Poor precision Shifted grids increases resolution while keeping learning easy. Other approach for multimodal distribution?

Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction

Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction

They have different values now Resolution is increased! Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction They have different values now Resolution is increased!

Approach Training Differentiable. End-to-end using backpropagation.

Approach Training Differentiable. End-to-end using backpropagation. softmax loss

Approach Training Differentiable. End-to-end using backpropagation. Supervision on gaze fixations only The role of saliency and gaze pathways emerges automatically softmax loss

Outline Motivation Approach Dataset Experiments Extension

Dataset GazeFollow Large-scale dataset annotated with the location where people are looking 1548 from SUN, 33790 from MS COCO, 9135 from Action 40 7791 from PASCAL, 508 from ImageNet, 198097 from Places Annotated with AMT. Mark the center of eyes and where the person is looking Finally, 130339 people in 122143 images with gaze location inside the image

Dataset GazeFollow

Outline Motivation Approach Dataset Experiments Extension

Experiments Implementation AlexNet architecture Initialization Saliency pathway: Places-CNN Gaze pathway: ImageNet-CNN Augment training data Flip Random crop Train/Test 4782 people for testing, the rest for training Uniform fixation location in test set 10 gaze annotations per person in test set 100 400 200 169 1×1×256 kernel N=5 shifted grids GT

Experiments Result Effectiveness of shifted grids

Experiments Result Effectiveness of shifted grids Outperform all baselines by a margin

Experiments Result Effectiveness of shifted grids Outperform all baselines by a margin Far from human performance

Outline Motivation Approach Dataset Experiments Extension

Extension Gaze out of frame Extention for videos? Motion? Co gaze following 2.5D

Reference [1] Recasens, Adria, et al. "Where are they looking?." Advances in Neural Information Processing Systems. 2015. [2] Pusiol, Guido, et al. "Discovering the signatures of joint attention in child-caregiver interaction." Proceedings of the 36th annual meeting of the Cognitive Science Society, Quebec City, Canada. 2014. [3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [4] http://giphy.com/search/history-of-magic

Thanks!