Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba Where are they looking? Adri`a Recasens, Aditya Khosla, Carl Vondrick, Antonio Torralba -- Presented by Yinan Zhao
Outline Motivation Approach Dataset Experiments Extension
Outline Motivation Approach Dataset Experiments Extension
Motivation Human’s remarkable ability to follow gaze Crucial in interaction with other people and environment
Motivation Human’s remarkable ability to follow gaze Crucial in interaction with other people and environment Joint attention is an important part of early language learning for children.
Motivation No look pass by Magic Jackson
Motivation Is it possible for machines to perform gaze-following in natural settings without restrictive assumptions when only a single view is available?
Motivation Is it possible for machines to perform gaze-following in natural settings without restrictive assumptions when only a single view is available?
Outline Motivation Approach Dataset Experiments Extension
Approach How do humans tend to follow gaze?
Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view Head Detection
Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view
Approach How do humans tend to follow gaze? First look at person’s head and eyes to estimate their field of view Subsequenly reason about salient objects in their perspective
Approach
Approach Convolutional layered combined with FC?
Approach Saliency Pathway See the full image but not the person’s location Produce a spatial map of size D×D (D=13) We hope it learns to find objects that are salient Independent of the person’s viewpoint
Approach Gaze Pathway Only has access to the closeup image of the person’s head and their location Produce a spatial map of the same size D×D (D=13) Expect it will learn to predict the direction of gaze Head orientation is modeled implicitly
Approach Shifted Grids Formulate the problem as classification supporting multimodal outputs naturally Quantize the fixation location y into N×N grid Large N: Harder learning due to no gradual penalty on spatial categories Small N: Poor precision Shifted grids increases resolution while keeping learning easy.
Approach Shifted Grids Formulate the problem as classification supporting multimodal outputs naturally Quantize the fixation location y into N×N grid Large N: Harder learning due to no gradual penalty on spatial categories Small N: Poor precision Shifted grids increases resolution while keeping learning easy. Other approach for multimodal distribution?
Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction
Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction
They have different values now Resolution is increased! Approach Shifted Grids Solve several overlapping classification problems Average shifted outputs to produce the final prediction They have different values now Resolution is increased!
Approach Training Differentiable. End-to-end using backpropagation.
Approach Training Differentiable. End-to-end using backpropagation. softmax loss
Approach Training Differentiable. End-to-end using backpropagation. Supervision on gaze fixations only The role of saliency and gaze pathways emerges automatically softmax loss
Outline Motivation Approach Dataset Experiments Extension
Dataset GazeFollow Large-scale dataset annotated with the location where people are looking 1548 from SUN, 33790 from MS COCO, 9135 from Action 40 7791 from PASCAL, 508 from ImageNet, 198097 from Places Annotated with AMT. Mark the center of eyes and where the person is looking Finally, 130339 people in 122143 images with gaze location inside the image
Dataset GazeFollow
Outline Motivation Approach Dataset Experiments Extension
Experiments Implementation AlexNet architecture Initialization Saliency pathway: Places-CNN Gaze pathway: ImageNet-CNN Augment training data Flip Random crop Train/Test 4782 people for testing, the rest for training Uniform fixation location in test set 10 gaze annotations per person in test set 100 400 200 169 1×1×256 kernel N=5 shifted grids GT
Experiments Result Effectiveness of shifted grids
Experiments Result Effectiveness of shifted grids Outperform all baselines by a margin
Experiments Result Effectiveness of shifted grids Outperform all baselines by a margin Far from human performance
Outline Motivation Approach Dataset Experiments Extension
Extension Gaze out of frame Extention for videos? Motion? Co gaze following 2.5D
Reference [1] Recasens, Adria, et al. "Where are they looking?." Advances in Neural Information Processing Systems. 2015. [2] Pusiol, Guido, et al. "Discovering the signatures of joint attention in child-caregiver interaction." Proceedings of the 36th annual meeting of the Cognitive Science Society, Quebec City, Canada. 2014. [3] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. [4] http://giphy.com/search/history-of-magic
Thanks!