CSE 190 Neural Networks: How to train a network to look and see 6/6/2018 CSE 190 Neural Networks: How to train a network to look and see Gary Cottrell Week 9 Lecture 2 . CSE 190 6/6/2018 Walker L. Cisler Memorial Science Lecture
Introduction How do we deal with the high dimensionality of visual input? CSE 190 6/6/2018
Introduction Our field of view is about 200° horizontally and 130° vertically – a HUGE image. Compare to the size of MNIST images! How do we deal with the high dimensionality of visual input? Sampling! CSE 190 6/6/2018
Introduction We have a foveated retina – we only have high resolution for about 2° of visual angle We move our eyes about 3 times a second That pencils out to about 172,000 times a day! So we sample at the highest resolution 2° of visual angle 172k times per day. Perhaps we could apply this idea to computer vision. CSE 190 6/6/2018
Introduction And we have (Kanan & Cottrell, 2010) We used a salience map to decide where to sample from an image And stored fragments of the image For a new image, took new samples and figured out who or what it was by a kind of nearest neighbor voting CSE 190 6/6/2018
6/6/2018 Humans make ~170,000 saccades each day CSE 190 6/6/2018 OSHER 6
What’s wrong with this picture? The model sampled randomly from the image according to the probability distribution of the salience map. Clearly, we (humans, other animals) don’t do this We can recognize a face in two fixations (Hsiao & Cottrell, 2008) Can we learn a policy for sampling from an image efficiently? CSE 190 6/6/2018
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model Researchers at Deep Mind (purchased by Google for $400,000,000 in 2014) have developed a network that can “move its eyes” and recognize multiple objects in an image. (Ba, Mnih, and Kavukcuoglu (2015), ICLR 2015) It is trained end-to-end to sample from an image, decide the next location to look at, and output a classification Initially used to read street addresses CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model Start Here CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model The little arrow from the little picture is actually a 3-layer convnet with no pooling, that learns from a coarse version of the image to create the initial state of the recurrent network that decides where to look next. CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model The controller network CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model This is half of the recurrent network; the part that I’ll call the controller network – because it keeps the state of where we’ve looked and is input to the emission network to produce where to look next. It is an LSTM network CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model From the controller network, the little arrow marked “emission” is really just a feedforward network with one hidden layer that learns to produce an (x,y) location of where to look next, based on the current state of the r(2) network. n is the time step CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model So, the first computation is to take a coarse version of the image, run it through a convnet, which sets the initial state of the r(2) network, which feeds into the emission network, which produces a first fixation. CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model This (x,y) location decides what patch of image is input to the “glimpse network” CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model So, after training, it focuses on the first digit in the address. Glimpse network glimpse Input image CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model This little arrow is a feed-forward convnet, with three convolutional layers followed by a fully-connected hidden layer…it is gated by the hidden layer of the location network, one to one (element-wise) glimpse Input image CSE 190 APA Talk, August 2000: The face of fear: *
The Story So Far… This section of the network: Gary Cottrell & * 07/16/96 The Story So Far… This section of the network: CSE 190 APA Talk, August 2000: The face of fear: *
The Story So Far… This section of the network: In more detail… is this Gary Cottrell & * The Story So Far… 07/16/96 This section of the network: X Y X Y X Y T=2 T=1 T=0 In more detail… is this CSE 190 APA Talk, August 2000: The face of fear: *
The Story So Far… This section of the network: In more detail… is this Gary Cottrell & * The Story So Far… 07/16/96 This section of the network: X Y X Y X Y T=2 T=1 T=0 In more detail… is this CSE 190 APA Talk, August 2000: The face of fear: *
The Story So Far… This section of the network: In more detail… is this Gary Cottrell & * The Story So Far… 07/16/96 This section of the network: In more detail… is this CSE 190 APA Talk, August 2000: The face of fear: *
The Story So Far… This section of the network: Gary Cottrell & * The Story So Far… 07/16/96 This section of the network: Again, the hidden units in emission network, with exactly the same number of hidden units as the first recurrent network, are one-to-one connected with multiplicative connections – that is, the hidden layer of the lower recurrent net is gated by the location network CSE 190 APA Talk, August 2000: The face of fear: *
The Story So Far… This section of the network: Gary Cottrell & * The Story So Far… 07/16/96 This section of the network: Note that this also gives a pathway for the error to propagate from the actual target network (which is fed by the lower recurrent net) back al the way to the hidden nodes of the emission network, but not the output of the emission network – the location. CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model End Here (if done) Start Here CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model How do we train this?? The y is compared to the target (presumably read out when the LSTM units are good and ready) And then backprop They actually stop the gradient calculation after the first mislabeled target – so shorter sequences first. This is sometimes called curriculum learning CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model That takes care of the classification part, but what about the location part? Here, we can use reinforcement learning to reward the network when it picks a location that works well. The reinforcement signal is based on the fraction it gets right. CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model But how do we even get started? We let the network choose random locations at first, to encourage it to explore. Then later we exploit what it has learned, and explore less. CSE 190 APA Talk, August 2000: The face of fear: *
The Recurrent Attention Model Gary Cottrell & * 07/16/96 The Recurrent Attention Model Start Here CSE 190 APA Talk, August 2000: The face of fear: *
So, what can we do with all this machinery?? We can find pairs of digits in images! (whoohoo!) (really? We did all this to do that???) Ok, yeah, well, we can do it better than anyone else! (ok, better than we did it last year…) CSE 190
How the network behaves CSE 190
But wait! There’s more! We can add those two digits (we couldn’t do that last year) CSE 190
But wait! There’s more! We can read street numbers!!! CSE 190
But wait! There’s more! We can read street numbers backwards!!! CSE 190
Was all that really necessary? CSE 190