CSE 190 Neural Networks: The Neural Turing Machine

Slides:



Advertisements
Similar presentations
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Advertisements

CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
November 30, 2010Neural Networks Lecture 20: Interpolative Associative Memory 1 Associative Networks Associative networks are able to store a set of patterns.
Code Generation CS 480. Can be complex To do a good job of teaching about code generation I could easily spend ten weeks But, don’t have ten weeks, so.
Advanced Topics in Algorithms and Data Structures Lecture 8.2 page 1 Some tools Our circuit C will consist of T ( n ) levels. For each time step of the.
CSE 373 Data Structures Lecture 15
Section 11.4 Language Classes Based On Randomization
Randomized Turing Machines
Lecture 4. RAM Model, Space and Time Complexity
1 KU College of Engineering Elec 204: Digital Systems Design Lecture 11 Binary Adder/Subtractor.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Data Structures and Algorithms Lecture 1 Instructor: Quratulain Date: 1 st Sep, 2009.
Starting Out with C++ Early Objects Seventh Edition by Tony Gaddis, Judy Walters, and Godfrey Muganda Modified for use by MSU Dept. of Computer Science.
Digital Logic Design Lecture # 15 University of Tehran.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Unit 1 Introduction Number Systems and Conversion.
Today’s Lecture Neural networks Training
Chapter 11 - JavaScript: Arrays
Convolutional Sequence to Sequence Learning
CSE 190 Modeling sequences: A brief overview
Dr.Ahmed Bayoumi Dr.Shady Elmashad
The Acceptance Problem for TMs
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Linear Algebra Review.
End-To-End Memory Networks
Week 9 - Monday CS 113.
CSE 190 Neural Networks: How to train a network to look and see
Introduction to Search Algorithms
COSC160: Data Structures Binary Trees
Chapter 6: Temporal Difference Learning
Attention Is All You Need
Introduction to Algorithms
Deep Learning: Model Summary
ICS 491 Big Data Analytics Fall 2017 Deep Learning
A Closer Look at Instruction Set Architectures: Expanding Opcodes
Intelligent Information System Lab
Different Units Ramakrishna Vedantam.
CSE 190 Modeling sequences: A brief overview
Turing Machines Acceptors; Enumerators
Convolutional Networks
Hybrid computing using a neural network with dynamic external memory
Neural Networks and Backpropagation
Advanced Recurrent Architectures
Algorithm design and Analysis
CS 4501: Introduction to Computer Vision Training Neural Networks II
CSCI N207 Data Analysis Using Spreadsheet
Statistical Methods for Magnetic Field Estimation
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
COMP60621 Fundamentals of Parallel and Distributed Systems
Neural Networks Geoff Hulten.
Lecture 16: Recurrent Neural Networks (RNNs)
October 6, 2011 Dr. Itamar Arel College of Engineering
Chapter 6: Temporal Difference Learning
Lecture 21 Logistics Last lecture Today HW7 due Wednesday
ECE 352 Digital System Fundamentals
Overloading functions
Data Structures & Algorithms
Attention.
Backpropagation and Neural Nets
Matlab Basics Tutorial
Important Problem Types and Fundamental Data Structures
EGR 2131 Unit 12 Synchronous Sequential Circuits
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Attention for translation
COMP60611 Fundamentals of Parallel and Distributed Systems
Recurrent Neural Networks
Introduction to Pointers
Presentation transcript:

CSE 190 Neural Networks: The Neural Turing Machine 6/3/2018 CSE 190 Neural Networks: The Neural Turing Machine Gary Cottrell Week 10 Lecture 2 . CSE 190 6/3/2018 Walker L. Cisler Memorial Science Lecture

Introduction Neural nets have been shown to be Turing-equivalent (Seigelmann & Sontag) But can they learn programs? Yes! CSE 190 6/3/2018

Introduction One can explicitly teach a neural net to implement a program – e.g. Tsung & Cottrell (1993), e.g., sequential addition: CSE 190 6/3/2018

Introduction One can explicitly teach a neural net to implement a program – e.g. Tsung & Cottrell (1993) CSE 190 6/3/2018

Introduction One can explicitly teach a neural net to implement a program – e.g. Tsung & Cottrell (1993) CSE 190 6/3/2018

Introduction One can explicitly teach a neural net to implement a program – e.g. Tsung & Cottrell (1993) That’s nice, but: The network had to learn to remember things in its internal state space – when it had to remember one extra bit, it took 5,000 more epochs! We told the model exactly what action to take on every step Can a network learn what to do just by seeing input-output examples? CSE 190 6/3/2018

The Neural Turing Machine A Turing Machine has a finite state controller, and a tape that it can read from and write to. Finite State Machine CSE 190 6/3/2018

The Neural Turing Machine Can we build one of these out of a neural net? The idea seems like it might be similar to this one: a Turing Machine made from tinker toys that plays tic-tac-toe CSE 190 6/3/2018

The Neural Turing Machine 6/3/2018 The Neural Turing Machine It’s a nice party trick, but totally impractical. Does an NTM have to be the same way? No!! CSE 190 6/3/2018 Walker L. Cisler Memorial Science Lecture

The Neural Turing Machine The Main Idea: Add a structured memory to a neural controller that it can write to and read from. CSE 190 6/3/2018

The Neural Turing Machine This is completely differentiable from end to end – so it can be trained by BPTT! CSE 190 6/3/2018

The Neural Turing Machine The controller can be a feed-forward network or a recurrent one – with LSTM units. The recurrent one works better. CSE 190 6/3/2018

The Neural Turing Machine The read and write heads are reasonably simple – the addressing mechanism, not so much. The addressing mechanism allows both content- addressable and location addressing (the normal kind) It is a highly structured system, but differentiable. CSE 190 6/3/2018

The Neural Turing Machine The memory is simply a matrix of linear neurons. M = Think of each row as a “word” of memory We will read and write a row at a time. So this memory has 3 words of length 4. CSE 190 6/3/2018

The Neural Turing Machine The memory is simply a matrix of linear neurons. M = Indexing the rows as i=1,2,3, then we can read from this using: Where rt is the vector read out from the memory at time t, and wt is a softmax column vector of length 3, which is the address! CSE 190 6/3/2018

Read Example M = Suppose we want the second row. If w = (0,1,0)T then Yields rt=(5,9,9,7) Again, note how w is the address CSE 190 6/3/2018

Read Example M = If w = (½,½,0)T then Yields (4,5,6.5,4) (average of rows 1 and 2) So you can even do a simple computation with the addressing mechanism! CSE 190 6/3/2018

Read Example M = Note: this expression is clearly differentiable, both with respect to M and w. CSE 190 6/3/2018

Writing to the memory Inspired by gating from LSTM networks, they use gates to write to memory. A write operation happens in two steps: Erase memories Add to them. CSE 190 6/3/2018

Erasing the memory A write operation happens in two steps: Erase memories Add to them. Erasing memory element i: Again, w is the address, so if wt(i) is 1 at time t, and et is an erase vector, then the memory is modified. (et is multiplied point-wise times the memory vector.) CSE 190 6/3/2018

Erasing the memory Erasing memory element i: M = If w = (0,1,0), and et = (1,1,1,1) then this yields Obviously, more complex “erases” are possible. CSE 190 6/3/2018

Adding to the memory Adding to memory element i: M = If w = (0,1,0), and at = (1,2,3,4) then this yields Question for you: Why do we have to do a write in two steps? CSE 190 6/3/2018

Adding to the memory The memory is a bunch of linear neurons: Question for you: Why do we have to do a write in two steps? The memory is a bunch of linear neurons: There is no “overwrite” operation in arithmetic. We could use an add vector that is the new vector minus the old vector, but that requires three steps! (What are they?) We would have to do a read, compute the modification vector, and then add it CSE 190 6/3/2018

Computing w, the address This is a complicated mechanism, designed to allow both content-addressability as well as location-based addressing CSE 190 6/3/2018

Computing w, the address These parameters must be computed by the controller. CSE 190 6/3/2018

Computing w, the address kt is a key vector for content-addressable memory - we want the memory element most similar to this vector (for our 3X4 memory, this would be a length 4 vector) βt is a gain parameter on the content-match (sharpening it) gt is a switch (gate) between content- and location-based addressing st is a shift vector that can increment or decrement the address, or leave it alone γt is gain parameter on the softmax address, making it more binary CSE 190 6/3/2018

Content-based addressing CSE 190 6/3/2018

Content-based addressing - kt is a key vector for content-addressable memory - we want the memory element most similar to this vector - βt is a gain parameter on the content-match (sharpening it) - So, we create an address vector where each entry is based on how similar that memory element is to the key. Where K is cosine similarity – so this is a softmax on the similarity CSE 190 6/3/2018

Content-based addressing M = So, suppose k = (3,1,0,0), and suppose the match score (cosine) is 0.8, 0.1,0.1,and β=1, this operation will produce an address vector of wc = (0.5, 0.25, 0.25) but if β=10, wc = (0.998, 0.001, 0.001). So a large β sharpens the address. CSE 190 6/3/2018

Location-based addressing This part of the model takes in the the content-based address, and a gate variable gt, which switches between the content-based addressing and the previous memory address vector (so we can increment or decrement it). CSE 190 6/3/2018

Location-based addressing What the “Interpolation” box does: I.e., it if gt=1, we use the content-based address, if gt=0, we pass through the previous address, wt-1. A mixture (e.g., gt=0.5) doesn’t really make sense… CSE 190 6/3/2018

Location-based addressing The “Convolutional shift” box takes in the content-based or the location-based address, wtg and then can increment it by -1, 0, or +1, based on the st vector (larger ranges are possible) CSE 190 6/3/2018

Location-based addressing The “Convolutional shift” box operation: circular convolution: E.g., if st= (0,0,1), this increments the address by 1: (0,1,0) -> (0,0,1) CSE 190 6/3/2018

Location-based addressing The “Sharpening” box operation again applies a high-gain softmax on the address, based on ϒt: E.g., if ϒ=2, (.9,.1,0) -> (.99,.01,0) CSE 190 6/3/2018

Computing the address: Creating a vector address based on similarity to existing memories Switching between content and location Incrementing or decrementing the address Sharpening the address CSE 190 6/3/2018

Computing the address: All of these operations are differentiable with respect to all of their parameters, which means we can backprop through them to change the parameters! CSE 190 6/3/2018

Ok, what can we do with this? Copy Repeated Copy Associative recall Dynamic N-grams Priority Sort They compare a feed-forward controller, an LSTM controller, and a vanilla LSTM net on these tasks CSE 190 6/3/2018

Ok, what can we do with this? Copy Repeated Copy Associative recall Dynamic N-grams Priority Sort CSE 190 6/3/2018

Copy Input: some sequence, e.g., 1,2,3,4 Target: after the end of the sequence, 1,2,3,4 CSE 190 6/3/2018

Copy: Results (LSTM controller) Note it was only trained on length 20 sequences CSE 190 6/3/2018

Copy: Results (LSTM network) Note it was only trained on length 20 sequences CSE 190 6/3/2018

Copy: Results (LSTM NTM) CSE 190 6/3/2018

Copy: The program it learned CSE 190 6/3/2018

Priority Sort: the training CSE 190 6/3/2018

Priority Sort: the results “We hypothesize that it uses the priorities to determine the relative location of each write. To test this hypothesis we fitted a linear function of the priority to the observed write locations. Figure 17 shows that the locations returned by the linear function closely match the observed write locations. It also shows that the network reads from the memory locations in increasing order, thereby traversing the sorted sequence.” CSE 190 6/3/2018