RNNs: An example applied to the prediction task

Slides:



Advertisements
Similar presentations
Dougal Sutherland, 9/25/13.
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Deep Learning and Neural Nets Spring 2015
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Recurrent Neural Networks
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Artificial Neural Networks
Machine Learning Chapter 4. Artificial Neural Networks
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
Back Propagation and Representation in PDP Networks
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Today’s Lecture Neural networks Training
Back Propagation and Representation in PDP Networks
Deep Learning Methods For Automated Discourse CIS 700-7
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Deep Learning for Bacteria Event Identification
The Gradient Descent Algorithm
Recursive Neural Networks
Computer Science and Engineering, Seoul National University
Recurrent Neural Networks for Natural Language Processing
Matt Gormley Lecture 16 October 24, 2016
ICS 491 Big Data Analytics Fall 2017 Deep Learning
Neural Networks CS 446 Machine Learning.
Lecture 25: Backprop and convnets
Intelligent Information System Lab
Intro to NLP and Deep Learning
Neural Networks and Backpropagation
CSE P573 Applications of Artificial Intelligence Neural Networks
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
RNNs: Going Beyond the SRN in Language Prediction
Grid Long Short-Term Memory
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
A First Look at Music Composition using LSTM Recurrent Neural Networks
Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Training Neural Networks II
Final Presentation: Neural Network Doc Summarization
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
CSE 573 Introduction to Artificial Intelligence Neural Networks
Artificial Neural Networks
Word Embedding Word2Vec.
Neural Networks Geoff Hulten.
Other Classification Models: Recurrent Neural Network (RNN)
Learning linguistic structure with simple recurrent neural networks
RNNs: Going Beyond the SRN in Language Prediction
Backpropagation Disclaimer: This PPT is modified based on
Back Propagation and Representation in PDP Networks
Artificial Intelligence 10. Neural Networks
实习生汇报 ——北邮 张安迪.
Neural networks (3) Regularization Autoencoder
Word embeddings (continued)
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Attention for translation
A unified extension of lstm to deep network
Batch Normalization.
Neural Machine Translation using CNN
Bidirectional LSTM-CRF Models for Sequence Tagging
CSC 578 Neural Networks and Deep Learning
LHC beam mode classification
Presentation transcript:

RNNs: An example applied to the prediction task Psychology 209 - 2017 February 7, 2017

Going Beyond the SRN Back-propagation through time The vanishing gradient Problem Solving the vanishing gradient problem with LSTM’s The problem of generalization and overfitting Solutions to the overfitting problem Applying LSTM’s with dropout to a full-scale version of Elman’s prediction task

A RNN for character prediction We can see this as several copies of an Elman net placed next to each other. Note that there are only three actual weight arrays, just as in the Elman network. But now we can do ‘back propagation through time’. Gradients are propagated through all arrows, and many different paths affect the same weights. We simply add all these gradient paths together before we change the weights. What happens when we want to process the next characters in our sequence? We keep the last hidden state, but don’t back- propagate through it. ‘Truncated backpropagation’ You can think of this are just an Elman net unrolled for several time steps! …

Parallelizing the computation Create 20 copies of the whole thing – call each one a stream Process your text starting at 20 different points in the data. Add up the gradient across all the streams at the end of processing a batch Then take one gradient step! The forward and backward computations are farmed out to a GPU, so they actually occur in parallel using the weights from the last update …

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation Overfitting Solution: Dropout (dotted paths only)

Word Embeddings Uses a learned word vector instead of one unit per word Similar to the Rumelhart model of semantic cognition and the first hidden layer of 10 units from Elman (199) We use backprop to (conceptually) change the word vectors, rather than the input to word vector weights, but it is essentially the same computation.

Zaremba et al (2016) https://www.tensorflow.org/tutorials/recurrent/ Uses stacked LSTMs with dropout Learns its own word vectors as it goes Shows performance gains compared with other network variants Was used in the Tensorflow RNN tutorial Could be used to study lots of interesting cognitive science questions

Zaremba et al (2016) – Details of their prediction experiments https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py Corpus of ~1.1M words, 10,000 word vocabulary (boy, boys different words). Rare words replaced by <unk>. Output is a softmax over 10,000 alternatives Input uses learned embeddings over N Units. All networks rolled out for 35 time steps, using a batch size of 20. As is standard, training was done with separate training, validation, and test data. Randomly divide the data into three parts (e.g. ~85% for training, ~7% for validation, ~8% for test) Run a validation test at many time points during training. Stop training when validation accuracy stops improving. Varied hidden layer size: Non-regularized: 200 units in word embedding and each LSTM layer Medium regularized: 650 units in embedding and LSTM layers, 50% dropout Large regularized: 1500 units in embedding and LSTM layers, 65% dropout loss=-1/N Si log(p(target)i) perplexity = eloss

Optimization algorithms We have discussed ‘stochastic gradient descent’ and ‘momentum descent’. There are many other variants, see: http://sebastianruder.com/optimizing-gradient-descent/ Most common: RMSprop Adagrad Adam

Weight initialization and activity normalization Weight initialization depends on number of incoming connections – don’t want the total to be too large Clever initialization can speed learning E.g., perform SVD on random weights, then reconstruct the weights giving each dimension equal strength (i.e. ignore ‘s’ term) Normalizing activations can speed learning too Batch normalization: For each unit in each layer, make its activations have mean = 0, sd = 1 Layer normalization Normalize inputs to units in a layer within each training case Ba, Kiros, & Hinton (2016) https://arxiv.org/abs/1607.06450