RNNs: An example applied to the prediction task

Slides:

Advertisements

Similar presentations

Dougal Sutherland, 9/25/13.

Advertisements

NEURAL NETWORKS Backpropagation Algorithm

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Recurrent Neural Networks

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

Artificial Neural Networks

Machine Learning Chapter 4. Artificial Neural Networks

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.

Back Propagation and Representation in PDP Networks

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

Back Propagation and Representation in PDP Networks

Deep Learning Methods For Automated Discourse CIS 700-7

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Deep Learning for Bacteria Event Identification

The Gradient Descent Algorithm

Recursive Neural Networks

Computer Science and Engineering, Seoul National University

Recurrent Neural Networks for Natural Language Processing

Matt Gormley Lecture 16 October 24, 2016

ICS 491 Big Data Analytics Fall 2017 Deep Learning

Neural Networks CS 446 Machine Learning.

Lecture 25: Backprop and convnets

Intelligent Information System Lab

Intro to NLP and Deep Learning

Neural Networks and Backpropagation

CSE P573 Applications of Artificial Intelligence Neural Networks

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

RNNs: Going Beyond the SRN in Language Prediction

Grid Long Short-Term Memory

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

A First Look at Music Composition using LSTM Recurrent Neural Networks

Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Training Neural Networks II

Final Presentation: Neural Network Doc Summarization

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

CSE 573 Introduction to Artificial Intelligence Neural Networks

Artificial Neural Networks

Word Embedding Word2Vec.

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Learning linguistic structure with simple recurrent neural networks

RNNs: Going Beyond the SRN in Language Prediction

Backpropagation Disclaimer: This PPT is modified based on

Back Propagation and Representation in PDP Networks

Artificial Intelligence 10. Neural Networks

实习生汇报 ——北邮张安迪.

Neural networks (3) Regularization Autoencoder

Word embeddings (continued)

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Attention for translation

A unified extension of lstm to deep network

Batch Normalization.

Neural Machine Translation using CNN

Bidirectional LSTM-CRF Models for Sequence Tagging

CSC 578 Neural Networks and Deep Learning

LHC beam mode classification

Presentation transcript:

RNNs: An example applied to the prediction task Psychology 209 - 2017 February 7, 2017

Going Beyond the SRN Back-propagation through time The vanishing gradient Problem Solving the vanishing gradient problem with LSTM’s The problem of generalization and overfitting Solutions to the overfitting problem Applying LSTM’s with dropout to a full-scale version of Elman’s prediction task

A RNN for character prediction We can see this as several copies of an Elman net placed next to each other. Note that there are only three actual weight arrays, just as in the Elman network. But now we can do ‘back propagation through time’. Gradients are propagated through all arrows, and many different paths affect the same weights. We simply add all these gradient paths together before we change the weights. What happens when we want to process the next characters in our sequence? We keep the last hidden state, but don’t back- propagate through it. ‘Truncated backpropagation’ You can think of this are just an Elman net unrolled for several time steps! …

Parallelizing the computation Create 20 copies of the whole thing – call each one a stream Process your text starting at 20 different points in the data. Add up the gradient across all the streams at the end of processing a batch Then take one gradient step! The forward and backward computations are farmed out to a GPU, so they actually occur in parallel using the weights from the last update …

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation Overfitting Solution: Dropout (dotted paths only)

Word Embeddings Uses a learned word vector instead of one unit per word Similar to the Rumelhart model of semantic cognition and the first hidden layer of 10 units from Elman (199) We use backprop to (conceptually) change the word vectors, rather than the input to word vector weights, but it is essentially the same computation.

Zaremba et al (2016) https://www.tensorflow.org/tutorials/recurrent/ Uses stacked LSTMs with dropout Learns its own word vectors as it goes Shows performance gains compared with other network variants Was used in the Tensorflow RNN tutorial Could be used to study lots of interesting cognitive science questions

Zaremba et al (2016) – Details of their prediction experiments https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py Corpus of ~1.1M words, 10,000 word vocabulary (boy, boys different words). Rare words replaced by <unk>. Output is a softmax over 10,000 alternatives Input uses learned embeddings over N Units. All networks rolled out for 35 time steps, using a batch size of 20. As is standard, training was done with separate training, validation, and test data. Randomly divide the data into three parts (e.g. ~85% for training, ~7% for validation, ~8% for test) Run a validation test at many time points during training. Stop training when validation accuracy stops improving. Varied hidden layer size: Non-regularized: 200 units in word embedding and each LSTM layer Medium regularized: 650 units in embedding and LSTM layers, 50% dropout Large regularized: 1500 units in embedding and LSTM layers, 65% dropout loss=-1/N Si log(p(target)i) perplexity = eloss

Optimization algorithms We have discussed ‘stochastic gradient descent’ and ‘momentum descent’. There are many other variants, see: http://sebastianruder.com/optimizing-gradient-descent/ Most common: RMSprop Adagrad Adam

Weight initialization and activity normalization Weight initialization depends on number of incoming connections – don’t want the total to be too large Clever initialization can speed learning E.g., perform SVD on random weights, then reconstruct the weights giving each dimension equal strength (i.e. ignore ‘s’ term) Normalizing activations can speed learning too Batch normalization: For each unit in each layer, make its activations have mean = 0, sd = 1 Layer normalization Normalize inputs to units in a layer within each training case Ba, Kiros, & Hinton (2016) https://arxiv.org/abs/1607.06450