RNNs: Going Beyond the SRN in Language Prediction

Slides:

Advertisements

Similar presentations

Multi-Layer Perceptron (MLP)

Advertisements

Dougal Sutherland, 9/25/13.

NEURAL NETWORKS Backpropagation Algorithm

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Deep Learning and Neural Nets Spring 2015

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Recurrent Neural Networks

Artificial Neural Networks

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

Artificial Neural Networks

Machine Learning Chapter 4. Artificial Neural Networks

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

CS 189 Brian Chu Slides at: brianchu.com/ml/

Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山助教: 熊信寬

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Multinomial Regression and the Softmax Activation Function Gary Cottrell.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Today’s Lecture Neural networks Training

Sentiment analysis using deep learning methods

Back Propagation and Representation in PDP Networks

Deep Learning Methods For Automated Discourse CIS 700-7

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

The Gradient Descent Algorithm

Recursive Neural Networks

Computer Science and Engineering, Seoul National University

Recurrent Neural Networks for Natural Language Processing

Matt Gormley Lecture 16 October 24, 2016

ICS 491 Big Data Analytics Fall 2017 Deep Learning

Neural Networks CS 446 Machine Learning.

Lecture 25: Backprop and convnets

Intelligent Information System Lab

Intro to NLP and Deep Learning

Neural Networks and Backpropagation

RNNs: Going Beyond the SRN in Language Prediction

Grid Long Short-Term Memory

RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

A First Look at Music Composition using LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Training Neural Networks II

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.

Tips for Training Deep Network

Artificial Neural Networks

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Learning linguistic structure with simple recurrent neural networks

Back Propagation and Representation in PDP Networks

实习生汇报 ——北邮张安迪.

Backpropagation David Kauchak CS159 – Fall 2019.

LSTM: Long Short Term Memory

Meta Learning (Part 2): Gradient Descent as LSTM

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Attention for translation

Introduction to Neural Networks

Batch Normalization.

LHC beam mode classification

Overall Introduction for the Lecture

Presentation transcript:

RNNs: Going Beyond the SRN in Language Prediction Psychology 209 February 07, 2019

XOR performance and factors affecting the results Worst case #hidden units Activation Function 394 4 196 2 Tanh 69 8 60 3 54 37 5 25 10 Relu 24 23 12 18 32 15 16 100 14 11 50 196 = Noah 24 = Hermawan 14 = Shaw 10 = Ciric

A RNN for character prediction We can see this as ‘nsteps’ copies of an Elman net placed next to each other. Note that there are only three actual weight arrays, just as in the Elman network. But now we can do ‘back propagation through time’. Gradients are propagated through all arrows, and many different paths affect the same weights. We simply add all these gradient paths together before we change the weights. What happens when we want to process the next characters in our sequence? We keep the last hidden state, but don’t back- propagate through it. ‘Truncated backpropagation’ Elman net truncates after one step How many total backward paths (from ot to ht-t) affect Whh? …

Parallelizing the computation Create 20 copies of the whole thing – call each one a stream Process your text starting at 20 different points in the data. Add up the gradient across all the streams at the end of processing a batch Then take one gradient step! The forward and backward computations are farmed out to a GPU, so they actually occur in parallel using the weights from the last update …

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note elementwise multiplication notation

Variants ‘Peep-hole connections’ Tying the input and forget gates Gated recurrent unit (Wikipedia version has biases)

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation Overfitting Solution: Dropout (dotted paths only)

Word Embeddings Uses a learned word vector instead of one unit per word Similar to the Rumelhart model of semantic cognition and the first hidden layer of 10 units from Elman (199) We use backprop to (conceptually) change the word vectors, rather than the input to word vector weights, but it is essentially the same computation.

Zaremba et al (2016) https://www.tensorflow.org/tutorials/recurrent/ Uses stacked LSTMs with dropout Learns its own word vectors as it goes Shows performance gains compared with other network variants Was used in the Tensorflow RNN tutorial Could be used to study lots of interesting cognitive science questions

Zaremba et al (2016) – Details of their prediction experiments https://github.com/tensorflow/models/blob/master/tutorials/rnn/ptb/ptb_word_lm.py Corpus of ~1.1M words, 10,000 word vocabulary (boy, boys different words). Rare words replaced by <unk>. Output is a softmax over 10,000 alternatives Input uses learned embeddings over N Units. All networks rolled out for 35 steps, using a batch size of 20. As is standard, training was done with separate training, validation, and test data. Randomly divide the data into three parts (e.g. ~85% for training, ~7% for validation, ~8% for test) Run a validation test at many time points during training. Stop training when validation accuracy stops improving. Varied hidden layer size: Non-regularized: 200 units in word embedding and each LSTM layer Medium regularized: 650 units in embedding and LSTM layers, 50% dropout Large regularized: 1500 units in embedding and LSTM layers, 65% dropout loss=-1/N Si log(p(target)i) perplexity = eloss How good are these results? p(correct) ≅1/perplexity

Optimization algorithms We have discussed ‘batch gradient descent’, ‘stochastic gradient descent’ and ‘momentum descent’. There are many other variants, see: http://sebastianruder.com/optimizing-gradient-descent/ Most common: RMSprop Adagrad Adam

Weight initialization and activity normalization Weight initialization depends on number of incoming connections – don’t want the total to be too large Clever initialization can speed learning E.g., perform SVD on random weights, then reconstruct the weights giving each dimension equal strength (i.e. ignore ‘s’ term) Normalizing activations can speed learning too Batch normalization: For each unit in each layer, make its activations have mean = 0, sd = 1 Layer normalization Normalize inputs to units in a layer within each training case Ba, Kiros, & Hinton (2016) https://arxiv.org/abs/1607.06450