Dougal Sutherland, 9/25/13.

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Artificial Intelligence 12. Two Layer ANNs

Multi-Layer Perceptron (MLP)

Introduction to Data Assimilation NCEO Data-assimilation training days 5-7 July 2010 Peter Jan van Leeuwen Data Assimilation Research Center (DARC) University.

Beyond Linear Separability

Backpropagation Learning Algorithm

Learning in Neural and Belief Networks - Feed Forward Neural Network 2001 년 3 월 28 일 안순길.

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)

Perceptron Learning Rule

Regularization David Kauchak CS 451 – Fall 2013.

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Learning in Recurrent Networks Psychology 209 February 25, 2013.

Kostas Kontogiannis E&CE

Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.

CSC321: Neural Networks Lecture 3: Perceptrons

Visual Recognition Tutorial

Lecture 14 – Neural Networks

Recurrent Neural Networks

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Saturation, Flat-spotting Shift up Derivative Weight Decay No derivative on output nodes.

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.

CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.

Radial Basis Function Networks

Normalised Least Mean-Square Adaptive Filtering

PATTERN RECOGNITION AND MACHINE LEARNING

Artificial Neural Networks

Biointelligence Laboratory, Seoul National University

Cascade Correlation Architecture and Learning Algorithm for Neural Networks.

Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.

Back Propagation and Representation in PDP Networks Psychology 209 February 6, 2013.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.

CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.

CS Statistical Machine learning Lecture 24

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22

CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.

Supervised Sequence Labelling with Recurrent Neural Networks PRESENTED BY: KUNAL PARMAR UHID:

Previous Lecture Perceptron W  t+1  W  t  t  d(t) - sign (w(t)  x)] x Adaline W  t+1  W  t  t  d(t) - f(w(t)  x)] f’ x Gradient.

BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积LSTM网络:利用机器学习预测短期降雨施行健香港科技大学 VALSE 2016/03/23.

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Unit 1 Introduction Number Systems and Conversion.

CSE 190 Modeling sequences: A brief overview

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Large Margin classifiers

The Gradient Descent Algorithm

Intro to NLP and Deep Learning

Backpropagation in fully recurrent and continuous networks

Convolutional Networks

Random walk initialization for training very deep feedforward networks

RNNs: Going Beyond the SRN in Language Prediction

Grid Long Short-Term Memory

A First Look at Music Composition using LSTM Recurrent Neural Networks

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 8, 2018.

of the Artificial Neural Networks.

Tips for Training Deep Network

Neural Networks Geoff Hulten.

RNNs: Going Beyond the SRN in Language Prediction

Introduction to Neural Networks

Presentation transcript:

Dougal Sutherland, 9/25/13

Problem with regular RNNs The standard learning algorithms for RNNs don’t allow for long time lags Problem: error signals going “back in time” in BPTT, RTRL, etc either exponentially blow up or (usually) exponentially vanish The toy problems solved in other papers are often solved faster by randomly guessing weights

Exponential decay of error signals Error from unit u, at step t, to unit v, q steps prior, is scaled by Summing over a path with lq=v, l0=u: if always > 1: exponential blow-up in q if always < 1: exponential decay in q

Exponential decay of error signals For logistic activation, f’(.) has maximum ¼. For constant activation, term is maximized with If weights have magnitude < 4, we get exponential decay. If weights are big, the derivative gets bigger faster and error still vanishes. Larger learning rate doesn’t help either.

Global error If we sum the exponential decay across all units, we see will still get vanishing global error Can also derive a (very) loose upper bound: If max weight < 4/n, get exponential decay. In practice, should happen much more often.

Constant error flow How can we avoid exponential decay? Need So we need a linear activation function, with constant activation over time Here, use the identity function with unit weight. Call it the Constant Error Carrousel (CEC)

CEC issues Input weight conflict Say we have a single outside input i If it helps to turn on unit j and keep it active, wji will want to both: Store the input (switching on j) Protect the input (prevent j from being switched off) Conflict makes learning hard Solution: add an “input gate” to protect the CEC from irrelevant inputs

CEC issues Output weight conflict Say we have a single outside output k wkj needs to both: Sometimes get the output from the CEC j Sometimes prevent j from messing with k Conflict makes learning hard Solution: add an “output gate” to control when the stored data is read

LSTM memory cell

LSTM Later on, added a “forget gate” to cells Can connect cells into a memory block Same input/output gates for multiple memory cells Here, used one fully-connected hidden layer Consisting of memory blocks only Could use regular hidden units also Learn with a variant of RTRL Only propagates errors back in time in memory cells O(# weights) update cost

LSTM topology example

Issues Sometimes memory cells get abused to work as constant biases One solution: sequential construction. (Train until error stops decreasing; then add a memory cell.) Another: output gate bias. (Initially suppress the outputs to make the abuse state farther away.)

Issues Internal state drift If inputs are mostly the same sign, memory contents will tend to drift over time Causes gradient to vanish Simple solution: initial bias on input gates towards zero

Experiment 1

Experiment 2a One-hot coding of p+1 symbols Task: predict the next symbol Train: (p+1, 1, 2, …, p-1, p+1), (p, 1, 2, …, p-1, p) 5 million examples, equal probability Test on same set RTRL usually works with p=4, never with p=10 Even with p=100, LSTM always works Hierarchical chunker also works

Experiment 2b Experiment 2c Replace (1, 2, …, p-1) with random sequence {x, *, *, …, *, x} vs {y, *, *, …, *, y} LSTM still works, chunker breaks Experiment 2c ?????????

Experiment 3a So far, noise has been on separate channel from signal Now mix it: {1, N(0, .2), N(0, .2), …, 1} vs {-1 , N(0, .2), …, 0} {1, 1, 1, N(0, .2), …, 1} vs {-1, -1, -1, N(0, .2), …, 0} LSTM works okay, but random guessing is better

Experiment 3b Experiment 3c Same as before, but add Gaussian noise to the initial informative elements LSTM still works (as does random guessing) Experiment 3c Replace informative elements with .2, .8 Add Gaussian noise to targets LSTM works, random guessing doesn’t

Experiment 4 Each sequence element is a pair of inputs: a number [-1, 1] and one of {-1, 0, 1} Mark two elements with a 1 second entry First, last elements get a -1 second entry Rest are 0 Last element of sequence is sum of the two marked elements (scaled to [0, 1]) LSTM always learns it with 2 memory cells

Experiment 5 Same as experiment 4, but multiplies instead of adds LSTM always learns it with 2 memory cells

Experiment 6 6a: one-hot encoding of sequence classes E, noise, X1, noise, X2, noise, B, output X1, X2 each have two possible values Noise is random length; separation ~40 Output is different symbol for each X1, X2 pair 6b: same thing but X1, X2, X3 (8 classes) LSTM almost always learns both sequences 2 or 3 cell blocks, of size 2

Problems Truncated backprop doesn’t work for e.g. delayed XOR Adds gate units (not a big deal) Hard to do exact counting Except Gers (2001) figured it out: can make sharp spikes every n steps Basically acts like feedforward net that sees the whole sequence at once

Good things Basically acts like feedforward net that sees the whole sequence at once Handles long time lags Handles noise, continuous representations, discrete representations Works well without much parameter tuning (on these toy problems) Fast learning algorithm

Cool results since initial paper Combined with Kalman filters, can learn AnBnCn for n up to 22 million (Perez 2002) Learn blues form (Eck 2002) Metalearning: learns to learn quadratic functions v. quickly (Hochreiter 2001) Use in reinforcement learning (Bakker 2002) Bidirectional extension (Graves 2005) Best results in handwriting recognition as of 2009