Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2018 February 8, 2018.

Slides:

Advertisements

Similar presentations

Dougal Sutherland, 9/25/13.

Advertisements

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Learning linguistic structure with simple recurrent networks February 20, 2013.

Learning in Recurrent Networks Psychology 209 February 25, 2013.

PDP: Motivation, basic approach. Cognitive psychology or “How the Mind Works”

Lecture 14 – Neural Networks

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Recurrent Neural Networks

9.012 Brain and Cognitive Sciences II Part VIII: Intro to Language & Psycholinguistics - Dr. Ted Gibson.

Artificial Neural Networks

CHAPTER 11 Back-Propagation Ming-Feng Yeh.

November 21, 2012Introduction to Artificial Intelligence Lecture 16: Neural Network Paradigms III 1 Learning in the BPN Gradients of two-dimensional functions:

James L. McClelland Stanford University

Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.

Machine Learning Chapter 4. Artificial Neural Networks

Methodology of Simulations n CS/PY 399 Lecture Presentation # 19 n February 21, 2001 n Mount Union College.

Introduction to Neural Networks and Example Applications in HCI Nick Gentile.

CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.

Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.

Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.

Back Propagation and Representation in PDP Networks

Neural Networks: An Introduction and Overview

Machine Learning Supervised Learning Classification and Regression

Back Propagation and Representation in PDP Networks

Convolutional Sequence to Sequence Learning

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

Neural Networks.

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Simple recurrent networks.

Deep Learning Amin Sobhani.

The Gradient Descent Algorithm

Recurrent Neural Networks for Natural Language Processing

第 3 章神经网络.

Matt Gormley Lecture 16 October 24, 2016

James L. McClelland SS 100, May 31, 2011

Intro to NLP and Deep Learning

ICS 491 Big Data Analytics Fall 2017 Deep Learning

Backpropagation in fully recurrent and continuous networks

Intelligent Information System Lab

CSE 190 Modeling sequences: A brief overview

CSE P573 Applications of Artificial Intelligence Neural Networks

RNNs: Going Beyond the SRN in Language Prediction

A First Look at Music Composition using LSTM Recurrent Neural Networks

Recurrent Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Chap. 7 Regularization for Deep Learning (7.8~7.12 )

CSE 573 Introduction to Artificial Intelligence Neural Networks

Artificial Neural Networks

Backpropagation.

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Artificial Intelligence Chapter 3 Neural Networks

Artificial Intelligence Chapter 3 Neural Networks

Learning linguistic structure with simple recurrent neural networks

RNNs: Going Beyond the SRN in Language Prediction

Back Propagation and Representation in PDP Networks

Artificial Intelligence Chapter 3 Neural Networks

LSTM: Long Short Term Memory

Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton

CSC321: Neural Networks Lecture 11: Learning in recurrent networks

Attention for translation

Neural Networks: An Introduction and Overview

A unified extension of lstm to deep network

Recurrent Neural Networks

Sequence-to-Sequence Models

Neural Machine Translation by Jointly Learning to Align and Translate

Artificial Intelligence Chapter 3 Neural Networks

Presentation transcript:

Learning linguistic structure with simple and more complex recurrent neural networks Psychology 209 - 2018 February 8, 2018

Elman’s Simple Recurrent Network (Elman, 1990) What is the best way to represent time Slots? Or time itself? What is the best way to represent language? Units and rules? Or connectionist learning? Is grammar learnable? If so, are there any necessary constraints?

The Simple Recurrent Network Network is trained on a stream of elements with sequential structure At step n, target for output is next element. Pattern on hidden units is copied back to the context units. After learning it comes to retain information about preceding elements of the string, allowing expectations to be conditioned by prior context.

Learning about words from streams of letters (200 sentences of 4-9 words) Similarly, SRNs have also been used to model learning to segment words in speech (e.g., Christiansen, Allen and Seidenberg, 1998)

Learning about sentence structure from streams of words

Learned and imputed hidden-layer representations (average vectors over all contexts) ‘Zog’ representation derived by averaging vectors obtained by inserting novel item in place of each occurrence of ‘man’.

Within-item variation by context

Analyis of SRN’s using Simpler Sequential Structures (Servain-Schreiber, Cleeremans, & McClelland) The Grammar The Network

Hidden unit representations with 3 hidden units True Finite State Machine Graded State Machine

Training with Restricted Set of Strings 21 of the 43 valid strings of length 3-8

Progressive Deepening of the Network’s Sensitivity to Prior Context Note: Prior Context is only maintained if it is prediction-relevant at intermediate points.

Relating the Model to Human data Experiment: Implicit sequence learning Input is a screen position (corresponding to a letter in the grammar) Response measure is RT: time from stimulus to button press (very few errors ever occur) Assumption: anticipatory activation of output unit reduces RT Fit to data: compare model’s predictions at different time points to human RT’s at these time points. A pretty good fit was obtained after adding two additional assumptions: Activation carries over (with decay) from the previous time step Connection weight adjustments have both a fast and a slow component.

Results and Model Fit Basic Model Fit Human Behavior Extended

Elman (1991)

NV Agreement and Verb successor prediction Histograms show summed activation for classes of words: W = who S = period V1/V2 / N1/N2/PN indicate singular, plural, or proper For V’s: N = No DO O = Optional DO R = Required DO

Prediction with an embedded clause

Rules or Connections? How is it that we can process sentences we’ve never seen before? Colorless green ideas sleep furiously Chomsky, Fodor, Pinker, … Abstract, symbolic rules S-> NP VP ; NP -> (Adj)* N ; VP-> V (Adv) The connectionist alternative Function approximation using distributed representations and knowledge in connection weights

Going Beyond the SRN Back-propagation through time The vanishing gradient Problem Solving the vanishing gradient problem with LSTM’s The problem of generalization and overfitting Solutions to the overfitting problem Applying LSTM’s with dropout to a full-scale version of Elman’s prediction task

A RNN for character prediction We can see this as several copies of an Elman net placed next to each other. Note that there are only three actual weight arrays, just as in the Elman network. But now we can do ‘back propagation through time’. Gradients are propagated through all arrows, and many different paths affect the same weights. We simply add all these gradient paths together before we change the weights. What happens when we want to process the next characters in our sequence? We keep the last hidden state, but don’t backpropagate through it. You can think of this are just an Elman net unrolled for several time steps! …

Parallelizing the computation Create 20 copies of the whole thing – call each one a stream Process your text starting at 20 different points in the data. Add up the gradient across all the streams at the end of processing a batch Then take one gradient step! The forward and backward computations are farmed out to a GPU, so they actually occur in parallel using the weights from the last update …

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation

Some problems and solutions Classic RNN at right Note superscript l for layer The vanishing gradient problem Solution: the LSTM Has it’s own internal state ‘c’ Has weights to gate input and output and to allow it to forget Note dot product notation Overfitting Solution: Dropout Tensorflow RNN tutorial does prediction task using stacked LSTMs with Dropout (dotted lines)

Word Embeddings Use a learned word vector instead of one unit per word Similar to the Rumelhart model from Tuesday and the first hidden layer of 10 units from Elman (199) We use backprop to (conceptually) change the word vectors, rather than the input to word vector weights, but it is essentially the same computation.

Zaremba et al (2016) Uses stacked LSTMs with dropout Learns its own word vectors as it goes Shows performance gains compared with other network variants Was used in the Tensorflow RNN tutorial Could be used to study lots of interesting cognitive science questions Will be available on the lab server soon!