Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advanced Recurrent Architectures

Similar presentations


Presentation on theme: "Advanced Recurrent Architectures"— Presentation transcript:

1 Advanced Recurrent Architectures
Davide Bacciu Dipartimento di Informatica Università di Pisa Intelligent Systems for Pattern Recognition (ISPR)

2 Lecture Outline Dealing with structured/compound data
Introduction Advanced RNN Wrap-up Outline Motivations Lecture Outline Dealing with structured/compound data Sequence-to-sequence Attention models Dealing with very long-term dependencies Multiscale networks Adding memory components Neural reasoning Neural Turing machines

3 Basic Gated RNN (GRNN) Limitations
Introduction Advanced RNN Wrap-up Outline Motivations Basic Gated RNN (GRNN) Limitations GRNN are excellent to handle size/topology varying data in input How can we handle size/topology varying outputs? Sequence-to-sequence Structured data is compound information Efficient processing needs the ability to focus on certain parts of such information Attention mechanism GRNN have troubles dealing with very long-range dependencies Introduce multiscale representation explicitly in the architecture Introduce external memory components

4 Learning to Output Variable Length Sequences
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Learning to Output Variable Length Sequences The idea of an unfolded RNN with blank inputs-outputs does not really work well Output sequence 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 input sequence The approach is based on an encoder-decoder scheme

5 Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Encoder Produce a compressed and fixed length representation 𝑐 of all the input sequence 𝑥 1 ,…, 𝑥 𝑛 Originally 𝑐= ℎ 𝑛 𝑐 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ Activations of an LSTM/GRU layer of K cells ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛 𝑊 𝑖𝑛 𝑊 𝑖𝑛 𝑊 𝑖𝑛 𝑊 𝑖𝑛 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛

6 Decoder A LSTM/GRU layer of K cells seeded by the context vector 𝑐 …
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Decoder 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 A LSTM/GRU layer of K cells seeded by the context vector 𝑐 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑠 1 𝑠 2 𝑠 3 𝑠 𝑛 𝑊′ ℎ 𝑊′ ℎ 𝑊′ ℎ Different approaches to realize this in practice 𝑐

7 We risk to lose memory of 𝑐 soon
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Decoder 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 We risk to lose memory of 𝑐 soon 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑠 1 𝑠 2 𝑠 3 𝑠 𝑛 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ If we share the parameters between encoder and decoder we can take 𝑠 1 =𝑐 𝑐 Or, at least, assume 𝑐 and 𝑠 1 have compatible size

8 Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Decoder 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑠 1 𝑠 2 𝑠 3 𝑠 𝑛 𝑊′ ℎ 𝑊′ ℎ 𝑊′ ℎ 𝑊 𝑐 𝑐 is contextual information kept throughout output generation 𝑐

9 Decoder … 𝑠 𝑖 =𝑓(𝑐, 𝑠 𝑖−1 , 𝑦 𝑖−1 )
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Decoder 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑊 𝑜𝑢𝑡 𝑠 1 𝑠 2 𝑠 3 𝑠 𝑖 =𝑓(𝑐, 𝑠 𝑖−1 , 𝑦 𝑖−1 ) 𝑠 𝑛 𝑊 ℎ 𝑊 ℎ 𝑊 ℎ 𝑊 𝑖𝑛 𝑊 𝑖𝑛 𝑊 𝑖𝑛 𝑦 1 𝑦 2 𝑦 𝑚−1 𝑊 𝑐 It is better to work on a one-step-ahead scheme 𝑐 Remember teacher forcing (only) at training time

10 Sequence-To-Sequence Learning
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Sequence-To-Sequence Learning Encoder-Decoder can share parameters (but it is uncommon) 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 𝑠 1 𝑠 2 𝑠 3 𝑠 𝑛 Encoder-Decoder can be trained end-to-end or independently 𝑦 1 𝑦 2 𝑦 𝑚−1 ℎ 1 ℎ 2 ℎ 3 Reversing the input sequence in encoding can increase performance ℎ 𝑛 𝑐 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛

11 On the Need of Paying Attention
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning On the Need of Paying Attention ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛 𝑐 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 Encoder-Decoder scheme assumes the hidden activation of the last input element summarizes sufficient information to generate the output Bias toward most recent past Other parts of the input sequence might be very informative for the task Possibly elements appearing very far from sequence end

12 On the Need of Paying Attention
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning On the Need of Paying Attention ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛 Attention module 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛 𝑐 Attention mechanism select which part of the sequence to focus on to obtain a good 𝑐

13 Attention Mechanisms – Blackbox View
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Attention Mechanisms – Blackbox View Aggregated seed 𝑐 Attention Module What’s inside of the box? Context info 𝑠 Encodings ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

14 What’s inside of the box?
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning What’s inside of the box? The Revenge of the Gates!

15 Opening the Box 𝑠 𝑐 … ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Opening the Box 𝑐 + × × × × 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 + + + + 𝑠 ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

16 Opening the Box – Relevance
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Opening the Box – Relevance Tanh layer fusing each encoding with current context 𝑠 𝑒 𝑖 =a( 𝑠,ℎ 𝑖 ) 𝑐 + × × × × 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 + + + + 𝑠 ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

17 Opening the Box – Softmax
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Opening the Box – Softmax A differentiable max selector operator 𝛼 𝑖 = exp⁡( 𝑒 𝑖 ) 𝑗 exp⁡(𝑒 𝑗 ) 𝑐 + × × × × 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 + + + + 𝑠 ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

18 Opening the Box – Voting
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Opening the Box – Voting Aggregated seed by (soft) attention voting 𝑐= 𝑖 𝛼 𝑖 ℎ 𝑖 𝑐 + × × × × 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 + + + + 𝑠 ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

19 SOFTMAX Attention in Seq2Seq … … 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Attention in Seq2Seq 𝑦 1 𝑦 2 𝑦 3 𝑦 𝑚 Context is past output state Seed takes into account (subset of) the input states 𝑠 1 𝑠 2 𝑠 3 𝑠 𝑛 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 × 𝛼 1 SOFTMAX 𝛼 2 × 𝛼 3 ℎ 1 ℎ 2 ℎ 3 × ℎ 𝑛 𝑐 𝛼 𝑛 × 𝑥 1 𝑥 2 𝑥 3 𝑥 𝑛

20 Learning to Translate with Attention
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Learning to Translate with Attention Bahdanau et al, Show, Neural machine translation by jointly learning to align and translate, ICLR 2015

21 Advanced Attention – Generalize Relevance
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Advanced Attention – Generalize Relevance This component determines how much each h is correlated/associated with current context s + × × × × 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 Can be dot product or a full NN net1 net2 net3 netn 𝑠 ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

22 Advanced Attention – Hard Attention
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Advanced Attention – Hard Attention Sample a single encoding using probability 𝛼 𝑖 Random Sample × × × × 𝛼 1 𝛼 𝑛 𝛼 2 𝛼 3 SOFTMAX 𝑒 1 𝑒 2 𝑒 3 𝑒 𝑛 Not Differentiable. Approximate gradient by sampling (MCMC) net1 net2 net3 netn 𝑠 ℎ 1 ℎ 2 ℎ 3 ℎ 𝑛

23 Attention-Based Captioning – Focus Shifting
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Attention-Based Captioning – Focus Shifting Soft Attention Hard Attention Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

24 Attention-Based Captioning - Generation
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Attention-Based Captioning - Generation Learns to correlate textual and visual concepts Helps understanding why the model fails Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

25 Attention-Based Captioning – The Model
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Attention-Based Captioning – The Model Encodings associated to n image regions From convolutional layers rather than from fully connected Xu et al, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015

26 Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning RNN and Memory – Issue 1 Gated RNN claim to solve the problem of learning long-range dependencies In practice it is still difficult to learn on longer range Architectures trying to optimize dynamic memory usage Clockwork RNN Skip RNN Multiscale RNN Zoneout Based on selective update of the state

27 Koutnik et al, A Clockwork RNN, ICML 2014
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Clockwork RNN Modular recurrent layer where each module is updated at different clock Problems: “Slow” (low-frequency) units: - Inactivity over long periods of time - As a result, undertrained - Barely contribute to prediction • Shif-variant - Network responds differently to same input stimuli applied at different moments in time Modules interconnected only when destination clock time is larger Koutnik et al, A Clockwork RNN, ICML 2014

28 Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Skip RNN Binary state update gate determining if RNN state is updated or copied Replacing gated update by copying increases network memory (LSTM has an exponential fading effect due to the multiplicative gate)

29 Skip RNN and Attention Attended pixels Ignored pixels
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Skip RNN and Attention Attended pixels Ignored pixels Campos et al, Skip RNN: Skipping State Updates in Recurrent Neural Networks, ICLR 2018

30 RNN and Memory – Issue 2 A motivating example:
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning RNN and Memory – Issue 2 A motivating example: In order to solve the task need to memorize Facts Question Answers A bit too much for the dynamical RNN memory Try to address it through an external memory Long-term memory is required to read a story (or watch a movie) and then answer questions about it

31 Memory Networks - General Idea
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Memory Networks - General Idea Memory module writing Neural network reading

32 Memory Network Components
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Memory Network Components (I) Input feature map: Encodes the input in a feature vector (G) Generalization: decide what input (or function of it) to write to memory (O) Output feature map: reads the relevant memory slots (R) Response: returns the prediction given the retrieved memories Extremely general framework which can be instantiated in several ways. Weston et al, Memory Networks, ICLR 2015

33 End-to-End Memory Networks
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning End-to-End Memory Networks Combine output memories through soft attention Generate prediction Query driven soft-attention Facts 𝑥 𝑖 Search for memories matching the query Sukhbaatar et al, End-to-end Memory Networks, NIPS 2015

34 Memory Network Extensions
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Memory Network Extensions Use more complex output components, e.g. RNN to generate response sequences Often with tied weights Stack multiple memory network layers The write component is still somehow naive. Several iterations of reasoning to produce a better answer

35 Memory Nets for Visual QA with Attention
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Memory Nets for Visual QA with Attention Stacked attention layers (First locate objects and concepts; Then focus on most interesting regions; iterative reasoning again) Question processed by CNN or LSTM VGG CNN for encoding Yang et al, Stacked Attention Networks for Image Question Answering, CVPR 2016

36 Memory Nets for Visual QA with Attention
Introduction Advanced RNN Wrap-up Structured Output and Attention Extending Memory Neural Reasoning Memory Nets for Visual QA with Attention

37 Neural Turing Machines
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning Neural Turing Machines Input Output NTM Controller Read Heads Write Heads Always read and write all the memory is the key to differentiability Attention and Augmented Recurrent Neural Networks.html#adaptive-computation-time Memory Memory networks that can read and write memories at both training and test End-to-end differentiable

38 Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning Neural Controller Image colah.github.io Typically an RNN emitting vectors to control read and write from the memory The key to differentiability is to always read and write the whole memory Use soft-attention to determine how much to read/write from each point

39 Interest in the i-th memory
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning Memory Read Interest in the i-th memory Attention distribution vector 𝒂 from the RNN Memories 𝑴 𝑖 Retrieved memory is a weighted mean of all memories 𝒓= 𝑖 𝑎 𝑖 𝑴 𝑖

40 Memory Write 𝑴 𝑖 = a i 𝐰+ 1− a i 𝐌 i
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning Memory Write Attention distribution vector 𝒂 describing how much we change each memory Value to write w 𝑴 𝑖 = a i 𝐰+ 1− a i 𝐌 i Write operation is actually performed by composing erasing and adding operations old new

41 NTM Attention Focusing
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning NTM Attention Focusing 1. Generate content-based memory indexing Memory Query vector A

42 NTM Attention Focusing
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning NTM Attention Focusing 2. Interpolate with attention from previous time Previous attention vector A Interpolation amount B

43 NTM Attention Focusing
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning NTM Attention Focusing 3. Generate location-based indexing B Shift distribution filter Determines how we move between the locations in memory The neural controller generates queries, interpolation and shift values Sharpen the distribution for final memory access

44 Practical Use? Still missing the killer application
Introduction Advanced RNN Wrap-up Structured Output and Attention Memory Networks Neural Reasoning Practical Use? Still missing the killer application Not straightforward to train Superior to GRNN when it comes to learn to program Copy task Repeat copy Associative recall Sorting Foundations for neural reasoning Pondering networks The neural controller generates queries, interpolation and shift values

45 Introduction Deep Gated RNN Wrap-up Software Conclusions Software Complete sequence-to-sequence tutorial (including attention) on Tensorflow A shorter version in Keras Github project collecting several memory augmented networks Pytorch implementation of stacked attention for visual QA (originally Theano-based) Many implementations of the NTM (Keras, Pytorch, Lasagne,…): none seemingly official

46 Take Home Messages Attention.. Attention.. and, again, attention
Introduction Deep Gated RNN Wrap-up Software Conclusions Take Home Messages Attention.. Attention.. and, again, attention Soft attention is nice because makes everything fully differentiable Hard attention is stochastic hence cannot Backprop Empirical evidences of them being sensitive to different things Encoder-Decoder scheme A general architecture to compose heterogeneous models and data Decoding allows sampling complex predictions from an encoding conditioned distribution Memory and RNN Efficient use of dynamic memory External memory for search and recall tasks Read/write memory for neural reasoning


Download ppt "Advanced Recurrent Architectures"

Similar presentations


Ads by Google