S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Scheduled Sampling for Sequence Prediction with Recurrent Neural Network S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer arXiv: Present by Hanyi Zhang

Contents Sequence Prediction Recurrent Neural Network
Problem Description and Proposed Models Training using scheduled sampling Inference Process Application Areas: Image Caption Constituency Parsing Speech Recognition

Sequence Learning Sequence Learning composed of solving three problems: Sequence Prediction Sequence Recognition Sequence Decision Making

What is Sequence Prediction
Attempts to predict elements of a sequence on the basis of the preceding elements Popular areas: Speech Recognition Machine Translation Conversation Modeling Image Video Captioning Question Answering Image Generation Turing Machines Robotics Demo

In Karpathy’s tutorial, Sequence Prediction includes
Generate literature from Shakespeare Generate Wikipedia, XML, Markdown Generate Latex structure Generate Linux Source Code Glance at the result:

Shakespeare 3-layer RNN, 512 hidden nodes

MarkDowns 100MB Hutter Prize with LSTM

Code Generation RNN learns to generate GNU disclaimer on top of the file. Training set: 474MB of C code from Linux source code 3-layer LSTM for a few days, 10 mil parameters

Recurrent Neural Network
Generally trained to maximize the likelihood of each target token given the current state of the model, including past states. Vanilla mode of processing without RNN, from fixed-sized input to fixed-sized output Sequence output (e.g. image captioning takes an image and outputs a sentence of words). Sequence input (e.g. sentiment analysis where a given sentence is classified as expressing positive or negative sentiment). Sequence input and sequence output (e.g. Machine Translation: an RNN reads a sentence in English and then outputs a sentence in French). Synced sequence input and output (e.g. video classification where we wish to label each frame of the video).

Proposed Model

Explained by Diagram

Feature of this model This model focus on learning to output the next token given the current state of the model AND the previous token Special token <EOS> is marked to the end of each sequence.

Problem Description Training Process Inference Process
Always use true output yt-1 to feed to next step as Inputt Inference Process Always use sample output y`t-1 to feed to next step as Input`t So in inference process, it doesn’t know how to correct the mistake caused by a wrong previous state vector.

Problem Description(Cont.)
In inference process, true previous target tokens are unavailable, replaced by token generated by the model itself, thus yield a discrepancy between how the model is used at training and inference. Mistakes made early in the sequence generation are fed as input to the model and can be quickly amplified because the model might be in a part of the state space it has never seen at training time

Solution: Scheduled Sampling in Training
Involve probability epsilon to use true previous token as the next input, use (1-epsilon) as the probability of using the sample token from the model itself. Epsilon = 1: Training as before Epsilon = 0: Training as same as Inference

Another paper also mentioned randomly selecting true value y or sample y (a better understanding figure) Reference at Bing Liu, Ian Lane: Recurrent Neural Network Structured Output Prediction for Spoken Language Understanding

They called it a curriculum learning strategy
Curriculum Strategy: To start small, learn easier aspects of the task or easier subtasks, and then gradually increase the difficulty level. Objective: Improved generalization, and faster convergence. Reference: Y.Bengio et el. Curriculum Learning

Curriculum Learning As Continuation Method
Initialled by E.L.Allgower & K.Georg For cost function Cλ(θ), make a C0 to be a smooth version of the real function C1 C0 can reveals the global picture, while C1 is the criterion that we actually wish to minimize. Gradually increase λ while keeping θ at a local minimum of Cλ(θ)

Inference Process Beam Search by maintaining a heap of m best candidate sequences If a wrong decision is taken at time t-1, the model can be a part of the state space that is very different from those visited from the training distribution and for which it doesn’t know what to do. Denote dictionary size D and total time steps T; Beam search maintain a list of m Method Total work Brutal Force DT Greedy Search Beam Search mDT

Experiment 1: Image Caption
ConvNet PreTraining LSTM: One layer of 512 hidden units Reference: Show and Tell: A Nueral Image Caption Generator

ConvNet: Representation of Images LSTM: Sentence Generator
Given an image, maximize the probability of correct description: Given previous translation, get the next translation token: Use CNN as image encoder by first pre-training it for an image classification task and using the last hidden layer as the input to RNN decoder that generates sentences.

CNN Encoding Karpathy’s spec: Representing Images using RCNN(Region Convolutional Neural Network); detect top 19 locations in a single image from 200 classes, instead of using the whole image, each image is represented as a set of h-dimensional vectors. In this paper, each image in the corpus has 5 different captions, training procedure picks one at random, creates mini-batch of examples and optimize the object function. Reference: Karpathy & Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions

LSTM-based Sentence Generator

Image Caption Spec & Result
Training set: MSCOCO 75k images Test set: 5k images Dictionary: 8857 words LSTM hidden units: 512, one layer

Experiment 1: Image Caption (Cont.)
Result: Scheduled sampling successfully trained a model more resilient to failures due to training and inference mismatch Always sampling yields very poor performance, since the convergence takes much more time Scheduled sampling approach ranked first in the 2015 MSCOCO image captioning challenge in the final leaderboard.

Experiment 2: Constituency Parsing
Reference: Oriol et el. Grammar as a Foreign Language

Experiment 2: Constituency Parsing(Cont.)
RNN encoder and decoder One layer of 512 LSTM Target dictionary: 128 symbols WSJ Dataset, 40k training instances(small training set) Deterministic model, perplexities are low Reference: Neural Machine Translation by Jointly Learning to Align and Translate

Experiment 3: Speech Recognition
GMM-HMM alternative using RNN Two layers of 250 LSTM cells and a softmax layer Training data labeling is from GMM-HMM Always sampling yield better result than linear sampling

GMM-HMM intuition

RNN for Speech Recognition

Conclusion Current approach to training them, predicting one token at a time, conditioned on the state and the previous correct token, is different from how we actually use them and thus is prone to the accumulation of errors along the decision paths. we proposed a curriculum learning approach to slowly change the training objective from an easy task, where the previous token is known, to a realistic one, where it is provided by the model itself.

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Similar presentations

Presentation on theme: "S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer

Similar presentations

Presentation on theme: "S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer"— Presentation transcript:

Similar presentations

About project

Feedback