Presentation is loading. Please wait.

Presentation is loading. Please wait.

RNNs and Sequence to sequence models

Similar presentations


Presentation on theme: "RNNs and Sequence to sequence models"β€” Presentation transcript:

1 RNNs and Sequence to sequence models
Rohit Apte

2 Why do machine learning models learn?
Logistic regression – from family of models called Generalized Linear Models β„Ž πœƒ π‘₯ = 1 1+ 𝑒 βˆ’ πœƒ 𝑇 π‘₯ Loss function is convex (e.g. Binary classification 𝐽 πœƒ = βˆ’ 𝑖=1 𝑛 𝑦 𝑖 π‘™π‘œπ‘” β„Ž πœƒ π‘₯ 𝑖 +( 1βˆ’π‘¦ 𝑖 )log⁑(1βˆ’ β„Ž πœƒ π‘₯ 𝑖 ) Function is differentiable and has a global minimum. Can find the optimal solution (lowest cost) through Gradient Descent or Stochastic Gradient Descent.

3 Problem with Generalized Linear models?
It’s a linear classifier. Dataset may not be linearly separable. SVMs can solve this using Kernels. Neural Networks can also do this by learning complex decision boundaries.

4 Vanilla neural network

5 Vanilla neural network (Cont.)
Activation function introduces non linearity – Sigmoid, tanh, relu, leaky relu Why do we need it? Because multiple linear transformations without non linearity can be represented as 1 combined matrix (logistic regression). Loss function is not necessarily convex (or concave!). Can get stuck in local minima. initializations important to avoid this – Random, Xavier, etc.

6 How do neural networks learn?
Forward propagation to calculate model output. Loss function to calculate how far off we are. Back propagation to calculate derivatives. Gradient descent to adjust weights. Repeat until convergence. Tensorflow, PyTorch, etc. implement frameworks to parallelize these calculations.

7 Challenges with vanilla neural nets
Takes input as a vector to produce output. Data may be sequential – for example text, speech or time series. The sequentiality of the data may hold some clues on its interpretation. Mappings may not be 1 to 1 One to many – text generation Many to one – sentiment analysis Many to many – language translation

8 Recurrent neural networks
Basic RNN Cell takes a previous hidden state and input vector and returns an output vector and new hidden state.

9 Example of rnns – movie sentiment

10 Challenge with vanilla RNNs
RNNs cant retain memory beyond the last few steps, and therefore perform poorly on long sequences. Back propagating the gradient can lead to vanishing or exploding gradients. LSTMs and GRUs address this by holding additional information and regulating the flow of information. RNN’s (all varieties) can be stateful or stateless. Stateful implies you remember the state between batches. We will use the term RNN from this point on to refer to LSTMs and GRUs

11 Example of LSTM on text generation
Using a character level mode, the (stateful) LSTM can to predict the next character. Trained on a large corpus it can learn writing style.

12 Sequence to sequence models
Many to many mapping – language translation, speech recognition, etc. Cant use a regular RNN Consists of 3 parts Encoder Intermediate (thought/sense) vector – final state of encoder Decoder

13

14 How to do this practically?
For TRAINING Take a translation corpus (e.g. French to English sentences). Build a vocabulary for each language (create word2id and id2word dictionaries). Get the longest sentence in each language (for padding). Encode each sentence using word2ids (add start and end tokens!). Pad shorter sentences. We have a list of input sentences (French) and target sentences (English).

15 How to do this practically? (cont.)
Convert words to higher dimensional vectors One hot encoding (if vocabulary is small) Dense embeddings Pretrained (Word2vec, Glove, Fasttext) if small corpus Can learn the embedding vectors as part of the model training process if we have a large enough corpus. Run the input sentences through encoder. Get final state. Run the output sentences as inputs through the decoder (using the final state of the encoder as the initial state), predicting the next word in the sentence.

16 How to do this practically? (cont.)
For INFERENCE Given a input sentence (French), encode to word ids. The output sentence(English) only has the Start token. Convert to word/one hot embeddings. Run the input sentences through encoder. Get final state. Run the output sentence (Start token) through the decoder, predicting the next word in the sentence (highest probability of SoftMax output).

17

18

19 Attention There is a problem with the sequence to sequence approach – the last state has to encapsulate ALL the information. This is an information bottleneck. Attention solves this problem – the core idea is for each step in the decoder, focus on a particular part of the source sentence. In that way its similar to human visual attention (imagine watching a tennis game. You are focused on the ball going back and forth while still looking at each players form, and the game in general).

20

21 Attention significantly improves NMT
Allows the decoder to focus on parts of the input sentence Solves the issue of the final state having to encapsulate all the information. Can look back to far away states in long sentences. Attention does come at a computational cost, but gives us better overall results.

22 Experiment NMT requires a LOT of data and can take time to train.
French – English dataset from Udacity. Roughly 140k sentences of a certain structure. Split into train test split. Model Parameters Hidden size: 128 Optimizer: Adam Loss: cross entropy French vocab: 334 English vocab: 228 num_epochs: 10

23 Results Encoder-decoder model achieved log loss of 0.05.
Encoder-decoder attention model achieved log loss of 0.01. More accurate score is BLUE – Bilingual evaluation understudy. Score between 0 and 1 (1 indicates more similar to the original text) For each sentence in the test set, translate it, and compute the BLUE score. Take the average BLUE score for both the standard model as well as the attention based one. Standard model had a BLUE score of 0.144 Attention based model had a BLUE score of 0.539

24 Test set EXamples French (original) English (original) Encoder-Decoder
Attention based les Γ©tats-unis est parfois calme en fΓ©vrier , et il est parfois chaud Γ  l' automne . the united states is sometimes quiet during february , and it is sometimes hot in autumn . the united states is sometimes freezing during february and it is sometimes dry in november . the united states is sometimes quiet during february and it is sometimes warm in autumn . il pourrait aller en france l' Γ©tΓ© prochain . he might go to france next summer . paris is usually warm during summer and it is never chilly in january . did he might go to california . notre moins aimΓ© fruit est la pomme , mais mon moins aimΓ© est l'orange . our least liked fruit is the apple , but my least liked is the orange . her least liked fruit fruit the apple but our most loved is the grapefruit . our least liked fruit is the apple but my least liked is the orange . le raisin est mon fruit prΓ©fΓ©rΓ© , mais la pomme est son favori . the grape is my favorite fruit , but the apple is his . the grape is my least favorite fruit but the pear is your least . my favorite fruit is my my favorite but the apple is his favorite .

25 Attention visualization

26 Beam Search SoftMax is greedy. May not give us optimal results.
For the prediction, instead of the taking highest probability of the SoftMax for the output, we can use beam search. Idea is to take a limited set (top m probabilities). Expand these to a certain depth (n). Take the highest combined probability (multiply each step). Still greedy (i.e. not guaranteed to find optimal solution) but better than just taking the best probability of the first layer.

27

28 Practical challenges of NMT
For a large corpus, we need deep network which can take time to train. The output of your final decoder layer of RNN is a vector of dimension vocabulary size of the output text. We apply SoftMax to this vector to convert it to probabilities. 𝑝 𝑖 = 𝑒 𝑒 𝑖 𝑗 𝑒 𝑒 𝑗 For a large vocabulary this is expensive! A Wikipedia corpus has over 2M words. There are ways to reduce the computational complexity (Hierarchical SoftMax, Differentiated SoftMax)


Download ppt "RNNs and Sequence to sequence models"

Similar presentations


Ads by Google