Intro to NLP and Deep Learning

Slides:



Advertisements
Similar presentations
NEURAL NETWORKS Backpropagation Algorithm
Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Lecture 14 – Neural Networks
Computer Go : A Go player Rohit Gurjar CS365 Project Presentation, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.
Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.
Convolutional LSTM Networks for Subcellular Localization of Proteins
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Xintao Wu University of Arkansas Introduction to Deep Learning 1.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Attention Model in NLP Jichuan ZENG.
Machine Learning Supervised Learning Classification and Regression
Big data classification using neural network
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
RNNs: An example applied to the prediction task
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Environment Generation with GANs
ECE 5424: Introduction to Machine Learning
Computer Science and Engineering, Seoul National University
Recurrent Neural Networks for Natural Language Processing
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Matt Gormley Lecture 16 October 24, 2016
Deep Learning: Model Summary
Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.
Intro to NLP and Deep Learning
Intro to NLP and Deep Learning
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Neural Networks CS 446 Machine Learning.
Intro to NLP and Deep Learning
Intelligent Information System Lab
Mini Presentations - part 2
Supervised Training of Deep Networks
Convolutional Networks
Deep Belief Networks Psychology 209 February 22, 2013.
Neural Machine Translation By Learning to Jointly Align and Translate
Efficient Estimation of Word Representation in Vector Space
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Bird-species Recognition Using Convolutional Neural Network
RNNs: Going Beyond the SRN in Language Prediction
Advanced Recurrent Architectures
Grid Long Short-Term Memory
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
CS 4501: Introduction to Computer Vision Training Neural Networks II
Final Presentation: Neural Network Doc Summarization
Tips for Training Deep Network
Word Embedding Word2Vec.
Neural Networks Geoff Hulten.
Other Classification Models: Recurrent Neural Network (RNN)
Machine Translation(MT)
Unsupervised Pretraining for Semantic Parsing
RNNs: Going Beyond the SRN in Language Prediction
Attention.
Machine learning overview
实习生汇报 ——北邮 张安迪.
Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.
Word embeddings (continued)
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
Automatic Handwriting Generation
Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.
Neural Machine Translation using CNN
Recurrent Neural Networks
Sequence-to-Sequence Models
CSC 578 Neural Networks and Deep Learning
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Presentation transcript:

Intro to NLP and Deep Learning - 236605 Tutorial 8 Neural machine translation Attention Dropout Layer Tomer Golany – tomer.golany@gmail.com

Attention - Introduction

Attention - Introduction A recent trend in Deep Learning are Attention Mechanisms. ( 2016 ) Attention Mechanisms in Neural Networks are (very) loosely based on the visual attention mechanism found in humans. Human visual attention models essentially come down to being able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution” And then adjusting the focal point over time. Intro to NLP and Deep Learning - 236605

What problem does Attention Solve To understand what attention can do for us, we will use Neural Machine Translation (NMT) as an example In NMT, we map the meaning of a sentence into a fixed-length vector representation and then generate a translation based on that vector.  Most NMT systems work by encoding the source sentence (e.g. a French sentence) into a vector using a Recurrent Neural Network, And then decoding an English sentence based on that vector, also using a RNN Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Neural machine translation

Neural machine translation: See architecture on board Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Encoder: An encoder reads the input sentence x = (x1, x2, …, x(Tx)) (sequence of vectors) into a vector c (the context). The most common approach is to use an RNN. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Decoder: The decoder is trained to predict the next word yt given the context vector c and all the previously predicted words {y1, · · · , yt−1} Defines a probability over the translation y by decomposing the joint probability into the ordered conditionals: Each conditional probability is modeled as: Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Neural machine translation: We can see that the decoder is supposed to generate a translation solely based on the last hidden state from the encoder. This vector must encode everything we need to know about the source sentence. h(t) must fully capture its meaning. In more technical terms, that vector is a sentence embedding In fact, if you plot the embeddings of different sentences in a low dimensional space using PCA or t-SNE for dimensionality reduction, you can see that semantically similar phrases end up close to each other. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Neural machine translation: Still, it seems somewhat unreasonable to assume that we can encode all information about a potentially very long sentence into a single vector and then have the decoder produce a good translation based on only that Let’s say your source sentence is 50 words long. The first word of the English translation is probably highly correlated with the first word of the source sentence. But that means decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in the vector.  Vanishing gradient :( Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Long range dependencies: Different approaches to solve the problem of long range dependencies:  In theory, architectures like LSTM should be able to deal with this, but in practice long-range dependencies are still problematic. It turn out that reversing the source sequence (feeding it backwards into the encoder) produces significantly better results because it shortens the path from the decoder to the relevant parts of the encoder. Feeding a sequence twice also seems to help a network to better memorize things. Attention Mechanisms! Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Attention mechanism

Intro to NLP and Deep Learning - 236605 Attention mechanism: With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector. We allow the decoder to “attend” to different parts of the source sentence at each step of the output generation Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far. the decoder would probably choose to attend to things sequentially. Attending to the first word when producing the first English word, and so on Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 unlike the existing encoder–decoder approach, here the probability is conditioned on a distinct context vector ci for each target word yi Attention mechanism: Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence ההסתברות שהמילה yi מתורגמת מהמילת קלט xj מודד עד כמה המילים סביב המילת קלט xj והמילת פלט yi תואמים. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Attention mechanism: The above illustration uses a bidirectional recurrent network. The important part is that each decoder output word Yt now depends on a weighted combination of all the input states, not just the last state The a’s are weights that define in how much of each input state should be considered for each output.  So, if a3,2 is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Dropout layer

Intro to NLP and Deep Learning - 236605 Dropout layer What is Dropout in Neural Networks? The term “dropout” refers to dropping out units (both hidden and visible) in a neural network. Dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random At each training stage, individual neurons are either dropped out of the net with probability 1-p or kept with probability p A reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Why do we need Dropout? To prevent Overfitting  A fully connected layer occupies most of the parameters. And hence, neurons develop co-dependency amongst each other during training which reduces the individual power of each neuron leading to over-fitting of training data. Intro to NLP and Deep Learning - 236605

Dropout layer – Technical Details Dropout is an approach to regularization in neural networks which helps reducing interdependent learning amongst the neurons. Training Phase: For each hidden layer, for each training sample, for each iteration: ignore (zero out) a random fraction, p, of nodes (and corresponding activations). Intro to NLP and Deep Learning - 236605

Dropout layer – Technical Details Intro to NLP and Deep Learning - 236605

Dropout layer – Technical Details Testing Phase: Use all activations and all neurons. Some Observations: Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less. With H hidden units, each of which can be dropped, we have 2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p. , but reduce them by a factor p (to account for the missing activations during training). Intro to NLP and Deep Learning - 236605

Dropout layer – Experiments We build a deep network in Keras and tried to validate it on the CIFAR-10 dataset. The deep network is built with three convolution layers of size 64, 128 and 256 followed by two fully connected layers of size 512 and an output layer layer of size 10  Took ReLU as the activation function for hidden layers and sigmoid for the output layer Used the standard categorical cross-entropy loss Finally, used dropout in all layers and increase the fraction of dropout from 0.0 (no dropout at all) to 0.9 with a step size of 0.1 and ran each of those to 20 epochs. Intro to NLP and Deep Learning - 236605

Dropout layer – Experiments Results: Intro to NLP and Deep Learning - 236605

Dropout layer – Experiments Conclusions: with increasing the dropout, there is some increase in validation accuracy and decrease in loss initially before the trend starts to go down.  There could be two reasons for the trend to go down if dropout fraction is 0.2: 0.2 is actual minima for the this dataset, network and the set parameters used More epochs are needed to train the networks. Intro to NLP and Deep Learning - 236605