Intro to NLP and Deep Learning

Slides:

Advertisements

Similar presentations

NEURAL NETWORKS Backpropagation Algorithm

Advertisements

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.

Lecture 14 – Neural Networks

Computer Go : A Go player Rohit Gurjar CS365 Project Presentation, IIT Kanpur Guided By – Prof. Amitabha Mukerjee.

Deep Learning for Efficient Discriminative Parsing Niranjan Balasubramanian September 2 nd, 2015 Slides based on Ronan Collobert’s Paper and video from.

Convolutional LSTM Networks for Subcellular Localization of Proteins

Deep Learning Overview Sources: workshop-tutorial-final.pdf

Xintao Wu University of Arkansas Introduction to Deep Learning 1.

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Bassem Makni SML 16 Click to add text 1 Deep Learning of RDF rules Semantic Machine Learning.

Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.

Attention Model in NLP Jichuan ZENG.

Machine Learning Supervised Learning Classification and Regression

Big data classification using neural network

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

RNNs: An example applied to the prediction task

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

Environment Generation with GANs

ECE 5424: Introduction to Machine Learning

Computer Science and Engineering, Seoul National University

Recurrent Neural Networks for Natural Language Processing

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Matt Gormley Lecture 16 October 24, 2016

Deep Learning: Model Summary

Spring Courses CSCI 5922 – Probabilistic Models (Mozer) CSCI Mind Reading Machines (Sidney D’Mello) CSCI 7000 – Human Centered Machine Learning.

Intro to NLP and Deep Learning

Intro to NLP and Deep Learning

Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)

Neural Networks CS 446 Machine Learning.

Intro to NLP and Deep Learning

Intelligent Information System Lab

Mini Presentations - part 2

Supervised Training of Deep Networks

Convolutional Networks

Deep Belief Networks Psychology 209 February 22, 2013.

Neural Machine Translation By Learning to Jointly Align and Translate

Efficient Estimation of Word Representation in Vector Space

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

Bird-species Recognition Using Convolutional Neural Network

RNNs: Going Beyond the SRN in Language Prediction

Advanced Recurrent Architectures

Grid Long Short-Term Memory

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

CS 4501: Introduction to Computer Vision Training Neural Networks II

Final Presentation: Neural Network Doc Summarization

Tips for Training Deep Network

Word Embedding Word2Vec.

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Machine Translation(MT)

Unsupervised Pretraining for Semantic Parsing

RNNs: Going Beyond the SRN in Language Prediction

Machine learning overview

实习生汇报 ——北邮张安迪.

Autoencoders Supervised learning uses explicit labels/correct output in order to train a network. E.g., classification of images. Unsupervised learning.

Word embeddings (continued)

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

Automatic Handwriting Generation

Neural Machine Translation - Encoder-Decoder Architecture and Attention Mechanism Anmol Popli CSE 291G.

Neural Machine Translation using CNN

Recurrent Neural Networks

Sequence-to-Sequence Models

CSC 578 Neural Networks and Deep Learning

LHC beam mode classification

Neural Machine Translation by Jointly Learning to Align and Translate

Presentation transcript:

Intro to NLP and Deep Learning - 236605 Tutorial 8 Neural machine translation Attention Dropout Layer Tomer Golany – tomer.golany@gmail.com

Attention - Introduction

Attention - Introduction A recent trend in Deep Learning are Attention Mechanisms. ( 2016 ) Attention Mechanisms in Neural Networks are (very) loosely based on the visual attention mechanism found in humans. Human visual attention models essentially come down to being able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution” And then adjusting the focal point over time. Intro to NLP and Deep Learning - 236605

What problem does Attention Solve To understand what attention can do for us, we will use Neural Machine Translation (NMT) as an example In NMT, we map the meaning of a sentence into a fixed-length vector representation and then generate a translation based on that vector. Most NMT systems work by encoding the source sentence (e.g. a French sentence) into a vector using a Recurrent Neural Network, And then decoding an English sentence based on that vector, also using a RNN Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Neural machine translation

Neural machine translation: See architecture on board Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Encoder: An encoder reads the input sentence x = (x1, x2, …, x(Tx)) (sequence of vectors) into a vector c (the context). The most common approach is to use an RNN. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Decoder: The decoder is trained to predict the next word yt given the context vector c and all the previously predicted words {y1, · · · , yt−1} Defines a probability over the translation y by decomposing the joint probability into the ordered conditionals: Each conditional probability is modeled as: Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Neural machine translation: We can see that the decoder is supposed to generate a translation solely based on the last hidden state from the encoder. This vector must encode everything we need to know about the source sentence. h(t) must fully capture its meaning. In more technical terms, that vector is a sentence embedding In fact, if you plot the embeddings of different sentences in a low dimensional space using PCA or t-SNE for dimensionality reduction, you can see that semantically similar phrases end up close to each other. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Neural machine translation: Still, it seems somewhat unreasonable to assume that we can encode all information about a potentially very long sentence into a single vector and then have the decoder produce a good translation based on only that Let’s say your source sentence is 50 words long. The first word of the English translation is probably highly correlated with the first word of the source sentence. But that means decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in the vector. Vanishing gradient :( Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Long range dependencies: Different approaches to solve the problem of long range dependencies: In theory, architectures like LSTM should be able to deal with this, but in practice long-range dependencies are still problematic. It turn out that reversing the source sequence (feeding it backwards into the encoder) produces significantly better results because it shortens the path from the decoder to the relevant parts of the encoder. Feeding a sequence twice also seems to help a network to better memorize things. Attention Mechanisms! Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Attention mechanism

Intro to NLP and Deep Learning - 236605 Attention mechanism: With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector. We allow the decoder to “attend” to different parts of the source sentence at each step of the output generation Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far. the decoder would probably choose to attend to things sequentially. Attending to the first word when producing the first English word, and so on Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 unlike the existing encoder–decoder approach, here the probability is conditioned on a distinct context vector ci for each target word yi Attention mechanism: Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence The context vector ci depends on a sequence of annotations (h1, · · · , hTx ) to which an encoder maps the input sentence ההסתברות שהמילה yi מתורגמת מהמילת קלט xj מודד עד כמה המילים סביב המילת קלט xj והמילת פלט yi תואמים. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Attention mechanism: The above illustration uses a bidirectional recurrent network. The important part is that each decoder output word Yt now depends on a weighted combination of all the input states, not just the last state The a’s are weights that define in how much of each input state should be considered for each output. So, if a3,2 is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. Implementation example in tensorflow: https://www.tensorflow.org/versions/master/tutorials/seq2seq#sequence-to-sequence-models Intro to NLP and Deep Learning - 236605

Dropout layer

Intro to NLP and Deep Learning - 236605 Dropout layer What is Dropout in Neural Networks? The term “dropout” refers to dropping out units (both hidden and visible) in a neural network. Dropout refers to ignoring units (i.e. neurons) during the training phase of certain set of neurons which is chosen at random At each training stage, individual neurons are either dropped out of the net with probability 1-p or kept with probability p A reduced network is left; incoming and outgoing edges to a dropped-out node are also removed. Intro to NLP and Deep Learning - 236605

Intro to NLP and Deep Learning - 236605 Why do we need Dropout? To prevent Overfitting  A fully connected layer occupies most of the parameters. And hence, neurons develop co-dependency amongst each other during training which reduces the individual power of each neuron leading to over-fitting of training data. Intro to NLP and Deep Learning - 236605

Dropout layer – Technical Details Dropout is an approach to regularization in neural networks which helps reducing interdependent learning amongst the neurons. Training Phase: For each hidden layer, for each training sample, for each iteration: ignore (zero out) a random fraction, p, of nodes (and corresponding activations). Intro to NLP and Deep Learning - 236605

Dropout layer – Technical Details Intro to NLP and Deep Learning - 236605

Dropout layer – Technical Details Testing Phase: Use all activations and all neurons. Some Observations: Dropout forces a neural network to learn more robust features that are useful in conjunction with many different random subsets of the other neurons. Dropout roughly doubles the number of iterations required to converge. However, training time for each epoch is less. With H hidden units, each of which can be dropped, we have 2^H possible models. In testing phase, the entire network is considered and each activation is reduced by a factor p. , but reduce them by a factor p (to account for the missing activations during training). Intro to NLP and Deep Learning - 236605

Dropout layer – Experiments We build a deep network in Keras and tried to validate it on the CIFAR-10 dataset. The deep network is built with three convolution layers of size 64, 128 and 256 followed by two fully connected layers of size 512 and an output layer layer of size 10 Took ReLU as the activation function for hidden layers and sigmoid for the output layer Used the standard categorical cross-entropy loss Finally, used dropout in all layers and increase the fraction of dropout from 0.0 (no dropout at all) to 0.9 with a step size of 0.1 and ran each of those to 20 epochs. Intro to NLP and Deep Learning - 236605

Dropout layer – Experiments Results: Intro to NLP and Deep Learning - 236605

Dropout layer – Experiments Conclusions: with increasing the dropout, there is some increase in validation accuracy and decrease in loss initially before the trend starts to go down. There could be two reasons for the trend to go down if dropout fraction is 0.2: 0.2 is actual minima for the this dataset, network and the set parameters used More epochs are needed to train the networks. Intro to NLP and Deep Learning - 236605