Code Completion with Neural Attention and Pointer Networks

Slides:



Advertisements
Similar presentations
Haitham Elmarakeby.  Speech recognition
Advertisements

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Deep Learning Methods For Automated Discourse CIS 700-7
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Attention Model in NLP Jichuan ZENG.
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Deep Learning RUSSIR 2017 – Day 3
Convolutional Sequence to Sequence Learning
Unsupervised Learning of Video Representations using LSTMs
Deep Learning Methods For Automated Discourse CIS 700-7
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa.
Best viewed with Computer Modern fonts installed
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
Attention Is All You Need
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Matt Gormley Lecture 16 October 24, 2016
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Deep Learning: Model Summary
Intro to NLP and Deep Learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Different Units Ramakrishna Vedantam.
Neural Machine Translation By Learning to Jointly Align and Translate
Hybrid computing using a neural network with dynamic external memory
Neural Networks and Backpropagation
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Deep Learning based Machine Translation
RNNs: Going Beyond the SRN in Language Prediction
Advanced Artificial Intelligence
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
A First Look at Music Composition using LSTM Recurrent Neural Networks
Recurrent Neural Networks
Final Presentation: Neural Network Doc Summarization
Understanding LSTM Networks
Word Embedding Word2Vec.
ECE599/692 - Deep Learning Lecture 14 – Recurrent Neural Network (RNN)
Other Classification Models: Recurrent Neural Network (RNN)
Neural Speech Synthesis with Transformer Network
Lecture 16: Recurrent Neural Networks (RNNs)
Machine Translation(MT)
Natural Language to SQL(nl2sql)
Report by: 陆纪圆.
RNNs: Going Beyond the SRN in Language Prediction
Attention.
实习生汇报 ——北邮 张安迪.
Please enjoy.
Word embeddings (continued)
Meta Learning (Part 2): Gradient Descent as LSTM
Attention for translation
-- Ray Mooney, Association for Computational Linguistics (ACL) 2014
RNNs and Sequence to sequence models
Automatic Handwriting Generation
Modeling IDS using hybrid intelligent systems
Neural Machine Translation
Recurrent Neural Networks
Sequence-to-Sequence Models
Deep learning: Recurrent Neural Networks CV192
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
A Neural Network for Car-Passenger matching in Ride Hailing Services.
Presentation transcript:

Code Completion with Neural Attention and Pointer Networks Jian Li, Yue Wang, Irwin King, and Michael R. Lyu The Chinese University of Hong Kong Presented by Ondrej Skopek

Goal: Predict out-of-vocabulary words using local context (illustrative image) Credits: van Kooten, P. neural_complete. https://github.com/kootenpv/neural_complete. (2017).

Pointer mixture networks Pointer network Joint RNN Overview Attention Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Outline Recurrent neural networks Attention Pointer networks Data representation Pointer mixture network Experimental evaluation Summary

Recurrent neural networks Credits: Olah, C. Understanding LSTM Networks. colah’s blog (2015).

Recurrent neural networks – unrolling Credits: Olah, C. Understanding LSTM Networks. colah’s blog (2015).

Long Short-term Memory Hidden state Forget gate Cell state New memory generation Output gate IDEA: Overcome vanishing/exploding gradient problem Cell state Hidden state Gates Forget gate: The first step in our LSTM is to decide what information we’re going to throw away from the cell state. This decision is made by a sigmoid layer called the “forget gate layer.” It looks at ht−1 and xt, and outputs a number between 0 and 1 for each number in the cell state C_t−1. A 1 represents “completely keep this” while a 0 represents “completely get rid of this.” New memory generation: The next step is to decide what new information we’re going to store in the cell state. This has two parts. First, a sigmoid layer called the “input gate layer” decides which values we’ll update. Next, a tanh layer creates a vector of new candidate values, C_t, that could be added to the state. In the next step, we’ll combine these two to create an update to the state. Output gate: Finally, we need to decide what we’re going to output. This output will be based on our cell state, but will be a filtered version. First, we run a sigmoid layer which decides what parts of the cell state we’re going to output. Then, we put the cell state through tanh (to push the values to be between −1 and 1) and multiply it by the output of the sigmoid gate, so that we only output the parts we decided to. Credits: Hochreiter, S. & Schmidhuber, J. Long Short-term Memory. Neural Computation 9, 1735–1780 (1997). Olah, C. Understanding LSTM Networks. colah’s blog (2015).

Recurrent neural networks – long-term dependencies Credits: Olah, C. Understanding LSTM Networks. colah’s blog (2015).

Attention Choose which context to look at when predicting Overcome the hidden state bottleneck IDEA: Overcome hidden state bottleneck A_j^i = v^T \tanh(W_h h_j + W_s s_i)\\ \alpha^i = \textrm{softmax}(A^i)\\ c_i = \sum_{j=1}^n \alpha_j^i h_j Credits: Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2014).

Attention (cont.) Credits: QI, X. Seq2seq. https://xiandong79.github.io/seq2seq-基础知识. (2017). Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2014).

Pointer networks Credits: Vinyals, O., Fortunato, M. & Jaitly, N. Pointer Networks. (2015).

Pointer networks (cont.) Based on Attention Softmax over a dictionary of inputs Output models a conditional distribution of the next output token Still a seq to seq model C .. output sequence P .. input sequence A_j^i = v^T \tanh(W_h h_j + W_s s_i)\\ p(C_i|C_1, \ldots, C_{i-1}, \mathcal{P}) = \textrm{softmax}(A^i) Credits: Vinyals, O., Fortunato, M. & Jaitly, N. Pointer Networks. (2015). Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2014).

Outline Recurrent neural networks Attention Pointer networks Data representation Pointer mixture network Experimental evaluation Summary

Data representation Corpus of Abstract Syntax Trees (ASTs) Parsed using a context-free grammar Each node has a type and a value (type:value) Non-leaf value: EMPTY, unknown value: UNK, end of program: EOF Task: Code completion Predict the “next” node Two separate tasks (type and value) Serialized to use sequential models In-order depth-first search + 2 bits of information on children/siblings Task after serialization: Given a sequence of words, predict the next one Corpus of Abstract Syntax Trees (ASTs) Parsed using a context-free grammar Each node has a type and a value (type:value) Non-leaf value: EMPTY, unknown value: UNK, end of program: EOF Task: Code completion Predict the “next” node Two separate tasks (type and value) Serialized to use sequential models In-order depth-first search + 2 bits of information on children/siblings Task after serialization: Given a sequence of words, predict the next one Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Pointer mixture networks Pointer network Joint RNN Recap the already described parts The missing combining component Attention Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

RNN with adapted Attention Intermediate goal Produce two distributions at time t RNN with Attention (fixed unrolling) L – input window size (L = 50) V – vocabulary size (differs) k – size of hidden state (k = 1500) Adapting attention to a fixed window M_t = [h_{t-L}, \ldots, h_{t-1}] \in \mathbb{R}^{k\times L}\\ A_t = v^T \tanh \left(W_m M_t + (W_h h_t)1_L^T \right)\\ \alpha_t = \textrm{softmax}(A_t)\\ c_t = M_t \alpha_t^T Credits: Vinyals, O., Fortunato, M. & Jaitly, N. Pointer Networks. (2015). Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2014).

Attention & Pointer components Attention for the “decoder” Condition on both the hidden state and context vector Pointer network Reuses Attention outputs Adapt attention without the use of a decoder Credits: Vinyals, O., Fortunato, M. & Jaitly, N. Pointer Networks. (2015). Bahdanau, D., Cho, K. & Bengio, Y. Neural Machine Translation by Jointly Learning to Align and Translate. (2014).

Mixture component Combine the two distributions into one Using where Basically use [h_t; c_t] as an input to a single dense layer with sigmoid output (between 0-1) Output distribution is a concatenation of the two, weighted by the coefficient Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Outline Recurrent neural networks Attention Pointer networks Data representation Pointer mixture network Experimental evaluation Summary

Experimental evaluation Data JavaScript and Python datasets http://plml.ethz.ch Each program divided into segments of 50 consecutive tokens Last segment padded with EOF AST data as described beforehand Type embedding (300 dimensions) Value embedding (1200 dimensions) No unknown word problem for types! Model & training parameters Single-layer LSTM, unrolling length 50 Hidden unit size 1500 Forget gate biases initialized to 1 Cross-entropy loss function Adam optimizer (learning rate 0.001 + decay) Gradient clipping (L2 norm [0, 5]) Batch size 128 8 epochs Trainiable initial states Initialized to 0 All other parameters ~ Unif([-0.05, 0.05]) TODO: Make this slide more visual/structured During training, whenever the ground truth of a training query is UNK we set the loss function to zero for that query such that our model does not learn to predict UNK tokens. Pointer network not used for predicting types (small vocabulary, not needed) Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Experimental evaluation (cont.) Training conditions Hidden state reset to trainable initial state only if segment from a different program, otherwise last hidden state reused If label UNK, set loss to 0 during training During training and test, UNK prediction considered incorrect Labels Vocabulary: K most frequent words If in vocabulary: word ID If in attention window: label it as the last attention position If not, labeled as UNK Interpret the results in the table JS and PY are differently hard Consistent with other papers Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Comparison to other results Set vocabulary size to 50 000 to be comparable to Liu et al. Interpret results in table Big drawback: they just reused the reported results, didn’t rerun Our vanilla LSTM outperforms the work by Liu et al. (2016) a lot who also apply a simple LSTM.We think the main reason lies in the different formulation of loss function, where Liu et al. define a loss function for both type and value prediction and train a model for them together. On the contrary, we define one loss function for each task and train them separately, which is much easier to train. Why does attention help? Average AST size in JS is 1000, 600 in PY Dependencies too long for normal LSTM Second table shows that the pointer network actually learns something meaningful Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Example result OoV value top 5 predictions Vanilla LSTM, it just produces EMPTY which is the most frequent token in our corpus. Attention-enhanced LSTM, it learns from the context that the target has a large probability to be UNK, but fails to produce the real value. Pointer mixture network successfully points out the OoV value from the context, as it observes the value appears in the previous code. Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Summary Future work Applied neural language models to code completion Demonstrated the effectiveness of the Attention mechanism Proposed a Pointer Mixture Network to deal with the out-of-vocabulary values Future work Encode more static type information Combine the two distributions in a different way Use both backward and forward context to predict the given node Attempt to learn longer dependencies for out-of-vocabulary values (L>50) Incremental improvements Different cell types, more cell layers, etc Encode more static type information Combine the two distributions in a different way Concat and normalize would probably give too much weight to local context, because L << K Concat and use a dense layer directly + softmax on top Use both backward and forward context to predict the given node Attempt to learn longer dependencies for out-of-vocabulary values (L>50) Current construction will fail for tokens that are not in the past L values Credits: Li, J., Wang, Y., King, I. & Lyu, M. R. Code Completion with Neural Attention and Pointer Networks. (2017).

Thank you for your attention!