Final Presentation: Neural Network Doc Summarization

Slides:

Advertisements

Similar presentations

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.

Advertisements

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.

Audio-Based Multimedia Event Detection Using Deep Recurrent Neural Networks Yun Wang, Leonardo Neves, Florian Metze 3/23/2016.

Deep Learning Methods For Automated Discourse CIS 700-7

Attention Model in NLP Jichuan ZENG.

R-NET: Machine Reading Comprehension With Self-Matching Networks

Convolutional Sequence to Sequence Learning

Unsupervised Learning of Video Representations using LSTMs

Deep Learning Methods For Automated Discourse CIS 700-7

Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.

RNNs: An example applied to the prediction task

End-To-End Memory Networks

CS 388: Natural Language Processing: LSTM Recurrent Neural Networks

CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.

Deep Learning for Bacteria Event Identification

Deep Learning Amin Sobhani.

Wu et. al., arXiv - sept 2016 Presenter: Lütfi Kerem Şenel

Recursive Neural Networks

Recurrent Neural Networks for Natural Language Processing

Adversarial Learning for Neural Dialogue Generation

Neural Machine Translation by Jointly Learning to Align and Translate

Attention Is All You Need

Zenodo Data Archive Irtiza Delwar, Michael Culhane, John Sizemore, Gil Turner Client: Dr. Seungwon Yang Instructor: Dr. Edward A. Fox CS 4624 Multimedia,

Intro to NLP and Deep Learning

Intelligent Information System Lab

Intro to NLP and Deep Learning

Neural Machine Translation By Learning to Jointly Align and Translate

Word2Vec CS246 Junghoo “John” Cho.

Attention Is All You Need

Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang

RNNs: Going Beyond the SRN in Language Prediction

Attention-based Caption Description Mun Jonghwan.

RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect

Advanced Artificial Intelligence

Paraphrase Generation Using Deep Learning

Image Captions With Deep Learning Yulia Kogan & Ron Shiff

Understanding LSTM Networks

CS5984:Big Data Text Summarization

The Big Health Data–Intelligent Machine Paradox

Neural Networks Geoff Hulten.

Other Classification Models: Recurrent Neural Network (RNN)

Neural Speech Synthesis with Transformer Network

Lecture 16: Recurrent Neural Networks (RNNs)

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Natural Language to SQL(nl2sql)

Report by: 陆纪圆.

RNNs: Going Beyond the SRN in Language Prediction

RNN Encoder-decoder Architecture

实习生汇报 ——北邮张安迪.

Ali Hakimi Parizi, Paul Cook

Deep Learning for the Soft Cutoff Problem

Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824

Attention for translation

-- Ray Mooney, Association for Computational Linguistics (ACL) 2014

RNNs and Sequence to sequence models

Automatic Handwriting Generation

The Updated experiment based on LSTM

Neural Machine Translation using CNN

Baseline Model CSV Files Pandas DataFrame Sentence Lists

Week 3 Presentation Ngoc Ta Aidean Sharghi.

Recurrent Neural Networks

Sequence-to-Sequence Models

Bidirectional LSTM-CRF Models for Sequence Tagging

LHC beam mode classification

Neural Machine Translation by Jointly Learning to Align and Translate

Listen Attend and Spell – a brief introduction

Visual Grounding.

A Neural Network for Car-Passenger matching in Ride Hailing Services.

Presentation transcript:

Final Presentation: Neural Network Doc Summarization CS4624 Multimedia, Hypertext, and Information Access Team: Junjie Cheng Instructor: Dr. Edward A. Fox Virginia Tech, Blacksburg VA 24061, Apr 30th, 2018

Outline Project Overview Data Preprocess Model Architecture Training Model Performance References and Acknowledgements

Project Overview Purpose: generate summarization from long document through deep learning. Model: sequence to sequence model with RNN. Dataset: CNN/Daily Mail news.

Data Preprocess Vocab size: 50000 Input sequence max length: 400 Target sequence max length: 100 The vocabulary size is 50000. Input sequence max length is 400 and the target sequence max length is 100. Right now, processing long sequence is still challenging for deep learning. Therefore, I pick only short articles from the dataset. After processing, the articles and the abstracts are converted to tokens like this.

Model Architecture Sequence to Sequence Model Sequence to sequence model is used for solving sequence problem. It takes a sequence as the input, and returns another sequence as the output. The sequence to sequence model contains an encoder and a decoder. The encoder converts the input sequence to the context vector, and the decoder takes the context vector as input and generate the output sequence. In this project, I used recurrent neural network as the both encoder and decoder. Next, I will introduce the technical details in the model.

Encoder Architecture Encoder Shared embedding layer Bidirectional LSTM layer The encoder has only two layers. A shared embedding layer, and a bidirectional long short term memory layer. The embedding layer converts each input token to a vector. The vector represents the semantic of the token. The performance of transformation is poor at the beginning of training, but the embedding layer will learn from data. As more data are trained, the transformation will be more accurate. After the embedding layer is well trained, the relationship of two tokens can be computed by calculating the distance between two vectors. For tokens that their semantic are similar, their distance should also be similar. For example, the distance between the word “human” and “man” should be similar to the distance between “human” and “woman”. The other layer is abidirectional LSTM layer. LSTM is a kind of RNN. It uses a logic gate to control selecting long term or short term memory from the context. It is bidirectional because it trains the input sequence from both forward and backward. In natural language, a word usually not only depends on the previous sequence, but also depends on the following sequence. The single directional RNN can only predict the next token from previous context, but the bidirectional LSTM can predict token from the whole context.

Encoder Workflow Embedding layer Embedded Input sequence LSTM layer Context Last hidden vector Last LSTM cell state While training, the input sequence is first embedded, then the embedded sequence will be put into the LSTM layer. The output of the LSTM layer contains a context, which includes the hidden vectors of each timestamp. The output also contains the last hidden vector and the last LSTM cell state. All of them will be used by the decoder.

Shared embedding layer Decoder Architecture Decoder Shared embedding layer LSTM layer MLP attention Layer Dropout layer Out layer The decoder has five layers. The first one is the shard embedding layer. This layer is shared with the encoder, because I want tokens’ semantic are same in the encoder and decoder. The next layer is a single directional LSTM layer. Since I’m going to generate the sequence from start of a sentence, so I don’t want to use bidirectional LSTM. It will also generate from the end of the sequence. The third layer is a MLP attention layer. This layer takes the context from encoder LSTM and the context from decoder LSTM to generate an attention applied context. The fourth layer is dropout layer. This layer drops part of data in the context to prevent overfitting. If a model is overfitting, it will have a good performance on the training dataset, but if we use other inputs, the output will be terrible. The last layer is the out layer. This layer is a linear transformation layer. It transforms the hidden vector to the size of the vocabulary.

Decoder Workflow Embedding layer Embedded input sequence LSTM layer Context Attention layer Attention applied context Dropout layer Attention applied context Out layer Context with vocab size Log softmax function Possibility of each token in the vocab

Training Workflow Load data Back propagation Train model Compute loss Training includes four steps. First, the data loader loads a batch of sequence from the dataset. They will be converted to a matrix, and put in to the model. The output of the model is the generated summary, it will be compare with the real summary by the criterion to compute the loss. According to the loss value, the optimizer will perform a back propagation to improve the accuracy of the model. When the loop iterates through the whole dataset, we call one epoch is completed.

Training Architecture Optimizer: SGD Criterion: NLLLoss Batch size: 3 Epoch number: 100 Loss: 6.7  1.4 Learning rate: 1 Hidden size: 256 Word embedding size: 128 There are some parameters I used for training phase. The optimizer is SGD, and the criterion is NLLLoss. Due to the limitation of memory size, I the batch size I used is 3. The dataset is trained 100 epochs. At the end, the loss decrease from 6.7 to 1.4. This is low enough to generate reasonable sentence. The learning rate is 1. I also applied dynamic learning rate. After each 20 epochs, the learning rate shrink to one tenth of the previous value.

Model Performance Generated summary: “have beaten three of their last three league games . the <UNK> scored in the second half of the last minute . the win takes all three points to move ahead of champions league place” Human-produced summary: “two goals from lionel messi help barcelona to a 3-1 win over almeria . kaka bags brace as real madrid coast to 3-0 victory at athletic bilbao . inter milan move up to second place in serie a with 2-0 win over chievo .” Because the Hackberry server has a really long queue. I have waited for 4 days to evaluate the model. And I’m still waiting now. Therefore, I used my laptop trained a very small dataset. After 100 epochs, the loss decreased to an extremely small value. When I gives the model the sentence, it returned the expected result.

Acknowledgements Client: Yufeng Ma Mr. Ma is a PhD student at Virginia Tech. He worked as the client of this project and guided the project through all project phases. My client is Yufeng Ma. He is a great tutor. He taught me everything in this project. It is impossible to complete the project without his help.

Reference Gokumohandas. Recurrent Neural Networks (RNN) – part 3: encoder- decoder. https://theneuralperspective.com/2016/11/20/recurrent- neural-networks-rnn-part-3- encoder-decoder/. Web. accessed: March 26, 2018.