Visualizing and Understanding recurrent Neural Networks

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.
Lecture 14 – Neural Networks
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Document Classification Comparison Evangel Sarwar, Josh Woolever, Rebecca Zimmerman.
October 14, 2010Neural Networks Lecture 12: Backpropagation Examples 1 Example I: Predicting the Weather We decide (or experimentally determine) to use.
Introduction to Recurrent neural networks (RNN), Long short-term memory (LSTM) Wenjie Pei In this coffee talk, I would like to present you some basic.
Distributed Representations of Sentences and Documents
05/06/2005CSIS © M. Gibbons On Evaluating Open Biometric Identification Systems Spring 2005 Michael Gibbons School of Computer Science & Information Systems.
November 21, 2012Introduction to Artificial Intelligence Lecture 16: Neural Network Paradigms III 1 Learning in the BPN Gradients of two-dimensional functions:
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Deep Learning Neural Network with Memory (1)
 The most intelligent device - “Human Brain”.  The machine that revolutionized the whole world – “computer”.  Inefficiencies of the computer has lead.
Artificial Neural Networks An Overview and Analysis.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
NEURAL NETWORKS FOR DATA MINING
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
ARTIFICIAL NEURAL NETWORKS. Overview EdGeneral concepts Areej:Learning and Training Wesley:Limitations and optimization of ANNs Cora:Applications and.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
How Solvable Is Intelligence? A brief introduction to AI Dr. Richard Fox Department of Computer Science Northern Kentucky University.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Multi-Layer Perceptron
Neural Network Implementation of Poker AI
Kai Sheng-Tai, Richard Socher, Christopher D. Manning
CSC321: Lecture 7:Ways to prevent overfitting
Systems Analyst (Module V) Ashima Wadhwa. The Systems Analyst - A Key Resource Many organizations consider information systems and computer applications.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
Predicting the dropouts rate of online course using LSTM method
Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting 卷积LSTM网络:利用机器学习预测短期降雨 施行健 香港科技大学 VALSE 2016/03/23.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Chapter 11 – Neural Nets © Galit Shmueli and Peter Bruce 2010 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Unsupervised Learning of Video Representations using LSTMs
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
RNNs: An example applied to the prediction task
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Recursive Neural Networks
Recurrent Neural Networks for Natural Language Processing
A Hierarchical Model of Reviews for Aspect-based Sentiment Analysis
Intro to NLP and Deep Learning
Intelligent Information System Lab
Intro to NLP and Deep Learning
Different Units Ramakrishna Vedantam.
Master’s Thesis defense Ming Du Advisor: Dr. Yi Shang
Prof. Carolina Ruiz Department of Computer Science
RNNs: Going Beyond the SRN in Language Prediction
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
A First Look at Music Composition using LSTM Recurrent Neural Networks
CS 4501: Introduction to Computer Vision Training Neural Networks II
Final Presentation: Neural Network Doc Summarization
RNNs: Going Beyond the SRN in Language Prediction
Attention.
Please enjoy.
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
CSC321: Neural Networks Lecture 11: Learning in recurrent networks
Automatic Handwriting Generation
David Kauchak CS158 – Spring 2019
Recurrent Neural Networks
LHC beam mode classification
Neural Machine Translation by Jointly Learning to Align and Translate
Prof. Carolina Ruiz Department of Computer Science
Presentation transcript:

Visualizing and Understanding recurrent Neural Networks Presented By: Collin Watts Wrritten By: Andrej Karpathy, Justin Johnson, Li Fei-fei

Plan Of Attack What we’re going to cover: Overview Some Definitions Expiremental Analysis Lots of Results The Implications of the Results Case Studies Meta-Analysis

So, what would you say you do here... This paper set out to analyze both the most efficient implementation of an RANN (we’ll get there) as well as identify what mechanisms are used internally that achieve their results. Chose 3 different variants of RANNs: Basic RANNs LSTM RANNs GRU RANNs Did character level language analysis as their test problem, as it is apparently strongly representative of other analysies.

Definitions RECURRENT NEURAL NETWORK Subset of Artificial Neural Networks Still use feedforward and backpropogation Allows nodes to form cycles, creating the potentiality for storage of information within the network Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks Difficult to train

Definitions RECURRENT NEURAL NETWORK Subset of Artificial Neural Networks Still use feedforward and backpropogation Allows nodes to form cycles, creating the potentiality for storage of information within the network Used in applications such as handwriting analysis, video analysis, translation, and other interpretation of various human tasks Difficult to train

Definitions RECURRENT NEURAL NETWORK (Cont.) Uses a 2 dimensional node setup, with time as one axis and depth of the nodes as another Nodes are referrd to as hLt, with l = 0 being the input nodes, and l = L being the output nodes. Intermediate vectors are calculated as a function of both the previous time step and the previous layer. This results in the following recurrence:

More DEFINITIONS! LONG SHORT-TERM MEMORY VARIANT Variant of the RANN designed to mitigate problems with backpropogation within a RANN. Adds a memory vector to each node. Every time step, an LSTM can choose to read, write to, or reset the memory vector, following a series of gating mechanisms. Has the effect of preserving gradients across memory cells for long periods. i, f, o, and g are the gates for whether the memory cell is updated, reset, or read, respectively, while g allows for additive additions to the memory cell.

HALF A DEFINITION... GATED RECURRENT UNIT Not well elaborated on in the paper... Given explanation is that “The GRU has the interpretation of computing a candidate hidden vector and then smoothly interpolating towards it, as gated by z.” My interpretation: rather than having explicit access & control gates, this follows a more analog approach.

Expiremental analysis (Science!) As previously stated, the researchers used character-level language modelling as a basis of comparison. Trained each network to predict the following character in a sequence. Used Softmax classifier at each time step. Generated a vector of all possible next characters and fed those to the current network to get that many hidde vectors in the last layer of the network. These outputs represented log probabilities of each character being the next character in the sequence.

Expiremental analysis (Science!) Rejected the use of two other datasets (Penn treeback dataset and Hutter Prize 100MB of Wikipedia dataset) on the basis of them containing both standard English language and markup. Stated intention for rejecting was to use a controlled setting for all types of neural networks, rather than compete for best results on these data sets. Decided on Leo Tolstoy’s War and Peace, consisting of 3,258,246 characters and the source code of the Linux Kernel (randomized across files and then concatenated into a single 6,206,996 character file).

Expiremental analysis (Science!) War and Peace, was split into 80/10/10 for training/validation/testing. Linux Kernel, was split into 90/5/5 for training/validation/testing. Tested the following properties for each of the 3 RANNS: Number of Layers (1, 2 , or 3) Number of Parameters (64, 128, 256, 512 cell counts)

Results (and the winner is...) Test set cross entropy loss:

Results (and the winner is...)

Results (and the winner is...)

Implications of results (But why...) The researchers paid attention to several characteristics beyond just the results of their findings. One of their stated goals was to arrive at why these emergent properties exist. Interpretable, long-range LSTM cells Have been theorized to exist, but never proven. They proved them. Truncated back-propagation (used for performance gains as well as combatting overfitting) limits understanding dependencies more than X characters away, where X is the depth of the backpropogation. These LSTM cells have been able to overcome that challenge while retaining performance and fitting characteristics.

Visualizations of results (But why...) Text color is a visualization of tanh(c) where -1 is red and +1 is blue.

Visualizations of results (But why...)

Visualizations of results (But why...)

Visualizations of results (But why...)

Implications of results (But why...) Also paid attention to gate activations (remember the gates are what cause interactions with the memory node) in LSTMs. Defined the ideas of “left saturated” and “right saturated” Left saturated: If the gates activate less than 0.1 (10% of the time). Right saturated: If the gates activate more than 0.9 (90% of the time) Of particular note: Right saturated forget gate cells (cells remembering values) No left saturated forget gate cells (no cells being purely feed forward) Found that activations in the first layer are diffuse (this is unexplainable by the researchers, but found to be very strange)

Visualizations of results (But why... LSTMs)

Visualizations of results (But why...GRUs)

Error Analysis of Results Compared against two standard n-gram models for analysis of LSTMs effectiveness. An error was defined to be if the probability of the next character being the character that was actually there was less than 0.5. Found that while the models shared many of the same errors, there were distinct segments that each one failed differently on.

Error analysis of results Linux Kernel War and peace

Error analysis of results Found that LSTM has significant advantages over standard n-gram models when computing the probability of special characters. In the Linux Kernel model, brackets and whitespce are predicted significantly better than in the n-gram model, because of it’s ability to keep track of relationships between open and closing brackets. Similarly, in War and Peace, LSTM was able to more correctly predict carriage returns, due to the relationship being outside of the n-gram models’ effective range of relationship prediction.

Case study { Look, braces! } When it specifically compes to closing brackets (“}”) in the Linux kernel, the researchers were able to analyze the performance of the LSTM versus the n-gram models. Found that LSTM did better than n-gram for distances of up to 60 characters. After that, the performance gains levelled off.

Meta-analysis (The good) The researchers were able to very effectively capture and elucidate their point via their visualizations and implications. They seem to have proven several until now only theorized ideas on how RANNs work in data analysis.

Meta-analysis (THE BAD) I would have appreciated a more in depth explanation of why they rejected the standard ANN competitive datasets. It would seem to follow that those would be a more true measure of the capabilities, which is why they are chosen in the first place. There wasn’t a lot of explanation as to why their parameters were chosen for each RANN, or what their parameters for evaluation were. (What is test set cross-entropy loss?) Data was split differently across each of the texts, so that the total count for validation and tests was the same. I don’t see what this offers. If anything, you would want the count of training to be the same.

META-ANALYSIS (The ugly) This paper does not ease the reader into understanding the ideas involved. Required reading several additional papers to get the implications of things they assumed the reader knew. Some ideas were not clearly explained even after researching the related works.

Final slide Questions? Comments ? Concerns? Correction s?