Principal Solutions Architect @ AWS Deep Learning Deed Learning Basics and Intuition Cyrus M. Vahid, Principal Solutions Architect @ AWS Deep Learning cyrusmv@amazon.com June 2017
Deductive Logic How often have I said to you that when you have eliminated the impossible, whatever remains, however improbable, must be the truth? https://en.wikiquote.org/wiki/Sherlock_Holmes
Deductive Logic P Q P ∧ Q P ∨ Q P ∴ Q T F
Deductive Logic P Q P ∧ Q P ∨ Q P ∴ Q T F 𝑃=𝑇∧𝑄=𝑇∴𝑃∧𝑄=𝑇 𝑃∧𝑄∴𝑃→𝑄; ∼𝑃∴𝑃→𝑄 P → Q P _________ ∴ Q
# of rows in truth table ( 2 𝑛 ) Complexities of Deductive Logic Predicates (n) # of rows in truth table ( 2 𝑛 ) 1 2 4 3 8 … 10 1024 20 1,048,576 30 1,073,741,824
# of rows in truth table ( 2 𝑛 ) Complexities of Deductive Logic Predicates (n) # of rows in truth table ( 2 𝑛 ) 1 2 4 3 8 … 10 1024 20 1,048,576 30 1,073,741,824 [Ω→ Ε∧Η ∨(Γ∨~Γ)] Φ∧ ~Φ _____________________ ∴ F
# of rows in truth table ( 2 𝑛 ) Complexities of Deductive Logic Useless for dealing with partial information. Predicates (n) # of rows in truth table ( 2 𝑛 ) 1 2 4 3 8 … 10 1024 20 1,048,576 30 1,073,741,824 [Ω→ Ε∧Η ∨(Γ∨~Γ)]∧(Φ∧ ~Φ)∴𝐹
# of rows in truth table ( 2 𝑛 ) Complexities of Deductive Logic Useless for dealing with partial information. Predicates (n) # of rows in truth table ( 2 𝑛 ) 1 2 4 3 8 … 10 1024 20 1,048,576 30 1,073,741,824 Useless for dealing with non-deterministic inferrence. [Ω→ Ε∧Η ∨(Γ∨~Γ)]∧(Φ∧ ~Φ)∴𝐹
# of rows in truth table ( 2 𝑛 ) Complexities of Deductive Logic Useless for dealing with partial information. Predicates (n) # of rows in truth table ( 2 𝑛 ) 1 2 4 3 8 … 10 1024 20 1,048,576 30 1,073,741,824 Useless for dealing with non-deterministic inferrence. We perfoerm very poorly in computing the complex logical statemetns intuitively. [Ω→ Ε∧Η ∨(Γ∨~Γ)]∧(Φ∧ ~Φ)∴𝐹
http://bit.ly/2rwtb0o
Search Trees and Graphs http://aima.cs.berkeley.edu/
Search Trees and Graphs http://aima.cs.berkeley.edu/
http://bit.ly/2rwtb0o
Neurobiology
𝑓 𝑥 𝑖 , 𝑤 𝑖 =Φ(𝑏+ Σ 𝑖 ( 𝑤 𝑖 . 𝑥 𝑖 )) Φ 𝑥 = 1, 𝑖𝑓 𝑥≥0.5 0, 𝑖𝑓 𝑥<0.5 Perceptron 𝑓 𝑥 𝑖 , 𝑤 𝑖 =Φ(𝑏+ Σ 𝑖 ( 𝑤 𝑖 . 𝑥 𝑖 )) Φ 𝑥 = 1, 𝑖𝑓 𝑥≥0.5 0, 𝑖𝑓 𝑥<0.5
Example of Perceptron 𝐼1=𝐼2=𝐵1=1 𝑂1= 1𝑥1 + 1𝑥1 + −1.5 =0.5∴Φ(𝑂1)=1 𝐼2=0 ; 𝐼1=𝐵1=1 𝑂1= 1𝑥1 + 0𝑥1 + −1.5 =−0.5∴Φ(𝑂1)=0
Linearity and Non-Linea Solutions P Q P ∧ Q T F Q x P
Linearity and Non-Linea Solutions P Q P ∧ Q P ⨁ Q T F Q Q x x P P x
Multi-Layer Perceptron A feedforward neural network is a biologically inspired classification algorithm. It consist of a (possibly large) number of simple neuron-like processing units, organized in layers. Every unit in a layer is connected with all the units in the previous layer. Each connection has a weight that encodes the knowledge of the network. Data enters from the input layer and arrives at the output through hidden layers. There is no feed back between the layers Fully-Connected Multi-Layer Feed-Forward Neural Network Topology
Activation Function (Φ)
Training an ANN A well-trained ANN has weights that amplify the signal and dampen the noise. A bigger weight signifies a tighter correlation between a signal and the network’s outcome. The process of learning for any learning algorithm using weights is the process of re-adjusting the weights and biases Back Propagation is a popular training algorithm and is based on distributing the blame for the error and divide it between the contributing weights.
Error Surface https://commons.wikimedia.org/wiki/File:Error_surface_of_a_linear_neuron_with_two_input_weights.png
Gradient Descent
SDG Visualization http://sebastianruder.com/optimizing-gradient-descent/
Learning Parameters Learning Rate: It is a real number that decides how far to move down in the direction of steepest gradient. Online Learning: Weights are updated at each step. Often slow to learn. Batch Learning: Weights are updated after the whole of training data. Often males it hard to optmize. Mini-Batch: Combination of both when we break up the training set into smaller batches and update the weights after each mini-batch.
Overview of Learning Rate
Data Encoding Source: Alec Radford https://www.tensorflow.org/get_started/mnist/beginners Source: Alec Radford
Why Apache MXNet? Most Open Best On AWS Accepted into the Apache Incubator Optimized for deep learning on AWS (Integration with AWS)
Building a Fully Connected Network in Apache MXNet 128 A convolution is defined as a mathematical operation describing a rule for how to merge two sets of information. Filter Size Every filter is small spatially with respect to the width and height of the filter size. An example of this is how the first convolutional layer may have a 5x5x3 sized filter. This would mean the filter is 5 pixels wide, by 5 pixels tall, and 3 represents the color channels assuming the input image was in 3-channel RGB color. Output Depth We can manually pick the depth of the output volume. The depth hyperparameter controls the neuron count in the convolutional layer that is connected to the same region of the input volume. Edges and Activation Different neurons along the depth dimension learn to activate when stimulated with created input data (eg: color or edges). We consider a set of neurons that all look at the same region of input volume as a depth column. Stride Stride configures how far our sliding filter window (described above) will move per application of the filter function. Each time we apply the filter function to the input column we create a new depth column in the output volume. Lower settings for stride (eg: 1 gives only a single unit step) will allocate more depth columns in the output volume. This will also yield more heavily overlapping receptive fields between the columns, leading to larger output volumes. The opposite is true for when we specify higher stride values. These higher stride values give us less overlap and smaller output volumes spatially. Zero-Padding The last hyperparameter is zero-padding and allows us to control the spatial size of the output volumes. We’d want to do this in cases where we want to maintain the spatial size of the input volume in the output volume.
Building a Fully Connected Network in Apache MXNet A convolution is defined as a mathematical operation describing a rule for how to merge two sets of information. Filter Size Every filter is small spatially with respect to the width and height of the filter size. An example of this is how the first convolutional layer may have a 5x5x3 sized filter. This would mean the filter is 5 pixels wide, by 5 pixels tall, and 3 represents the color channels assuming the input image was in 3-channel RGB color. Output Depth We can manually pick the depth of the output volume. The depth hyperparameter controls the neuron count in the convolutional layer that is connected to the same region of the input volume. Edges and Activation Different neurons along the depth dimension learn to activate when stimulated with created input data (eg: color or edges). We consider a set of neurons that all look at the same region of input volume as a depth column. Stride Stride configures how far our sliding filter window (described above) will move per application of the filter function. Each time we apply the filter function to the input column we create a new depth column in the output volume. Lower settings for stride (eg: 1 gives only a single unit step) will allocate more depth columns in the output volume. This will also yield more heavily overlapping receptive fields between the columns, leading to larger output volumes. The opposite is true for when we specify higher stride values. These higher stride values give us less overlap and smaller output volumes spatially. Zero-Padding The last hyperparameter is zero-padding and allows us to control the spatial size of the output volumes. We’d want to do this in cases where we want to maintain the spatial size of the input volume in the output volume.
Training a Fully Connected Network in Apache MXNet A convolution is defined as a mathematical operation describing a rule for how to merge two sets of information. Filter Size Every filter is small spatially with respect to the width and height of the filter size. An example of this is how the first convolutional layer may have a 5x5x3 sized filter. This would mean the filter is 5 pixels wide, by 5 pixels tall, and 3 represents the color channels assuming the input image was in 3-channel RGB color. Output Depth We can manually pick the depth of the output volume. The depth hyperparameter controls the neuron count in the convolutional layer that is connected to the same region of the input volume. Edges and Activation Different neurons along the depth dimension learn to activate when stimulated with created input data (eg: color or edges). We consider a set of neurons that all look at the same region of input volume as a depth column. Stride Stride configures how far our sliding filter window (described above) will move per application of the filter function. Each time we apply the filter function to the input column we create a new depth column in the output volume. Lower settings for stride (eg: 1 gives only a single unit step) will allocate more depth columns in the output volume. This will also yield more heavily overlapping receptive fields between the columns, leading to larger output volumes. The opposite is true for when we specify higher stride values. These higher stride values give us less overlap and smaller output volumes spatially. Zero-Padding The last hyperparameter is zero-padding and allows us to control the spatial size of the output volumes. We’d want to do this in cases where we want to maintain the spatial size of the input volume in the output volume.
Demo Time http://localhost:9999/notebooks/mxnet-notebooks/python/tutorials/mnist.ipynb A convolution is defined as a mathematical operation describing a rule for how to merge two sets of information. Filter Size Every filter is small spatially with respect to the width and height of the filter size. An example of this is how the first convolutional layer may have a 5x5x3 sized filter. This would mean the filter is 5 pixels wide, by 5 pixels tall, and 3 represents the color channels assuming the input image was in 3-channel RGB color. Output Depth We can manually pick the depth of the output volume. The depth hyperparameter controls the neuron count in the convolutional layer that is connected to the same region of the input volume. Edges and Activation Different neurons along the depth dimension learn to activate when stimulated with created input data (eg: color or edges). We consider a set of neurons that all look at the same region of input volume as a depth column. Stride Stride configures how far our sliding filter window (described above) will move per application of the filter function. Each time we apply the filter function to the input column we create a new depth column in the output volume. Lower settings for stride (eg: 1 gives only a single unit step) will allocate more depth columns in the output volume. This will also yield more heavily overlapping receptive fields between the columns, leading to larger output volumes. The opposite is true for when we specify higher stride values. These higher stride values give us less overlap and smaller output volumes spatially. Zero-Padding The last hyperparameter is zero-padding and allows us to control the spatial size of the output volumes. We’d want to do this in cases where we want to maintain the spatial size of the input volume in the output volume.
References How ANN predicts Stock Market Indices? by Vidya Sagar Reddy Gopala, Deepak Ramu Deep Learning, O’Reilly Media Inc. ISBN: 9781491914250; By: Josh Patterson and Adam Gibson Artificial Intelligence for Humans, Volume 3: Deep Learning and Neural Networks, CreateSpace Independent Publishing Platform; ISBN: 1505714346; By: Jeff Heaton Coursera; Neural Networks for Machine Learning by Jeoff Hinton at University of Toronto https://www.researchgate.net/figure/267214531_fig1_Figure-1-RNN-unfolded-in-time-Adapted-from-Sutskever-2013-with-permission www.yann-lecun.com http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ https://github.com/cyrusmvahid/mxnet-notebooks/tree/master/python
Cyrus M. Vahid cyrusmv@amazon.com