Shunyuan Zhang Nikhil Malik Deep Learning A tutorial for dummies Param Vir Singh Shunyuan Zhang Nikhil Malik
Dokyun Lee: The Deep Learner http://leedokyun.com/deep-learning-reading-list.html
“Deep Learning doesn’t do different things, “Deep Learning doesn’t do different things, it does things differently”
Performance vs Sample Size Deep Learning algorithms Performance Traditional ML algorithms Size of Data
Discriminator Network Outline Supervised Learning Convolutional Neural Network Sequence Modelling: RNN and its extensions Unsupervised Learning Autoencoder Stacked Denoising Autoencoder Unsupervised Learning (+Supervised) Generative Adversarial Networks Reinforcement Learning Deep Reinforcement Learning GAN Produce Poetry like Shakespeare Output Generated Shakespeare Poetry Generator Network Discriminator Network Input Fake Real Shakespeare Poetry Real
Supervised Learning Traditional pattern recognition models work with hand crafted features and relatively simple trainable classifiers. Limitations Very tedious and costly to develop hand crafted features. The hand-crafted features are usually highly dependents on one application. Trainable Classifier (e.g. SVM, Random Forrest) Extract Hand Crafted Features Output (e.g. Outdoor Yes or No)
Deep Learning Deep learning has an inbuilt automatic multi stage feature learning process that learns rich hierarchical representations (i.e. features). Low Level Features Mid Level Features High Level Features Output (e.g. outdoor, indoor) Trainable Classifier
Deep Learning Input Image Text Low Level Features Mid Level Features High Level Features Input Trainable Classifier Output Image Pixel Edge Texture Motif Part Object Text Character Word Word-group Clause Sentence Story Each module in Deep Learning transforms its input representation into a higher-level one, in a way similar to human cortex.
Let us see how it all works!
A Simple Neural Network An Artificial Neural Network is an information processing paradigm that is inspired by the biological nervous systems, such as the human brain’s information processing mechanism. x1 a1(1) x2 a2(1) Y a1(2) x3 a3(1) x4 a4(1) Input Hidden Layers Output
A Simple Neural Network Softmax x1 x2 x3 x4 a1(1) w4 w3 w2 w1 1 1+ 𝑒 −𝑤 ∗𝑎1 (2) x1 x2 x3 x4 Y a1(1) a2(1) a3(1) a4(1) a1(2) 𝑎1 (1) =𝑓 𝑤1∗𝑥1+𝑤2∗𝑥2+𝑤3∗𝑥3+𝑤4∗𝑥4 f( ) is activation function: Relu or sigmoid 𝑅𝑒𝑙𝑢:max(0,𝑥) 𝑎1 (1) =𝑚𝑎𝑥 0, 𝑤1∗𝑥1+𝑤2∗𝑥2+𝑤3∗𝑥3+𝑤4∗𝑥4
4*4 + 4 +1 Number of Parameters Softmax Y Input Hidden Layers Output
If the input is an Image? Y 400 X 400 X 3 Input Hidden Layers Output Number of Parameters 480000*480000 + 480000 +1 = approximately 230 Billion !!! 480000*1000 + 1000 +1 = approximately 480 million !!!
Let us see how convolutional layers help.
Convolutional Layers Filter 0 1 0 1 -4 1 Filter Input Image Convoluted Image Inspired by the neurophysiological experiments conducted by Hubel and Wiesel 1962.
Convolutional Layers What is Convolution? ℎ2=𝑓 𝑏∗𝑤1+𝑐∗𝑤2+𝑓∗𝑤3+𝑔∗𝑤4 ℎ1=𝑓 𝑎∗𝑤1+𝑏∗𝑤2+𝑒∗𝑤3+𝑓∗𝑤4 a b c d e f g h i j k l m n o p w1 w2 w3 w4 h1 h2 Input Image Filter Convolved Image (Feature Map) ℎ2=𝑓 𝑏∗𝑤1+𝑐∗𝑤2+𝑓∗𝑤3+𝑔∗𝑤4 Number of Parameters for one feature map = 4 Number of Parameters for 100 feature map = 4*100
Lower Level to More Complex Features Filter 1 Filter 2 Input Image Layer 1 Feature Map Layer 2 Feature Map In Convolutional neural networks, hidden units are only connected to local receptive field.
Pooling Max pooling: reports the maximum output within a rectangular neighborhood. Average pooling: reports the average output of a rectangular neighborhood. MaxPool with 2X2 filter with stride of 2 1 3 5 4 2 4 5 3 Input Matrix Output Matrix
Convolutional Neural Network Maxpool Output Vector Feature Extraction Architecture 64 Living Room 128 Bed Room 256 512 512 Kitchen Bathroom Outdoor Max Pool Filter Fully Connected Layers
Convolutional Neural Networks Output: Binary, Multinomial, Continuous, Count Input: fixed size, can use padding to make all images same size. Architecture: Choice is ad hoc requires experimentation. Optimization: Backward propagation hyper parameters for very deep model can be estimated properly only if you have billions of images. Use an architecture and trained hyper parameters from other papers (Imagenet or Microsoft/Google APIs etc) Computing Power: Buy a GPU!!
Automatic Colorization of Black and White Images
Optimizing Images Post Processing Feature Optimization (Color Curves and Details) Post Processing Feature Optimization (Illumination) Post Processing Feature Optimization (Color Tone: Warmness)
Recurrent Neural Networks
Why RNN? The limitations of the Convolutional Neural Networks Take fixed length vectors as input and produce fixed length vectors as output. Allow fixed amount of computational steps. We need to model the data with temporal or sequential structures and varying length of inputs and outputs e.g. This movie is ridiculously good. This movie is very slow in the beginning but picks up pace later on and has some great action sequences and comedy scenes.
Modeling Sequences Awesome tutorial. Positive Happy Diwali A person riding a motorbike on dirt road Image Captioning Awesome tutorial. Positive Sentiment Analysis Happy Machine Translation Diwali शुभ दीपावली
What is RNN? Recurrent neural networks are connectionist models with the ability to selectively pass information across sequence steps, while processing sequential data one element at a time. Allows a memory of the previous inputs to persist in the model’s internal state and influence the outcome. OUTPUT h(t) h(t) Hidden Layer Delay h(t-1) x(t) INPUT
RNN (rolled over time) RNN is awesome 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 OUTPUT x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1 x2 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h2 x3 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h3 𝑤 𝑦 𝑓 𝑦 () is awesome RNN ℎ 𝑡 = 𝑓 ℎ 𝑤 ℎ ∗ℎ 𝑡−1 + 𝑤 𝑥 ∗𝑥 𝑡
RNN (rolled over time) RNN is so cool OUTPUT x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1 𝑤 𝑦 𝑓 𝑦 () OUTPUT RNN is so cool
The Vanishing Gradient Problem RNN’s use back propagation. Back propagation uses chain rule. Chain rule multiplies derivatives If these derivatives are between 0 and 1 the product vanishes as the chain gets longer. or the product explodes if the derivatives are greater than 1. Sigmoid activation function in RNN leads to this problem. Relu, in theory, avoids this problem but not in practice.
Problem with Vanishing or Exploding Gradients Don’t allow us to learn long term dependencies. Param is a hard worker. VS. Param, student of Yong, is a hard worker. BAD!!!! Misguided!!!! Unacceptable!!!!
Long Short Term Memory (LSTM) LSTM provide solution to the vanishing/exploding gradient problem. Solution: Memory Cell, which is updated at each step in the sequence. Three Gates control the flow of information to and from the Memory cell Input Gate: protect the current step from irrelevant inputs Output Gate: prevents current step from passing irrelevant information to later steps. Forget Gate: limits information passed from one cell to the next.
LSTM + c0 Forget f1 c1 Input i1 𝑐 1 = 𝑓 1 . 𝑐 0 + 𝑖 𝑖 . 𝑢 1 x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1 u1 𝑐 𝑡 = 𝑓 𝑡 . 𝑐 𝑡−1 + 𝑖 𝑡 . 𝑢 𝑡
LSTM + 𝑓 𝑓 () 𝑤 ℎ𝑓 𝑓 1 = 𝑓 𝑓 𝑊 ℎ𝑓 ∗ ℎ 0 + 𝑊 𝑥𝑓 ∗ 𝑥 1 𝑤 ℎ 𝑓 ℎ () x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1 u1 c0 c1 + Forget f1 Input i1 x1 h0 𝑤 ℎ𝑓 𝑤 𝑥𝑓 𝑓 𝑓 () Forget f1 𝑓 1 = 𝑓 𝑓 𝑊 ℎ𝑓 ∗ ℎ 0 + 𝑊 𝑥𝑓 ∗ 𝑥 1 𝑓 𝑡 = 𝑓 𝑓 𝑊 ℎ𝑓 ∗ ℎ 𝑡−1 + 𝑊 𝑥𝑓 ∗ 𝑥 𝑡
LSTM + 𝑤 ℎ𝑖 𝑓 𝑖 () 𝑖 1 = 𝑓 𝑖 𝑊 ℎ𝑖 ∗ ℎ 0 + 𝑊 𝑥𝑖 ∗ 𝑥 1 𝑤 ℎ 𝑓 ℎ () x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1 u1 c0 c1 + Forget f1 Input i1 x1 h0 𝑤 ℎ𝑖 𝑤 𝑥𝑖 𝑓 𝑖 () Input i1 𝑖 1 = 𝑓 𝑖 𝑊 ℎ𝑖 ∗ ℎ 0 + 𝑊 𝑥𝑖 ∗ 𝑥 1 𝑖 𝑡 = 𝑓 𝑓 𝑊 ℎ𝑖 ∗ ℎ 𝑡−1 + 𝑊 𝑥𝑖 ∗ 𝑥 𝑡
LSTM + 𝑜 𝑡 = 𝑓 𝑜 𝑊 ℎ𝑜 ∗ ℎ 𝑡−1 + 𝑊 𝑥𝑜 ∗ 𝑥 𝑡 ℎ 𝑡 = 𝑜 𝑡 .𝑡𝑎𝑛ℎ 𝑐 𝑡 𝑓 𝑜 () 𝑜 𝑡 = 𝑓 𝑜 𝑊 ℎ𝑜 ∗ ℎ 𝑡−1 + 𝑊 𝑥𝑜 ∗ 𝑥 𝑡 ℎ 𝑡 = 𝑜 𝑡 .𝑡𝑎𝑛ℎ 𝑐 𝑡 x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1 u1 c0 c1 + Forget f1 Input i1 𝑓 𝑜 () Output o1 h1 x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 h1
LSTM + + x1 h0 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 u1 c0 c1 Forget f1 Input i1 𝑓 𝑜 () h1 Output o1 x2 𝑓 ℎ () 𝑤 ℎ 𝑤 𝑥 u2 c2 + Forget f2 Input i2 𝑓 𝑜 () h2 Output o2
Combining CNN and LSTM
Visual Question Answering