Dhruv Batra Georgia Tech

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
ECE6504 – Deep Learning for Perception
Appendix B: An Example of Back-propagation algorithm
Backpropagation An efficient way to compute the gradient Hung-yi Lee.
Neural Network Introduction Hung-yi Lee. Review: Supervised Learning Training: Pick the “best” Function f * Training Data Model Testing: Hypothesis Function.
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Neural Networks –Backprop –Modular Design.
ECE 6504: Deep Learning for Perception
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –(Finish) Backprop –Convolutional Neural Nets.
PARALLELIZATION OF ARTIFICIAL NEURAL NETWORKS Joe Bradish CS5802 Fall 2015.
ECE 6504: Deep Learning for Perception Dhruv Batra Virginia Tech Topics: –Recurrent Neural Networks (RNNs) –BackProp Through Time (BPTT) –Vanishing / Exploding.
EEE502 Pattern Recognition
Chapter 8: Adaptive Networks
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Lecture 3a Analysis of training of NN
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
6.5.4 Back-Propagation Computation in Fully-Connected MLP.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Matt Gormley Lecture 15 October 19, 2016
Today’s Lecture Neural networks Training
Lecture 7 Learned feedforward visual processing
Deep Learning.
Dhruv Batra Georgia Tech
Dhruv Batra Georgia Tech
The Gradient Descent Algorithm
ECE 5424: Introduction to Machine Learning
Gradient-based Learning Applied to Document Recognition
Dhruv Batra Georgia Tech
ECE 5424: Introduction to Machine Learning
Dhruv Batra Georgia Tech
Computing Gradient Hung-yi Lee 李宏毅
Matt Gormley Lecture 16 October 24, 2016
Dhruv Batra Georgia Tech
Dhruv Batra Georgia Tech
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
CS 4501: Introduction to Computer Vision Basics of Neural Networks, and Training Neural Nets I Connelly Barnes.
Neural Networks CS 446 Machine Learning.
Lecture 25: Backprop and convnets
Convolution Neural Networks
CS 188: Artificial Intelligence
Neural Networks Lecture 6 Rob Fergus.
Introduction to CuDNN (CUDA Deep Neural Nets)
Neural Networks and Backpropagation
Deep Neural Network Toolkits: What’s Under the Hood?
CS 4501: Introduction to Computer Vision Training Neural Networks II
CSC 578 Neural Networks and Deep Learning
cs540 - Fall 2016 (Shavlik©), Lecture 18, Week 10
Neural Network - 2 Mayank Vatsa
Convolutional networks
Neural Networks Geoff Hulten.
Backpropagation.
Multilayer Perceptron: Learning : {(xi, f(xi)) | i = 1 ~ N} → W
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
Backpropagation Disclaimer: This PPT is modified based on
Image Classification & Training of Neural Networks
Backpropagation and Neural Nets
Backpropagation.
Neural Networks II Chen Gao Virginia Tech ECE-5424G / CS-5824
NN & Optimization Yue Wu.
CS295: Modern Systems: Application Case Study Neural Network Accelerator Sang-Woo Jun Spring 2019 Many slides adapted from Hyoukjun Kwon‘s Gatech “Designing.
Image recognition.
Dhruv Batra Georgia Tech
Artificial Neural Networks / Spring 2002
Principles of Back-Propagation
Derivatives and Gradients
Overall Introduction for the Lecture
Presentation transcript:

Dhruv Batra Georgia Tech CS 4803 / 7643: Deep Learning Topics: Automatic Differentiation Patterns in backprop Jacobians in FC+ReLU NNs Dhruv Batra Georgia Tech

Administrativia HW1 Reminder Project Teams Google Doc Due: 09/26, 11:55pm Project Teams Google Doc https://docs.google.com/spreadsheets/d/1ouD6ctaemV_3nb2MQHs7rUOAaW9DFLu8I5Zd3yOFs7E/edit?usp=sharing Project Title 1-3 sentence project summary TL;DR Team member names (C) Dhruv Batra

Recap from last time (C) Dhruv Batra

Deep Learning = Differentiable Programming Computation = Graph Input = Data + Parameters Output = Loss Scheduling = Topological ordering Auto-Diff A family of algorithms for implementing chain-rule on computation graphs (C) Dhruv Batra

Forward mode AD g What we’re doing in backprop is we have all these nodes in our computational graph, each node only aware of immediate surroundings at each node have ...

Reverse mode AD g What we’re doing in backprop is we have all these nodes in our computational graph, each node only aware of immediate surroundings at each node have ...

Example: Forward mode AD + sin( ) * x1 x2 (C) Dhruv Batra

Example: Forward mode AD + sin( ) * x1 x2 (C) Dhruv Batra

Example: Reverse mode AD + sin( ) * x1 x2 (C) Dhruv Batra

Forward mode vs Reverse Mode x  Graph  L Intuition of Jacobian (C) Dhruv Batra

Forward mode vs Reverse Mode What are the differences? Which one is faster to compute? Forward or backward? Which one is more memory efficient (less storage)? (C) Dhruv Batra

Any DAG of differentiable modules is allowed! Computational Graph Any DAG of differentiable modules is allowed! (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato

Key Computation: Forward-Prop (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Key Computation: Back-Prop (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] Step 2: Compute gradients wrt parameters [B-Pass] (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] Step 2: Compute gradients wrt parameters [B-Pass] (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] Step 2: Compute gradients wrt parameters [B-Pass] (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Neural Network Training Step 1: Compute Loss on mini-batch [F-Pass] Step 2: Compute gradients wrt parameters [B-Pass] Step 3: Use gradient to update parameters (C) Dhruv Batra Slide Credit: Marc'Aurelio Ranzato, Yann LeCun

Plan for Today Automatic Differentiation Patterns in backprop Jacobians in FC+ReLU NNs (C) Dhruv Batra

Neural Network Computation Graph (C) Dhruv Batra Figure Credit: Andrea Vedaldi

Figure Credit: Andrea Vedaldi Backprop (C) Dhruv Batra Figure Credit: Andrea Vedaldi

Modularized implementation: forward / backward API Graph (or Net) object (rough psuedo code) During forward pass cache intermediary computation Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Modularized implementation: forward / backward API x z * y (x,y,z are scalars) The implementation exactly mirrors the algorithm in that it’s completely local Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Modularized implementation: forward / backward API x z * y (x,y,z are scalars) Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n Example: Caffe layers Higher level node Beautiful thing is only need to implement forward and backward Caffe is licensed under BSD 2-Clause Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n Caffe Sigmoid Layer * top_diff (chain rule) Caffe is licensed under BSD 2-Clause Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Plan for Today Automatic Differentiation Patterns in backprop Jacobians in FC+ReLU NNs (C) Dhruv Batra

Backpropagation: a simple example Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Backpropagation: a simple example Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow Q: What is an add gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor Q: What is a max gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router Q: What is a mul gate? Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Patterns in backward flow add gate: gradient distributor max gate: gradient router mul gate: gradient switcher Gradient switcher and scaler Mul: scale upstream gradient by value of other branch Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Gradients add at branches + Gradients add at branches, from multivariate chain rule Since wiggling this node affects both connected nodes in the forward pass, then at backprop the upstream gradient coming into this node is the sum of the gradients flowing back from both nodes Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Duality in Fprop and Bprop SUM + COPY + (C) Dhruv Batra

Plan for Today Automatic Differentiation Patterns in backprop Jacobians in FC+ReLU NNs (C) Dhruv Batra

Figure Credit: Andrea Vedaldi Backprop (C) Dhruv Batra Figure Credit: Andrea Vedaldi

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n Jacobian of ReLU g(x) = max(0,x) (elementwise) 4096-d input vector 4096-d output vector Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n Jacobian of ReLU g(x) = max(0,x) (elementwise) 4096-d input vector 4096-d output vector Q: what is the size of the Jacobian matrix? Partial deriv. Of each dimension of output with respect to each dimension of input Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n Jacobian of ReLU g(x) = max(0,x) (elementwise) 4096-d input vector 4096-d output vector Q: what is the size of the Jacobian matrix? [4096 x 4096!] Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n Jacobian of ReLU g(x) = max(0,x) (elementwise) 4096-d input vector 4096-d output vector Q: what is the size of the Jacobian matrix? [4096 x 4096!] Q2: what does it look like? For each elem in input, for each elem in output Slide Credit: Fei-Fei Li, Justin Johnson, Serena Yeung, CS 231n

Jacobians of FC-Layer (C) Dhruv Batra

Jacobians of FC-Layer (C) Dhruv Batra

Jacobians of FC-Layer (C) Dhruv Batra

Jacobians of FC-Layer (C) Dhruv Batra