Presentation is loading. Please wait.

Presentation is loading. Please wait.

Zhu Han University of Houston

Similar presentations


Presentation on theme: "Zhu Han University of Houston"— Presentation transcript:

1 Signal processing and Networking for Big Data Applications Lectures 11-12: Deep Learning Basics
Zhu Han University of Houston Thanks for Dr. Hien Nguyen slides and help by Xunshen Du and Kevin Tsai

2 outline Motivation and overview Idea of Deep Feedforward Networks
Why deep learning Basic concepts CNN, RNN vs. DBN Idea of Deep Feedforward Networks Example: Learning XOR Output Units

3 Which one is better? Answer “it depends”
Classic method such as Random forest might be good if you don't have a lot of data, or training data have categorical features, or you want a more explainable model, or you want high run-time speed. Deep learning is good when you have a lot of training data of the same or related domain. Deep networks could be much slower than random forest. Appropriate scaling and normalization have to be done to the data to improve the convergence speed.

4 Why Deep Learning? Beat me on IMAGENET! How can I convince you? Hinton
Malik Alex

5 Why Deep Learning Complex engineered systems AlexNet

6 Why Deep Learning

7 Why Deep Learning

8 Why Deep Learning Carotic artery bifurcation landmark detection

9 Why Deep Learning “Deep” in deep learning:
Some functions computable with a polynomial-size logic gates circuit of depth k will require exponential size for depth k − 1 [Hastad Johan, Theory of computing, 1986] Complex patterns is composed of repeated simples patterns (Everything in the universe is within you) Curse of dimensionality: Convergence rate decreases exponentially with dimension of data We say that the expression of a function is compact when it has few computational elements, i.e., few degrees of freedom that need to be tuned by learning. Compact means more generalizable in machine learning. The depth of an architecture is the maximum length of a path from any input of the graph to any output of the graph, there are functions computable with a polynomial-size logic gates circuit of depth k that require exponential size when restricted to depth k − 1 J. H˚astad, “Almost optimal lower bounds for small depth circuits,” in Proceedings of the 18th annual ACM Symposium on Theory of Computing, pp. 6–20, Berkeley, California: ACM Press, 1986

10 Why now? Big annotated datasets become available: GPU processing power
ImageNet: Google Video: Mechanical Turk Crowsourcing GPU processing power Better stochastic gradient descents: AdaGrad, AdamGrad, RMSProp

11 Biological Neuron and Modeling

12 Hierarchical Structure Deep Learning
1981, David Hubel & Torsten Wiesel: The Mamalian Visual Cortex is Hierarchical 14

13 Deep Learning Structure
Hinton

14 Fully Connected Feed-Forward Network
Linear transformation followed by non-linear rectification Training of NN Back propagation

15 Rectifier Functions Activation function from biological data

16 Rectifier Linear Unit (ReLU)
Pros: Faster to compute Help reduce vanishing gradient problem Sparse activation More biologically plausible More robust to noise Information disentanglement, etc Cons: Very negative neuron cannot recover due to zero gradient  Parametric ReLu come to rescue

17 Dropout Hilton et. al. 2012 Randomly pick some neurons and replace their value with 0 during training Reduce overfitting (Pediatric ADHD) and co-adaptation

18 DropConnect Randomly remove connections
Wang et. al. ICML 2013 Randomly remove connections Claim to work better than dropout

19 Other Tips Data augmentation / perturbation Batch normalization
L1/L2 regularization: Big network with regularizer often outperform small network Parameters search using Bayesian optimization

20 Type 1: Convolutional Neural Networks (CNNs)
LeCun et al. 1989 Neural network with specialized connectivity structure

21 AlexNet Same model as LeCun’98 but: Bigger model (8 layers)
More data (106 vs 103 images) GPU implementation (50x speedup over CPU) Better regularization (DropOut) 7 hidden layers; 650,000 neurons; 60,000,000 parameters Trained on 2 GPUs for a week

22 AlexNet 14 million labeled images, 20k classes
Images gathered from Internet Human labels via Amazon Turk

23 Pre-Training Effect Pretrain with ImageNet Network
Fine-tune on a small set of samples Dataset: Caltech-256 Zeiler & Fergus, arXiv , 2013

24 CONVOLUTION (LEARNED)
CNN Structure FEATURE MAPS Feed-forward: Convolve input Non-linearity (rectified linear) Pooling (local max) Supervised Train convolutional filters by back-propagating classification error POOLING NON-LINEAR RECTIFIER CONVOLUTION (LEARNED) INPUT IMAGE

25 Filtering Convolution Filter is learned during training
Same filter at each location Each filter looks at local region only

26 Filtering Multiple filters at each layer
More complex patterns at deeper layers

27 Non-Linearity Rectified linear function Applied per-pixel
Output = max(0,input)

28 Pooling Spatial Pooling Non-overlapping / overlapping regions
Sum or Max Boureau et al. ICML’10 for theoretical analysis Max pooling make it more invariant to local transformation like translation.

29 Pooling Spatial Pooling Non-overlapping / overlapping regions
Sum or Max Boureau et al. ICML’10 for theoretical analysis Max Sum

30 Role of Pooling Spatial pooling Variants:
Invariance to small transformations Larger receptive fields Variants: Pool across feature maps Tiled pooling

31 Going Deeper Current best models are much deeper than AlexNet
Deeper architectures Current best models are much deeper than AlexNet

32 Going Deeper CaffeNet Google Inception Network Residual network
8 layers Google Inception Network 22 layers Google inception network 22 layers Residual network 152 layers

33 Detection with CNNs Sometime we also want to locate specific objects in the scene Traditional approaches: Scan over entire images, at different scales Construct a feature for each location Apply a classifier on that feature Sophisticated tricks to speedup scanning (Integral image, subwindow search) Combine multiple hand-engineered features Difficult to scale up to multiple classes

34 Detection with CNNs Faster R-CNN, NIPS 2015 Perform regression of bounding box coordinates and object class at the same time Jointly learn features and regressors Currently best performer in PASCAL detection

35 Segmentation with CNNs
Noh et. al., ICCV 2015 Add unpooling and upsampling layers Competitive performance on PASCAL dataset

36 Reinforcement Learning with CNNs
Robot takes action, get reward, and enter new state Goal: Find optimal policy for robot Q-learning:

37 Reinforcement Learning with CNNs
Sensor data Robot action Action-value function jointly learned with image features Outperform human expert (and linear method)

38 Reinforcement Learning with CNNs
Model probability of next action with CNN Predict distribution of opponent’s next action  feed into policy learning Play with itself to improve policy ALPHA GO GO game

39 Self-Driving Car with CNNs
NVIDIA

40 Why CNNs work well Small number of variables
Jointly learn features and classifier Hierarchically learning complex patterns  mitigate curse of dimensionality Designed to be invariant to shift/translation

41 Type 2: Recurrent Neural Network
Vast number of tasks involve time series data Need long term dependency: “I grew up in France… I speak fluent French.”

42 Recurrent Network

43 Recurrent Network

44 Recurrent Network

45 Long Short-Term Memory Network

46 Long-Short Term Memory Network

47 Long-Short Term Memory Network
Rafal et. al., JMLR 2015 Forget gate is most important Output gate is least important Adding positive bias (1 or 2) to forget gate  help reduce vanishing gradient  greatly improves performance of LSTM

48 Sequence to Sequence Could apply to sequence of input and output

49 Example: Video to Text with LSTM
Connect CNN with stack of LSTM

50 Network Visualization
What were learned in deep networks Raw coefficients of learned filters in higher layers are difficult to interpret Two main approaches: Project activation back to pixel space Optimize input image to maximize response of a particular neuron / feature map

51 Network Visualization
Projection from higher layers back to input Visualizing and Understanding Convolutional Networks, Matt Zeiler & Rob Fergus, ECCV 2014 Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps, Karen Simonyan, Andrea Vedaldi, Andrew Zisserman, arXiv , 2013 Object Detectors Emerge in Deep Scene CNNs, Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, Antonio Torralba, ICLR 2015

52 Network Visualization
Divide image into regions Iteratively remove regions with least change in output score Zhou & Torralba

53 Network Visualization
Find K images producing highest response of a neuron Occlude random regions to find the most critical one, i.e. causing largest change in output Object detectors emerge!

54 Network Visualization
Optimize input to maximize particular output: Erhan et. al. 2009, Le et. al. [NIPS 2010] Output depends on initialization Google Deep Dream

55 Network Visualization
Google Deep Dream Ask network to enhance what it sees

56 Type 3: Deep Belief Network: Restricted Boltzmann Machine
Each link associates with a probability; It is parametric men women

57 Deep belief network layered structure
After learning an RBM, treat the activation probabilities of its hidden units as the data for training the RBM one layer up Stacking a number of the RBMs learned layer by layer from bottom-up gives “Trained” methods for better clustering

58 Comparisons

59 Deep Learning Software
Caffe: Theano: Torch: Matlab: TensorFlow:

60 Deep Learning Software

61 outline Motivation and overview Idea of Deep Feedforward Networks
Example: Learning XOR Output Units Book: Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning

62 Concept Also known as Approximate some function 𝑓 ∗
Feedforward Neural Networks Multi-layer Perceptrons (MLPs) Approximate some function 𝑓 ∗ Ex: 𝑦= 𝑓 ∗ (𝑥) Feedforward network learns the parameters 𝜃 in a function Ex: 𝑦=𝑓(𝑥;𝜃) IDEA OF DEEP FEEDFORWARD NETWORKS

63 Deep? Feedforward? Networks?
NETWORKS – composing by multiple functions FEEDFORWARD – information/input (x) flows through functions to produce output y Ex: 𝑓 𝑥 = 𝑓 3 ( 𝑓 2 ( 𝑓 1 (𝑥))) Note: Recurrent neural networks is the extension of feedforward neural networks with feedback connections DEEP – overall length of the chain gives the “depth” of the model. (deep learning) Ex: 𝑓 1 is called “1st layer”, 𝑓 2 is called “2nd layer”, …, final layer of the network is called the “output layer” IDEA OF DEEP FEEDFORWARD NETWORKS

64 Linear  nonlinear Linear models usually fit efficiently and reliably in closed form or with convex optimization Nonlinear functions can overcome the limitation of linear models Apply linear model to a transformed input 𝜙(𝒙). 𝜙 is nonlinear transformation Very generic 𝜙. Infinite-dimensional 𝜙 Manually engineer 𝜙. Dominant approach before deep learning Generated by deep learning IDEA OF DEEP FEEDFORWARD NETWORKS

65 Hidden layers Recall: 𝑓 𝑥 = 𝑓 3 ( 𝑓 2 ( 𝑓 1 (𝑥)))
𝑓 𝑛 might be linear or nonlinear function Training data doesn’t determine what each individual layer should do Learning algorithm needs to decide how to connect each layer. This is NOT part of the “learning” algorithm IDEA OF DEEP FEEDFORWARD NETWORKS

66 Hidden layers IDEA OF DEEP FEEDFORWARD NETWORKS

67 Cost functions The choice of the cost function is important for designing a deep neural network In most cases, we use cross-entropy between the training data and the model’s predictions as the cost function since the parametric model usually defines a distribution 𝑝(𝒚|𝒙;𝜃)

68 Training feedforward network
Output units form Cost function Optimizer IDEA OF DEEP FEEDFORWARD NETWORKS

69 Example: Learning xor

70 objective Target function 𝑦= 𝑓 ∗ (𝑥) Our model 𝑦=𝑓(𝑥;𝜃)
Objective: adapt 𝜃 to make 𝑓 as similar as 𝑓 ∗ Example: Learning xor

71 Strategy Treat as a regression problem
Use a mean square error (MSE) loss function 𝐽 𝜽 = 1 4 𝒙∈𝕏 ( 𝑓 ∗ 𝒙 −𝑓 𝒙;𝜽 ) 2 Just for simplifying the math here Our model: 𝑓 𝒙;𝜽 =𝑓 𝒙;𝒘,𝑏 = 𝒙 ⊺ 𝒘+𝑏 Example: Learning xor

72 problem Solution: w=0 and b=0.5 X1=0, output increases as x2 increases
Useless! X1=0, output increases as x2 increases x1=1, output decrease as x2 increases Coefficient w not fixed Example: Learning xor

73 solution Add hidden layer (h) which contains two hidden units
Rewrite model Map and to Example: Learning xor

74 solution Use nonlinear function to describe the features
Note: most neural networks do so using affine transformation controlled by learned parameters, followed by a fixed, nonlinear function called an “activation function” 𝒉= 𝑓 1 𝒙;𝑾,𝑐 =𝑔( 𝑾 ⊺ 𝒙+𝑐) where 𝑾 are the weights and 𝒄 are the biases Complete networks: 𝑓 𝒙;𝑾,𝒄,𝒘,𝑏 = 𝒘 ⊺ max 0, 𝑾 ⊺ 𝒙+𝒄 +𝑏 Example: Learning xor

75 Side note: common activation function
Example: Learning xor

76 Gradient-based learning

77 remark Linear models Neural networks (include nonlinear function)
Linearity Linear equation solvers is convergence guarantees Capacity limitation Nonlinearity Causes most loss functions to become non-convex; in other words, using iterative, gradient-based optimizers that drive cost function to small value

78 Stochastic Gradient descent
No convergence guarantees Sensitive to the initial parameters value Initial parameters should be small random values (discuss in Ch. 8)

79 Output units

80 Output units Any kind of neural network unit (ex. Hidden unit)
Note: can be used internally in the neuron Cost function is tightly coupled with the choice of output unit OUTPUT UNITS

81 Output units linear units for Gaussian output distributions
Based on an affine transformation with no nonlinearity Output of linear output units: 𝒚 = 𝑾 ⊺ 𝒉+𝒃 where 𝒉 is the given features Often used to produce the mean of a conditional Gaussian distribution 𝑝 𝒚 𝒙 =𝒩(𝒚; 𝑦 ,𝐼) Covariance must be positive definite matrix for all inputs (hard to satisfy) Do not saturate. Pose difficulty for some opt. algorithms OUTPUT UNITS

82 Output units Sigmoid unit for Bernoulli output distributions
Useful for classification problems with two classes Neural net predict only 𝑃(𝑦=1|𝒙) where valid probability ∈[0,1] 𝑃 𝑦=1 𝒙 =max⁡{0, min 1, 𝒘 ⊺ 𝒉+𝑏 } 𝑃=0 when 𝒘 ⊺ 𝒉+𝑏 is out of [0,1]. Unable to learn What now? OUTPUT UNITS

83 Output units Sigmoid unit for Bernoulli output distributions
Ensure strong gradient whenever the model has the wrong answer Combine sigmoid output units with maximum likelihood Two components: linear layer to compute 𝑧= 𝒘 ⊺ 𝒉+𝑏, sigmoid activation function to convert 𝑧 into a probability Sigmoid output unit: 𝑦 =𝜎(𝑧) where 𝜎(𝑥)= 1 1+exp⁡(−𝑥) OUTPUT UNITS

84 Output units softmax unit for multinoulli output distributions
Represent a probability distribution over a discrete variable with 𝑛 possible values Often used as the output of a classifier Generalization of the sigmoid function Output a vector 𝑦 OUTPUT UNITS

85 Output units softmax unit for multinoulli output distributions
𝑦 𝑖 =𝑃 𝑦=𝑖 𝒙 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒛) 𝑖 = exp⁡( 𝑧 𝑖 ) 𝑗 exp⁡( 𝑧 𝑗 ) OUTPUT UNITS

86 Intro to Hidden Units Haven’t fully defined
Design process consists of trial and error Choice of activation function

87 Outline Hidden Units & ReLU Architecture Design Visualization Design
Brief Intro to Propagation

88 Hidden Units Design process consists of trial and error
Not expect gradient to be 0 Distinguished by the choice of the activation function Hidden Units & ReLU

89 Rectified linear units
Derivatives remain large whenever the unit is active 2nd derivative is 1 when active Gradients are large and consistent Typically used on top of an affine transformation 𝒉=𝑔( 𝑾 ⊺ 𝒙+𝒃) 𝒃 is small positive value. This make ReLU initially active for most input to allow the derivatives to pass through Hidden Units & ReLU

90 ReLU generalization Based on non-zero slope
𝑧 𝑖 <0: ℎ 𝑖 =𝑔 (𝑧,𝛼) 𝑖 = max 0, 𝑧 𝑖 + 𝛼 𝑖 min(0, 𝑧 𝑖 ) Absolute value rectification: 𝛼 𝑖 =1 to obtain 𝑔 𝑧 = 𝑧 Leaky ReLU: 𝛼 𝑖 =0.01 Parametric ReLU or PReLU: treats 𝛼 𝑖 as a learnable parameter Hidden Units & ReLU

91 ReLU generalization Maxout units: 𝑔(𝒛) 𝑖 = max 𝑗∈ 𝔾 (𝑖) 𝑧 𝑗
where 𝔾 (𝑖) is the inputs for group 𝑖, { 𝑖−1 𝑘+1,…,𝑖𝑘} Activation function Usually combined with dropout Hidden Units & ReLU

92 Architecture Design

93 Architecture Design Number of units? How to connect?
Organized into groups of units called layers. Arrange these layers in a chain structure. Ex: First layer: 𝒉 (1) = 𝑔 (1) 𝑾 (1)⊺ 𝒙+ 𝒃 (1) Second layer: 𝒉 (2) = 𝑔 (2) 𝑾 (2)⊺ 𝒉 (1) + 𝒃 (2) Deeper networks are able to use fewer units per layer and fewer parameters but often harder to optimize Via experimentation guided by monitoring the validation set error Architecture Design

94 Universal approximation theorem
Define: a feedforward network with a linear output layer and at least one hidden layer with any “squashing” activation function can approximate any Borel measurable function from one finite-dimensional space to another with any desired non- zero amount of error, provided that the network is given enough hidden units. Large MLP will be able to represent any function we are trying to learn Not guaranteed that the training algorithm will be able to learn that function Two reason that fail to learn: Opt. algorithm may not find the correct parameters Training algorithm might choose the wrong function due to overfitting Architecture Design

95 Visualization Design Brief Intro to Propagation & Visualization Design

96 Computational graph Brief Intro to Propagation & Visualization Design

97 Symbol-to-symbol derivatives
Brief Intro to Propagation & Visualization Design

98 Brief Intro to Propagation & Visualization Design

99 Forward propagation & back-propagation
Inputs 𝒙 propagates up to the hidden units at each layer and produces 𝒚 During training, forward propagation can continue onward until it produces a scalar cost function 𝐽(𝜽) Known as backprop Allows the information from the cost function to flow backwards through the network, in order to compute the gradient Brief Intro to Propagation & Visualization Design

100 Thank you


Download ppt "Zhu Han University of Houston"

Similar presentations


Ads by Google