Download presentation
Presentation is loading. Please wait.
Published byMelvyn Fleming Modified over 6 years ago
1
Lecture 8 Why deep? We explain deep learning from two aspects
Experimental evidence Theoretical proof
2
1. Experiments show deeper is better
Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4 3 X 2k 18.4 4 X 2k 17.8 5 X 2k 17.2 1 X 3772 22.5 7 X 2k 17.1 1 X 4634 22.6 1 X 16k 22.1 Not surprised, more parameters, better performance? Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech
3
Fat + Short v.s. Thin + Tall
The same number of parameters …… Shallow Which one is better? …… Deep
4
Fat + Short v.s. Thin + Tall
Layer X Size Word Error Rate (%) 1 X 2k 24.2 2 X 2k 20.4 3 X 2k 18.4 4 X 2k 17.8 5 X 2k 17.2 1 X 3772 22.5 7 X 2k 17.1 1 X 4634 22.6 1 X 16k 22.1 Why? Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech
5
They call this “modularization”
Classifier 1 Girls with long hair Classifier 2 Boys with long hair Image weak Lacking data Classifier 3 Girls with short hair Classifier 4 Boys with short hair
6
Modularization Intuitive example:
Each basic classifier can have sufficient training examples. Intuitive example: Boy or Girl? Image Basic Classifier Long or short? Classifiers for the attributes
7
Modularization or deeper reasons?
can be trained by little data Classifier 1 Girls with long hair Boy or Girl? Classifier 2 Boys with long hair Image fine Little data Basic Classifier Classifier 3 Girls with short hair Long or short? Classifier 4 Boys with short hair Sharing by the following classifiers as module
8
Hidden nodes, modularization, features
→ Less training data? …… The modularization or hidden nodes is automatically learned Poepel said AI= deep + bigdata Bigdata -> deep Deep is not for big data Deep Learning also works on small data set. Need less data!!!!! The most basic classifiers Use 1st layer as module to build classifiers Use 2nd layer as module ……
9
Image understanding Levels of features …… The most basic classifiers
Less training data The most basic classifiers Use 1st layer as module to build classifiers Use 2nd layer as module for objects Reference: Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In Computer Vision–ECCV 2014 (pp )
10
2. Theoretical proof: Deeper is better M
2. Theoretical proof: Deeper is better M. Telgarsky: The benefit of depth in neural networks, 2016. We give an informal argument of the Telgarsky’s proof. Claim 1. Few oscillations can’t fit many oscillations. Proof by picture. Stars mark disagree regions * * * * * * * * * * * * * * * * * * * * * * * * * * * *
11
Claim 2.ReLU can make exponentially many oscillations
(1/2,1) ReLU(x) := max {0,x}, and Let h(x) := ReLU(ReLU(2x) – ReLU(4x-2)) (0,0) (1,0) h(x) 2x x ε [0, ½] 2(1-x) x ε [½, 1] otherwise h(x) = h h h (x) h h (x) h has 1 peak hk has 2k-1 peaks.
12
Claim 3. Few layers implies few oscillations.
g f is s-affine, g is t-affine s-affine + t-affine ≤ (s+t−1)-affine same layer s-affine t-affine ≤ (st)-affine composition = next layer ReLU is 2-affine, after k levels it is exp(k) affine. Hence with O(1) layers, one needs exponentially many nodes to approximate k layers.
13
What does this mean? There exists a function that can be learned by a “deep” neural network with a polynomial number of nodes, but it needs exponentially many nodes for any “shallow” neural network. Open Question: However, this only says there exists a function, but does not tell us what function. This function might be something we do not care.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.