Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slides adapted from Geoffrey Hinton, Yann Le Cun, Yoshua Bengio

Similar presentations


Presentation on theme: "Slides adapted from Geoffrey Hinton, Yann Le Cun, Yoshua Bengio"— Presentation transcript:

1 Slides adapted from Geoffrey Hinton, Yann Le Cun, Yoshua Bengio
Deep Learning Sen Song Slides adapted from Geoffrey Hinton, Yann Le Cun, Yoshua Bengio and G. A. Tagliarini

2 Part 0 Success of Deep Learning

3 Lecture 10 Deep Learning Sen Song 2013/12/14
Adapted from slides by Geoffrey Hinton and Yann Lecun

4

5

6

7

8 深度学习的成功: 回过头看: 很早就发现的受大脑启发的算法 + 大数据 大的计算机 但真正的历史是曲折的。。。

9 物体识别问题 Selectivity vs Invariance
A key computational issue in what pathway (object recognition) is the selectivity-invariance trade-off: Recognition must be able to finely discriminate between different objects or object classes (such as the faces ) while being tolerant of object transformations (such as scaling, translation, illumination, changes in viewpoint, and clutter), as well as non-rigid transformations (such as variations in shape within a class), as in the change of facial expression in recognizing faces. Serre and Poggio, 2010

10

11 Hierarchy of visual areas
What What versus Where pathways What pathway: 32 cortical areas 187 linkages Felleman and Van Essen, 1991

12 Orientation selectivity of V1 Simple Cell
Hubel and Wiesel, 1968

13 Selectivity emerging: Simple Cell
Hubel and Wiesel, 1962

14 Invariance emerging: Complex Cell
Hubel and Wiesel, 1962

15 The architecture of LeNet5
很早就实现就手写体识别,为啥没早日解决物体识别?

16 David Marr

17 Marr’s approach 1. Computational theory: What is the goal of the computation, why is it appropriate, and what is the logic of the strategy by which it can be carried out? 2. Representation and algorithm: How can this computational theory be implemented? In particular, what is the representation for the input and output, and what is the algorithm for the transformation? 3. Hardware implementation: How can the representation and algorithm be realized physically? Marr puts great importance to the first level: ”To phrase the matter in another way, an algorithm is likely to be understood more readily by understanding the nature of the problem being solved than by examining the mechanism (and hardware) in which it is embodied.”

18 Part I The ups and downs of backpropagation

19 A brief history of backpropagation
The backpropagation algorithm for learning multiple layers of features was invented several times in the 70’s and 80’s: Bryson & Ho (1969) linear Werbos (1974) Rumelhart et. al. in 1981 Parker (1985) LeCun (1985) Rumelhart et. al. (1985) Backpropagation clearly had great promise for learning multiple layers of non-linear feature detectors. But by the late 1990’s most serious researchers in machine learning had given up on it. It was still widely used in psychological models and in practical applications such as credit card fraud detection.

20 Why backpropagation failed
The popular explanation of why backpropagation failed in the 90’s: It could not make good use of multiple hidden layers (except in convolutional nets) It did not work well in recurrent networks or deep auto-encoders. Support Vector Machines worked better, required less expertise, produced repeatable results, and had much fancier theory. The real reasons it failed: Computers were thousands of times too slow. Labeled datasets were hundreds of times too small. Deep networks were too small and not initialized sensibly. These issues prevented it from being successful for tasks where it would eventually be a big win.

21 A spectrum of machine learning tasks
Typical Statistics Artificial Intelligence Low-dimensional data (e.g. less than 100 dimensions) Lots of noise in the data. Not much structure in the data. The structure can be captured by a fairly simple model. The main problem is separating true structure from noise. Not ideal for non-Bayesian neural nets. Try SVM or GP. High-dimensional data (e.g. more than 100 dimensions) The noise is not the main problem. There is a huge amount of structure in the data, but its too complicated to be represented by a simple model. The main problem is figuring out a way to represent the complicated structure so that it can be learned. Let backpropagation figure it out.

22 Why Support Vector Machines were never a good bet for Artificial Intelligence tasks that need good representations View 1: SVM’s are just a clever reincarnation of Perceptrons. They expand the input to a (very large) layer of non-linear non-adaptive features. They only have one layer of adaptive weights. They have a very efficient way of fitting the weights that controls overfitting. View 2: SVM’s are just a clever reincarnation of Perceptrons. They use each input vector in the training set to define a non-adaptive “pheature”. The global match between a test input and that training input. They have a clever way of simultaneously doing feature selection and finding weights on the remaining features.

23 Historical document from AT&T Adaptive Systems Research Dept
Historical document from AT&T Adaptive Systems Research Dept., Bell Labs

24 Why we need backpropagation
Networks without hidden units are very limited in the input-output mappings they can model. More layers of linear units do not help. Its still linear. Fixed output non-linearities are not enough We need multiple layers of adaptive non-linear hidden units. This gives us a universal approximator. But how can we train such nets? We need an efficient way of adapting all the weights, not just the last layer. This is hard. Learning the weights going into hidden units is equivalent to learning features. Nobody is telling us directly what hidden units should do.

25 What perceptrons cannot do
The binary threshold output units cannot even tell if two single bit numbers are the same! Same: (1,1)  1; (0,0)  1 Different: (1,0)  0; (0,1)  0 The following set of inequalities is impossible: Data Space 0,1 1,1 weight plane output =1 output =0 0,0 1,0 The positive and negative cases cannot be separated by a plane

26 XOR Problem 26

27 Learning by perturbing weights
Randomly perturb one weight and see if it improves performance. If so, save the change. Very inefficient. We need to do multiple forward passes on a representative set of training data just to change one weight. Towards the end of learning, large weight perturbations will nearly always make things worse. We could randomly perturb all the weights in parallel and correlate the performance gain with the weight changes. Not any better because we need lots of trials to “see” the effect of changing one weight through the noise created by all the others. output units hidden units input units Learning the hidden to output weights is easy. Learning the input to hidden weights is hard.

28 The idea behind backpropagation
We don’t know what the hidden units ought to do, but we can compute how fast the error changes as we change a hidden activity. Instead of using desired activities to train the hidden units, use error derivatives w.r.t. hidden activities. Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined. We can compute error derivatives for all the hidden units efficiently. Once we have the error derivatives for the hidden activities, its easy to get the error derivatives for the weights going into a hidden unit.

29 Non-linear neurons with smooth derivatives
For backpropagation, we need neurons that have well-behaved derivatives. Typically they use the logistic function The output is a smooth function of the inputs and the weights. 1 0.5

30 Basic Neuron Model In A Feedforward Network
Inputs xi arrive through pre-synaptic connections Synaptic efficacy is modeled using real weights wi The response of the neuron is a nonlinear function f of its weighted inputs 30 Copyright G. A. Tagliarini, PhD 2019/2/18

31

32

33

34

35

36

37

38

39

40

41 Feedforward Network Training by Backpropagation: Process Summary
Select an architecture Randomly initialize weights While error is too large Select training pattern and feedforward to find actual network output Calculate errors and backpropagate error signals Adjust weights Evaluate performance using the test set 41 Copyright G. A. Tagliarini, PhD 2019/2/18


Download ppt "Slides adapted from Geoffrey Hinton, Yann Le Cun, Yoshua Bengio"

Similar presentations


Ads by Google