Presentation is loading. Please wait.

Presentation is loading. Please wait.

Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe.

Similar presentations


Presentation on theme: "Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe."— Presentation transcript:

1 Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe for using Backprop to Train an ANN Adjusting the Learning Rate (η) The Momentum Term (  ) Reducing Overfitting in ANNs –Early Stopping –Weight Decay Understanding Hidden Units Choosing the Number of Hidden Units ANNs as Universal Approximators Learning what an ANN has Learned 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 101

2 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Using BP to Train ANN’s 1.Initialize weights & bias to small random values (eg, in [-0.3, 0.3]) 2.Randomize order of training examples For each do: a)Propagate activity forward to output units k j i out i = F(  w i,j x out j ) 2

3 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Using BP to Train ANN’s (continued) b)Compute ‘deviation’ for output units c)Compute ‘deviation’ for hidden units d)Update weights  i = F '( net i ) x ( Teacher i - out i )  j = F '( net j ) x (  w i,j x  i )  w i,j =  x  i x out j  w j,k =  x  j x out k F '( net j )   F(net i )  net i 3 Aside The book (Fig 18.24) uses  instead of , g instead of F and  instead of 

4 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Using BP to Train ANN’s (concluded) 3.Repeat until training-set error rate small enough Actually, should use early stopping (ie, minimize error on the tuning set; more details later) Some jargon: Each cycle through all training examples is called an epoch 4.Measure accuracy on test set to estimate generalization (future accuracy) 4

5 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 The Need for Symmetry Breaking (if HUs) Assume all weights are initially the same (drawing below a bit more general)         Can the corresponding (mirror-image) weight ever differ? WHY? Solution randomize initial weights (in, say, [-0.3, 0.3]) 5 NO by symmetry (the two HUs in identical environments)

6 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Choosing η (‘the learning rate’) Error weight space ∂ E∂ E∂ E∂ E ∂ W ij η too large (error ) η too small (error ) 6 -

7 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Adjusting η On-the-Fly 0. Let η = 0.25 1. Measure ave. error over k examples - call this E before 2. Adjust wgts according to learning algorithm being used 3. Measure ave error on same k examples - call this E after 7

8 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 10 Adjusting η (cont) 4. If E after > E before, then η  η  0.99 else η  η  1.01 5. Go to 1 Note: k can be all training examples but could be a subset 8

9 Including a ‘Momentum’ Term in Backprop To speed up convergence, often another term is added to the weight-update rule Typically, 0 < β < 1, 0.9 common choice The previous change in weight 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 109

10 Overfitting Reduction Approach #1: Using Tuning Sets (Known as ‘Early Stopping’) 50% Error Tune Test Train Ideal ANN to choose Chosen ANN Training Epochs 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1010

11 Overfitting Reduction Approach #2: Minimizing a Cost Function –Cost = Error Rate + Network Complexity –Essentially what SVMs do (later) 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1011 Need to tune the parameter lambda (so still use a tuning set)

12 Overfitting Reduction: Weight Decay (Hinton ’86) Weights decay toward zero Empirically improves generalization So … 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1012

13 1.Transform the input space into a new space where perceptrons suffice (relates to SVMs) 2.Probabilistically represent ‘hidden features’ – constructive induction, predicate invention, learning representations, etc (construct new features out of those given; ‘kernels’ do this in SVMs) 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1013 Four Views of What Hidden Units Do (Not Necessarily Disjoint) A perceptron

14 Four Views of What Hidden Units Do (Not Necessarily Disjoint) 3.Divide feature space into many subregions 4.Provide a set of basis functions which can be linearly combined (relates to SVMs) 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1014 + + + + + + + - + + + + + + + + + + + - - - - - - - - - - - - - - + ++ + + - - - - -

15 How Many Hidden Units? Historically one hidden layer is used –How many units should it contain? Too few: can’t learn Too many: poor generalization –Use tuning set or cross-validation to select number of hidden units Traditional Approach (but no longer recommended) ‘conventional view ’ 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1015

16 Can One Ever Have Too Many Hidden Units? Evidence (Weigand, Caruana) suggests that if ‘early stopping’ is used –Generalization does not degrade as number of hidden units  ∞ –Ie, use tuning set to detect over fitting (recall ‘early stopping’ slide) –Weigand gives an explanation in terms of ‘effective number’ of HUs (analysis based on principle components and Eigenvectors) 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1016

17 ANNs as ‘Universal Approximators’ Boolean Functions –Need one layer of hidden units to represent exactly Continuous Functions –Approximation to arbitrarily small error with one (possibly quite ‘wide’) layer of hidden units Arbitrary Functions –Any function can be approximated to arbitrary precision with two layers of hidden units 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1017 But note, what can be REPRESENTED is different from what can be ‘easily’ LEARNED

18 Looking for Specific Boolean Inputs (eg, memorize the POS examples) Hence with enough hidden units can ‘memorize’ the training data But what about generalization? -∞ Use bias=0.99 for all nodes (assume ‘step functions’) 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1018 -∞ Becomes an ‘or’ of all the positive examples

19 Understanding What a Trained ANN has Learned - Human ‘Readability’ of Trained ANNs is Challenging Rule Extraction (Craven & Shavlik, 1996) Roughly speaking, train ID3 to learn the I/O behavior of the neural network – note we can generate as many labeled training ex’s as desired by forward prop’ing through trained ANN! Could be an ENSEMBLE of models 11/10/15CS 540 - Fall 2015 (Shavlik©), Lecture 20, Week 1019


Download ppt "Today’s Topics Read: Chapters 7, 8, and 9 on Logical Representation and Reasoning HW3 due at 11:55pm THURS (ditto for your Nannon Tourney Entry) Recipe."

Similar presentations


Ads by Google