Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

Similar presentations


Presentation on theme: "CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10"— Presentation transcript:

1 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
11/16/2018 Today’s Topics HW4 Out (due in two weeks, some Java) Artificial Neural Networks (ANNs) Perceptrons (1950s) Hidden Units and Backpropagation (1980s) Deep Neural Networks (2010s) ??? (2040s [note the pattern]) This Lecture: The Big Picture & Forward Propagation Next Lecture: Learning Network Weights 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

2 Should you? (Slide I used in CS 760 for 20+ years)
‘Fenwίck here is biding his time waiting for neural networks’ 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

3 Recall: Supervised ML Systems Differ in How They Represent Concepts
Backpropagation Training Examples ID3, CART FOIL, ILP   X  Y   Z 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

4 Advantages of Artificial Neural Networks
Provide best predictive accuracy for many problems Can represent a rich class of concepts (‘universal approximators’) Positive Negative Positive time-series data 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

5 A Brief Overview of ANNs
Output units  error  weight Recurrent link Hidden units Input units 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

6 Recurrent ANN’s (Advanced topic: LSTM models, Schmidhuber group)
State Units (ie, memory) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

7 Representing Features in ANNs (and SVMs) - we need NUMERIC values
Input Units Ex 1 Nominal f={a,b,c} ‘1 of N’ rep f=a f=b f=c Hierarchical Linear/Ordered f=a f=b f=c f=d f=e f=g 1 (for f=e) Typical Approaches - others possible f=[a,b] Approach I (use 1 input unit): f = value – a b - a Approach II: Thermometer Rep (next slide) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

8 More on Encoding Datasets
Thermometer Representation f is an element of { a, b, c }, ie f is ordered f = a  100 f = b  110 f = c  111 (could also discretize continuous functions this way) For N categories use a 1-of-N representation Output Representation Category 1  100 Category 2  010 Category 3  001 For Boolean functions use either 1 or 2 output units Normalize real-valued functions to [0,1] Could also use an error-correcting code (but we won’t cover that) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

9 Connectionism History
PERCEPTRONS (Rosenblatt 1957) no hidden units earliest work in machine learning, died out in 1960’s (due to Minsky & Papert book) wij J wik K I L wil Outputi = F(Wij  outputj + Wik  outputk + Wil  outputl ) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

10 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Connectionism (cont.) Backpropagation Algorithm Overcame Perceptron’s Weakness Major reason for renewed excitement in 1980’s ‘Hidden Units’ Important Fundamental extension to perceptrons Can generate new features (‘constructive induction’, ‘predicate invention’, ‘learning representations’, ‘derived features’) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

11 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Deep Neural Networks Old: backprop algo does not work well for more than one layer of hidden units (‘gradient gets too diffuse’) New: with a lot of training data, deep (several layers of hidden units) neural networks exceed prior state-of-the-art results Unassigned, but FYI: 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

12 Sample Deep Neural Network
11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

13 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
A Deeper Network Old Design: fully connect each input node to each HU (only one HU layer), then fully connect each HU to each output node We’ll cover CONVOLUTION and POOLING later 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

14 Digit Recognition: Influential ANN Testbed
From Digit Recognition: Influential ANN Testbed Deep Networks (Schmidhuber, 2012) One YEAR of training on single CPU One WEEK of training on a single GPU that performed 109 wgt updates/sec 0.2% Error Rate (old record was 0.4%) More info on datasets and results at Perceptron: % error (7.6% with feature engineering) k-NN: % (0.63%) Ensemble of d-trees: 1.5% SVMs: % (0.56%) One layer of HUs: % (0.4%; feature engr + ensemble of 25 ANNs) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

15 Activation Units: Map Weighted Sum to Scalar
Individual Units’ Computation output I = F(Sweight i,j x output j) Typically F(input i) = j 1 1+e -(input i – bias i) bias output input Called the ‘sigmoid’ and ‘logistic’ (hyperbolic tangent also used) Piecewise Linear (and Gaussian) nodes can also be used 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Rectified Linear Units (ReLUs) (Nair & Hinton, 2010) – used for HUs; use ‘pure’ linear for output units, ie F(wgt’edSum) = wgt’edSum F(wgt’edSum) = max(0, wgt’edSum) Argued to be more biologically plausible Used in ‘deep networks’ bias output input 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

17 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Sample ANN Calculation (‘Forward Propagation’, ie, reasoning with weights learned by backprop) 3 4 OUTPUT Assume bias=0 for all nodes for simplicity and using RLUs 3 4 -2 3 4 -1 1 -8 -7 9 5 3 2 INPUT 3 2 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

18 Perceptron Convergence Theorem (Rosenblatt, 1957)
Perceptron  no hidden units If a set of examples is learnable, the DELTA rule will eventually find the necessary weights However a perceptron can only learn/represent linearly separable dataset 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

19 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
X2 - Linear Separability Consider a perceptron, its output is 1 If W1 X1 + W2 X2 + … + Wn Xn > Q 0 otherwise X1 In terms of feature space (2 features only) W1X1 + W2X2 = Q X2 = = Q -W1X1 W2 -W Q W W2 X1+ y = mx + b Hence, can only classify examples if a ‘line’ (hyperplane) can separate them 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

20 The (Infamous) XOR Problem
Not linearly separable Exclusive OR (XOR) X1 Input 0 0 0 1 1 0 1 1 Output 1 a) b) c) d) 1 b d a c 1 X2 A Solution with (Sigmoidal) Hidden Units 10 X1 10 -10 -10 X2 10 Let Q = 5 for all nodes 10 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

21 The Need for Hidden Units
If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded (N = number of input units) This recoding allows any mapping to be represented (known by Minsky & Papert) Question: How to provide an error signal to the interior units? (backprop is the answer from the 1980’s) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

22 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Hidden Units One View Allow a system to create its own internal representation – for which problem solving is easy A perceptron 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

23 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
11/16/2018 Reformulating XOR X1 X1 X3 = X1  X2 X2 Alternatively X1 X2 X3 So, if a hidden unit can learn to represent X1  X2 , solution is easy X2 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

24 The Need for Non-Linear Activation Functions
Claim: For every ANN using only linear activation functions with depth k, there is an equivalent perceptron Ie, a neural network with no hidden units So if using only linear activation units, ‘deep’ ANN can only learn a separating ‘line’ Note that RLU’s are non-linear (‘piecewise’ linear) Can show using linear algebra (but won’t in cs540) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

25 A Famous Early Application (http://cnl.salk.edu/Media/nettalk.mp3)
NETtalk (Sejnowski & Rosenburg, 1987) Mapping character strings into phonemes ‘Sliding Window’ approach Train: 1,000 most common English words 88.5% correct Test: 20,000 word dictionary 72% / 63% correct Ă Ō Like the phonemes in a dictionary A T C _ 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

26 An Empirical Comparison of Symbolic and Neural Learning
[Shavlik, Mooney, & Towell, IJCAI 1989 & ML journal 1991] Perceptron works quite well! 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10

27 ANN Wrapup on Non-Learning Aspects
Geoff Hinton, 1947- (great-great-grandson of George Boole!) ANN Wrapup on Non-Learning Aspects Perceptrons can do well, but can only create linear separators in feature space Backprop Algo (next lecture) can successfully train hidden units Historically only one HU layer used Deep Neural Networks (several HU layers) highly successful given large amounts of training data, especially for images & text 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10


Download ppt "CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10"

Similar presentations


Ads by Google