Download presentation
Presentation is loading. Please wait.
Published byIwan Iskandar Modified over 6 years ago
1
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
11/16/2018 Today’s Topics HW4 Out (due in two weeks, some Java) Artificial Neural Networks (ANNs) Perceptrons (1950s) Hidden Units and Backpropagation (1980s) Deep Neural Networks (2010s) ??? (2040s [note the pattern]) This Lecture: The Big Picture & Forward Propagation Next Lecture: Learning Network Weights 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
2
Should you? (Slide I used in CS 760 for 20+ years)
‘Fenwίck here is biding his time waiting for neural networks’ 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
3
Recall: Supervised ML Systems Differ in How They Represent Concepts
Backpropagation … Training Examples ID3, CART … FOIL, ILP X Y Z 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
4
Advantages of Artificial Neural Networks
Provide best predictive accuracy for many problems Can represent a rich class of concepts (‘universal approximators’) Positive Negative Positive time-series data 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
5
A Brief Overview of ANNs
Output units error weight Recurrent link Hidden units Input units 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
6
Recurrent ANN’s (Advanced topic: LSTM models, Schmidhuber group)
State Units (ie, memory) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
7
Representing Features in ANNs (and SVMs) - we need NUMERIC values
Input Units Ex 1 Nominal f={a,b,c} ‘1 of N’ rep f=a f=b f=c Hierarchical Linear/Ordered f=a f=b f=c f=d f=e f=g 1 (for f=e) Typical Approaches - others possible f=[a,b] Approach I (use 1 input unit): f = value – a b - a Approach II: Thermometer Rep (next slide) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
8
More on Encoding Datasets
Thermometer Representation f is an element of { a, b, c }, ie f is ordered f = a 100 f = b 110 f = c 111 (could also discretize continuous functions this way) For N categories use a 1-of-N representation Output Representation Category 1 100 Category 2 010 Category 3 001 For Boolean functions use either 1 or 2 output units Normalize real-valued functions to [0,1] Could also use an error-correcting code (but we won’t cover that) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
9
Connectionism History
PERCEPTRONS (Rosenblatt 1957) no hidden units earliest work in machine learning, died out in 1960’s (due to Minsky & Papert book) wij J wik K I L wil Outputi = F(Wij outputj + Wik outputk + Wil outputl ) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
10
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Connectionism (cont.) Backpropagation Algorithm Overcame Perceptron’s Weakness Major reason for renewed excitement in 1980’s ‘Hidden Units’ Important Fundamental extension to perceptrons Can generate new features (‘constructive induction’, ‘predicate invention’, ‘learning representations’, ‘derived features’) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
11
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Deep Neural Networks Old: backprop algo does not work well for more than one layer of hidden units (‘gradient gets too diffuse’) New: with a lot of training data, deep (several layers of hidden units) neural networks exceed prior state-of-the-art results Unassigned, but FYI: 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
12
Sample Deep Neural Network
11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
13
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
A Deeper Network Old Design: fully connect each input node to each HU (only one HU layer), then fully connect each HU to each output node We’ll cover CONVOLUTION and POOLING later 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
14
Digit Recognition: Influential ANN Testbed
From Digit Recognition: Influential ANN Testbed Deep Networks (Schmidhuber, 2012) One YEAR of training on single CPU One WEEK of training on a single GPU that performed 109 wgt updates/sec 0.2% Error Rate (old record was 0.4%) More info on datasets and results at Perceptron: % error (7.6% with feature engineering) k-NN: % (0.63%) Ensemble of d-trees: 1.5% SVMs: % (0.56%) One layer of HUs: % (0.4%; feature engr + ensemble of 25 ANNs) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
15
Activation Units: Map Weighted Sum to Scalar
Individual Units’ Computation output I = F(Sweight i,j x output j) Typically F(input i) = j 1 1+e -(input i – bias i) bias output input Called the ‘sigmoid’ and ‘logistic’ (hyperbolic tangent also used) Piecewise Linear (and Gaussian) nodes can also be used 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
16
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Rectified Linear Units (ReLUs) (Nair & Hinton, 2010) – used for HUs; use ‘pure’ linear for output units, ie F(wgt’edSum) = wgt’edSum F(wgt’edSum) = max(0, wgt’edSum) Argued to be more biologically plausible Used in ‘deep networks’ bias output input 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
17
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Sample ANN Calculation (‘Forward Propagation’, ie, reasoning with weights learned by backprop) 3 4 OUTPUT Assume bias=0 for all nodes for simplicity and using RLUs 3 4 -2 3 4 -1 1 -8 -7 9 5 3 2 INPUT 3 2 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
18
Perceptron Convergence Theorem (Rosenblatt, 1957)
Perceptron no hidden units If a set of examples is learnable, the DELTA rule will eventually find the necessary weights However a perceptron can only learn/represent linearly separable dataset 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
19
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
X2 - Linear Separability Consider a perceptron, its output is 1 If W1 X1 + W2 X2 + … + Wn Xn > Q 0 otherwise X1 In terms of feature space (2 features only) W1X1 + W2X2 = Q X2 = = Q -W1X1 W2 -W Q W W2 X1+ y = mx + b Hence, can only classify examples if a ‘line’ (hyperplane) can separate them 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
20
The (Infamous) XOR Problem
Not linearly separable Exclusive OR (XOR) X1 Input 0 0 0 1 1 0 1 1 Output 1 a) b) c) d) 1 b d a c 1 X2 A Solution with (Sigmoidal) Hidden Units 10 X1 10 -10 -10 X2 10 Let Q = 5 for all nodes 10 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
21
The Need for Hidden Units
If there is one layer of enough hidden units (possibly 2N for Boolean functions), the input can be recoded (N = number of input units) This recoding allows any mapping to be represented (known by Minsky & Papert) Question: How to provide an error signal to the interior units? (backprop is the answer from the 1980’s) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
22
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Hidden Units One View Allow a system to create its own internal representation – for which problem solving is easy A perceptron 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
23
CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
11/16/2018 Reformulating XOR X1 X1 X3 = X1 X2 X2 Alternatively X1 X2 X3 So, if a hidden unit can learn to represent X1 X2 , solution is easy X2 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
24
The Need for Non-Linear Activation Functions
Claim: For every ANN using only linear activation functions with depth k, there is an equivalent perceptron Ie, a neural network with no hidden units So if using only linear activation units, ‘deep’ ANN can only learn a separating ‘line’ Note that RLU’s are non-linear (‘piecewise’ linear) Can show using linear algebra (but won’t in cs540) 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
25
A Famous Early Application (http://cnl.salk.edu/Media/nettalk.mp3)
NETtalk (Sejnowski & Rosenburg, 1987) Mapping character strings into phonemes ‘Sliding Window’ approach Train: 1,000 most common English words 88.5% correct Test: 20,000 word dictionary 72% / 63% correct … … … Ă Ō … … Like the phonemes in a dictionary A T C _ 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
26
An Empirical Comparison of Symbolic and Neural Learning
[Shavlik, Mooney, & Towell, IJCAI 1989 & ML journal 1991] Perceptron works quite well! 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
27
ANN Wrapup on Non-Learning Aspects
Geoff Hinton, 1947- (great-great-grandson of George Boole!) ANN Wrapup on Non-Learning Aspects Perceptrons can do well, but can only create linear separators in feature space Backprop Algo (next lecture) can successfully train hidden units Historically only one HU layer used Deep Neural Networks (several HU layers) highly successful given large amounts of training data, especially for images & text 11/8/16 CS540 - Fall 2016 (Shavlik©), Lecture 17, Week 10
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.