Learning: Nearest Neighbor, Perceptrons & Neural Nets

Slides:



Advertisements
Similar presentations
Beyond Linear Separability
Advertisements

Slides from: Doug Gray, David Poole
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Kostas Kontogiannis E&CE
Artificial Neural Networks
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Machine Learning Neural Networks
Overview over different methods – Supervised Learning
Artificial Neural Networks
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Connectionist Modeling Some material taken from cspeech.ucd.ie/~connectionism and Rich & Knight, 1991.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Artificial Neural Networks
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
CHAPTER 11 Back-Propagation Ming-Feng Yeh.
CS 4700: Foundations of Artificial Intelligence
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Neural Networks Lecture 8: Two simple learning algorithms
Learning: Nearest Neighbor Artificial Intelligence CMSC January 31, 2002.
MSE 2400 EaLiCaRA Spring 2015 Dr. Tom Way
Artificial Intelligence Lecture No. 28 Dr. Asad Ali Safi ​ Assistant Professor, Department of Computer Science, COMSATS Institute of Information Technology.
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
Computer Science and Engineering
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Chapter 9 Neural Network.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
1 Machine Learning The Perceptron. 2 Heuristic Search Knowledge Based Systems (KBS) Genetic Algorithms (GAs)
NEURAL NETWORKS FOR DATA MINING
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
1 Chapter 11 Neural Networks. 2 Chapter 11 Contents (1) l Biological Neurons l Artificial Neurons l Perceptrons l Multilayer Neural Networks l Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Retrieval by Authority Artificial Intelligence CMSC February 1, 2007.
Non-Bayes classifiers. Linear discriminants, neural networks.
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Artificial Neural Network
EEE502 Pattern Recognition
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Chapter 6 Neural Network.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Fall 2004 Backpropagation CS478 - Machine Learning.
Artificial Neural Networks
Learning with Perceptrons and Neural Networks
Real Neurons Cell structures Cell body Dendrites Axon
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Data Mining with Neural Networks (HK: Chapter 7.5)
Classification Neural Networks 1
Artificial Intelligence Lecture No. 28
Hubs and Authorities & Learning: Perceptrons
Seminar on Machine Learning Rada Mihalcea
Learning: Perceptrons & Neural Networks
Presentation transcript:

Learning: Nearest Neighbor, Perceptrons & Neural Nets Artificial Intelligence CSPP 56553 February 4, 2004

Nearest Neighbor Example II Credit Rating: Classifier: Good / Poor Features: L = # late payments/yr; R = Income/Expenses Name L R G/P A 0 1.2 G B 25 0.4 P C 5 0.7 G D 20 0.8 P E 30 0.85 P F 11 1.2 G G 7 1.15 G H 15 0.8 P

Nearest Neighbor Example II Name L R G/P A 0 1.2 G A F B 25 0.4 P 1 G R C 5 0.7 G E H D C D 20 0.8 P E 30 0.85 P B F 11 1.2 G G 7 1.15 G 10 20 30 L H 15 0.8 P

Nearest Neighbor Example II Name L R G/P I 6 1.15 G A F J 22 0.45 K P 1 I G K 15 1.2 ?? E R H D C J B Distance Measure: Sqrt ((L1-L2)^2 + [sqrt(10)*(R1-R2)]^2)) - Scaled distance 10 20 30 L

Nearest Neighbor: Issues Prediction can be expensive if many features Affected by classification, feature noise One entry can change prediction Definition of distance metric How to combine different features Different types, ranges of values Sensitive to feature selection

Efficient Implementations Classification cost: Find nearest neighbor: O(n) Compute distance between unknown and all instances Compare distances Problematic for large data sets Alternative: Use binary search to reduce to O(log n)

Efficient Implementation: K-D Trees Divide instances into sets based on features Binary branching: E.g. > value 2^d leaves with d split path = n d= O(log n) To split cases into sets, If there is one element in the set, stop Otherwise pick a feature to split on Find average position of two middle objects on that dimension Split remaining objects based on average position Recursively split subsets

K-D Trees: Classification Yes No L > 17.5? L > 9 ? No Yes Yes No R > 0.6? R > 0.75? R > 1.175 ? R > 1.025 ? No Yes No Yes No No Yes Yes Poor Good Good Poor Good Good Poor Good

Efficient Implementation: Parallel Hardware Classification cost: # distance computations Const time if O(n) processors Cost of finding closest Compute pairwise minimum, successively O(log n) time

Nearest Neighbor: Analysis Issue: What features should we use? E.g. Credit rating: Many possible features Tax bracket, debt burden, retirement savings, etc.. Nearest neighbor uses ALL Irrelevant feature(s) could mislead Fundamental problem with nearest neighbor

Nearest Neighbor: Advantages Fast training: Just record feature vector - output value set Can model wide variety of functions Complex decision boundaries Weak inductive bias Very generally applicable

Summary: Nearest Neighbor Training: record input vectors + output value Prediction: closest training instance to new data Efficient implementations Pros: fast training, very general, little bias Cons: distance metric (scaling), sensitivity to noise & extraneous features

Learning: Perceptrons Artificial Intelligence CSPP 56553 February 4, 2004

Agenda Neural Networks: Perceptrons: Single layer networks Conclusions Biological analogy Perceptrons: Single layer networks Perceptron training Perceptron convergence theorem Perceptron limitations Conclusions

Neurons: The Concept Dendrites Axon Nucleus Cell Body Neurons: Receive inputs from other neurons (via synapses) When input exceeds threshold, “fires” Sends output along axon to other neurons Brain: 10^11 neurons, 10^16 synapses

Artificial Neural Nets Simulated Neuron: Node connected to other nodes via links Links = axon+synapse+link Links associated with weight (like synapse) Multiplied by output of node Node combines input via activation function E.g. sum of weighted inputs passed thru threshold Simpler than real neuronal processes

Artificial Neural Net w x w Sum Threshold + x w x

Perceptrons Single neuron-like element Binary inputs Binary outputs Weighted sum of inputs > threshold

Perceptron Structure y compensates for threshold w0 wn w1 w3 w2 x0=1 . . . xn x0 w0 compensates for threshold

Perceptron Example Logical-OR: Linearly separable 00: 0; 01: 1; 10: 1; 11: 1 x2 x2 + + + + + + x1 x1 or or

Perceptron Convergence Procedure Straight-forward training procedure Learns linearly separable functions Until perceptron yields correct output for all If the perceptron is correct, do nothing If the percepton is wrong, If it incorrectly says “yes”, Subtract input vector from weight vector Otherwise, add input vector to weight vector

Perceptron Convergence Example LOGICAL-OR: Sample x0 x1 x2 Desired Output 1 1 0 0 0 2 1 0 1 1 3 1 1 0 1 4 1 1 1 1 Initial: w=(000);After S2, w=w+s2=(101) Pass2: S1:w=w-s1=(001);S3:w=w+s3=(111) Pass3: S1:w=w-s1=(011)

Perceptron Convergence Theorem If there exists a vector W s.t. Perceptron training will find it Assume for all +ive examples x ||w||^2 increases by at most ||x||^2, in each iteration ||w+x||^2 <= ||w||^2+||x||^2 <=k ||x||^2 v.w/||w|| > <= 1 Converges in k <= O steps

Perceptron Learning Perceptrons learn linear decision boundaries E.g. + + + + + + x1 x2 x2 + But not + x1 xor X1 X2 -1 -1 w1x1 + w2x2 < 0 1 -1 w1x1 + w2x2 > 0 => implies w1 > 0 1 1 w1x1 + w2x2 >0 => but should be false -1 1 w1x1 + w2x2 > 0 => implies w2 > 0

Perceptron Example Digit recognition Assume display= 8 lightable bars Inputs – on/off + threshold 65 steps to recognize “8”

Perceptron Summary Motivated by neuron activation Simple training procedure Guaranteed to converge IF linearly separable

Neural Nets Multi-layer perceptrons Inputs: real-valued Intermediate “hidden” nodes Output(s): one (or more) discrete-valued X1 X2 Y1 Y2 X3 X4 Inputs Hidden Hidden Outputs

Neural Nets Pro: More general than perceptrons Not restricted to linear discriminants Multiple outputs: one classification each Con: No simple, guaranteed training procedure Use greedy, hill-climbing procedure to train “Gradient descent”, “Backpropagation”

Solving the XOR Problem Network Topology: 2 hidden nodes 1 output w11 w13 x1 w21 w01 y -1 w12 w23 w03 w22 x2 -1 w02 o2 Desired behavior: x1 x2 o1 o2 y 0 0 0 0 0 1 0 0 1 1 0 1 0 1 1 1 1 1 1 0 -1 Weights: w11= w12=1 w21=w22 = 1 w01=3/2; w02=1/2; w03=1/2 w13=-1; w23=1

Neural Net Applications Speech recognition Handwriting recognition NETtalk: Letter-to-sound rules ALVINN: Autonomous driving

ALVINN Driving as a neural network Inputs: 5 Hidden nodes Outputs: Image pixel intensities I.e. lane lines 5 Hidden nodes Outputs: Steering actions E.g. turn left/right; how far Training: Observe human behavior: sample images, steering

Backpropagation Greedy, Hill-climbing procedure Weights are parameters to change Original hill-climb changes one parameter/step Slow If smooth function, change all parameters/step Gradient descent Backpropagation: Computes current output, works backward to correct error

Producing a Smooth Function Key problem: Pure step threshold is discontinuous Not differentiable Solution: Sigmoid (squashed ‘s’ function): Logistic fn

Neural Net Training Goal: Approach: Determine how to change weights to get correct output Large change in weight to produce large reduction in error Approach: Compute actual output: o Compare to desired output: d Determine effect of each weight w on error = d-o Adjust weights

Neural Net Example xi : ith sample input vector w : weight vector y3 w03 w23 z3 z2 w02 w22 w21 w12 w11 w01 z1 -1 x1 x2 w13 y1 y2 xi : ith sample input vector w : weight vector yi*: desired output for ith sample - Sum of squares error over training samples z3 z1 z2 From 6.034 notes lozano-perez Full expression of output in terms of input and weights

Gradient Descent Error: Sum of squares error of inputs with current weights Compute rate of change of error wrt each weight Which weights have greatest effect on error? Effectively, partial derivatives of error wrt weights In turn, depend on other weights => chain rule

Gradient Descent E = G(w) Find rate of change of error dG dw Error as function of weights Find rate of change of error Follow steepest rate of change Change weights s.t. error is minimized E G(w) w0w1 w Local minima

MIT AI lecture notes, Lozano-Perez 2000 Gradient of Error - z3 z1 z2 y3 w03 w23 z3 z2 w02 w22 w21 w12 w11 w01 z1 -1 x1 x2 w13 y1 y2 Note: Derivative of sigmoid: ds(z1) = s(z1)(1-s(z1)) dz1 From 6.034 notes lozano-perez MIT AI lecture notes, Lozano-Perez 2000

From Effect to Update Gradient computation: To train: How each weight contributes to performance To train: Need to determine how to CHANGE weight based on contribution to performance Need to determine how MUCH change to make per iteration Rate parameter ‘r’ Large enough to learn quickly Small enough reach but not overshoot target values

Backpropagation Procedure j k Pick rate parameter ‘r’ Until performance is good enough, Do forward computation to calculate output Compute Beta in output node with Compute Beta in all other nodes with Compute change for all weights with

Backprop Example y3 w03 w23 z3 z2 w02 w22 w21 w12 w11 w01 z1 -1 x1 x2 Forward prop: Compute zi and yi given xk, wl

Backpropagation Observations Procedure is (relatively) efficient All computations are local Use inputs and outputs of current node What is “good enough”? Rarely reach target (0 or 1) outputs Typically, train until within 0.1 of target

Neural Net Summary Training: Prediction: Backpropagation procedure Gradient descent strategy (usual problems) Prediction: Compute outputs based on input vector & weights Pros: Very general, Fast prediction Cons: Training can be VERY slow (1000’s of epochs), Overfitting

Training Strategies Online training: Offline (batch training): Update weights after each sample Offline (batch training): Compute error over all samples Then update weights Online training “noisy” Sensitive to individual instances However, may escape local minima

Training Strategy To avoid overfitting: Split data into: training, validation, & test Also, avoid excess weights (less than # samples) Initialize with small random weights Small changes have noticeable effect Use offline training Until validation set minimum Evaluate on test set No more weight changes

Classification Neural networks best for classification task Single output -> Binary classifier Multiple outputs -> Multiway classification Applied successfully to learning pronunciation Sigmoid pushes to binary classification Not good for regression

Neural Net Example NETtalk: Letter-to-sound by net Inputs: Need context to pronounce 7-letter window: predict sound of middle letter 29 possible characters – alphabet+space+,+. 7*29=203 inputs 80 Hidden nodes Output: Generate 60 phones Nodes map to 26 units: 21 articulatory, 5 stress/sil Vector quantization of acoustic space

Neural Net Example: NETtalk Learning to talk: 5 iterations/1024 training words: bound/stress 10 iterations: intelligible 400 new test words: 80% correct Not as good as DecTalk, but automatic

Neural Net Conclusions Simulation based on neurons in brain Perceptrons (single neuron) Guaranteed to find linear discriminant IF one exists -> problem XOR Neural nets (Multi-layer perceptrons) Very general Backpropagation training procedure Gradient descent - local min, overfitting issues