Artificial Neural Network

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

Beyond Linear Separability
Slides from: Doug Gray, David Poole
NEURAL NETWORKS Backpropagation Algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
G53MLE | Machine Learning | Dr Guoping Qiu
Artificial Neural Networks
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Reading for Next Week Textbook, Section 9, pp A User’s Guide to Support Vector Machines (linked from course website)
CS Perceptrons1. 2 Basic Neuron CS Perceptrons3 Expanded Neuron.
Classification Neural Networks 1
Perceptron.
Machine Learning Neural Networks
Overview over different methods – Supervised Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Back-Propagation Algorithm
Artificial Neural Networks
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
LOGO Classification III Lecturer: Dr. Bo Yuan
CS 484 – Artificial Intelligence
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Artificial Neural Networks
Computer Science and Engineering
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CS464 Introduction to Machine Learning1 Artificial N eural N etworks Artificial neural networks (ANNs) provide a general, practical method for learning.
Machine Learning Chapter 4. Artificial Neural Networks
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
START OF DAY 4 Reading: Chap. 3 & 4. Project Topics & Teams Select topics/domains Select teams Deliverables – Description of the problem – Selection.
Artificial Intelligence Techniques Multilayer Perceptrons.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
Linear Classification with Perceptrons
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
For Friday No reading Take home exam due Exam 2. For Monday Read chapter 22, sections 1-3 FOIL exercise due.
EEE502 Pattern Recognition
1 Perceptron as one Type of Linear Discriminants IntroductionIntroduction Design of Primitive UnitsDesign of Primitive Units PerceptronsPerceptrons.
BACKPROPAGATION (CONTINUED) Hidden unit transfer function usually sigmoid (s-shaped), a smooth curve. Limits the output (activation) unit between 0..1.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
Artificial Neural Network. Introduction Robust approach to approximating real-valued, discrete-valued, and vector-valued target functions Backpropagation.
129 Feed-Forward Artificial Neural Networks AMIA 2003, Machine Learning Tutorial Constantin F. Aliferis & Ioannis Tsamardinos Discovery Systems Laboratory.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Artificial Neural Networks
Learning with Perceptrons and Neural Networks
Artificial Neural Networks
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Artificial Neural Networks
Machine Learning Today: Reading: Maria Florina Balcan
CSC 578 Neural Networks and Deep Learning
Classification Neural Networks 1
Artificial Intelligence Chapter 3 Neural Networks
Perceptron as one Type of Linear Discriminants
Artificial Neural Networks
Lecture Notes for Chapter 4 Artificial Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Machine Learning: Lecture 4
Machine Learning: UNIT-2 CHAPTER-1
Artificial Intelligence Chapter 3 Neural Networks
Artificial Intelligence Chapter 3 Neural Networks
Seminar on Machine Learning Rada Mihalcea
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Artificial Neural Network Update weights Compare output with Target ….. Hidden Units Input Units Output Units Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Artificial Neural Networks (ANN) Perceptrons Gradient Descent Backpropagation Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Biological Inspiration Human brain has (est.) 1011 neurons Each connected to 104 others on average Switching times on the order of 10-3 sec. 10-10 with computer systems It takes 0.1 second for one to visually recognize one’s mother How much time does a computer system need? Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

ANNs Are Simplifications! Let’s be realistic: Biological neural systems are much more complex in connection structure! Neurons communicate in parallel But most ANNs are run on sequential machines Neurons output a complex series of spikes But ANNs output one value Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Motivation For Studying ANNs To model biological learning Can we do that without building much more complex ANNs? To invent effective machine learning algorithms If this is the goal, does it matter whether the ANNs resemble biological systems? Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Example: ANN for Car Driving ALVINN (1993) Travelled at 70 mph for 90 miles Other vehicles present Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

When To Use ANNs Input: attribute-value pairs Output: a vector of discrete or real valued quite flexible Noisy data Long training time acceptable Fast response required in application Black-box solution acceptable Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Perceptron One type of ANN is based on Perceptrons Input: a vector of real-valued x1 … xn Connections with weights w1 … wn Linear combination: w1x1 + … + wnxn Compare with threshold w0 Output 1 if w1x1 + … + wnxn > w0 Output –1 otherwise Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Structure of a Perceptron Input x0 = 1 x1 w1 w0 (threshold) w2 x2 .... xn wn Weights (weights are to be trained) Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Linear Separability The input represents an n-dimensional space A perceptron represents a hyperplane That cuts the space into two classes: 1 and – 1 wixi can only represent linear hyperplanes Examples that can be separated by linear hyperplanes are called linear separable Obviously not all examples are linear separable Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Example on Linear Separability + – x2 – + – + – + x1 – + – + Not linear separable (XOR) Linear separable Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Expressive Power of Perceptrons Many boolean functions can be represented Represent true by 1, false by –1 e.g. x1 AND x2: make w0 = 0.8, w1 = w2 = 0.5 AND, OR, NAND, NOR can be represented by one perceptron Not XOR, which is not linear separable However, every boolean function can be represented by perceptrons of two levels deep Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Training Perceptrons The hope is to adjust the weights so that the actual output matches the target Initialize wi with random weights Repeat as many times as necessary: Feed the perceptron with one example If misclassified, then adjust the weights Until all classifications are correct Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

The Perceptron Training Rule wi  wi + wi where wi = (t – o) xi Learning rate : small, positive, e.g. 0.1 Suppose xi = 0.8, target t = +1, output o = –1 wi = 0.1(1 – (– 1)) 0.8 = 0.16 This will bring o closer to the target +1 Proven: will converge within finite iterations to classify all linear separable instances Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Gradient Descent, Principle Can handle data not linear separable Will converge to a best-fit approximation Principle: gradient descent Define the error Then move to maximally reduce error (how?) Linear unit: output with no threshold o(x) = w · x Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Gradient Descent: Training Error One common way to measure training error: use squared difference between target and actual output, summed over all examples where D = training examples, d = one example td = target output for d, od = actual output for d Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Gradient Descent Calculation Gradients can be measured by computing the derivative of E with respect to each wi E(wi) =  E(w) / wi To maximally reduce E(w), choose wi’s: wi = – E(wi) where  is the learning rate (>0, small) wi  wi + wi Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Algorithm for Gradient Descent We need to compute E(w) efficiently As we need to compute it iteratively  E(w) / wi = dD (td – od) (– xid) D is the set of all examples td and od are target and actual output for data d xid is the value of input xi for data d wi = dD (td – od) xid Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Procedure Gradient-Descent Initialize wi with small random values Until termination condition met Do For all i, wi  0 For each example <x, t> Do Input x to compute o wi  wi +  (t – o) xi For each linear unit weight wi Do wi  wi + wi Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Remarks on Gradient-Descent If  too large, may over-step the minimum Variation: gradually reduce  over iterations Weights are changed once everytime all examples are considered Can be slow (thousands of steps) Can be trapped in local minimum Variation: use incremental gradient descent Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Stochastic Gradient Descent Initialize wi with small random values Until termination condition met Do For all i, wi  0 For each example <x, t> Do Input x to compute o wi  wi +  (t – o) xi (Delta Rule) Unlike the perceptron rule, the delta rule involves no threshold Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Standard vs Stochastic GD Stochastic GD defines Ed(w) as opposed to ED(w) in standard GD Standard GD takes bigger steps but it takes longer time to make each step Stochastic GD has a chance of not falling into local minimum Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Perceptron Training vs Delta Rule Perceptron training uses threshold Delta Rule training does not use threshold Perceptron training converges to perfectly classify linear separable data Delta Rule does not assume linear separability but only converges asymptotically Remarks: linear programming can handle linear separable data too, brilliantly! Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Multilayer Networks, Introduction Multiple inputs & outputs + hidden units to allow us represent more complex functions Will linear units work? We want a function that is: continuous nonlinear monotonic increasing Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Sigmoid Function Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

(weights are to be trained) Sigmoid Unit Input x0 = 1 x1 w1 w0 (threshold) w2 x2 .... xn wn Weights (weights are to be trained) Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Training Error Redefined Training error: sum over all output units There are multiple outputs, hence koutput tkd and okd are the kth target & actual outputs Target is to minimize E (as before) (Stochastic) gradient descent can be used Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Multilayer Networks, Details Normally structure fixed Feedforward vs recurrent wji: input from node i to unit j n : error associated to unit n analogous to (t – o) in delta rule Details of weight-tuning rule derivation ommitted Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Backpropagation Algorithm Parameters: nin inputs, nhidden hidden units, nout outputs Initialize wi with small random values Until termination condition met Do For each training example <x, t> Do Input x to compute output of every node Propagate errors backward (to elaborate) Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Propagating Errors Backward For each output unit k, compute error k k  ok (1 – ok) (tk – ok) For each hidden unit h, compute error h h  oh (1 – oh) koutputs wkhk Update network weight wji wji  wji + wji where wji = j xji Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Momentum An alternative to Backpropagation to update the weight differently Let wji(t) be weight from i to j in itearation t wji(n) = j xji +  wji (n – 1) where 0   < 1 is the momentum Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Convergence & Local Minima Backpropagation will converge Using gradient descent … but not necessarily to global minimum Why does it work effectively in practice? Multi-dimensions provide “escape routes”? Gradient descent over complex surfaces is still poorly understood Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Expressive Power, Feedforward ANN All boolean functions can be represented by two layers may need exponential number of hidden layers Bounded continuous functions can be approximated by two layers Arbitrary functions can be approximated by three layers Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Hypothesis Space & Inductive Bias Space of possible network weights space is n-dimensional where n is the number of weights in the network space is continuous contrast with decision trees and version space This enables well-defined gradient descent Inductive bias: difficult to characterize smooth interpolation between data points? Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Hidden Layer Representation Normally use few hidden units Hidden units have lots of freedom Only input & output units are governed by data Backpropagation define hidden layer features that are not explicit in input E.g. given 8 input 10000000, 01000000, … 3 hidden units found representation 000, 001, ... Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017

Overfitting When should Backpropagation stop? When error is less than some threshold? Over-training could lead to overfitting Measuring overfitting: use validation data to measure generalization accuracy Terminate learning once error is significant then restore earlier weights Edward Tsang (Copyright) Wednesday, 26 April 2017Wednesday, 26 April 2017