Classification Part 3: Artificial Neural Networks

Slides:



Advertisements
Similar presentations
Multi-Layer Perceptron (MLP)
Advertisements

Beyond Linear Separability
NEURAL NETWORKS Backpropagation Algorithm
Neural networks Introduction Fitting neural networks
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali.
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Machine Learning Neural Networks
Artificial Neural Networks
Neural Networks I CMPUT 466/551 Nilanjan Ray. Outline Projection Pursuit Regression Neural Network –Background –Vanilla Neural Networks –Back-propagation.
Lecture 14 – Neural Networks
Simple Neural Nets For Pattern Classification
Supervised learning 1.Early learning algorithms 2.First order gradient methods 3.Second order gradient methods.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Prénom Nom Document Analysis: Artificial Neural Networks Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Back-Propagation Algorithm
Data mining and statistical learning - lecture 12 Neural networks (NN) and Multivariate Adaptive Regression Splines (MARS)  Different types of neural.
CS 4700: Foundations of Artificial Intelligence
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Collaborative Filtering Matrix Factorization Approach
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Biointelligence Laboratory, Seoul National University
Computer Science and Engineering
Artificial Neural Networks
Chapter 11 – Neural Networks COMP 540 4/17/2007 Derek Singer.
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Introduction to Artificial Neural Network Models Angshuman Saha Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg.
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
Artificial Intelligence Techniques Multilayer Perceptrons.
Artificial Neural Networks. The Brain How do brains work? How do human brains differ from that of other animals? Can we base models of artificial intelligence.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Non-Bayes classifiers. Linear discriminants, neural networks.
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Image Source: ww.physiol.ucl.ac.uk/fedwards/ ca1%20neuron.jpg
EEE502 Pattern Recognition
Neural Networks 2nd Edition Simon Haykin
Artificial Intelligence CIS 342 The College of Saint Rose David Goldschmidt, Ph.D.
Neural Networks The Elements of Statistical Learning, Chapter 12 Presented by Nick Rizzolo.
PREDICT 422: Practical Machine Learning
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Neural networks.
Fall 2004 Backpropagation CS478 - Machine Learning.
Deep Feedforward Networks
Artificial Neural Networks
Artificial Neural Networks I
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
of the Artificial Neural Networks.
Artificial Neural Networks
Neural Networks Geoff Hulten.
Neural networks (1) Traditional multi-layer perceptrons
Presentation transcript:

Classification Part 3: Artificial Neural Networks BMTRY 726 4/15/14

Last Class Last class we discussed (1) Problems with linear classification methods -As with regression, we have a hard time including a large number of covariates especially if n is small -linear boundary may not really be an appropriate choice for separating our classes (2) We introduced the concept of Artificial Neural Networks - Extract linear combinations of inputs as derived features and then model the outcome (classes) as a nonlinear function of these features -They are really nonlinear statistical models but with pieces that are familiar to us already

Artificial Neural Networks (ANNs) ANNs modeled after the brain so often refer to features/outputs as neurons ANNs consist of (1) A set of observed input features (2) A set of derived features (3) A set of outcomes we want to explain/predict (4) Weights on connections between inputs, derived features, and outcomes The simplest (and perhaps most common) type of ANN is a feed-forward ANN This means data feed forward through the network with no cycles or loops

ANNs Recall our generic example of an ANN from last class (1) Xi , i=1,2,…,p are the observed features/inputs (2) Zm , m=1,2,…,M, are the derived features -referred to as the “hidden” layer (3) Yk , k=1,2,…,K, are the outputs -Classification: classes we want to model using observed features X -Regression: Y could be a continuous … Y1 Y1 Y2 YK … Z1 Z2 Z3 ZM … X1 X2 X3 Xp-1 Xp

ANNs Hidden Layer Zm represent hidden features derived by applying an activation function to linear combinations of the observed features Common activation functions include … Y1 Y1 Y2 YK … Z1 Z2 Z3 ZM … X1 X2 X3 Xp-1 Xp

ANNs Output Outputs (i.e. predicted Y’s) come from applying a non-linear function to linear combinations of derived features Zm Some examples of gk(T) … Y1 Y1 Y2 YK … Z1 Z2 Z3 ZM … X1 X2 X3 Xp-1 Xp

ANNs Consider the expression for the derived features Zm Parameters a0m represent “bias” like we described for LDA - recall the “bias” defined location of a decision boundary Parameters am define linear combinations of X’s for derived features Zm and can be thought of as weights -i.e. how much influence a particular input variable Xi has on the derived feature Zm

ANNs Now consider the expression for the output values Yk Parameters b0k represent another “bias” parameter - These also help define locations of decision boundaries Parameters bk define linear combinations of derived features Zm also represent weights -i.e. how much influence a particular derived feature Zm have on the output We can add these “weights” into the graphic representation of our ANN

Y1 Y1 Y2 … YK b1M b21 b3M bK1 b11 b13 b22 b23 bKM b12 bK2 bK3 Z1 Z2 Z3 … ZM a21 a11 a3p aMp X1 X2 Xp X3 … Xp-1

Simple Example of Feed-Forward ANN Consider a simple example: -4 input variables (i.e. our Xi’s) -3 derived features (i.e. our Zm’s) -2 outcomes (i.e. our Yk’s) Let’s look at the graphic representation of this ANN…

Simple Example of Feed-Forward ANN Three derived features in the hidden layer: Z1, Z2, and Z3 X1 Z1 Y1 Y1 X2 Two outputs: Y1 and Y2 (i.e. possible classes in the data) Z2 Y2 X3 Z3 X4 Four inputs: X1, X2, X3, and X4 (i.e. observed features in the data)

Simple Example of Feed-Forward ANN First consider the connection between observed features X and derived features in the hidden layer, Z1, Z2, and Z3 We can add the “weights” for each of the X’s for the derived features to our graphical representation Z1 a21 Y1 Y1 X2 a22 a23 Z2 a31 Y2 a32 X3 a33 Z3 a41 a42 X4 a43

Simple Example of Feed-Forward ANN Consider the first derived feature Z1 It is created by applying our activation function, s, to a linear combination of out observed features If our activation function is sigmoid it takes the form Thus we can see that our derived feature Z1 takes the form: a11 X1 Z1 a12 Y1 Y1 X2 a13 Z2 Y2 X3 a14 Z3 X4

Simple Example of Feed-Forward ANN Given the form of the activation function, it is easy to write out the form of each of our three derived features Z1, Z2, and Z3

Simple Example of Feed-Forward ANN Now that we have the form of our derived features, Z1, Z2, and Z3, we can now consider the connections between our derived features and out outputs Yk Again we can add the “weights” to the graphical representation of our ANN b11 Z1 b12 Y1 Y1 X2 b21 Z2 b22 Y2 X3 b31 b32 Z3 X4

Simple Example of Feed-Forward ANN Consider the first output class Y1 It is created by applying an output function, gk(T), to a linear combination of the derived features Since the activation function is sigmoid, it makes sense for the our output function to be the softmax function Thus we can see that our first output Y1 takes the form: X1 b11 Z1 b12 Y1 Y1 X2 b21 Z2 b22 Y2 X3 b31 b32 Z3 X4

Simple Example of Feed-Forward ANN Given the form of the output function, it is easy to write out the form of the two outputs Y1 and Y2

Feed-Forward ANN Denote complete set of weights, q, for the ANN as Goal: estimate weights such that the model fits well Fitting well means minimizing loss function or error For regression can use sum-of-squared error loss For classification we can use either the sum-of squared error or the deviance (also known as cross-entropy)

Fitting a Feed-Forward ANN Purpose of learning is to estimate parameters/weights for connections in the model (i.e. am and bk) that allow model to reproduce the provided patterns of inputs and outputs ANN learns function of arbitrary complexity from examples (i.e. the training data) Complexity depends on the number of hidden neurons Once network trained can use it to get the expected outputs with incomplete/slightly different data

Fitting a Feed-Forward ANN Basic idea of the learning phase: Back Propagation for learning the parameters/ weights in a feed-forward ANN (one method) -Provide observed inputs and outputs to the network, -Calculate estimated outputs -back propagating the calculated error -Repeat process iteratively for a specified number of iterations Under back propagation, weights are updated using the gradient descent method -Follow steepest path of error function in order to minimize it

Illustration of Gradient Descent R(q) w1 w0

Illustration of Gradient Descent R(q) w1 w0

Illustration of Gradient Descent R(q) w1 Direction of steepest descent = direction of negative gradient w0

Illustration of Gradient Descent R(q) w1 Original point in weight space New point in weight space w0

Back Propagation For each input and ideal (expected) output pattern Initialize weights with random values (generally (1,-1)) (2) For a specified number of training iterations do: For each input and ideal (expected) output pattern i. Calculate the actual output from the input ii. Calculate output neurons error iii. Calculate hidden neurons error iv. Calculate weights variations (delta) v. Adjust the current weight using the accumulated deltas (3) Iterate until some chosen stopping point

Back-Propagation using Gradient Descent i. Calculate the actual output from the input (rth iteration)

Back Propagation Using Gradient Descent ii. Calculate output neurons error Calculate hidden neurons error Based on out choice of model fit/error function (e.g. SSE) Write in terms of the weights….

Back-Propagation using Gradient Descent Goal is to minimize the error term so take the partial derivative with respect to the weights This must be done of each weight in the ANN Start with the weights in our hidden layer variables

Back-Propagation Using Gradient Descent For SSE… Use chain rule and write in terms of predicted y, Tk, and then bkm

Back-Propagation Using Gradient Descent For SSE…

Back-Propagation Using Gradient Descent Repeat this idea for the input weights…

Back Propagation Using Gradient Descent Calculate output neurons error -this comes from the derivative of the hidden layer weights Calculate hidden neurons error -this comes from the derivative of the input weights

Back Propagation Calculate weights variations (delta) -Just the derivatives of our error function with respect to the weights

Learning Rate We also want to scale the step sizes the algorithm takes This “scale” value is also known as the learning rate and controls how far we descend on the gradient In general it is a constant selected by the user This learning rate, gr, is multiplied by the derivatives

Update at the r+1 Iteration v. Add the weights variations to the accumulated delta

Back Propagation In forward pass current weights fixed and predicted values come from these weight In backward pass errors are estimated and used calculate the gradient to update the weights Learning rate gr often taken to be fixed though it can be optimized to minimize the error at each iteration One important note, since the gradient descent algorithm requires taking derivatives, the activation, output, and error functions must be differentiable w.r.t. the weights

Considerations When Fitting ANNs Training ANNs is a bit of an art form and there are things that must be taken into consideration Considerations when training network (1) Choice of starting weights -Over-fitting -Scaling inputs -Number of hidden layers -Multiple minima

Considerations When Fitting ANNs Training ANNs is a bit of an art form and there are things that must be taken into consideration Considerations when training network (1) Choice of starting weights -If weights are near 0, the operative part of sigmoid function approximately linear -Initial weights generally chosen to be near 0 so that the model nearly linear -Model becomes progressively more non-linear as weights increase

Considerations When Fitting ANNs (2) Over-fitting -Often NNs have too many weights and over fit the data using global minimum of error function R(q) -One solution is weight decay, which is analogous to ridge regression

Considerations When Fitting ANNs Scaling inputs -As with any of the other methods we’ve discussed, if inputs have very different scales, can greatly impact quality of model -Best to standardize inputs prior to training the model Number of hidden layers -Typically have between 5-100 units -Generally better to have too many hidden units -Too few results in less model flexibility -If too many, some can be shrunk to 0 with appropriate regularization Multiple minima -R(q) non-convex and has many local minima -Thus final solution is dependent on choice of starting weights -Try number random starting points picking one with lowest penalized error -Or average prediction over collection of networks (more on that later)

A Couple of Extra Points Models do not have to have a hidden layer -A model with no hidden layers is called a perceptron -If we are using the sigmoid activation function, this is VERY similar to multinomial logistic regression -If using identity link, this is VERY similar to linear regression By the same token, models can have more than one hidden layer -We may decide to have 5 hidden layers, each with different numbers of derived features Not all features must be connected -this is equivalent to placing zero weight on -connection between an input and a derived features -connection between derived feature and output

A Classic Problem The US post office needs to be able to sort mail using handwritten zip codes on letters There are too many letters to sort by hand… can we develop a NN to recognize the numbers in a zip code?