Dan Simon Cleveland State University

Name: Dan Simon Cleveland State University
Uploaded: 2017-08-25T11:28:14+00:00
Duration: PTM15S13
Description: Dan Simon Cleveland State University

Dan Simon Cleveland State University
Neural Networks Part 2 Dan Simon Cleveland State University

Outline Preprocessing Cascade Correlation Input normalization
Feature selection Principal component analysis (PCA) Cascade Correlation

Preprocessing First things first: Preprocessing
Input Normalization Feature Selection Neural network training If you use a poor preprocessing algorithm, then it doesn’t matter what type of neural network that you use. And if you use a good preprocessing algorithm, then it doesn’t matter what type of neural network you use.

Input normalization for independent variables
Training data: xni n  [1, N], N = # of training samples i  [1, d], d = input dimension Mean of each input dimension Normalized inputs Sometimes weight updating takes care of normalization, but what about weight initialization? Also, recall that RBF activation is determined by Euclidean distance between input and center, which we don’t want to be dominated by a single dimension.

Input normalization for correlated variables (whitening)
Training data: xni n  [1, N], N = # of training samples i  [1, d], d = input dimension What is the mean and covariance of the normalized inputs?

original distribution
whitened distribution

Feature Selection: What features of the input data should we use as inputs to the neural network?
Example: A 256  256 bitmap for character recognition gives 65,536 input neurons! One Solution: Problem-specific clustering of features – for example, “super pixels” in image processing We went from to 256 to 64 features

Feature Selection: What features of the input data should we use as inputs to the neural network?
Expert Knowledge: Use clever problem-dependent approaches 5 6 2 7 4 1 We went from to 64 to 16 features. The old features were binary, and the new features are not.

Clever problem-dependent approach to feature selection
Example: Use ECG data to diagnose heart disease We have 24 hours of data at 500 Hz. Cardiologists tell us that primary indicators include: P wave duration P wave amplitude P wave energy P wave inflection point This gives us a neural network with four inputs

Feature Selection: If you reduce the number of features, make sure you don’t lose too much information! 8 6 8 6

Feature Selection in the case of no expert knowledge
Brute Force Search: If we want to reduce the number of features from M to N, we find all possible N-element subsets of the M features, and check neural network performance for each subset. How many N-element subsets can be taken from an M-element set? Binomial coefficient: Example: M = 64, N = 8  M-choose-N = 4.4 billion Hmm, I wonder if there are better ways …

Feature Selection: Branch and Bound Method This is based on the idea that deleting features cannot improve performance. Suppose we want to reduce the number of features from 5 to 2. First, pick a “reasonable” pair of features (say, 1 and 2). Train an ANN with those features. This gives a performance threshold J. Create a tree of eliminated features. Move down the tree to accumulate deleted features. Evaluate ANN performance P at each node of the tree. If P < J, there is no need to consider that branch any further.

Branch and Bound We want a reduction from 5 to 2 features Optimal method, but may require a lot of effort. B 1 2 3 2 3 4 3 4 4 C A 3 4 5 4 5 5 4 5 5 5 Use features 1 and 2 to obtain performance at node A. We find that B is worse than A, so no need to evaluate below B. We also find C is worse than A, so no need to evaluate below C.

Feature Selection: Sequential Forward Selection Find the feature f1 that gives the best performance Find the feature f2 such that (f1 , f2 ) gives the best performance Repeat for as many features as desired Example: Find the best 3 out of 5 available features 1 2 3 4 5 2 is best {2, 4} is best {2, 4, 1} is best 1 3 4 5 1 3 5

Problem with sequential forward selection:
There may be two features such that either one along provides little information, but the combination provide a lot of information. x1 x2 Class 1 Class 2 Neither feature, if used alone, provides information about the class. But both features in combination provide a lot of information.

Feature Selection: Sequential Backward Elimination Start with all features. Eliminate the one that provides the least information. Repeat until the desired number of features is obtained. Example: Find the best 3 out 5 available features Use all features for best performance 1,2,3,4,5 Eliminating feature 4 results in the least loss of performance 1,2,3,4 1,2,3,5 1,2,4,5 1,3,4,5 2,3,4,5 Eliminating feature 1 results in the least loss of performance 1,2,3 1,2,5 1,3,5 2,3,5

Principal Component Analysis (PCA)
This is not a feature selection method, but a feature reduction method. x1 x2 Class 1 Class 2 This is a reduced-dimension problem, but no single feature gives us enough information about the classes.

Principal Component Analysis (PCA)
We are given input vectors xn (n=1,…N), and each vector contains d elements Goal: map xn vectors to zn vectors, where each zn vector has M elements, and M < d. {ui} = orthonormal basis vectors Since uiTuk = ik, we see that zni = uiTxn We want to minimize EM

This defines P (d  d matrix).
We found the best {bi} values. What are the best {ui} vectors?

We want to minimize EM with respect to {ui}.
{ui} = 0 would work … We need to constrain {ui} to be a set of orthonormal basis vectors; that is, uiTuk = ik. Constrained optimization: ik = Lagrange multipliers M=MT w.l.o.g. where U = [uM+1 … ud], M = [ ik ], I = identity matrix U is a d  (dM) matrix, M is a (dM)  (dM) matrix

UT P U = M UT P U M = UTPU=M: [(dM)  d] [d  d] [d  (dM)] = [(dM)  (dM)] Add columns to U, and expand M. We now have an eigenvector equation while still satisfying the original UTPU=M equation. PCA Solution: {ui} = eigenvectors of P

The error is half of the sum of the (dM) smallest eigenvalues of P {ui} = principal components PCA = Karhunen-Loeve transformation

u1 u2 PCA illustration in two dimensions: All data points are projected onto the u1 direction. Any variation in the u2 direction is ignored.

Cascade Correlation Scott Fahlman, 1988
This gives a way to automatically adjust the network size. Also, it uses gradient-based weight optimization without the complexity of backpropagation. Begin with no hidden-layer neurons. Add hidden neurons one at a time. After adding hidden neuron Hi, optimize the weights from “upstream” neurons to Hi to maximize the effect of Hi on the outputs. Optimize the output weights of Hi to minimize training error.

Cascade Correlation Example
Two inputs and two outputs Step 1 – Start with a two-layer network (no hidden layers) x1 x2 1 y1 y2 x1 x2 y1 y2 1 y1 = f(x.w1) w1 = weights from inputs to y1 w1 can be trained with a gradient method Similarly, w2 can be trained with a gradient method

Two inputs and two outputs Step 2 – Add a hidden neuron H1. Maximize the correlation between H1 outputs and training error. 1 y1 y2 H1 z x1 x2 Don’t update Do update no = number of outputs N = number of training samples e = training error before H1 is added {wi} = weights from inputs to H1 Use a gradient method to maximize |S| with respect to {wi}

Two inputs and two outputs Step 3 – Optimize the output weights 1 y1 y2 H1 x1 x2 Don’t update Do update Use a gradient method to minimize training error with respect to the weights that are connected to the output neurons.

Two inputs and two outputs Step 4 – Add another hidden neuron H2 and repeat step 2; Maximize correlation between H2 outputs and training error. 1 y1 y2 H1 x1 x2 H2 z2 Don’t update Do update no = number of outputs N = number of training samples e = training error before H2 is added {wi} = weights “upstream” from H2 Use a gradient method to maximize |S| with respect to {wi}

Cascade Correlation Example Two inputs and two outputs
Step 5 – Optimize the output weights 1 y1 y2 H1 x1 x2 H2 Don’t update Do update Use a gradient method to minimize training error with respect to the weights that are connected to the output neurons. Repeat previous two steps until desired performance obtained: Add a hidden neuron Hi Maximize corr. between Hi output and training error w/r to Hi input weights Freeze input weights Minimize training error w/r to output weights.

References C. Bishop, Neural Networks for Pattern Recognition D. Simon, Optimal State Estimation (Chapter 1)

Dan Simon Cleveland State University

Similar presentations

Presentation on theme: "Dan Simon Cleveland State University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dan Simon Cleveland State University

Similar presentations

Presentation on theme: "Dan Simon Cleveland State University"— Presentation transcript:

Similar presentations

About project

Feedback