Fundamentals of Artificial Neural Networks

Fundamentals of Artificial Neural Networks
Rosenblatt’s perceptron: linear combination of inputs connected to output Linear regression and classification by linear discriminants ANN finds optimum linear model by iterative improvement of weights More efficient methods (linear least squares) find optimum model by solving a linear system of equations.

What can Perceptron do? Fit a line to data: y=wx+w0
Use y=wx+w0 as a discriminant S = sigmoid(y) models P(C1|x) y y y s w0 w0 w w x w0 x x x0=+1 S(y) = 1(1+e-y) = 0.5 when y=0 -> 0 large negative y -> 1 large positive y Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Multivariate Linear Regression
x is input vector w is weight vector Scalar output, y, is inner product of w and x Fit a hyperplane to data in d dimensions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Multiclass classification by multivariate linear discriminant
wi is weight vector connecting input vector x to component yi of the output vector y. Transform yi by sigmoid. Choose Ci if yi is the largest output. K(d+1) weights Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4

Boolean AND: simple classification problem with linear discriminant
Truth table How did we find this linear discriminant? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5

System of inequalities “suggest” weight values
Linear discriminant wTx < 0 → r = 0 wTx > 0 → r = 1 x1 x2 r required sign of wTx choice w0 <0 w0=-1.5 w2 + w0 <0 w1= 1 w w0 <0 w2= 1 w1 + w2 + w0>0 linear discriminant wTx = 0 -> x1+x2-1.5 boundary in (x1,x2) plane where sigmoid(wTx) > 0.5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6

Assignment 3: Due Derive a linear discriminant for Boolean OR Show the following: Truth table System of inequalities for weights Linear discriminant Graphical representation of problem with solution

Boolean XOR: linearly inseparable 2D binary classification problem
data table graphical representation Transform to linearly separable feature space

Solution of XOR in Gaussian feature space
f1 = exp(-|X – [1,1]|2) f2 = exp(-|X – [0,0]|2) X f1 f2 (1,1) (0,1) (0,0) (1,0) XOR data This transformation puts examples (0,1) and (1,0) at the same point in feature space. 3 points in 2D space are always linearly separable. r=0 r=1

Weight optimization by back propagation Initialize weights randomly.
Find rules that relate changes in weights to the difference between output and target.

If the expression for in-sample error is simple (e.g. squared
residuals) and network not too complex (e.g. < 3 hidden layers), then an analytical expression for the rate of change of error with change in weights can be derived from calculus.

Simplest case: multivariate linear regression by perceptron
In-sample error is sum of squared residuals. This example instructive but not relevant because weights could be determined by “one-step” optimization (discussed later) rather than iterative back-propagation.

Approaches to Training
Online: weights updated based on training-set examples seen one by one in random order Batch: weights updated based on whole training set after summing deviations from individual examples Weight-update formulas are simpler for “online” approach Formulas for “batch” can be derived from “online” formula by summing. 13

Weight-update rule: multivariate linear regression
Contribution to sum of squared residuals from single example wj is the jth component of weight vector w connecting attribute vector x to scalar output y = wTx Et depends on wj through yt = wTxt; hence use chain rule

Weight update formula called “stochastic gradient decent”
Why negative sign? Proportionality constant h is called “learning rate” Since Dwj is proportional xj, all attributes should be roughly the same size. Normalization to achieve this may be helpful

Momentum parameter Keep part of previous update
How do learning rate and momentum affect training? As learning rate → 1, back-propagation becomes deterministic Each example determines a set of weights optimal for itself only As learning rate → 0, probably local minimum trapping → 1 because step size of weight change is so small Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization Opinions differ on best choice of learning rate and momentum Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17

binary classification with a perceptron
y = sigmoid(wTx) r t = {0,1} S Derive in-sample error by Maximum Likelihood Estimation with the weights as parameters of a distribution function from which the dataset was drawn. Derive weight update formulas by calculus. Equivalent to linear logistic regression.

Derive in-sample error function (cross entropy)
Assume that r t is drawn from Bernoulli distribution with parameter p0 for the probability that r t = 1 p(r) = por (1 – po ) (1 – r) p(r =0) = 1 – po Let po = y = sigmoid(wTx), then p(r) = yr (1 – y ) (1 – r) Use MLE to find best w 19

Let L(w|X) be the log-likelihood that weight vector w
results from training set X yt = sigmoid(wTxt) is between 0 and 1; hence L(w|X) < 0 Therefore to maximize L(w|X), minimize which we take as our in-sample error called cross entropy.

As with in-sample error defined as squared residuals, we
get stochastic weight update by where With yt = sigmoid(wTxt), After some algebra we get same result as regression

Result generalizes to k-way classification
Weight vector wi connects input vector x to output node yi Assign the example with attributes x to class with largest yi 22

Cross entropy and weight update rule
wij is the jth component of wi (ith column of weight matrix W) This approach to multivariate multiclass linear classification requires iterative refinement of k (d+1) weights. Multivariate linear regression followed by binning determines weights by 1-step optimization Cover this topic after finish ANN. 23

Multilayer Perceptrons (MLP)
Review: Perceptron has only input and output nodes. Equivalent to multivariate linear and logistic regression. Some problems cannot be solved by linear models. Multilayer perceptron solves such problems by inserting “hidden” layers between input and output. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24

XOR: non-linear classification problem
Truth table Graphical representation Classes are not linearly separable in attribute space

This transformation puts examples (0,1) and (1,0) at the same point in
f1 = exp(-||X – [1,1]T||2) f2 = exp(-||X – [0,0]T||2) X f1 f2 (1,1) (0,1) (0,0) (1,0) XOR in linearly separable feature space (transformed attribute vectors) This transformation puts examples (0,1) and (1,0) at the same point in feature space. 3 points in a 2D space are always linearly separable

(0,0) and (1,1) are at the same point
Consider hidden units zh as features Choose wh so that in feature space (0,0) and (1,1) are at the same point z2 z1 iIdeal feature space attribute space

Design criteria for hidden layer
x1 x2 r z1 z2 0 0 0 ~0 ~0 0 1 1 ~0 ~1 1 0 1 ~1 ~0 1 1 0 ~0 ~0 whTx < 0 → zh ~ 0 whTx > 0 → zh ~ 1

Find weights for design criteria
x1 x2 z1 w1Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~0 <0 w2 + w0 <0 w2= -1 1 0 ~1 >0 w1 + w0 >0 w1= 1 1 1 ~0 <0 w1 + w2 + w0<0 x1 x2 z2 w2Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~1 >0 w2 + w0 >0 w2= 1 1 0 ~0 <0 w1 + w0 <0 w1= -1 1 1 ~0 <0 w1 + w2 + w0>0

Transformation of input by hidden layer z1 = sigmoid(x1-x2-0.5)
x1 x2 arg1 z1 arg2 z2 r z2 z1 x2 x1 transformed xor Boolean or

Find weights connecting hidden layer to output
y = sigmoid(vTz) z1 z2 r vTz required choice <0 .38v1+.38v2+v0 <0 v0= -.78 >0 .18v1+.62v2+v0 >0 v2= 1 >0 .18v1+.62v2+v0 >0 v1= 1 <0 .38v1+.38v2+v0 <0

A solution of XOR by multilayer perceptron
-0.78 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32

Recall multivariate linear regression with perceptron
Weight update to minimize sum squared residuals

Multivariate nonlinear regression with one hidden layer
Backward Forward x Specific to sigmoid as the nonlinear transform of input Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34

old 2nd layer weights (online or batch?)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35

Example using in silico data
xt = U(-0.5, 0.5) yt = sin(6xt) + N(0, 0.1) Epoch is complete pass through training data Validation error calculated after each epoch fit is getting better with increasing numbers of epochs Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 36

Overtraining elbow Beyond elbow errors track for ~ 200 e
Above e ~ 500 have clear evidence for overtraining elbow Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37

Too many hidden units Multivariate regression with one hidden layer containing H units Number of weights: H (d+1)+(H+1) Up to H ~ 5, we have significant improvement with more hidden units. Above H ~ 15, validation error increases while training error is flat. Evidence for more complex ANN than needed. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38

Tuning the Network Size by weight decay
Zero weight effectively removes a connection. Weight decay creates a tendency for unnecessary weights to approach zero. Add a term to weight update rule Best magnitude of l will depend on h Use validation set to get the right balance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39

Structural adaptation: systematic architectural
changes without starting over Destructive: start large and remove unnecessary connections Constructive: start small and add what is needed

Tuning the Network Size by construction
Dynamic Node Creation: Start with one unit in one hidden layer; train and test If validation error too large add another hidden unit; train without reinitializing weights on previous connections (Fahlman and Lebiere, 1989) (Ash, 1989) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 41

Tuning the Network Size by construction
Cascade correlation: Start with one unit in one hidden layer; add new hidden units as another hidden layer. One node in each hidden layer Freeze previously trained weights. Train newly added connections Training a single layer at each step is faster than training multiple hidden layers. This makes sense because each new unit is added to learn what has not yet been learned by the previous layer(s). (Fahlman and Lebiere, 1989) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 42

How nonlinearity works
Input to hidden units whx+w0 Hidden units zh=sigmoid(whTx) Hidden units times output weights y = vhzh Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 44

Tall and lean usually better than short and fat
Multiple layers may lead to a simpler network Regression with 2 hidden layers: feed forward

Weight-update equations for regression with 2 hidden layer
(batch mode) New notation designed to show the pattern Update = learning factor ∙ output error ∙ input y = vTz2 ; for connection between 2l and output z2l = sigmoid(w2lTz1); for connection between 1h and 2l z1h = sigmoid(w1hTx); for connection between input and 1h Error depends of whj through weights in top 2 layers

Two-Class Discrimination with one hidden layer
= exp(vTzt) Sigmoid output models P(C1|x) Minimize cross entropy in batch update Weight update formulas are same as for regression After convergence assign client to C1 if output > 0.5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 47

K>2 Classes: one hidden layer
Derived from multinomial distribution vi is weight vector connection nodes of hidden layer to output of class i Note sum over i Minimize cross entropy by batch update After convergence assign client to class with largest output Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 48

Classification problem with small dataset
Assignment 4: due Classification problem with small dataset Build a system that can classify a piece of glass from a beer bottle. Train an ANN to classify what brewery the bottle came from using the chemical content of the glass. glassdata.csv contains 214 samples of bottle glass from 6 breweries Since dataset is small do not use a validation set

I have not been able to make Lars’ ANN
1 1.521 13.64 4.49 1.1 71.78 0.06 8.75 2 1.517 13.89 3.6 1.36 72.73 0.48 7.83 3 1.516 13.53 3.55 1.54 72.99 0.39 7.78 I have not been able to make Lars’ ANN code perform the way I expected it to from my experience with it in 2008. For now, I am suspending all parts of assignment 4 that involve Lars’ code.

New Objectives For Assignment 4
Objective 1: Create an input data file that will let you run WEKA’s multilayer perceptron for classification. Describe the change you made to glassdata.csv to achieve this. Objective 2: Capture WEKA’s results using the default settings. Include the confusion matrix. Can you find settings that give better results? If so, describe those settings in your report along with WEKA output including the improved confusion matrix.

First 3 rows of data Data not ready for use by
1 1.521 13.64 4.49 1.1 71.78 0.06 8.75 2 1.517 13.89 3.6 1.36 72.73 0.48 7.83 3 1.516 13.53 3.55 1.54 72.99 0.39 7.78 Data not ready for use by either WEKA or LARs’ code What is needed irrespective of which software used? 1. Sample index 2. RI: refractive index 3. Na: Sodium 4. Mg: Magnesium 5. Al: Aluminum 6. Si: Silicon 7. K: Potassium 8. Ca: Calcium 9. Ba: Barium 10. Fe: Iron 11. Type of bottle (class) 1=Anheuser-Busch, Inc. 2=Miller Brewing Co. 3=Blitz-Weinhard Brewing Co. 4=Pete’s Brewing Co. 5=Samuel Adams Brew House 6=Plank Road Brewery

3. Background for assignment 5 due 11-6
Example of multivariate nonlinear regression Prediction of charge on peptides after electron-spray ionization

5. Background for assignment 5 due 11-6-12
Example of multivariate nonlinear regression Prediction of peptide charge in electro-spray Ionization Construct input from amino-acid sequence of peptide Goal: best result with smallest input dimension

First 4 of ~ 23,000 data pairs are
Sequence Charge AAAAAAPDDVAAQLVVADLDLVGGHVEDAFAR 2.8 AAAAADLANR 2 AAAAAQASASAAAK AAAAAVAQGGPIEDAER How can we encode the peptide sequence? Can an encoded peptide sequence be an input? Can we encode a subset of peptide sequence? What inputs can we calculate from the input sequence?

Properties of amino acids pi pK1 pK2 charge Hydrophobic? Polar? 6.01
code mass pi pK1 pK2 charge Hydrophobic? Polar? A 6.01 2.35 9.87 T F R 10.76 1.82 8.99 + N 5.41 2.14 8.72 D 2.85 1.99 9.9 - C 5.05 1.92 10.7 E 3.15 2.1 9.47 Q 5.65 2.17 9.13 G 6.06 9.78 H 7.6 1.8 9.33 I 6.05 2.32 9.76 L 2.33 9.74 K 9.6 2.16 9.06 M 5.74 2.13 9.28 5.49 2.2 9.31 P 6.3 1.95 10.64 S 5.68 2..19 9.21 5.6 2.09 9.1 W 5.89 2.46 9.41 Y 5.64 V 6.0 2.39

Some suggestions for inputs from properties of amino acids
Length of peptide Mass of peptide Sequence of first 2 residues Sequence of last 2 residues Factions of amino acids of each type Fractions of hydrophobic, polar, and charged residues Net formal charge Average isoelectric point Average disassociation constant

Review of protein biology
Central dogma of biology

Dogma on protein function
Proteins are polymers of amino acids The sequence of amino acids determines a protein’s shape (folding pattern) The shape of a protein determines its function

Amino acids with different side chains have different names
Glycine gly G alanine ala A valine val V leucine leu L isoleucine ile I methionine met M porline pro P phenylalanine phe F tryptophan trp W serine ser S cysteine cys C threonine thr T glutamine gln Q asparagine asn N histidine his H tyrosine tyr Y glutamic acid glu E aspartic acid asp D lysine lys K arginine arg R What are amino acids? N-terminus C-terminus Side chain Amino acids with different side chains have different names

chemical properties of amino acids

Sequence of Protein Dictates its Folding Pattern

Amino Acids Polymerize to Form Proteins
formation of peptide bond -N-C-C-N-C-C-N- H R

Proteases cut proteins into peptides
-N-C-C-N-C-C-N- H R Most proteases have cleavage specificity. Trypsin cleaves arginines and lysines Digestion of a protein with trypsin produces peptides of various length Proteases catalyze cleavage of peptide bonds

Liquid chromatography coupled to mass spectrometry
LC column Electro-spray ionization Mass spectrometer Digested protein mixture peptides are retained for differing times on the LC column Peptides may have multiple charges Charges in assignment 5 are averages from several runs

Hints: known properties independent of training
Virtual examples: additions to training set with the same label Example1: invariants to translation, rotation, etc. Example2: “0101” and “1010” have the same parity. Virtual examples added by shifting Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 66

Hints: known properties independent of training
Add penalties to error: E ’=E + λhEh E is usual sum of squared residuals Example1: know f(x) = f(x’) Let Eh=[g(x|θ)- g(x’|θ)]2 lh determines the significance of the penalty Example2: know that f(x) is between ax and bx

Fundamentals of Artificial Neural Networks

Similar presentations

Presentation on theme: "Fundamentals of Artificial Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fundamentals of Artificial Neural Networks

Similar presentations

Presentation on theme: "Fundamentals of Artificial Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback