Fundamentals of Artificial Neural Networks

Slides:



Advertisements
Similar presentations
Proteins: Structure reflects function….. Fig. 5-UN1 Amino group Carboxyl group carbon.
Advertisements

Introduction to Neural Networks Computing
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Multivariate linear models for regression and classification Outline: 1) multivariate linear regression 2) linear classification (perceptron) 3) logistic.
Lecture 14 – Neural Networks
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
x – independent variable (input)
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Learning Targets “I Can...” -State how many nucleotides make up a codon. -Use a codon chart to find the corresponding amino acid.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Artificial Intelligence Chapter 3 Neural Networks Artificial Intelligence Chapter 3 Neural Networks Biointelligence Lab School of Computer Sci. & Eng.
Intro. ANN & Fuzzy Systems Lecture 14. MLP (VI): Model Selection.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Example of regression by RBF-ANN Prediction of charge on peptides after electron-spray ionization in mass spectrometry What are the best attributes to.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
EEE502 Pattern Recognition
Machine Learning 5. Parametric Methods.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
Amino acids Proof. Dr. Abdulhussien Aljebory College of pharmacy
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Deep Feedforward Networks
Amino acids.
Artificial Neural Networks
Multilayer Perceptrons
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
INTRODUCTION TO Machine Learning 3rd Edition
第 3 章 神经网络.
Assignment 8: due Use Weka to classify beer-bottle glass by brewery
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Amino Acids (Foundation Block) 1 Lecture Dr. Usman Ghani
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
Protein Sequence Alignments
Machine Learning Today: Reading: Maria Florina Balcan
Collaborative Filtering Matrix Factorization Approach
The Interface of Biology and Chemistry
Chapter 3 Proteins.
Neuro-Computing Lecture 4 Radial Basis Function Network
Fig. 5-UN1  carbon Amino group Carboxyl group.
Artificial Intelligence Chapter 3 Neural Networks
Artificial Neural Networks
Proteins Genetic information in DNA codes specifically for the production of proteins Cells have thousands of different proteins, each with a specific.
The 20 amino acids.
Artificial Intelligence Chapter 3 Neural Networks
The loss function, the normal equation,
Translation.
The 20 amino acids.
Mathematical Foundations of BME Reza Shadmehr
Artificial Intelligence Chapter 3 Neural Networks
Review for test #2 Fundamentals of ANN Dimensionality reduction
Neural networks (1) Traditional multi-layer perceptrons
Artificial Intelligence Chapter 3 Neural Networks
Example of regression by RBF-ANN
Classification by multivariate linear regression
Proteins Proteins have many structures, resulting in a wide range of functions Proteins do most of the work in cells and act as enzymes 2. Proteins are.
Assignment 5 Example of multivariate regression
Test #1 Thursday September 20th
Classification problem with small dataset
Supervised machine learning: creating a model
Introduction to Neural Networks
“When you understand the amino acids,
Artificial Intelligence Chapter 3 Neural Networks
Presentation transcript:

Fundamentals of Artificial Neural Networks Rosenblatt’s perceptron: linear combination of inputs connected to output Linear regression and classification by linear discriminants ANN finds optimum linear model by iterative improvement of weights More efficient methods (linear least squares) find optimum model by solving a linear system of equations.

What can Perceptron do? Fit a line to data: y=wx+w0 Use y=wx+w0 as a discriminant S = sigmoid(y) models P(C1|x) y y y s w0 w0 w w x w0 x x x0=+1 S(y) = 1(1+e-y) = 0.5 when y=0 -> 0 large negative y -> 1 large positive y Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

Multivariate Linear Regression x is input vector w is weight vector Scalar output, y, is inner product of w and x Fit a hyperplane to data in d dimensions Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3

Multiclass classification by multivariate linear discriminant wi is weight vector connecting input vector x to component yi of the output vector y. Transform yi by sigmoid. Choose Ci if yi is the largest output. K(d+1) weights Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 4

Boolean AND: simple classification problem with linear discriminant Truth table How did we find this linear discriminant? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 5

System of inequalities “suggest” weight values Linear discriminant wTx < 0 → r = 0 wTx > 0 → r = 1 x1 x2 r required sign of wTx choice 0 0 0 w0 <0 w0=-1.5 0 1 0 w2 + w0 <0 w1= 1 1 0 0 w1 + w0 <0 w2= 1 1 1 1 w1 + w2 + w0>0 linear discriminant wTx = 0 -> x1+x2-1.5 boundary in (x1,x2) plane where sigmoid(wTx) > 0.5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6

Assignment 3: Due 10-2-12 Derive a linear discriminant for Boolean OR Show the following: Truth table System of inequalities for weights Linear discriminant Graphical representation of problem with solution

Boolean XOR: linearly inseparable 2D binary classification problem data table graphical representation Transform to linearly separable feature space

Solution of XOR in Gaussian feature space f1 = exp(-|X – [1,1]|2) f2 = exp(-|X – [0,0]|2) X f1 f2 (1,1) 1 0.1353 (0,1) 0.3678 0.3678 (0,0) 0.1353 1 (1,0) 0.3678 0.3678 XOR data This transformation puts examples (0,1) and (1,0) at the same point in feature space. 3 points in 2D space are always linearly separable. r=0 r=1

Weight optimization by back propagation Initialize weights randomly. Find rules that relate changes in weights to the difference between output and target.

If the expression for in-sample error is simple (e.g. squared residuals) and network not too complex (e.g. < 3 hidden layers), then an analytical expression for the rate of change of error with change in weights can be derived from calculus.

Simplest case: multivariate linear regression by perceptron In-sample error is sum of squared residuals. This example instructive but not relevant because weights could be determined by “one-step” optimization (discussed later) rather than iterative back-propagation.

Approaches to Training Online: weights updated based on training-set examples seen one by one in random order Batch: weights updated based on whole training set after summing deviations from individual examples Weight-update formulas are simpler for “online” approach Formulas for “batch” can be derived from “online” formula by summing. 13

Weight-update rule: multivariate linear regression Contribution to sum of squared residuals from single example wj is the jth component of weight vector w connecting attribute vector x to scalar output y = wTx Et depends on wj through yt = wTxt; hence use chain rule

Weight-update rule: multivariate linear regression Contribution to sum of squared residuals from single example wj is the jth component of weight vector w connecting attribute vector x to scalar output y = wTx Et depends on wj through yt = wTxt; hence use chain rule

Weight update formula called “stochastic gradient decent” Why negative sign? Proportionality constant h is called “learning rate” Since Dwj is proportional xj, all attributes should be roughly the same size. Normalization to achieve this may be helpful

Momentum parameter Keep part of previous update How do learning rate and momentum affect training? As learning rate → 1, back-propagation becomes deterministic Each example determines a set of weights optimal for itself only As learning rate → 0, probably local minimum trapping → 1 because step size of weight change is so small Large momentum parameter reduces trapping at small learning rate but increases likelihood that single outlier with dramatically affect weight optimization Opinions differ on best choice of learning rate and momentum Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17

binary classification with a perceptron y = sigmoid(wTx) r t = {0,1} S Derive in-sample error by Maximum Likelihood Estimation with the weights as parameters of a distribution function from which the dataset was drawn. Derive weight update formulas by calculus. Equivalent to linear logistic regression.

Derive in-sample error function (cross entropy) Assume that r t is drawn from Bernoulli distribution with parameter p0 for the probability that r t = 1 p(r) = por (1 – po ) (1 – r) p(r =0) = 1 – po Let po = y = sigmoid(wTx), then p(r) = yr (1 – y ) (1 – r) Use MLE to find best w 19

Let L(w|X) be the log-likelihood that weight vector w results from training set X yt = sigmoid(wTxt) is between 0 and 1; hence L(w|X) < 0 Therefore to maximize L(w|X), minimize which we take as our in-sample error called cross entropy.

As with in-sample error defined as squared residuals, we get stochastic weight update by where With yt = sigmoid(wTxt), After some algebra we get same result as regression

Result generalizes to k-way classification Weight vector wi connects input vector x to output node yi Assign the example with attributes x to class with largest yi 22

Cross entropy and weight update rule wij is the jth component of wi (ith column of weight matrix W) This approach to multivariate multiclass linear classification requires iterative refinement of k (d+1) weights. Multivariate linear regression followed by binning determines weights by 1-step optimization Cover this topic after finish ANN. 23

Multilayer Perceptrons (MLP) Review: Perceptron has only input and output nodes. Equivalent to multivariate linear and logistic regression. Some problems cannot be solved by linear models. Multilayer perceptron solves such problems by inserting “hidden” layers between input and output. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24

XOR: non-linear classification problem Truth table Graphical representation Classes are not linearly separable in attribute space

This transformation puts examples (0,1) and (1,0) at the same point in f1 = exp(-||X – [1,1]T||2) f2 = exp(-||X – [0,0]T||2) X f1 f2 (1,1) 1 0.1353 (0,1) 0.3678 0.3678 (0,0) 0.1353 1 (1,0) 0.3678 0.3678 XOR in linearly separable feature space (transformed attribute vectors) This transformation puts examples (0,1) and (1,0) at the same point in feature space. 3 points in a 2D space are always linearly separable

(0,0) and (1,1) are at the same point Consider hidden units zh as features Choose wh so that in feature space (0,0) and (1,1) are at the same point z2 z1 iIdeal feature space attribute space

Design criteria for hidden layer x1 x2 r z1 z2 0 0 0 ~0 ~0 0 1 1 ~0 ~1 1 0 1 ~1 ~0 1 1 0 ~0 ~0 whTx < 0 → zh ~ 0 whTx > 0 → zh ~ 1

Find weights for design criteria x1 x2 z1 w1Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~0 <0 w2 + w0 <0 w2= -1 1 0 ~1 >0 w1 + w0 >0 w1= 1 1 1 ~0 <0 w1 + w2 + w0<0 x1 x2 z2 w2Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~1 >0 w2 + w0 >0 w2= 1 1 0 ~0 <0 w1 + w0 <0 w1= -1 1 1 ~0 <0 w1 + w2 + w0>0

Transformation of input by hidden layer z1 = sigmoid(x1-x2-0.5) x1 x2 arg1 z1 arg2 z2 r 0 0 -0.5 0.38 -0.5 0.38 0 0 1 -1.5 0.18 0.5 0.62 1 1 0 0.5 0.62 -1.5 0.18 1 1 1 -0.5 0.38 -0.5 0.38 0 z2 z1 x2 x1 transformed xor Boolean or

Find weights connecting hidden layer to output y = sigmoid(vTz) z1 z2 r vTz required choice 0.38 0.38 0 <0 .38v1+.38v2+v0 <0 v0= -.78 0.18 0.62 1 >0 .18v1+.62v2+v0 >0 v2= 1 0.62 0.18 1 >0 .18v1+.62v2+v0 >0 v1= 1 0.38 0.38 0 <0 .38v1+.38v2+v0 <0

A solution of XOR by multilayer perceptron -0.78 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32

Recall multivariate linear regression with perceptron Weight update to minimize sum squared residuals

Multivariate nonlinear regression with one hidden layer Backward Forward x Specific to sigmoid as the nonlinear transform of input Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34

old 2nd layer weights (online or batch?) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35

Example using in silico data xt = U(-0.5, 0.5) yt = sin(6xt) + N(0, 0.1) Epoch is complete pass through training data Validation error calculated after each epoch fit is getting better with increasing numbers of epochs Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 36

Overtraining elbow Beyond elbow errors track for ~ 200 e Above e ~ 500 have clear evidence for overtraining elbow Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37

Too many hidden units Multivariate regression with one hidden layer containing H units Number of weights: H (d+1)+(H+1) Up to H ~ 5, we have significant improvement with more hidden units. Above H ~ 15, validation error increases while training error is flat. Evidence for more complex ANN than needed. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38

Tuning the Network Size by weight decay Zero weight effectively removes a connection. Weight decay creates a tendency for unnecessary weights to approach zero. Add a term to weight update rule Best magnitude of l will depend on h Use validation set to get the right balance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39

Structural adaptation: systematic architectural changes without starting over Destructive: start large and remove unnecessary connections Constructive: start small and add what is needed

Tuning the Network Size by construction Dynamic Node Creation: Start with one unit in one hidden layer; train and test If validation error too large add another hidden unit; train without reinitializing weights on previous connections (Fahlman and Lebiere, 1989) (Ash, 1989) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 41

Tuning the Network Size by construction Cascade correlation: Start with one unit in one hidden layer; add new hidden units as another hidden layer. One node in each hidden layer Freeze previously trained weights. Train newly added connections Training a single layer at each step is faster than training multiple hidden layers. This makes sense because each new unit is added to learn what has not yet been learned by the previous layer(s). (Fahlman and Lebiere, 1989) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 42

How nonlinearity works Input to hidden units whx+w0 Hidden units zh=sigmoid(whTx) Hidden units times output weights y = vhzh Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 44

Tall and lean usually better than short and fat Multiple layers may lead to a simpler network Regression with 2 hidden layers: feed forward

Weight-update equations for regression with 2 hidden layer (batch mode) New notation designed to show the pattern Update = learning factor ∙ output error ∙ input y = vTz2 ; for connection between 2l and output z2l = sigmoid(w2lTz1); for connection between 1h and 2l z1h = sigmoid(w1hTx); for connection between input and 1h Error depends of whj through weights in top 2 layers

Two-Class Discrimination with one hidden layer = exp(vTzt) Sigmoid output models P(C1|x) Minimize cross entropy in batch update Weight update formulas are same as for regression After convergence assign client to C1 if output > 0.5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 47

K>2 Classes: one hidden layer Derived from multinomial distribution vi is weight vector connection nodes of hidden layer to output of class i Note sum over i Minimize cross entropy by batch update After convergence assign client to class with largest output Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 48

Classification problem with small dataset Assignment 4: due 10-23-12 Classification problem with small dataset Build a system that can classify a piece of glass from a beer bottle. Train an ANN to classify what brewery the bottle came from using the chemical content of the glass. glassdata.csv contains 214 samples of bottle glass from 6 breweries Since dataset is small do not use a validation set

I have not been able to make Lars’ ANN 1 1.521 13.64 4.49 1.1 71.78 0.06 8.75 2 1.517 13.89 3.6 1.36 72.73 0.48 7.83 3 1.516 13.53 3.55 1.54 72.99 0.39 7.78 I have not been able to make Lars’ ANN code perform the way I expected it to from my experience with it in 2008. For now, I am suspending all parts of assignment 4 that involve Lars’ code.

New Objectives For Assignment 4 Objective 1: Create an input data file that will let you run WEKA’s multilayer perceptron for classification. Describe the change you made to glassdata.csv to achieve this. Objective 2: Capture WEKA’s results using the default settings. Include the confusion matrix. Can you find settings that give better results? If so, describe those settings in your report along with WEKA output including the improved confusion matrix.

First 3 rows of data Data not ready for use by 1 1.521 13.64 4.49 1.1 71.78 0.06 8.75 2 1.517 13.89 3.6 1.36 72.73 0.48 7.83 3 1.516 13.53 3.55 1.54 72.99 0.39 7.78 Data not ready for use by either WEKA or LARs’ code What is needed irrespective of which software used? 1. Sample index 2. RI: refractive index 3. Na: Sodium 4. Mg: Magnesium 5. Al: Aluminum 6. Si: Silicon 7. K: Potassium 8. Ca: Calcium 9. Ba: Barium 10. Fe: Iron 11. Type of bottle (class) 1=Anheuser-Busch, Inc. 2=Miller Brewing Co. 3=Blitz-Weinhard Brewing Co. 4=Pete’s Brewing Co. 5=Samuel Adams Brew House 6=Plank Road Brewery

3. Background for assignment 5 due 11-6 Example of multivariate nonlinear regression Prediction of charge on peptides after electron-spray ionization

5. Background for assignment 5 due 11-6-12 Example of multivariate nonlinear regression Prediction of peptide charge in electro-spray Ionization Construct input from amino-acid sequence of peptide Goal: best result with smallest input dimension

First 4 of ~ 23,000 data pairs are Sequence Charge AAAAAAPDDVAAQLVVADLDLVGGHVEDAFAR 2.8 AAAAADLANR 2 AAAAAQASASAAAK 1.714286 AAAAAVAQGGPIEDAER How can we encode the peptide sequence? Can an encoded peptide sequence be an input? Can we encode a subset of peptide sequence? What inputs can we calculate from the input sequence?

Properties of amino acids pi pK1 pK2 charge Hydrophobic? Polar? 6.01 code mass pi pK1 pK2 charge Hydrophobic? Polar? A 89.09404 6.01 2.35 9.87 T F R 174.20274 10.76 1.82 8.99 + N 132.1190 5.41 2.14 8.72 D 133.10384 2.85 1.99 9.9 - C 121.15404 5.05 1.92 10.7 E 146.14594 3.15 2.1 9.47 Q 5.65 2.17 9.13 G 75.06714 6.06 9.78 H 155.15634 7.6 1.8 9.33 I 131.17464 6.05 2.32 9.76 L 2.33 9.74 K 146.18934 9.6 2.16 9.06 M 149.20784 5.74 2.13 9.28 165.1918 5.49 2.2 9.31 P 115.13194 6.3 1.95 10.64 S 105.09344 5.68 2..19 9.21 119.12034 5.6 2.09 9.1 W 204.22844 5.89 2.46 9.41 Y 181.19124 5.64 V 117.14784 6.0 2.39

Some suggestions for inputs from properties of amino acids Length of peptide Mass of peptide Sequence of first 2 residues Sequence of last 2 residues Factions of amino acids of each type Fractions of hydrophobic, polar, and charged residues Net formal charge Average isoelectric point Average disassociation constant

Review of protein biology Central dogma of biology

Dogma on protein function Proteins are polymers of amino acids The sequence of amino acids determines a protein’s shape (folding pattern) The shape of a protein determines its function

Amino acids with different side chains have different names Glycine gly G alanine ala A valine val V leucine leu L isoleucine ile I methionine met M porline pro P phenylalanine phe F tryptophan trp W serine ser S cysteine cys C threonine thr T glutamine gln Q asparagine asn N histidine his H tyrosine tyr Y glutamic acid glu E aspartic acid asp D lysine lys K arginine arg R What are amino acids? N-terminus C-terminus Side chain Amino acids with different side chains have different names

chemical properties of amino acids

Sequence of Protein Dictates its Folding Pattern

Amino Acids Polymerize to Form Proteins formation of peptide bond -N-C-C-N-C-C-N- H R

Proteases cut proteins into peptides -N-C-C-N-C-C-N- H R Most proteases have cleavage specificity. Trypsin cleaves arginines and lysines Digestion of a protein with trypsin produces peptides of various length Proteases catalyze cleavage of peptide bonds

Liquid chromatography coupled to mass spectrometry LC column Electro-spray ionization Mass spectrometer Digested protein mixture peptides are retained for differing times on the LC column Peptides may have multiple charges Charges in assignment 5 are averages from several runs

Hints: known properties independent of training Virtual examples: additions to training set with the same label Example1: invariants to translation, rotation, etc. Example2: “0101” and “1010” have the same parity. Virtual examples added by shifting Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 66

Hints: known properties independent of training Add penalties to error: E ’=E + λhEh E is usual sum of squared residuals Example1: know f(x) = f(x’) Let Eh=[g(x|θ)- g(x’|θ)]2 lh determines the significance of the penalty Example2: know that f(x) is between ax and bx