Review for test #2 Fundamentals of ANN Dimensionality reduction Genetic algorithms
HW #3 Boolean OR Linear discriminant wTx = x1+x2-0.5 = 0 Classes are linearly separable x1 x2 r 0 0 0 w0 <0 w0=-0.5 0 1 1 w2 + w0 >0 w1= 1 1 0 1 w1 + w0 >0 w2= 1 1 1 1 w1 + w2 + w0>0 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2
XOR in feature space with Gaussian kernels f1 = exp(-||X – [1,1]T||2) X f1 f2 (1,1) 1 0.1353 (0,1) 0.3678 0.3678 (0,0) 0.1353 1 (1,0) 0.3678 0.3678 XOR in feature space with Gaussian kernels This transformation puts examples (0,1) and (1,0) at the same point in feature space
(0,0) and (1,1) are at the same point Consider hidden units zh as features Choose wh so that in feature space (0,0) and (1,1) are at the same point z2 z1 feature space attribute space
Design criteria for hidden layer x1 x2 r z1 z2 0 0 0 ~0 ~0 0 1 1 ~0 ~1 1 0 1 ~1 ~0 1 1 0 ~0 ~0 whTx < 0 → zh ~ 0 whTx > 0 → zh ~ 1
Find weights for design criteria x1 x2 z1 w1Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~0 <0 w2 + w0 <0 w2= -1 1 0 ~1 >0 w1 + w0 >0 w1= 1 1 1 ~0 <0 w1 + w2 + w0<0 x1 x2 z2 w2Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~1 >0 w2 + w0 >0 w2= 1 1 0 ~0 <0 w1 + w0 <0 w1= -1 1 1 ~0 <0 w1 + w2 + w0>0
Training a neural network by back-propagation Initialize weights randomly How is the adjustment of weights related to the difference between output and target?
Approaches to Training Online: weights updated based on training-set examples seen one by one in random order Batch: weights updated based on whole training set after summing deviations from individual examples How can we tell the difference? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8
online or batch? Update = learning factor ∙ output error ∙ input
Multivariate nonlinear regression with multilayer perceptron Backward Forward x Can you express Et as an explicit function of whj? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
Batch mode x Backward Forward Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11
Why do some sums go and others stay? zh vih yi xj whj Total error in connection zh to output Update = learning factor ∙ output error ∙ input Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Back propagation for perceptron dichotomizer Like the sum of squared residuals, cross entropy depends on weights w through yt More complex yt dependence and yt = sigmoid(wTx) Same primciples apply
Review of protein biology Central dogma of biology
Dogma on protein function Proteins are polymers of amino acids The sequence of amino acids determines a protein’s shape (folding pattern) The shape of a protein determines its function In natural selection, which changes fastest, protein sequence or protein shape?
chemical properties of amino acids
Which of the amino acids V and S is more likely to be found the core of a protein sturcture?
Dimensionality Reduction by Auto-Association hidden layer smaller than input, output required to reproduce input “Reconstruction error” Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18
In validation and test sets reconstruction error will not zero How do we make use of this? “Reconstruction error”
Linear and Non-linear Data Smoothing Examples (blue: original, red: smoothed):
PCA brings back an old friend Find w1 such that w1TSw1 is maximum subject to constraint ||w1|| = w1Tw1 = 1 Maximize L = w1TSw1 + c(w1Tw1 – 1) gradient of L = 2Sw1+ 2cw1 = 0 Sw1 = -cw1 w1 is an eigenvector of covariance matrix let c = -l1 l1 is eigenvalue associate with w1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21
A simple example constrained optimization using Lagrange multipliers find the stationary points of f(x1, x2) = 1 - x12 – x22 subject to the constraint g(x1, x2) = x1 + x2 - 1 = 0
Form the Lagrangian L(x, l) = 1-x12-x22 +l(x1+x2-1)
-2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 HOW? Set the partial derivatives of L with respect to x1, x2, and l equal to zero L(x, l) = 1-x12-x22 +l(x1+x2-1) -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 HOW? -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2
x1* = x2* = ½ contours of f(x1,x2) In this case, not necessary to find l l sometimes called “undetermined multiplier”
Application of characteristic polynomial calculate eigenvalues of det(A - lI) = det( ) (3-l)(3-l) –1 = l2 – 6l +8 = 0 by quadratic formula l1 = 4 and l2 = 2 Not a practical way to calculate eigenvalues of S
In PCA, don’t confuse eigenvalues and principal components Are these eigenvalues principal components? 73.6809 18.7491 2.8856 1.9068 0.7278 0.5444 0.4238 0.3501 0.1631 d = 9 k = 1,2,…
Data Projected onto Principal Components 1st and 2nd How was this figure constructed?
Principal Components Analysis (PCA) If w is unit vector, then z=wTx is the projection of x in the direction of w. Note z=wTx = xTw = w1x1 + w2x2 + … is a scalar Use projection to find a low-dimension feature space where the essential information in the data is preserved. Accomplish this by finding features z such that Var(z) is maximal (i.e. spread the data out) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Method to select chromosomes for refinement Calculate fitness f(xi) for each chromosome in population Assigned each chromosome a discrete probability by Use pi to design a roulette wheel How do we spin the wheel?
Spinning the roulette wheel Divide number line between 0 and 1 into segments of length pi in a specified order Get r, random number uniformly distributed between 0 and 1 Choose the chromosome of the line segment containing r Similarly for decisions about crossover and mutations Crossover probability = 0.75 Mutation probability = 0.002
Sigma scaling allows variable selection pressure Sigma scaling of fitness f(x) m and s are the mean and standard deviation of fitness in the population In early generations, selection pressure should be low to enable wider coverage of search space (large s) In later generations selection pressure should be higher to encourage convergence to optimum solution (small s)