Review for test #2 Fundamentals of ANN Dimensionality reduction

Slides:



Advertisements
Similar presentations
Introduction to Support Vector Machines (SVM)
Advertisements

Support Vector Machine
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Neural networks Introduction Fitting neural networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
Supervised Learning Recap
Data mining in 1D: curve fitting
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Principal Component Analysis
Artificial Intelligence Statistical learning methods Chapter 20, AIMA (only ANNs & SVMs)
Chapter 5 NEURAL NETWORKS
CES 514 – Data Mining Lecture 8 classification (contd…)
Lecture 10: Support Vector Machines
MACHINE LEARNING 12. Multilayer Perceptrons. Neural Networks Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Today Wrap up of probability Vectors, Matrices. Calculus
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Classification / Regression Neural Networks 2
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Linear Models for Classification
CSSE463: Image Recognition Day 14 Lab due Weds, 3:25. Lab due Weds, 3:25. My solutions assume that you don't threshold the shapes.ppt image. My solutions.
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
CS621 : Artificial Intelligence
Fundamentals of Artificial Neural Networks Chapter 7 in amlbook.com.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
EEE502 Pattern Recognition
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM.
Genetic Algorithms Schematic of neural network application to identify metabolites by mass spectrometry (MS) Developed by Dr. Lars Kangas Input to Genetic.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
CSSE463: Image Recognition Day 14
Deep Feedforward Networks
Multilayer Perceptrons
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
INTRODUCTION TO Machine Learning 3rd Edition
第 3 章 神经网络.
CH 5: Multivariate Methods
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017
An Introduction to Support Vector Machines
Support Vector Machines
Ying shen Sse, tongji university Sep. 2016
Artificial Neural Networks
Neural Network - 2 Mayank Vatsa
Machine Learning Math Essentials Part 2
Dimensionality Reduction
Feature space tansformation methods
Support Vector Machines
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
A discriminant function for 2-class problem can be defined as the ratio of class likelihoods g(x) = p(x|C1)/p(x|C2) Derive formula for g(x) when class.
Parametric Methods Berlin Chen, 2005 References:
Neural networks (1) Traditional multi-layer perceptrons
Parametric Estimation
Fewer attributes are better if they are optimal
Support vector machines
Linear Discrimination
Test #1 Thursday September 20th
Presentation transcript:

Review for test #2 Fundamentals of ANN Dimensionality reduction Genetic algorithms

HW #3 Boolean OR Linear discriminant wTx = x1+x2-0.5 = 0 Classes are linearly separable x1 x2 r 0 0 0 w0 <0 w0=-0.5 0 1 1 w2 + w0 >0 w1= 1 1 0 1 w1 + w0 >0 w2= 1 1 1 1 w1 + w2 + w0>0 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 2

XOR in feature space with Gaussian kernels f1 = exp(-||X – [1,1]T||2) X f1 f2 (1,1) 1 0.1353 (0,1) 0.3678 0.3678 (0,0) 0.1353 1 (1,0) 0.3678 0.3678 XOR in feature space with Gaussian kernels This transformation puts examples (0,1) and (1,0) at the same point in feature space

(0,0) and (1,1) are at the same point Consider hidden units zh as features Choose wh so that in feature space (0,0) and (1,1) are at the same point z2 z1 feature space attribute space

Design criteria for hidden layer x1 x2 r z1 z2 0 0 0 ~0 ~0 0 1 1 ~0 ~1 1 0 1 ~1 ~0 1 1 0 ~0 ~0 whTx < 0 → zh ~ 0 whTx > 0 → zh ~ 1

Find weights for design criteria x1 x2 z1 w1Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~0 <0 w2 + w0 <0 w2= -1 1 0 ~1 >0 w1 + w0 >0 w1= 1 1 1 ~0 <0 w1 + w2 + w0<0 x1 x2 z2 w2Tx required choice 0 0 ~0 <0 w0 <0 w0=-0.5 0 1 ~1 >0 w2 + w0 >0 w2= 1 1 0 ~0 <0 w1 + w0 <0 w1= -1 1 1 ~0 <0 w1 + w2 + w0>0

Training a neural network by back-propagation Initialize weights randomly How is the adjustment of weights related to the difference between output and target?

Approaches to Training Online: weights updated based on training-set examples seen one by one in random order Batch: weights updated based on whole training set after summing deviations from individual examples How can we tell the difference? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8

online or batch? Update = learning factor ∙ output error ∙ input

Multivariate nonlinear regression with multilayer perceptron Backward Forward x Can you express Et as an explicit function of whj? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10

Batch mode x Backward Forward Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11

Why do some sums go and others stay? zh vih yi xj whj Total error in connection zh to output Update = learning factor ∙ output error ∙ input Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12

Back propagation for perceptron dichotomizer Like the sum of squared residuals, cross entropy depends on weights w through yt More complex yt dependence and yt = sigmoid(wTx) Same primciples apply

Review of protein biology Central dogma of biology

Dogma on protein function Proteins are polymers of amino acids The sequence of amino acids determines a protein’s shape (folding pattern) The shape of a protein determines its function In natural selection, which changes fastest, protein sequence or protein shape?

chemical properties of amino acids

Which of the amino acids V and S is more likely to be found the core of a protein sturcture?

Dimensionality Reduction by Auto-Association hidden layer smaller than input, output required to reproduce input “Reconstruction error” Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18

In validation and test sets reconstruction error will not zero How do we make use of this? “Reconstruction error”

Linear and Non-linear Data Smoothing Examples (blue: original, red: smoothed):

PCA brings back an old friend Find w1 such that w1TSw1 is maximum subject to constraint ||w1|| = w1Tw1 = 1 Maximize L = w1TSw1 + c(w1Tw1 – 1) gradient of L = 2Sw1+ 2cw1 = 0 Sw1 = -cw1 w1 is an eigenvector of covariance matrix let c = -l1 l1 is eigenvalue associate with w1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 21

A simple example constrained optimization using Lagrange multipliers find the stationary points of f(x1, x2) = 1 - x12 – x22 subject to the constraint g(x1, x2) = x1 + x2 - 1 = 0

Form the Lagrangian L(x, l) = 1-x12-x22 +l(x1+x2-1)

-2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 HOW? Set the partial derivatives of L with respect to x1, x2, and l equal to zero L(x, l) = 1-x12-x22 +l(x1+x2-1) -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2 HOW? -2x1 + l = 0 -2x2 + l = 0 x1 + x2 -1 = 0 Solve for x1 and x2

x1* = x2* = ½ contours of f(x1,x2) In this case, not necessary to find l l sometimes called “undetermined multiplier”

Application of characteristic polynomial calculate eigenvalues of det(A - lI) = det( ) (3-l)(3-l) –1 = l2 – 6l +8 = 0 by quadratic formula l1 = 4 and l2 = 2 Not a practical way to calculate eigenvalues of S

In PCA, don’t confuse eigenvalues and principal components Are these eigenvalues principal components? 73.6809 18.7491 2.8856 1.9068 0.7278 0.5444 0.4238 0.3501 0.1631 d = 9 k = 1,2,…

Data Projected onto Principal Components 1st and 2nd How was this figure constructed?

Principal Components Analysis (PCA) If w is unit vector, then z=wTx is the projection of x in the direction of w. Note z=wTx = xTw = w1x1 + w2x2 + … is a scalar Use projection to find a low-dimension feature space where the essential information in the data is preserved. Accomplish this by finding features z such that Var(z) is maximal (i.e. spread the data out) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

Method to select chromosomes for refinement Calculate fitness f(xi) for each chromosome in population Assigned each chromosome a discrete probability by Use pi to design a roulette wheel How do we spin the wheel?

Spinning the roulette wheel Divide number line between 0 and 1 into segments of length pi in a specified order Get r, random number uniformly distributed between 0 and 1 Choose the chromosome of the line segment containing r Similarly for decisions about crossover and mutations Crossover probability = 0.75 Mutation probability = 0.002

Sigma scaling allows variable selection pressure Sigma scaling of fitness f(x) m and s are the mean and standard deviation of fitness in the population In early generations, selection pressure should be low to enable wider coverage of search space (large s) In later generations selection pressure should be higher to encourage convergence to optimum solution (small s)