Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM.

Similar presentations


Presentation on theme: "Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM."— Presentation transcript:

1 Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM

2 Fundamentals of ANN

3 Boolean AND: linearly separable 2D binary classification problem 3 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) data table S x 1 +x 2 =1.5 is an acceptable linear discriminant

4 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) linear discriminant w T x = 0 x1x2rrequiredchoice 000w0 <0w0=-1.5 010 w2 + w0 <0w1= 1 100 w1+ w0 <0w2= 1 111 w1 + w2 + w0>0 Output is sigmoid(w T x) If w T x > 0 → r = 1 Derive the linear discriminant x 1 + x 2 -1.5 = 0

5 Weight-update rule: Perceptron regression Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Contribution to sum of squared residuals from single example w j is the j th component of weight vector w connecting attribute vector x to scalar output y E t depends on w j through y t = w T x t ; hence use chain rule

6 Weight update formula called “stochastic gradient decent” Proportionality constant  is called “learning rate” Since  w j is proportional x j, all attributes should be roughly the same size. Normalized to achieve this may be helpful

7 7 Forward Backward x Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: weight update rules for nonlinear regression Each pass through all training data called “epoch” Given a weights w h and v Transform hidden layer by Sigmoid. h weight vectors connect input to hidden layer. 1 weight vector connects hidden layer to output.

8 RBF-ANN and clustering

9 How is the output related to the hidden layer? How is the input related to the hidden layer in a Radial Basis Function (RBF) network?

10 How is the output related to the hidden layer? How is the input related to the hidden layer in a Radial Basis Function (RBF) network? Input data represented as K Gaussian clusters Linear least squares with Gaussian features

11 Gaussians  j (x) = exp(-½(|x-  j |/  j ) 2 ) j = 1, 2 Set up the linear system of equations that determine the weights connecting the hidden layer to the output Clustering the data setwith N = 5 by K-means with K = 2 produced the

12 Solve normal equations D T Dw = D T r for a vector w What are the dimensions of this linear system of equations?

13 K-means has converged. m i can be used for  i in Gaussian basis functions How do we get  i ?

14 Given converged centers, a common variance for all K Gaussians can be calculated by  2 = d 2 max /2K, where d max is the largest distance between clusters. What if I want the Gaussian RBFs to have different variances?

15 Given converged centers, a common variance for all K Gaussians can be calculated by  2 = d 2 max /2K, where d max is the largest distance between clusters. What if I want the Gaussian RBFs to have different variances?  i 2 = average of ||x i t – m i || 2 over instances in cluster i How does this approximation differ from application of Gaussian mixtures?

16 Given converged centers, a common variance for all K Gaussians can be calculated by  2 = d 2 max /2K, where d max is the largest distance between clusters. What if I want the Gaussian RBFs to have different variances?  i = average of ||x i t – m i || 2 over instances in cluster i How does this approximation differ from application of Gaussian mixtures? h i t is the probability that instance x t belong to cluster i

17 K-means is an example of the Expectation-Maximization (EM) approach to maximum likelihood estimation (MLE) Define the E-step and M-step of K-means How is the iterative EM process started? 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

18 K-means is an example of the Expectation-Maximization (EM) approach to maximum likelihood estimation (MLE) Define the E-step and M-step of K-means How is the iterative EM process started? 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Use a 2-step iterative method: E-step: estimate labels of x t given current knowledge of mixture components M-step: update component knowledge using labels from E-step Chose K instances at random to be cluster means

19 19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) E - step M - step K-means clustering pseudo code with steps labeled

20 Define single-linkage, complete linkage and average linkage in agglomerative hierarchical clustering?

21 Agglomerative Clustering: Start with N groups each with one instance and merge the two closest groups at each iteration Options for distance between groups G i and G j 21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Single-link: smallest distance between all possible pairs Average-link, distance between centroids (average of inputs in clusters on each itterration) Complete-link: largest distance between all possible pairs

22 22 Dendrogram Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Grid spacing is 1 h is the maximum separation between members of a cluster at a given level of agglomeration. How many clusters are present and what is their composition at the following values of h: h 2 Example: single-linked clusters

23 Bayesian decision theory

24 Review: Bayes’ Rule: K>2 Classes 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) In Bayes’ rule, which functions are a priors, which are class likelihoods, and which are posteriors? How do I use Bayes rules to assign an instances to a class?

25 Review: Bayes’ Rule: K>2 Classes 25 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) P(Ci)=priorp(x|Ci)=class likelihoodP(Ci|x)=posterior

26 26 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Estimating priors and class likelihoods from data Assuming training data are Gaussian distributed, how do we estimate priors and class likelihoods?

27 27 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) With class labels r i t, estimators are Estimating priors and class likelihoods from data

28 28 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) naïve Bayes classification What assumption is made to get a simpler form of class likelihoods called “naïve Bayes”? What is the form of this approximation to class likelihoods? What parameters characterize a class in this approximation?

29 29 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Review: naïve Bayes classification Each class is characterized by a set of means and variances for the components of the attributes in that class. Assume the components of x are independent random variables. Covariance matrix is diagonal and p(x|C) is product of probabilities for each component of x.

30 Actions: α i assigning x to C i of K classes Loss λ ik occurs if we take α i when x belongs to C k Risk of action α i given x 30 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Minimizing risk of classification given attributes x Calculate R(  i |x) for “0/1 loss function” (correct decisions no loss and error have equal cost) when posteriors are normalized.

31 31 For minimum risk, choose the most probable class Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

32 R(  1|x) = 11 P(C1|x) + 12 P(C2|x) = 10 P(C2|x) R(  2|x) = 21 P(C1|x) + 22 P(C2|x) = P(C1|x) Choose C1 if R(  1|x) < R(  2|x), which is true if 10 P(C2|x) < P(C1|x), which becomes P(C1|x) > 10/11 using normalization of posteriors Consequence of erroneously assigning instance to C1 is so bad that we choose C1 only when we are virtually certain it is correct. Example of risk minimization with 11 = 22 = 0, 12 = 10, and 21 = 1 Loss λ ik occurs if we take α i when x belongs to C k

33 Genetic algorithm

34 Given the fitness of chromosomes in a population, how do we choose a pair of chromosomes to update the population by crossover?

35 Given fitness f(x i ) for each chromosome in population, assigned each chromosome a discrete probability by Use p i to design a roulette wheel Divide number line between 0 and 1 into segments of length p i in a specified order Get r, random number uniformly distributed between 0 and 1 Choose the chromosome of the line segment containing r Repeat for a second chromosome.

36 00100fitness = 0.0011 01001fitness = 0.0216 11011 fitness = 0.0023 11111 fitness = 0.0001 5-bit chromosomes have fitness given below design a “roulette wheel” for random selection of chromosomes to replicate

37 00100fitness = 0.0011pi = 0.044 01001fitness = 0.0216pi = 0.861 11011 fitness = 0.0023pi = 0.091 11111 fitness = 0.0001pi = 0.004  i f(x i ) = 0.0251 5-bit chromosomes have fitness given below Assume the pair with largest 2 probabilities are selected for replication by crossover at the locus between the 1 st and 2 nd bits. How does the population change?

38 Assume a mixing point (locus) is chosen between first and second bit. Crossover selected to induce change Mutation is rejected as method to induce change

39 Self organizing maps

40 How many prototype vectors will be generated in the SOM application illustrated?

41 How many prototype vectors will be generated in the SOM application illustrated? One for each node of output array

42 Describe the following types of SOM output? elastic net contexture map sematic map unified distance matrix (UMAT) UMAT with connectedness

43 Describe the following types of SOM output? elastic net=deformable grid connecting positions of prototypical vectors in attribute space contexture map=mark output nodes with greatest activation by test patterns sematic map=label all output nodes by the test pattern that generates the greatest activation unified distance matrix (UMAT)=heat map of average difference between an output node’s prototype and prototypes of its neighbors UMAT with connectedness=add “stars” that connect maxima on UMAT with one and only one output node

44 Are the bars that illustrate convergence of this SOM, elastic nets, semantic maps, or UMATS?

45 Is this elastic net covering input space or the lattice of output nodes? What are the dimensions of the output-node array in the SOM that produced this elastic net?

46 Use the stars to draw a boundary on the cluster that contains horse and cow

47 Support vector machines

48 Review: constrained optimization by Lagrange multipliers find the stationary point of f(x 1, x 2 ) = 1 - x 1 2 – x 2 2 subject to the constraint g(x 1, x 2 ) = x 1 + x 2 = 1 Constrained optimization

49 Form the Lagrangian L(x, ) = f(x 1, x 2 ) + (g(x 1, x 2 ) - c) L(x, ) = 1-x 1 2 -x 2 2 + (x 1 +x 2 -1)

50 -2x 1 + = 0 -2x 2 + = 0 x 1 + x 2 -1 = 0 Solve for x 1 and x 2 Set the partial derivatives of L with respect to x 1, x 2, and equal to zero L(x, ) = 1-x 1 2 -x 2 2 + (x 1 +x 2 -1)

51 In this case, not necessary to find sometimes called “undetermined multiplier ” Solution is x 1 * = x 2 * = ½

52 52 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Why is maximizing margins a good strategy for classification? Separating hyperplane margins Support vectors

53 53 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) For linearly separable classes, SVM achieves this goal by maximizing the distance of all points from the separating hyperplane. If the graph below represents achievement of this goal, how do I draw the margins?

54 54 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) If the distance, d t, of instance x t from a separating hyperplane is |g(x t )|/||w||, write this relationship in terms of x t, w, and r t, the label of x t. Explain why r t must be +1.

55 55 Distance of x t to separating hyperplane is Weights that define a separating hyperplane that is has maximum distances from all instances in the training set Use the definition of margins for binary classification to explain why this separating hyperplane has maximum margins. Why are margins the same width on both sides of the separating hyperplane? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

56 Duality in constrained optimization Primal variables: variables, like weights, that we want to optimize Dual variables: coefficients of constraints added to original objective function to achieve constrained optimization of the primal variables. Also called “Lagrange multipliers”. In L p, which variables are primal and which are dual? How is L p, changed to L d ? L d has dual variables only.

57 Active set in constrained optimization What is the “active set” in constrained optimization? How does maximization of the dual make constraints on support vectors the active set?

58 58 Set α t = 0 for data point sufficiently far from discriminant to be ignored in the search for hyperplane with maximum margins. Find remaining a t > 0 by quadratic programing Given the a t > 0 that maximize L d calculate This is an iterative process. Suggest a way to obtain an initial guess of which instances have attributes that are near the separating hyperplane. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Maximize

59 Perceptron learning algorithm (PLA) sign(w T x) negative sign(w T x) positive Each iteration pulls discriminant in direction that tends to correct the misclassified data point.

60 60 Set α t = 0 for data point sufficiently far from discriminant to be ignored in the search for hyperplane with maximum margins. Find remaining a t > 0 by quadratic programing Given the a t > 0 that maximize L d calculate What is the dimension of w? How do you find a value for w 0 ? Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Maximize

61 61 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Given Show that is a discriminant for the classes on the margins of the hyperplane

62 Given a binary classification problem where the number of attributes of an instance exceeds the size of the training set, which of the following methods cannot be used to find a solution? 1)Perceptron training algorithm 2)Logistic regression 3)Classification by least-squares regression 4)Classification by ANN 5)SVM Which method is preferred?

63 In the equation for Lp below, which variables are primal and which are dual? Do they contain any parameters? If so, what is their purpose.

64 In SVM, constraints are placed on dual variables, for example: What are the origins of these constraints? maximize

65 In the equation for Lp below, which variables are primal and which are dual? Does Lp contain any parameters? If so, what is their purpose.

66 x t with r t = 1 is correctly classified but in the margins. What are the bounds on hinge loss? x t with r t = 1 is misclassified. (a) What are the bounds on hinge loss if x t is in the margins? (b) What are the bounds on hinge loss if x t is outside of the margins? Would the answer to these questions be different if r t = -1 ? What would the graph look like if r t = -1 ? The components of soft error = have the form of “hinge loss”

67 weight decay How does augmented error achieve weight decay and what is its effect?

68 Based on your understanding of weight decay by augmented error, explain how the value of regularization parameter C effects binary classification by SVM with slack variables

69 What are the primal variables in the L p for -SVM shown below? Does it contain any parameters? If so, what is their purpose? Add the Lagrange multiplier to L p

70 By comparison with C-SVM where

71 Kernel machines What 2 equations in feature-space SVM enabled the “kernel trick”?

72 72 Kernel machines Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) What 2 equations in feature space enable the “kernel trick”? The dual and the discriminant Explicit use of weights which cannot be written as a dot product of features,  (x t ) T  (x), is not needed.

73 Kernel machines Do kernel machines include regularization? If so, how?

74 Kernel machines Do kernel machines include regularization? If so, how? Kernel machines still contain the regularization parameter C through the constraint 0<  t <C for support vectors on margins and  t =C for instances in margins and/or misclassified. C must be optimized by a validation set for each choice of kernel.

75 One-class SVM: Set up the Lagrangian What are the primal variables? Add constraints to the Lagrangian What relations develop from the condition that Lagrangian be at a minimum? Does the Lagrangian contain any parameters? If so, what is their purpose?

76 Add Lagrange multipliers  t >0 and  t >0 for constraints 76 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Set derivatives with respect to primal variables R, a and  t = 0 0 <  t < C Substituting back into L p we get dual to be maximized Given region center, a, how do we find optimum radius R?

77 One-Class SVM Machines R will be determined by instances that are support vectors on the surface of the hyper-sphere 77 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

78 What is wrong with these equations as a start to the development of an SVM for soft-margin hyperplans

79 What is wrong with these equations as a start to the development of an SVM for soft-margin hyperplans C is not a primal variable


Download ppt "Review for final exam 2015 Fundamentals of ANN RBF-ANN using clustering Bayesian decision theory Genetic algorithm SOM SVM."

Similar presentations


Ads by Google