Radial Basis Functions: Alternative to Back Propagation

Radial Basis Functions: Alternative to Back Propagation

Example of Radial Basis Function (RBF) network
Input vector d dimensions K radial basis functions Single output Structure likely used for multivariate regressions

RBF network provides alternative to back propagation
Input vector is connected to hidden layer by K-means clustering Hidden layer connected to the output by linear least squares Gaussians are the most frequently used radial basis function jj(x) = exp(-½(|x-mj|/sj)2) Each hidden node is associated with a cluster of input instances parameterized by a mean and variance

Linear least squares with basis functions
Given training set and the mean and variance of K clusters of input data, construct the NxK matrix D and column vector r Solve normal equations DTDw = DTr for a vector w of K weights connecting hidden nodes to output node

RBF networks perform best with large datasets
With large datasets, expect redundancy (i.e. multiple examples expressing the same general pattern) In RBF network, hidden layer is a feature-space representation of the data where redundancy has been used to reduce noise. A validation set may be helpful to determine K, the best number clusters of input data

K-Means Clustering Given K, find group labels using the geometric
interpretation of a cluster Define cluster centers by reference vectors mj j=1…K Define group labels based on nearest center Called “hard” labels; each input belongs to one and only one cluster Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 6

Next K-means iteration
Get new trial centers Get new group labels based on nearest center Judge convergence by thightness of clusters

Example: Convergence Given centers, a line separates attribute space
start assign new centers Example: Given centers, a line separates attribute space between the 2 clusters new centers assign assign Convergence Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 8

K-means clustering pseudo code
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9

K-means is an example of the Expectation-Maximization
E step: estimate labels M step: update means from new labels Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10

Given converged centers, a common variance for RBFs
can be calculated by s2 = d2max/2K, where dmax is the largest distance between clusters. Gaussian mixture theory is another example of Expectation-Maximization that gives a variance estimator for each cluster.

divide attribute space into distinct parts.
Hard labels of K-means divide attribute space into distinct parts. Soft labels are the probability that a given input belongs to a particular component of the Gaussian mixture derived by EM. Contours show regions of m + s for each component of the mixture. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12

EM for Gaussian mixture with K components
Initialize by K-means Use centers mi and examples in i th cluster to estimate a covariance matrices Si and mixture proportions pi From these K-means results, calculate soft labels, hit for each example in dataset. End of the initial E step

M step for Gaussian mixtures
From soft labels of E step, calculate new proportions, centers and covariance by Use these results to calculate new soft labels Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14

Application of Gaussian mixtures
x marks cluster mean Outliers? h1=0.5 Data points color coded by greater soft label Contours show m + s of Gaussian densities Dashed contour is “separating” curve Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15

How can we use hard and soft labels to K-means
and Gaussian mixtures robust to outliers?

Set bit = 0 if closest distance to a center exceeds limit
Ignore data with soft label less than threshold

In applications of Gaussian mixtures to RBFs,
correlation of attributes is ignored and diagonal elements of the covariance matrix are equal. In this approximation Mahalanobis distance reduces to Euclidence distance. Variance parameter of radial basis function becomes a scalar Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 18

Hierarchical Clustering
Clustering based on similarities (distances) Minkowski distance between input vectors xr and xs p = 2 is Euclidean City-block distance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 22

Single-link: smallest distance between all possible pairs
Agglomerative Clustering: Start with N groups each with one instance and merge the two closest groups at each iteration Options for distance between groups Gi and Gj Single-link: smallest distance between all possible pairs Complete-link: largest distance between all possible pairs Average-link, distance between centroids (average of inputs in clusters on each itterration) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 23

Example: single-linked clusters
Dendrogram At sqrt(2) < h < 2, dendrogram has the 3 clusters shown on data graph At h>2 dendrogram shows 2 clusters. c, d, and f are one cluster at this distance Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 24

How are K-means and hierarchical clustering with
average linkage similar and different? They both measure similarity by looking at the average of instances that fall in a cluster. K-means has the same number of clusters on every. Hierarchical clustering has clusters at different resolutions (starting with one for each instance in the dataset and ending with only one)

Single-link: smallest distance between all possible pairs
Agglomerative Clustering: Start with N groups each with one instance and merge the two closest groups at each iteration Options for distance between groups Gi and Gj Single-link: smallest distance between all possible pairs Complete-link: largest distance between all possible pairs Average-link, distance between centroids Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27

Introduction Supervised learning: mapping input to output
Unsupervised learning: find regularities in the input we assume regularities reflect some p(x) discovering p(x) called “density estimation” parametric method assume p(x|q); use MLE In clustering look for regularities as group membership assume we know the number of clusters, K given K and sample X we want to find size of each group P(Gi) and component densities p(x|Gi) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 28

Gaussian Mixture Densities
Component densities are Gaussian p(x|Gi) ~ Nd ( μi , ∑i ) parameters Φ = {P ( Gi ), μi , ∑i }ki=1 unlabeled sample X={xt}t collection of vectors specifying the values of d attributes X made up of K groups (clusters) parameter set F contains P(Gi) proportion of X in group i mi means of xt in group i Si covariance matrix of xt in group i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29

K-means is an example of the Expectation-Maximization (EM) approach to MLE
Log likelihood of mixture model cannot be solved analytically for F Use a 2-step iterative method E-step: estimate labels of xt given current knowledge of mixture components M-step: update component knowledge using labels from E-step Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30 30

Φ = {P ( Gi ), μi , ∑i }ki=1 where p ( x | Gi) ~ N ( μi , ∑i )
Given a group label for each data point, rt, MLE provides estimates of parameters of Gaussian mixtures where p ( x | Gi) ~ N ( μi , ∑i ) Φ = {P ( Gi ), μi , ∑i }ki=1 Estimators Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 31

Example of pseudo code application
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32

Mixture model is example of EM applied to MLE using hidden variables
Analytical optimization of log likelihood not possible Assume hidden variables z, which when known, make optimization much simpler Complete likelihood, Lc(Φ |X,Z), in terms of x and z Incomplete likelihood, L(Φ |X), in terms of x, the observed variables In mixtures (clustering) hidden variables are the sources of the observation (which observation belongs to which component) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 33 33

General E- and M-steps E-step: Estimate z given X and current Φ
Expectation of the complete log likelihood is function of mixture variables parameterized by their value in previous iteration Values that maximize this function are the mixture variables in the next iteration Dempster, Laird, and Rubin (1977) showed that an increase in Q increased the incomplete log likelihood Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 34 E-step: Estimate z given X and current Φ M-step: Find new Φ’ given z, X, and old Φ. An increase in Q increases incomplete likelihood

EM in Gaussian Mixtures
Indicator random variable method (text p151) gives a general solution to E-step, which has form of Bayes’ rule for K classes Posteriors are “soft” labels expressing the probability that xt belongs to cluster i based on the current iteration Soft labels effectively increase the size of training set because every instance has probability to belong to every cluster Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 35 zti = 1 if xt belongs to Gi, 0 otherwise (labels r ti of supervised learning); assume p(x|Gi)~N(μi,∑i) E-step: M-step: Use estimated labels in place of unknown labels

EM in Gaussian Mixtures 2
If assume that cluster densities p(xt|Gi,Fl) are Gaussian, then mixture proportions, means and variances are estimated by hit are soft labels from E-step These are the same formulas as multivariate Gaussian classification with hit replacing the exemplar labels rit of supervised learning (text p94) Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 36

Supervised Learning After Clustering
Clustering methods find similarities between instances and group similar instances Application expert can name clusters and define attributes Allows knowledge extraction through number of clusters, prior probabilities, cluster parameters, i.e., center, range of features. Example: customer segmentation Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 37

Clustering as Preprocessing
Estimated group labels hj (soft) or bj (hard) may be seen as the dimensions of a new K dimensional space, where we can then learn our discriminant or regressor. Labeling data by application expert is costly. Use large amount of unlabeled data to get cluster parameters then application expert converts much lower dimension problem to supervised learning First learn what usually happens, then learn what it means Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 38

Choosing K (how many clusters)
Defined by the application, e.g., color quantization 24 bits/pixel → 8 bits/pixel = 256 clusters (text pgs ) Plot data (after PCA) and check for clusters Incremental (leader-cluster) algorithm: Add one at a time until “elbow” (reconstruction error/log likelihood/intergroup distances) Manually check for meaning Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 39

Radial Basis Functions: Alternative to Back Propagation

Similar presentations

Presentation on theme: "Radial Basis Functions: Alternative to Back Propagation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Radial Basis Functions: Alternative to Back Propagation

Similar presentations

Presentation on theme: "Radial Basis Functions: Alternative to Back Propagation"— Presentation transcript:

Similar presentations

About project

Feedback