Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005
Outline Background Hierarchical tree structure Gating networks Expert networks E-M algorithm Experimental results Conclusions
Background The idea of mixture of experts First presented by Jacobs and Hintons in 1988 Hierarchical mixture of experts Proposed by Jordan and Jacobs in 1994 Difference from previous mixture model Mixing weights depends on both the input and the output
Example (ME)
One-layer structure Expert Network Gating Network x xxx μ μ1μ1 μ2μ2 μ3μ3 g1g1 g2g2 g3g3 Ellipsoidal Gating function
Example (HME)
Hierarchical tree structure Linear Gating function
Expert network At the leaves of trees for each expert: linear predictor output of the expert link function For example: logistic function for binary classification
Gating network At the nonterminal of the tree top layer: other layer:
Output At the non-leaves nodes top node: other nodes:
Probability model For each expert, assume the true output y is chosen from a distribution P with mean μ ij Therefore, the total probability of generating y from x is given by
Posterior probabilities Since the g ij and g i are computed based only on the input x, we refer them as prior probabilities. We can define the posterior probabilities with the knowledge of both the input x and the output y using Bayes ’ rule
E-M algorithm Introduce auxiliary variables z ij which have an interpretation as the labels that corresponds to the experts. The probability model can be simplified with the knowledge of auxiliary variables
E-M algorithm Complete-data likelihood: The E-step
E-M algorithm The M-step
IRLS Iteratively reweighted least squares alg. An iterative algorithm for computing the maximum likelihood estimates of the parameters of a generalized linear model A special case for Fisher scoring method
Algorithm E-step M-step
Online algorithm This algorithm can be used for online regression For Expert network: where R ij is the inverse covariance matrix for EN(i,j)
Online algorithm For Gating network: where S i is the inverse covariance matrix and where S ij is the inverse covariance matrix
Results Simulated data of a four-joint robot arm moving in three-dimensional space
Results
Conclusions Introduce a tree-structured architecture for supervised learning Much faster than traditional back- propagation algorithm Can be used for on-line learning
Thank you Questions?