Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005.

Slides:



Advertisements
Similar presentations
Prof. Navneet Goyal CS & IS BITS, Pilani
Advertisements

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Chapter 4: Linear Models for Classification
All lecture slides will be available as .ppt, .ps, & .htm at
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Paper Discussion: “Simultaneous Localization and Environmental Mapping with a Sensor Network”, Marinakis et. al. ICRA 2011.
INTRODUCTION TO Machine Learning 2nd Edition
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification and risk prediction
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Data Mining Techniques Outline
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Announcements  Homework 4 is due on this Thursday (02/27/2004)  Project proposal is due on 03/02.
Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.
Phylogenetic Trees Presenter: Michael Tung
Clustering.
Gaussian Mixture Example: Start After First Iteration.
Expectation Maximization Algorithm
Collaborative Ordinal Regression Shipeng Yu Joint work with Kai Yu, Volker Tresp and Hans-Peter Kriegel University of Munich, Germany Siemens Corporate.
Expectation-Maximization
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Chapter 9 Additive Models,Trees,and Related Models
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Review of Lecture Two Linear Regression Normal Equation
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
EM and expected complete log-likelihood Mixture of Experts
Least-Mean-Square Training of Cluster-Weighted-Modeling National Taiwan University Department of Computer Science and Information Engineering.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
From Machine Learning to Deep Learning. Topics that I will Cover (subject to some minor adjustment) Week 2: Introduction to Deep Learning Week 3: Logistic.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Prognosis of gear health using stochastic dynamical models with online parameter estimation 10th International PhD Workshop on Systems and Control a Young.
M Machine Learning F# and Accord.net. Alena Dzenisenka Software architect at Luxoft Poland Member of F# Software Foundation Board of Trustees Researcher.
High-Dimensional Unsupervised Selection and Estimation of a Finite Generalized Dirichlet Mixture model Based on Minimum Message Length by Nizar Bouguila.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CS Statistical Machine learning Lecture 10 Yuan (Alan) Qi Purdue CS Sept
1 Generative and Discriminative Models Jie Tang Department of Computer Science & Technology Tsinghua University 2012.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Map-Reduce for Machine Learning on Multicore C. Chu, S.K. Kim, Y. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, K. Olukotun (NIPS 2006) Shimin Chen Big Data Reading.
Bayesian Multivariate Logistic Regression by Sean O’Brien and David Dunson (Biometrics, 2004 ) Presented by Lihan He ECE, Duke University May 16, 2008.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Linear Models for Classification
INTRODUCTION TO MACHINE LEARNING 3RD EDITION ETHEM ALPAYDIN © The MIT Press, Lecture.
Lecture 2: Statistical learning primer for biologists
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Model-based Clustering
Additive Models , Trees , and Related Models Prof. Liqing Zhang Dept. Computer Science & Engineering, Shanghai Jiaotong University.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
Introduction We consider the data of ~1800 phenotype measurements Each mouse has a given probability distribution of descending from one of 8 possible.
Learning Deep Generative Models by Ruslan Salakhutdinov
Lecture 8 Generalized Linear Models &
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Multitask Learning Using Dirichlet Process
Mathematical Foundations of BME Reza Shadmehr
10701 / Machine Learning Today: - Cross validation,
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
A. S. Weigend M. Mangeas A. N. Srivastava
Presentation transcript:

Hierarchical Mixture of Experts Presented by Qi An Machine learning reading group Duke University 07/15/2005

Outline Background Hierarchical tree structure Gating networks Expert networks E-M algorithm Experimental results Conclusions

Background The idea of mixture of experts First presented by Jacobs and Hintons in 1988 Hierarchical mixture of experts Proposed by Jordan and Jacobs in 1994 Difference from previous mixture model Mixing weights depends on both the input and the output

Example (ME)

One-layer structure Expert Network Gating Network x xxx μ μ1μ1 μ2μ2 μ3μ3 g1g1 g2g2 g3g3 Ellipsoidal Gating function

Example (HME)

Hierarchical tree structure Linear Gating function

Expert network At the leaves of trees for each expert: linear predictor output of the expert link function For example: logistic function for binary classification

Gating network At the nonterminal of the tree top layer: other layer:

Output At the non-leaves nodes top node: other nodes:

Probability model For each expert, assume the true output y is chosen from a distribution P with mean μ ij Therefore, the total probability of generating y from x is given by

Posterior probabilities Since the g ij and g i are computed based only on the input x, we refer them as prior probabilities. We can define the posterior probabilities with the knowledge of both the input x and the output y using Bayes ’ rule

E-M algorithm Introduce auxiliary variables z ij which have an interpretation as the labels that corresponds to the experts. The probability model can be simplified with the knowledge of auxiliary variables

E-M algorithm Complete-data likelihood: The E-step

E-M algorithm The M-step

IRLS Iteratively reweighted least squares alg. An iterative algorithm for computing the maximum likelihood estimates of the parameters of a generalized linear model A special case for Fisher scoring method

Algorithm E-step M-step

Online algorithm This algorithm can be used for online regression For Expert network: where R ij is the inverse covariance matrix for EN(i,j)

Online algorithm For Gating network: where S i is the inverse covariance matrix and where S ij is the inverse covariance matrix

Results Simulated data of a four-joint robot arm moving in three-dimensional space

Results

Conclusions Introduce a tree-structured architecture for supervised learning Much faster than traditional back- propagation algorithm Can be used for on-line learning

Thank you Questions?