CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton.

Slides:



Advertisements
Similar presentations
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Advertisements

Slides from: Doug Gray, David Poole
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 7: Learning in recurrent networks Geoffrey Hinton.
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Artificial Neural Networks
Chapter 4: Linear Models for Classification
All lecture slides will be available as .ppt, .ps, & .htm at
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Instance Based Learning
CSC321: Introduction to Neural Networks and Machine Learning Lecture 20 Learning features one layer at a time Geoffrey Hinton.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Classification and Prediction: Regression Analysis
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Radial Basis Function Networks
Neural Networks Lecture 8: Two simple learning algorithms
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
EM and expected complete log-likelihood Mixture of Experts
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
10/18/ Support Vector MachinesM.W. Mak Support Vector Machines 1. Introduction to SVMs 2. Linear SVMs 3. Non-linear SVMs References: 1. S.Y. Kung,
Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.
CSC321: Neural Networks Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis Geoffrey Hinton.
Jeff Howbert Introduction to Machine Learning Winter Regression Linear Regression.
CSC321: Neural Networks Lecture 2: Learning with linear neurons Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 9: Ways of speeding up the learning and preventing overfitting Geoffrey Hinton.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
CSC321: Neural Networks Lecture 24 Products of Experts Geoffrey Hinton.
CSC 2535 Lecture 8 Products of Experts Geoffrey Hinton.
Neural Networks - Lecture 81 Unsupervised competitive learning Particularities of unsupervised learning Data clustering Neural networks for clustering.
CSC2535 Lecture 4 Boltzmann Machines, Sigmoid Belief Nets and Gibbs sampling Geoffrey Hinton.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 18 Learning Boltzmann Machines Geoffrey Hinton.
Linear Models for Classification
CSC321 Introduction to Neural Networks and Machine Learning Lecture 3: Learning in multi-layer networks Geoffrey Hinton.
CSC2515: Lecture 7 (post) Independent Components Analysis, and Autoencoders Geoffrey Hinton.
CIAR Summer School Tutorial Lecture 1b Sigmoid Belief Nets Geoffrey Hinton.
CSC321: Lecture 7:Ways to prevent overfitting
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Machine Learning 5. Parametric Methods.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Neural Networks Lecture 11: Learning in recurrent networks Geoffrey Hinton.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 17: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC2535 Lecture 5 Sigmoid Belief Nets
CSC321 Lecture 24 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CSC321 Lecture 27 Using Boltzmann machines to initialize backpropagation Geoffrey Hinton.
RiskTeam/ Zürich, 6 July 1998 Andreas S. Weigend, Data Mining Group, Information Systems Department, Stern School of Business, NYU 2: 1 Nonlinear Models.
Linear Models Tony Dodd. 21 January 2008Mathematics for Data Modelling: Linear Models Overview Linear models. Parameter estimation. Linear in the parameters.
CSC2535: Computation in Neural Networks Lecture 7: Independent Components Analysis Geoffrey Hinton.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning Supervised Learning Classification and Regression
CSC2535: Computation in Neural Networks Lecture 11 Extracting coherent properties by maximizing mutual information across space or time Geoffrey Hinton.
Deep Feedforward Networks
CSC321: Neural Networks Lecture 19: Boltzmann Machines as Probabilistic Models Geoffrey Hinton.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 12: Combining models Geoffrey Hinton.
Mixture Density Networks
EE513 Audio Signals and Systems
EM Algorithm and its Applications
Presentation transcript:

CSC321: Introduction to Neural Networks and Machine Learning Lecture 15: Mixtures of Experts Geoffrey Hinton

A spectrum of models Very local models –e.g. Nearest neighbors Very fast to fit –Just store training cases Local smoothing obviously improves things Fully global models –e. g. Polynomial May be slow to fit –Each parameter depends on all the data x y x y

Multiple local models Instead of using a single global model or lots of very local models, use several models of intermediate complexity. –Good if the dataset contains several different regimes which have different relationships between input and output. –But how do we partition the dataset into subsets for each expert?

Partitioning based on input alone versus partitioning based on input-output relationship We need to cluster the training cases into subsets, one for each local model. –The aim of the clustering is NOT to find clusters of similar input vectors. –We want each cluster to have a relationship between input and output that can be well-modeled by one local model which partition is best: I=input alone or I/O=input  output mapping? II/O

Mixtures of Experts Can we do better that just averaging predictors in a way that does not depend on the particular training case? –Maybe we can look at the input data for a particular case to help us decide which model to rely on. This may allow particular models to specialize in a subset of the training cases. They do not learn on cases for which they are not picked. So they can ignore stuff they are not good at modeling. The key idea is to make each expert focus on predicting the right answer for the cases where it is already doing better than the other experts. –This causes specialization. –If we always average all the predictors, each model is trying to compensate for the combined error made by all the other models.

A picture of why averaging is bad Average of all the other predictors target Do we really want to move the output of predictor i away from the target value?

Making an error function that encourages specialization instead of cooperation If we want to encourage cooperation, we compare the average of all the predictors with the target and train to reduce the discrepancy. –This can overfit badly. It makes the model much more powerful than training each predictor separately. If we want to encourage specialization we compare each predictor separately with the target and train to reduce the average of all these discrepancies. –Its best to use a weighted average, where the weights, p, are the probabilities of picking that “expert” for the particular training case. probability of picking expert i for this case Average of all the predictors

The mixture of experts architecture Combined predictor: Simple error function for training : ( There is a better error function ) Expert 1 Expert 2 Expert 3 input Softmax gating network

The derivatives of the simple cost function If we differentiate w.r.t. the outputs of the experts we get a signal for training each expert. If we differentiate w.r.t. the outputs of the gating network we get a signal for training the gating net. –We want to raise p for all experts that give less than the average squared error of all the experts (weighted by p)

Another view of mixtures of experts One way to combine the outputs of the experts is to take a weighted average, using the gating network to decide how much weight to place on each expert. But there is another way to combine the experts. –How many times does the earth rotate around its axis each year? –What will be the exchange rate of the Canadian dollar the day after the Quebec referendum?

Giving a whole distribution as output If there are several possible regimes and we are not sure which one we are in, its better to output a whole distribution. –Error is negative log probability of right answer c 75c

The probability distribution that is implicitly assumed when using squared error Minimizing the squared residuals is equivalent to maximizing the log probability of the correct answers under a Gaussian centered at the model’s guess. –If we assume that the variance of the Gaussian is the same for all cases, its value does not matter. d correct answer y model’s prediction

The probability of the correct answer under a mixture of Gaussians Prob. of desired output on case c given the mixture Mixing proportion assigned to expert i for case c by the gating network output of expert i Normalization term for a Gaussian with

A natural error measure for a Mixture of Experts This fraction is the posterior probability of expert i

What are vowels? The vocal tract has about four resonant frequencies which are called formants. –We can vary the frequencies of the four formants. How do we hear the formants? –The larynx makes clicks. We hear the dying resonances of each click. –The click rate is the pitch of the voice. It is independent of the formants. The relative energies in each harmonic of the pitch define the envelope of a formant. Each vowel corresponds to a different region in the plane defined by the first two formants, F1 and F2. Diphthongs are different.

A picture of two imaginary vowels and a mixture of two linear experts after learning F1 F2 decision boundary of gating net decision boundary of expert 1 decision boundary of expert 2 use expert 1 on this side use expert 2 on this side

Decision trees In a decision tree, we start at the root node and perform a test on the input vector to determine whether to take the right or left branch. At each internal node of the tree we have a different test, and the test is usually some very simple function of the input vector. –Typical test: Is the third component of the input vector bigger than 0.7? One we reach a leaf node, we apply a function to the input vector to compute the output. –The function is specific to that leaf node, so the sequence of tests is used to pick an appropriate function to apply to the current input vector.

Decision Stumps Consider a decision tree with one root node directly connected to N different leaf nodes. –The test needs to have N possible outcomes. Each leaf node is an “expert” that uses its own particular function to predict the output from the input. Learning a decision stump is tricky if the test has discrete outcomes because we do not have a continuous space in which to optimize parameters.

Creating a continuous search space for decision stumps If the test at the root node uses a softmax to assign probabilities to the leaf nodes we get a continuous search space: –Small changes to the parameters of the softmax “manager” cause small changes to the expected log probability of predicting the correct answer. A mixture of experts can be viewed as a probabilistic way of viewing a decision stump so that the tests and leaf functions can be learned by maximum likelihood. –It can be generalised to a full decision tree by having a softmax at each internal node of the tree.