The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.

Slides:



Advertisements
Similar presentations
The Maximum Likelihood Method
Advertisements

Mixture Models and the EM Algorithm
Unsupervised Learning
Clustering Beyond K-means
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Integration of sensory modalities
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
Segmentation and Fitting Using Probabilistic Methods
K-means clustering Hongning Wang
Machine Learning and Data Mining Clustering
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 143, Brown James Hays 02/22/11 Many slides from Derek Hoiem.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Visual Recognition Tutorial
EE 290A: Generalized Principal Component Analysis Lecture 6: Iterative Methods for Mixture-Model Segmentation Sastry & Yang © Spring, 2011EE 290A, University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Clustering.
Expectation-Maximization
EE462 MLCV 1 Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim.
Gaussian Mixture Models and Expectation Maximization.
Principles of the Global Positioning System Lecture 10 Prof. Thomas Herring Room A;
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EM and expected complete log-likelihood Mixture of Experts
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
First topic: clustering and pattern recognition Marc Sobel.
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Introduction to LDA Jinyang Gao. Outline Bayesian Analysis Dirichlet Distribution Evolution of Topic Model Gibbs Sampling Intuition Analysis of Parameter.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Lecture 2: Statistical learning primer for biologists
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Flat clustering approaches
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
CSE 517 Natural Language Processing Winter 2015
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press 1 Computational Statistics with Application to Bioinformatics Prof. William.
Review of statistical modeling and probability theory Alan Moses ML4bio.
LECTURE 07: CLASSIFICATION PT. 3 February 15, 2016 SDS 293 Machine Learning.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/22/11.
R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
The normal approximation for probability histograms.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Classification of unlabeled data:
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Gaussian Mixture Models And their training with the EM algorithm
#10 The Central Limit Theorem
Presentation transcript:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy Summer School Drawing Astrophysical Inferences from Data Sets William H. Press The University of Texas at Austin Lecture 6

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 2 A general idea for dealing with events which can be in one of several components – and you don’t know which. Instead of doing general case, we’ll illustrate an astronomical example. Hubble constant H 0 was highly contentious as late as the late 1990s. Measurement by classical astronomy very difficult –each a multi-year project –calibration issues Between 1930 and 2000 credible measurements ranged from 30 to 120 (km/s/Mpc) –many claimed small errors Consensus view was “we just don’t know H 0. –or was it just failure to apply an adequate statistical model to the existing data? Mixture Models this one is a 3-component mixture

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 3 Grim observational situation: Not the range of values, but the inconsistency of the claimed errors. This forbids any kind of “just average”, because goodness- of-fit rejects the possibility that these experiments are measuring the same value.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 4 Here, the mixture will be “some experiments are right, some are wrong, and we don’t know which are which” probability that a “random” experiment is right bit vector of which experiments are right or wrong, e.g. (1,0,0,1,1,0,1…) now, expand out the prior and make reasonable assumptions about conditional independence: p #(v=1) (1-p) #(v=0) “law of total probability” Bayes

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 5 any big enough value now you stare at this a while and realize that the sum over v is just a multinomial expansion So it is as if each event were sampled from the linear mixture of probability distributions. We see that this is not just heuristic, but the actual, exact marginalization over all possible assignments of events to components.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 6 And the answer is… This is not a Gaussian, it’s just whatever shape it came out from the data It’s not even necessarily unimodel (although it is for this data) –If you leave out some of the middle-value experiments, it splits to be bimodal –Thus showing that this method is not “tail-trimming” from Wilkinson Microwave Anisotropy Probe satellite 5-year results (2008) from WMAP + supernovae

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 7 Similarly, we can get the probability that each experiment is correct (that is, the assignment of events to components): and if P(p) is uniform in (0,1)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 8 Or the probability distribution of p (probability of a measurement being correct a priori): (this is of course not universal, but depends on the field and its current state) e.g., take uniform prior on P(p) mixture automatically marginalizes on v

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 9 Gaussian Mixture Models (GMMs) What if the components have unknown parameters –like location and shape in N-dim space Can’t assign the events to components until we know the components –but can’t define the components until we know which events are assigned to them –the trick is to find both, self-consistently EM (expectation-maximization) methods are a general iterative method for dealing with problems like this “Gaussian Mixture Models” are the simplest example –the components are Gaussians defined by a mean and covariance Note that not all mixture models are EM methods, and not all EM methods are mixture models! –EM methods can also deal with missing data, e.g.

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 10 “probabilistic assignment” of a data point to a component! overall likelihood of the model specify the model as a mixture of Gaussians M d i mens i ons k = 1 ::: KG auss i ans n = 1 ::: N d a t apo i n t s P ( k ) popu l a t i on f rac t i on i n k P ( x n ) mo d e l pro b a b i l i t ya t x n Key to the notational thicket: Goal is to find all of the above, starting with only the

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 11 Expectation, or E-step: suppose we know the model, but not the assignment of individual points. (so called because it’s probabilistic assignment by expectation value) Maximization, or M-step: suppose we know the assignment of individual points, but not the model. (so called because [theorem!] the overall likelihood increases at each step) Can be proved that alternating E and M steps converges to (at least a local) maximum of overall likelihood Convergence is sometimes slow, with long “plateaus” Often start with k randomly chosen data points as starting means, and equal (usually spherical) covariance matrices –but then had better try multiple re-starts

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 12 Because Gaussians underflow so easily, a couple of tricks are important: 1) Use logarithms! 2) Do the sum by the “log-sum-exp” formula: (The code in NR3 implements these tricks.)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 13 Let’s look in 2 dimensions at an “ideal”, and then a “non-ideal”, example. Ideal: we generate Gaussians, then, we fit to Gaussians mu1 = [.3.3]; sig1 = [.04.03;.03.04]; mu2 = [.5.5]; sig2 = [.5 0; 0.5]; mu3 = [1.5]; sig3 = [.05 0; 0.5];

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 14 This “ideal” example converges rapidly to the right answer. Use GMM class in NR3:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 15 For a non-ideal example, here is some biological data on the (log10) lengths of the 1 st and 2 nd introns in genes. We can see that something non-GMM is going on! For general problems in >2 dimensions, it’s often hard to visualize whether this is the case or not, so GMMs get used “blindly”. spliceosome can’t deal with introns <100 in length, so strong evolutionary constrant except that, like everything else in biology, there are exceptions (these are not “experimental error” in the physics sense!)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 16 Three component model converges rapidly to something reasonable:

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 17 But, we don’t always land on the same local maximum, although there seem to be just a handfull. (One of these presumably has the higher likelihood.)

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 18 Eight components: The ones with higher likelihood are pretty good as summaries of the data distribution (absent a predictive model). But the individual components are unstable and have little or no meaning. “Fit a lot of Gaussians for interpolation, but don’t believe them.”

The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 19 Variations on the theme of GMMs: You can constrain the  matrices to be diagonal –when you have reason to believe that the components individually have no cross-correlations (align with the axes) Or constrain them to be multiples of the unit matrix –make all components spherical Or fix  =  1 (infinitesimal times unit matrix) –don’t re-estimate , only re-estimate  –this assigns points 100% to the closest cluster (so don’t actually need to compute any Gaussians, just compute distances) –it is called “K-means clustering” kind of GMM for dummies widely used (there are a lot of dummies!) probably always better to use spherical GMM (middle bullet above)