1 Three examples of the EM algorithm Week 12, Lecture 1 Statistics 246, Spring 2002.

Slides:

Advertisements

Similar presentations

Basic Mendelian Principles

Advertisements

Mean, Proportion, CLT Bootstrap

Week 11 Review: Statistical Model A statistical model for some data is a set of distributions, one of which corresponds to the true unknown distribution.

Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.

1 Chapter 4 Experiments with Blocking Factors The Randomized Complete Block Design Nuisance factor: a design factor that probably has an effect.

1 Maximum Likelihood Estimates and the EM Algorithms II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:

Chi-Square Test A fundamental problem is genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.

EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.

The General Linear Model. The Simple Linear Model Linear Regression.

Bayesian Methods with Monte Carlo Markov Chains III

Patterns of inheritance

. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.

. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.

1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

Visual Recognition Tutorial

Overview Full Bayesian Learning MAP learning

. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.

Maximum likelihood (ML) and likelihood ratio (LR) test

1 Lecture 8: Genetic Algorithms Contents : Miming nature The steps of the algorithm –Coosing parents –Reproduction –Mutation Deeper in GA –Stochastic Universal.

Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Maximum likelihood (ML)

1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.

Maximum likelihood (ML) and likelihood ratio (LR) test

Evaluating Hypotheses

Maximum Likelihood (ML), Expectation Maximization (EM)

1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.

Visual Recognition Tutorial

Copyright © Cengage Learning. All rights reserved. 6 Point Estimation.

EM Algorithm Likelihood, Mixture Models and Clustering.

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.

Maximum likelihood (ML)

Chi-Square Test A fundamental problem in genetics is determining whether the experimentally determined data fits the results expected from theory (i.e.

Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.

Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.

Maximum Likelihood Estimates and the EM Algorithms I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

Basic Concepts in Number Theory Background for Random Number Generation 1.For any pair of integers n and m, m  0, there exists a unique pair of integers.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.

The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

Chapter 3 – Basic Principles of Heredity. Johann Gregor Mendel (1822 – 1884) Pisum sativum Rapid growth; lots of offspring Self fertilize with a single.

Chi-Square Test A fundamental problem in genetics is determining whether the experimentally determined data fits the results expected from theory. How.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

Maximum Likelihood Estimates and the EM Algorithms I Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

Experimental Design and Data Structure Supplement to Lecture 8 Fall

Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.

Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.

Lecture 15: Linkage Analysis VII

Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.

HMM - Part 2 The EM algorithm Continuous density HMM.

1 8. One Function of Two Random Variables Given two random variables X and Y and a function g(x,y), we form a new random variable Z as Given the joint.

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.

Sampling and estimation Petter Mostad

Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven

Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,

The Mixed Effects Model - Introduction In many situations, one of the factors of interest will have its levels chosen because they are of specific interest.

Maximum Likelihood Estimates and the EM Algorithms III Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Lynette.

Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.

EM Algorithm 主講人：虞台文大同大學資工所智慧型多媒體研究室. Contents Introduction Example  Missing Data Example  Mixed Attributes Example  Mixture Main Body Mixture Model.

Fundamentals of Genetics. Gregor Mendel  Gregor Mendel was a monk in mid 1800’s who discovered how genes were passed on.  He used peas to determine.

Learning Sequence Motif Models Using Expectation Maximization (EM)

Hidden Markov Models Part 2: Algorithms

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Presentation transcript:

1 Three examples of the EM algorithm Week 12, Lecture 1 Statistics 246, Spring 2002

2 The estimation of linkage from the offspring of selfed heterozygotes R A Fisher and Bhai Balmukand, Journal of Genetics (1928) See also: Collected Papers of R A Fisher, Volume II, Paper 71, pp

3 The problem In modern terminology, we have two linked bi-allelic loci, A and B say, with alleles A and a, and B and b, respectively, where A is dominant over a and B is dominant over b. A double heterozygote AaBb will produce gametes of four types: AB, Ab, aB and ab. Since the loci are linked, the types AB and ab will appear with a frequency different from that of Ab and aB, say 1-r and r, respectively, in males, and 1-r’ and r’ respectively in females. Here we suppose that the parental origin of these heterozygotes is from the mating AABB  aabb, so that r and r’ are the male and female recombination rates between the two loci. The problem is to estimate r and r’, if possible, from the offspring of selfed double heterozygotes.

4 Offspring genotypic and phenotypic frequencies Since gametes AB, Ab, aB and ab are produced in proportions (1-r)/2, r/2, r/2 and (1-r)/2 respectively by the male parent, and (1-r’)/2, r’/2, r’/2 and (1-r’)/2 respectively by the female parent, zygotes with genotypes AABB, AaBB, …etc, are produced with frequencies (1-r)(1-r’)/4, (1-r)r’/4, etc. Exercise: Complete the Punnett square of offspring genotypes and their associated frequencies. The problem here is this: although there are 16 distinct offspring genotypes, taking parental origin into account, the dominance relations imply that we only observe 4 distinct phenotypes, which we denote by A*B*, A*b*, a*B* and a*b*. Here A* (res. B*) denotes the dominant, while a* (resp. b*) denotes the recessive phenotype determined by the alleles at A (resp. B ).

5 Offspring genotypic and phenotypic probabilities, cont. Thus individuals with genotypes AABB, AaBB, AABb or AaBb, which account for 9/16 gametic combinations (check!), all exhibit the phenotype A*B*, i.e. the dominant alternative in both characters, while those with genotypes AAbb or Aabb (3/16) exhibit the phenotype A*b*, those with genotypes aaBB and aaBb (3/16) exhibit the phenotype a*B*, and finally the double recessives aabb (1/16) exhibit the phenotype a*b*. It is a slightly surprising fact that the probabilities of the four phenotypic classes are definable in terms of the parameter  = (1-r)(1-r’), as follows: a*b* has probability  /4 (easy to see), a*B* and A*b* both have probabilities (1-  )/4, while A*B* has probability 1 minus the sum of the preceding, which is (2+  )/4. Exercise: Calculate these phenotypic probabilities.

6 Estimation of  Now suppose we have a random sample of n offspring from the selfing of our double heterozygote. Thus the 4 phenotypic classes will be represented roughly in proportion to their theoretical probabilities, their joint distribution being multinomial Mult [n; (2+  )/4, (1-  )/4, (1-  )/4,  /4]. Note that here neither r nor r’ will be separately estimable from these data, but only the product (1-r)(1-r’). Note that since we know that r≤1/2 and r’≤1/2, it follows that  ≥1/4. How do we estimate  ? Fisher and Balmukand discuss a variety of methods that were in the literature at the time they wrote, and compare them with maximum likelihood, which is the method of choice in problems like this. We describe a variant on their approach to illustrate the EM algorithm.

7 The incomplete data formulation Let us denote (cf p. 26 of Week 11b) the counts of the 4 phenotypic classes by y 1, y 2, y 3 and y 4, these having probabilities (2+  )/4, (1-  )/4, (1-  )/4 and  /4, respectively. Now the probability of the observing genotype AABB is  /4, just as it is with aabb, and although this genotype is phenotypically indistinguishable from the 8 others with phenotype A*B*, it is convenient to imagine that we can distinguish them.So let us denote their count by x 1, and let x 2 denote count of the remainder of that class, so that x 1 +x 2 = y 1. Note that x 2 has marginal probability 1/2. In the jargon of the EM algorithm, x 1 and x 2 are missing data, as we only observe their sum y 1. Next, as in p.26 of Week 11b, we let y 2 =x 3, y 3 =x 4 and y 4 =x 5. We now illustrate the approach of the EM algorithm, referring to material in Week 9b and Week 11b for generalities.

8 The EM algorithm for this problem The complete data log likelihood at  is (Ex: Check): (x 2 +x 5 )log  + (x 3 +x 4 )log(1-  ). The expected value of the complete data log likelihood given observed data taken at  ’ (E-step: think of  ’ as  -initial) is: (E  ’ (x 2 |y 1 )+y 4 )log  + (y 2 +y 3 )log(1-  ). Now E  ’ (x 2 |y 1 ) is just ky 1, where k=  ’/(2+  ’). (Ex: Check.) The maximum over  of the expected value of the complete data log likelihood given observed data taken at  ’ (M-step) occurs at  ’’ = (ky 1 +y 4 )/(ky 1 +y 2 +y 3 +y 4 ). (Ex: Check) Here we think of  ’’ as  -next. It should now be clear how the E-step (calculation of k) and the M-step (calculation of  ’’) can be iterated.

9 Comments on this example This completes our discussion of this example. It appeared in the famous EM paper (Dempster, Laird and Rubin, JRSSB 1977) without any explanation of its genetic origins. Of course it is an illustration of the EM only, for the actual likelihood equation generated by the observed data only is a quadratic, and so easy to solve (see Fisher & Balmukand). Thus it is not necessary to use the EM in this case (some would say in any case, but that is for another time). We have omitted any of the fascinating detail provided in Fisher and Balmukand, and similarly in Dempster et al. Read these papers: both are classics, with much of interest to you. Rather than talk about details concerning the EM (most importantly, starting and stopping it, the issue of global max, and SEs for parameter estimates), I turn to another important EM example: mixtures.

10 Fitting a mixture model by EM to discover motifs in biopolymers T L Bailey and C Elkan, UCSD Technical Report CS94-351; ISMB94. Here we outline some more EM theory, this being relevant to motif discovery. We follow the above report, as we will be discussing the program MEME written by these authors in a later lecture. This part is called MM. A finite mixture model supposes that data X = (X 1,…,X n ) arises from two or more groups with known distributions but different, unknown parameters  = (  1, …,  g ), where g is the number of groups, and mixing parameters = ( 1,…, g ), where the s are non-negative and sum to 1. It is convenient to introduce indicator vectors Z = (Z 1,…,Z n ), where Z i = (Z i1,…,Z ig ), and Z ij = 1 if X i is from group j, and = 0 otherwise. Thus Z i gives group membership for the ith sample. It follows that pr(Z ij = 1 | , ) = j. For any given i, all Z ij are 0 apart from one.

11 Complete data log likelihood Under the assumption that the pairs (Z i,X i ) are mutually independent, their joint density may be written (Exercise: Carry out this calculation in detail.) pr(Z, X | , ) = ∏ ij [ j pr(X i |  j ) ] Zij The complete-data log likelihood is thus log L( , | Z, X) = ∑∑ Z ij log [ j pr(X i |  j ) ]. The EM algorithm iteratively computes the expectation of this quantity given the observed data X, and initial estimates  ’ and ’ of  and (the E-step), and then maximizes the result in the free variables  and leading to new estimates  ’’ and ’’ (the M-step). Our interest here is in the particular calculations necessary to carry out these two steps.

12 Mixture models: the E-step Since the log likelihood is the sum of over i and j of terms multiplying Z ij, and these are independent across i, we need only consider the expectation of one such, given X i. Using initial parameter values  ’ and ’, and the fact that the Z ij are binary, we get E( Z ij | X,  ’, ’) = ’ j pr(X i |  ’ j )/ ∑ k ’ k pr(X i |  ’ k ) = Z’ ij, say. Exercise: Obtain this result.

13 Mixture models: the M-step Here our task is to maximize the result of an E-step: ∑∑ Z’ ij j + ∑∑ Z’ ij log pr(X i |  j ). The maximization over is clearly independent of the rest and is readily seen to be achieved (Ex: check this) with j ’’ = ∑ i Z’ ij / n. Maximizing over  requires that we specify the model in more detail. The case of interest to us is where g = 2, and the distributions for class 1 (the motif) and class 2 (the background) are given by position specific and a general multinomial distribution, respectively.

14 Mixture models: the M-step, cont. Our initial observations are supposed to consist of N sequences from an L-letter alphabet. Unlike what we did in the last lecture, these sequences are now broken up into all n overlapping subsequences of length w, and these are our X i. We need w+1 sets of probabilities, namely f jk and f 0k, where j=1,…,w (the length of the motif) and k runs over the symbol alphabet. With these parameters, we can write pr(X i |  1 ) = ∏ j ∏ k f jk I(k,Xij), and pr(X i |  2 ) = ∏ j ∏ k f 0k I(k,Xij) where X ij is the letter in the jth position of sample i, and I(k,a) = 1 if a=a k, and =0 otherwise. With this notation, for k=1,…L, write c 0k = ∑∑ Z’ i2 I( k,X ij ), and c jk = ∑∑ Z’ i1 I(k,X ij ). Here c 0k is the expected number of times letter a k appears in the background, and c jk the expected number of times a k appears in occurrences of the motif in the data.

15 Mixture models: the M-step completed. With these preliminaries, it is straightforward to maximize the expected complete-data log likelihood, given the observations X, evaluated at initial parameters  ’ and ’. Exercise: Fill in the details missing below. We obtain f’’ jk = c jk / ∑ k=1 L c jk, j = 0,1,…,w; k = 1,…,L. In practice, care must be taken to avoid zero frequencies, so that either one uses explicit Dirichlet prior distributions, or one adds small constants  j,∑  j = , giving f’’ jk = (c jk +  j )/ (∑ k=1 L c jk +  ), j = 0,1,…,w; k = 1,…,L.

16 Comment on the EM algorithm A common misconception concerning the EM algorithm is that we are estimating or predicting the missing data, plugging that estimate or prediction into the complete data log likelihood, “completing the data” you might say, and then maximizing this in the free parameters, as though we had complete data. This is emphatically NOT what is going on. A more accurate description might be this. We are using the inferred missing data to weight different parts of the complete data log likelihood differently in such a way that the pieces combine into the maximum likelihood estimates.

17 An Expectation Maximization (EM) Algorithm for the Identification and Characterization of Common Sites in Unaligned Biopolymer Sequences C E Lawrence and A A Reilly PROTEINS: Structure, Function and Genetics 7:41-51 (1990) This paper was the precursor of the Gibbs-Metropolis algorithm we described in Week11b. We now briefly describe the algorithm along the lines of the discussion just given, but in the notation used in Week11b. On p... of that lecture, the full data log likelihood was given, without the term for the marginal distribution of A being visible. We suppose that the a k are independent, and uniformly distributed over {1,..,n k -W+1}. Because this term does not depend on either  0 or , it disappears in the likelihood proportionality constant. Thus the complete data log likelihood is log L(  0,  |R,A) = h(R {A}c )log  0 + ∑ j h(R A+j-1 )log  j, where we use the notation log  0 = ∑ i i log  0,i cf p.12, Week 11b.

18 The E-step in this case. The expected value of the complete data log likelihood given the observed data and initial parameters  ’ 0 and  ’ is just ∑ A pr(A | R,  ’ 0,  ’) log pr(R, A |  0,  ) (*) where the sum is over all A = {a 1,…,a K }, and so our task is to calculate the first term. Now we are treating all the rows (sequences) as mutually independent, so pr(A | R,  ’ 0,  ’) factorizes in k, and we need only deal with a typical row, the kth, say. Letting a k denote the random variable corresponding to the start of the motif in the kth row, then pr(a k = i) = 1/(n k -W+1), i=1, …, n k -W+1. We can readily calculate pr(a k = i |  ’ 0,  ’, R) by Bayes’ theorem, and these get multiplied and inserted in (*) above. Exercise: Carry out this calculation.

19 The M-step in this case. Once we have calculated the expected value of the complete data log likelihood given the observed data and initial parameters  ’ 0 and  ’, we then maximize it in the free variables  0 and , leading to new parameters  ’’ 0 and  ’’. How is this done? Without giving the gory details, just notice that (*) is a weighted combination of multinomial log likelihoods just like the one we met in our previous example, the mixture model. There the weights were Z’s, and here they are pr(A | R,  ’ 0,  ’)s. It follows (Exercise: Fill in the details) that the maximizing values of  0 and , which we denote by  ’’ 0 and  ’’, are ratios of expected counts similar to the c 0 and c j in the mixture discussion. As there, we will want to deal with small or zero counts by invoking Dirichlet priors.