Presentation is loading. Please wait.

Presentation is loading. Please wait.

. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.

Similar presentations


Presentation on theme: ". Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau."— Presentation transcript:

1 . Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.

2 2 The EM algorithm The setting for the algorithm: Our data is a series of observed outcomes of experiments. Each experiment is conducted identically and independently. An experiment can be modeled by a series of independent “die” tosses. The actual outcomes of the die tosses are hidden from us. The observed outcomes are a deterministic function of the hidden die toss outcomes.  We wish to find the MLE of the parameters for each die.

3 3 The EM algorithm  We wish to find the MLE of the parameters for each die. For each of the observed outcomes we need to determine the number of times each die fell on each of its faces. Because the toss results are hidden, this number is unknown to us. However, every assignment of the dice parameters allows us to compute the expected number of times each die fell on each of its faces. This is done using the following scheme: 1.Define the set of all the possible hidden outcomes. 2.For each outcome determine: its probability The observed outcome

4 4 The EM algorithm Baum-Welch as an EM algorithm: An observed outcome: a series of signals X 1 …X L. Each observed outcome is a deterministic function of a series of die tosses. There are two dice for every state: one determining the next state and one determining the emitted signal. Each die may be tossed several times in each experiment. The objective of the E-step in Baum-Welch’s algorithm is to determine the expected number of times each die fell on each of its faces.

5 5 Genotype statistics Mendelian Genetics: locus - a particular location on a chromosome (genome) - Each locus has two copies – alleles (one paternal and one maternal) - Each copy has several relevant states - genotypes locus genotype is determined by the combined genotype of both copies. locus genotype yields phenotype (physical features) We wish to estimate the distribution of all possible genotypes. Suppose we randomly sample N individuals and found the number N s,t.  The MLE is given by: Sampling genotypes is costly Sampling phenotypes is cheap

6 6 The ABO locus ABO locus determines blood-type It has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. They lead to four possible phenotypes: {A, B, AB, O} We wish to estimate the proportion in a population of the 6 genotypes. - Sample genotype – sequence a genomic region - Sample phenotype - checking presence of antibodies (simple blood test) Problem: phenotype doesn’t reveal genotype (in case of A,B)

7 7 The ABO locus Problem: phenotype doesn’t reveal genotype The probabilistic model: Allele genotypes are distributed i.i.d w.p  a,  b,  o, and determine probabilities for locus genotypes:  a/b =2  a  b ;  a/o =2  a  o ;  b/o =2  b  o  a/a =  a 2 ;  b/b =  b 2 ;  o/o =  o 2 This implies probabilities for phenotypes: Pr[P= A |Θ] =  a/a +  a/o =  a 2 +2  a  o Pr[P= B |Θ] =  b/b +  b/o =  b 2 +2  b  o Pr[P= AB |Θ] =  a/b = 2  a  b Pr[P= O |Θ] =  o/o =  o 2 Hardy-Weinberg equilibrium Θ - model parameter set Θ={  a,  b,  o }

8 8 Likelihood of phenotype data Given a population phenotype sample: Data = {B,A,B,B,O,A,B,A,O,B, AB} the likelihood of our parameter set Θ={  a,  b,  o } is: A B AB O Maximum of this function yields the MLE  Use EM to obtain this Sufficient statistics – n A, n B, n AB, n O

9 9 The EM algorithm The EM setting: Four possible observed outcomes: {A, B, AB, O} Each outcome is a function of two independent tosses (paternal and maternal) of the same three-sided die: {a/b/o} The EM algorithm: Choose some (arbitrary) assignment to the parameters  a,  b,  o. E-step: for each of the observed outcomes we need to determine the expected number of times (0 – 2) the die fell on {a/b/o} (as a function of the assumed parameters  a,  b,  o ). M-step: use the sum of these expected counts to determine the new parameters  a ’,  b ’,  o ’

10 10 E-step calculations – gene counting genotype a/o a/a b/o b/b a/b o/o pheno -type A B AB O prob gene count abo 0 0 110 002 1* +2* 1* +2* 1*1* observed outcome of “experiment” result(s) of die tosses 1.Define the set of all the possible hidden outcomes. 2.For each outcome determine: its probability The observed outcome Expected count of die-toss results

11 11 E-step: Sufficient statistics – n A, n B, n AB, n O M-step: EM algorithm for the ABO locus - summary

12 12 Data type #people A 100 B 200 AB 50 O 50 We start with an initial guess:  0 = { 0.2, 0.2, 0.6} A numeric example Sufficient statistics: n A, n B, n AB, n O  a  b  o

13 13 1 st iteration:  0 = {0.2, 0.2, 0.6} A numeric example - execution of EM Data type #people A 100 B 200 AB 50 O 50 E-step: ABABO E[(# a )] = E[(# b )] = E[(# o )] = 800 = 2n M-step:  1 = {0.205, 0.348, 0.447}

14 14 A numeric example - execution of EM Data type #people A 100 B 200 AB 50 O 50 E-step: ABABO E[(# a )] = E[(# b )] = E[(# o )] = 800 = 2n M-step: 2 nd iteration:  1 = {0.205, 0.348, 0.447}  2 = {0.211, 0.383, 0.406}

15 15 E-step: Sufficient statistics – n A, n B, n AB, n O M-step: EM algorithm for the ABO locus - summary

16 16 Iteration update formula: Sufficient statistics – n A, n B, n AB, n O, EM algorithm for the ABO locus - summary

17 17 EM algorithm – ABO example Data type #people A 100 B 200 AB 50 O 50 0.20 0.38 0.42  a,  b,  o Learning iteration

18 18 EM algorithm – ABO example Data type #people A 100 B 200 AB 50 O 50 0.20 0.38 0.42  a,  b,  o Learning iteration good convergence (maybe)


Download ppt ". Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau."

Similar presentations


Ads by Google