. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.

Slides:



Advertisements
Similar presentations
Tutorial #8 by Ma’ayan Fishelson. Computational Difficulties Algorithms that perform multipoint likelihood computations sum over all the possible ordered.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
. Inference and Parameter Estimation in HMM Lecture 11 Computational Genomics © Shlomo Moran, Ydo Wexler, Dan Geiger (Technion) modified by Benny Chor.
. Exact Inference in Bayesian Networks Lecture 9.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Hidden Markov Models Tunghai University Fall 2005.
. Computational Genomics Lecture 10 Hidden Markov Models (HMMs) © Ydo Wexler & Dan Geiger (Technion) and by Nir Friedman (HU) Modified by Benny Chor (TAU)
. Hidden Markov Models - HMM Tutorial #5 © Ydo Wexler & Dan Geiger.
Introduction to Hidden Markov Models
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
Statistical NLP: Lecture 11
Statistical NLP: Hidden Markov Models Updated 8/12/2005.
Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Parallel Genehunter: Implementation of a linkage analysis package for distributed memory architectures Michael Moran CMSC 838T Presentation May 9, 2003.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
Tutorial #6 by Ma’ayan Fishelson Based on notes by Terry Speed.
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Parametric and Non-Parametric analysis of complex diseases Lecture #8
Lecture 5: Learning models using EM
. Bayesian Networks For Genetic Linkage Analysis Lecture #7.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Hidden Markov Models Lecture 5, Tuesday April 15, 2003.
Tutorial #11 by Anna Tzemach. Background – Lander & Green’s HMM Recombinations across successive intervals are independent  sequential computation across.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
. Inference in HMM Tutorial #6 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
Hidden Markov Models.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
. Applications and Summary. . Presented By Dan Geiger Journal Club of the Pharmacogenetics Group Meeting Technion.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. Class 5: Hidden Markov Models. Sequence Models u So far we examined several probabilistic model sequence models u These model, however, assumed that.
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Linkage stuff Vibhav Gogate. A Review of the Genetic Model X1X1 X2X3Xi-1XiXi+1Y1Y1 Y2Y2 Y3Y3 Y i-1 YiYi Y i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Tutorial #5 by Ma’ayan Fishelson
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 15: Linkage Analysis VII
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
1 HMM in crosses and small pedigrees Lecture 8, Statistics 246, February 17, 2004.
. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger.
. EM in Hidden Markov Models Tutorial 7 © Ydo Wexler & Dan Geiger, revised by Sivan Yogev.
Belief propagation with junction trees Presented by Mark Silberstein and Yaniv Hamo.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Basic Model For Genetic Linkage Analysis Lecture #3
Linkage Analysis Problems
Genetic linkage analysis
Hidden Markov Model Lecture #6
Hidden Markov Models ..
Presentation transcript:

. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger

2 Using the Maximum Likelihood Approach The probability of pedigree data Pr(data |  ) is a function of the known and unknown recombination fractions denoted collectively by . How can we construct this likelihood function ? The maximum likelihood approach is to seek the value of  which maximizes the likelihood function Pr(data |  ). This is the ML estimate.

3 Constructing the Likelihood function L ijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles l i at locus i. First, we need to determine the variables that describe the problem. There are many possible choices. Some variables we can observe and some we cannot. X ij = Unordered allele pair at locus i of person j. The values are pairs of i th -locus alleles (l i,l’ i ). L ijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles l i at locus i (Same as for L ijm ). As a starting point, We assume that the data consists of an assignment to a subset of the variables {X ij }. In other words some (or all) persons are genotyped at some (or all) loci.

4 What is the relationships among the variables for a specific individual ? L 11f L 11m X 11 Paternal allele at locus 1 of person 1 Unordered allele pair at locus 1 of person 1 = data Maternal allele at locus 1 of person 1 P(L 11m = a) is the frequency of allele a. We use lower case letters for states writing, in short, P(l 11m ). P(x 11 | l 11m, l 11f ) = 0 or 1 depending on consistency

5 What is the relationships among the variables across individuals ? L 11f L 11m L 13m X 11 P(l 13m | l 11m, l 11f ) = 1/2 if l 13m = l 11m or l 13m = l 11f P(l 13m | l 11m, l 11f ) = 0 otherwise L 12f L 12m L 13f X 12 First attempt: correct but not efficient as we shall see. Mother Father Offspring

6 Probabilistic model for two loci L 11f L 11m L 13m X 11 L 12f L 12m L 13f X 12 X 13 Model for locus 1 L 21f L 21m L 23m X 21 L 22f L 22m L 23f X 22 X 23 Model for locus 2 L 23m depends on whether L 13m got the value from L 11m or L 11f, whether a recombination occurred, and on the values of L 21m and L 21f. This is quite complex.

7 Adding a selector variable L 11f L 11m L 13m X 11 S 13m Selector of maternal allele at locus 1 of person 3 Maternal allele at locus 1 of person 3 (offspring) Selector variables S ijm are 0 or 1 depending on whose allele is transmitted to offspring i at maternal locus j. P(s 13m ) = ½ P(l 13m | l 11m, l 11f,,S 13m =0) = 1 if l 13m = l 11m P(l 13m | l 11m, l 11f,,S 13m =1) = 1 if l 13m = l 11f P(l 13m | l 11m, l 11f,,s 13m ) = 0 otherwise

8 Probabilistic model for two loci S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 Model for locus 1 S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2

9 Probabilistic Model for Recombination S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13  is the recombination fraction between loci 2 & 1.

10 Constructing the likelihood function I S 13m L 11f L 11m L 13m X 11 P(l 11m, l 11f,, x 11, s 13m,l 13m ) = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) P(s 13m ) P(l 13m | s 13m, l 11m, l 11f ) Joint probability Prob(data) = P(x 11 ) =  l11m  l11f  s13m  l13m P(l 11m, l 11f,, x 11, s 13m,l 13m ) Probability of data (sum over all states of all hidden variables) All other variables are not-observed (hidden) Observed variable

11 Constructing the likelihood function II = P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m,  2 ) P(s 23m | s 13m,  2 ) P(l 11m,l 11f,x 11,l 12m,l 12f,x 12,l 13m,l 13f,x 13, l 21m,l 21f,x 21,l 22m,l 22f,x 22,l 23m,l 23f,x 23, s 13m,s 13f,s 23m,s 23f,  2 ) = Product over all local probability tables Prob(data|  2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) = Probability of data (sum over all states of all hidden variables) Prob(data|  2 ) = P(x 11, x 12, x 13, x 21, x 22, x 23 ) =  l11m, l11f … s23f [ P(l 11m ) P(l 11f ) P(x 11 | l 11m, l 11f, ) … P(s 13m ) P(s 13f ) P(s 23m | s 13m,  2 ) P(s 23m | s 13m,  2 ) ] The result is a function of the recombination fraction. The ML estimate is the  2 value that maximizes this function.

12 Modeling Phenotypes I L 11f L 11m L 13m X 11 S 13m Phenotype variables Y ij are 0 or 1 depending on whether a phenotypic trait associated with locus i of person j is observed. E.g., sick versus healthy. For example model of perfect recessive disease yields the penetrance probabilities: P(y 11 = sick | X 11 = (a,a)) = 1 P(y 11 = sick | X 11 = (A,a)) = 0 P(y 11 = sick | X 11 = (A,A)) = 0 Y 11

13 Modeling Phenotypes II L 11f L 11m L 13m X 11 S 13m Note that in this model we assume the phenotype depends only on the alleles of one locus. Also we did not model levels of sickness. We did not model continuous phenotypic observations either. Y 11

14 Introducing a tentative disease Locus S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 The recombination fraction  2 is unknown. Finding it can help determine whether a gene causing the disease lies in the vicinity of the marker locus. Disease locus: assume sick means x ij =(a,a) Marker locus

15 Locus-by-Locus Summation order Sum over locus i vars before summing over locus i+1 vars Sum over orange vars (L ijt ) before summing selector vars (S ijt ). This order yields a Hidden Markov Model (HMM). S i3m L i1f L i1m L i3m X i1 S i3f L i2f L i2m L i3f X i2 X i

16 Hidden Markov Models in General Application in communication: message sent is (s 1,…,s m ) but we receive (r 1,…,r m ). Compute what is the most likely message sent ? Application in speech recognition: word said is (s 1,…,s m ) but we recorded (r 1,…,r m ). Compute what is the most likely word said ? Application in Genetic linkage analysis: to be discussed now. X1X1 X2X3Xi-1XiXi+1R1R1 R2R2 R3R3 R i-1 RiRi R i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1 SiSi S i+1 Which depicts the factorization:

17 Hidden Markov Model In our case X1X1 X2X3Xi-1XiXi+1X1X1 X2X2 X3X3 X i-1 XiXi X i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1 SiSi S i+1 The compounded variable S i = (S i,1,m,…,S i,2n,f ) is called the inheritance vector. It has 2 2n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable X i = (X i,1,m,…,X i,2n,f ) is the data regarding locus i. To specify the HMM we need to write down the transition matrices from S i-1 to S i and the matrices P(x i |S i ). Note that these quantities have already been implicitly defined.

18 The transition matrix Recall that: Therefore, in our example, where we have one non-founder (n=1), the transition probability table size is 4  4 = 2 2n  2 2n, encoding four options of recombination/non-recombination for the two parental meiosis: (The Kronecker product) For n non-founders, the transition matrix is the n-fold Kronecker product:

19 Probability of data in one locus given an inheritance vector S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 Model for locus 2 P(x 21, x 22, x 23 |s 23m,s 23f ) = =  P(l 21m ) P(l 21f ) P(l 22m ) P(l 22f ) P(x 21 | l 21m, l 21f ) P(x 22 | l 22m, l 22f ) P(x 23 | l 23m, l 23f ) P(l 23m | l 21m, l 21f, S 23m ) P(l 23f | l 22m, l 22f, S 23f ) l 21m,l 21f,l 22m,l 22f l 22m,l 22f The five last terms are always zero-or-one, namely, indicator functions.

20 Posterior decoding H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi The standard query for HMM is belief update (also called posterior decoding). 1. Compute the posteriori belief in H i (specific i) given the evidence {x 1,…,x L } for each of H i ’s values h i, namely, compute p(h i | x 1,…,x L ). 2. Do the same computation for every H i but without repeating the first task L times. The solution is called the forward-backward algorithm.

21 Likelihood of evidence To compute the likelihood of evidence P(x 1,…,x L ), which depends on the recombination fractions in our case, we will use either the forward or the backward algorithm to be described now. H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

22 Decomposing the computation P(x 1,…,x L,h i ) = P(x 1,…,x i,h i ) P(x i+1,…,x L | x 1,…,x i,h i ) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Equality due to Ind({x i+1,…,x L }, {x 1,…,x i } | H i } = P(x 1,…,x i,h i ) P(x i+1,…,x L | h i )  f(h i ) b(h i ) Answer: P(h i | x 1,…,x L ) = (1/K) P(x 1,…,x L,h i ) where K=  hi P(x 1,…,x L,h i ).

23 The forward algorithm P(x 1,x 2,h 2 ) =  P(x 1,h 1,h 2,x 2 ) {Second step} =  P(x 1,h 1 ) P(h 2 | x 1,h 1 ) P(x 2 | x 1,h 1,h 2 ) h1h1 h1h1 Last equality due to conditional independence =  P(x 1,h 1 ) P(h 2 | h 1 ) P(x 2 | h 2 ) h1h1 H1H1 H2H2 X1X1 X2X2 HiHi XiXi The task: Compute f(h i ) = P(x 1,…,x i,h i ) for i=1,…,L (namely, considering evidence up to time slot i). P(x 1, h 1 ) = P(h 1 ) P(x 1 |h 1 ) {Basis step} P(x 1,…,x i,h i ) =  P(x 1,…,x i-1, h i-1 ) P(h i | h i-1 ) P(x i | h i ) h i-1 {step i}

24 The backward algorithm The task: Compute b(h i ) = P(x i+1,…,x L |h i ) for i=L-1,…,1 (namely, considering evidence after time slot i). H L-1 HLHL X L-1 XLXL HiHi H i+1 X i+1 P(x L | h L-1 ) =  P(x L,h L |h L-1 ) =  P(h L |h L-1 ) P(x L |h L-1,h L )= hLhL hLhL Last equality due to conditional independence =  P(h L |h L-1 ) P(x L |h L ) {first step} hLhL P(x i+1,…,x L |h i ) =  P(h i+1 | h i ) P(x i+1 | h i+1 ) P(x i+2,…,x L | h i+1 ) h i+1 {step i} =b(h i )= =b(h i+1 )=

25 The combined answer 1. To Compute the posteriori belief in H i (specific i) given the evidence {x 1,…,x L } run the forward algorithm and compute f(h i ) = P(x 1,…,x i,h i ), run the backward algorithm to compute b(h i ) = P(x i+1,…,x L |h i ), the product f(h i )b(h i ) is the answer (for every possible value h i ). 2. To Compute the posteriori belief for every H i simply run the forward and backward algorithms once, storing f(h i ) and b(h i ) for every i (and value h i ). Compute f(h i )b(h i ) for every i. H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi

26 Likelihood of evidence revisited 1.To compute the likelihood of evidence P(x 1,…,x L ), do one more step in the forward algorithm, namely,  f(h L ) =  P(x 1,…,x L,h L ) 2. Alternatively, do one more step in the backward algorithm, namely,  b(h 1 ) P(h 1 ) P(x 1 |h 1 ) =  P(x 2,…,x L |h 1 ) P(h 1 ) P(x 1 |h 1 ) H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi hLhL h1h1 hLhL h1h1

27 Time and Space Complexity of the forward/backward algorithms Time complexity is linear in the length of the chain, provided the number of states of each variable is a constant. More precisely, time complexity is O(k 2 n) where k is the maximum domain size of each variable. H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi Space complexity is also O(k 2 n). In our case: O(k 2 n) is really O(2 4n L). Next class we will see how to save on these computations using the special matrices we have. The savings have been implemented in GeneHunter – a leading software of linkage.

28 The Maximum APosteriori query H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi 1.Recall that the query asking likelihood of evidence is to compute P(x 1,…,x L ) =  P(x 1,…,x L, h 1,…,h L ) 2.Now we wish to compute a similar quantity: P * (x 1,…,x L ) = MAX P(x 1,…,x L, h 1,…,h L ) (h 1,…,h L ) And, of course, we wish to find a MAP assignment (h 1 *,…,h L * ) that brought about this maximum.

29 Example: Revisiting likelihood of evidence H1H1 H2H2 X1X1 X2X2 H3H3 X3X3 P(x 1,x 2,x 3 ) =  P(h 1 )P(x 1 |h 1 )  P(h 2 |h 1 )P(x 2 |h 2 )  P(h 3 |h 2 )P(x 3 |h 3 ) h3h3 h2h2 h1h1 =  P(h 1 )P(x 1 |h 1 )  b(h 2 ) P(h 1 |h 2 )P(x 2 |h 2 ) h1h1 h2h2 =  b(h 1 ) P(h 1 )P(x 1 |h 1 ) h1h1

30 Example: Computing the MAP assignment H1H1 H2H2 X1X1 X2X2 H3H3 X3X3 maximum = max P(h 1 )P(x 1 |h 1 ) max P(h 2 |h 1 )P(x 2 |h 2 ) max P(h 3 |h 2 )P(x 3 |h 3 ) h3h3 h2h2 h1h1 = max P(h 1 )P(x 1 |h 1 ) max b (h 2 ) P(h 1 |h 2 )P(x 2 |h 2 ) h1h1 h2h2 h3h3 Replace sums with taking maximum: = max b (h 1 ) P(h 1 )P(x 1 |h 1 ) h1h1 h2h2 {Finding the maximum} h 1 * = arg max b (h 1 ) P(h 1 )P(x 1 |h 1 ) h1h1 h2h2 {Finding the map assignment} h 2 * = x* (h 1 * ); h2h2 x* (h 2 ) h3h3 x* (h 1 ) h2h2 h 3 * = x* (h 2 * ) h3h3

31 Viterbi’s algorithm For i=1 to L-1 do h 1 * = ARG MAX P(h 1 ) P(x 1 |h 1 ) b (h 1 ) h2h2 h2h2 h i+1 * = x* (h i *) h i+1 Forward phase (Tracing the MAP assignment) : H1H1 H2H2 H L-1 HLHL X1X1 X2X2 X L-1 XLXL HiHi XiXi x* (h i ) = ARGMAX P(h i+1 | h i ) P(x i+1 | h i+1 ) b (h i+1 ) For i=L-1 downto 1 do b (h i ) = MAX P(h i+1 | h i ) P(x i+1 | h i+1 ) b (h i+1 ) h i+1 h i+2 b (h L ) = 1 h L+1 h i+1 h i+2 Backward phase: (Storing the best value as a function of the parent’s values)