1 HMM in crosses and small pedigrees Lecture 8, Statistics 246, February 17, 2004.

Slides:



Advertisements
Similar presentations
. Markov Chains. 2 Dependencies along the genome In previous classes we assumed every letter in a sequence is sampled randomly from some distribution.
Advertisements

. Exact Inference in Bayesian Networks Lecture 9.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Gene Frequency and LINKAGE Gregory Kovriga & Alex Ratt.
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
Lecture (7) Random Variables and Distribution Functions.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Operations Research: Applications and Algorithms
Bayesian Methods with Monte Carlo Markov Chains III
Genetic linkage analysis Dotan Schreiber According to a series of presentations by M. Fishelson.
Basics of Linkage Analysis
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
Joint Linkage and Linkage Disequilibrium Mapping
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
The Rate of Concentration of the stationary distribution of a Markov Chain on the Homogenous Populations. Boris Mitavskiy and Jonathan Rowe School of Computer.
Statistical NLP: Lecture 11
Operations Research: Applications and Algorithms
. Hidden Markov Model Lecture #6. 2 Reminder: Finite State Markov Chain An integer time stochastic process, consisting of a domain D of m states {1,…,m}
Markov Chains Lecture #5
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Tutorial #6 by Ma’ayan Fishelson Based on notes by Terry Speed.
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
Tutorial #11 by Anna Tzemach. Background – Lander & Green’s HMM Recombinations across successive intervals are independent  sequential computation across.
CASE STUDY: Genetic Linkage Analysis via Bayesian Networks
1 QTL mapping in mice, cont. Lecture 11, Statistics 246 February 26, 2004.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics
Probabilistic Model of Sequences Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Hidden Markov Model Continues …. Finite State Markov Chain A discrete time stochastic process, consisting of a domain D of m states {1,…,m} and 1.An m.
Tutorial #5 by Ma’ayan Fishelson
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Joint Linkage and Linkage Disequilibrium Mapping Key Reference Li, Q., and R. L. Wu, 2009 A multilocus model for constructing a linkage disequilibrium.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Markov Chain Monte Carlo Hadas Barkay Anat Hashavit.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 15: Linkage Analysis VII
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger.
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
HMM in crosses and small pedigrees, cont.
Hidden Markov Models Part 2: Algorithms
Haplotype Reconstruction
Basic Model For Genetic Linkage Analysis Lecture #3
Multipoint Approximations of Identity-by-Descent Probabilities for Accurate Linkage Analysis of Distantly Related Individuals  Cornelis A. Albers, Jim.
Tutorial #6 by Ma’ayan Fishelson
Learning Bayesian networks
Presentation transcript:

1 HMM in crosses and small pedigrees Lecture 8, Statistics 246, February 17, 2004

2 Discrete-time Markov chains Consider a sequence of random variables X 1,X 2, X 3, …with common finite state space S. This sequence forms a Markov chain if for all t, {X 1,…, X t-1 } and {X t+1, X t+2,… } are conditionally independent given X t, equivalently, pr(X t | X t-1, X t-2,…) = pr(X t | X t-1 ). The matrix p(i,j;t) = pr(X t = j | X t-1 = i) is the transition matrix at step t. When p(i,j;t) = p(i,j) for all i and j, independent of t, we say the Markov chain is time-homogeneous, or has stationary transition probabilities. Many of the chains we’ll be meeting will be inhomogeneous, and t will be in space, not time. There are plenty of good books on elementary Markov chain theory, Feller vol 1 being my favourite, but they mostly concentrate on asymptotic behaviour in the homogeneous case. For the time being we don’t need this, or much else from the general theory, apart from the fact that multi-step transition matrices are products of 1-step transition matrices (Exercise).

3 Hidden Markov Models (HMM) If (X t ) is a Markov chain, and f is an arbitrary function on the state space, then (f(X t )) will not in general be a Markov chain. Exercise: Construct an example to demonstrate the last assertion. It is sometimes the case that associated with a Markov chain (X t ) is another process, (Y t ) say, whose terms are conditionally independent given the chain (X t ). This happens with so-called semi-Markov chains. Both functions of Markov chains and this last situation are covered by the following useful definition, based on the work of L. E. Baum and colleagues around A bivariate Markov chain (X t,Y t ) is called a Hidden Markov Model if a) (X t ) is a Markov chain, and b) the distribution of Y t given X t, X t-1, X t-2 …. depends only on X t and X t-1. In many examples, this dependence is only on X t, but in some, it can extend beyond X t-1, and/or includeY t. Once you see how the defining property is used in the calculations, you will get an idea of the possible extensions. Exercise. Explain how functions of Markov chains are always HMM.

4 HMM, cont. There are many suitable references on HMM, but two good ones for our purposes are the books by Timo Koski (HMM for bioinformatics, 2001) and Durbin et al (Biological sequence analysis, 1998). The simplest specification of an HMM is via the transition probabilities p(i,j;t) for the underlying Markov chain (X t ), and the emission probabilities for the observations (Y t ), where these are given by q(i,j, k;t) = pr(Y t = k | X t-1 = i, X t = j). We also need an initial distribution  for the chain:  (i) = pr(X 0 = i). In general we are not going to observe (X t ), which accounts for the word “hidden” in the name, but if we did, the probability of observing the state sequence x 0,x 1, x 2, ….., x n, and associated observations y 1, y 2, …,y n is  (x 0 )p(x 0, x 1 ;1)q(x 0, x 1,y 1 ;1)…..p(x n-1,x n ;n )q(x n-1,x n,y n ;n).

5 HMM in experimental crosses We are going to consider the chromosomes of offspring from crosses of inbred strains A and B of mice. Suppose that we have n markers along a chromosome, #1 say, in their correct order 1, 2, …n, say, with r t being the recombination fraction between markers t and t+1. Consider the genotypes at these markers along chromosome 1 of an A  H backcross mouse. Each genotype will be either A = aa or H = ab, and you can do the Exercise. Under the assumption of no interference, the sequence of genotypes at markers 1, 2, …, n is a Markov chain with state space {A, H}, initial distribution (1/2,1/2 ), and 2  2 transition probability matrix R(r t ) having diagonal entries1-r t for no change (A  A, H  H), and off-diagonal entries r t for change (A  H, H  A). This Markov chain represents the crossover process along the chromosome passed by one parent, say the mother. In an F 2 intercross, H  H, there is a crossover process in both F 1 parents. If we consider the offspring’s possible ordered genotypes, that is, genotypes with known parental origin (also called known phase), then the sequence of ordered genotypes at markers 1, 2, …, n is also a Markov chain, with state space, {a,b}  {a,b} = {aa, ab, ba, bb}, initial distribution (1/2, 1/2)  (1/2, 1/2) = (1/4,1/4, 1/4. 1/4), and transition probability matrixP(t) = R(r t )  R(r t ).

6 HMM in experimental crosses, cont between marker t and t+1. Here you need to know the notion of tensor product of matrices, also known as direct product or Kronecker product. This is defined in most books on matrices, but my favourite is Bellman’s book Exercise. My notation suggests the idea of the product of two independent Markov chains. Define this notion carefully, and show we get a Markov chain.. Now, unlike in the backcross, the observed F 2 genotypes do not always tell us which parental strand had a recombination across an interval, and which didn’t, so we cannot always reconstruct the ordered genotypes. With this 4-state Markov chain we have just 3 possible observed states, and we include some ambiguity states, giving us an observation space {A, H, B, C, D, -}, where C = {H,B} = not A, D = {A, H} = not B, and - = missing. The resulting joint chain-observation process is an HMM, with emission probabilities which in a more general form can be written as the array on the next page. There  is the error rate, which may in fact be marker specific, though for simplicity that is not indicated by the notation.

7 F 2 HMM emission probabilities Note that the row entries in this array do not simply sum to 1, but the entries for mutually exclusive and exhaustive cases should: A, H and B, or A and C, etc.

8 Calculations with our HMM For our F 2 intercross there are certain calculations we would like to do which the HMM formalism makes straightforward. In fact, they are all instances of calculations generally of interest with HMM, and if the number of steps is not too large, and the state spaces not too big, there are neat algorithms for carrying them out which are worth knowing. Major problems: Given the combined observations O of the genotypes Y = (Y 1, Y 2,…, Y n ) from many different F 2 offspring, transition matrices P(r t ), and emission probabilities q(  t ), t=1,2,…,n, a) calculate the log likelihood l( r | O) = log pr( O | r) for different parameter values and different marker orders; b) calculate MLE for the recombination fractions r by maximizing l( r | O); and for individual offspring with genotype dataY, c) probabilistically reconstruct the individual unobserved ordered genotypes X = (X 1, X 2,….,, X n ) by calculating argmax X pr( X | Y, r). We’ll see the algorithms in due course. We might also wish to simulate from pr( X | Y, r).

9 Another problem: reconstructing haplotypes The problem here is to reconstruct the childrens’ haplotypes as in the figure, from marker data on both the children and the parents.

10 Some basic notions from pedigree analysis Founders: individuals without parents in the pedigree. Non-founders: individuals with one or more parent in the pedigree Calculations on pedigrees typically involve evaluating probabilities of data under specific hypotheses, perhaps going on to form likelihood ratios, or inferring unobserved genetic states, e.g. of genotypes or haplotypes. To get started we need (in addition to the pedigree) phenotypic data, e.g. on markers or disease states, and a full genetic model. This means population genotype or allele frequencies, transmission probabilities, (probabilities of offspring haplotypes given parental haplotypes), and penetrances (probabilties of phenotypes given genotypes). Generally genotype frequencies in populations are assumed to satisfy the Hardy-Weinberg equlibrium formulae, and to be in linkage equilibrium across loci, and no interference is assumed in recombination. All of these are simplifications which have been found to be generally satisfactory.

11 Inheritance vectors We plan to view our (small) pedigree as one large HMM. The states of the Markov chain along each chromosome are what we term inheritance vectors, which have a component for every meiosis in the pedigree, and entries 0 or 1. So if we have n non-founders, our inheritance vectors are binary vectors of length 2n. Each non-founder has 2 components of the inheritance vector assigned to them: one for their maternal and one for their paternal meiosis. At any locus on a chromosome, the entries in the inheritance vector for a non-founder are 0 if the parental variant passed on at that locus was grandmaternal, and 1 otherwise. For individuals whose parent are founders, this definition is incomplete, but we will make a choice below. Before doing so, let me note that slightly different definitions of inheritance vector are in the literature, and in Lange’s excellent book, they are replaced by what he terms descent graphs. Clearly the idea is the same: given an inheritance vector, and allelic assignments for the founders, we can follow the descent of the alleles down the pedigree using our vector, and it it is easy to envisage its graphical representation. Consider a two-parent two-child nuclear family and suppose that the parents are 1/2 and 3/4, while the first child is 1/3 and the second is 2/4. Then the inheritance vector is of length 4, v = (v gm, v gp, v bm, v bp ), where gm represents the girl’s maternal meiosis, gp her paternal meiosis, and so on.

12 Exercises and references Exercise 1. Figure out whether the sequence of observed states (i.e. A, H, or B) in an F 2 intercross is a Markov chain. (I asserted that collapsing the state space of a Markov chain will not, in general, lead to another Markov chain. But sometime it does. Does it in this case?) Exercise 2. Suppose that 1, 2 and 3 are markers along a backcross mouse’s chromosome in that order, and that X t denotes its genotype at marker t. Prove that (contrary to my assertion in class) in general (X 1,X 2,X 3 ) will not be Markov in the incorrect order, e.g. that in general, pr(X 2 = H | X 1 = A, X 3 = H)  pr (X 2 = H | X 1 = A), which would be true if (X 3,X 1,X 2 ) were Markov in that order (3-1-2). References. As already mentioned, Lange’s book is a good reference for this topic applied to genetics. Closely related material can be found in my old class notes on my web site, see my Stat 260 for 1998, esp. weeks 3, 7 and 8, and weeks 3, 6 and 7 of 2000.