CASE STUDY: Genetic Linkage Analysis via Bayesian Networks

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

. Exact Inference in Bayesian Networks Lecture 9.
METHODS FOR HAPLOTYPE RECONSTRUCTION
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
.. . Parameter Estimation using likelihood functions Tutorial #1 This class has been cut and slightly edited from Nir Friedman’s full course of 12 lectures.
EM algorithm and applications. Relative Entropy Let p,q be two probability distributions on the same sample space. The relative entropy between p and.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Parameter Estimation using likelihood functions Tutorial #1
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
. Learning – EM in The ABO locus Tutorial #8 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
Tutorial by Ma’ayan Fishelson Changes made by Anna Tzemach.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
. Learning Bayesian networks Slides by Nir Friedman.
Parametric and Non-Parametric analysis of complex diseases Lecture #8
. Bayesian Networks For Genetic Linkage Analysis Lecture #7.
. Basic Model For Genetic Linkage Analysis Lecture #3 Prepared by Dan Geiger.
. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Thanks to Nir Friedman, HU
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
Linkage stuff Vibhav Gogate. A Review of the Genetic Model X1X1 X2X3Xi-1XiXi+1Y1Y1 Y2Y2 Y3Y3 Y i-1 YiYi Y i+1 X1X1 X2X3Xi-1XiXi+1S1S1 S2S2 S3S3 S i-1.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
Maximum likelihood (ML)
Tutorial #5 by Ma’ayan Fishelson
Problem A newly married couple plans to have four children and would like to have three girls and a boy. What are the chances (probability) their desire.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
. Basic Model For Genetic Linkage Analysis Lecture #5 Prepared by Dan Geiger.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
What if a new genome comes? We just sequenced the porcupine genome We know CpG islands play the same role in this genome However, we have no known CpG.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
BINOMIALDISTRIBUTION AND ITS APPLICATION. Binomial Distribution  The binomial probability density function –f(x) = n C x p x q n-x for x=0,1,2,3…,n for.
. EM and variants of HMM Lecture #9 Background Readings: Chapters 11.2, 11.6, 3.4 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Genetics of Blood Types. Genotypes and Phenotypes Type A, Type B, Type AB and Type O blood are phenotypes. It is not always possible to tell the genotype.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 15: Linkage Analysis VII
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
Conditional Probability Mass Function. Introduction P[A|B] is the probability of an event A, giving that we know that some other event B has occurred.
. Basic Model For Genetic Linkage Analysis Prepared by Dan Geiger.
Review of statistical modeling and probability theory Alan Moses ML4bio.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CHAPTER 6 Random Variables
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Hidden Markov Models Part 2: Algorithms
Haplotype Reconstruction
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Basic Model For Genetic Linkage Analysis Lecture #3
Parametric Methods Berlin Chen, 2005 References:
Genetic linkage analysis
Learning Bayesian networks
Presentation transcript:

CASE STUDY: Genetic Linkage Analysis via Bayesian Networks 2 4 5 1 3 H A1/A1 D A2/A2 A1/A2 D D A1 A2 H D H | D A2 | A2 A2 A2 Recombinant Phase inferred We speculate a locus with alleles H (Healthy) / D (affected) If the expected number of recombinants is low (close to zero), then the speculated locus and the marker are tentatively physically closed.

The Variables Involved Lijm = Maternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i. Lijf = Paternal allele at locus i of person j. The values of this variables are the possible alleles li at locus i (Same as for Lijm) . Xij = Unordered allele pair at locus i of person j. The values are pairs of ith-locus alleles (li,l’i). “The genotype” Yj = person I is affected/not affected. “The phenotype”. Sijm = a binary variable {0,1} that determines which maternal allele is received from the mother. Similarly, Sijf = a binary variable {0,1} that determines which paternal allele is received from the father. It remains to specify the joint distribution that governs these variables. Bayesian networks turn to be a perfect choice.

The Bayesian network for Linkage Locus 1 Locus 3 Locus 4 Si3m Li1f Li1m Li3m Xi1 Si3f Li2f Li2m Li3f Xi2 Xi3 Locus 2 (Disease) Y3 y2 Y1 This network depicts the qualitative relations between the variables. We have already specified the local conditional probability tables.

Details regarding recombination S13m L11f L11m L13m X11 S13f L12f L12m L13f X12 X13 L21m L21f L22m L22f S23m X21 X22 S23f Y1 Y2 L23m L23f X23 Y3  is the recombination fraction between loci 2 & 1.

Details regarding the Loci P(L11m=a) is the frequency of allele a. Li1f Li1m Li3m Xi1 Si3m Y1 X11 is an unordered allele pair at locus 1 of person 1 = “the data”. P(x11 | l11m, l11f) = 0 or 1 depending on consistency The phenotype variables Yj are 0 or 1 (e.g, affected or not affected) are connected to the Xij variables (only in the disease locus). For example, model of perfect recessive disease yields the penetrance probabilities: P(y11 = sick | X11= (a,a)) = 1 P(y11 = sick | X11= (A,a)) = 0 P(y11 = sick | X11= (A,A)) = 0

SUPERLINK Stage 1: each pedigree is translated into a Bayesian network.  Stage 2: value elimination is performed on each pedigree (i.e., some of the impossible values of the variables of the network are eliminated). Stage 3: an elimination order for the variables is determined, according to some heuristic. Stage 4: the likelihood of the pedigrees given the  values is calculated using variable elimination according to the elimination order determined in stage 3. Allele recoding and special matrix multiplication is used.

Comparing to the HMM model S1 X1 X2 X3 Xi-1 Xi Xi+1 S2 S3 Si-1 Si Si+1 X1 X2 X3 Xi-1 Xi Xi+1 X1 X2 X3 Yi-1 Xi Xi+1 The compounded variable Si = (Si,1,m,…,Si,2n,f) is called the inheritance vector. It has 22n states where n is the number of persons that have parents in the pedigree (non-founders). The compounded variable Xi = (Xi,1,m,…,Xi,2n,f) is the data regarding locus i. Similarly for the disease locus we use Yi. REMARK: The HMM approach is equivalent to the Bayesian network approach provided we sum variables locus-after-locus say from left to right.

Experiment A (V1.0) over 100 hours Same topology (57 people, no loops) Elimination Order: General Person-by-Person Locus-by-Locus (HMM) over 100 hours Out-of-memory Pedigree size Too big for Genehunter. Same topology (57 people, no loops) Increasing number of loci (each one with 4-5 alleles) Run time is in seconds.

Experiment C (V1.0) Same topology (5 people, no loops) Increasing number of loci (each one with 3-6 alleles) Run time is in seconds. Order type Software Out-of-memory Bus error

Some options for improving efficiency Multiplying special probability tables efficiently. Grouping alleles together and removing inconsistent alleles. Optimizing the elimination order of variables in a Bayesian network. Performing approximate calculations of the likelihood.

Standard usage of linkage There are usually 5-15 markers. 20-30% of the persons in large pedigrees are genotyped (namely, their xij is measured). For each genotyped person about 90% of the loci are measured correctly. Recombination fraction between every two loci is known from previous studies (available genetic maps). The user adds a locus called the “disease locus” and places it between two markers i and i+1. The recombination fraction ’ between the disease locus and marker i and ” between the disease locus and marker i+1 are the unknown parameters being estimated using the likelihood function. This computation is done for every gap between the given markers on the map. The MLE hints on the whereabouts of a single gene causing the disease (if a single one exists).

Relation to Treewidth the weight of each vertex is constant, The unconstrained Elimination Problem reduces to finding treewidth if: the weight of each vertex is constant, the cost function is Finding the treewidth of a graph is known to be NP-complete (Arnborg et al., 1987). When no edges are added, the elimination sequence is perfect and the graph is chordal. Relation to Treewidth

Parameter Estimation Lecture #10 Acknowledgement: Some slides of this lecture are due to Nir Friedman. .

Likelihood function for a die: Multinomial sampling Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a die. Suppose we observe a sequence of independent outcomes: Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6) What is the probability of this data ? If we knew the long-run frequencies i for falling on side xi, then, Where  ={1,2,3,4,5} are called the parameters of the likelihood function. We wish to estimate these parameters from the data we have seen.

Sufficient Statistics To compute the probability of data in the die example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6). We do not need to recall the entire sequence of outcomes {Ni | i=1…6} is called the sufficient statistics for the multinomial sampling.

Sufficient Statistics A sufficient statistics is a function of the data that summarizes the relevant information for the likelihood Formally, s(Data) is a sufficient statistics if for any two datasets D and D’ s(Data) = s(Data’ )  P(Data|) = P(Data’|) Datasets Statistics

Maximum Likelihood Estimate Maximum likelihood estimate is an assignment to the parameters that maximizes the probability of data (i.e., the likelihood function ). Usually one maximizes the log-likelihood function which is easier to do and gives an identical answer: A sufficient condition for maximum is:

Finding the Maximum We have just found that: Divide the ith and jth equations: Sum from j=1 to 6: Hence the MLE is given by:

Adding Pseudo Counts The MLE given by can be misleading for small data sets because it could happen that a small data set is not typical. For example, it might be that we know that the dice is manufactured to be loaded but the small dataset we examined does not show this property. The MAP estimate can be justified as maximizing one’s posterior (namely, after seeing the data) best estimate of the frequencies for each side. The theory formally justifying this formula is called Bayesian Statistics (not covered in this course due to time constraints). The MAP estimate is given by The six pseudo counts N’i sum to N’. They express one’s assessment regarding the frequencies for each side prior to seeing the data. Large N’ indicates high confidence. Smaller than 1 values are possible.

Example: The ABO locus Recall that a locus is a particular place on the chromosome. Each locus’ state (called genotype) consists of two alleles – one parental and one maternal. Some loci (plural of locus) determine distinguished features. The ABO locus, for example, determines blood type. The ABO locus has six possible genotypes {a/a, a/o, b/o, b/b, a/b, o/o}. The first two genotypes determine blood type A, the next two determine blood type B, then blood type AB, and finally blood type O. We wish to estimate the proportion in a population of the 6 genotypes. Suppose we randomly sampled N individuals and found that Na/a have genotype a/a, Na/b have genotype a/b, etc. Then, the MLE is given by:

The ABO locus (Cont.) However, testing individuals for their genotype is a very expensive test. Can we estimate the proportions of genotype using the common cheap blood test with outcome being one of the four blood types (A, B, AB, O) ? The problem is that among individuals measured to have blood type A, we don’t know how many have genotype a/a and how many have genotype a/o. So what can we do ? We use the Hardy-Weinberg equilibrium rule that tells us that in equilibrium the frequencies of the three alleles a,b,o in the population determine the frequencies of the genotypes as follows: a/b= 2a b, a/o= 2a o, b/o= 2b o, a/a= [a]2, b/b= [b]2, o/o= [o]2. So now we have three parameters that we need to estimate.

The Likelihood Function Let X be a random variable with 6 values xa/a, xa/o ,xb/b, xb/o, xa/b , xo/o denoting the six genotypes. The parameters are = {a ,b, o}. The probability P(X= xa/b | ) = 2a b. The probability P(X= xo/o | ) = o o. And so on for the other four genotypes. What is the probability of Data={B,A,B,B,O,A,B,A,O,B, AB} ? Obtaining the maximum of this function yields the MLE. This can be done by multidimensional Newton’s algorithm.

Computing MLE Finding MLE parameters: nonlinear optimization problem Gradient Ascent (Newton like methods): Follow gradient of likelihood w.r.t. to parameters (As taught in your favorite Numerical Analysis course). Improve, by adding line search methods to determine step size and get faster convergence. Start at several random locations. P(Data| ) 

Gene Counting Had we known the counts na/a and na/o (blood type A individuals), we could have estimated a from n individuals as follows (and similarly estimate b and o): Can we compute what na/a and na/o are expected to be ? Using the current estimates of a and o we can as follows: We repeat these two steps until the parameters converge.

Gene Counting (example of EM) Input: Counts of each blood type nA, nB, nO, nAB of n people. Desired Output: ML estimate of allele frequencies a ,b , o. Initialization: Set a ,b ,and o to arbitrary values (say, 1/3). Repeat E-step (Expectation): M-step (Maximization): Until a ,b ,and o converge