BMI 731- Winter 2004 Haplotype reconstruction Catalin Barbacioru Department of Biomedical Informatics Ohio State University.

Slides:



Advertisements
Similar presentations
Introduction to Haplotype Estimation Stat/Biostat 550.
Advertisements

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete.
METHODS FOR HAPLOTYPE RECONSTRUCTION
1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.
Sampling distributions of alleles under models of neutral evolution.
Basics of Linkage Analysis
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Visual Recognition Tutorial
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Overview Full Bayesian Learning MAP learning
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
Maximum likelihood (ML) and likelihood ratio (LR) test
Evaluating Hypotheses
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
1 1.1 © 2012 Pearson Education, Inc. Linear Equations in Linear Algebra SYSTEMS OF LINEAR EQUATIONS.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Slides are based on Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Population Genetics is the study of the genetic
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 15: Linkage Analysis VII
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Population genetics. coalesce 1.To grow together; fuse. 2.To come together so as to form one whole; unite: The rebel units coalesced into one army to.
California Pacific Medical Center
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lecture 22: Quantitative Traits II
Review of statistical modeling and probability theory Alan Moses ML4bio.
Parameter Estimation. Statistics Probability specified inferred Steam engine pump “prediction” “estimation”
Computational Intelligence: Methods and Applications Lecture 26 Density estimation, Expectation Maximization. Włodzisław Duch Dept. of Informatics, UMK.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
 Negnevitsky, Pearson Education, Lecture 12 Hybrid intelligent systems: Evolutionary neural networks and fuzzy evolutionary systems n Introduction.
Hidden Markov Models BMI/CS 576
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Haplotype Reconstruction
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Accuracy of Haplotype Frequency Estimation for Biallelic Loci, via the Expectation- Maximization Algorithm for Unphased Diploid Genotype Data  Daniele.
LECTURE 09: BAYESIAN LEARNING
Ho Kim School of Public Health Seoul National University
Parametric Methods Berlin Chen, 2005 References:
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Bruce Rannala, Jeff P. Reeve  The American Journal of Human Genetics 
Presentation transcript:

BMI 731- Winter 2004 Haplotype reconstruction Catalin Barbacioru Department of Biomedical Informatics Ohio State University

The problem We start with a collection of genotypes in the form of allelic determinations at tightly linked single nucleotide polymorphisms (SNPs) for each of a set of n individuals. For example, we might describe 3 SNPs as follows: Name SNP alleles (major, minor) SNP1 T, A SNP2 A, G SNP3 C, G An individual might have genotype AT at SNP1, AA at SNP2, and CG at SNP3, which we will denote by AT//AA//CG. Possible haplotype pairs for this person are AAC/TAG and AAG/TAC, and without further information, we can’t distinguish between these two pairs.

The problem, cont. What can be done? With information on the individual’s parents, we can usually infer the haplotypes, the only problem being that the parents may not be fully informative. For example, if the maternal and paternal genotypes were TA//AA//CC and TT//AA//CG respectively, at SNPs 1, 2 and 3, and the individual is AT//AA//CG, then it would be clear that the haplotypes were AAC/TAG (why?). On the other hand, if the parents both had genotypes AT//AA//CG, then we wouldn’t be able to determine unique haplotypes for the individual. Even in the first case, we might have to make an assumption about the frequency of recombination: what is it? Our problem here is to determine haplotypes, or make good guesses at them without parental genotypes.

Origin of the problem Why do we want to determine haplotypes for individuals at tightly linked SNP loci? a)Haplotypes are more powerful discriminators between cases and controls in disease association studies. Why? b) With haplotypes we can conduct evolutionary studies. c) Use of haplotypes in disease association studies reduces the number of tests to be carried out, and hence the penalty for multiple testing. Is this the same point as a)?

Two aspects of the problem With a random sample of multilocus genotypes at a set of SNPs, we can attempt a) to estimate the frequencies of all possible haplotypes, and b) to infer the haplotypes of all individuals. The first step on this problem was taken by A Clark in He gave what we might call a parsimony solution to b) above. It goes like this. With a reasonable sample size, we might expect to have some individuals homozygous at every locus, e.g. TT//AA//CC, or heterozygous at just one locus, e.g. TT//AA//CG. With the individuals of former type, we have unambiguously identified one (TAC), and of the latter type two (TAC and TAG) haplotypes present in the population. The algorithm begins by finding all homozygotes and single SNP heterozygotes and tallying the resulting known haplotypes.

Now proceed as follows. For each known haplotype, look at all remaining unresolved cases, and ask whether the known haplotype can be made from some combination of ambiguous sites from an unresolved case. For example, if we have identified TAC as a known haplotype from a TT//AA//CC homozygote, and we have an individual AT//AA//CG still unresolved, then we infer that s/he is TAC/AAG, and we have have “resolved” this person’s haplotype and added a putative haplotype to our list. Similarly, a TT//AA//CG individual gives us both TAC and TAG as known haplotypes, and both of these go into the initial list. This chain of inferences is continued until either all haplotypes have been recovered, or until no more new haplotypes can be found in this way. A Clark’s algorithm, cont.

Clark’s algorithm with SNPs in practice This method should work in principle, but there are three problems that might arise in practice: a) there may be no homozygotes or single SNP heterozygotes in the sample, and so the chain might never get started; b) there may be many unresolved haplotypes left at the end; and c) haplotypes might be erroneously inferred if a crossover product of two actual haplotypes is identical to another true haplotype. The frequency of these problems will depend on average heterozygosity of the SNPs, number of loci, their recombination rates and the sample size. Clark (1990) did some calculations and simulations which led him to believe the algorithm would perform well, even with relatively small sample sizes. And it did.

The EM algorithm solution to problem a) (SNPem) We now describe an EM algorithm to infer haplotype frequencies in a population on the basis of a random sample and the assumption of random mating for haplotypes. Escoffier and Slatkin (1995) call the phase unknown multilocus genotypes, e.g. TT//AA//CG, phenotypes, and keep the term genotype for the corresponding haplotype pair TAC/TAG. Others use the term diplotype for a pair of haplotypes, but neither of these has caught on. The observed data in a random sample of n individuals will be multilocus genotype frequencies, and the natural model is multinomial. The number c of haplotype pairs leading to a given phenotype will depend on the number s of heterozygous SNPs, and will be 2 s-1. E.g if our genotype is TT//AA//CG, then we can recover the haplotypes unambiguously (c=1), but for our original case, AT//AA//CG, there were c = 2 possible haplotype pairs. Under the assumption of random mating, the probability of a given genotype is just the sum of 2 s-1 squares or products of haplotype probabilities, e.g. P=pr(AT//AA//CG) = pr(AAC/TAG) + pr(AAG/TAC) = 2pr(AAC)pr(TAG) + 2pr(AAG)pr(TAC).

The EM algorithm, cont. The probability of a sample of n individuals conditioned by the phenotype frequencies P 1, …, P m (i.e. the likelihood of the data given the parameters) is given by the multinomial probability, where m different phenotypes are observed with counts n 1,…,n m. If we h possible haplotypes with (unknown) frequencies p 1,…,p h, then the likelihood of the haplotype frequencies given the phenotypic counts is where a is a constant incorporating the multinomial coefficient, h p,r are the possible haplotypes and c j is the number of genotypes leading to the j th phenotype. In principle, we want to estimate the haplotype frequencies that maximize the likelihood function, by solving a set of h-1 equations involving first partial derivatives of the logarithm of the likelihood, ∂logL/∂p i.

The EM algorithm, cont. The EM algorithm is an iterative method to compute successive sets of haplotype frequencies p 1,…,p h, starting with initial values p 1 (0),…,p h (0). These initial values are used as if they are the true frequencies to estimate genotype frequencies P(h k h l ) (0) (the expectation step). These expected genotype frequencies are used in turn to estimate haplotype frequencies at the next generation p 1 (1),…,p h (1) (the maximization step) and so on, until convergence is reached. Possible initial haplotype frequencies can be evaluated assuming that all haplotypes are equally frequent in the sample, or that haplotype frequencies are equal to the product of single-locus allele frequencies, or that initial haplotypes frequencies are chosen random, or that for each phenotype, the possible genotypes are equally likely, so P(genotype h k h l in phenotype j) (0) = P j (h k h l ) (0) = 1/c j

The EM algorithm, cont. The expectation step at the g th iteration consists of using the haplotype frequencies in the previous iteration to calculate the probability of resolving each genotype given the phenotype: where the sum is taken over all genotypes (h k,h l ) leading to the j th phenotype, and P(h 1,h 2 ) (g) = 2p 1 (g) p 2 (g) if h 1 and h 2 are different and (p 1 (g) ) 2 otherwise. Haplotype frequencies are then computed in the maximization step: where δ it is equal to the number of times haplotype t is present in genotype i = (h k,h l ) (0,1, or 2).

The EM algorithm: performance with SNPs

Previous slide: Influence of sample size on haplotype frequency estimates. A and B, Haplotype frequencies at the three steps of the simulation procedure. Generating frequencies (Gk [line]), sample frequencies (Sk [triangles]), and resulting haplotype frequency estimates from the EM algorithm (Ek [unblackened circles]) for a five-locus system with equally frequent population haplotype frequencies, with sample size set to N = 50 (A) and N = 500 (B) are shown. C, Average MSE and 95% CI for batches of 500 data sets of each sample size for five-locus haplotypes generated under the N(1/k,2) model. Unbroken line denotes comparisons of EM estimates to sample values (SE); dotted line, EM estimates to generating parameters (GE).

A Gibbs sampler approach to haplotype reconstruction, Stephens et al (2001) (PHASE) Our notation is as follows: G = (G 1,…, G n ) denotes the observed multilocus genotype frequencies, H=(H 1,…,H n ) will be the corresponding unknown haplotype pairs, while F=(F 1, …, F M ) will denote the M unknown population haplotype frequencies. In these terms, the EM algorithm sought that F which maximized pr(G|F). The Gibbs sampler uses the following three steps, starting from some initial haplotype reconstruction H (0) : i) Choose an individual i, uniformly and at random from all ambiguous individuals; ii) Sample H i (t+1) from pr(H i | g, H -i (t) ), where H -i is the set of haplotypes excluding individual i; iii) Set H j (t+1) = H j (t) for j=1,…,i-1,i+1, …,n. General theory tells us that this produces a Markov chain with the desired stationary distribution. The question is: what is pr(H i | g, H -i (t) )?

The Stephens et al Gibbs sampler, some details For any haplotype pair H i = (h i1,h i2 ) consistent with genotype G i, write pr(H i | G, H -i )  pr(H i | H -i )  pr(h i1 |H -I )pr(h i2 | H -I, h i1 ) (*) where pr(.|H) is the conditional distribution of a future sampled haplotype, given a set H of previously sampled haplotypes. There are a couple of choices here, one assuming that the type of a mutant offspring is h with probability h, independent of the type of the parent. In that case, for a constant-size random mating population, pr(h|H) = (r h +  h )/(r +  ), where r h is the number of haplotypes of type h in H, r is the total number of haplotypes in H, and  is the scaled mutation rate.

The Stephens et al Gibbs sampler, more details In principle we can substitute this formula into step ii) of the generic Gibbs algorithm, and off we go. In practice, the number of possible values of Hi is too large: 2 s-1, where s is the number of loci at which individual i heterozygous. However, if we take h =1/M, where M is the total number of different possible haplotypes that could be observed in the population, we can exploit the fact

Comparison

Previous Figure Comparison of accuracy of PHASE method (solid line) versus EM (dotted line) and Clark's method (dashed line) for short sequence data (530 segregating sites). Top row, mean error rate (defined in text) for haplotype reconstruction. Bottom row, mean discrepancy (defined in text) for estimation of haplotype frequencies. We simulated data sets of 2n haplotypes, randomly paired to form n genotypes, under an infinite-sites model, with = 4 and different assumptions about the local recombination rate R (R and are defined in the note to table 1), using a coalescent-based program kindly provided by R. R. Hudson. For each combination of parameters considered, we generated 100 independent data sets and discarded those data sets for which the total number of possible haplotypes was >105 (the limit of our implementation of the EM algorithm), which typically left >90 data sets on which to compare the methods. Each point thus represents an average over simulated data sets. Horizontal lines above and below each point show approximate 95% confidence intervals for this average (±2 standard errors). The results for Clark's algorithm for R = 40 are omitted, as we had difficulty getting the algorithm to consistently provide a unique haplotype reconstruction for these data.

The Stephens et al Gibbs sampler, yet more details Stephens et al (2001) offer a second algorithm, based on slightly different population genetic modelling. It takes the form pr(h | H) = ∑ {  E} ∑ {s ≥0} [r  /r][  /(r+  )] s [r/(r+  )](P s )  h, where r  is the number of haplotypes of type  in the set H, r is the total number of haplotypes in H, and  is the scaled mutation rate. Here E is the set of types for a general mutation model, and P a reversible mutation matrix. Informally, this corresponds to the next sampled haplotype h being obtained by applying a random number of mutations s to a randomly chosen existing haplotype , where s is sampled from a geometric distribution. This related to sampling from what is known as the coalescent, see Stephens et al (2001) for details and references. The algorithm uses the above expression in (*) a few slides back. There are several issues that need to be dealt with here: estimation of , dealing with the fact that the dimension of P is M, the number of possible haplotypes, etc.

An alternative Gibbs sampler, Nui et al (2002) These authors are from the Bayesian school, who assume Dirichlet priors with parameters  = (  1, …,  M ), for the haplotype frequencies F = (f 1, …, f M ). As with the EM approach, our model is multinomial, being a product over individuals in the sample, of phase unknown multilocus genotype probabilities, which in turn are sums of products of pairs of haplotype probabilities. The problem, as always, is in the large number of haplotypes which are compatible with a given genotype. More fully, thinking of the haplotypes H as “missing”, we can write pr(G,H | F)  ∏ {i=1,..n} f hi1 f hi2 ∏ j f j  j - 1. The steps are familiar: conditional on F, sample h i1 and h i2 for individual i according to (**) below; then sample F given H updated. (**) pr(h i1 =g, h i2 = h|F, G i ) = f g f h /∑ {g’  h’=Gi} f g’ f h’. Here g’  h’=G i means that g’ and h’ combine to give G i.

Alternative Gibbs sampler, cont. The advantage of this approach is that it is independent of complex and potentially dangerous population genetic modelling assumptions. Furthermore, tricks we have met before and some new ones can be used to improve the efficiency of the algorithm. Predictive updating. Here, as elsewhere, we integrate out the parameters F to improve the Gibbs sampler. In this case pr(G,H)   [|  +N(H)|]/  [  +N(H)], where we are using the abbreviated gamma function notation once more, and where N(H) is the vector of haplotype counts. Now we can use a different, easier sampler: pick an individual i at random or in a certain order, and update his/her haplotype h i by sampling from pr(h i = (g,h)|H -i,G)  (n g +  g )(n h +  h ), where n g and n h are counts of g and h in H -i, respectively.

Alternative sampler, cont: Ligation. Handling a large number of loci and hence haplotypes still presents a challenge. It is natural to adopt a divide and conquer approach: solve the problem for small blocks of loci, and then piece together the solutions. Niu et al suggest partitioning L loci into blocks of about 8 loci. They offer two strategies: progressive ligation and hierarchical ligation. In both cases, the first step is to carry out block level haplotype reconstruction using the sampler just described. They then record the B most probable (partial) haplotypes for each block, between 40 and 50 in their examples, and proceed to join them. Progressive ligation begins at one end and pieces consecutive pairs together, using the Gibbs sampler restricted to longer haplotypes which are among the B 2 combinations from the two most probable sets of B partial haplotypes. This process is then continued until all the blocks are joined. Hierarchical ligation is analogous, but working across the whole length, see Figure on next slide. It is worth pointing out that these strategies not only deal with the large state space, they also help the Markov Chain to converge more rapidly.

Schematic depicting ligation. Here there are L loci, initially in segments of K, while  is the highest level of the pyramidal hierarchy.

Alternative sampler completed Prior annealing. A nice trick used in Niu et al to enable the Gibbs sampler to more move freely in haplotype space is to use high pseudo-counts at the beginning of the iterations, and progressively reduce them at a fixed rate as the sampler continues. Their formula for T iterations is  (t) =  (0) + t(  (T) -  (0) )/T. Missing marker data. The absence of both alleles of an SNP marker is common owing to PCR dropouts. Another concern is the “one-allele” problem, in which only one allele is unscored. The Gibbs sampler adapts nicely to having multiple categories of errors and dealing with these in the sampling.

References A G Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: L Escoffier and M Slatkin. Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: , D Fallin and NJ Schork. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm, for unphased diploid genotype data. Am J Hum, Genet. 67: , M Stephens, NJ Smith and P Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: , T Niu, ZS Qin, X. Xu and J Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70: , T Speed, Statistics 246 (notes), Berkeley 2002