SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete.

Slides:



Advertisements
Similar presentations
Introduction to Haplotype Estimation Stat/Biostat 550.
Advertisements

ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sampling distributions of alleles under models of neutral evolution.
Bayesian Methods with Monte Carlo Markov Chains III
Basics of Linkage Analysis
MALD Mapping by Admixture Linkage Disequilibrium.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Overview Full Bayesian Learning MAP learning
Tutorial #6 by Ma’ayan Fishelson Based on notes by Terry Speed.
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
Maximum likelihood (ML) and likelihood ratio (LR) test
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
. Learning Bayesian networks Slides by Nir Friedman.
BMI 731- Winter 2004 Haplotype reconstruction Catalin Barbacioru Department of Biomedical Informatics Ohio State University.
Maximum likelihood (ML) and likelihood ratio (LR) test
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Maximum likelihood (ML)
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Population Genetics is the study of the genetic
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Joint Linkage and Linkage Disequilibrium Mapping Key Reference Li, Q., and R. L. Wu, 2009 A multilocus model for constructing a linkage disequilibrium.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 15: Linkage Analysis VII
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 07: BAYESIAN ESTIMATION (Cont.) Objectives:
Allele Frequencies: Staying Constant Chapter 14. What is Allele Frequency? How frequent any allele is in a given population: –Within one race –Within.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Simulation Study for Longitudinal Data with Nonignorable Missing Data Rong Liu, PhD Candidate Dr. Ramakrishnan, Advisor Department of Biostatistics Virginia.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lecture 22: Quantitative Traits II
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
8 and 11 April, 2005 Chapter 17 Population Genetics Genes in natural populations.
The Haplotype Blocks Problems Wu Ling-Yun
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Hidden Markov Models BMI/CS 576
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Constrained Hidden Markov Models for Population-based Haplotyping
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Haplotype Reconstruction
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Accuracy of Haplotype Frequency Estimation for Biallelic Loci, via the Expectation- Maximization Algorithm for Unphased Diploid Genotype Data  Daniele.
Ho Kim School of Public Health Seoul National University
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Learning Bayesian networks
Presentation transcript:

SNP Haplotype reconstruction Statistics 246, 2002, Week 14, Lecture 2 Not complete

The problem We start with a collection of genotypes in the form of allelic determinations at tightly linked single nucleotide polymorphisms (SNPs) for each of a set of n individuals. For example, we might describe 3 SNPs as follows: Name SNP alleles (major, minor) SNP1 T, A SNP2 A, G SNP3 C, G An individual might have genotype AT at SNP1, AA at SNP2, and CG at SNP3, which we will denote by AT//AA//CG. Possible haplotype pairs for this person are AAC/TAG and AAG/TAC, and without further information, we can’t distinguish between these two pairs.

The problem, cont. What can be done? With information on the individual’s parents, we can usually infer the haplotypes, the only problem being that the parents may not be fully informative. For example, if the maternal and paternal genotypes were TA//AA//CC and TT//AA//CG respectively, at SNPs 1, 2 and 3, and the individual is AT//AA//CG, then it would be clear that the haplotypes were AAC/TAG (why?). On the other hand, if the parents both had genotypes AT//AA//CG, then we wouldn’t be able to determine unique haplotypes for the individual. Even in the first case, we might have to make an assumption about the frequency of recombination: what is it? Our problem here is to determine haplotypes, or make good guesses at them without parental genotypes.

Origin of the problem Why do we want to determine haplotypes for individuals at tightly linked SNP loci? From the class: a)Haplotypes are more powerful discriminators between cases and controls in disease association studies. Why? b) With haplotypes we can conduct evolutionary studies. c) Use of haplotypes in disease association studies reduces the number of tests to be carried out, and hence the penalty for multiple testing. Is this the same point as a)? d) (From me) Haplotypes are necessary in linkage analyses. Other reasons?

Two aspects of the problem With a random sample of multilocus genotypes at a set of SNPs, we can attempt a) to estimate the frequencies of all possible haplotypes, and b) to infer the haplotypes of all individuals. The first step on this problem was taken by A Clark in He gave what we might call a parsimony solution to b) above. It goes like this. With a reasonable sample size, we might expect to have some individuals homozygous at every locus, e.g. TT//AA//CC, or heterozygous at just one locus, e.g. TT//AA//CG. With the individuals of former type, we have unambiguously identified one (TAC), and of the latter type two (TAC and TAG) haplotypes present in the population. The algorithm begins by finding all homozygotes and single SNP heterozygotes and tallying the resulting known haplotypes.

Now proceed as follows. For each known haplotype, look at all remaining unresolved cases, and ask whether the known haplotype can be made from some combination of ambiguous sites from an unresolved case. For example, if we have identified TAC as a known haplotype from a TT//AA//CC homozygote, and we have an individual AT//AA//CG still unresolved, then we infer that s/he is TAC/AAG, and we have have “resolved” this person’s haplotype and added a putative haplotype to our list. Similarly, a TT//AA//CG individual gives us both TAC and TAG as known haplotypes, and both of these go into the initial list. This chain of inferences is continued until either all haplotypes have been recovered, or until no more new haplotypes can be found in this way. A Clark’s algorithm, cont.

Clark’s algorithm with SNPs in practice This method should work in principle, but there are three problems that might arise in practice: a) there may be no homozygotes or single SNP heterozygotes in the sample, and so the chain might never get started; b) there may be many unresolved haplotypes left at the end; and c) haplotypes might be erroneously inferred if a crossover product of two actual haplotypes is identical to another true haplotype. The frequency of these problems will depend on average heterozygosity of the SNPs, number of loci, their recombination rates and the sample size. Clark (1990) did some calculations and simulations which led him to believe the algorithm would perform well, even with relatively small sample sizes. And it did.

The EM algorithm solution to problem a) We now describe an EM algorithm to infer haplotype frequencies in a population on the basis of a random sample and the assumption of random mating for haplotypes. Escoffier and Slatkin (1995) call the phase unknown multilocus genotypes, e.g. TT//AA//CG, phenotypes, and keep the term genotype for the corresponding haplotype pair TAC/TAG. Others use the term diplotype for a pair of haplotypes, but neither of these has caught on. The observed data in a random sample of n individuals will be multilocus genotype frequencies, and the natural model is multinomial. The number c of haplotype pairs leading to a given multilocus genotype will depend on the number s of heterozygous SNPs, and will be 2 s-1. E.g if our genotype is TT//AA//CG, then we can recover the haplotypes unambiguously (c=1), but for our original case, AT//AA//CG, there were c = 2 possible haplotype pairs. Under the assumption of random mating, the probability of a given genotype is just the sum of 2 s-1 squares or products of haplotype probabilities, e.g. pr(AT//AA//CG) = pr(AAC/TAG) + pr(AAG/TAC) = pr(AAC)pr(TAG) + pr(AAG)pr(TAC).

The EM solution, cont. Straightforward, history, performance in recovering frequencies, and in recovering true haplotypes.

The EM algorithm: performance with SNPs Summarize Fallin & Schork (2000)

A Gibbs sampler approach to haplotype reconstruction, Stephens et al (2001) Our notation is as follows: G = (G 1,…, G n ) denotes the observed multilocus genotype frequencies, H=(H 1,…,H n ) will be the corresponding unknown haplotype pairs, while F=(F 1, …, F M ) will denote the M unknown population haplotype frequencies. In these terms, the EM algorithm sought that F which maximized pr(G|F). The Gibbs sampler uses the following three steps, starting from some initial haplotype reconstruction H (0) : i) Choose an individual i, uniformly and at random from all ambiguous individuals; ii) Sample H i (t+1) from pr(H i | g, H -i (t) ), where H -i is the set of haplotypes excluding individual i; iii) Set H j (t+1) = H j (t) for j=1,…,i-1,i+1, …,n. General theory tells us that this produces a Markov chain with the desired stationary distribution. The question is: what is pr(H i | g, H -i (t) )?

The Stephens et al Gibbs sampler, some details For any haplotype pair H i = (h i1,h i2 ) consistent with genotype G i, write pr(H i | G, H -i )  pr(H i | H -i )  pr(h i1 |H -I )pr(h i2 | H -I, h i1 ) (*) where pr(.|H) is the conditional distribution of a future sampled haplotype, given a set H of previously sampled haplotypes. There are a couple of choices here, one assuming that the type of a mutant offspring is h with probability h, independent of the type of the parent. In that case, for a constant-size random mating population, pr(h|H) = (r h +  h )/(r +  ), where r h is the number of haplotypes of type h in H, r is the total number of haplotypes in H, and  is the scaled mutation rate.

The Stephens et al Gibbs sampler, more details In principle we can substitute this formula into step ii) of the generic Gibbs algorithm, and off we go. In practice, the number of possible values of Hi is too large: 2 s-1, where s is the number of loci at which individual i heterozygous. However, if we take h =1/M, where M is the total number of different possible haplotypes that could be observed in the population, we can exploit the fact Incomplete.

The Stephens et al Gibbs sampler, yet more details Stephens et al (2001) offer a second algorithm, based on slightly different population genetic modelling. It takes the form pr(h | H) = ∑ {  E} ∑ {s ≥0} [r  /r][  /(r+  )] s [r/(r+  )](P s )  h, where r  is the number of haplotypes of type  in the set H, r is the total number of haplotypes in H, and  is the scaled mutation rate. Here E is the set of types for a general mutation model, and P a reversible mutation matrix. Informally, this corresponds to the next sampled haplotype h being obtained by applying a random number of mutations s to a randomly chosen existing haplotype , where s is sampled from a geometric distribution. This related to sampling from what is known as the coalescent, see Stephens et al (2001) for details and references. The algorithm uses the above expression in (*) a few slides back. There are several issues that need to be dealt with here: estimation of , dealing with the fact that the dimension of P is M, the number of possible haplotypes, etc.

An alternative Gibbs sampler, Nui et al (2002) These authors are from the Bayesian school, who assume Dirichlet priors with parameters  = (  1, …,  M ), for the haplotype frequencies F = (f 1, …, f M ). As with the EM approach, our model is multinomial, being a product over individuals in the sample, of phase unknown multilocus genotype probabilities, which in turn are sums of products of pairs of haplotype probabilities. The problem, as always, is in the large number of haplotypes which are compatible with a given genotype. More fully, thinking of the haplotypes H as “missing”, we can write pr(G,H | F)  ∏ {i=1,..n} f hi1 f hi2 ∏ j f j  j - 1. The steps are familiar: conditional on F, sample h i1 and h i2 for individual i according to (**) below; then sample F given H updated. (**) pr(h i1 =g, h i2 = h|F, G i ) = f g f h /∑ {g’  h’=Gi} f g’ f h’. Here g’  h’=G i means that g’ and h’ combine to give G i.

Alternative Gibbs sampler, cont. The advantage of this approach is that it is independent of complex and potentially dangerous population genetic modelling assumptions. Furthermore, tricks we have met before and some new ones can be used to improve the efficiency of the algorithm. Predictive updating. Here, as elsewhere, we integrate out the parameters F to improve the Gibbs sampler. In this case pr(G,H)   [|  +N(H)|]/  [  +N(H)], where we are using the abbreviated gamma function notation once more, and where N(H) is the vector of haplotype counts. Now we can use a different, easier sampler: pick an individual i at random or in a certain order, and update his/her haplotype h i by sampling from pr(h i = (g,h)|H -i,G)  (n g +  g )(n h +  h ), where n g and n h are counts of g and h in H -i, respectively.

Alternative sampler, cont: Ligation. Handling a large number of loci and hence haplotypes still presents a challenge. It is natural to adopt a divide and conquer approach: solve the problem for small blocks of loci, and then piece together the solutions. Niu et al suggest partitioning L loci into blocks of about 8 loci. They offer two strategies: progressive ligation and hierarchical ligation. In both cases, the first step is to carry out block level haplotype reconstruction using the sampler just described. They then record the B most probable (partial) haplotypes for each block, between 40 and 50 in their examples, and proceed to join them. Progressive ligation begins at one end and pieces consecutive pairs together, using the Gibbs sampler restricted to longer haplotypes which are among the B 2 combinations from the two most probable sets of B partial haplotypes. This process is then continued until all the blocks are joined. Hierarchical ligation is analogous, but working across the whole length, see Figure on next slide. It is worth pointing out that these strategies not only deal with the large state space, they also help the Markov Chain to converge more rapidly.

Schematic depicting ligation. Here there are L loci, initially in segments of K, while  is the highest level of the pyramidal hierarchy.

Alternative sampler completed Prior annealing. A nice trick used in Niu et al to enable the Gibbs sampler to more move freely in haplotype space is to use high pseudo-counts at the beginning of the iterations, and progressively reduce them at a fixed rate as the sampler continues. Their formula for T iterations is  (t) =  (0) + t(  (T) -  (0) )/T. Missing marker data. The absence of both alleles of an SNP marker is common owing to PCR dropouts. Another concern is the “one-allele” problem, in which only one allele is unscored. The Gibbs sampler adapts nicely to having multiple categories of errors and dealing with these in the sampling.

References A G Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: L Escoffier and M Slatkin. Maximum likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: , D Fallin and NJ Schork. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm, for unphased diploid genotype data. Am J Hum, Genet. 67: , M Stephens, NJ Smith and P Donnelly. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: , T Niu, ZS Qin, X. Xu and J Liu. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70: , 2002.