Haplotype Analysis based on Markov Chain Monte Carlo

Slides:



Advertisements
Similar presentations
Introduction to Haplotype Estimation Stat/Biostat 550.
Advertisements

Gibbs sampler - simple properties It’s not hard to show that this MC chain is aperiodic. Often is reversible distribution. If in addition the chain is.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Ch 11. Sampling Models Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by I.-H. Lee Biointelligence Laboratory, Seoul National.
1 The Monte Carlo method. 2 (0,0) (1,1) (-1,-1) (-1,1) (1,-1) 1 Z= 1 If  X 2 +Y 2  1 0 o/w (X,Y) is a point chosen uniformly at random in a 2  2 square.
 Read Chapter 6 of text  Brachydachtyly displays the classic 3:1 pattern of inheritance (for a cross between heterozygotes) that mendel described.
Bayesian Methods with Monte Carlo Markov Chains III
Markov Chains 1.
Basics of Linkage Analysis
MALD Mapping by Admixture Linkage Disequilibrium.
Markov Chain Monte Carlo Prof. David Page transcribed by Matthew G. Lee.
11 - Markov Chains Jim Vallandingham.
TCOM 501: Networking Theory & Fundamentals
Introduction to Sampling based inference and MCMC Ata Kaban School of Computer Science The University of Birmingham.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
1 Bayesian Methods with Monte Carlo Markov Chains II Henry Horng-Shing Lu Institute of Statistics National Chiao Tung University
Graduate School of Information Sciences, Tohoku University
Suggested readings Historical notes Markov chains MCMC details
BAYESIAN INFERENCE Sampling techniques
CS774. Markov Random Field : Theory and Application Lecture 16 Kyomin Jung KAIST Nov
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
048866: Packet Switch Architectures Dr. Isaac Keslassy Electrical Engineering, Technion Review.
Computational statistics 2009 Random walk. Computational statistics 2009 Random walk with absorbing barrier.
The University of Texas at Austin, CS 395T, Spring 2008, Prof. William H. Press IMPRS Summer School 2009, Prof. William H. Press 1 4th IMPRS Astronomy.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
If time is continuous we cannot write down the simultaneous distribution of X(t) for all t. Rather, we pick n, t 1,...,t n and write down probabilities.
Reconstructing Genealogies: a Bayesian approach Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Department of Mathematics and Statistics
 Read Chapter 6 of text  We saw in chapter 5 that a cross between two individuals heterozygous for a dominant allele produces a 3:1 ratio of individuals.
Introduction to Monte Carlo Methods D.J.C. Mackay.
6. Markov Chain. State Space The state space is the set of values a random variable X can take. E.g.: integer 1 to 6 in a dice experiment, or the locations.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
1 Gil McVean Tuesday 24 th February 2009 Markov Chain Monte Carlo.
Markov Chain Monte Carlo and Gibbs Sampling Vasileios Hatzivassiloglou University of Texas at Dallas.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Markov-Chain Monte Carlo CSE586 Computer Vision II Spring 2010, Penn State Univ.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Estimating Genealogies from Marker Data Dario Gasbarra Matti Pirinen Mikko Sillanpää Elja Arjas Biometry Group Department of Mathematics and Statistics.
Markov Chain Monte Carlo Hadas Barkay Anat Hashavit.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 15: Linkage Analysis VII
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
To be presented by Maral Hudaybergenova IENG 513 FALL 2015.
CS774. Markov Random Field : Theory and Application Lecture 15 Kyomin Jung KAIST Oct
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
The Unscented Particle Filter 2000/09/29 이 시은. Introduction Filtering –estimate the states(parameters or hidden variable) as a set of observations becomes.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Kevin Stevenson AST 4762/5765. What is MCMC?  Random sampling algorithm  Estimates model parameters and their uncertainty  Only samples regions of.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
Daphne Koller Sampling Methods Metropolis- Hastings Algorithm Probabilistic Graphical Models Inference.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
The Monte Carlo Method/ Markov Chains/ Metropolitan Algorithm from sec in “Adaptive Cooperative Systems” -summarized by Jinsan Yang.
Introduction to Sampling based inference and MCMC
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Markov Chain Monte Carlo methods --the final project of stat 6213
Advanced Statistical Computing Fall 2016
Bayesian inference Presented by Amir Hadadi
Remember that our objective is for some density f(y|) for observations where y and  are vectors of data and parameters,  being sampled from a prior.
Haim Kaplan and Uri Zwick
Ch13 Empirical Methods.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Opinionated Lessons #39 MCMC and Gibbs Sampling in Statistics
IBD Estimation in Pedigrees
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Haplotype Analysis based on Markov Chain Monte Carlo By Konstantin Sinyuk

Overview Haplotype, Haplotype Analysis Markov Chain Monte Carlo (MCMC) The algorithm based on (MCMC) Compare with other algorithms result Discussion on algorithm accuracy

What is haplotype ? A haplotype is a particular pattern of sequential SNPs found on a single chromosome. Haplotype has a block-wise structure separated by hot spots. Within each block, recombination is rare due to tight linkage and only very few haplotypes really occur Single base pair which exhibits variation – Caused by point mutations during meiosis – Variation almost always biallelic • dbSNP contains ~ 4.3×106 SNPs – Over 1 SNP per 1,000 base pairs – About half with minor allele frequency > 20% – This number is still growing rapidly! Generally, only a few of the 2loci possible haplotypes cover >90% of a population, due to bottleneck effects and genetic drift.

Haplotype analysis motivation Use of haplotypes in disease association studies reduces the number of tests to be carried out, and hence the penalty for multiple testing. The genome can be partitioned onto 200,000 blocks With haplotypes we can conduct evolutionary studies. International HapMap Project started in October 2002 and planned to be 3 years long.

Haplotype analysis algorithms Given a random sample of multilocus genotypes at a set of SNPs the following actions can be taken: Estimate the frequencies of all possible haplotypes. Infer the haplotypes of all individuals. Haplotyping Algorithms: Clark algorithm EM algorithm Haplotyping programs: HAPINFEREX ( Clark Parsimony algorthm) EM-Decoder ( EM algorithm) PHASE ( Gibbs Sampler) HAPLOTYPER

Motivation for MCMC method MCMC algorithm considers the underlying configurations in proportion to their likelihood Estimates most probable haplotype configuration The Markov chain Monte Carlo (MCMC) algorithm is able to analyze large pedigrees because it considers the underlying configurations in proportion to their likelihood Prof. Donnelly: “If a statistician cannot solve a problem, s/he makes it more complicated”

Discrete-Time Markov Chain Discrete-time stochastic process {Xn: n = 0,1,2,…} Takes values in {0,1,2,…} Memoryless property: Transition probabilities Pij Transition probability matrix P=[Pij] Aperiodic – the chances of going from one state to another are not periodic in the number of steps needed, with period >1 Irreducible – every step is accessible from other state with finite number of steps

Chapman-Kolmogorov Equations n step transition probabilities Chapman-Kolmogorov equations is element (i, j) in matrix Pn Recursive computation of state probabilities

State Probabilities – Stationary Distribution State probabilities (time-dependent) In matrix form: If time-dependent distribution converges to a limit p is called the stationary distribution Existence depends on the structure of Markov chain

Classification of Markov Chains Irreducible: States i and j communicate: Irreducible Markov chain: all states communicate Aperiodic: State i is periodic: Aperiodic Markov chain: none of the states is periodic 3 4 2 1 3 4 2 1

Existence of Stationary Distribution Theorem 1: Irreducible aperiodic Markov chain. There are two possibilities for scalars: j = 0, for all states j No stationary distribution j > 0, for all states j  is the unique stationary distribution Remark: If the number of states is finite, case 2 is the only possibility

Ergodic Markov Chains Markov chain with a stationary distribution States are positive recurrent: The process returns to state j “infinitely often” A positive recurrent and aperiodic Markov chain is called ergodic Ergodic chains have a unique stationary distribution Ergodicity ⇒ Time Averages = Stochastic Averages

Balanced Markov Chain Global Balance Equations (GBE) Detailed Balance Equations (DBE) is the frequency of transitions from j to i

Markov Chain Summary Markov chain is a set of random processes with stationary transition probabilities - matrix of transition probabilities, between and Markov chain is Ergodic if: Aperiodic – Irreducible - Ergodic Markov chain has stationary distribution property: exists and is independent of i ( ) The vector is stationary distribution of the chain Ergodic Markov chain is detailed balanced if: Aperiodic – the chances of going from one state to another are not periodic in the number of steps needed, with period >1 Irreducible – every step is accessible from other state with finite number of steps

Markov Chain Monte Carlo MCMC is used when we wish to simulate from a distribution known only up to a constant (normalization) factor: (C is hard to calculate) Metropolis proposed to construct Markov chain with stationary distribution using only ratio Define transition matrix P indirectly via Q = matrix: - proposal probability - acceptance probability, selected such that Markov chain will be detailed balanced I will concentrate upon the Metropolis-Hastings algorithm (Section 2.1). This was introduced in the 1950s, and was originally motivated by the desire to solve problems in combinatorics and physics. It has been used extensively since, and, two years ago, was placed in a survey among the top ten algorithms that had the greatest influence on the development and practice of science and engineering in the 20th century [3

Metropolis-Hastings algorithm Metropolis-Hasting (MH) algorithm steps: Start with X0 = any state Given Xt-1 = i, choose j with probability Accept this j (put Xt = j) with acceptance probability - Hastings ratio Otherwise accept i (put Xt = i) Repeat step 2 through 4 a needed number of times With such detailed balance is satisfied With rejection steps the Markov chain is surely irreducible

Metropolis-Hastings Graph

Example of Metropolis-Hastings Suppose we want to simulate from Metropolis algorithm steps: Start with X0 = 0 Generate Xt from the proposal distribution N(Xt-1,1) Compute Repeat step 2 through 4 a needed number of times

Gibbs Sampler The Gibbs sampler is a special case of the MH algorithm that considers states that can be partitioned into coordinates At each step, a single coordinate of the state is updated. Step from to given by Gibbs sampler is used where each variable depends on other variable within some neighborhood The acceptance probabilities are all equal to 1

MCMC in haplotyping The Gibbs sampler is good for multilocus genotyping of n persons. Lets define: The conditional distribution P(g|d) can be estimated , The Markov chain obtained with Gibbs sampler may not be ergodic. is the observed phenotype of individual i at locus j Ordered genotype of person i

The proposed algorithm Most algorithms search that maximize P(g|d) The proposed algorithm seeks for An ergodic Markov chain is constructed such that stationary distribution is P(f(c)|d) The sampling is done with Gibbs sampler An Ergodic property of Markov chain is satisfied with use of Metropolis jump kernels The Gibbs-Jumping name is assigned to algorithm

Gibbs step of algorithm For each individual i and locus j, alleles and are sampled from the conditional distribution: The following assumption are commonly made in order to compute transition probability Hardy-Weinberg Equilibrium Linkage Equilibrium No interference children of i and s spouses of i

Jumping step of algorithm After Gibbs step the algorithm attempts to jump from current state of multilocus genotype g to the state g* in a different irreducible set. The Metropolis jumping kernel is used Let be the set of non-communicating genotypic configurations on locus j set of individuals who “characterize” irreducible set at j A new state g* is formed by replacing the alleles pair in g by those from for individuals in The g* is accepted with probability

Gibbs-Jump trajectories

Results Comparison Gibbs-Jumping algorithm is estimating one. So the algorithm should be tested on well-explored genetic diseases. Such explored diseases are: Krabbe disease (autosomal, recessive disorder) Episodic Ataxia disease (autosomal, dominant disorder) The original exploration was done by programs LINKAGE (Krabbe) – enumerating linkage analysis SIMWALK (Ataxia) – using simulated annealing (MC) The comparison of various haplotyping method was carried out by Sobel. So proposed algorithm results are compared to Sobel work.

Krabbe disease (Globoid-cell leukodystrophy) This autosomal recessively-inherited disease results from a deficiency of the lysosomal enzyme b-galactosylceramidase (GALC). GALC enzyme plays a role in the normal turnover of lipids that form a significant part of myelin, the insulating material around certain neurons. Affected individuals show progressive mental and motor (movement) deterioration and usually die within a year or two of birth.

Krabbe disease cont

Krabbe disease result compare The input data is genetic map of 8 polymorphic genetic markers on chromosome 14. The Gibbs-Jump algorithm assigned the most likely haplotype configuration with probability 0.69, the same configuration as obtained by Sobol enumerative approach. By Sobol they enumerated 262,144 haplotype variations with CPU time of couple of hours instead of less than 1 minute run of 100 iterations of Gibbs-Jump.

Episodic Ataxia disease Episodic ataxia, a autosomal dominantly-inherited disease affecting the cerebellum. Point mutations in the human voltage-gated potassium channel (Kv1.1) gene on chromosome 12p13 Affected individuals are normal between attacks but become ataxic under stressful conditions.

Episodic Ataxia result compare The input data is genetic map of 9 polymorphic genetic markers on chromosome 12. The Gibbs-Jump algorithm assigned the most likely haplotype configuration with probability 0.41, that is very similar to the obtained by Sobel with SIMWALK. The second most probable haplotype configuration obtained with 0.09 probability and is identical to the one picked by Sobel.

Simulation data To evaluate the performance of Gibbs-Jump on large pedigrees (with loops) a haplotype configuration was simulated. The genetic map of 10 co dominant markers (5 alleles per marker) with = 0.05 was taken. The founders haplotypes were sampled randomly from population distribution of haplotypes. Haplotypes for nonfounders where then simulated conditional on their parents’ haplotypes. Assuming HW equilibrium ,Linkage equilibrium and Haldane’s no interference model for recombination.

Simulation results 100 iteration of Gibbs-Jump were performed. The most probable configuration (with probability 0.41) is identical to the true (simulated ) one There are 3 configurations with second largest probability (0.07) All 3 differ from the true configuration in one person with one extra recombination event in each The algorithm execution time took several minutes

Simulation accuracy Results of 10 runs of 100 realizations each. In runs 1 and 3-10 the most frequent configuration was the true one . The most frequent configuration in run 2 differed from the true one at one individual.

Simulation run-length Results of 5 runs of 10000 realizations each. The figure shows that there is a fair amount of variability in the estimates, but with very little correlation between consecutive estimates. Autocorrelation = -0.02 Dot plot of the estimated frequency of the underlying true haplotype configuration for 100 iterations.

Simulation run-length cont Estimates converges to the true haplotype configuration after 2000 steps. The confidence bound is 95% Four other runs also inferred the true configuration with probabilities: 34.54%,35.75%,37.08% and 35.27% respectively. Cumulative frequency of the most probable configuration , plotted for every 100 iterations and the confidence bound.

Results of Sensitivity Analysis Computation of P(g) requires an assignment of haplotype probabilities to the founders. How inaccurate prediction of founder probabilities affects the results? The 4 sets for gene frequencies (different from simulated) for one of 10 markers were used (other markers were leaved unchanged) For the above simulation set the resulting haplotype configuration was as simulated one.

Conclusion In this discussion was presented a new method, Gibbs-Jump, for haplotype analysis, which explores the whole distribution of haplotypes conditional on the observed phenotypes. The method is very time-efficient. The result accuracy was compared to obtained by other methods (described by Sobol). Method demonstrated the sensitivity tolerance to founders probabilities sample.

The End… Wake up!