Presentation is loading. Please wait.

Presentation is loading. Please wait.

Constrained Hidden Markov Models for Population-based Haplotyping

Similar presentations


Presentation on theme: "Constrained Hidden Markov Models for Population-based Haplotyping"— Presentation transcript:

1 Constrained Hidden Markov Models for Population-based Haplotyping
Application of Probabilistic ILP II, FP Constrained Hidden Markov Models for Population-based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki

2 Outline Population-based haplotype reconstruction
Infer haplotypes from genotypes: reconstruct hidden phase of genetic data Important problem in biology/medicine: e.g. disease association studies An approach using constrained HMMs Sparse markov chains to represent conserved haplotype fragments HMM model that can be learned directly from genotype data Experimental results

3 Human Genome and SNPs ...GATATTCGTACGGATGTTTCCA...
(marker) SNP (marker) SNP (marker) SNP ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... Individuals DNA Sequence

4 Haplotypes Haplotypes AGT A G T GTC G T C 1 2 3 4 5 6 Individuals
SNP SNP SNP AGT GTC Haplotypes A G T G T C Individuals DNA Sequence

5 Haplotypes Haplotypes 101 1 0 1 010 0 1 0 1 2 3 4 5 6 Individuals
SNP SNP SNP 101 010 Haplotypes Individuals DNA Sequence

6 Why Haplotypes? Haplotypes Disease Association Studies (Gene Mapping):
define our genetic individuality contribute to risk factors of complex diseases (e.g., diabetes) Disease Association Studies (Gene Mapping): find genetic difference between a case and a control population Identifying SNPs responsible for disease might help find a cure Also useful for Linkage disequilibrium studies: Summarize genetic variation Understanding evolution of human populations

7 The problem: Haplotypes not directly observable
. 1 . 1 {0,1} {0} {1} WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown) Paternal Maternal

8 Population-based Haplotype Reconstruction
Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium) haplotype pair genotype 1 0 0 1 1 1 0 0 {0,1} {1} {0} 1 0 0 1 1 1 0 0 {0,1} {1} {0} 0 1 1 1 0 0 {0,1} {1} {0} Individual 2 Individual 3 … Individual 1

9 Haplotype Reconstruction Problem (CS Perspective)
Input: A set G of genotypes Output: A set H of corresponding haplotype pairs such that

10 Population-based Haplotype Reconstruction
Given a model M for the distribution of haplotypes, can infer most likely resolution: Hardy-Weinberg equilibrium Need to estimate this model from available genotype data

11 Prior Work on Haplotype Reconstruction
Competitive application domain for several years: many systems developed characterized by the statistical model and learning/reconstruction algorithms employed Special-purpose statistical models Approximate Coalescent (PHASE 2001,2003,2005) Block-based (Gerbil 2004,2005) Variable-length MC (HaploRec 2004,2006) Founder-based (HIT 2005) Local clusters (fastPHASE 2006)

12 Prior Work on Haplotype Reconstruction
Special-purpose learning/reconstruction algorithms MCMC variant Approximate EM + partition ligation Our approach: Model haplotypes using (sparse) markov chains Natural extension to a Hidden Markov Model on genotypes Directly learnable from genotype data (standard Baum-Welsh)

13 Constrained HMMs for haplotyping
Path for haplotype 0,1,1,0 Modeling haplotypes Standard markov chain More general: order k markov chain

14 Constrained HMMs for haplotyping
Modeling genotypes Hidden phase (order of pair): Hidden Markov Model States: pairs of states of the underlying markov chain (state of the maternal/paternal sequence) Output symbol: unordered pair Path in the model: sample two haplotypes, output corresponding genotype Have to enforce Hardy-Weinberg equilibrium Parameter tying constraints on transition probabilities Algorithms Learning: standard Baum-Welsh Reconstruction of most likely haplotype pair: Viterbi

15 Constrained HMMs for haplotyping
Example: paths for genotype {0,1},{1},{0,1},{0}

16 Sparse Markov Modeling (SpaMM)
Higher-order models (long history) needed: exponential size of model However, out of the possible history blocks, only few occur in data (conserved fragments) Idea: Sparse model, iterative structure learning algorithm to identify conserved fragments (Apriori-style) Initialize first-order-model() em-training( ) repeat regularize-and-extend( ) em-training( ) until

17 SpaMM Model (order 1) Initial model: standard markov chain of order 1
Iteration: extend order of model by 1, prune unlikely parts Avoids combinatorial explosion of model size

18 SpaMM Model (order 2) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

19 SpaMM Model (order 3) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

20 SpaMM Model (order 4) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

21 SpaMM Model (order 5) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

22 SpaMM Model (order 6) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

23 SpaMM Model (final) Final model: Model structure encodes conserved fragments Concise representation of all haplotypes with non-zero probability

24 Experimental Evaluation
Real world population data Correct haplotypes have been inferred from trios Daly dataset: 103 SNP markers for 174 individuals Yoruba population: 100 datasets, 500 SNP markers each, 60 individuals Problem Setting: Given the set of genotypes, algorithm outputs most likely haplotype pairs Difference to real haplotype pairs is measured in switch distance (# recombinations needed to transform pairs, normalized)

25 Results: Haplotype Reconstruction
Many well-engineered systems Smart priors, averaging over several random restarts of EM, ... SpaMM: proof-of-concept implementation, not tuned

26 Results: Haplotype Reconstruction
PHASE most accurate, then fastPHASE, then SpaMM however, PHASE too slow for long maps SpaMM beats fastPHASE without averaging overall, competitive accuracy

27 Results: Runtime Runtime in seconds for phasing 100 markers (log. scale) SpaMM scales linearly in #markers like fastPHASE, HaploRec, HIT unlike PHASE, Gerbil

28 Results: Genotype imputation
Most haplotyping methods can also predict missing genotype values for SpaMM, can be read off Viterbi path

29 Results: Genotype imputation
fastPHASE best known method Again, SpaMM beats fastPHASE without averaging

30 Conclusions SpaMM: new haplotyping method Future work
sparse Markov chains to encode conserved haplotype fragments Constrained HMM for modeling genotypes Apriori-style structure learning algorithm Simple, accurate, interpretable output Future work Accuracy can probably be improved using standard techniques (EM random restarts, averaging, ...)

31 Thanks!


Download ppt "Constrained Hidden Markov Models for Population-based Haplotyping"

Similar presentations


Ads by Google