Constrained Hidden Markov Models for Population-based Haplotyping Application of Probabilistic ILP II, FP6-508861 www.aprill.org Constrained Hidden Markov Models for Population-based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki
Outline Population-based haplotype reconstruction Infer haplotypes from genotypes: reconstruct hidden phase of genetic data Important problem in biology/medicine: e.g. disease association studies An approach using constrained HMMs Sparse markov chains to represent conserved haplotype fragments HMM model that can be learned directly from genotype data Experimental results
Human Genome and SNPs ...GATATTCGTACGGATGTTTCCA... (marker) SNP (marker) SNP (marker) SNP ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... Individuals 1 2 3 4 5 6 DNA Sequence
Haplotypes Haplotypes AGT A G T GTC G T C 1 2 3 4 5 6 Individuals SNP SNP SNP AGT GTC Haplotypes A G T G T C Individuals 1 2 3 4 5 6 DNA Sequence
Haplotypes Haplotypes 101 1 0 1 010 0 1 0 1 2 3 4 5 6 Individuals SNP SNP SNP 101 010 Haplotypes 1 0 1 0 1 0 Individuals 1 2 3 4 5 6 DNA Sequence
Why Haplotypes? Haplotypes Disease Association Studies (Gene Mapping): define our genetic individuality contribute to risk factors of complex diseases (e.g., diabetes) Disease Association Studies (Gene Mapping): find genetic difference between a case and a control population Identifying SNPs responsible for disease might help find a cure Also useful for Linkage disequilibrium studies: Summarize genetic variation Understanding evolution of human populations
The problem: Haplotypes not directly observable . 1 . 1 {0,1} {0} {1} WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown) Paternal Maternal
Population-based Haplotype Reconstruction Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium) haplotype pair genotype 1 0 0 1 1 1 0 0 {0,1} {1} {0} 1 0 0 1 1 1 0 0 {0,1} {1} {0} 0 1 1 1 0 0 {0,1} {1} {0} Individual 2 Individual 3 … Individual 1
Haplotype Reconstruction Problem (CS Perspective) Input: A set G of genotypes Output: A set H of corresponding haplotype pairs such that
Population-based Haplotype Reconstruction Given a model M for the distribution of haplotypes, can infer most likely resolution: Hardy-Weinberg equilibrium Need to estimate this model from available genotype data
Prior Work on Haplotype Reconstruction Competitive application domain for several years: many systems developed characterized by the statistical model and learning/reconstruction algorithms employed Special-purpose statistical models Approximate Coalescent (PHASE 2001,2003,2005) Block-based (Gerbil 2004,2005) Variable-length MC (HaploRec 2004,2006) Founder-based (HIT 2005) Local clusters (fastPHASE 2006)
Prior Work on Haplotype Reconstruction Special-purpose learning/reconstruction algorithms MCMC variant Approximate EM + partition ligation … Our approach: Model haplotypes using (sparse) markov chains Natural extension to a Hidden Markov Model on genotypes Directly learnable from genotype data (standard Baum-Welsh)
Constrained HMMs for haplotyping Path for haplotype 0,1,1,0 Modeling haplotypes Standard markov chain More general: order k markov chain
Constrained HMMs for haplotyping Modeling genotypes Hidden phase (order of pair): Hidden Markov Model States: pairs of states of the underlying markov chain (state of the maternal/paternal sequence) Output symbol: unordered pair Path in the model: sample two haplotypes, output corresponding genotype Have to enforce Hardy-Weinberg equilibrium Parameter tying constraints on transition probabilities Algorithms Learning: standard Baum-Welsh Reconstruction of most likely haplotype pair: Viterbi
Constrained HMMs for haplotyping Example: paths for genotype {0,1},{1},{0,1},{0}
Sparse Markov Modeling (SpaMM) Higher-order models (long history) needed: exponential size of model However, out of the possible history blocks, only few occur in data (conserved fragments) Idea: Sparse model, iterative structure learning algorithm to identify conserved fragments (Apriori-style) Initialize first-order-model() em-training( ) repeat regularize-and-extend( ) em-training( ) until
SpaMM Model (order 1) Initial model: standard markov chain of order 1 Iteration: extend order of model by 1, prune unlikely parts Avoids combinatorial explosion of model size
SpaMM Model (order 2) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
SpaMM Model (order 3) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
SpaMM Model (order 4) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
SpaMM Model (order 5) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
SpaMM Model (order 6) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
SpaMM Model (final) Final model: Model structure encodes conserved fragments Concise representation of all haplotypes with non-zero probability
Experimental Evaluation Real world population data Correct haplotypes have been inferred from trios Daly dataset: 103 SNP markers for 174 individuals Yoruba population: 100 datasets, 500 SNP markers each, 60 individuals Problem Setting: Given the set of genotypes, algorithm outputs most likely haplotype pairs Difference to real haplotype pairs is measured in switch distance (# recombinations needed to transform pairs, normalized)
Results: Haplotype Reconstruction Many well-engineered systems Smart priors, averaging over several random restarts of EM, ... SpaMM: proof-of-concept implementation, not tuned
Results: Haplotype Reconstruction PHASE most accurate, then fastPHASE, then SpaMM however, PHASE too slow for long maps SpaMM beats fastPHASE without averaging overall, competitive accuracy
Results: Runtime Runtime in seconds for phasing 100 markers (log. scale) SpaMM scales linearly in #markers like fastPHASE, HaploRec, HIT unlike PHASE, Gerbil
Results: Genotype imputation Most haplotyping methods can also predict missing genotype values for SpaMM, can be read off Viterbi path
Results: Genotype imputation fastPHASE best known method Again, SpaMM beats fastPHASE without averaging
Conclusions SpaMM: new haplotyping method Future work sparse Markov chains to encode conserved haplotype fragments Constrained HMM for modeling genotypes Apriori-style structure learning algorithm Simple, accurate, interpretable output Future work Accuracy can probably be improved using standard techniques (EM random restarts, averaging, ...)
Thanks!