Download presentation
Presentation is loading. Please wait.
Published byBryan Hawkins Modified over 6 years ago
1
Constrained Hidden Markov Models for Population-based Haplotyping
Application of Probabilistic ILP II, FP Constrained Hidden Markov Models for Population-based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki
2
Outline Population-based haplotype reconstruction
Infer haplotypes from genotypes: reconstruct hidden phase of genetic data Important problem in biology/medicine: e.g. disease association studies An approach using constrained HMMs Sparse markov chains to represent conserved haplotype fragments HMM model that can be learned directly from genotype data Experimental results
3
Human Genome and SNPs ...GATATTCGTACGGATGTTTCCA...
(marker) SNP (marker) SNP (marker) SNP ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... Individuals DNA Sequence
4
Haplotypes Haplotypes AGT A G T GTC G T C 1 2 3 4 5 6 Individuals
SNP SNP SNP AGT GTC Haplotypes A G T G T C Individuals DNA Sequence
5
Haplotypes Haplotypes 101 1 0 1 010 0 1 0 1 2 3 4 5 6 Individuals
SNP SNP SNP 101 010 Haplotypes Individuals DNA Sequence
6
Why Haplotypes? Haplotypes Disease Association Studies (Gene Mapping):
define our genetic individuality contribute to risk factors of complex diseases (e.g., diabetes) Disease Association Studies (Gene Mapping): find genetic difference between a case and a control population Identifying SNPs responsible for disease might help find a cure Also useful for Linkage disequilibrium studies: Summarize genetic variation Understanding evolution of human populations
7
The problem: Haplotypes not directly observable
. 1 . 1 {0,1} {0} {1} WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown) Paternal Maternal
8
Population-based Haplotype Reconstruction
Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium) haplotype pair genotype 1 0 0 1 1 1 0 0 {0,1} {1} {0} 1 0 0 1 1 1 0 0 {0,1} {1} {0} 0 1 1 1 0 0 {0,1} {1} {0} Individual 2 Individual 3 … Individual 1
9
Haplotype Reconstruction Problem (CS Perspective)
Input: A set G of genotypes Output: A set H of corresponding haplotype pairs such that
10
Population-based Haplotype Reconstruction
Given a model M for the distribution of haplotypes, can infer most likely resolution: Hardy-Weinberg equilibrium Need to estimate this model from available genotype data
11
Prior Work on Haplotype Reconstruction
Competitive application domain for several years: many systems developed characterized by the statistical model and learning/reconstruction algorithms employed Special-purpose statistical models Approximate Coalescent (PHASE 2001,2003,2005) Block-based (Gerbil 2004,2005) Variable-length MC (HaploRec 2004,2006) Founder-based (HIT 2005) Local clusters (fastPHASE 2006)
12
Prior Work on Haplotype Reconstruction
Special-purpose learning/reconstruction algorithms MCMC variant Approximate EM + partition ligation … Our approach: Model haplotypes using (sparse) markov chains Natural extension to a Hidden Markov Model on genotypes Directly learnable from genotype data (standard Baum-Welsh)
13
Constrained HMMs for haplotyping
Path for haplotype 0,1,1,0 Modeling haplotypes Standard markov chain More general: order k markov chain
14
Constrained HMMs for haplotyping
Modeling genotypes Hidden phase (order of pair): Hidden Markov Model States: pairs of states of the underlying markov chain (state of the maternal/paternal sequence) Output symbol: unordered pair Path in the model: sample two haplotypes, output corresponding genotype Have to enforce Hardy-Weinberg equilibrium Parameter tying constraints on transition probabilities Algorithms Learning: standard Baum-Welsh Reconstruction of most likely haplotype pair: Viterbi
15
Constrained HMMs for haplotyping
Example: paths for genotype {0,1},{1},{0,1},{0}
16
Sparse Markov Modeling (SpaMM)
Higher-order models (long history) needed: exponential size of model However, out of the possible history blocks, only few occur in data (conserved fragments) Idea: Sparse model, iterative structure learning algorithm to identify conserved fragments (Apriori-style) Initialize first-order-model() em-training( ) repeat regularize-and-extend( ) em-training( ) until
17
SpaMM Model (order 1) Initial model: standard markov chain of order 1
Iteration: extend order of model by 1, prune unlikely parts Avoids combinatorial explosion of model size
18
SpaMM Model (order 2) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
19
SpaMM Model (order 3) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
20
SpaMM Model (order 4) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
21
SpaMM Model (order 5) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
22
SpaMM Model (order 6) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size
23
SpaMM Model (final) Final model: Model structure encodes conserved fragments Concise representation of all haplotypes with non-zero probability
24
Experimental Evaluation
Real world population data Correct haplotypes have been inferred from trios Daly dataset: 103 SNP markers for 174 individuals Yoruba population: 100 datasets, 500 SNP markers each, 60 individuals Problem Setting: Given the set of genotypes, algorithm outputs most likely haplotype pairs Difference to real haplotype pairs is measured in switch distance (# recombinations needed to transform pairs, normalized)
25
Results: Haplotype Reconstruction
Many well-engineered systems Smart priors, averaging over several random restarts of EM, ... SpaMM: proof-of-concept implementation, not tuned
26
Results: Haplotype Reconstruction
PHASE most accurate, then fastPHASE, then SpaMM however, PHASE too slow for long maps SpaMM beats fastPHASE without averaging overall, competitive accuracy
27
Results: Runtime Runtime in seconds for phasing 100 markers (log. scale) SpaMM scales linearly in #markers like fastPHASE, HaploRec, HIT unlike PHASE, Gerbil
28
Results: Genotype imputation
Most haplotyping methods can also predict missing genotype values for SpaMM, can be read off Viterbi path
29
Results: Genotype imputation
fastPHASE best known method Again, SpaMM beats fastPHASE without averaging
30
Conclusions SpaMM: new haplotyping method Future work
sparse Markov chains to encode conserved haplotype fragments Constrained HMM for modeling genotypes Apriori-style structure learning algorithm Simple, accurate, interpretable output Future work Accuracy can probably be improved using standard techniques (EM random restarts, averaging, ...)
31
Thanks!
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.