Constrained Hidden Markov Models for Population-based Haplotyping

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Efficient Algorithms for Imputation of Missing SNP Genotype Data A.Mihajlović, V. Milutinović,
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Yasuhiro Fujiwara (NTT Cyber Space Labs)
METHODS FOR HAPLOTYPE RECONSTRUCTION
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Basics of Linkage Analysis
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
MALD Mapping by Admixture Linkage Disequilibrium.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
University of Connecticut
Profiles for Sequences
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Lecture 6, Thursday April 17, 2003
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Ion Mandoiu CSE Department, University of Connecticut Joint work with Justin.
Genotype Error Detection using Hidden Markov Models of Haplotype Diversity Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE Department, University of Connecticut.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Elze de Groot1 Parameter estimation for HMMs, Baum-Welch algorithm, Model topology, Numerical stability Chapter
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
. Learning Parameters of Hidden Markov Models Prepared by Dan Geiger.
Handwritten Character Recognition using Hidden Markov Models Quantifying the marginal benefit of exploiting correlations between adjacent characters and.
1 A Network Traffic Classification based on Coupled Hidden Markov Models Fei Zhang, Wenjun Wu National Lab of Software Development.
Unambiguity Regularization for Unsupervised Learning of Probabilistic Grammars Kewei TuVasant Honavar Departments of Statistics and Computer Science University.
Graphical models for part of speech tagging
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
Imputation 2 Presenter: Ka-Kit Lam.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 15: Linkage Analysis VII
CS Statistical Machine learning Lecture 24
FINE SCALE MAPPING ANDREW MORRIS Wellcome Trust Centre for Human Genetics March 7, 2003.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Markov Chain Monte Carlo for LDA C. Andrieu, N. D. Freitas, and A. Doucet, An Introduction to MCMC for Machine Learning, R. M. Neal, Probabilistic.
Imputation-based local ancestry inference in admixed populations
John Lafferty Andrew McCallum Fernando Pereira
Biostatistics-Lecture 19 Linkage Disequilibrium and SNP detection
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Spectral Algorithms for Learning HMMs and Tree HMMs for Epigenetics Data Kevin C. Chen Rutgers University joint work with Jimin Song (Rutgers/Palentir),
Hidden Markov Models BMI/CS 576
Gonçalo Abecasis and Janis Wigginton University of Michigan, Ann Arbor
Recombination (Crossing Over)
Imputation-based local ancestry inference in admixed populations
Hidden Markov Models Part 2: Algorithms
Haplotype Reconstruction
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CONTEXT DEPENDENT CLASSIFICATION
Finding regulatory modules
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Constrained Hidden Markov Models for Population-based Haplotyping Application of Probabilistic ILP II, FP6-508861 www.aprill.org Constrained Hidden Markov Models for Population-based Haplotyping Niels Landwehr Joint work with Taneli Mielikäinen, Lauri Eronen, Hannu Toivonen, Heikki Mannila University of Freiburg / University of Helsinki

Outline Population-based haplotype reconstruction Infer haplotypes from genotypes: reconstruct hidden phase of genetic data Important problem in biology/medicine: e.g. disease association studies An approach using constrained HMMs Sparse markov chains to represent conserved haplotype fragments HMM model that can be learned directly from genotype data Experimental results

Human Genome and SNPs ...GATATTCGTACGGATGTTTCCA... (marker) SNP (marker) SNP (marker) SNP ...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... Individuals 1 2 3 4 5 6 DNA Sequence

Haplotypes Haplotypes AGT A G T GTC G T C 1 2 3 4 5 6 Individuals SNP SNP SNP AGT GTC Haplotypes A G T G T C Individuals 1 2 3 4 5 6 DNA Sequence

Haplotypes Haplotypes 101 1 0 1 010 0 1 0 1 2 3 4 5 6 Individuals SNP SNP SNP 101 010 Haplotypes 1 0 1 0 1 0 Individuals 1 2 3 4 5 6 DNA Sequence

Why Haplotypes? Haplotypes Disease Association Studies (Gene Mapping): define our genetic individuality contribute to risk factors of complex diseases (e.g., diabetes) Disease Association Studies (Gene Mapping): find genetic difference between a case and a control population Identifying SNPs responsible for disease might help find a cure Also useful for Linkage disequilibrium studies: Summarize genetic variation Understanding evolution of human populations

The problem: Haplotypes not directly observable . 1 . 1 {0,1} {0} {1} WetLab: only genotype information (two alleles for each SNP, but chromosome origin is unknown) Paternal Maternal

Population-based Haplotype Reconstruction Given the genotypes of several individuals, infer for every individual the most likely underlying haplotype pair Hidden data reconstruction problem using probabilistic model: exploit patterns in the haplotypes (linkage disequilibrium) haplotype pair genotype 1 0 0 1 1 1 0 0 {0,1} {1} {0} 1 0 0 1 1 1 0 0 {0,1} {1} {0} 0 1 1 1 0 0 {0,1} {1} {0} Individual 2 Individual 3 … Individual 1

Haplotype Reconstruction Problem (CS Perspective) Input: A set G of genotypes Output: A set H of corresponding haplotype pairs such that

Population-based Haplotype Reconstruction Given a model M for the distribution of haplotypes, can infer most likely resolution: Hardy-Weinberg equilibrium Need to estimate this model from available genotype data

Prior Work on Haplotype Reconstruction Competitive application domain for several years: many systems developed characterized by the statistical model and learning/reconstruction algorithms employed Special-purpose statistical models Approximate Coalescent (PHASE 2001,2003,2005) Block-based (Gerbil 2004,2005) Variable-length MC (HaploRec 2004,2006) Founder-based (HIT 2005) Local clusters (fastPHASE 2006)

Prior Work on Haplotype Reconstruction Special-purpose learning/reconstruction algorithms MCMC variant Approximate EM + partition ligation … Our approach: Model haplotypes using (sparse) markov chains Natural extension to a Hidden Markov Model on genotypes Directly learnable from genotype data (standard Baum-Welsh)

Constrained HMMs for haplotyping Path for haplotype 0,1,1,0 Modeling haplotypes Standard markov chain More general: order k markov chain

Constrained HMMs for haplotyping Modeling genotypes Hidden phase (order of pair): Hidden Markov Model States: pairs of states of the underlying markov chain (state of the maternal/paternal sequence) Output symbol: unordered pair Path in the model: sample two haplotypes, output corresponding genotype Have to enforce Hardy-Weinberg equilibrium Parameter tying constraints on transition probabilities Algorithms Learning: standard Baum-Welsh Reconstruction of most likely haplotype pair: Viterbi

Constrained HMMs for haplotyping Example: paths for genotype {0,1},{1},{0,1},{0}

Sparse Markov Modeling (SpaMM) Higher-order models (long history) needed: exponential size of model However, out of the possible history blocks, only few occur in data (conserved fragments) Idea: Sparse model, iterative structure learning algorithm to identify conserved fragments (Apriori-style) Initialize first-order-model() em-training( ) repeat regularize-and-extend( ) em-training( ) until

SpaMM Model (order 1) Initial model: standard markov chain of order 1 Iteration: extend order of model by 1, prune unlikely parts Avoids combinatorial explosion of model size

SpaMM Model (order 2) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

SpaMM Model (order 3) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

SpaMM Model (order 4) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

SpaMM Model (order 5) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

SpaMM Model (order 6) Iteration: extend order of model by 1, prune unlikely paths Avoids combinatorial explosion of model size

SpaMM Model (final) Final model: Model structure encodes conserved fragments Concise representation of all haplotypes with non-zero probability

Experimental Evaluation Real world population data Correct haplotypes have been inferred from trios Daly dataset: 103 SNP markers for 174 individuals Yoruba population: 100 datasets, 500 SNP markers each, 60 individuals Problem Setting: Given the set of genotypes, algorithm outputs most likely haplotype pairs Difference to real haplotype pairs is measured in switch distance (# recombinations needed to transform pairs, normalized)

Results: Haplotype Reconstruction Many well-engineered systems Smart priors, averaging over several random restarts of EM, ... SpaMM: proof-of-concept implementation, not tuned

Results: Haplotype Reconstruction PHASE most accurate, then fastPHASE, then SpaMM however, PHASE too slow for long maps SpaMM beats fastPHASE without averaging overall, competitive accuracy

Results: Runtime Runtime in seconds for phasing 100 markers (log. scale) SpaMM scales linearly in #markers like fastPHASE, HaploRec, HIT unlike PHASE, Gerbil

Results: Genotype imputation Most haplotyping methods can also predict missing genotype values for SpaMM, can be read off Viterbi path

Results: Genotype imputation fastPHASE best known method Again, SpaMM beats fastPHASE without averaging

Conclusions SpaMM: new haplotyping method Future work sparse Markov chains to encode conserved haplotype fragments Constrained HMM for modeling genotypes Apriori-style structure learning algorithm Simple, accurate, interpretable output Future work Accuracy can probably be improved using standard techniques (EM random restarts, averaging, ...)

Thanks!