Sharlee Climer, Alan R. Templeton, and Weixiong Zhang

Slides:



Advertisements
Similar presentations
CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Advertisements

Association Studies, Haplotype Blocks and Tagging SNPs Prof. Sorin Istrail.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Admixture in Horse Breeds Illustrated from Single Nucleotide Polymorphism Data César Torres, Yaniv Brandvain University of Minnesota, Department of Plant.
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Biology and Bioinformatics Gabor T. Marth Department of Biology, Boston College BI820 – Seminar in Quantitative and Computational Problems.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College Cold Spring Harbor Laboratory Advanced Bioinformatics.
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Lecture X.X1. 2 The informatics of SNPs and Haplotypes Gabor T. Marth Department of Biology, Boston College
Take a Walk and Cluster Genes: A TSP-based Approach to Optimal Rearrangement Clustering Sharlee Climer and Weixiong Zhang This research was supported in.
May 25, GSU Biotech Symposium1 Minimum PCR Primer Set Selection with Amplification Length and Uniqueness Constraints Ion Mandoiu University of.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Advanced Algorithms and Models for Computational Biology -- a machine learning approach Population Genetics: SNPS Haplotype Inference Eric Xing Lecture.
Imputation 2 Presenter: Ka-Kit Lam.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Lecture 7.01 The informatics of SNPs and haplotypes Gabor T. Marth Department of Biology, Boston College CGDN Bioinformatics Workshop June.
February 20, 2002 UD, Newark, DE SNPs, Haplotypes, Alleles.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Imputation-based local ancestry inference in admixed populations
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Introduction to SNP and Haplotype Analysis
Inferring Missing Genotypes in Large SNP Panels
Of Sea Urchins, Birds and Men
Constrained Hidden Markov Models for Population-based Haplotyping
How Accurate is Pure Parsimony Haplotype Inferencing
Introduction to SNP and Haplotype Analysis
Haplotype Reconstruction
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
SEG5010 Presentation Zhou Lanjun.
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
SNPs and CNPs By: David Wendel.
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Sharlee Climer, Alan R. Templeton, and Weixiong Zhang SplittingHeirs: Inferring Haplotypes by Optimizing Resultant Dense Graphs Sharlee Climer, Alan R. Templeton, and Weixiong Zhang ACM-BCB, Niagara falls August 2010

Overview Introduction Definition of haplotype inference problem Previous approaches SplittingHeirs Experimental results

Introduction Only 0.1% of human DNA has variation Most of this variation is due to Single Nucleotide Polymorphisms (SNPs) Most SNPs have only two variants, or alleles, within a population Broad definition of haplotype: A set of alleles for a given set of SNPs in relatively close proximity on a chromosome Image source: http://www.dnabaser.com/articles/SNP/SNP-Single-nucleotide-polymorphism.png

Introduction DNA is transcribed to produce RNA RNA is translated, ultimately producing proteins Variation in non-coding regions might have an effect on regulation SNPs throughout the genome may be of interest Image source: http://www.cytochemistry.net/cell-biology/ribosome.htm

Introduction Humans are diploid Pairs of chromosomes Common sequencing produces a meld of the two haplotypes, referred to as a genotype Computational methods used to infer a pair of haplotypes from a genotype Phasing the genotype G C T T SNP1 SNP2 G C T T G T A C + C T A G ? C T A C + G T A G

Importance of accuracy when inferring haplotypes from genotypes SNP1 SNP2 SNP1 SNP2 C C T C G C T T Importance of accuracy when inferring haplotypes from genotypes Frequently an early step in expensive and vitally important studies

Introduction Possible to identify the separate haplotypes directly Only feasible for very small studies Useful for testing accuracy of computational methods Andres et al. [Genet. Epi. 2007] found computational methods had poor accuracy and confidence levels were error prone PHASE [Stephens et al., AJHG 2001] fastPhase [Scheet and Stephens, AJHG 2006] HAP [Halperin and Eskin, Bioinformatics 2004] GERBIL [Kimmel and Shamir, PNAS 2005] Errors in confidence levels suggest that the models might not fully capture biological properties

Problem Definition G T A C C T A G 1 1 0 0 0 1 0 1 2 1 0 2 Let ‘0’ and ‘1’ represent the two possible alleles for a given SNP Haplotype represented by a string of binary values Genotype for a pair of haplotypes ‘0’ if both alleles are ‘0’ ‘1’ if both alleles are ‘1’ ‘2’ if heterozygous G T A C C T A G 1 1 0 0 0 1 0 1 2 1 0 2

Problem Definition For k heterozygous sites, there are 2k-1 feasible solutions Not apparent which solution is more likely than another Population-level characteristics There tends to be relatively few unique haplotypes There tends to be clusters of haplotypes that are similar to each other Some haplotypes are relatively common

Problem Definition Given a set of genotypes drawn from a population: 1) Find the set of haplotypes that exist in the set 2) For each genotype, determine the pair of haplotypes that is mostly likely to exist in the given individual Image source: http://www.samepoint.com/blog/wp-content/uploads/2009/04/blog_group_of_people_1.jpg

Example Example problem Display solutions as graphs g1: 1111 0001 5 individuals 8 SNP sites Display solutions as graphs Each node represents a unique haplotype Edge weight Measure of difference between haplotypes Set equal to the number of sites that differ between the haplotypes Edges with smallest distances are shown g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222

Example Solution found by: 5 unique haplotypes g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222 Solution found by: Clark’s Subtraction Method [Mol. Biol. And Evol. 1990] Pure Parsimony [Gusfield, CPM’03] EM [Excoffier and Slatkin, Mol. Biol.Evol. 1995] 5 unique haplotypes Haplotypes are not very similar to each other

Example No Perfect Phylogeny solution Solution found by HAP 6 unique haplotypes Haplotypes are slightly more similar to each other

Example Solution found by PHASE 9 unique haplotypes g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222 Solution found by PHASE 9 unique haplotypes Haplotypes are more similar to each other

Example PHASE favors pair-wise similarities g1: 1111 0001 g2: 2212 0202 g3: 2220 2102 g4: 2222 2121 g5: 2022 0222 PHASE favors pair-wise similarities Essentially evaluating a nearest-neighbor graph

SplittingHeirs where di = the weight of edge i SplittingHeirs favors cluster-wide similarities, as well as reduced cardinality Cast as a Mixed Integer Linear Program (MIP) Minimize: where di = the weight of edge i h = the cardinality of the haplotype set u = a weighting factor

SplittingHeirs Enforce cluster-wide similarities by requiring a minimum density of edges in the graph Additional constraint: where e = number of edges a is a configurable parameter Can be decreased for highly diverse sample Can be increased for sample with low diversity

Example Solution found by SplittingHeirs 8 unique haplotypes Haplotypes are quite similar to each other

Results Tested on 7 sets of haplotype data for which the true phase is known n is the number of individuals m is the number of sites # Ambiguous is the number of genotypes that have more than one feasible solution

Results

Results

Conclusions Introduced a biologically intuitive model that optimizes cluster-wide similarities and reduced cardinality Globally optimal solutions can be computed for small regions Candidate locus studies Future work Speed up computation Use model to guide an approximation method Image source: http://farm3.static.flickr.com/2268/2255581637_a59a956bfe.jpg

Acknowledgments Olin Fellowship NIH grants NSF grants P50-GM065509 R01-GM087194A2 U01-GM063340 NSF grants IIS-053557 DBI-0743797 Alzheimer’s Association grant Thanks to: Taylor Maxwell Gerold Jaeger