Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Slides:



Advertisements
Similar presentations
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Advertisements

CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
MALD Mapping by Admixture Linkage Disequilibrium.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Approaching the Long-Range Phasing Problem using Variable Memory Markov Chains Samuel Angelo Crisanto 2015 Undergraduate Research Symposium Brown University.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Polymorphism Structure of the Human Genome Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Haplotyping via Perfect Phylogeny: A Direct Approach
Combinatorial Approaches to Haplotype Inference Dan Gusfield CS, UC Davis.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
A Primer on Genetic Variation Variety Lawrence Brody - NHGRI.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Biology 101 DNA: elegant simplicity A molecule consisting of two strands that wrap around each other to form a “twisted ladder” shape, with the.
CS177 Lecture 10 SNPs and Human Genetic Variation
Informative SNP Selection Based on Multiple Linear Regression
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
Genotype Calling Jackson Pang Digvijay Singh Electrical Engineering, UCLA.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Yufeng Wu and Dan Gusfield University of California, Davis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Constrained Hidden Markov Models for Population-based Haplotyping
How Accurate is Pure Parsimony Haplotype Inferencing
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Indexing Genotypes for Haplotype Search
On solving population haplotype inference problems
Selecting a Maximally Informative Set of Single-Nucleotide Polymorphisms for Association Analyses Using Linkage Disequilibrium  Christopher S. Carlson,
SNPs and CNPs By: David Wendel.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield

SNP Data A SNP is a Single Nucleotide Polymorphism - a site in the genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more). SNP maps have been compiled with a density of about 1 site per SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Genotypes and Haplotypes Each individual has two “copies” of each chromosome. At each site, each chromosome has one of two alleles (states) denoted by 0 and 1 (motivated by SNPs) Two haplotypes per individual Genotype for the individual Merge the haplotypes

Haplotype Map Project: HAPMAP NIH lead project ($100M) to find common haplotypes in the Human population. Used to try to associate genetic-influenced diseases with specific haplotypes, to either find causal haplotypes, or to find the region near causal mutations. Haplotyping individuals is expensive.

Haplotyping Problem Biological Problem: For disease association studies, haplotype data is more valuable than genotype data, but haplotype data is hard to collect. Genotype data is easy to collect. Computational Problem: Given a set of n genotypes, determine the original set of n haplotype pairs that generated the n genotypes. This is hopeless without a genetic model.

The Pure Parsimony Objective For a set of genotypes, find a Smallest set H of haplotypes, such that each genotype can be explained by a pair of haplotypes in H. For each genotype G in the input, assign a pair of haplotypes in H to explain G. The Pure Parsimony Objective reflects simple genetic models of how haplotypes evolve in a population.

Pure Parsimony Explain the N genotypes using the fewest number of distinct haplotypes. Why? Empirically few haplotypes are seen in populations; Coalescent theory; analogy to other parsimony criteria; common attempts to interpret Clark’s haplotype inferral method as a parsimony method; many methods are to a large extent hidden parsimony methods - PHASE

Example of Parsimony distinct haplotypes set S has size 3

Pure Parsimony is NP-hard Earl Hubbel (Affymetrix) showed that Pure Parsimony is NP-hard. However, for a range of parameters of current interest (50 sites and 50 genotypes) a True Parsimony solution can be computed efficiently, using Integer Linear Programming, and two speed-up tricks. For larger parameters (100 sites and 50 genotypes) A near-parsimony solution can be found efficiently.

Why I did this work I wanted to answer two questions: First, can a pure parsimony solution be computed efficiently for a range of problem sizes of current interest in biology? Second, how accurate is the pure parsimony solution, compared to the correct solution (in simulations and in the available real data), and compared to solutions given by other existing computational methods such as PHASE. Accuracy is measured by the number of genotypes whose originating pair of haplotypes are returned in the solution.

The Conceptual Integer Programming Formulation For each genotype (individual) j, create one integer programming variable Yij for each pair of haplotypes whose merge creates genotype j. If j has k 2’s, then This creates 2^(k-1) Y variables. Create one integer programming variable Xq for Each distinct haplotype q that appears in one of the pairs for a Y variable.

Conceptual IP For each genotype, create an equality that says that exactly one of its Y variables must be set to 1. For each variable Yij, whose two haplotypes are given variables Xq and Xq’, include an inequality that says that if variable Yij is set to 1, then both variables Xq and Xq’ must be set to 1. Then the objective function is to Minimize the sum of the X variables.

Example Creates a Y variable Y1 for pair X X2 and a Y variable Y2 for pair X X4 Y1 + Y2 = 1 Y1 - X1 <= 0 Y1 - X2 <= 0 Y2 - X3 <= 0 Y2 - X4 <= 0 Include the following (in)equalities into the IP The objective function will include the subexpression X1 + X2 + X3 + X4 But any X variable is included exactly once no matter how many Y variables it is associated with.

Efficiency Tricks Ignore any Y variable and its two X variables if those X variables are associated with no other Y variable. The Resulting IP is much smaller, and can be used to find the optimal to the conceptual IP. Also, we need not enumerate all X pairs for a given genotype, but can efficiently recognize the pairs we need.

Avoiding Enumeration of unneeded haplotypes For each pair of genotypes, G1, G2 it is easy to find all the haplotypes that appear in an explanation for G1 and in an explanation for G2. Example: V 0 2 V and then generate all combinations of 0,1’s over the V sites. So the time is O(m x # haps in both explanation sets)

The APOE Data: A case where the haplotypes were molecularly determined There are 17 distinct haplotypes in the real data. The IP finds a True Parsimony Solution with 15 distinct haplotypes. PHASE and HAPLOTYPER each use 15 haplotypes also. Over 10,000 executions of Clarks method, the fewest haplotypes it used in any solutions was 20. This data has 9 sites, and 47 genotypes, each with at least two ambiguous sites.

Recombination Recombination is a process whereby a prefix of one sequence is concatenated to a suffix of another sequence to create a third sequence. Ex. ABCDEFG and TUVWXYZ could recombine to create ABCWXYZ DNA sequences evolve by mutations of different types, but also by recombinations.

Recombination Helps Efficiency As the level of recombination increases, the efficiency of the IP increases, because the variable elimination trick becomes more effective, reducing the size of the IP. The reason is that recombination makes the underlying haplotypes in the population more varied, and also increases the number of haplotypes in the population. Hence, each haplotype is less likely to be part of a potential explanation of any given genotype.

Recombination Hurts Accuracy For almost the same reason as recombination helps efficiency, it hurts accuracy. As recombination increases, the number of haplotypes that can be part of the explanation of more than one genotype in the data decreases. That helps efficiency, but it reduces the level of structure and dependency among the potential explanations, and hence the parsimony criteria is less effective.

How Fast? How Good? Depends on the level of recombination in the underlying data. Pure Parsimony can be computed in seconds to minutes for most cases with 50 genotypes and up to 60 sites, faster as the level of recombination increases. As the level of recombination increases, the accuracy of the Pure Parsimony Solution falls, but remains within 5% of the quality of PHASE (for comparison).

Accuracy For 10 sites and moderate recombination, the Pure Parsimony solutions have the same accuracy as PHASE and HAPLOTYPER solutions. As the number of sites and the level of recombination increases, PHASE and HAPLOTYPER tend to be more accurate than the Pure Parsimony solution, but the gap is moderate.

A Hybrid Approach for Large Data Sets We are interested in handling 100 genotypes and 150 sites. This is too large for the IP approach, but we can use a hybrid approach based on Clarks Method and an IP version of it.

Generic Clark Method Given a known haplotype H (original homozygote or single-site heterozygote, or previously inferred), and an unresolved genotype G, if G can be explained by H and another vector H’, then call H’ a known haplotype, available for additional inferrals. example: H G G is “resolved” by H and H’ H’ Clark (1990) Randomize choices, and do the computations many times to find an execution (run) that explains the most genotypes. In a single run, repeat the basic step until stuck - resolve as many genotypes as possible in the data. Basic Step:

Many variations of Clark Variations based on which parts are randomized. We closely examine eight variations on a real data set. Variation 1 randomizes every decision - probably more than Clark originally intended. Truth in advertising - we implemented our own Clark versions - did not actually use Clark’s software.

Clark/Parsimony Hybrid For low recombination, but many sites (say > 100), the pure IP approach blows up. But by connecting to Clark’s approach, using the Digraph view of Clark (Gusfield ISMB 2000, JCB 2001), the size of the IP is dramatically smaller, and can run on a very large number of sites. On datasets where the pure IP can be run, the hybrid method is only a bit inferior to the pure IP method.

Clark/Parsimony Hybrid Find an execution of Clark’s method that a)maximizes the number of genotypes resolved b) minimizes the number of distinct haplotypes used We can do this by mixing the Digraph View of Clark’s method (Gusfield 2001) with the parsimony criteria, and truly find an execution of Clark’s method that minimizes the number of distinct haplotypes used. On datasets where we can compute True Parsimony, this hybrid does only a bit worse than True Parsimony. For low recombination, large (>60) sites

Other uses of IP On datasets where we know the solution, find the best that a Clark method can ever do. IP can find the best possible execution. On the APOE data, Clark’s method can get all get 47 correct! In fact in a huge number of ways. (But the best we found by actually running Clark’s method was 42 correct). This kind of test is not possible for Statistical methods.