National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

Slides:



Advertisements
Similar presentations
Primal Dual Combinatorial Algorithms Qihui Zhu May 11, 2009.
Advertisements

Introduction to Haplotype Estimation Stat/Biostat 550.
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Sharlee Climer, Alan R. Templeton, and Weixiong Zhang
Efficient Algorithms for Genome-wide TagSNP Selection across Populations via the Linkage Disequilibrium Criterion Authors: Lan Liu, Yonghui Wu, Stefano.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
Approximation Algoirthms: Semidefinite Programming Lecture 19: Mar 22.
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
Combinatorial Algorithms for Maximum Likelihood Tag SNP Selection and Haplotype Inference Ion Mandoiu University of Connecticut CS&E Department.
Wei-Bung Wang Tao Jiang
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Integer Programming Difference from linear programming –Variables x i must take on integral values, not real values Lots of interesting problems can be.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Multipath Routing Algorithms for Congestion Minimization Ron Banner and Ariel Orda Department of Electrical Engineering Technion- Israel Institute of Technology.
Approximation Algorithms: Bristol Summer School 2008 Seffi Naor Computer Science Dept. Technion Haifa, Israel TexPoint fonts used in EMF. Read the TexPoint.
Integrality Gaps for Sparsest Cut and Minimum Linear Arrangement Problems Nikhil R. Devanur Subhash A. Khot Rishi Saket Nisheeth K. Vishnoi.
Giuseppe Lancia University of Udine The phasing of heterozygous traits: Algorithms and Complexity.
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Optimization for Operation of Power Systems with Performance Guarantee
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
BIBE 051 Haplotype Phasing using Semidefinite Programming Parag Namjoshi CSEE Department University of Maryland Baltimore County Joint work with Konstantinos.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
June 21, 2007 Minimum Interference Channel Assignment in Multi-Radio Wireless Mesh Networks Anand Prabhu Subramanian, Himanshu Gupta.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Informative SNP Selection Based on Multiple Linear Regression
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
National Taiwan University Department of Computer Science and Information Engineering Dynamic Programming Algorithms for Haplotype Block Partitioning:
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Scientific Computing General Least Squares. Polynomial Least Squares Polynomial Least Squares: We assume that the class of functions is the class of all.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
C&O 355 Mathematical Programming Fall 2010 Lecture 16 N. Harvey TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A.
Column Generation By Soumitra Pal Under the guidance of Prof. A. G. Ranade.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
California Pacific Medical Center
Approximation Algorithms Department of Mathematics and Computer Science Drexel University.
De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.
Bayesian Multi-Population Haplotype Inference via a Hierarchical Dirichlet Process Mixture Duke University Machine Learning Group Presented by Kai Ni August.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
Application of the GA-PSO with the Fuzzy controller to the robot soccer Department of Electrical Engineering, Southern Taiwan University, Tainan, R.O.C.
Efficient Point Coverage in Wireless Sensor Networks Jie Wang and Ning Zhong Department of Computer Science University of Massachusetts Journal of Combinatorial.
Approximation Algorithms based on linear programming.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
The Haplotype Blocks Problems Wu Ling-Yun
Yufeng Wu and Dan Gusfield University of California, Davis
Introduction to SNP and Haplotype Analysis
How to Solve NP-hard Problems in Linear Time
How Accurate is Pure Parsimony Haplotype Inferencing
Introduction to SNP and Haplotype Analysis
Estimating Recombination Rates
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
CS 394C: Computational Biology Algorithms
On solving population haplotype inference problems
Approximation Algorithms for the Selection of Robust Tag SNPs
Bioinformatics, Vol.17 Suppl.1 (ISMB 2001)
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony Department of Computer Science & Information Engineering, National Taiwan University, Taiwan Department of Biological Sciences, University of Southern California, USA Kun-Mao ChaoTing Chen Yao-Ting Huang

National Taiwan University Department of Computer Science and Information Engineering 2 SNPs and Haplotypes A Single Nucleotide Polymorphism (SNP) is a single DNA base variation observed with frequency more than 1% in the population. A haplotype stands for a set of linked SNPs on the same chromosome. SNP 1 SNP 2 SNP 3 -A C T T A G C T T- -A A T T T G C T C- -A C T T T G C T C- Haplotype 2 Haplotype 3 C A T A T C C T C Haplotype 1 SNP 1 SNP 2 SNP 3

National Taiwan University Department of Computer Science and Information Engineering 3 Genotype Data v.s. Haplotype Data The use of haplotype information has been limited because the human genome is a diploid.  In large sequencing projects, genotype data instead of haplotype data are collected. A C G T AT SNP 1 SNP 2 CG Haplotype data SNP 1 SNP 2 Genotype data or AT CG SNP 1 SNP 2 AG CT SNP 1 SNP 2 We don’t know which haplotype pair is real. ACAC GTGT SNP 1 SNP 2 A T C G SNP 1 SNP 2

National Taiwan University Department of Computer Science and Information Engineering 4 Haplotype Inference Inferring the haplotypes for a set of genotypes is called haplotype inference.  Many variations of this problem are already shown to be NP-hard. Most combinatorial methods consider the maximum parsimony model to solve this problem.  Find a minimum set of haplotypes to resolve all genotypes. A C SNP 1 SNP 2 G T AT CG AG CT

National Taiwan University Department of Computer Science and Information Engineering 5 Maximum Parsimony AG h3h3 CT h4h4 AT h1h1 CG h2h2 AT h1h1 AT h1h1 or G1G1 A C SNP 1 SNP 2 G T G2G2 A A SNP 1 SNP 2 T T AG CT AT AT CG Find a minimum set of haplotypes to resolve all genotypes.

National Taiwan University Department of Computer Science and Information Engineering 6 Our Results We formulated this problem as an integer quadratic programming (IQP) problem. W proposed an iterative semidefinite programming (SDP) relaxation algorithm to solve the IQP problem.  This algorithm finds a solution of O(log n) approximation. We implemented this algorithm in MatLab and compared with existing statistical and combinatorial methods.

National Taiwan University Department of Computer Science and Information Engineering 7 Integer Quadratic Programming (IQP) Given n genotypes and m possible haplotypes.  Let x i = 1 if the i-th haplotype is selected.  Let x i = -1 if the i-th haplotype is not selected. Minimizing the number of selected haplotypes is to Each genotype must be resolved by at least one pair of haplotypes. G1G1 A C SNP 1 SNP 2 G T AT h1h1 CG h2h2 AG h3h3 CT h4h4

National Taiwan University Department of Computer Science and Information Engineering 8 Integer Quadratic Programming (IQP) Maximum parsimony: Solving the IQP problem is NP-hard. Objective Function Constraint Functions to resolve all genotypes. Find a minimum set of haplotypes

National Taiwan University Department of Computer Science and Information Engineering 9 The Flow of the Iterative SDP Relaxation Algorithm Integer Quadratic Programming Integral Solution Semidefinite Programming Vector Solution Vector Formulation SDP Solution All genotypes resolved? Relax the integer constraint No, repeat this algorithm. Existing SDP solver Yes, done. Reformulation Randomized rounding Incomplete Cholesky decomposition NP-hardP

National Taiwan University Department of Computer Science and Information Engineering 10 Relaxation We relax x i into a (m+1)-dimensional unit vector y i.  Replace integer constant 1 with another unit vector y 0 = (1, 0, …, 0). Integer Quadratic ProgrammingVector Formulation Integer Quadratic Programming Vector Formulation

National Taiwan University Department of Computer Science and Information Engineering 11 Vector FormulationSemidefinite Programming Let Y = (y 0 y 1 …y m ) T (y 0 y 1 …y m ) = Semidefinite Programming Vector Formulation Reformulation

National Taiwan University Department of Computer Science and Information Engineering 12 Vector FormulationSemidefinite Programming Semidefinite Programming Vector Formulation Reformulation

National Taiwan University Department of Computer Science and Information Engineering 13 Solving SDP Semidefinite ProgrammingSDP Solution SDP Solution Semidefinite Programming The SDP problem can be solved by algorithms such as the interior point method in polynomial time.  We can obtain the matrix solution Y.

National Taiwan University Department of Computer Science and Information Engineering 14 Decomposition SDP Solution Vector Solution Semidefinite Solution Recall that Y = (y 0 y 1 …y m ) T (y 0 y 1 …y m ). Use the incomplete Choleskey decomposition method to obtain vector solutions y 0, y 1, …, y m. y 0 = (1, 0, …, 0). y 1 = (0.12, 0.04, …, 0.1). … y m = (0.09, 0.1, …, 0.14). Vector Solution

National Taiwan University Department of Computer Science and Information Engineering 15 Randomized Rounding Randomly generate two unit vectors z 1 and z 2. Set x i = 1 if  ( z 1 · y i ) ( z 1 · y 0 ) > 0, and  ( z 2 · y i ) ( z 2 · y 0 ) > 0.  Otherwise, set x i = -1. The integer solution obtained by this rounding method is close to the optimal solution. Integral Solution Vector Solution y 0 = (1, 0, …, 0). y 1 = (0.12, 0.04, …, 0.1). … y m = (0.09, 0.1, …, 0.14). Vector Solution x 1 = 1. x 2 = -1. … x m = 1. Integral Solution

National Taiwan University Department of Computer Science and Information Engineering 16 Iterative Process Is any genotype still unresolved?  Yes, repeat this algorithm for those unresolved genotypes.  No, we are done. Integer Quadratic Programming x 1 = 1. x 2 = -1. … x m = 1. Integral Solution All genotypes resolved? No, repeat this algorithm. Yes, done.

National Taiwan University Department of Computer Science and Information Engineering 17 Experiments The iterative SDP-relaxation algorithm has been implemented in MatLab. The program has been tested on a variety of simulated and biological data.  Randomly generated haplotypes and genotypes.  Haplotypes and genotypes generated by Hudson’s program.  β 2 -Adrenergic receptors (β 2 AR).  Cystic fibrosis.

National Taiwan University Department of Computer Science and Information Engineering 18 Comparison of the Number of Haplotypes m: # of haplotypes, k: # of SNPs, n: # of genotypes.  f: fail to find a solution in two hours.

National Taiwan University Department of Computer Science and Information Engineering 19 Experiments on Simulated Data Define e a as the average error rates over 100 data sets.  The error rate is the proportion of genotypes whose original haplotype pairs are inferred incorrectly.

National Taiwan University Department of Computer Science and Information Engineering 20 Experiments on Biological Data Define e a as the average error rates over 100 data sets.  The error rate is the proportion of genotypes whose original haplotype pairs are inferred incorrectly.

National Taiwan University Department of Computer Science and Information Engineering 21 Conclusion We proposed an iterative SDP-relaxation algorithm which finds a solution of O(log n) approximaiton.  To our best knowledge, this is the first paper that finds the approximation bound for this problem. The error rates of our algorithm is similar to those of HAPLOTYPER and PHASE. The performance of our algorithm is more efficient than HAPAR.

National Taiwan University Department of Computer Science and Information Engineering 22 Related Works Statistical methods:  Niu, et al. (2002) developed a PL-EM algorithm called HAPLOTYPER.  Stephens and Donnelly (2003) designed a MCMC algorithm based on Gibbs sampling called PHASE. Combinatorial methods:  Gusfield (2003) proposed an integer linear programming algorithm.  Wang and Xu (2003) developed a branching and bound algorithm called HAPAR to find the optimal solution.  Brown and Harrower (2004) proposed a new integer linear formulation of this problem.