A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.

Slides:

Advertisements

Similar presentations

Sugar 2.0 Formal Specification Language D ana F isman 1,2 Cindy Eisner 1 1 IBM Haifa Research Laboratory 1 IBM Haifa Research Laboratory 2 Weizmann Institute.

Advertisements

Lower Bounds for Local Search by Quantum Arguments Scott Aaronson.

APWeb 2004 Hangzhou, China 1 Labeling and Querying Dynamic XML Trees Jiaheng Lu and Tok Wang Ling School of Computing National University of Singapore.

Analysis of Algorithms

Dynamic Programming Introduction Prof. Muhammad Saeed.

A Simple ALU Binary Logic.

Greedy Technique Constructs a solution to an optimization problem piece by piece through a sequence of choices that are: feasible, i.e. satisfying the.

Princess Sumaya University

Compressing Forwarding Tables Ori Rottenstreich (Technion, Israel) Joint work with Marat Radan, Yuval Cassuto, Isaac Keslassy (Technion, Israel) Carmi.

Truth Tables & Logic Expressions

Chapter 9 -- Simplification of Sequential Circuits.

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut DIMACS Workshop on Algorithmics in Human.

演算法實驗室演算法實驗室 On the Minimum Node and Edge Searching Spanning Tree Problems Sheng-Lung Peng Department of Computer Science and Information Engineering.

Chapter 4 Variable–Length and Huffman Codes. Unique Decodability We must always be able to determine where one code word ends and the next one begins.

Great Theoretical Ideas in Computer Science

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

110/6/2014CSE Suprakash Datta datta[at]cse.yorku.ca CSE 3101: Introduction to the Design and Analysis of Algorithms.

Princess Sumaya University

Test B, 100 Subtraction Facts

Finite-state Recognizers

Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population Yufeng Wu Dept. of Computer Science and Engineering University of.

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.

Abdollah Khodkar Department of Mathematics University of West Georgia Joint work with Arezoo N. Ghameshlou, University of Tehran.

Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8

Synthesis For Finite State Machines. FSM (Finite State Machine) Optimization State tables State minimization State assignment Combinational logic optimization.

Multi-State Perfect Phylogeny via Chordal Graph Theory Dan Gusfield UC Davis February 17, UCBerkeley.

Finite State Machines Finite state machines with output

Epp, section 10.? CS 202 Aaron Bloomfield

Minimum Vertex Cover in Rectangle Graphs

IMIM v v v v v v v v v DEFINITION L v 11 v 2 1 v 31 v 12 v 2 2 v 32.

Node Optimization. Simplification Represent each node in two level form Use espresso to minimize each node Several simplification procedures which vary.

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.

Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination Dan Gusfield UC Davis Different parts of this work are joint.

Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.

D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.

Inference of Complex Genealogical Histories In Populations and Application in Mapping Complex Traits Yufeng Wu Dept. of Computer Science and Engineering.

Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.

Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.

Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.

CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.

ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.

Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.

Two Solutions in Search of Killer Apps. Dimacs workshop on Algorithms in Human Population Genomics Dan Gusfield UC Davis.

Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.

Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.

Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.

Incorporating Mutations

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish.

Inference of Genealogies for Recombinant SNP Sequences in Populations Yufeng Wu Computer Science and Engineering Department University of Connecticut

Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.

Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,

RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.

Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.

Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.

Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.

Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.

Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.

by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal

Yufeng Wu and Dan Gusfield University of California, Davis

Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,

L4: Counting Recombination events

Estimating Recombination Rates

ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.

Approximation Algorithms for the Selection of Robust Tag SNPs

Presentation transcript:

A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007

2 Recombination Recombination: one of the principle genetic forces shaping sequence variations within species Two equal length sequences generate a third new equal length sequence during meiosis Prefix Suffix Breakpoint

3 Ancestral Recombination Graph (ARG) Network, not tree! Assumption: at most one mutation per site Mutations Recombination

4 A Min ARG for Kreitmans data ARG created by SHRUB

5 Minimizing Recombination Given enough recombinations, any set of sequences can be trivially derived on an ARG. Problem: Given a set of sequences M, construct an ARG that derives M using one mutation per site, and the minimum number of recombinations (Rmin). NP-hard (Wang, et al 2001, Semple et al.). –Efficiently computed Lower bounds on Rmin exist.

History Bound (Myers & Griffiths 2003) Iterate the following operations 1.Remove a column with a single 0 or 1 2.Remove a duplicate row 3.Remove any row History bound: the minimum number of type-3 operations needed to reduce the matrix to empty Empty. One type-3 operation M

7 Graphical interpretation of history bound (HistB) Each operation corresponds to an operation that decomposes the optimal, but unknown ARG. Removing an exposed recombination node in the ARG corresponds to a single type-3 operation. So when decomposing the optimal ARG, the number of recombination nodes => number of corresponding type-3 operations. However, not all type-3 operations correspond to removing a recombination node. Since the optimal ARG is unknown, the history bound is the minimum number of type-3 operations needed to make the matrix empty.

a: b: d: c: e: f: g: p s a: b: c: d: e: f: g: Operations on M correspond to operations on the optimal ARG M

a: b: d: c: e: f: p s a: b: c: d: e: f: Type-2 operation

a: 001 b: 101 d: 110 c: 010 e: 010 f: p s a: 001 b: 101 c: 010 d: 110 e: 010 f: Type-1 operations

a: 001 b: 101 d: 110 c: p s a: 001 b: 101 c: 010 d: Type-2 operations

a: 001 b: 101 c: 010 a: 001 b: 101 c: Type-3 operation Then three more Type-1 operations fully reduce M and the ARG.

13 History bound Initially required trying all n! permutations of the rows to choose the type-3 operations. The bound can be computed by DP in O(2 n ) time (Bafna, Bansal). On datasets where it can be computed, the history bound is observed to be higher than (or equal to) all studied lower bounds (about ten of them). There is no static definition for what the history bound is -- it is only defined by the algorithms that compute it! The work in this paper comes out of an attempt to find a simple static definition.

14 Why a static definition matters We want a definition of what is being computed, independent of how it is computed, so that we can reason about it and find alternative ways to compute or approximate it. For example, with no static definition of the history bound, we dont know how to formulate an integer linear program to compute it.

sites Site mutations on edges The tree derives the set M: starting from Only one mutation per site allowed. Perfect Phylogeny

Intro. to Forest Bound: Decompose an Optimal ARG to A Forest of Trees, removing recombination edges An ARG with three recombinations After removing recombination edges, four trees result. The number of trees is precisely the number of recombinations plus one

17 Idea behind the Forest Bound (FB) Each tree created in this way contains at most one occurrence of any site, and each site occurs in at most one of the trees. So the trees form a forest of related perfect phylogenies.

18 Forest Bound Given a set of sequences M, partition M into the fewest subsets so that each subset of sequences can be derived on a tree, where each site occurs at most once in the forest of trees. The number of trees, minus one, is a valid lower bound on Rmin.

Forest Bound Given sequences, we need to partition them into trees, where at most one edge label per column in all the trees s2,s4 s3,s4 Illegal! s4 appears twice. Edge mutations needed

Forest Bound Given sequences, we need to partition them into trees, where at most one edge label per column in all the trees This leads to 4 trees (including 3 degenerated trees). But 4 is not the smallest number of trees! s3,s4

Forest Bound Minimum number of trees = 3 FB: the minimum number of trees in any partition (where each site occurs at most once) minus one, is a lower bound on Rmin s2 s1 s4 s5 s3 Steiner nodes

22 Comparing the Forest Bound (FB) to: History Bound (HistB) Optimal Haplotype Bound (OhapB): The currently best lower bound that can be computed in practice for biological data. Theorem: On any data, OhapB <= FB <= HistB On some data, OhapB < FB < HistB Thus the FB is the highest lower bound with a static definition.

23 First, define the Haplotype Lower Bound (S. Myers, 2003) Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 <= minimum number of recombinations needed (folklore) Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound. Generally Rh is really bad bound, often negative, when used on large intervals, but Very Good when used as local bounds in the Composite Method. Myers implemented the method in a program called RecMin, which was a huge advance, generally three times higher than the prior best lower bound method. The composite method can be used with any lower bound method and the better the initial lower bounds, the better the composite result.

24 Then, the Subset Bound (Myers) Let S be a subset of sites, and Rh(S) be the haplotype bound computed on the sequences restricted to S. Rh(S) is a valid lower bound on Rmin. Optimal Haplotype Bound (OhapB) is the maximum Rh(S) over all subsets of sites. Practical computation of OhapB via ILP was studied in (SWG 2005) and exploited in the program Hapbound. Hapbound gives provably better bounds than RecMin.

25 Now, the Optimal Haplotype Bound (OhapB) OhapB is the maximum haplotype bound over any subset of columns. NP-hard (Bafna & Bansal, 2005) –Efficiently computed in practice (Song, Wu, Gusfield 2005) R h = 4 – 3 – 1 = R h = 4 – 2 – 1 = 1

Forest Bound (FB) is Higher than Haplotype Bound (R h ) s2 s1 s4 Steiner nodes R h = number distinct rows – number distinct columns – 1 = = --1 s5 s3 FB = 2

27 FB >= R h FB obtained from all the data M is >= FB obtained from a subset of the columns, so assume all columns in M are distinct. FB = # trees in the FB -- 1 = # nodes -- # edges -- 1 in the forest = # leaves + # Steiner nodes -- # columns -- 1 = # rows + # Steiner nodes -- # columns -- 1 >= # distinct rows -- # columns-- 1 = R h

28 Forest Bound is Higher than Optimal Haplotype Bound F(Ms) R h (Ms) (i.e. the optimal haplotype bound). s2 s3 s5 Optimal subset of columns Ms s2 s3 s5 Input matrix M

Number of Trees after Taking Subset of Columns s3 s1 s2 s5 s6 s4 Minimum forest with 3 trees for entire data s3 A legal forest for the subset data! s2, s3, s5. s2 s5 s3 s2 s5 A legal forest with 3 trees for the subset Cleanup Also, taking subsets can not increase the number of trees, and so FB(M) FB(Ms). So, FB(M) FB(Ms) R h (Ms), so FB OhapB

30 FB <= HistB The decomposition of the optimal ARG, directed by the operations of computing the history bound, creates a forest of HistB + 1 trees, where each site occurs at most once, in at most one tree. So FB <= HistB.

31 Computing the Forest Bound is NP-Hard Optimal haplotype bound is quite good, but NP-hard to compute. If the forest bound can be efficiently computable, we do not need to use optimal haplotype bound at all. Unfortunately, the forest bound is NP- hard to compute. Reduction from Exact-cover-by-3 sets.

NP-hardness Proof for the Minimum Perfect Phylogenetic Forest Problem {1,3,5} {1,2,4} {2,4,6} 3-Sets Binary sequences on a hypercube Sequences corresponding to the same set form a perfect phylogeny with a single novel sequence (not in input) Two sequences from different sets are far apart, and would need two many mutations to connect, thus can not belong to the same tree. Sequences corresponding to same element in two sets need same mutation and thus can not be both chosen.

33 Integer Programming Formulation for the Forest Bound For sequences with m sites, consider the hypercube all possible 2 m sequences. Minimizing F is equivalent to reducing the number of Steiner nodes in the forests. We also need to ensure the edge linking two nodes in a tree is only labeled with columns that do not appear in other trees. Can easily incorporate the missing data in the input. The IP formulation has exponential size, but practical when the number of columns is relatively small.

34 Empirical Results On random generated dataset with 15 rows and 7 columns, FB > OhapB on 10% of the data. On more biological meaningful data (generated with simulation program ms), however, OhapB= FB more often. On dataset generated by ms with missing entries, FB is more often outperforms an approximate optimal R h bound: –30 rows and 7 columns and 30% missing entries: FB was strictly larger in 8% of the data. –When the level of missing entries is lower, the approx. OhapB matches the FB more often.