Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish.

Slides:

Advertisements

Similar presentations

A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.

Advertisements

CS 336 March 19, 2012 Tandy Warnow.

Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,

1 Decomposing Hypergraphs with Hypertrees Raphael Yuster University of Haifa - Oranim.

An introduction to maximum parsimony and compatibility

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.

PHYLOGENETIC TREES Bulent Moller CSE March 2004.

Greedy Algorithms Greed is good. (Some of the time)

Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination Dan Gusfield UC Davis Different parts of this work are joint.

Erice - Structured Pattern Detection and Exploitation Deterministic Algorithms.

Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.

Graph Drawing and Information Visualization Laboratory Department of Computer Science and Engineering Bangladesh University of Engineering and Technology.

1 A Linear-Time Algorithm for the Perfect Phylogeny Haplotyping (PPH) Problem Zhihong Ding, Vladimir Filkov, Dan Gusfield Department of Computer Science.

D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.

Exact Computation of Coalescent Likelihood under the Infinite Sites Model Yufeng Wu University of Connecticut ISBRA

Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.

WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.

Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.

Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.

Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.

Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.

CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.

Increasing graph connectivity from 1 to 2 Guy Kortsarz Joint work with Even and Nutov.

Haplotyping via Perfect Phylogeny: A Direct Approach

ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.

Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.

Inferring Evolutionary History with Network Models in Population Genomics: Challenges and Progress Yufeng Wu Dept. of Computer Science and Engineering.

Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.

Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.

Chapter 11: Limitations of Algorithmic Power

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.

Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.

Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.

Incorporating Mutations

Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.

Perfect Phylogeny MLE for Phylogeny Lecture 14

Combinatorial Optimization and Combinatorial Structure in Computational Biology Dan Gusfield, Computer Science, UC Davis.

Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,

Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.

Combinatorial Optimization in Computational Biology: three topics that use Perfect Phylogeny Dan Gusfield OSB 2008, Lijiang, China, November 1, 2008.

Physical Mapping of DNA Shanna Terry March 2, 2004.

Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.

Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.

Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.

Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.

Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.

CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.

1 Decomposition into bipartite graphs with minimum degree 1. Raphael Yuster.

Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.

1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,

Strings Basic data type in computational biology A string is an ordered succession of characters or symbols from a finite set called an alphabet Sequence.

Subtree Prune Regraft & Horizontal Gene Transfer or Recombination.

Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.

Today Graphical Models Representing conditional dependence graphically

. Perfect Phylogeny MLE for Phylogeny Lecture 14 Based on: Setubal&Meidanis 6.2, Durbin et. Al. 8.1.

An Algorithm for the Consecutive Ones Property Claudio Eccher.

Application of Phylogenetic Networks in Evolutionary Studies Daniel H. Huson and David Bryant Presented by Peggy Wang.

Sorting by placement and Shift Sergi Elizalde Peter Winkler By 資工四 B 周于荃.

by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal

Yufeng Wu and Dan Gusfield University of California, Davis

Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,

Erice - Structured Pattern Detection and Exploitation

The minimum cost flow problem

Chapter 5. Optimal Matchings

L4: Counting Recombination events

Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,

Estimating Recombination Rates

ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.

3.5 Minimum Cuts in Undirected Graphs

Chapter 11 Limitations of Algorithm Power

Presentation transcript:

Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yufeng Wu.

Reconstructing the Evolution of Binary Bio-Sequences Perfect Phylogeny (tree) model Phylogenetic Networks (DAG) with recombination Phylogenetic Networks with disjoint cycles: Galled-Trees Phylogenetic Networks with unconstrained cycles: Blobbed-Trees Combinatorial Structure and Efficient Algorithms Efficiently Computed Lower Bounds on the number of recombinations needed

Geneological or Phylogenetic Networks The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations. Also relates to phylogenetic-based haplotyping. Some of the algorithmic and mathematical results also have phylogenetic applications, for example in hybrid speciation, lateral gene transfer.

The Perfect Phylogeny Model for binary sequences sites Ancestral sequence Extant sequences at the leaves Site mutations on edges The tree derives the set M:

Why SNPs? SNPs imply that the sequences are binary, and that the order of the sites is fixed (on a chromosome). This is in contrast to a set of taxonomic characters, where the order is arbitrary.

The converse problem Given a set of sequences M we want to find, if possible, a perfect phylogeny that derives M. Remember that each site can change state from 0 to 1 only once. n will denote the number of sequences in M, and m will denote the length of each sequence in M. n m M

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test When can a set of sequences be derived on a perfect phylogeny with the all-0 root?

A richer model added pair 4, 5 fails the three gamete-test. The sites 4, 5 ``conflict”. Real sequence histories often involve recombination.

The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix). P S Sequence Recombination A recombination of P and S at recombination point 5. Single crossover recombination

Network with Recombination new The previous tree with one recombination event now derives all the sequences. 5 P S

Multiple Crossover Recombination 4-crossovers 2-crossovers = ``gene conversion”

Elements of a Phylogenetic Network (single crossover recombination) Directed acyclic graph. Integers from 1 to m written on the edges. Each integer written only once. These represent mutations. A choice of ancestral sequence at the root. Every non-root node is labeled by a sequence obtained from its parent(s) and any edge label on the edge into it. A node with two edges into it is a ``recombination node”, with a recombination point r. One parent is P and one is S. The network derives the sequences that label the leaves.

A Phylogenetic Network S p P S 1 4 a:00010 b:10010 c: d:10100 e: f:01101 g:

Which Phylogenetic Networks are meaningful? Given M we want a phylogenetic network that derives M, but which one? A: A perfect phylogeny (tree) if possible. As little deviation from a tree, if a tree is not possible. Use as little recombination or gene-conversion as possible.

Minimizing recombinations Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. Two algorithmic problems: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem). Find a network generating M that has some biologically- motivated structural properties.

Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP- hard. (Wang et al 2000) (Semple 2004) Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time.

Recombination Cycles In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. The cycle specified by those two paths is called a ``recombination cycle”.

Galled-Trees A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle. A phylogenetic network is called a “galled- tree” if every recombination cycle is a gall.

a: b: d: c: e: f: g: A galled-tree generating the sequences generated by the prior network. 3 4 p s p s

Sales pitch for Galled-Trees Galled-trees represent a small deviation from true trees. There are sufficient applications where it is plausible that a galled tree exists that generates the sequences. Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable recombinations may be a galled-tree. The number of recombinations is never more than m/2. Moreover, when M can be derived on a galled-tree, the number of recombinations used is the minimum number over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination. A galled-tree for M is ``almost unique” - implications for reconstructing the correct history.

Old (Aug. 2003) Results O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence. Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution. Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic- networks with all-0 ancestral sequence.

New work We derive the galled-tree results in a more general setting that addresses unconstrained recombination cycles and multiple crossover recombination. This also solves the problem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral sequence is known in advance.

Blobbed-trees: generalizing galled-trees In a phylogenetic network a maximal set of intersecting cycles is called a blob. Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.

Ugly tangled network inside the blob. Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship?

The start of technical stuff

Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

abcdefgabcdefg Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. Incompatibility Graph M THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

Simple Fact If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j. (This is a general fact for all phylogenetic networks.) Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.

A Phylogenetic Network S p P S 1 4 a:00010 b:10010 c: d:10100 e: f:01101 g:

Simple Consequence of the simple fact All sites on the same (non-trivial) connected component of the incompatibility graph must be on the same blob in any blobbed-tree. Follows by transitivity. So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the incompatibility graph.

Key Result about Galls: For galls, the converse of the simple consequence is also true. Two sites that are in different (non-trivial) connected components cannot be placed on the same gall in any phylogenetic network for M. Hence, in a galled-tree T for M each gall contains all and only the sites of one (non-trivial) connected component of the incompatibility graph. All compatible sites can be put on edges outside of the galls. This was the key to the galled-tree solution.

a: b: d: c: e: f: g: A galled-tree generating the sequences generated by the prior network. 2 4 p s p s Incompatibility Graph

Reduced Galled-Trees A galled-tree is called reduced if every gall only contains conflicted sites. Theorem: If M can be derived on a galled-tree, it can be derived on a reduced galled-tree. The number of recombination nodes in a reduced galled-tree equals the number of connected components of the conflict graph, which is the minimum number of recombinations possible in any galled-tree.

A reduced galled-tree for M (if one exists) minimizes the number of recombinations used over any phylogenetic network for M. The initial proof is based on a lower bound method. Another proof (together with Dean Hickerson): The minimum number of recombinations needed in any phylogenetic network for M is at least the number of non-trivial connected components of the conflict graph for M.

The conflict graph G and a blobbed tree N A conflicted pair of sites must be together on some single cycle in N (as before). By transitivity, all the sites in a connected component of the conflict graph must be together on some single blob of N. Does the correspondence between connected components of G and galls in a galled-tree generalize to blobbed trees?

The main new result: The Partition Theorem For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of the incompatibility graph. The compatible sites can always be put on edges outside of any blob. Moreover, the tree part of T(M) is unique. And it is easy to find the tree part. This is weaker than the result for galled-trees: it replaces “must” with “can”.

My proof is direct and self-contained, but the result may also be derivable from well-established results about splits, for example, from facts known about Bunemann graphs. Thanks to Mike Steel for pointing this out. Can also be derived from earlier work on splits by Bandelt and Dress.

What does T(M) look like? How do we construct it? How do we exploit it? Let T’(M) be the tree where each blob in T(M) has been contracted to a single node. Finding T’(M) is easy from M. Then we expand each node in T’(M) to get T(M).

How to find T’(M) For a connected component C in the conflict graph, let M[C] be M restricted to the sites in C. Consider each distinct sequence S in M[C] as defining a binary character (split) partitioning those rows of M[C] that contain S, and those that do not. Let M’ be the binary matrix obtained from these characters, over all connected components in the conflict graph.

Fact: M’ satisfies the 4-gamete test and so has a unique undirected perfect phylogeny, which is T’(M). Then we need to expand each node in T’(M) that corresponds to a blob for connected component C to some network B so that the external nodes in B have the labels in M[C], when restricted to C.

sites on B from C v For any leaf in Tv, and only at those leaves, the states of the sites in C are the same as at v. w T(M) Key point: B’ Leaf set Tv B: blob C: component of the conflict graph whose sites are on B

a: b: d: c: e: f: g: p s p s Example: v

More Formally: Let M[C] be the matrix M restricted to the sites in C. Let S[C] be a sequence S restricted to the sites in C. Key point: Each distinct (non-zero) sequence in M[C] must be the sequence S[C] for some sequence S labeling a branching node v on B. So, although we don’t know much about the interior of B, or the arrangement of the exterior nodes (braching off of B), we know precisely their number and the sequences (restricted to sites of C) that label the exterior nodes on B. And we know that the states of the non-C sites are identical at each node in B.

abcdefgabcdefg 13 4 M a b c d e f g Matrix M[C] is Matrix M restricted to the columns in C a b d c, e, f, g 2 p s B C

Punch Line Each distinct non-zero sequence S[C] in M[C] corresponds to an edge e branching off of B, and exactly defines the set of sequences in M (and hence leaves in T(M)) that must be below e in T(M). If S[C] is the all-zero sequence, then the leaf labeled S must connect to B through the coalesenct node of B. These two facts allow the construction of the tree part of T(M) in O(nm) time, starting from M. It also defines the complete label of each exterior node.

abcdefgabcdefg 13 4 M a b c d e f g C1 25 C2 abcdefgabcdefg So the paths to every leaf pass through the blob B1, but only the paths to e, f, g pass through gall B2. The path from B1 to B2 exits B1 at the node whose C1-restricted label is 010. B1 B2 M[C1]M[C2]

abcdefgabcdefg 13 4 M a b c d e f g C1 25 C2 abcdefgabcdefg M[C1]M[C2] abcdefgabcdefg M’

Algorithmically Finding the tree part of the blobbed-tree is easy. Determining the sequences labeling the exterior nodes on any blob is easy. Determining a “good” structure inside a blob B is the problem of generating the sequences of the exterior nodes of B. It is easy to test whether the exterior sequences on B can be generated with only a single (possibly multiple-crossover) recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.

The main open question Is there always a blobbed-tree where each blob has all and only the sites of a single connected component of the conflict graph, which minimizes the number of recombinations over all possible phylogenetic networks for M? Proof attempt by splitting blob B into B’ and B”. The catch is the possibility of a recombination node in B that is needed in both B’ and B”.

Can prove that if B has at most three recombination nodes, then this cannot happen. Also, a recombination node cannot be in both B’ and B” if it has no recombination node below it in B.

If Yes, Then any method that computes a lower bound on the number of needed recombinations should be applied separately on sequences defined by the sites on each connected component, and the results added together. This may be true even if the desired result is not. It is worth trying to prove this.

How to arrange the sites on a blob Given a single connected component of the conflict graph with k sites, how do we arrange those k sites to generate the required sequences, using only one (multiple-crossover) recombination, or using a multiple-crossover recombination with the fewest cross-overs?

Arranging the sites We will describe an O(n^3) time method to arrange all of the blobs. O(n^2) time is possible with a more complex method when only single-crossover recombinations are allowed.

Let Q be a gall for the sites on connected-component C of the conflict graph. Let M[C] be the matrix M restricted to the sites on C. Let LQ[C] be the sequences labeling the nodes of Q, restricted to the sites on C. Claim: The two sets of sequences are identical, i.e., M[C] = LQ[C]. A needed fact in words

abcdefgabcdefg 13 4 M a b c d e f g Matrix M[C] is Matrix M restricted to the columns in C a b d c, e, f, g 2 p s Q C LQ[C] Fact: M[C] = LQ[C] LQ[C] are the node labels on Q restricted to the sites in C

The idea for arranging the sites of C on B: via a short movie

4 1 3 a b d c, e, f, g 2 p s Q

4 1 3 a b d c, e, f, g Q

4 1 3 a b: 101 d c, e, f, g: 010 B Blob B minus the recombination node is a perfect phylogeny for M[C] minus the recombinant sequence; all sites are on one or two paths from the root; and the two end sequences of those paths can recombine at point r to recreate the recombinant sequence.

The point If we remove the recombinant node from B, we have a phylogenetic tree (no cycles) for the remaining sequences and hence a perfect phylogenetic tree for the sequences in M[C] minus the recombinant sequence of B. The sites in this tree are on one or two paths. Moreover, the two end sequences on that perfect phylogeny can recombine to create the removed recombinant sequence.

The algorithm for arranging a blob B for C 1.Form the matrix M[C]. 2. For each row of M[C], remove the row, see if there is a perfect phylogeny for the remaining rows. If yes, see if the sites are in one or two paths, and the end sequences can generate the removed row by a recombination. Fact: Every row that works gives a permitted arrangement of the sites on B.

Optimality of Galled-Trees Theorem: (G,H,B,B) The minimum number of recombination nodes in any phylogenetic network for M is at least the number of non-trivial connected components of the incompatibility graph. Hence, when the sequences on each blob on T(M) can be generated with a single recombination node, the blobbed-tree minimizes the number of recombination nodes over all phylogenetic networks and all choices of ancestral sequence. This solves the root-unknown galled-tree problem in polynomial time. Code is on the web.

The number of arrangements on a gall (all-0 ancestral sequence) Following the algorithmic approach above, one can prove that the number of arrangements of any gall is at most three, and this happens only if the gall has two sites. If the gall has more than two sites, then the number of arrangements is at most two. If the gall has four or more sites, with at least two sites on each side of the recombination point (not the side of the gall) then the arrangement is forced and unique.

What is the most tree-like Network? Assume no node with only one child. Contracting each blob to a single node results in a tree. The number of edges in that tree is a measure of how tree-like the network is. The larger the number of edges, the more tree-like is the network. Then T(M) is the most tree-like network.

Practical and Accurate Lower Bound Computation Unconstrained recombination. Want efficient computation of lower bounds on the minimum number of recombinations needed. Many methods: HK, Haplotype, (history), Connected Comps, Composite Interval Composite, Gall-Interval Composite Method, Subsets, Recmin, Composite ILP. Some are specific for recombination, and some apply to any reticulation event.

The grandfather of all lower bounds - HK 1985 Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. The HK bound is the minimum number of vertical lines needed to cut every edge in the incompatibility graph. Weak bound, but widely used - not only to bound the number of recombinations, but also to suggest their locations.

HK Lower Bound

HK Lower Bound = 1

More general view of HK Given a set of intervals on the line, and for each interval I, a number N(I), find the minimum number of vertical lines so that every interval I intersects at least N(I) of the vertical lines. In HK, each incompatibility defines an interval I where N(I) = 1. Other methods find different intervals and larger N(I) values. The general problem is easy to solve by a left-to-right myopic placement of vertical lines: sort the intervals by right end-point; Process the intervals left to right in that order; when the right endpoint of an interval I is reached, place there (if needed) additional vertical so that N(I) lines intersect I.

The general approach is called the Composite Interval Method (Myers).

New bound 2003 The number of non-trivial connected components in the incompatibility graph (Gusfield, Hickerson; Bafna, Bansal) - weak bound when used alone, but effective when used with the Composite Interval Method. CC Bound = 2 in the prior example.

Haplotype Bound (Simon Myers) Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 (folklore) Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound. Generally really bad bound, often negative, when used alone, but Very Good when used in the Composite Interval Method, and Amazing when used in the Composite Subset Method, and More than Amazing when used in the Composite ILP Method.

Composite Interval Method (Simon Myers) Compute Rh separately for each of the C(m,2) intervals of the m sites; Let Rh(I) be the bound for interval I. This is called a ``local bound”. Compute the Minimum number of vertical lines so that each interval I is cut at least Rh(I) times (a trivial algorithm). The HK bound is a specialization of this where Rh(I) is 0 or 1. Also can use the connected component bound for the local bounds, or take the best of both in each interval.

Composite Subset Method Let S be subset of sites, and Rh(S) be the haplotype bound for subset S. If the leftmost site in S is L and the rightmost site in S is R, then use Rh(S) as a local bound for interval [S,L]. Compute Rh(S) on many subsets, and then use the Composite method to find an overall bound.

RecMin (Myers) World Champion Lower Bound Program (available) Great Program (Myers). Often RecMin gives a bound three times as large as HK. Uses Composite Subsets with Rh, but limits the size and the span of the subsets. Default parameters are s = 6, w = 15 (s = size, w = span). Generally, impractical to set s = w = m, so generally one doesn’t know if increasing the parameters would increase the bound.

Optimal RecMin Bound (ORB) The Optimal RecMin Bound is the lower bound that RecMin would produce if both parameters were set to their maximum values (s = w= m). In general, RecMin cannot compute (in practical time) the ORB. Practical computation of the ORB is our first contribution.

Computing the ORB Gross Idea: For each interval I, use ILP to find a subset of sites that maximizes Rh over all subsets in interval I. Set N(I) to the found Rh value, and use these values in the Composite Method. The result is guaranteed to be the ORB. i.e, the bound one would get by using all 2^m subsets in RecMin.

Now instead of doing an exponential number of simple computations (computing Rh for each subset), we have a quadratic number of (possibly expensive) ILPs to solve. Is this a good trade-off in practice? Our experience - yes!

Two critical tricks 1)Formulate the ILP as a ``test set” problem: Find the minimum number of sites so that each pair of haplotypes differs at one site at least. Compared to other ILP formulations, this allows large eliminations of redundant inequalities, and greatly speeds up the ILP solution. 2) Reduce the number of intervals examined: If the optimal subset in the interval I only spans interval L, then there is no Need to examine any interval containing L. Each ILP calls at most 4 other ILPs. I L

ILP Compsite method efficiency Only C(m,2) intervals, but the time for each is more than for the Hap bound. With tricks we need to solve the ILP for only 0.5% of all the C(m,2) intervals (empirical result). Shockingly fast in practice.

Typical Result With n = 40, m = 100, r = 10 (ms recombination parameter): HK gives 4 Composite Intervals with Hap Bound or CC bound gives 6 Recmin Composite Subsets with Hap Bound and subsets of size up to 20 in windows of size 20: 9 (7 secs) Composite Subsets with Hap Bound and subsets of size up to 30 in windows of size 30: 10 (takes 10 mins on Powerbook G4) ILP Composite gets 10 and takes 1 second on Linux Box.

Papers and Software wwwcsif.cs.ucdavis.edu/~gusfield/