Erice - Structured Pattern Detection and Exploitation

Slides:



Advertisements
Similar presentations
A New Recombination Lower Bound and The Minimum Perfect Phylogenetic Forest Problem Yufeng Wu and Dan Gusfield UC Davis COCOON07 July 16, 2007.
Advertisements

CS 336 March 19, 2012 Tandy Warnow.
Efficient Computation of Close Upper and Lower Bounds on the Minimum Number of Recombinations in Biological Sequence Evolution Yun S. Song, Yufeng Wu,
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Greedy Algorithms Greed is good. (Some of the time)
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination Dan Gusfield UC Davis Different parts of this work are joint.
1 EE5900 Advanced Embedded System For Smart Infrastructure Static Scheduling.
Erice - Structured Pattern Detection and Exploitation Deterministic Algorithms.
Combinatorial Algorithms and Optimization in Computational Biology and Bioinformatics Dan Gusfield occbio, June 30, 2006.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
D. Gusfield, V. Bansal (Recomb 2005) A Fundamental Decomposition Theory for Phylogenetic Networks and Incompatible Characters.
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Multi-State Perfect Phylogeny with Missing and Removable Data: Solutions via Chordal Graph Theory Dan Gusfield Recomb09, May 2009.
Close Lower and Upper Bounds for the Minimum Reticulate Network of Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover recombination in populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.
ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions.
Fast Computation of the Exact Hybridization Number of Two Phylogenetic Trees Yufeng Wu and Jiayin Wang Department of Computer Science and Engineering University.
Integer Programming for Phylogenetic and Population- Genetic Problems with Complex Data D. Gusfield, Y. Frid, D. Brown Cocoon’07, July 16, 2007.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination (The root-unknown case) Dan Gusfield UC Davis.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis.
Optimal Phylogenetic Networks with Constrained and Unconstrained Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish.
Phylogenetic Networks of SNPs with Constrained Recombination D. Gusfield, S. Eddhu, C. Langley.
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Algorithms to Distinguish the Role of Gene-Conversion from Single-Crossover Recombination in Populations Y. Song, Z. Ding, D. Gusfield, C. Langley, Y.
Fixed Parameter Complexity Algorithms and Networks.
CSE 326: Data Structures NP Completeness Ben Lerner Summer 2007.
Estimating and Reconstructing Recombination in Populations: Problems in Population Genomics Dan Gusfield UC Davis Different parts of this work are joint.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Graph Colouring L09: Oct 10. This Lecture Graph coloring is another important problem in graph theory. It also has many applications, including the famous.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Meiotic Recombination (single-crossover) PrefixSuffix  Recombination is one of the principal evolutionary forces responsible for shaping genetic variation.
LIMITATIONS OF ALGORITHM POWER
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Approximation Algorithms based on linear programming.
by d. gusfield v. bansal v. bafna y. song presented by vikas taliwal
Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.
Yufeng Wu and Dan Gusfield University of California, Davis
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
Hans Bodlaender, Marek Cygan and Stefan Kratsch
Computability and Complexity
Chapter 5. Optimal Matchings
L4: Counting Recombination events
NP-Completeness Yin Tat Lee
Algorithms for estimating and reconstructing recombination in populations Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu,
1.3 Modeling with exponentially many constr.
Analysis and design of algorithm
Effective Social Network Quarantine with Minimal Isolation Costs
Complexity 6-1 The Class P Complexity Andrei Bulatov.
Estimating Recombination Rates
Objective of This Course
Instructor: Shengyu Zhang
Analysis of Algorithms
Chapter 3 The Simplex Method and Sensitivity Analysis
ReCombinatorics The Algorithmics and Combinatorics of Phylogenetic Networks with Recombination Dan Gusfield U. Oregon , May 8, 2012.
Lectures on Graph Algorithms: searching, testing and sorting
The coalescent with recombination (Chapter 5, Part 1)
Chapter 11 Limitations of Algorithm Power
Minimizing the Aggregate Movements for Interval Coverage
3. Brute Force Selection sort Brute-Force string matching
1.3 Modeling with exponentially many constr.
NP-Completeness Yin Tat Lee
Approximation Algorithms for the Selection of Robust Tag SNPs
Lecture-Hashing.
Presentation transcript:

Erice - Structured Pattern Detection and Exploitation Deterministic Algorithms

Outline Structured patterns - weird things happen - change or necessity? Structured pattern detection - suffix trees, viruses and integer linear programming ??more on suffix arrays - computing LCP in linear time; finding close repeats detecting and exploiting patterns of recombination in binary (SNP) sequences. Adding in the complication of gene conversion.

Kmer frequency Kmers in a string S over all K Ex. S = abxabcxab K = 2, the Kmers are: ab, bx, xa, ab, bc, cx, xa, ab five distinct 2-mers, ab and ax repeat K = 3: abx, bxa, xab, abc, bcx, xab five distinct 3-mers, xab repeats K = 4: abxa, bxab, xabc, abcx, bcxa, cxab six distinct 4-mers, no repeats K = 5,6,7,8,9 have 5,4,3,2,1 distinct Kmers, no repeats

Weird (non-obvious) Patterns? K* = Maximum K such that some Kmer repeats in S. K’ = K where the number of distinct Kmers is maximum. D = number of distinct Kmers for K = K’ Observations from Data: K’ = K* + 1, and D + K* = |S| Chance or necessity?

Uncovering Optimal Virus Signatures String Barcoding Uncovering Optimal Virus Signatures Sam Rash, Dan Gusfield University of California, Davis.

Motivation Need for rapid virus detection Given Problem unknown virus database known viruses Problem identify unknown virus quickly based on a small set of substrings.

Motivation Real World Another Idea only have sequence for pathogens in database not possible to quickly sequence an unknown virus can test for presence small (<= 50 bp) strings in unknown virus substring tests Another Idea String Barcoding use substring tests to uniquely identify each virus in the database acquire unique barcode for each virus in database

Problem Definition Formal Definition given goal result set of strings S goal find set of strings S’, the testing set such that for each s1, s2 in S, there exists at least one u in S’ where u is a substring of only s1 u is a signature substring minimize |S’| result barcode for each element on S

Example Figure 1.5 - signatures cagtgc {“tg”} cagttc  catgga {“tg”, “atgga”} Figure 1.5 - signatures

Problem Complexity Complexity unknown if NP-hard when size of any u in S’ is unbounded Max-Length String Barcoding additional parameter k, a maximum length of any u in S’ this variant is NP-Hard reduction from Minimum Testing Set (Garey, Johnson, 1979) means all real world uses have to deal with NP-hard result

Implementation Basic Idea: Formulate problem as an ILP Enumerate some “useful” set of substrings from S variable in ILP for each substring Constraint for each pair of strings in S means that at least one substring will be chosen to distinguish each pair Objective Function Minimize sum of variables in ILP

Implementation Key point: complexity of ILP primarily a function of the number of variables reducing number of candidate substring tests reduces the number of variables in ILP how to reduce? Key to our method: suffix trees finds minimum cardinality set of “useful” substrings for use as candidate signature substrings

Implementation: Suffix Trees Key Properties of Suffix Tree build for set of strings S tree with character sequences labeling edges nodes labeled with a subset of original string IDs every substring of original input set appears as a root-edge walk exactly once root-node walk is considered root-edge walk into node’s in-edge from parent

Implementation: Suffix Trees root-edge walk Creates string appears in exactly the strings that label the node at which it ends 2 root-edge walks ending on the same edge Both strings created by the walk occur in exactly the same set of original strings Can use ether string c a g t t a c g a g t t g t t c c g a example - a root edge walk

Implementation: Solving If two substrings occur in exactly the same set of original strings, only one need be considered Use strings from suffix tree for each uniquely labeled node Build ILP as discussed Solve ILP using CPLEX Acquire barcode and signatures for each original string signature is the set of substring tests occurring in a string

Implementation: Example strings: 1. cagtgc 2. cagttc 3. catgga Each node in the suffix tree has a corresponding set of string IDs below it Figure 1.1 - suffix tree for set of strings cagtgc, cagttc, and catgga v1 - {1,2,3} v2 - {1,2,3} v3 - {3} v4 - {1} v5 - {3} v6 - {1,2} v7 - {2} v8 - {1} v9 - {1,2,3} v10 - {1,2,3} v11 - {1,2} v12 - {1} v13 - {2} v14 - {3} v15 - {1,2,3} v16 - {2} v17 - {2} v18 - {1,3} v19 - {1} v20 - {3} v21 - {1,2,3} v22 - {3} v23 - {2} v24 - {1,2} v25 - {1} Figure 1.2 - table of string labels for each node in suffix tree from figure 1.1

Implementation: Example minimize V18 + V22 + V11 + V17 + V8 #objective function st V18 + V22 + V11 + V17 + V8 >= 2 #this is the theoretical minimum V18 + V17 + V8 >= 1 #constraint to cover pair 1,2 V22 + V11 + V8 >= 1 #constraint to cover pair 1,3 V18 + V22 + V11 + V17 >= 1 #constraint to cover pair 2,3 binaries #all variables are 0/1 V18 V22 V11 V17 V8 end Figure 1.3 - ILP constructed for suffix tree in figure 1.1 using no additional constraints (length, etc) tg (V18) atgga (V22) cagtgc 1 cagttc catgga Figure 1.4 - barcodes cagtgc {“tg”} cagttc  catgga {“tg”, “atgga”} Figure 1.5 - signatures

Implementation: Extensions minimum and maximum lengths on signature substrings acquire barcodes/signatures for only a subset of input strings (wrt to whole set) minimum string edit distance between chosen signature substrings redundancy require r signature substrings to differentiate each pair adds a higher level of confidence that signatures remain valid even with mutations

Results: Summary Works quickly on most moderately sized datasets (especially when redundancy >= 2) dataset properties ~50k virus genomes taken from NCBI (Genbank) 50-150 virus genomes average length of each genome ~1000 characters total input size ranged from approximately 50,000 – 150,000 characters increasing dataset size scaled approximately linearly reach 25% gap (at most 1/3 more than optimum) in just a few minutes reach small gap (often < 1%) in 4 hours

Figure 2.2 - effect of redundancy on avg gap at 4 hours Results: Summary increasing redundancy greatly decreases run time and % gap at 4 hours in all cases tested Figure 2.1 - effect of redundancy on avg 25% gap Figure 2.2 - effect of redundancy on avg gap at 4 hours

Conclusion Practical sized testing sets obtained on reasonable sized input datasets testing set consisting of 50 – 270 substring tests on input sets of ~100 genomes works well with reactions that have high number of assays (substring tests) per reaction GeneChip – 400 assays per reaction Redundancy Good concept in theory Reduces solution space and hence computation time GeneChip makes higher number of assays needed cost-effective

Recognizing Patterns of Historical Recombination Dan Gusfield UC Davis Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yun Song, Yufeng Wu.

Sequence Recombination 01011 10100 P S 5 Single crossover recombination 10101 A recombination of P and S at recombination point 5. The first 4 sites come from P (Prefix) and the sites from 5 onward come from S (Suffix).

Network with Recombination Deriving a Set of Sequences given set 10100 10000 01011 01010 00010 10101 12345 00000 1 4 3 00010 2 10100 5 Only one mutation per position is allowed. P 10000 01010 01011 5 S 10101

The biological Problem Given a set of binary sequences derived by one mutation per position and possibly many recombinations, find the positions where clusters of historical recombinations likely occurred. These are called recombination hotspots. Applications: 1) Insight into the mechanics of recombination Science article October 14, 2005, and Nature article this week: in humans and chimps most recombinations occur in hotspots, but in different places in humans compared to chimps. 2) Association mapping: A major strategy being developed for finding genes that influence disease - the whole strategy relies on the historical effects of recombination.

Two Approaches Stochastic models of recombination and mutation - maximum likelihood - very intensive computations. Deterministic approaches based on minimizing the number of needed recombinations, or bounding that number. Regions where the minimum number is large, or where close bounds on the minimum are large, indicate regions of hotspots. Science article.

Reconstructing the Evolution of Binary Bio-Sequences Perfect Phylogeny (tree) model Phylogenetic Networks (DAG) with recombination Phylogenetic Networks with disjoint cycles: Galled-Trees Phylogenetic Networks with unconstrained cycles: Blobbed-Trees Combinatorial Structure and Efficient Algorithms Efficiently Computed Lower and Upper bounds on the number of recombinations needed

The Perfect Phylogeny Model for binary sequences sites 12345 Ancestral sequence 00000 1 4 Site mutations on edges 3 00010 The tree derives the set M: 10100 10000 01011 01010 00010 2 10100 5 10000 01010 01011 Extant sequences at the leaves

When can a set of sequences be derived on a perfect phylogeny? Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs: 0,0 and 0,1 and 1,0 and 1,1 This is the 4-Gamete Test

A richer model 10100 10000 01011 01010 00010 10101 added 12345 00000 1 4 3 00010 2 10100 5 pair 4, 5 fails the three gamete-test. The sites 4, 5 ``conflict”. 10000 01010 01011 Real sequence histories often involve recombination.

Network with Recombination 10100 10000 01011 01010 00010 10101 new 12345 00000 1 4 3 00010 2 10100 5 P 10000 01010 The previous tree with one recombination event now derives all the sequences. 01011 5 S 10101

Elements of a Phylogenetic Network (single crossover recombination) Directed acyclic graph. Integers from 1 to m written on the edges. Each integer written only once. These represent mutations. A choice of ancestral sequence at the root. Every non-root node is labeled by a sequence obtained from its parent(s) and any edge label on the edge into it. A node with two edges into it is a ``recombination node”, with a recombination point r. One parent is P and one is S. The network derives the sequences that label the leaves.

A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100

Minimizing recombinations Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful. However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations. problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem).

Minimization is NP-hard The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004) Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible. They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time. We can solve the minimization problem in polynomial time, when node disjoint recombination cycles are possible.

Recombination Cycles In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet. The cycle specified by those two paths is called a ``recombination cycle”.

Galled-Trees A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle. A phylogenetic network is called a “galled-tree” if every recombination cycle is a gall.

A galled-tree generating the sequences generated by the prior network. 4 3 1 s p a: 00010 3 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101

Sales pitch for Galled-Trees Galled-trees represent a small deviation from true trees. There are sufficient applications where it is plausible that a galled tree exists that generates the sequences. Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable recombinations may be a galled-tree. The number of recombinations is never more than m/2. Moreover, when M can be derived on a galled-tree, the number of recombinations used is the minimum number over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination. A galled-tree for M is ``almost unique” - implications for reconstructing the correct history.

Old (Aug. 2003) Results O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence. Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution. Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic-networks with all-0 ancestral sequence.

New work We derive the galled-tree results in a more general setting that addresses unconstrained recombination cycles and multiple crossover recombination. This also solves the problem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral sequence is known in advance.

Blobbed-trees: generalizing galled-trees In a phylogenetic network a maximal set of intersecting cycles is called a blob. Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree. The blobs are the non-tree-like parts of the network.

Every network is a tree of blobs. How do the tree parts and the blobs relate? How can we exploit this relationship? Ugly tangled network inside the blob.

Incompatible Sites A pair of sites (columns) of M that fail the 4-gametes test are said to be incompatible. A site that is not in such a pair is compatible.

1 2 3 4 5 Incompatibility Graph a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 4 M 1 3 2 5 Two nodes are connected iff the pair of sites are incompatible, i.e, fail the 4-gamete test. THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

The connected components of G(M) are very informative The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson). When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al 2003-5)

Simple Fact If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j. (This is a general fact for all phylogenetic networks.) Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.

A Phylogenetic Network 00000 4 00010 a:00010 3 1 10010 00100 5 00101 2 01100 S b:10010 P S 4 01101 c:00100 p g:00101 3 d:10100 f:01101 e:01100

Simple Consequence of the simple fact All sites on the same (non-trivial) connected component of the incompatibility graph must be on the same blob in any blobbed-tree. Follows by transitivity. So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the incompatibility graph.

Key Result about Galls: For galls, the converse of the simple consequence is also true. Two sites that are in different (non-trivial) connected components cannot be placed on the same gall in any phylogenetic network for M. Hence, in a galled-tree T for M each gall contains all and only the sites of one (non-trivial) connected component of the incompatibility graph. All compatible sites can be put on edges outside of the galls. This is the key to the galled-tree solution.

Incompatibility Graph A galled-tree generating the sequences generated by the prior network. 4 4 3 1 1 3 2 5 s p a: 00010 2 c: 00100 b: 10010 d: 10100 2 5 s p 4 g: 00101 e: 01100 f: 01101

Motivated by the one-one correspondence between galls and non-trivial connected components, we ask: To what extent does this one-one correspondence hold in general blobbed-trees, i.e. with no constraints on how recombination cycles interweave?

The Decomposition Theorem (Recomb 2005) For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. A blobbed-tree with this structure is called fully-decomposed.

Optimality of Galled-Trees Theorem: (G,H,B,B) The minimum number of recombination nodes in any phylogenetic network for M is at least the number of non-trivial connected components of the incompatibility graph. Hence, when the sequences on each blob on T(M) can be generated with a single recombination node, the blobbed-tree minimizes the number of recombination nodes over all phylogenetic networks and all choices of ancestral sequence. This solves the root-unknown galled-tree problem in polynomial time. Code is on the web.

The number of arrangements on a gall (all-0 ancestral sequence) By analysing the algorithm to layout the sites on a gall (not discussed here), one can prove that the number of arrangements of any gall is at most three, and this happens only if the gall has two sites. If the gall has more than two sites, then the number of arrangements is at most two. If the gall has four or more sites, with at least two sites on each side of the recombination point (not the side of the gall) then the arrangement is forced and unique.

Computing close bounds on the minimum number of recombinations Dan Gusfield UCD Y. Song, Y. F. Wu, D. Gusfield (ISMB2005) D. Gusfield, D. Hickerson (Dis. Appl. Math 2005) D. Gusfield, V. Bansal (Recomb 2005)

The grandfather of all lower bounds - HK 1985 Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. This bound requires a linear order. The HK bound is the minimum number of vertical lines needed to cut every edge in the incompatibility graph. Weak bound, but widely used - not only to bound the number of recombinations, but also to suggest their locations.

Justification for HK If two sites are incompatible, there must have been some recombination where the crossover point is between the two sites.

HK Lower Bound 1 2 3 4 5

HK Lower Bound = 1 1 2 3 4 5

More general view of HK Given a set of intervals on the line, and for each interval I, a number N(I), define the composite problem: Find the minimum number of vertical lines so that every interval I intersects at least N(I) of the vertical lines. In HK, each incompatibility defines an interval I where N(I) = 1. The composite problem is easy to solve by a left-to-right myopic placement of vertical lines: Sort the intervals by right end-point; Process the intervals left to right in that order; when the right endpoint of an interval I is reached, place there (if needed) additional vertical so that N(I) lines intersect I.

If each N(I) is a ``local” lower bound on the number of recombinations needed in interval I, then the solution to the composite problem is a valid lower bound for the full sequences. The resulting bound is called the composite bound given the local bounds. This general approach is called the Composite Method (Simon Myers 2002).

Haplotype Bound (Simon Myers) Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 (folklore) Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound. Generally really bad bound, often negative, when used on large intervals, but Very Good when used as local bounds in the Composite Interval Method, and other methods.

Composite Interval Method using RH bounds Compute Rh separately for each of the C(m,2) intervals of the m sites; let N(I) = Rh(I) be the local lower bound for interval I. Then compute the composite bound using these local bounds. Polynomial time and gives bounds that often double HK in our simulations.

Composite Subset Method (Myers) Let S be subset of sites, and Rh(S) be the haplotype bound for subset S. If the leftmost site in S is L and the rightmost site in S is R, then use Rh(S) as a local bound N(I) for interval I = [S,L]. Compute Rh(S) on many subsets, and then solve the composite problem to find a composite bound.

RecMin (Myers) World Champion Lower Bound Program (until now). Often RecMin gives a bound three times as large as HK. Computes Rh on subsets of sites, but limits the size and the span of the subsets. Default parameters are s = 6, w = 15 (s = size, w = span). Generally, impractical to set s = w = m, so generally one doesn’t know if increasing the parameters would increase the bound. (example: Myers bound of 70 on the LPL data).

Optimal RecMin Bound (ORB) The Optimal RecMin Bound is the lower bound that RecMin would produce if both parameters were set to their maximum values (s = w= m). In general, RecMin cannot compute (in practical time) the ORB. Practical computation of the ORB is our first contribution.

Computing the ORB Gross Idea: For each interval I, use ILP to find a subset of sites S that maximizes Rh(S) over all subsets in interval I. Call the result Opt(I). Set N(I) = Opt(I), and compute the composite bound using those local bounds. The composite bound is the ORB. -- the result one would get by using all 2^m subsets in RecMin, with s = w = m.

We have moved from doing an exponential number of simple computations (computing Rh for each subset), to solving a quadratic number of (possibly expensive) ILPs. Is this a good trade-off in practice? Our experience - very much so!

How to compute Opt(I) by ILP Create one variable Xi for each row i; one variable Yj for each column j in interval I. All variables are 0/1 variables. Define S(i,i’) as the set of columns where rows i and i’ have different values. Each column in S(i,i’) is a witness that rows i and i’ differ. For each pair of rows (i,i’), create the constraint: Xi + Xi’ <= 1 + ∑ [Yj: j in S(i,i’)] Objective Function: Maximize ∑ Xi - ∑ Yj -1

Alternate way to compute Opt(I) by ILP First remove any duplicate rows. Let N be the number of rows remaining. Create one variable Yj for each column j in interval I. All variables are 0/1 variables. S(i,i’) as before. For each pair of rows (i,i’) create the constraint: 1 <= ∑ [Yj: j in S(i,i’)] Objective Function: Maximize N - ∑(Yj) -1 Finds the smallest number of columns to distinguish the rows.

Two critical tricks Use the second ILP formulation to compute Opt(I). It solves much faster than the first (why?) 2) Reduce the number of intervals where Opt(I) is computed: I If the solution to Opt(I) uses columns that only span interval L, then there is no need to find Opt(I’) in any interval I’ containing L. Each ILP computation directly spawns at most 4 other ILPs. Apply this idea recursively. L

With the second trick we need to find Opt(I) for only 0 With the second trick we need to find Opt(I) for only 0.5% - 35% of all the C(m,2) intervals (empirical result). Surprisingly fast in practice (with either the GNU solver or CPLEX).

Bounds Higher Than the Optimal RecMin Bound Often the ORB underestimates the true minimum, e.g. Kreitman’s ADH data: 6 v. 7 How to derive sharper bound? Idea: In the composite method, check if each local bound N(I) = Opt(I) is tight, and if not, increase N(I) by one. Small increases in local bounds can increase the composite bound.

Bounds Sharper Than Optimal RecMin Bound A set of sequences M is called self-derivable if there is a network generating M where the sequence at every node (leaf and intermediate) is in M. Observation: The haplotype bound for a set of sequences is tight if and only if the sequences are self-derivable. So for each interval I where Opt(I) is computed, we check self-derivability of the sequences induced by columns S*, where S* are the columns specified by Opt(I). Increase N(I) by 1 if the sequences are not self-derivable.

Algorithm for Self-Derivability Solution is easy when there are no mutations --only recombinations are allowed and one initial pair of sequences is chosen as ``reached”. Two reached sequences S1 and S2 can reach a third sequence S3, if S3 can be created by recombining S1 and S2. Do BFS to see if all sequences can be reached by successive application of the ``reach operation”. Clearly polynomial time and can be optimized with suffix-trees etc. (Kececiouglu, Gusfield)

Self-derivability Test with mutations allowed For each site i, construct set MUT(i) of sequence pairs (S1, S2) in M where S1 and S2 differ at only site. Try each sequence in M as root (which is the only reached sequence initially). For each root, enumerate all ways of choosing exactly one ordered pair of sequences (Sp, Sq) from each MUT(i). Sp is allowed to ``reach” Sq. Run the prior self-derivability algorithm with these new permitted reach relations, to test if all sequences in M can be reached. If so, then M is self-derivable, otherwise it is not.

Checking if N(I) should be increased by two If the set of sequences is not self-derivable, we can test if adding one new sequence makes it self-derivable. the number of candidate sequences is polynomial and for each one added to M we check self-derivability. N(I) should be increased by two if none of these sets of sequences is self-derivable.

Program HapBound HapBound –S. Checks each Opt(I) subset for self-derivability. Increase N(I) by 1 or 2 if possible. This often beats ORB and is still practical for most datasets. HapBound –M. Explicitly examine each interval directly for self-derivability. Increase local bound if possible. Derives lower bound of 7 for Kreitman’s ADH data in this mode.

HapBound vs. RecMin on LPL from Clark et al. Program Lower Bound Time RecMin (default) 59 3s RecMin –s 25 –w 25 75 7944s RecMin –s 48 –w 48 No result 5 days HapBound 31s HapBound -S 78 1643s 2 Ghz PC

Example where RecMin has difficulity in Finding Optimal Bound on a 25 by 376 Data Matrix Program Bound Time RecMin default 36 1s RecMin –s 30 –w 30 42 3m 25s RecMin –s 35 –w 35 43 24m 2s RecMin –s 40 –w 40 2h 9m 4s RecMin –s 45 –w 45 10h 20m 59s HapBound 44 2m 59s HapBound -S 48 39m 30s

Frequency of HapBound –S Bound Sharper Than Optimal RecMin Bound ms param. Rho=1 Rho=5 Rho=10 Rho=20 Theta=1 0.0% 0.4% 0.5% 1.5% Theta=5 0.7% 4.0% 10.4% 27.0% Theta=10 1.4% 9.2% 17.8% 40.4% Theta=20 10.5% 27.8% 45.4% For every ms parameters, 1000 data sets are used.

Computing Upper Bounds The method is an adaptation of the ``history” lower bound (Myers). A non-informative column is one with fewer than two 0’s or fewer than two 1’s.

Single History Computation Set W = 0 Collapse identical rows together, and remove non-informative columns. Repeat until neither is possible. Let A be the data at this point. If A is empty, stop, else remove some row r from A, and set W = W + 1. Go to step 2). Note that the choice of r is arbitrary in Step 3), so the resulting W can vary.

History Lower Bound Theorem (Myers) Let W* be the minimum W obtained from all possible single history computations. Then W* is a valid lower bound on the number of recombinations needed. Naïve time: theta(n!) (RecMin), but can be reduced to theta(2^n) (Bafna, Bansal).

Converting the History Lower Bound to an Upper Bound Given a set of rows A and a single row r, define w(r | A - r) as the minimum number of recombinations needed to create r from A-r (well defined in our application). w(r | A-r) can be computed in linear time by a greedy-type algorithm.

Upper Bound Computation Set W = 0 Collapse identical rows together, and remove non-informative columns. Repeat until neither is possible. Let A be the data at this point. If A is empty, stop, else remove some row r from A, and set W = W + W(r | A-r). Go to step 2). Note that the choice of r is arbitrary in Step 3), so the resulting W can vary. This is the Single History Computation, with a change in step 3).

Note, even a single execution of the upper bound computation gives a valid upper bound, and a way to construct a network. This is in contrast to the History Bound which requires finding the minimum W over all histories. However, we also would like to find the lowest upper bound possible with this approach. This can be done in O(2^n) time by DP. In practice, we can use branch and bound to find the lowest upper bound, but we have also found that branching on the best local choice, or randomizing gives good results.

Branch and Bound (Branching) In Step 3) choose r to minimize w(r | A-r) + L(A-r), where L(A-r) is some fast lower bound on the number of recombinations needed for the set A-r. Even HK is good for this purpose. (Bounding) Let C be the min for an full solution found so far; If W + L(A) >= C, then backtrack.

Kreitman’s 1983 ADH Data 11 sequences, 43 segregating sites Both HapBound and SHRUB took only a fraction of a second to analyze this data. Both produced 7 for the number of detected recombination events Therefore, independently of all other methods, our lower and upper bound methods together imply that 7 is the minimum number of recombination events.

A minimal ARG for Kreitman’s data SHRUB produces code that can be input to an open source program to display the constructed ARG

The Human LPL Data (Nickerson et al. 1998) (88 Sequences, 88 sites) Our new lower and upper bounds Optimal RecMin Bounds (We ignored insertion/deletion, unphased sites, and sites with missing data.)

Study on simulated data: Exact-Match frequency for varying parameters  = Scaled mutation rate = Scaled recombination rate n = Number of sequences Used Hudson’s MS to generate1000 simulated datasets for each pair of and  n = 25 n = 15 For < 5, our lower and upper bounds match more than 95% of the time.

Exact-Match frequency for varying number of sequences Match frequency does not depend on n as much as it does on  or 

A closer look at the deviation  Average ratio of lower bound to upper bound when they do not match For n = 25: The numerical difference between lower and upper bounds grows as or  increases, but their ratio is more stable.

Multiple Crossover Recombination 4-crossovers 2-crossovers = ``gene conversion”

Extensions to Gene Conversion ``Gene Conversion” is short two cross-over recombination that occurs in meiosis. The extent of gene-conversion is only now being understood, due to prior lack of fine-scale molecular data, and lack of algorithmic tools. Gene Conversion is the Achilles heel of association mapping proposals.

New Results Both the lower bound and upper bound methods have been extended to incorporate gene conversion as well as single-crossover recombination. Allowing gene-conversion can reduce the total number of nodes in the network, and also the associated bounds. But …

Distinguishing gene conversion from recombination For a given set of sequences, let B be the bound (lower or upper) when only recombination is allowed, and let BC be the bound when gene-conversion is also allowed. Define D = B - BC. We expect that D will generally be larger when sequences are generated using gene-conversion compared to when they are generated with recombination only. In such studies, we have shown that we can use statistics like D to correctly determine the amount of gene-conversion used to generate the sequences.

Take-home message The upper and lower bound algorithms cannot ``make-up” gene-conversions. The bounds reflect the extent of gene-conversion in the true generation of the sequences.

Papers and Software wwwcsif.cs.ucdavis.edu/~gusfield/