Genome Rearrangement and Duplication Distance Crystal L. Kahn 9/18/08
Genome Rearrangement Over course of evolution, genomes undergo large structural changes Chromosomal fissions, fusions, inversions, transpositions Genome rearrangement is an area of computational biology that uses parsimony* methods to compute “distances” between pairs of genomes Characterize similarity between genomes by quantifying number of operations required to transform one into another Not interested in point mutations (SNPs) -- different than edit distance * Maximum likelihood methods can also be used
Genome Rearrangements Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements ~ 300 large synteny blocks
History of Chromosome X Rat Consortium, Nature, 2004 Rearrangement Events: Reversals Fusions Fissions Translocation
Genome Rearrangement Models Types of rearrangement operations that have been considered: Reversal (Inversion) [HP, STOC95], [Bader et al., WADS01] Translocation [Hannenhalli, DAM95] Duplication transposition [El-Mabrouk, JCSS02] Ultimate goal: generic genome rearrangement model that allows any type of rearrangement G1 G1 G2 Duplications common in cancer G2
Duplication Distance: DX(Z,Y) Input strings X, Y, Z (X non-ambiguous) Def: duplication operation, Z°s,t,p(X) X Z s t p Problem: Compute DX(Z,Y) = min number duplication operations to transform Z into Y Theorem: O(n4) algorithm, n = |Y|
Definitions T = abcdefg = bcd = ace String: sequence of characters Substring: contiguous sequence of characters Subsequence: sequence of characters, not necessarily contiguous Note: a substring is a subsequence, but not necessarily vice versa T = abcdefg = bcd = ace
Key Insight W.L.O.G., let Z = Ø X a b c d e f g h i j k l m n o p q r s “overlapping” Y a b c d j k c d e f l o p q a b c d c d j e k f l o p q Observation: overlapping subsequences interfere with each other Lemma: a set of subsequences that are substrings of X and that cover all the characters of Y can be converted into a sequence of duplicate operations iff they are mutually non-overlapping “Feasible set”
Finding min-cardinality feasible set for Ys,t Let be element of feasible set that includes index s 2 Cases: includes index t does not include index t Y s t Ys,t Y s t Ys,t
Let d(Ys,t) = DX(Ø,Ys,t) where Case 1 Ys,t and Case 2 Ys,t
Assume, by induction, already computed Ys,t Assume, by induction, already computed Substring of X “internal substrings” of placements of Xs,t in Ys,t Xs,t = abcd Ys,t = abcbccabcd Ys,t \ Ys,t = abcbccabcd Ys,t = abcbccabcd Ys,t = abcbccabcd Ys,t = abcbccabcd Possibly exponential number of “placements” as,t computed with second recurrence in O(n2) time
Assume, by induction, already computed Ys,t Assume, by induction, already computed bs,t computed in O(n) time
Running Time n = |Y| For a substring Ys,t: Computing as,t takes O(n2) time Computing bs,t takes O(n) time Total of O(n2) substrings of Y Total running time: O(n4)
Duplication Transposition vs. Duplication 1 s t p n G ° s,t,p 1 s t (p-1) p n G Duplication transposition: “paste” into same string s < t < p 1 s t n G ° s,t,p(G) 1 s t (p-1) p n G 1 p n Duplication: “paste” into another string
Duplication can be more complicated… 1 s t n G 1 p n G 1 s (p-1) p t n G ° s,t,p(G) s < p < t
Duplication Transposition Distance in Semi-Ambiguous Genomes [El-Mabrouk, JCSS02] incorrectly computes duplication transposition distance Implication in paper is that: Given X non-ambiguous and Y semi-ambiguous, DT(X,Y) = # maximal repeated segments of Y Counterexample: X = abcdefg Y = abdecdbcefg Y0 = abcdefg Y1 = abcdbcefg Y2 = abdcdbcefg Y3 = abdecdbcefg
A Lower Bound for Duplication Transposition Distance Lemma: If Y has at most 2 copies of every character, X is non-ambiguous, and X is a subsequence of Y, then DX(X,Y) DT(X,Y) There is still no known algorithm for duplication transposition distance
Conclusions Duplication Distance is a simple model for genome rearrangement and can be computed efficiently. In a special case, it provides a lower bound to duplication transposition distance Thank you! Questions?
New Model for Cancer Mutation: Amplisomes Can show that minimum amplisome distance can be reframed as: min [DG(A,Ø) + DA(T,A)] where min is taken over all possible choices of A A Duplication Distance is subproblem
Tumor Amplisomes (Maurer, et al. 1987; Wahl, 1989…) Other terms: Episome Amplicon Double-minute 20
DX(X,Y) ≤ DT(X,Y) when Y is semi-ambiguous Why is semi-ambiguity necessary? Semi-ambiguity ensures that all copied substrings are substrings of original X (not some intermediate) -- so for every DT operation, there exists a duplicate operation that produces the same result Example: X = A Y = AAAAAAAA DT(X,Y) = 3 DX(X,Y) = 7