Genome Halving – work in progress Fulton Wang ACGT Group Meeting
Motivation: chromosomes have undergone whole duplication in the past, going from “pre duplicated genome” to “perfectly duplicated genome” “pre duplicated genome” A: single circular chr with exactly 1 copy of each gene “perfectly duplicated chr” A+A: genome that has 2 copies of every adjacency in A A+A could be either 1 or 2 chrs, ie A-A or A A depending on how you label the adjacencies in A+A Eg: > or
After the duplication event, rearrangements occur that alter the adjacencies in the perfectly duplicated chr to give us present day genome G Eg: > We will assume that G is a single circular chr Given G, want to infer A
Natural solution: want to find preduplicated chr A such that dist(A+A, G) is minimized Simple distance measure: breakpoint distance: N - # shared adjacencies between 2 chrs Eg: 1234 and 1243 has breakpoint distance of 1 But A+A and G each have 2 copies of every gene
What is breakpoint distance between 1212 and 1122? Not well defined. Need to match up the genes in the 2 chr to each other. Need to label every gene copy and gives 0 shared adjacencies and gives 1 shared adjacency
If G1, G2 are duplicated genomes(exactly 2 copies of every gene), then dist(G1,G2) is the minimum of bp(G1,G2) over all possible labellings of G1, G2 Halving problem: given a genome G with exactly 2 copies of each gene, we want to find the minimum over all pre duplicated genomes A of dist(A+A, G). Assume A and G are both single circular chr
Restriction is that the pre duplicated chr A be a single circular chr If you didn’t care about # pieces in A, then there is simple algorithm to solve halving problem – involves just maximum matching where the vertices are the gene ends, edges between each pair of gene ends weighed by how many times that adjacency appears in G
Restrictions on A make the problem harder Right now, I am considering the simpler case when I restrict all the genes to be positively oriented, ie no backwards facing genes
Graph representation of problem A chr can be represented as a directed graph, where each vertex is a specific copy of a gene becomes 1 1 ->2 1 ->1 2 ->2 2 Genome depicted is Grouped members of same gene family together. View each family as 1 “big” vertex, so that graph has 4 vertices
Picking A corresponds to picking a tour of the “big” vertices. Suppose A = Before labeling the adjacencies in A, you don’t know if A+A is A-A or A A Can you label the adjacencies in A+A such that the number of shared adjacencies between A+A and G is the number of shared adjacencies between A+A and G, if the genes are unlabeled?
Yes. Note that this is only true because we don’t care if A+A is 1 or 2 chr This means that if you fix A, then bp(A+A) is equal to the number of “small” edges parallel to A This means we can work with alternate graph, with n vertices, each representing a gene family, where the weight of an edge is the number of adjacencies between the 2 gene families in G
Now, genome halving is reduced to finding max weight directed ham cycle in the complete graph, with edge weights of 0,1,2 dist(A+A,G) = min_A min_labellings (A+A, G) Can simplify problem – if A-A/A A is the minimum possible distance from G, and some adjacency A->B appears twice in G, then A must contain that adjacency, and can merge the 2 vertices
So then, halving is reduced to finding max weight Ham cycle in complete directed graph, where all the edges out of a vertex have weight 0 except 2 of them, which have weight 1 Alternatively, want to find minimum path cover in a directed 2-regular graph. If you have a minimum path cover, you can link the paths with edges of weight 0 to get a max weight ham cycle.
Guided halving Same situation as before, but now we know an ancestor of the current organism, call it B. Then, we wish to minimize bp(A+A,G) + d(A,B) Note that B only contains 1 copy of each gene, same with A
Then, have similar problem to before – given complete graph where edges are of weight 0,1,2, find maximum weight ham cycle This is the exact same problem that is needed to solve in the breakpoint median problem. Likely means guided halving is NP-complete.