Download presentation
1
Outline Today’s topic: greedy algorithms
Genome rearrangements Sorting by reversals Breakpoints Shotgun sequencing Fragment assembly: the shortest superstring problem Finishing problem Next time: greedy algorithms for motif finding
2
Turnip vs Cabbage: Almost Identical mtDNA gene sequences
In 1980s Jeffrey Palmer studied evolutionary change in plant organelles by comparing mitochondrial genomes of the cabbage and turnip 99% similarity between genes These surprisingly identical gene sequences differed in gene order This helped pave the way to analyzing genome rearrangements in molecular evolution
3
Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison:
4
Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison:
5
Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison:
6
Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison:
7
Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison:
8
Transforming Cabbage into Turnip
9
Genome rearrangements
Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other?
10
Mouse vs Human Genome Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversal, fusion, fission, translocation Reversal: flipping a block of genes within a genomic sequence
11
Reversals Will assume that genes in genomic segment p do not have direction p = p p i-1 p i p i p j-1 p j p j p n p ’ = p p i-1 p j p j p i+1 p i p j pn r( i, j ) is defined as the operation that reverses the gene sequence of p (original genomic segment) between p i and p j r(i,j)
12
Reversals: Example Example: p = 1 2 3 4 5 6 7 8 r(3,5)
13
Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’
Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’
14
Reversal Distance Problem
Goal: Given two permutations, find the shortest series of reversals that forms one from the other Input: Permutations p and s Output: A series of reversals r1,…rt transforming p into s, such that t is minimum t = reversal distance between p and s d(p, s) denotes smallest possible value of t, given p and s
15
Sorting By Reversals Problem
Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) Input: permutation p Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum
16
Sorting By Reversals: Example
Minimal value of t is denoted as d(p) and this value is the reversal distance of permutation p Example : input: p = output: So d(p) <= 3 Is this the smallest #reversals?
17
Sorting By Reversals: A Greedy Algorithm
If sorting permutation p = , the first three numbers are already in order so it does not make any sense to break them. These already sorted numbers of p will be defined as prefix(p) prefix(p) = 3 This results in an idea for an greedy algorithm to sort by reversals; increase prefix(p) at every step
18
Greedy Algorithm: An Example
Doing so, p can be sorted d(p) = 2 Number of steps to sort permutation of length n is at most (n – 1)
19
Greedy Algorithm: Pseudocode
SimpleReversalSort(p) 1 for i 1 to n – 1 2 j position of element i in p (i.e., pj = i) 3 if j != i p p * r(i, j) output p 6 if p is the identity permutation return
20
Analyzing SimpleReversalSort
Greedy algorithm; does not guarantee the smallest number of reversals For example, let p = … SimpleReversalSort(p) takes five steps: Step 1: Step 2: Step 3: Step 4: Step 5:
21
Analyzing SimpleReversalSort (cont’d)
But it can be done in two steps: p = Step 1: Step 2: So, SimpleReversalSort(p) is a terrible algorithm that is not optimal
22
Pancake Flipping Problem
Problem Description The chef is sloppy; he prepares an unordered stack of pancakes of different sizes The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) He does it by flipping over several from the top, repeating this as many times as necessary
23
Christos Papadimitrou and Bill Gates Flip Pancakes
24
Pancake Flipping Problem: Formulation
Goal: Given n pancakes, what is the minimum number of flips required to rearrange them? Input: An ordered stack of n pancakes of distinct size 1 < 2 < 3 < ……< n Output: An ordered stack of pancakes, smallest on top, largest on the bottom
25
Pancake Flipping Problem: Greedy Algorithm
Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most (link to online simulator) William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals The problem is still open…
26
Approximation Algorithms
Optimal algorithms are unknown for many problems Approximation algorithms find approximate solutions rather than optimal solutions The approximation ratio of an algorithm A on input p is A(p) / OPT(p) Where: A(p) is the solution produced by algorithm A OPT(p) is the optimal solution of the problem
27
Approximation Ratio/Performance Guarantee
Approximation ratio (performance guarantee) of algorithm A: worst approximation ratio over all inputs of size n For a minimization problem Approx ratio = max|p| = n A(p) / OPT(p) Key to proving approximation ratios: good lower-bound on OPT
28
Breakpoints p = p1p2p3…pn-1pn
A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 For example: p = (3, 4) or (7, 8) and (6,5) are adjacent pairs
29
Breakpoints: An Example
There is a breakpoint between any pair of non-adjacent elements: p = Pairs (1,9), (9,3), (4,7), (8,2) and (2,5) form breakpoints of permutation p
30
Extending Permutations
We put two blocks p 0 and p n + 1 at the ends of p p 0 = 0 and p n + 1 = n + 1 This gives us the goal to sort the elements between the end blocks to the identity permutation Example: p = Extending with 0 and 10 p = Note: A new breakpoint was created after extending
31
Reversal Distance and Breakpoints
b(p) = number of breakpoints Each reversal eliminates at most 2 breakpoints d(p) >= b(p) / (lower bound on OPT) p = b(p) = 5 b(p) = 4 b(p) = 2 b(p) = 0
32
Sorting By Reversals: A Better Greedy Algorithm
BreakPointReversalSort(p) 1 while b(p) > 0 2 Among all possible reversals, choose r minimizing b(p * r) 3 p p * r(i, j) 4 output p 5 return
33
Strips Strip: an interval between two consecutive breakpoints in p
Decreasing strips: strips that are in decreasing order (e.g. 6 5 and 3 2 ). Increasing strips: strips that are in increasing order (e.g. 7 8). A single-element strip is both increasing and decreasing
34
Strips: An Example For permutation 1 9 3 4 7 8 2 5 6:
There are 7 strips:
35
Things To Consider Fact 1:
If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p * r) < b(p) )
36
Things To Consider (cont’d)
For p = b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation, reverse the segment between k and k-1: b(p) = 4
37
Things To Consider (cont’d)
Fact 2: If there is no decreasing strip, no reversal r can reduce the number of breakpoints (i.e. b(p * r) = b(p) ). By reversing an increasing strip ( # of breakpoints stay unchanged ), the number of strips can be reduced in the next step.
38
Things To Consider (cont’d)
There are no decreasing strips in p, for: p = b(p) = 3 p * r(3,4) = b(p) = 3 r(3,4) does not change the # of breakpoints r(3,4) creates a decreasing strip, guaranteeing that the next step will decrease the # of breakpoints.
39
ImprovedBreakpointReversalSorting
ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 if p has a decreasing strip 3 Among all possible reversals, choose r that minimizes b(p * r) 4 else 5 Choose a reversal r that flips an increasing strip in p 6 p p * r 7 output p 8 return
40
ImprovedBreakpointReversalSorting: Performance Guarantee
ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 It eliminates at least one breakpoint in every two steps outputs at most 2b(p) reversals Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) >= b(p) / 2 Performance guarantee: [ 2b(p) / d(p) ] <= [ 2b(p) / (b(p) / 2) ] = 4
41
Signed Permutations? p = 1 -2 3 -4
Up to this point, all permutations to sort were unsigned But genes have directions… so we should consider signed permutations 5’ 3’ p =
42
Outline Today’s topic: greedy algorithms
Genome rearrangements Sorting by reversals Breakpoints Shotgun sequencing Fragment assembly: the shortest superstring problem Finishing problem Next time: greedy algorithms for motif finding
43
DNA Sequencing Chain termination can be used to sequence DNA strings a few hundred nucleotides long Start with single strand template Complementary strand synthesis terminated with small probability Lengths of fragments read by gel electrophoresis A C T G ---- ATACGGA ATACGG ATACG ATAC ATA AT A How to sequence longer DNA strings?
44
Shotgun Method
45
Shortest Superstring Problem
Given: set of strings s1, s2, …, sn (we may assume that no si is substring of another sj) Find: shortest string s containing each si as a substring Simplified model for fragment assembly Ignores experimental errors May collapse repeat DNA sequences
46
Shortest Superstring Example
Given: s1= ATAT s2= TATT s3= TTAT s4= TATA s5= TAAT s6= AATA Superstring of s1,…,s6: S = TTATTTAATATA s1 = ATAT s2 = TATT s3 = TTAT s4 = TATA s5 = TAAT s6 = AATA
47
Greedy Merging Algorithm
S = {s1,s2,…,sn} While |S| > 1 do Find s,t in S with longest overlap S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} Output final string Approximation factor no better than 2: s1 = abk, s2 =bkc, s3 = bk+1 Greedy output: abkcbk length = 2k+3 Optimum: abk+1c length = k+3
48
Overlap & Prefix of 2 strings
Overlap of s and t: longest suffix of s that is a prefix of t Prefix of s and t: s after removing overlap(s,t) s = a1 a2 a3 … a|s|-k+1…a|s| t = b1 … bk … b|t| prefix(s,t) overlap(s,t)
49
Prefix Graph (not all arcs shown)
2 4 ATAT 2 3 1 3 TATT TTAT 3 1 2 3 3 3 TATA TAAT 2 2 2 3 2 AATA 3
50
Lower Bound on OPT OPT = prefix(s1,s2) … prefix(sn-1,sn) prefix(sn,s1) overlap(sn,s1) cost of tour 12…n in the prefix graph
51
Modified Greedy Merging Algorithm
S = {s1,s2,…,sn} T={} While |S| > 0 do Find s,t in S with longest overlap, including s=t If st, S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} Else, S = S \ {s}, T=T {s} Output the concatenation of strings in T The modified greedy algorithm constructs a cycle cover of the prefix graph
52
Approximation Guarantee
Theorem: The modified greedy merging gives factor 4 approximation Length of output is wt(cycle cover) + sum of lengths of “representative” strings ri wt(cycle cover) OPT | ri | OPT + |overlap(ri,ri+1)| OPT + 2 wt(C) Length of output 4 x OPT
53
Sequence Coverage Requirements
Target DNA Unit length sequenced fragments ~1% experimental errors in sequencing short fragments by chain termination Target sequencing error for human project is 0.01% (1 error per bases) Each nucleotide position to be covered by 3 or more fragments (use majority rule in case of conflicts) At least 1 fragment from each of the two complementary strands
54
Meeting Coverage Requirements
Increase number of shotgun fragments Fragments are distributed along target DNA uniformly at random Use DNA “walking” to correct deficiencies A walk reads the sequence of a fragment of fixed length (~500b) Can specify position of walked fragments along target DNA Cost of a walk (Cw) much higher than cost of random segment sequencing (Cs) Issues For given Cs and Cw, what is the optimal balance between #shotgun fragments and #walks For given set of shotgun fragments, find minimum number of walks needed to meet coverage requirements
55
“Finishing” Problem For a given set of shotgun fragments, find min-cost set of walks such that coverage requirements are satisfied 1 2 3
56
Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 1 2 3
57
Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1
58
Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1
59
Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1
60
Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1
61
Optimality of the Greedy Algorithm
Theorem: The greedy algorithm finds the minimum number of fragments that must be walked in order to meet coverage requirements (ignoring strand coverage requirements) Proof: every optimum solution can be converted inductively to the greedy one by shifting intervals in their rightmost feasible positions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.