Outline Today’s topic: greedy algorithms

Slides:



Advertisements
Similar presentations
Sorting by reversals Bogdan Pasaniuc Dept. of Computer Science & Engineering.
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
School of CSE, Georgia Tech
Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Theory of Computing Lecture 3 MAS 714 Hartmut Klauck.
Chapter 3 The Greedy Method 3.
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Greedy Algorithms And Genome Rearrangements
Genome Rearrangements CIS 667 April 13, Genome Rearrangements We have seen how differences in genes at the sequence level can be used to infer evolutionary.
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Introduction to Bioinformatics Algorithms Greedy Algorithms And Genome Rearrangements.
Discrete Structures & Algorithms EECE 320 : UBC : Spring 2009 Matei Ripeanu 1.
Of Mice and Men Learning from genome reversal findings Genome Rearrangements in Mammalian Evolution: Lessons From Human and Mouse Genomes and Transforming.
Genome Rearrangements. Basic Biology: DNA Genetic information is stored in deoxyribonucleic acid (DNA) molecules. A single DNA molecule is a sequence.
Genome Rearrangements CSCI : Computational Genomics Debra Goldberg
Genomic Rearrangements CS 374 – Algorithms in Biology Fall 2006 Nandhini N S.
Genome Rearrangements, Synteny, and Comparative Mapping CSCI 4830: Algorithms for Molecular Biology Debra S. Goldberg.
1 Genome Rearrangements João Meidanis São Paulo, Brazil December, 2004.
Approximation Algorithms
7-1 Chapter 7 Genome Rearrangement. 7-2 Background In the late 1980‘s Jeffrey Palmer and colleagues discovered a remarkable and novel pattern of evolutionary.
Pancakes With A Problem Steven Rudich The chef at our place is sloppy, and when he prepares a stack of pancakes they come out all different sizes.
Pancakes With A Problem! Great Theoretical Ideas In Computer Science Vince Conitzer COMPSCI 102 Fall 2007 Lecture 1 August 27, 2007 Duke University.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Mon C222 lecture by Veli Mäkinen Thu C222 study group by VM  Mon C222 exercises by Anna Kuosmanen Algorithms in Molecular Biology, 5.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 8, 2005 ChengXiang Zhai Department of Computer Science University of Illinois,
CompSci 102 Spring 2012 Prof. Rodger January 11, 2012.
Graphs and DNA sequencing CS 466 Saurabh Sinha. Three problems in graph theory.
Pancakes With A Problem! Great Theoretical Ideas In Computer Science Anupam Gupta CS Fall 2O05 Lecture 1 Aug 29 th, 2OO5 Aug 29 th, 2OO5 Carnegie.
Genome Rearrangements …and YOU!! Presented by: Kevin Gaittens.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
9/8/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 6 Greedy Algorithms.
Genome Rearrangements Unoriented Blocks. Quick Review Looking at evolutionary change through reversals Find the shortest possible series of reversals.
Greedy Algorithms And Genome Rearrangements An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
Greedy Algorithms And Genome Rearrangements
Chap. 7 Genome Rearrangements Introduction to Computational Molecular Biology Chap ~
1 Combinatorial Algorithms Parametric Pruning. 2 Metric k-center Given a complete undirected graph G = (V, E) with nonnegative edge costs satisfying the.
Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.
Andrew’s Leap 2011 Pancakes With A Problem Steven Rudich.
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions of genes in a genome Gene Prediction:
Lectures on Greedy Algorithms and Dynamic Programming
CSCI 256 Data Structures and Algorithm Analysis Lecture 6 Some slides by Kevin Wayne copyright 2005, Pearson Addison Wesley all rights reserved, and some.
Pancakes With A Problem! Great Theoretical Ideas In Computer Science Steven Rudich CS Spring 2005 Lecture 3 Jan 18, 2005 Carnegie Mellon University.
1 Chapter 5-1 Greedy Algorithms Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
Genome Rearrangement By Ghada Badr Part I.
Introduction to Bioinformatics Algorithms Chapter 5 Greedy Algorithms and Genome Rearrangements By: Hasnaa Imad.
Genome Rearrangements. Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different.
Genome Rearrangements. Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different.
1 Genome Rearrangements (Lecture for CS498-CXZ Algorithms in Bioinformatics) Dec. 6, 2005 ChengXiang Zhai Department of Computer Science University of.
11 -1 Chapter 12 On-Line Algorithms On-Line Algorithms On-line algorithms are used to solve on-line problems. The disk scheduling problem The requests.
1 Algorithms CSCI 235, Fall 2015 Lecture 29 Greedy Algorithms.
Approximation Algorithms Greedy Strategies. I hear, I forget. I learn, I remember. I do, I understand! 2 Max and Min  min f is equivalent to max –f.
Approximation Algorithms based on linear programming.
Lecture 4: Genome Rearrangements. End Sequence Profiling (ESP) C. Collins and S. Volik (UCSF Cancer Center) 1)Pieces of tumor genome: clones ( kb).
Lecture 2: Genome Rearrangements. Outline Cancer Sequencing Transforming Cabbage into Turnip Genome Rearrangements Sorting By Reversals Pancake Flipping.
DNA Sequencing (Lecture for CS498-CXZ Algorithms in Bioinformatics)
Pancakes With A Problem!
CSE 5290: Algorithms for Bioinformatics Fall 2009
Greedy (Approximation) Algorithms and Genome Rearrangements
Enumerating Distances Using Spanners of Bounded Degree
CSCI2950-C Lecture 4 Genome Rearrangements
Greedy Algorithms And Genome Rearrangements
Algorithms CSCI 235, Spring 2019 Lecture 29 Greedy Algorithms
Greedy Algorithms And Genome Rearrangements
Fragment Assembly 7/30/2019.
Algorithms Tutorial 27th Sept, 2019.
Presentation transcript:

Outline Today’s topic: greedy algorithms Genome rearrangements Sorting by reversals Breakpoints Shotgun sequencing Fragment assembly: the shortest superstring problem Finishing problem Next time: greedy algorithms for motif finding

Turnip vs Cabbage: Almost Identical mtDNA gene sequences In 1980s Jeffrey Palmer studied evolutionary change in plant organelles by comparing mitochondrial genomes of the cabbage and turnip 99% similarity between genes These surprisingly identical gene sequences differed in gene order This helped pave the way to analyzing genome rearrangements in molecular evolution

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison:

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison:

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison:

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison:

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison:

Transforming Cabbage into Turnip

Genome rearrangements Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other?

Mouse vs Human Genome Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversal, fusion, fission, translocation Reversal: flipping a block of genes within a genomic sequence

Reversals Will assume that genes in genomic segment p do not have direction p = p 1 ------ p i-1 p i p i+1 ------ p j-1 p j p j+1 ----- p n p ’ = p 1 ------ p i-1 p j p j-1 ------ p i+1 p i p j+1 ----- pn r( i, j ) is defined as the operation that reverses the gene sequence of p (original genomic segment) between p i and p j r(i,j)

Reversals: Example Example: p = 1 2 3 4 5 6 7 8 r(3,5)

Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’ Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’

Reversal Distance Problem Goal: Given two permutations, find the shortest series of reversals that forms one from the other Input: Permutations p and s Output: A series of reversals r1,…rt transforming p into s, such that t is minimum t = reversal distance between p and s d(p, s) denotes smallest possible value of t, given p and s

Sorting By Reversals Problem Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) Input: permutation p Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum

Sorting By Reversals: Example Minimal value of t is denoted as d(p) and this value is the reversal distance of permutation p Example : input: p = 3 4 2 1 5 6 7 10 9 8 output: 4 3 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 So d(p) <= 3 Is this the smallest #reversals?

Sorting By Reversals: A Greedy Algorithm If sorting permutation p = 1 2 3 6 4 5, the first three numbers are already in order so it does not make any sense to break them. These already sorted numbers of p will be defined as prefix(p) prefix(p) = 3 This results in an idea for an greedy algorithm to sort by reversals; increase prefix(p) at every step

Greedy Algorithm: An Example Doing so, p can be sorted 1 2 3 6 4 5 1 2 3 4 6 5 1 2 3 4 5 6 d(p) = 2 Number of steps to sort permutation of length n is at most (n – 1)

Greedy Algorithm: Pseudocode SimpleReversalSort(p) 1 for i  1 to n – 1 2 j  position of element i in p (i.e., pj = i) 3 if j != i 4 p  p * r(i, j) 5 output p 6 if p is the identity permutation 7 return

Analyzing SimpleReversalSort Greedy algorithm; does not guarantee the smallest number of reversals For example, let p = 6 1 2 3 4 5… SimpleReversalSort(p) takes five steps: Step 1: 1 6 2 3 4 5 Step 2: 1 2 6 3 4 5 Step 3: 1 2 3 6 4 5 Step 4: 1 2 3 4 6 5 Step 5: 1 2 3 4 5 6

Analyzing SimpleReversalSort (cont’d) But it can be done in two steps: p = 6 1 2 3 4 5 Step 1: 5 4 3 2 1 6 Step 2: 1 2 3 4 5 6 So, SimpleReversalSort(p) is a terrible algorithm that is not optimal

Pancake Flipping Problem Problem Description The chef is sloppy; he prepares an unordered stack of pancakes of different sizes The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) He does it by flipping over several from the top, repeating this as many times as necessary

Christos Papadimitrou and Bill Gates Flip Pancakes

Pancake Flipping Problem: Formulation Goal: Given n pancakes, what is the minimum number of flips required to rearrange them? Input: An ordered stack of n pancakes of distinct size 1 < 2 < 3 < ……< n Output: An ordered stack of pancakes, smallest on top, largest on the bottom

Pancake Flipping Problem: Greedy Algorithm Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most (link to online simulator) William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals The problem is still open…

Approximation Algorithms Optimal algorithms are unknown for many problems Approximation algorithms find approximate solutions rather than optimal solutions The approximation ratio of an algorithm A on input p is A(p) / OPT(p) Where: A(p) is the solution produced by algorithm A OPT(p) is the optimal solution of the problem

Approximation Ratio/Performance Guarantee Approximation ratio (performance guarantee) of algorithm A: worst approximation ratio over all inputs of size n For a minimization problem Approx ratio = max|p| = n A(p) / OPT(p) Key to proving approximation ratios: good lower-bound on OPT

Breakpoints p = p1p2p3…pn-1pn A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 For example: p = 1 9 3 4 7 8 2 6 5 (3, 4) or (7, 8) and (6,5) are adjacent pairs

Breakpoints: An Example There is a breakpoint between any pair of non-adjacent elements: p = 1 9 3 4 7 8 2 6 5 Pairs (1,9), (9,3), (4,7), (8,2) and (2,5) form breakpoints of permutation p

Extending Permutations We put two blocks p 0 and p n + 1 at the ends of p p 0 = 0 and p n + 1 = n + 1 This gives us the goal to sort the elements between the end blocks to the identity permutation Example: p = 1 9 3 4 7 8 2 6 5 Extending with 0 and 10 p = 0 1 9 3 4 7 8 2 6 5 10 Note: A new breakpoint was created after extending

Reversal Distance and Breakpoints b(p) = number of breakpoints Each reversal eliminates at most 2 breakpoints  d(p) >= b(p) / 2 (lower bound on OPT) p = 2 3 1 4 6 5 0 2 3 1 4 6 5 7 b(p) = 5 0 1 3 2 4 6 5 7 b(p) = 4 0 1 2 3 4 6 5 7 b(p) = 2 0 1 2 3 4 5 6 7 b(p) = 0

Sorting By Reversals: A Better Greedy Algorithm BreakPointReversalSort(p) 1 while b(p) > 0 2 Among all possible reversals, choose r minimizing b(p * r) 3 p  p * r(i, j) 4 output p 5 return

Strips Strip: an interval between two consecutive breakpoints in p Decreasing strips: strips that are in decreasing order (e.g. 6 5 and 3 2 ). Increasing strips: strips that are in increasing order (e.g. 7 8). A single-element strip is both increasing and decreasing

Strips: An Example For permutation 1 9 3 4 7 8 2 5 6: There are 7 strips: 0 1 9 4 3 7 8 2 5 6 10

Things To Consider Fact 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p * r) < b(p) )

Things To Consider (cont’d) For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation, reverse the segment between k and k-1: 0 1 2 3 8 7 5 6 4 9 b(p) = 4

Things To Consider (cont’d) Fact 2: If there is no decreasing strip, no reversal r can reduce the number of breakpoints (i.e. b(p * r) = b(p) ). By reversing an increasing strip ( # of breakpoints stay unchanged ), the number of strips can be reduced in the next step.

Things To Consider (cont’d) There are no decreasing strips in p, for: p = 0 1 2 5 6 7 3 4 8 b(p) = 3 p * r(3,4) = 0 1 2 5 6 7 4 3 8 b(p) = 3 r(3,4) does not change the # of breakpoints r(3,4) creates a decreasing strip, guaranteeing that the next step will decrease the # of breakpoints.

ImprovedBreakpointReversalSorting ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 if p has a decreasing strip 3 Among all possible reversals, choose r that minimizes b(p * r) 4 else 5 Choose a reversal r that flips an increasing strip in p 6 p  p * r 7 output p 8 return

ImprovedBreakpointReversalSorting: Performance Guarantee ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 It eliminates at least one breakpoint in every two steps  outputs at most 2b(p) reversals Optimal algorithm eliminates at most 2 breakpoints in every step: d(p) >= b(p) / 2 Performance guarantee: [ 2b(p) / d(p) ] <= [ 2b(p) / (b(p) / 2) ] = 4

Signed Permutations? p = 1 -2 3 -4 Up to this point, all permutations to sort were unsigned But genes have directions… so we should consider signed permutations 5’ 3’ p = 1 -2 3 -4

Outline Today’s topic: greedy algorithms Genome rearrangements Sorting by reversals Breakpoints Shotgun sequencing Fragment assembly: the shortest superstring problem Finishing problem Next time: greedy algorithms for motif finding

DNA Sequencing Chain termination can be used to sequence DNA strings a few hundred nucleotides long Start with single strand template Complementary strand synthesis terminated with small probability Lengths of fragments read by gel electrophoresis A C T G ---- ATACGGA ATACGG ATACG ATAC ATA AT A How to sequence longer DNA strings?

Shotgun Method

Shortest Superstring Problem Given: set of strings s1, s2, …, sn (we may assume that no si is substring of another sj) Find: shortest string s containing each si as a substring Simplified model for fragment assembly Ignores experimental errors May collapse repeat DNA sequences

Shortest Superstring Example Given: s1= ATAT s2= TATT s3= TTAT s4= TATA s5= TAAT s6= AATA Superstring of s1,…,s6: S = TTATTTAATATA s1 = ATAT s2 = TATT s3 = TTAT s4 = TATA s5 = TAAT s6 = AATA

Greedy Merging Algorithm S = {s1,s2,…,sn} While |S| > 1 do Find s,t in S with longest overlap S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} Output final string Approximation factor no better than 2: s1 = abk, s2 =bkc, s3 = bk+1 Greedy output: abkcbk+1 length = 2k+3 Optimum: abk+1c length = k+3

Overlap & Prefix of 2 strings Overlap of s and t: longest suffix of s that is a prefix of t Prefix of s and t: s after removing overlap(s,t) s = a1 a2 a3 … a|s|-k+1…a|s| t = b1 … bk … b|t| prefix(s,t) overlap(s,t)

Prefix Graph (not all arcs shown) 2 4 ATAT 2 3 1 3 TATT TTAT 3 1 2 3 3 3 TATA TAAT 2 2 2 3 2 AATA 3

Lower Bound on OPT OPT = prefix(s1,s2) … prefix(sn-1,sn) prefix(sn,s1) overlap(sn,s1) cost of tour 12…n in the prefix graph

Modified Greedy Merging Algorithm S = {s1,s2,…,sn} T={} While |S| > 0 do Find s,t in S with longest overlap, including s=t If st, S = ( S \ {s,t} ) U { s overlapped with t to maximum extent} Else, S = S \ {s}, T=T  {s} Output the concatenation of strings in T The modified greedy algorithm constructs a cycle cover of the prefix graph

Approximation Guarantee Theorem: The modified greedy merging gives factor 4 approximation Length of output is wt(cycle cover) + sum of lengths of “representative” strings ri wt(cycle cover)  OPT  | ri |  OPT +  |overlap(ri,ri+1)|  OPT + 2 wt(C)  Length of output  4 x OPT

Sequence Coverage Requirements Target DNA Unit length sequenced fragments ~1% experimental errors in sequencing short fragments by chain termination Target sequencing error for human project is 0.01% (1 error per 10000 bases) Each nucleotide position to be covered by 3 or more fragments (use majority rule in case of conflicts) At least 1 fragment from each of the two complementary strands

Meeting Coverage Requirements Increase number of shotgun fragments Fragments are distributed along target DNA uniformly at random Use DNA “walking” to correct deficiencies A walk reads the sequence of a fragment of fixed length (~500b) Can specify position of walked fragments along target DNA Cost of a walk (Cw) much higher than cost of random segment sequencing (Cs) Issues For given Cs and Cw, what is the optimal balance between #shotgun fragments and #walks For given set of shotgun fragments, find minimum number of walks needed to meet coverage requirements

“Finishing” Problem For a given set of shotgun fragments, find min-cost set of walks such that coverage requirements are satisfied 1 2 3

Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 1 2 3

Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1

Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1

Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1

Greedy Algorithm Find breakpoint = leftmost point with coverage deficit Add walk with left end at breakpoint Update coverage profile and repeat 3 2 1

Optimality of the Greedy Algorithm Theorem: The greedy algorithm finds the minimum number of fragments that must be walked in order to meet coverage requirements (ignoring strand coverage requirements) Proof: every optimum solution can be converted inductively to the greedy one by shifting intervals in their rightmost feasible positions