CSE 5290: Algorithms for Bioinformatics Fall 2011

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

Greedy Algorithms CS 466 Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix of.
Greedy Algorithms CS 6030 by Savitha Parur Venkitachalam.
Dynamic Programming: Sequence alignment
Outline The power of DNA Sequence Comparison The Change Problem
Bioinformatics Chromosome rearrangements Chromosome and genome comparison versus gene comparison Permutations and breakpoint graphs Transforming Men into.
Greedy Algorithms And Genome Rearrangements
Dynamic Programming: Edit Distance
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Sequencing and Sequence Alignment
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Introduction to Bioinformatics Algorithms Greedy Algorithms And Genome Rearrangements.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Genome Rearrangements CSCI : Computational Genomics Debra Goldberg
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Genome Rearrangements, Synteny, and Comparative Mapping CSCI 4830: Algorithms for Molecular Biology Debra S. Goldberg.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Dynamic Programming I Definition of Dynamic Programming
Sequence Alignment.
Brandon Andrews.  Longest Common Subsequences  Global Sequence Alignment  Scoring Alignments  Local Sequence Alignment  Alignment with Gap Penalties.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Pairwise & Multiple sequence alignments
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Greedy Algorithms And Genome Rearrangements An Introduction to Bioinformatics Algorithms (Jones and Pevzner)
Genome Rearrangements [1] Ch Types of Rearrangements Reversal Translocation
Greedy Algorithms And Genome Rearrangements
Greedy Algorithms CS 498 SS Saurabh Sinha. A greedy approach to the motif finding problem Given t sequences of length n each, to find a profile matrix.
Dynamic Programming: Manhattan Tourist Problem Lecture 17.
Dynamic Programming: Edit Distance
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Introduction to Bioinformatics Algorithms Chapter 5 Greedy Algorithms and Genome Rearrangements By: Hasnaa Imad.
Genome Rearrangements. Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different.
Genome Rearrangements. Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different.
Outline Today’s topic: greedy algorithms
1 Genome Rearrangements (Lecture for CS498-CXZ Algorithms in Bioinformatics) Dec. 6, 2005 ChengXiang Zhai Department of Computer Science University of.
Sequence Comparison I519 Introduction to Bioinformatics, Fall 2012.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Lecture 4: Genome Rearrangements. End Sequence Profiling (ESP) C. Collins and S. Volik (UCSF Cancer Center) 1)Pieces of tumor genome: clones ( kb).
Lecture 2: Genome Rearrangements. Outline Cancer Sequencing Transforming Cabbage into Turnip Genome Rearrangements Sorting By Reversals Pancake Flipping.
Sequence Alignment.
Protein Sequence Alignments
CSE 5290: Algorithms for Bioinformatics Fall 2009
Sequence Alignment ..
Greedy (Approximation) Algorithms and Genome Rearrangements
Sequence Alignment Using Dynamic Programming
Pairwise sequence Alignment.
CSCI2950-C Lecture 4 Genome Rearrangements
Intro to Alignment Algorithms: Global and Local
Greedy Algorithms And Genome Rearrangements
Sequence Alignment.
Lecture 8. Paradigm #6 Dynamic Programming
Bioinformatics Algorithms and Data Structures
Greedy Algorithms And Genome Rearrangements
Algorithm Design Techniques Greedy Approach vs Dynamic Programming
CSE 5290: Algorithms for Bioinformatics Fall 2009
Pairwise Sequence Alignment (II)
Presentation transcript:

CSE 5290: Algorithms for Bioinformatics Fall 2011 Suprakash Datta datta@cse.yorku.ca Office: CSEB 3043 Phone: 416-736-2100 ext 77875 Course page: http://www.cse.yorku.ca/course/5290 11/8/2018 CSE 5290, Fall 2011

Last time Finding Regulatory Motifs in DNA sequences (exhaustive search variants) Next: Greedy algorithms The following slides are based on slides by the authors of our text. 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Look and Taste Different Although cabbages and turnips share a recent common ancestor, they look and taste different 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage - 2 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Almost Identical mtDNA gene sequences In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip 99% similarity between genes These surprisingly identical gene sequences differed in gene order This study helped pave the way to analyzing genome rearrangements in molecular evolution 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order Gene order comparison: Before After Evolution is manifested as the divergence in gene order 11/8/2018 CSE 5290, Fall 2011

Transforming Cabbage into Turnip 11/8/2018 CSE 5290, Fall 2011

Genome rearrangements Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other? 11/8/2018 CSE 5290, Fall 2011

History of Chromosome X Rat Consortium, Nature, 2004 11/8/2018 CSE 5290, Fall 2011

Reversals 1 3 2 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Blocks represent conserved genes. 11/8/2018 CSE 5290, Fall 2011

Reversals 1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. 11/8/2018 CSE 5290, Fall 2011

Reversals and Breakpoints 1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 The reversion introduced two breakpoints (disruptions in order). 11/8/2018 CSE 5290, Fall 2011

Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’ Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’ 11/8/2018 CSE 5290, Fall 2011

Types of Rearrangements Reversal 1 2 3 4 5 6 1 2 -5 -4 -3 6 Translocation 1 2 3 4 5 6 1 2 6 4 5 3 Fusion 1 2 3 4 5 6 1 2 3 4 5 6 Fission 11/8/2018 CSE 5290, Fall 2011

Comparative Genomic Architectures: Mouse vs Human Genome Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversals Fusions Fissions Translocation 11/8/2018 CSE 5290, Fall 2011

Waardenburg’s Syndrome: Mouse Provides Insight into Human Genetic Disorder Waardenburg’s syndrome is characterized by pigmentary dysphasia Gene implicated in the disease was linked to human chromosome 2 but it was not clear where exactly it is located on chromosome 2 11/8/2018 CSE 5290, Fall 2011

Waardenburg’s syndrome and splotch mice A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans Scientists succeeded in identifying location of gene responsible for disorder in mice Finding the gene in mice gives clues to where the same gene is located in humans 11/8/2018 CSE 5290, Fall 2011

Reversals: Example r(3,5) 1 2 5 4 3 6 7 8 r(5,6) 1 2 5 4 6 3 7 8 11/8/2018 CSE 5290, Fall 2011

Reversals and Gene Orders Gene order is represented by a permutation p: p = p 1 ------ p i-1 p i p i+1 ------ p j-1 p j p j+1 ----- p n p 1 ------ p i-1 p j p j-1 ------ p i+1 p i p j+1 ----- pn Reversal r ( i, j ) reverses (flips) the elements from i to j in p r(i,j) 11/8/2018 CSE 5290, Fall 2011

Reversal Distance Problem Goal: Given two permutations p, s, find the shortest series of reversals that transforms p into s Input: Permutations p and s Output: A series of reversals r1,…rt transforming p into s, such that t is minimum Notation: t - reversal distance between p and s d(p, s) - smallest possible value of t, given p and s 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals Problem Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) Input: Permutation p Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals: Example t =d(p ) - reversal distance of p Example : p = 3 4 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 10 9 8 4 3 2 1 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 So d(p ) = 3 11/8/2018 CSE 5290, Fall 2011

Sorting by reversals: 5 steps hour 11/8/2018 CSE 5290, Fall 2011

Sorting by reversals: 4 steps What is the reversal distance for this permutation? Can it be sorted in 3 steps? 11/8/2018 CSE 5290, Fall 2011

Pancake Flipping Problem The chef is sloppy; he prepares an unordered stack of pancakes of different sizes The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) He does it by flipping over several from the top, repeating this as many times as necessary Christos Papadimitrou and Bill Gates flip pancakes 11/8/2018 CSE 5290, Fall 2011

Pancake Flipping Problem: Formulation Goal: Given a stack of n pancakes, what is the minimum number of flips to rearrange them into perfect stack? Input: Permutation p Output: A series of prefix reversals r1, … rt transforming p into the identity permutation such that t is minimum 11/8/2018 CSE 5290, Fall 2011

Pancake Flipping Problem: Greedy Algorithm Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals: A Greedy Algorithm If sorting permutation p = 1 2 3 6 4 5, the first three elements are already in order so it does not make any sense to break them. The length of the already sorted prefix of p is denoted prefix(p) prefix(p) = 3 This results in an idea for a greedy algorithm: increase prefix(p) at every step 11/8/2018 CSE 5290, Fall 2011

Greedy Algorithm: An Example Doing so, p can be sorted 1 2 3 6 4 5 1 2 3 4 6 5 1 2 3 4 5 6 Number of steps to sort permutation of length n is at most (n – 1) 11/8/2018 CSE 5290, Fall 2011

Greedy Algorithm: Pseudocode SimpleReversalSort(p) 1 for i  1 to n – 1 2 j  position of element i in p (i.e., pj = i) 3 if j ≠i 4 p  p * r(i, j) 5 output p 6 if p is the identity permutation 7 return 11/8/2018 CSE 5290, Fall 2011

Analyzing SimpleReversalSort SimpleReversalSort does not guarantee the smallest number of reversals and takes five steps on p = 6 1 2 3 4 5 : Step 1: 1 6 2 3 4 5 Step 2: 1 2 6 3 4 5 Step 3: 1 2 3 6 4 5 Step 4: 1 2 3 4 6 5 Step 5: 1 2 3 4 5 6 11/8/2018 CSE 5290, Fall 2011

Analyzing SimpleReversalSort But it can be sorted in two steps: p = 6 1 2 3 4 5 Step 1: 5 4 3 2 1 6 Step 2: 1 2 3 4 5 6 So, SimpleReversalSort(p) is not optimal Optimal algorithms are unknown for many problems; approximation algorithms are used 11/8/2018 CSE 5290, Fall 2011

Approximation Algorithms These algorithms find approximate solutions rather than optimal solutions The approximation ratio of an algorithm A on input p is: A(p) / OPT(p) where A(p) -solution produced by algorithm A OPT(p) - optimal solution of the problem 11/8/2018 CSE 5290, Fall 2011

Approximation Ratio/Performance Guarantee Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n For algorithm A that minimizes objective function (minimization algorithm): max|p| = n A(p) / OPT(p) 11/8/2018 CSE 5290, Fall 2011

Approximation Ratio/Performance Guarantee Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n For algorithm A that minimizes objective function (minimization algorithm): max|p| = n A(p) / OPT(p) For maximization algorithm: min|p| = n A(p) / OPT(p) 11/8/2018 CSE 5290, Fall 2011

Adjacencies and Breakpoints p = p1p2p3…pn-1pn A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 For example p = 1 9 3 4 7 8 2 6 5 (3, 4) or (7, 8) and (6,5) are adjacent pairs 11/8/2018 CSE 5290, Fall 2011

Breakpoints: An Example There is a breakpoint between any adjacent element that are non-consecutive: p = 1 9 3 4 7 8 2 6 5 Pairs (1,9), (9,3), (4,7), (8,2) and (2,6) form breakpoints of permutation p b(p) - # breakpoints in permutation p 11/8/2018 CSE 5290, Fall 2011

Adjacency & Breakpoints An adjacency - a pair of adjacent elements that are consecutive A breakpoint - a pair of adjacent elements that are not consecutive π = 5 6 2 1 3 4 Extend π with π0 = 0 and π7 = 7 adjacencies 0 5 6 2 1 3 4 7 breakpoints 11/8/2018 CSE 5290, Fall 2011

Extending Permutations We put two elements p 0 =0 and p n + 1=n+1 at the ends of p Example: p = 1 9 3 4 7 8 2 6 5 Extending with 0 and 10 p = 0 1 9 3 4 7 8 2 6 5 10 Note: A new breakpoint was created after extending 11/8/2018 CSE 5290, Fall 2011

Reversal Distance and Breakpoints Each reversal eliminates at most 2 breakpoints. p = 2 3 1 4 6 5 0 2 3 1 4 6 5 7 b(p) = 5 0 1 3 2 4 6 5 7 b(p) = 4 0 1 2 3 4 6 5 7 b(p) = 2 0 1 2 3 4 5 6 7 b(p) = 0 This implies: reversal distance ≥ #breakpoints / 2 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals: A Better Greedy Algorithm BreakPointReversalSort(p) 1 while b(p) > 0 2 Among all possible reversals, choose reversal r minimizing b(p • r) 3 p  p • r(i, j) 4 output p 5 return Q: Does this algorithm terminate? 11/8/2018 CSE 5290, Fall 2011

Strips Strip: an interval between two consecutive breakpoints in a permutation Decreasing strip: strip of elements in decreasing order (e.g. 6 5 and 3 2 ). Increasing strip: strip of elements in increasing order (e.g. 7 8) 0 1 9 4 3 7 8 2 5 6 10 A single-element strip can be declared either increasing or decreasing. We will choose to declare them as decreasing with exception of the strips with 0 and n+1 11/8/2018 CSE 5290, Fall 2011

Reducing the Number of Breakpoints Theorem 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p • r) < b(p) ) 11/8/2018 CSE 5290, Fall 2011

Find k – 1 in the permutation Things To Consider For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation 11/8/2018 CSE 5290, Fall 2011

Things To Consider (cont’d) For p = 1 4 6 5 7 8 3 2 0 1 4 6 5 7 8 3 2 9 b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation Reverse the segment between k and k-1: 0 1 4 6 5 7 8 3 2 9 b(p) = 5 0 1 2 3 8 7 5 6 4 9 b(p) = 4 11/8/2018 CSE 5290, Fall 2011

Reducing the Number of Breakpoints Again If there is no decreasing strip, there may be no reversal r that reduces the number of breakpoints (i.e. b(p • r) ≥ b(p) for any reversal r). By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1). 11/8/2018 CSE 5290, Fall 2011

Things To Consider (cont’d) There are no decreasing strips in p, for: p = 0 1 2 5 6 7 3 4 8 b(p) = 3 p • r(6,7) = 0 1 2 5 6 7 4 3 8 b(p) = 3 r(6,7) does not change the # of breakpoints r(6,7) creates a decreasing strip thus guaranteeing that the next step will decrease the # of breakpoints. 11/8/2018 CSE 5290, Fall 2011

ImprovedBreakpointReversalSort ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 if p has a decreasing strip Among all possible reversals, choose reversal r that minimizes b(p • r) 4 else 5 Choose a reversal r that flips an increasing strip in p 6 p  p • r 7 output p 8 return 11/8/2018 CSE 5290, Fall 2011

ImprovedBreakpointReversalSort: Performance Guarantee ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 It eliminates at least one breakpoint in every two steps; at most 2b(p) steps Approximation ratio: 2b(p) / d(p) Optimal algorithm eliminates at most 2 breakpoints in every step: d(p)  b(p) / 2 Performance guarantee: ( 2b(p) / d(p) )  [ 2b(p) / (b(p) / 2) ] = 4 11/8/2018 CSE 5290, Fall 2011

Signed Permutations Up to this point, all permutations to sort were unsigned But genes have directions… so we should consider signed permutations 5’ 3’ p = 1 -2 - 3 4 -5 11/8/2018 CSE 5290, Fall 2011

Signed Permutations Algorithms are a little more involved. Possible project topic 11/8/2018 CSE 5290, Fall 2011

GRIMM Web Server Real genome architectures are represented by signed permutations Efficient algorithms to sort signed permutations have been developed GRIMM web server computes the reversal distances between signed permutations: 11/8/2018 CSE 5290, Fall 2011

GRIMM Web Server http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM 11/8/2018 CSE 5290, Fall 2011

Next Dynamic programming, sequence alignment Some of the following slides are based on slides by the authors of our text. 11/8/2018 CSE 5290, Fall 2011

Dynamic programming (DP) Typically used for optimization problems Often results in efficient algorithms Not applicable to all problems Caveats: Need not yield poly-time algorithms No unique formulations for most problems May not rule out greedy algorithms 11/8/2018 CSE 5290, Fall 2011

Example Counting the number of shortest paths in a grid Counting the number of shortest paths in a grid with blocked intersections Finding paths in a weighted grid Sequence alignment 11/8/2018 CSE 5290, Fall 2011

Setting up DP in practice The optimal solution should be computable as a (recursive) function of the solution to sub-problems Solve sub-problems systematically and store solutions (to avoid duplication of work). 11/8/2018 CSE 5290, Fall 2011

Number of paths in a grid Problem: Travel from the top-left to the bottom right of a rectangular grid using only right and down moves Combinatorial approach DP approach: how can we decompose the problem into sub-problems ? 11/8/2018 CSE 5290, Fall 2011

Number of paths in a grid with blocked intersections Problem: Same as before but some grid points are blocked and cannot be used Combinatorial approach? DP approach: how can we decompose the problem into sub-problems ? 11/8/2018 CSE 5290, Fall 2011

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink 11/8/2018 CSE 5290, Fall 2011

Manhattan Tourist Problem (MTP) Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink 11/8/2018 CSE 5290, Fall 2011

Manhattan Tourist Problem: Formulation Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in G from “source” to “sink” 11/8/2018 CSE 5290, Fall 2011

MTP: An Example source sink 13 19 9 15 23 20 j coordinate i coordinate 4 7 1 5 6 8 i coordinate 13 source 19 9 15 23 20 j coordinate sink 11/8/2018 CSE 5290, Fall 2011

MTP: Greedy Algorithm Is Not Optimal 1 2 5 source 5 3 10 5 2 1 5 3 5 3 1 2 3 4 promising start, but leads to bad choices! 5 2 22 sink 18 11/8/2018 CSE 5290, Fall 2011

MTP: Simple Recursive Program MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} 11/8/2018 CSE 5290, Fall 2011

MTP: Simple Recursive Program MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach? 11/8/2018 CSE 5290, Fall 2011

MTP: Dynamic Programming j 1 source 1 1 i S0,1 = 1 5 1 5 S1,0 = 5 Calculate optimal path score for each vertex in the graph Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between 11/8/2018 CSE 5290, Fall 2011

MTP: Dynamic Programming (cont’d) j 1 2 source 1 2 1 3 i S0,2 = 3 5 3 -5 1 5 4 S1,1 = 4 3 2 8 S2,0 = 8 11/8/2018 CSE 5290, Fall 2011

MTP: Dynamic Programming (cont’d) j 1 2 3 source 1 2 5 1 3 8 i S3,0 = 8 5 3 10 -5 1 1 5 4 13 S1,2 = 13 3 5 -5 2 8 9 S2,1 = 9 3 8 11/8/2018 CSE 5290, Fall 2011 S3,0 = 8

MTP: Dynamic Programming (cont’d) j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 S1,3 = 8 3 5 -3 -5 3 2 8 9 12 S2,2 = 12 3 8 9 11/8/2018 CSE 5290, Fall 2011 S3,1 = 9 greedy alg. fails!

MTP: Dynamic Programming (cont’d) j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 3 5 -3 2 -5 3 3 2 8 9 12 15 S2,3 = 15 -5 3 8 9 9 11/8/2018 CSE 5290, Fall 2011 S3,2 = 9

MTP: Dynamic Programming (cont’d) j 1 2 3 source 1 2 5 1 3 8 Done! i 5 3 10 -5 -5 1 -5 1 5 4 13 8 (showing all back-traces) 3 5 -3 2 -5 3 3 2 8 9 12 15 -5 1 3 8 9 9 16 11/8/2018 CSE 5290, Fall 2011 S3,3 = 16

MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns) 11/8/2018 CSE 5290, Fall 2011

Manhattan Is Not A Perfect Grid B A3 A1 A2 What about diagonals? The score at point B is given by: sB = max of sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) 11/8/2018 CSE 5290, Fall 2011

Manhattan Is Not A Perfect Grid (contd) Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) Predecessors (x) – set of vertices that have edges leading to x The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once 11/8/2018 CSE 5290, Fall 2011

Traveling in the Grid The only hitch is that one must decide on the order in which visit the vertices By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. We need to traverse the vertices in some order Try to find such order for a directed cycle ??? 11/8/2018 CSE 5290, Fall 2011

DAG: Directed Acyclic Graph Since Manhattan is not a perfect regular grid, we represent it as a DAG DAG for Dressing in the morning problem 11/8/2018 CSE 5290, Fall 2011

Topological Ordering A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right. 11/8/2018 CSE 5290, Fall 2011

Topological ordering 2 different topological orderings of the DAG 11/8/2018 CSE 5290, Fall 2011

Longest Path in DAG Problem Goal: Find a longest path between two vertices in a weighted DAG Input: A weighted DAG G with source and sink vertices Output: A longest path in G from source to sink 11/8/2018 CSE 5290, Fall 2011

Longest Path in DAG: Dynamic Programming Suppose vertex v has indegree 3 and predecessors {u1, u2, u3} Longest path to v from source is: In General: sv = maxu (su + weight of edge from u to v) su1 + weight of edge from u1 to v su2 + weight of edge from u2 to v su3 + weight of edge from u3 to v sv = max of 11/8/2018 CSE 5290, Fall 2011

Traversing the Manhattan Grid b) 3 different strategies: a) Column by column b) Row by row c) Along diagonals c) 11/8/2018 CSE 5290, Fall 2011

Sequence alignment Fundamental problem Many different versions 11/8/2018 CSE 5290, Fall 2011

Alignment: 2 row representation Given 2 DNA sequences v and w: v : A T G T T A T m = 7 w : n = 7 A T C G T A C Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 2 insertions 2 deletions 11/8/2018 CSE 5290, Fall 2011

Aligning DNA Sequences V = ATCTGATG n = 8 4 matches mismatches insertions deletions m = 7 1 W = TGCATAC 2 match mismatch 2 V A T C G W deletion indels insertion 11/8/2018 CSE 5290, Fall 2011

Aligning DNA Sequences - 2 Brute force is infeasible…. Number of alignments of X[1..n],Y[1..m], n<m is ( ) For m=n, this is about 22n/pn m+n n 11/8/2018 CSE 5290, Fall 2011

Longest Common Subsequence (LCS) – Alignment without Mismatches Given two sequences v = v1 v2…vm and w = w1 w2…wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal 11/8/2018 CSE 5290, Fall 2011

LCS: Example Every common subsequence is a path in 2-D grid 1 1 2 2 3 1 1 2 2 3 4 3 5 4 5 6 6 7 7 8 i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: 2 < 3 < 4 < 6 < 8 Matches shown in red positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid 11/8/2018 CSE 5290, Fall 2011

LCS Problem as Manhattan Tourist Problem G A T C j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

Edit Graph for LCS Problem j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

Edit Graph for LCS Problem j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

Computing LCS Let vi = prefix of v of length i: v1 … vi and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j-1 + 1 if vi = wj 11/8/2018 CSE 5290, Fall 2011

Computing LCS (cont’d) i-1,j -1 i-1,j 1 si-1,j + 0 si,j = MAX i,j -1 si,j -1 + 0 i,j si-1,j -1 + 1, if vi = wj 11/8/2018 CSE 5290, Fall 2011

Every Path in the Grid Corresponds to an Alignment W A T C G 0 1 2 2 3 4 V = A T - G T | | | W= A T C G – 0 1 2 3 4 4 V 1 2 3 4 A T G T 11/8/2018 CSE 5290, Fall 2011

Aligning Sequences without Insertions and Deletions: Hamming Distance Given two DNA sequences v and w : v : A T w : A T The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar 11/8/2018 CSE 5290, Fall 2011

Aligning Sequences with Insertions and Deletions By shifting one sequence over one position: v : A T -- w : -- A T The edit distance: dH(v, w) = 2. Hamming distance neglects insertions and deletions in DNA 11/8/2018 CSE 5290, Fall 2011

Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v  w 11/8/2018 CSE 5290, Fall 2011

Edit Distance vs Hamming Distance always compares i-th letter of v with i-th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task. 11/8/2018 CSE 5290, Fall 2011

Edit Distance vs Hamming Distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8 d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task 11/8/2018 CSE 5290, Fall 2011

Edit Distance vs Hamming Distance may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)=8 d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ??? 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) What is the edit distance? 5? 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example (cont’d) TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6th T) ATGCATA  (substitute G for 5th A) ATGCGTA  (substitute C for 3rd G) ATCCGAT (Done) 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example (cont’d) TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6th T) ATGCATA  (substitute G for 5th A) ATGCGTA  (substitute C for 3rd G) ATCCGAT (Done) Can it be done in 3 steps??? 11/8/2018 CSE 5290, Fall 2011

The Alignment Grid Every alignment path is from source to sink 11/8/2018 CSE 5290, Fall 2011

Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 G A T C w v 0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path - 11/8/2018 CSE 5290, Fall 2011

Alignments in Edit Graph (cont’d) and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5. 1 2 3 4 5 6 7 G A T C w v 11/8/2018 CSE 5290, Fall 2011

Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 G A T C w v Every path in the edit graph corresponds to an alignment: 11/8/2018 CSE 5290, Fall 2011

Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 G A T C w v Old Alignment 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 New Alignment 0122345677 v= AT_GTTAT_ w= ATCG_TA_C 0123445667 11/8/2018 CSE 5290, Fall 2011

Alignment as a Path in the Edit Graph 1 2 3 4 5 6 7 G A T C w v 0122345677 v= AT_GTTAT_ w= ATCGT_A_C 0123455667 (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) 11/8/2018 CSE 5290, Fall 2011

Alignment: Dynamic Programming si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { 11/8/2018 CSE 5290, Fall 2011

Dynamic Programming Example 1 2 3 4 5 6 7 G A T C w v Initialize 1st row and 1st column to be all zeroes. Or, to be more precise, initialize 0th row and 0th column to be all zeroes. 11/8/2018 CSE 5290, Fall 2011

Dynamic Programming Example 1 2 3 4 5 6 7 G A T C w v Si,j = Si-1, j-1 max Si-1, j Si, j-1 { 1 1 1 1 1 1 1 value from NW +1, if vi = wj  value from North (top)  value from West (left) 1 1 1 1 1 1 11/8/2018 CSE 5290, Fall 2011

Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if vi = wj 11/8/2018 CSE 5290, Fall 2011

Backtracking Example w v Find a match in row and column 2. i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1 1 2 3 4 5 6 7 G A T C w v 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 11/8/2018 CSE 5290, Fall 2011

Backtracking Example w v 1 2 3 4 5 6 7 G A T C w v Continuing with the dynamic programming algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5 11/8/2018 CSE 5290, Fall 2011

Alignment: Dynamic Programming si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { 11/8/2018 CSE 5290, Fall 2011

Alignment: Dynamic Programming si,j = si-1, j-1+1 if vi = wj max si-1, j+0 si, j-1+0 { This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero. 11/8/2018 CSE 5290, Fall 2011

LCS Algorithm { { LCS(v,w) for i  1 to n si,0  0 for j  1 to m s0,j  0 si-1,j si,j  max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j  “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) { { 11/8/2018 CSE 5290, Fall 2011

Now What? w v LCS(v,w) created the alignment grid 1 2 3 4 5 6 7 G A T C w v LCS(v,w) created the alignment grid Now we need a way to read the best alignment of v and w Follow the arrows backwards from sink 11/8/2018 CSE 5290, Fall 2011

Printing LCS: Backtracking PrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = “ “ PrintLCS(b,v,i-1,j-1) print vi else PrintLCS(b,v,i-1,j) PrintLCS(b,v,i,j-1) 11/8/2018 CSE 5290, Fall 2011

LCS Runtime It takes O(nm) time to fill in the nxm dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix. 11/8/2018 CSE 5290, Fall 2011

Why does DP work? Avoids re-computing the same sub-problems Limits the amount of work done in each step 11/8/2018 CSE 5290, Fall 2011

When is DP applicable? – Optimal substructure: Optimal solution to problem (instance) contains optimal solutions to sub-problems – Overlapping sub-problems: Limited number of distinct sub-problems, repeated many many times 11/8/2018 CSE 5290, Fall 2011

Alignment with Affine Gap Penalties Next: More realistic sequence alignment algorithms Types: Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties 11/8/2018 CSE 5290, Fall 2011

From LCS to Alignment: Change up the Scoring The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 11/8/2018 CSE 5290, Fall 2011

Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 11/8/2018 CSE 5290, Fall 2011

The Global Alignment Problem Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = - σ = 1 if match = -µ if mismatch si-1,j-1 +1 if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ m : mismatch penalty σ : indel penalty { 11/8/2018 CSE 5290, Fall 2011

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { 11/8/2018 CSE 5290, Fall 2011

Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation 11/8/2018 CSE 5290, Fall 2011

Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical 11/8/2018 CSE 5290, Fall 2011

Making a Scoring Matrix Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 11/8/2018 CSE 5290, Fall 2011

Scoring Matrix: Example K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11 11/8/2018 CSE 5290, Fall 2011

Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate  glutamate Nonpolar to nonpolar alanine  valine Similarly behaving residues leucine to isoleucine 11/8/2018 CSE 5290, Fall 2011

Scoring matrices Amino acid substitution matrices PAM BLOSUM DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level 11/8/2018 CSE 5290, Fall 2011

PAM some residues may have mutated several times Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all 11/8/2018 CSE 5290, Fall 2011

PAMX PAMx = PAM1x PAM250 = PAM1250 PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A 13 6 9 9 5 8 9 12 6 8 6 7 ... Arg R 3 17 4 3 2 5 3 2 6 3 2 9 Asn N 4 4 6 7 2 5 6 4 6 3 2 5 Asp D 5 4 8 11 1 7 10 5 6 3 2 5 Cys C 2 1 1 1 52 1 1 2 2 2 1 1 Gln Q 3 5 5 6 1 10 7 3 7 2 3 5 ... Trp W 0 2 0 0 0 0 0 0 1 0 1 0 Tyr Y 1 1 2 1 3 1 1 1 3 2 2 1 Val V 7 4 4 4 4 4 4 4 5 4 15 10 11/8/2018 CSE 5290, Fall 2011

BLOSUM Blocks Substitution Matrix Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity 11/8/2018 CSE 5290, Fall 2011

The Blosum50 Scoring Matrix 11/8/2018 CSE 5290, Fall 2011

Local vs. Global Alignment The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. 11/8/2018 CSE 5290, Fall 2011

Local vs. Global Alignment The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment 11/8/2018 CSE 5290, Fall 2011

Local vs. Global Alignment (cont’d) Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc 11/8/2018 CSE 5290, Fall 2011

Local Alignment: Example Compute a “mini” Global Alignment to get Local Local alignment Global alignment 11/8/2018 CSE 5290, Fall 2011