CSE 5290: Algorithms for Bioinformatics Fall 2011

CSE 5290: Algorithms for Bioinformatics Fall 2011
Suprakash Datta Office: CSEB 3043 Phone: ext 77875 Course page: 11/8/2018 CSE 5290, Fall 2011

Last time Finding Regulatory Motifs in DNA sequences (exhaustive search variants) Next: Greedy algorithms The following slides are based on slides by the authors of our text. 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Look and Taste Different
Although cabbages and turnips share a recent common ancestor, they look and taste different 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage - 2 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Almost Identical mtDNA gene sequences
In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip 99% similarity between genes These surprisingly identical gene sequences differed in gene order This study helped pave the way to analyzing genome rearrangements in molecular evolution 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: Before After Evolution is manifested as the divergence in gene order 11/8/2018 CSE 5290, Fall 2011

Transforming Cabbage into Turnip
11/8/2018 CSE 5290, Fall 2011

Genome rearrangements
Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other? 11/8/2018 CSE 5290, Fall 2011

History of Chromosome X
Rat Consortium, Nature, 2004 11/8/2018 CSE 5290, Fall 2011

Reversals 1 3 2 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Blocks represent conserved genes. 11/8/2018 CSE 5290, Fall 2011

Reversals 1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. 11/8/2018 CSE 5290, Fall 2011

Reversals and Breakpoints
1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 The reversion introduced two breakpoints (disruptions in order). 11/8/2018 CSE 5290, Fall 2011

Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’
Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’ 11/8/2018 CSE 5290, Fall 2011

Types of Rearrangements
Reversal Translocation Fusion 5 6 Fission 11/8/2018 CSE 5290, Fall 2011

Comparative Genomic Architectures: Mouse vs Human Genome
Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversals Fusions Fissions Translocation 11/8/2018 CSE 5290, Fall 2011

Waardenburg’s Syndrome: Mouse Provides Insight into Human Genetic Disorder
Waardenburg’s syndrome is characterized by pigmentary dysphasia Gene implicated in the disease was linked to human chromosome 2 but it was not clear where exactly it is located on chromosome 2 11/8/2018 CSE 5290, Fall 2011

Waardenburg’s syndrome and splotch mice
A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans Scientists succeeded in identifying location of gene responsible for disorder in mice Finding the gene in mice gives clues to where the same gene is located in humans 11/8/2018 CSE 5290, Fall 2011

Reversals: Example r(3,5) 1 2 5 4 3 6 7 8 r(5,6) 1 2 5 4 6 3 7 8
11/8/2018 CSE 5290, Fall 2011

Reversals and Gene Orders
Gene order is represented by a permutation p: p = p p i-1 p i p i p j-1 p j p j p n p p i-1 p j p j p i+1 p i p j pn Reversal r ( i, j ) reverses (flips) the elements from i to j in p r(i,j) 11/8/2018 CSE 5290, Fall 2011

Reversal Distance Problem
Goal: Given two permutations p, s, find the shortest series of reversals that transforms p into s Input: Permutations p and s Output: A series of reversals r1,…rt transforming p into s, such that t is minimum Notation: t - reversal distance between p and s d(p, s) - smallest possible value of t, given p and s 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals Problem
Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) Input: Permutation p Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals: Example
t =d(p ) - reversal distance of p Example : p = So d(p ) = 3 11/8/2018 CSE 5290, Fall 2011

Sorting by reversals: 5 steps
hour 11/8/2018 CSE 5290, Fall 2011

Sorting by reversals: 4 steps
What is the reversal distance for this permutation? Can it be sorted in 3 steps? 11/8/2018 CSE 5290, Fall 2011

Pancake Flipping Problem
The chef is sloppy; he prepares an unordered stack of pancakes of different sizes The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) He does it by flipping over several from the top, repeating this as many times as necessary Christos Papadimitrou and Bill Gates flip pancakes 11/8/2018 CSE 5290, Fall 2011

Pancake Flipping Problem: Formulation
Goal: Given a stack of n pancakes, what is the minimum number of flips to rearrange them into perfect stack? Input: Permutation p Output: A series of prefix reversals r1, … rt transforming p into the identity permutation such that t is minimum 11/8/2018 CSE 5290, Fall 2011

Pancake Flipping Problem: Greedy Algorithm
Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals: A Greedy Algorithm
If sorting permutation p = , the first three elements are already in order so it does not make any sense to break them. The length of the already sorted prefix of p is denoted prefix(p) prefix(p) = 3 This results in an idea for a greedy algorithm: increase prefix(p) at every step 11/8/2018 CSE 5290, Fall 2011

Greedy Algorithm: An Example
Doing so, p can be sorted Number of steps to sort permutation of length n is at most (n – 1) 11/8/2018 CSE 5290, Fall 2011

Greedy Algorithm: Pseudocode
SimpleReversalSort(p) 1 for i  1 to n – 1 2 j  position of element i in p (i.e., pj = i) 3 if j ≠i p  p * r(i, j) output p 6 if p is the identity permutation return 11/8/2018 CSE 5290, Fall 2011

Analyzing SimpleReversalSort
SimpleReversalSort does not guarantee the smallest number of reversals and takes five steps on p = : Step 1: Step 2: Step 3: Step 4: Step 5: 11/8/2018 CSE 5290, Fall 2011

Analyzing SimpleReversalSort
But it can be sorted in two steps: p = Step 1: Step 2: So, SimpleReversalSort(p) is not optimal Optimal algorithms are unknown for many problems; approximation algorithms are used 11/8/2018 CSE 5290, Fall 2011

Approximation Algorithms
These algorithms find approximate solutions rather than optimal solutions The approximation ratio of an algorithm A on input p is: A(p) / OPT(p) where A(p) -solution produced by algorithm A OPT(p) - optimal solution of the problem 11/8/2018 CSE 5290, Fall 2011

Approximation Ratio/Performance Guarantee
Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n For algorithm A that minimizes objective function (minimization algorithm): max|p| = n A(p) / OPT(p) 11/8/2018 CSE 5290, Fall 2011

Approximation Ratio/Performance Guarantee
Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n For algorithm A that minimizes objective function (minimization algorithm): max|p| = n A(p) / OPT(p) For maximization algorithm: min|p| = n A(p) / OPT(p) 11/8/2018 CSE 5290, Fall 2011

Adjacencies and Breakpoints
p = p1p2p3…pn-1pn A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 For example p = (3, 4) or (7, 8) and (6,5) are adjacent pairs 11/8/2018 CSE 5290, Fall 2011

Breakpoints: An Example
There is a breakpoint between any adjacent element that are non-consecutive: p = Pairs (1,9), (9,3), (4,7), (8,2) and (2,6) form breakpoints of permutation p b(p) - # breakpoints in permutation p 11/8/2018 CSE 5290, Fall 2011

Adjacency & Breakpoints
An adjacency - a pair of adjacent elements that are consecutive A breakpoint - a pair of adjacent elements that are not consecutive π = Extend π with π0 = 0 and π7 = 7 adjacencies breakpoints 11/8/2018 CSE 5290, Fall 2011

Extending Permutations
We put two elements p 0 =0 and p n + 1=n+1 at the ends of p Example: p = Extending with 0 and 10 p = Note: A new breakpoint was created after extending 11/8/2018 CSE 5290, Fall 2011

Reversal Distance and Breakpoints
Each reversal eliminates at most 2 breakpoints. p = b(p) = 5 b(p) = 4 b(p) = 2 b(p) = 0 This implies: reversal distance ≥ #breakpoints / 2 11/8/2018 CSE 5290, Fall 2011

Sorting By Reversals: A Better Greedy Algorithm
BreakPointReversalSort(p) 1 while b(p) > 0 2 Among all possible reversals, choose reversal r minimizing b(p • r) 3 p  p • r(i, j) 4 output p 5 return Q: Does this algorithm terminate? 11/8/2018 CSE 5290, Fall 2011

Strips Strip: an interval between two consecutive breakpoints in a permutation Decreasing strip: strip of elements in decreasing order (e.g. 6 5 and 3 2 ). Increasing strip: strip of elements in increasing order (e.g. 7 8) A single-element strip can be declared either increasing or decreasing. We will choose to declare them as decreasing with exception of the strips with 0 and n+1 11/8/2018 CSE 5290, Fall 2011

Reducing the Number of Breakpoints
Theorem 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p • r) < b(p) ) 11/8/2018 CSE 5290, Fall 2011

Find k – 1 in the permutation
Things To Consider For p = b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation 11/8/2018 CSE 5290, Fall 2011

Things To Consider (cont’d)
For p = b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation Reverse the segment between k and k-1: b(p) = 5 b(p) = 4 11/8/2018 CSE 5290, Fall 2011

Reducing the Number of Breakpoints Again
If there is no decreasing strip, there may be no reversal r that reduces the number of breakpoints (i.e. b(p • r) ≥ b(p) for any reversal r). By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1). 11/8/2018 CSE 5290, Fall 2011

Things To Consider (cont’d)
There are no decreasing strips in p, for: p = b(p) = 3 p • r(6,7) = b(p) = 3 r(6,7) does not change the # of breakpoints r(6,7) creates a decreasing strip thus guaranteeing that the next step will decrease the # of breakpoints. 11/8/2018 CSE 5290, Fall 2011

ImprovedBreakpointReversalSort
ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 if p has a decreasing strip Among all possible reversals, choose reversal r that minimizes b(p • r) 4 else Choose a reversal r that flips an increasing strip in p 6 p  p • r output p 8 return 11/8/2018 CSE 5290, Fall 2011

ImprovedBreakpointReversalSort: Performance Guarantee
ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 It eliminates at least one breakpoint in every two steps; at most 2b(p) steps Approximation ratio: 2b(p) / d(p) Optimal algorithm eliminates at most 2 breakpoints in every step: d(p)  b(p) / 2 Performance guarantee: ( 2b(p) / d(p) )  [ 2b(p) / (b(p) / 2) ] = 4 11/8/2018 CSE 5290, Fall 2011

Signed Permutations Up to this point, all permutations to sort were unsigned But genes have directions… so we should consider signed permutations 5’ 3’ p = 11/8/2018 CSE 5290, Fall 2011

Signed Permutations Algorithms are a little more involved.
Possible project topic 11/8/2018 CSE 5290, Fall 2011

GRIMM Web Server Real genome architectures are represented by signed permutations Efficient algorithms to sort signed permutations have been developed GRIMM web server computes the reversal distances between signed permutations: 11/8/2018 CSE 5290, Fall 2011

GRIMM Web Server http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM
11/8/2018 CSE 5290, Fall 2011

Next Dynamic programming, sequence alignment
Some of the following slides are based on slides by the authors of our text. 11/8/2018 CSE 5290, Fall 2011

Dynamic programming (DP)
Typically used for optimization problems Often results in efficient algorithms Not applicable to all problems Caveats: Need not yield poly-time algorithms No unique formulations for most problems May not rule out greedy algorithms 11/8/2018 CSE 5290, Fall 2011

Example Counting the number of shortest paths in a grid
Counting the number of shortest paths in a grid with blocked intersections Finding paths in a weighted grid Sequence alignment 11/8/2018 CSE 5290, Fall 2011

Setting up DP in practice
The optimal solution should be computable as a (recursive) function of the solution to sub-problems Solve sub-problems systematically and store solutions (to avoid duplication of work). 11/8/2018 CSE 5290, Fall 2011

Number of paths in a grid
Problem: Travel from the top-left to the bottom right of a rectangular grid using only right and down moves Combinatorial approach DP approach: how can we decompose the problem into sub-problems ? 11/8/2018 CSE 5290, Fall 2011

Number of paths in a grid with blocked intersections
Problem: Same as before but some grid points are blocked and cannot be used Combinatorial approach? DP approach: how can we decompose the problem into sub-problems ? 11/8/2018 CSE 5290, Fall 2011

Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink 11/8/2018 CSE 5290, Fall 2011

Manhattan Tourist Problem: Formulation
Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in G from “source” to “sink” 11/8/2018 CSE 5290, Fall 2011

MTP: An Example source sink 13 19 9 15 23 20 j coordinate i coordinate
4 7 1 5 6 8 i coordinate 13 source 19 9 15 23 20 j coordinate sink 11/8/2018 CSE 5290, Fall 2011

MTP: Greedy Algorithm Is Not Optimal
1 2 5 source 5 3 10 5 2 1 5 3 5 3 1 2 3 4 promising start, but leads to bad choices! 5 2 22 sink 18 11/8/2018 CSE 5290, Fall 2011

MTP: Simple Recursive Program
MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} 11/8/2018 CSE 5290, Fall 2011

MTP: Simple Recursive Program
MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach? 11/8/2018 CSE 5290, Fall 2011

MTP: Dynamic Programming
j 1 source 1 1 i S0,1 = 1 5 1 5 S1,0 = 5 Calculate optimal path score for each vertex in the graph Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between 11/8/2018 CSE 5290, Fall 2011

MTP: Dynamic Programming (cont’d)
j 1 2 source 1 2 1 3 i S0,2 = 3 5 3 -5 1 5 4 S1,1 = 4 3 2 8 S2,0 = 8 11/8/2018 CSE 5290, Fall 2011

j 1 2 3 source 1 2 5 1 3 8 i S3,0 = 8 5 3 10 -5 1 1 5 4 13 S1,2 = 13 3 5 -5 2 8 9 S2,1 = 9 3 8 11/8/2018 CSE 5290, Fall 2011 S3,0 = 8

j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 S1,3 = 8 3 5 -3 -5 3 2 8 9 12 S2,2 = 12 3 8 9 11/8/2018 CSE 5290, Fall 2011 S3,1 = 9 greedy alg. fails!

j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 3 5 -3 2 -5 3 3 2 8 9 12 15 S2,3 = 15 -5 3 8 9 9 11/8/2018 CSE 5290, Fall 2011 S3,2 = 9

j 1 2 3 source 1 2 5 1 3 8 Done! i 5 3 10 -5 -5 1 -5 1 5 4 13 8 (showing all back-traces) 3 5 -3 2 -5 3 3 2 8 9 12 15 -5 1 3 8 9 9 16 11/8/2018 CSE 5290, Fall 2011 S3,3 = 16

MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns) 11/8/2018 CSE 5290, Fall 2011

Manhattan Is Not A Perfect Grid
B A3 A1 A2 What about diagonals? The score at point B is given by: sB = max of sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) 11/8/2018 CSE 5290, Fall 2011

Manhattan Is Not A Perfect Grid (contd)
Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) Predecessors (x) – set of vertices that have edges leading to x The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once 11/8/2018 CSE 5290, Fall 2011

Traveling in the Grid The only hitch is that one must decide on the order in which visit the vertices By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. We need to traverse the vertices in some order Try to find such order for a directed cycle ??? 11/8/2018 CSE 5290, Fall 2011

DAG: Directed Acyclic Graph
Since Manhattan is not a perfect regular grid, we represent it as a DAG DAG for Dressing in the morning problem 11/8/2018 CSE 5290, Fall 2011

Topological Ordering A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right. 11/8/2018 CSE 5290, Fall 2011

Topological ordering 2 different topological orderings of the DAG
11/8/2018 CSE 5290, Fall 2011

Longest Path in DAG Problem
Goal: Find a longest path between two vertices in a weighted DAG Input: A weighted DAG G with source and sink vertices Output: A longest path in G from source to sink 11/8/2018 CSE 5290, Fall 2011

Longest Path in DAG: Dynamic Programming
Suppose vertex v has indegree 3 and predecessors {u1, u2, u3} Longest path to v from source is: In General: sv = maxu (su + weight of edge from u to v) su1 + weight of edge from u1 to v su2 + weight of edge from u2 to v su3 + weight of edge from u3 to v sv = max of 11/8/2018 CSE 5290, Fall 2011

Traversing the Manhattan Grid
b) 3 different strategies: a) Column by column b) Row by row c) Along diagonals c) 11/8/2018 CSE 5290, Fall 2011

Sequence alignment Fundamental problem Many different versions
11/8/2018 CSE 5290, Fall 2011

Alignment: 2 row representation
Given 2 DNA sequences v and w: v : A T G T T A T m = 7 w : n = 7 A T C G T A C Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 2 insertions 2 deletions 11/8/2018 CSE 5290, Fall 2011

Aligning DNA Sequences
V = ATCTGATG n = 8 4 matches mismatches insertions deletions m = 7 1 W = TGCATAC 2 match mismatch 2 V A T C G W deletion indels insertion 11/8/2018 CSE 5290, Fall 2011

Aligning DNA Sequences - 2
Brute force is infeasible…. Number of alignments of X[1..n],Y[1..m], n<m is ( ) For m=n, this is about 22n/pn m+n n 11/8/2018 CSE 5290, Fall 2011

Longest Common Subsequence (LCS) – Alignment without Mismatches
Given two sequences v = v1 v2…vm and w = w1 w2…wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal 11/8/2018 CSE 5290, Fall 2011

LCS: Example Every common subsequence is a path in 2-D grid 1 1 2 2 3
1 1 2 2 3 4 3 5 4 5 6 6 7 7 8 i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: 2 < 3 < 4 < 6 < 8 Matches shown in red positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid 11/8/2018 CSE 5290, Fall 2011

LCS Problem as Manhattan Tourist Problem
G A T C j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

Computing LCS Let vi = prefix of v of length i: v1 … vi
and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j if vi = wj 11/8/2018 CSE 5290, Fall 2011

Computing LCS (cont’d)
i-1,j -1 i-1,j 1 si-1,j si,j = MAX i,j -1 si,j i,j si-1,j , if vi = wj 11/8/2018 CSE 5290, Fall 2011

Every Path in the Grid Corresponds to an Alignment
W A T C G V = A T - G T | | | W= A T C G – V 1 2 3 4 A T G T 11/8/2018 CSE 5290, Fall 2011

Aligning Sequences without Insertions and Deletions: Hamming Distance
Given two DNA sequences v and w : v : A T w : A T The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar 11/8/2018 CSE 5290, Fall 2011

Aligning Sequences with Insertions and Deletions
By shifting one sequence over one position: v : A T -- w : -- A T The edit distance: dH(v, w) = 2. Hamming distance neglects insertions and deletions in DNA 11/8/2018 CSE 5290, Fall 2011

Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v  w 11/8/2018 CSE 5290, Fall 2011

Edit Distance vs Hamming Distance
always compares i-th letter of v with i-th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task. 11/8/2018 CSE 5290, Fall 2011

may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task 11/8/2018 CSE 5290, Fall 2011

may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ??? 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example
TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example
TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) What is the edit distance? 5? 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example (cont’d)
TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6th T) ATGCATA  (substitute G for 5th A) ATGCGTA  (substitute C for 3rd G) ATCCGAT (Done) 11/8/2018 CSE 5290, Fall 2011

Edit Distance: Example (cont’d)
TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6th T) ATGCATA  (substitute G for 5th A) ATGCGTA  (substitute C for 3rd G) ATCCGAT (Done) Can it be done in 3 steps??? 11/8/2018 CSE 5290, Fall 2011

The Alignment Grid Every alignment path is from source to sink
11/8/2018 CSE 5290, Fall 2011

Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v A T _ G T T A T _ A T C G T _ A _ C (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path - 11/8/2018 CSE 5290, Fall 2011

Alignments in Edit Graph (cont’d)
and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5. 1 2 3 4 5 6 7 G A T C w v 11/8/2018 CSE 5290, Fall 2011

1 2 3 4 5 6 7 G A T C w v Every path in the edit graph corresponds to an alignment: 11/8/2018 CSE 5290, Fall 2011

1 2 3 4 5 6 7 G A T C w v Old Alignment v= AT_GTTAT_ w= ATCGT_A_C New Alignment v= AT_GTTAT_ w= ATCG_TA_C 11/8/2018 CSE 5290, Fall 2011

1 2 3 4 5 6 7 G A T C w v v= AT_GTTAT_ w= ATCGT_A_C (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) 11/8/2018 CSE 5290, Fall 2011

Alignment: Dynamic Programming
si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { 11/8/2018 CSE 5290, Fall 2011

Dynamic Programming Example
1 2 3 4 5 6 7 G A T C w v Initialize 1st row and 1st column to be all zeroes. Or, to be more precise, initialize 0th row and 0th column to be all zeroes. 11/8/2018 CSE 5290, Fall 2011

Dynamic Programming Example
1 2 3 4 5 6 7 G A T C w v Si,j = Si-1, j-1 max Si-1, j Si, j-1 { 1 1 1 1 1 1 1 value from NW +1, if vi = wj  value from North (top)  value from West (left) 1 1 1 1 1 1 11/8/2018 CSE 5290, Fall 2011

Alignment: Backtracking
Arrows show where the score originated from. if from the top if from the left if vi = wj 11/8/2018 CSE 5290, Fall 2011

Backtracking Example w v Find a match in row and column 2.
i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1 1 2 3 4 5 6 7 G A T C w v 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 11/8/2018 CSE 5290, Fall 2011

Backtracking Example w v
1 2 3 4 5 6 7 G A T C w v Continuing with the dynamic programming algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5 11/8/2018 CSE 5290, Fall 2011

si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { 11/8/2018 CSE 5290, Fall 2011

si,j = si-1, j-1+1 if vi = wj max si-1, j+0 si, j-1+0 { This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero. 11/8/2018 CSE 5290, Fall 2011

LCS Algorithm { { LCS(v,w) for i  1 to n si,0  0 for j  1 to m
s0,j  0 si-1,j si,j  max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j  “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) { { 11/8/2018 CSE 5290, Fall 2011

Now What? w v LCS(v,w) created the alignment grid
1 2 3 4 5 6 7 G A T C w v LCS(v,w) created the alignment grid Now we need a way to read the best alignment of v and w Follow the arrows backwards from sink 11/8/2018 CSE 5290, Fall 2011

Printing LCS: Backtracking
PrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = “ “ PrintLCS(b,v,i-1,j-1) print vi else PrintLCS(b,v,i-1,j) PrintLCS(b,v,i,j-1) 11/8/2018 CSE 5290, Fall 2011

LCS Runtime It takes O(nm) time to fill in the nxm dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix. 11/8/2018 CSE 5290, Fall 2011

Why does DP work? Avoids re-computing the same sub-problems
Limits the amount of work done in each step 11/8/2018 CSE 5290, Fall 2011

When is DP applicable? – Optimal substructure: Optimal solution to problem (instance) contains optimal solutions to sub-problems – Overlapping sub-problems: Limited number of distinct sub-problems, repeated many many times 11/8/2018 CSE 5290, Fall 2011

Alignment with Affine Gap Penalties
Next: More realistic sequence alignment algorithms Types: Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties 11/8/2018 CSE 5290, Fall 2011

From LCS to Alignment: Change up the Scoring
The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 11/8/2018 CSE 5290, Fall 2011

Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 11/8/2018 CSE 5290, Fall 2011

The Global Alignment Problem
Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = - σ = if match = -µ if mismatch si-1,j if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ m : mismatch penalty σ : indel penalty { 11/8/2018 CSE 5290, Fall 2011

Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { 11/8/2018 CSE 5290, Fall 2011

Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation 11/8/2018 CSE 5290, Fall 2011

Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical 11/8/2018 CSE 5290, Fall 2011

Making a Scoring Matrix
Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 11/8/2018 CSE 5290, Fall 2011

Scoring Matrix: Example
K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11 11/8/2018 CSE 5290, Fall 2011

Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate  glutamate Nonpolar to nonpolar alanine  valine Similarly behaving residues leucine to isoleucine 11/8/2018 CSE 5290, Fall 2011

Scoring matrices Amino acid substitution matrices PAM BLOSUM
DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level 11/8/2018 CSE 5290, Fall 2011

PAM some residues may have mutated several times
Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all 11/8/2018 CSE 5290, Fall 2011

PAMX PAMx = PAM1x PAM250 = PAM1250
PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A Arg R Asn N Asp D Cys C Gln Q ... Trp W Tyr Y Val V 11/8/2018 CSE 5290, Fall 2011

BLOSUM Blocks Substitution Matrix
Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity 11/8/2018 CSE 5290, Fall 2011

The Blosum50 Scoring Matrix
11/8/2018 CSE 5290, Fall 2011

Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. 11/8/2018 CSE 5290, Fall 2011

Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment 11/8/2018 CSE 5290, Fall 2011

Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc 11/8/2018 CSE 5290, Fall 2011

Local Alignment: Example
Compute a “mini” Global Alignment to get Local Local alignment Global alignment 11/8/2018 CSE 5290, Fall 2011

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations

Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations

Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

Similar presentations

About project

Feedback