Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSE 5290: Algorithms for Bioinformatics Fall 2011

Similar presentations


Presentation on theme: "CSE 5290: Algorithms for Bioinformatics Fall 2011"— Presentation transcript:

1 CSE 5290: Algorithms for Bioinformatics Fall 2011
Suprakash Datta Office: CSEB 3043 Phone: ext 77875 Course page: 11/8/2018 CSE 5290, Fall 2011

2 Last time Finding Regulatory Motifs in DNA sequences (exhaustive search variants) Next: Greedy algorithms The following slides are based on slides by the authors of our text. 11/8/2018 CSE 5290, Fall 2011

3 Turnip vs Cabbage: Look and Taste Different
Although cabbages and turnips share a recent common ancestor, they look and taste different 11/8/2018 CSE 5290, Fall 2011

4 Turnip vs Cabbage - 2 11/8/2018 CSE 5290, Fall 2011

5 Turnip vs Cabbage: Almost Identical mtDNA gene sequences
In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip 99% similarity between genes These surprisingly identical gene sequences differed in gene order This study helped pave the way to analyzing genome rearrangements in molecular evolution 11/8/2018 CSE 5290, Fall 2011

6 Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

7 Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

8 Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

9 Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: 11/8/2018 CSE 5290, Fall 2011

10 Turnip vs Cabbage: Different mtDNA Gene Order
Gene order comparison: Before After Evolution is manifested as the divergence in gene order 11/8/2018 CSE 5290, Fall 2011

11 Transforming Cabbage into Turnip
11/8/2018 CSE 5290, Fall 2011

12 Genome rearrangements
Mouse (X chrom.) Unknown ancestor ~ 75 million years ago Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other? 11/8/2018 CSE 5290, Fall 2011

13 History of Chromosome X
Rat Consortium, Nature, 2004 11/8/2018 CSE 5290, Fall 2011

14 Reversals 1 3 2 4 10 5 6 8 9 7 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 Blocks represent conserved genes. 11/8/2018 CSE 5290, Fall 2011

15 Reversals 1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 Blocks represent conserved genes. In the course of evolution or in a clinical context, blocks 1,…,10 could be misread as 1, 2, 3, -8, -7, -6, -5, -4, 9, 10. 11/8/2018 CSE 5290, Fall 2011

16 Reversals and Breakpoints
1 2 3 9 10 8 4 7 5 6 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 The reversion introduced two breakpoints (disruptions in order). 11/8/2018 CSE 5290, Fall 2011

17 Reversals: Example 5’ ATGCCTGTACTA 3’ 3’ TACGGACATGAT 5’
Break and Invert 5’ ATGTACAGGCTA 3’ 3’ TACATGTCCGAT 5’ 11/8/2018 CSE 5290, Fall 2011

18 Types of Rearrangements
Reversal Translocation Fusion 5 6 Fission 11/8/2018 CSE 5290, Fall 2011

19 Comparative Genomic Architectures: Mouse vs Human Genome
Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversals Fusions Fissions Translocation 11/8/2018 CSE 5290, Fall 2011

20 Waardenburg’s Syndrome: Mouse Provides Insight into Human Genetic Disorder
Waardenburg’s syndrome is characterized by pigmentary dysphasia Gene implicated in the disease was linked to human chromosome 2 but it was not clear where exactly it is located on chromosome 2 11/8/2018 CSE 5290, Fall 2011

21 Waardenburg’s syndrome and splotch mice
A breed of mice (with splotch gene) had similar symptoms caused by the same type of gene as in humans Scientists succeeded in identifying location of gene responsible for disorder in mice Finding the gene in mice gives clues to where the same gene is located in humans 11/8/2018 CSE 5290, Fall 2011

22 Reversals: Example r(3,5) 1 2 5 4 3 6 7 8 r(5,6) 1 2 5 4 6 3 7 8
11/8/2018 CSE 5290, Fall 2011

23 Reversals and Gene Orders
Gene order is represented by a permutation p: p = p p i-1 p i p i p j-1 p j p j p n p p i-1 p j p j p i+1 p i p j pn Reversal r ( i, j ) reverses (flips) the elements from i to j in p r(i,j) 11/8/2018 CSE 5290, Fall 2011

24 Reversal Distance Problem
Goal: Given two permutations p, s, find the shortest series of reversals that transforms p into s Input: Permutations p and s Output: A series of reversals r1,…rt transforming p into s, such that t is minimum Notation: t - reversal distance between p and s d(p, s) - smallest possible value of t, given p and s 11/8/2018 CSE 5290, Fall 2011

25 Sorting By Reversals Problem
Goal: Given a permutation, find a shortest series of reversals that transforms it into the identity permutation (1 2 … n ) Input: Permutation p Output: A series of reversals r1, … rt transforming p into the identity permutation such that t is minimum 11/8/2018 CSE 5290, Fall 2011

26 Sorting By Reversals: Example
t =d(p ) - reversal distance of p Example : p = So d(p ) = 3 11/8/2018 CSE 5290, Fall 2011

27 Sorting by reversals: 5 steps
hour 11/8/2018 CSE 5290, Fall 2011

28 Sorting by reversals: 4 steps
What is the reversal distance for this permutation? Can it be sorted in 3 steps? 11/8/2018 CSE 5290, Fall 2011

29 Pancake Flipping Problem
The chef is sloppy; he prepares an unordered stack of pancakes of different sizes The waiter wants to rearrange them (so that the smallest winds up on top, and so on, down to the largest at the bottom) He does it by flipping over several from the top, repeating this as many times as necessary Christos Papadimitrou and Bill Gates flip pancakes 11/8/2018 CSE 5290, Fall 2011

30 Pancake Flipping Problem: Formulation
Goal: Given a stack of n pancakes, what is the minimum number of flips to rearrange them into perfect stack? Input: Permutation p Output: A series of prefix reversals r1, … rt transforming p into the identity permutation such that t is minimum 11/8/2018 CSE 5290, Fall 2011

31 Pancake Flipping Problem: Greedy Algorithm
Greedy approach: 2 prefix reversals at most to place a pancake in its right position, 2n – 2 steps total at most William Gates and Christos Papadimitriou showed in the mid-1970s that this problem can be solved by at most 5/3 (n + 1) prefix reversals 11/8/2018 CSE 5290, Fall 2011

32 Sorting By Reversals: A Greedy Algorithm
If sorting permutation p = , the first three elements are already in order so it does not make any sense to break them. The length of the already sorted prefix of p is denoted prefix(p) prefix(p) = 3 This results in an idea for a greedy algorithm: increase prefix(p) at every step 11/8/2018 CSE 5290, Fall 2011

33 Greedy Algorithm: An Example
Doing so, p can be sorted Number of steps to sort permutation of length n is at most (n – 1) 11/8/2018 CSE 5290, Fall 2011

34 Greedy Algorithm: Pseudocode
SimpleReversalSort(p) 1 for i  1 to n – 1 2 j  position of element i in p (i.e., pj = i) 3 if j ≠i p  p * r(i, j) output p 6 if p is the identity permutation return 11/8/2018 CSE 5290, Fall 2011

35 Analyzing SimpleReversalSort
SimpleReversalSort does not guarantee the smallest number of reversals and takes five steps on p = : Step 1: Step 2: Step 3: Step 4: Step 5: 11/8/2018 CSE 5290, Fall 2011

36 Analyzing SimpleReversalSort
But it can be sorted in two steps: p = Step 1: Step 2: So, SimpleReversalSort(p) is not optimal Optimal algorithms are unknown for many problems; approximation algorithms are used 11/8/2018 CSE 5290, Fall 2011

37 Approximation Algorithms
These algorithms find approximate solutions rather than optimal solutions The approximation ratio of an algorithm A on input p is: A(p) / OPT(p) where A(p) -solution produced by algorithm A OPT(p) - optimal solution of the problem 11/8/2018 CSE 5290, Fall 2011

38 Approximation Ratio/Performance Guarantee
Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n For algorithm A that minimizes objective function (minimization algorithm): max|p| = n A(p) / OPT(p) 11/8/2018 CSE 5290, Fall 2011

39 Approximation Ratio/Performance Guarantee
Approximation ratio (performance guarantee) of algorithm A: max approximation ratio of all inputs of size n For algorithm A that minimizes objective function (minimization algorithm): max|p| = n A(p) / OPT(p) For maximization algorithm: min|p| = n A(p) / OPT(p) 11/8/2018 CSE 5290, Fall 2011

40 Adjacencies and Breakpoints
p = p1p2p3…pn-1pn A pair of elements p i and p i + 1 are adjacent if pi+1 = pi + 1 For example p = (3, 4) or (7, 8) and (6,5) are adjacent pairs 11/8/2018 CSE 5290, Fall 2011

41 Breakpoints: An Example
There is a breakpoint between any adjacent element that are non-consecutive: p = Pairs (1,9), (9,3), (4,7), (8,2) and (2,6) form breakpoints of permutation p b(p) - # breakpoints in permutation p 11/8/2018 CSE 5290, Fall 2011

42 Adjacency & Breakpoints
An adjacency - a pair of adjacent elements that are consecutive A breakpoint - a pair of adjacent elements that are not consecutive π = Extend π with π0 = 0 and π7 = 7 adjacencies breakpoints 11/8/2018 CSE 5290, Fall 2011

43 Extending Permutations
We put two elements p 0 =0 and p n + 1=n+1 at the ends of p Example: p = Extending with 0 and 10 p = Note: A new breakpoint was created after extending 11/8/2018 CSE 5290, Fall 2011

44 Reversal Distance and Breakpoints
Each reversal eliminates at most 2 breakpoints. p = b(p) = 5 b(p) = 4 b(p) = 2 b(p) = 0 This implies: reversal distance ≥ #breakpoints / 2 11/8/2018 CSE 5290, Fall 2011

45 Sorting By Reversals: A Better Greedy Algorithm
BreakPointReversalSort(p) 1 while b(p) > 0 2 Among all possible reversals, choose reversal r minimizing b(p • r) 3 p  p • r(i, j) 4 output p 5 return Q: Does this algorithm terminate? 11/8/2018 CSE 5290, Fall 2011

46 Strips Strip: an interval between two consecutive breakpoints in a permutation Decreasing strip: strip of elements in decreasing order (e.g. 6 5 and 3 2 ). Increasing strip: strip of elements in increasing order (e.g. 7 8) A single-element strip can be declared either increasing or decreasing. We will choose to declare them as decreasing with exception of the strips with 0 and n+1 11/8/2018 CSE 5290, Fall 2011

47 Reducing the Number of Breakpoints
Theorem 1: If permutation p contains at least one decreasing strip, then there exists a reversal r which decreases the number of breakpoints (i.e. b(p • r) < b(p) ) 11/8/2018 CSE 5290, Fall 2011

48 Find k – 1 in the permutation
Things To Consider For p = b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation 11/8/2018 CSE 5290, Fall 2011

49 Things To Consider (cont’d)
For p = b(p) = 5 Choose decreasing strip with the smallest element k in p ( k = 2 in this case) Find k – 1 in the permutation Reverse the segment between k and k-1: b(p) = 5 b(p) = 4 11/8/2018 CSE 5290, Fall 2011

50 Reducing the Number of Breakpoints Again
If there is no decreasing strip, there may be no reversal r that reduces the number of breakpoints (i.e. b(p • r) ≥ b(p) for any reversal r). By reversing an increasing strip ( # of breakpoints stay unchanged ), we will create a decreasing strip at the next step. Then the number of breakpoints will be reduced in the next step (theorem 1). 11/8/2018 CSE 5290, Fall 2011

51 Things To Consider (cont’d)
There are no decreasing strips in p, for: p = b(p) = 3 p • r(6,7) = b(p) = 3 r(6,7) does not change the # of breakpoints r(6,7) creates a decreasing strip thus guaranteeing that the next step will decrease the # of breakpoints. 11/8/2018 CSE 5290, Fall 2011

52 ImprovedBreakpointReversalSort
ImprovedBreakpointReversalSort(p) 1 while b(p) > 0 2 if p has a decreasing strip Among all possible reversals, choose reversal r that minimizes b(p • r) 4 else Choose a reversal r that flips an increasing strip in p 6 p  p • r output p 8 return 11/8/2018 CSE 5290, Fall 2011

53 ImprovedBreakpointReversalSort: Performance Guarantee
ImprovedBreakPointReversalSort is an approximation algorithm with a performance guarantee of at most 4 It eliminates at least one breakpoint in every two steps; at most 2b(p) steps Approximation ratio: 2b(p) / d(p) Optimal algorithm eliminates at most 2 breakpoints in every step: d(p)  b(p) / 2 Performance guarantee: ( 2b(p) / d(p) )  [ 2b(p) / (b(p) / 2) ] = 4 11/8/2018 CSE 5290, Fall 2011

54 Signed Permutations Up to this point, all permutations to sort were unsigned But genes have directions… so we should consider signed permutations 5’ 3’ p = 11/8/2018 CSE 5290, Fall 2011

55 Signed Permutations Algorithms are a little more involved.
Possible project topic 11/8/2018 CSE 5290, Fall 2011

56 GRIMM Web Server Real genome architectures are represented by signed permutations Efficient algorithms to sort signed permutations have been developed GRIMM web server computes the reversal distances between signed permutations: 11/8/2018 CSE 5290, Fall 2011

57 GRIMM Web Server http://www-cse.ucsd.edu/groups/bioinformatics/GRIMM
11/8/2018 CSE 5290, Fall 2011

58 Next Dynamic programming, sequence alignment
Some of the following slides are based on slides by the authors of our text. 11/8/2018 CSE 5290, Fall 2011

59 Dynamic programming (DP)
Typically used for optimization problems Often results in efficient algorithms Not applicable to all problems Caveats: Need not yield poly-time algorithms No unique formulations for most problems May not rule out greedy algorithms 11/8/2018 CSE 5290, Fall 2011

60 Example Counting the number of shortest paths in a grid
Counting the number of shortest paths in a grid with blocked intersections Finding paths in a weighted grid Sequence alignment 11/8/2018 CSE 5290, Fall 2011

61 Setting up DP in practice
The optimal solution should be computable as a (recursive) function of the solution to sub-problems Solve sub-problems systematically and store solutions (to avoid duplication of work). 11/8/2018 CSE 5290, Fall 2011

62 Number of paths in a grid
Problem: Travel from the top-left to the bottom right of a rectangular grid using only right and down moves Combinatorial approach DP approach: how can we decompose the problem into sub-problems ? 11/8/2018 CSE 5290, Fall 2011

63 Number of paths in a grid with blocked intersections
Problem: Same as before but some grid points are blocked and cannot be used Combinatorial approach? DP approach: how can we decompose the problem into sub-problems ? 11/8/2018 CSE 5290, Fall 2011

64 Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink 11/8/2018 CSE 5290, Fall 2011

65 Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Source * * * * * * * * * * * * Sink 11/8/2018 CSE 5290, Fall 2011

66 Manhattan Tourist Problem: Formulation
Goal: Find the longest path in a weighted grid. Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink” Output: A longest path in G from “source” to “sink” 11/8/2018 CSE 5290, Fall 2011

67 MTP: An Example source sink 13 19 9 15 23 20 j coordinate i coordinate
4 7 1 5 6 8 i coordinate 13 source 19 9 15 23 20 j coordinate sink 11/8/2018 CSE 5290, Fall 2011

68 MTP: Greedy Algorithm Is Not Optimal
1 2 5 source 5 3 10 5 2 1 5 3 5 3 1 2 3 4 promising start, but leads to bad choices! 5 2 22 sink 18 11/8/2018 CSE 5290, Fall 2011

69 MTP: Simple Recursive Program
MT(n,m) if n=0 or m=0 return MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} 11/8/2018 CSE 5290, Fall 2011

70 MTP: Simple Recursive Program
MT(n,m) x  MT(n-1,m)+ length of the edge from (n- 1,m) to (n,m) y  MT(n,m-1)+ length of the edge from (n,m-1) to (n,m) return max{x,y} What’s wrong with this approach? 11/8/2018 CSE 5290, Fall 2011

71 MTP: Dynamic Programming
j 1 source 1 1 i S0,1 = 1 5 1 5 S1,0 = 5 Calculate optimal path score for each vertex in the graph Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between 11/8/2018 CSE 5290, Fall 2011

72 MTP: Dynamic Programming (cont’d)
j 1 2 source 1 2 1 3 i S0,2 = 3 5 3 -5 1 5 4 S1,1 = 4 3 2 8 S2,0 = 8 11/8/2018 CSE 5290, Fall 2011

73 MTP: Dynamic Programming (cont’d)
j 1 2 3 source 1 2 5 1 3 8 i S3,0 = 8 5 3 10 -5 1 1 5 4 13 S1,2 = 13 3 5 -5 2 8 9 S2,1 = 9 3 8 11/8/2018 CSE 5290, Fall 2011 S3,0 = 8

74 MTP: Dynamic Programming (cont’d)
j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 S1,3 = 8 3 5 -3 -5 3 2 8 9 12 S2,2 = 12 3 8 9 11/8/2018 CSE 5290, Fall 2011 S3,1 = 9 greedy alg. fails!

75 MTP: Dynamic Programming (cont’d)
j 1 2 3 source 1 2 5 1 3 8 i 5 3 10 -5 -5 1 -5 1 5 4 13 8 3 5 -3 2 -5 3 3 2 8 9 12 15 S2,3 = 15 -5 3 8 9 9 11/8/2018 CSE 5290, Fall 2011 S3,2 = 9

76 MTP: Dynamic Programming (cont’d)
j 1 2 3 source 1 2 5 1 3 8 Done! i 5 3 10 -5 -5 1 -5 1 5 4 13 8 (showing all back-traces) 3 5 -3 2 -5 3 3 2 8 9 12 15 -5 1 3 8 9 9 16 11/8/2018 CSE 5290, Fall 2011 S3,3 = 16

77 MTP: Recurrence Computing the score for a point (i,j) by the recurrence relation: si, j = max si-1, j + weight of the edge between (i-1, j) and (i, j) si, j-1 + weight of the edge between (i, j-1) and (i, j) The running time is n x m for a n by m grid (n = # of rows, m = # of columns) 11/8/2018 CSE 5290, Fall 2011

78 Manhattan Is Not A Perfect Grid
B A3 A1 A2 What about diagonals? The score at point B is given by: sB = max of sA1 + weight of the edge (A1, B) sA2 + weight of the edge (A2, B) sA3 + weight of the edge (A3, B) 11/8/2018 CSE 5290, Fall 2011

79 Manhattan Is Not A Perfect Grid (contd)
Computing the score for point x is given by the recurrence relation: sx = max of sy + weight of vertex (y, x) where y є Predecessors(x) Predecessors (x) – set of vertices that have edges leading to x The running time for a graph G(V, E) (V is the set of all vertices and E is the set of all edges) is O(E) since each edge is evaluated once 11/8/2018 CSE 5290, Fall 2011

80 Traveling in the Grid The only hitch is that one must decide on the order in which visit the vertices By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble. We need to traverse the vertices in some order Try to find such order for a directed cycle ??? 11/8/2018 CSE 5290, Fall 2011

81 DAG: Directed Acyclic Graph
Since Manhattan is not a perfect regular grid, we represent it as a DAG DAG for Dressing in the morning problem 11/8/2018 CSE 5290, Fall 2011

82 Topological Ordering A numbering of vertices of the graph is called topological ordering of the DAG if every edge of the DAG connects a vertex with a smaller label to a vertex with a larger label In other words, if vertices are positioned on a line in an increasing order of labels then all edges go from left to right. 11/8/2018 CSE 5290, Fall 2011

83 Topological ordering 2 different topological orderings of the DAG
11/8/2018 CSE 5290, Fall 2011

84 Longest Path in DAG Problem
Goal: Find a longest path between two vertices in a weighted DAG Input: A weighted DAG G with source and sink vertices Output: A longest path in G from source to sink 11/8/2018 CSE 5290, Fall 2011

85 Longest Path in DAG: Dynamic Programming
Suppose vertex v has indegree 3 and predecessors {u1, u2, u3} Longest path to v from source is: In General: sv = maxu (su + weight of edge from u to v) su1 + weight of edge from u1 to v su2 + weight of edge from u2 to v su3 + weight of edge from u3 to v sv = max of 11/8/2018 CSE 5290, Fall 2011

86 Traversing the Manhattan Grid
b) 3 different strategies: a) Column by column b) Row by row c) Along diagonals c) 11/8/2018 CSE 5290, Fall 2011

87 Sequence alignment Fundamental problem Many different versions
11/8/2018 CSE 5290, Fall 2011

88 Alignment: 2 row representation
Given 2 DNA sequences v and w: v : A T G T T A T m = 7 w : n = 7 A T C G T A C Alignment : 2 * k matrix ( k > m, n ) letters of v A T -- G T T A T -- letters of w A T C G T -- A -- C 4 matches 2 insertions 2 deletions 11/8/2018 CSE 5290, Fall 2011

89 Aligning DNA Sequences
V = ATCTGATG n = 8 4 matches mismatches insertions deletions m = 7 1 W = TGCATAC 2 match mismatch 2 V A T C G W deletion indels insertion 11/8/2018 CSE 5290, Fall 2011

90 Aligning DNA Sequences - 2
Brute force is infeasible…. Number of alignments of X[1..n],Y[1..m], n<m is ( ) For m=n, this is about 22n/pn m+n n 11/8/2018 CSE 5290, Fall 2011

91 Longest Common Subsequence (LCS) – Alignment without Mismatches
Given two sequences v = v1 v2…vm and w = w1 w2…wn The LCS of v and w is a sequence of positions in v: 1 < i1 < i2 < … < it < m and a sequence of positions in w: 1 < j1 < j2 < … < jt < n such that it -th letter of v equals to jt-letter of w and t is maximal 11/8/2018 CSE 5290, Fall 2011

92 LCS: Example Every common subsequence is a path in 2-D grid 1 1 2 2 3
1 1 2 2 3 4 3 5 4 5 6 6 7 7 8 i coords: elements of v A T -- C -- T G A T C elements of w -- T G C A T -- A -- C j coords: (0,0) (1,0) (2,1) (2,2) (3,3) (3,4) (4,5) (5,5) (6,6) (7,6) (8,7) positions in v: 2 < 3 < 4 < 6 < 8 Matches shown in red positions in w: 1 < 3 < 5 < 6 < 7 Every common subsequence is a path in 2-D grid 11/8/2018 CSE 5290, Fall 2011

93 LCS Problem as Manhattan Tourist Problem
G A T C j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

94 Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

95 Edit Graph for LCS Problem
j 1 2 3 4 5 6 7 8 Every path is a common subsequence. Every diagonal edge adds an extra element to common subsequence LCS Problem: Find a path with maximum number of diagonal edges i T 1 G 2 C 3 A 4 T 5 A 6 C 7 11/8/2018 CSE 5290, Fall 2011

96 Computing LCS Let vi = prefix of v of length i: v1 … vi
and wj = prefix of w of length j: w1 … wj The length of LCS(vi,wj) is computed by: si, j = max si-1, j si, j-1 si-1, j if vi = wj 11/8/2018 CSE 5290, Fall 2011

97 Computing LCS (cont’d)
i-1,j -1 i-1,j 1 si-1,j si,j = MAX i,j -1 si,j i,j si-1,j , if vi = wj 11/8/2018 CSE 5290, Fall 2011

98 Every Path in the Grid Corresponds to an Alignment
W A T C G V = A T - G T | | | W= A T C G – V 1 2 3 4 A T G T 11/8/2018 CSE 5290, Fall 2011

99 Aligning Sequences without Insertions and Deletions: Hamming Distance
Given two DNA sequences v and w : v : A T w : A T The Hamming distance: dH(v, w) = 8 is large but the sequences are very similar 11/8/2018 CSE 5290, Fall 2011

100 Aligning Sequences with Insertions and Deletions
By shifting one sequence over one position: v : A T -- w : -- A T The edit distance: dH(v, w) = 2. Hamming distance neglects insertions and deletions in DNA 11/8/2018 CSE 5290, Fall 2011

101 Edit Distance Levenshtein (1966) introduced edit distance between two strings as the minimum number of elementary operations (insertions, deletions, and substitutions) to transform one string into the other d(v,w) = MIN number of elementary operations to transform v  w 11/8/2018 CSE 5290, Fall 2011

102 Edit Distance vs Hamming Distance
always compares i-th letter of v with i-th letter of w V = ATATATAT W = TATATATA Hamming distance: d(v, w)=8 Computing Hamming distance is a trivial task. 11/8/2018 CSE 5290, Fall 2011

103 Edit Distance vs Hamming Distance
may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT Just one shift Make it all line up W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 Computing Hamming distance Computing edit distance is a trivial task is a non-trivial task 11/8/2018 CSE 5290, Fall 2011

104 Edit Distance vs Hamming Distance
may compare i-th letter of v with j-th letter of w Hamming distance always compares i-th letter of v with i-th letter of w V = - ATATATAT V = ATATATAT W = TATATATA W = TATATATA Hamming distance: Edit distance: d(v, w)= d(v, w)=2 (one insertion and one deletion) How to find what j goes with what i ??? 11/8/2018 CSE 5290, Fall 2011

105 Edit Distance: Example
TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) 11/8/2018 CSE 5290, Fall 2011

106 Edit Distance: Example
TGCATAT  ATCCGAT in 5 steps TGCATAT  (delete last T) TGCATA  (delete last A) TGCAT  (insert A at front) ATGCAT  (substitute C for 3rd G) ATCCAT  (insert G before last A) ATCCGAT (Done) What is the edit distance? 5? 11/8/2018 CSE 5290, Fall 2011

107 Edit Distance: Example (cont’d)
TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6th T) ATGCATA  (substitute G for 5th A) ATGCGTA  (substitute C for 3rd G) ATCCGAT (Done) 11/8/2018 CSE 5290, Fall 2011

108 Edit Distance: Example (cont’d)
TGCATAT  ATCCGAT in 4 steps TGCATAT  (insert A at front) ATGCATAT  (delete 6th T) ATGCATA  (substitute G for 5th A) ATGCGTA  (substitute C for 3rd G) ATCCGAT (Done) Can it be done in 3 steps??? 11/8/2018 CSE 5290, Fall 2011

109 The Alignment Grid Every alignment path is from source to sink
11/8/2018 CSE 5290, Fall 2011

110 Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v A T _ G T T A T _ A T C G T _ A _ C (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - Corresponding path - 11/8/2018 CSE 5290, Fall 2011

111 Alignments in Edit Graph (cont’d)
and represent indels in v and w with score 0. represent matches with score 1. The score of the alignment path is 5. 1 2 3 4 5 6 7 G A T C w v 11/8/2018 CSE 5290, Fall 2011

112 Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v Every path in the edit graph corresponds to an alignment: 11/8/2018 CSE 5290, Fall 2011

113 Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v Old Alignment v= AT_GTTAT_ w= ATCGT_A_C New Alignment v= AT_GTTAT_ w= ATCG_TA_C 11/8/2018 CSE 5290, Fall 2011

114 Alignment as a Path in the Edit Graph
1 2 3 4 5 6 7 G A T C w v v= AT_GTTAT_ w= ATCGT_A_C (0,0) , (1,1) , (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) 11/8/2018 CSE 5290, Fall 2011

115 Alignment: Dynamic Programming
si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { 11/8/2018 CSE 5290, Fall 2011

116 Dynamic Programming Example
1 2 3 4 5 6 7 G A T C w v Initialize 1st row and 1st column to be all zeroes. Or, to be more precise, initialize 0th row and 0th column to be all zeroes. 11/8/2018 CSE 5290, Fall 2011

117 Dynamic Programming Example
1 2 3 4 5 6 7 G A T C w v Si,j = Si-1, j-1 max Si-1, j Si, j-1 { 1 1 1 1 1 1 1 value from NW +1, if vi = wj  value from North (top)  value from West (left) 1 1 1 1 1 1 11/8/2018 CSE 5290, Fall 2011

118 Alignment: Backtracking
Arrows show where the score originated from. if from the top if from the left if vi = wj 11/8/2018 CSE 5290, Fall 2011

119 Backtracking Example w v Find a match in row and column 2.
i=2, j=2,5 is a match (T). j=2, i=4,5,7 is a match (T). Since vi = wj, si,j = si-1,j-1 +1 s2,2 = [s1,1 = 1] + 1 s2,5 = [s1,4 = 1] + 1 s4,2 = [s3,1 = 1] + 1 s5,2 = [s4,1 = 1] + 1 s7,2 = [s6,1 = 1] + 1 1 2 3 4 5 6 7 G A T C w v 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 11/8/2018 CSE 5290, Fall 2011

120 Backtracking Example w v
1 2 3 4 5 6 7 G A T C w v Continuing with the dynamic programming algorithm gives this result. 1 1 1 1 1 1 1 1 2 2 2 2 2 2 1 2 2 3 3 3 3 1 2 2 3 4 4 4 1 2 2 3 4 4 4 1 2 2 3 4 5 5 1 2 2 3 4 5 5 11/8/2018 CSE 5290, Fall 2011

121 Alignment: Dynamic Programming
si,j = si-1, j-1+1 if vi = wj max si-1, j si, j-1 { 11/8/2018 CSE 5290, Fall 2011

122 Alignment: Dynamic Programming
si,j = si-1, j-1+1 if vi = wj max si-1, j+0 si, j-1+0 { This recurrence corresponds to the Manhattan Tourist problem (three incoming edges into a vertex) with all horizontal and vertical edges weighted by zero. 11/8/2018 CSE 5290, Fall 2011

123 LCS Algorithm { { LCS(v,w) for i  1 to n si,0  0 for j  1 to m
s0,j  0 si-1,j si,j  max si,j-1 si-1,j-1 + 1, if vi = wj “ “ if si,j = si-1,j bi,j  “ “ if si,j = si,j-1 “ “ if si,j = si-1,j-1 + 1 return (sn,m, b) { { 11/8/2018 CSE 5290, Fall 2011

124 Now What? w v LCS(v,w) created the alignment grid
1 2 3 4 5 6 7 G A T C w v LCS(v,w) created the alignment grid Now we need a way to read the best alignment of v and w Follow the arrows backwards from sink 11/8/2018 CSE 5290, Fall 2011

125 Printing LCS: Backtracking
PrintLCS(b,v,i,j) if i = 0 or j = 0 return if bi,j = “ “ PrintLCS(b,v,i-1,j-1) print vi else PrintLCS(b,v,i-1,j) PrintLCS(b,v,i,j-1) 11/8/2018 CSE 5290, Fall 2011

126 LCS Runtime It takes O(nm) time to fill in the nxm dynamic programming matrix. Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix. 11/8/2018 CSE 5290, Fall 2011

127 Why does DP work? Avoids re-computing the same sub-problems
Limits the amount of work done in each step 11/8/2018 CSE 5290, Fall 2011

128 When is DP applicable? – Optimal substructure: Optimal solution to problem (instance) contains optimal solutions to sub-problems – Overlapping sub-problems: Limited number of distinct sub-problems, repeated many many times 11/8/2018 CSE 5290, Fall 2011

129 Alignment with Affine Gap Penalties
Next: More realistic sequence alignment algorithms Types: Global Alignment Scoring Matrices Local Alignment Alignment with Affine Gap Penalties 11/8/2018 CSE 5290, Fall 2011

130 From LCS to Alignment: Change up the Scoring
The Longest Common Subsequence (LCS) problem—the simplest form of sequence alignment – allows only insertions and deletions (no mismatches). In the LCS Problem, we scored 1 for matches and 0 for indels Consider penalizing indels and mismatches with negative scores Simplest scoring schema: +1 : match premium -μ : mismatch penalty -σ : indel penalty 11/8/2018 CSE 5290, Fall 2011

131 Simple Scoring When mismatches are penalized by –μ, indels are penalized by –σ, and matches are rewarded with +1, the resulting score is: #matches – μ(#mismatches) – σ (#indels) 11/8/2018 CSE 5290, Fall 2011

132 The Global Alignment Problem
Find the best alignment between two strings under a given scoring schema Input : Strings v and w and a scoring schema Output : Alignment of maximum score ↑→ = - σ = if match = -µ if mismatch si-1,j if vi = wj si,j = max s i-1,j-1 -µ if vi ≠ wj s i-1,j - σ s i,j-1 - σ m : mismatch penalty σ : indel penalty { 11/8/2018 CSE 5290, Fall 2011

133 Scoring Matrices To generalize scoring, consider a (4+1) x(4+1) scoring matrix δ. In the case of an amino acid sequence alignment, the scoring matrix would be a (20+1)x(20+1) size. The addition of 1 is to include the score for comparison of a gap character “-”. This will simplify the algorithm as follows: si-1,j-1 + δ (vi, wj) si,j = max s i-1,j + δ (vi, -) s i,j-1 + δ (-, wj) { 11/8/2018 CSE 5290, Fall 2011

134 Measuring Similarity Measuring the extent of similarity between two sequences Based on percent sequence identity Based on conservation 11/8/2018 CSE 5290, Fall 2011

135 Percent Sequence Identity
The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G mismatch indel 70% identical 11/8/2018 CSE 5290, Fall 2011

136 Making a Scoring Matrix
Scoring matrices are created based on biological evidence. Alignments can be thought of as two sequences that differ due to mutations. Some of these mutations have little effect on the protein’s function, therefore some penalties, δ(vi , wj), will be less harsh than others. 11/8/2018 CSE 5290, Fall 2011

137 Scoring Matrix: Example
K 5 -2 -1 - 7 3 6 Notice that although R and K are different amino acids, they have a positive score. Why? They are both positively charged amino acids will not greatly change function of protein. AKRANR KAAANK -1 + (-1) + (-2) = 11 11/8/2018 CSE 5290, Fall 2011

138 Conservation Amino acid changes that tend to preserve the physico-chemical properties of the original residue Polar to polar aspartate  glutamate Nonpolar to nonpolar alanine  valine Similarly behaving residues leucine to isoleucine 11/8/2018 CSE 5290, Fall 2011

139 Scoring matrices Amino acid substitution matrices PAM BLOSUM
DNA substitution matrices DNA is less conserved than protein sequences Less effective to compare coding regions at nucleotide level 11/8/2018 CSE 5290, Fall 2011

140 PAM some residues may have mutated several times
Point Accepted Mutation (Dayhoff et al.) 1 PAM = PAM1 = 1% average change of all amino acid positions After 100 PAMs of evolution, not every residue will have changed some residues may have mutated several times some residues may have returned to their original state some residues may not changed at all 11/8/2018 CSE 5290, Fall 2011

141 PAMX PAMx = PAM1x PAM250 = PAM1250
PAM250 is a widely used scoring matrix: Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys ... A R N D C Q E G H I L K ... Ala A Arg R Asn N Asp D Cys C Gln Q ... Trp W Tyr Y Val V 11/8/2018 CSE 5290, Fall 2011

142 BLOSUM Blocks Substitution Matrix
Scores derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins Matrix name indicates evolutionary distance BLOSUM62 was created using sequences sharing no more than 62% identity 11/8/2018 CSE 5290, Fall 2011

143 The Blosum50 Scoring Matrix
11/8/2018 CSE 5290, Fall 2011

144 Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. 11/8/2018 CSE 5290, Fall 2011

145 Local vs. Global Alignment
The Global Alignment Problem tries to find the longest path between vertices (0,0) and (n,m) in the edit graph. The Local Alignment Problem tries to find the longest path among paths between arbitrary vertices (i,j) and (i’, j’) in the edit graph. In the edit graph with negatively-scored edges, Local Alignmet may score higher than Global Alignment 11/8/2018 CSE 5290, Fall 2011

146 Local vs. Global Alignment (cont’d)
Local Alignment—better alignment to find conserved segment --T—-CC-C-AGT—-TATGT-CAGGGGACACG—A-GCATGCAGA-GAC | || | || | | | ||| || | | | | |||| | AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATG—T-CAGAT--C tccCAGTTATGTCAGgggacacgagcatgcagagac |||||||||||| aattgccgccgtcgttttcagCAGTTATGTCAGatc 11/8/2018 CSE 5290, Fall 2011

147 Local Alignment: Example
Compute a “mini” Global Alignment to get Local Local alignment Global alignment 11/8/2018 CSE 5290, Fall 2011


Download ppt "CSE 5290: Algorithms for Bioinformatics Fall 2011"

Similar presentations


Ads by Google