CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

Overview What is Dynamic Programming? A Sequence of 4 Steps
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Computational Genomics Lecture 1, Tuesday April 1, 2003.
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.
Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Lecture 2, Thursday April 3, 2003.
1 Dynamic Programming Jose Rolim University of Geneva.
Lecture 7 Topics Dynamic Programming
Pairwise alignment Computational Genomics and Proteomics.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Longest Common Subsequence
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Minimum Edit Distance Definition of Minimum Edit Distance.
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
CS 3343: Analysis of Algorithms Lecture 18: More Examples on Dynamic Programming.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Sequence Similarity.
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Dynamic Programming Typically applied to optimization problems
All-pairs Shortest paths Transitive Closure
CS 3343: Analysis of Algorithms
Definition of Minimum Edit Distance
Lecture 5: Local Sequence Alignment Algorithms
Intro to Alignment Algorithms: Global and Local
CS 3343: Analysis of Algorithms
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Lecture 8. Paradigm #6 Dynamic Programming
Ch. 15: Dynamic Programming Ming-Te Chi
Dynamic Programming.
Dynamic Programming-- Longest Common Subsequence
Bioinformatics Algorithms and Data Structures
Dynamic Programming.
Presentation transcript:

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment

Evolution at the DNA level …ACGGTGCAGTCACCA… …ACGTTGC-GTCCACCA… C DNA evolutionary events (sequence edits): Mutation, deletion, insertion

Sequence conservation implies function OK X X Still OK? next generation

Why sequence alignment? Conserved regions are more likely to be functional –Can be used for finding genes, regulatory elements, etc. Similar sequences often have similar origin and function –Can be used to predict functions for new genes / proteins Sequence alignment is one of the most widely used computational tools in biology

Global Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition An alignment of two strings S, T is a pair of strings S ’, T ’ (with spaces) s.t. (1) |S ’ | = |T ’ |, and (|S| = “ length of S ” ) (2) removing all spaces in S ’, T ’ leaves S, T AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC S T S’ T’

What is a good alignment? Alignment: The “ best ” way to match the letters of one sequence with those of the other How do we define “ best ” ?

The score of aligning (characters or spaces) x & y is σ (x,y). Score of an alignment: An optimal alignment: one with max score S’: -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T’: TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring Function Sequence edits: AGGCCTC –Mutations AGGACTC –InsertionsAGGGCCTC –DeletionsAGG-CTC Scoring Function: Match: +m~~~AAC~~~ Mismatch: -s~~~A-A~~~ Gap (indel):-d

Match = 2, mismatch = -1, gap = -1 Score = 3 x 2 – 2 x 1 – 1 x 1 = 3

More complex scoring function Substitution matrix –Similarity score of matching two letters a, b should reflect the probability of a, b derived from same ancestor –It is usually defined by log likelihood ratio (Durbin book) –Active research area. Especially for proteins. –Commonly used: PAM, BLOSUM

An example substitution matrix ACGT A3-2-2 C3 G3-2 T3

How to find an optimal alignment? A naïve algorithm: for all subseqs A of S, B of T s.t. |A| = |B| do align A[i] with B[i], 1 ≤i ≤|A| align all other chars to spaces compute its value retain the max end output the retained alignment S = abcd A = cd T = wxyz B = xz -abc-d a-bc-d w--xyz -w-xyz

Analysis Assume |S| = |T| = n Cost of evaluating one alignment: ≥n How many alignments are there: –pick n chars of S,T together –say k of them are in S –match these k to the k unpicked chars of T Total time: E.g., for n = 20, time is > 2 40 >10 12 operations

Intro to Dynamic Programming

Dynamic programming What is dynamic programming? –A method for solving problems exhibiting the properties of overlapping subproblems and optimal substructureoverlapping subproblemsoptimal substructure –Key idea: tabulating sub-problem solutions rather than re-computing them repeatedly Two simple examples: –Computing Fibonacci numbers –Find the special shortest path in a grid

Example 1: Fibonacci numbers 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … F(0) = 1; F(1) = 1; F(n) = F(n-1) + f(n-2) How to compute F(n)?

A recursive algorithm function fib(n) if (n == 0 or n == 1) return 1; else return fib(n-1) + fib(n-2); F(9) F(8)F(7) F(6) F(5) F(6)F(5) F(4)F(5) F(4) F(3)

Time complexity: –Between 2 n/2 and 2 n –O(1.62 n ), i.e. exponential Why recursive Fib algorithm is inefficient? –Overlapping subproblems n/2 n

An iterative algorithm function fib(n) F[0] = 1;F[1] = 1; for i = 2 to n F[i] = F[i-1] + F[i-2]; Return F[n]; Time complexity: Time: O(n), space: O(n)

Example 2: shortest path in a grid S G m n Each edge has a length (cost). We need to get to G from S. Can only move right or down. Aim: find a path with the minimum total length

Optimal substructures Naïve algorithm: enumerate all possible paths and compare costs –Exponential number of paths Key observation: –If a path P(S, G) is the shortest from S to G, any of its sub-path P(S,x), where x is on P(S,G), is the shortest from S to x

Proof Proof by contradiction –If the path between P(S,x) is not the shortest, i.e., P’(S,x) < P(S,x) –Construct a new path P’(S,G) = P’(S,x) + P(x, G) –P’(S,G) P(S,G) is not the shortest –Contradiction –Therefore, P(S, x) is the shortest S G x

Recursive solution Index each intersection by two indices, (i, j) Let F(i, j) be the total length of the shortest path from (0, 0) to (i, j). Therefore, F(m, n) is the shortest path we wanted. To compute F(m, n), we need to compute both F(m-1, n) and F(m, n-1) m n (0,0) (m, n) F(m-1, n) + length((m-1, n), (m, n)) F(m, n) = min F(m, n-1) + length((m, n-1), (m, n))

Recursive solution But: if we use recursive call, many subpaths will be recomputed for many times Strategy: pre-compute F values starting from the upper-left corner. Fill in row by row (what other order will also do?) m n F(i-1, j) + length((i-1, j), (i, j)) F(i, j) = min F(i, j-1) + length((i, j-1), (i, j)) (0,0) (m, n) (i, j) (i-1, j) (i, j-1)

Dynamic programming illustration S G F(i-1, j) + length(i-1, j, i, j) F(i, j) = min F(i, j-1) + length(i, j-1, i, j)

Trackback

Elements of dynamic programming Optimal sub-structures –Optimal solutions to the original problem contains optimal solutions to sub-problems Overlapping sub-problems –Some sub-problems appear in many solutions Memorization and reuse –Carefully choose the order that sub-problems are solved

Dynamic Programming for sequence alignment Suppose we wish to align x 1 ……x M y 1 ……y N Let F(i,j) = optimal score of aligning x 1 ……x i y 1 ……y j Scoring Function: Match: +m Mismatch: -s Gap (indel):-d

Optimal substructure If x[i] is aligned to y[j] in the optimal alignment between x[1..M] and y[1..N], then The alignment between x[1..i] and y[1..j] is also optimal Easy to prove by contradiction... 12iM 12 j N x:x: y:y:

Recursive formula Notice three possible cases: 1.x M aligns to y N ~~~~~~~ x M ~~~~~~~ y N 2.x M aligns to a gap ~~~~~~~ x M ~~~~~~~  3.y N aligns to a gap ~~~~~~~  ~~~~~~~ y N m, if x M = y N F(M,N) = F(M-1, N-1) + -s, if not F(M,N) = F(M-1, N) - d F(M,N) = F(M, N-1) - d

Recursive formula Generalize: F(i-1, j-1) +  (X i,Y j ) F(i,j) = max F(i-1, j) – d F(i, j-1) – d  (X i,Y j ) = m if X i = Y j, and –s otherwise Boundary conditions: –F(0, 0) = 0. –F(0, j) = ? –F(i, 0) = ? -jd: y[1..j] aligned to gaps. -id: x[1..i] aligned to gaps.

What order to fill? F(0,0) F(M,N) F(i, j)F(i, j-1) F(i-1, j)F(i-1, j-1)

What order to fill? F(0,0) F(M,N)

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T A F(i,j) i = j =

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T-2 A-3 j = F(i,j) i =

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T A-3 j = F(i,j) i =

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 F(i,j) i =

Example x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 This only tells us the best score F(i,j) i =

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = A A

Trace-back AGTA A10 -2 T 0010 A-3 02 F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d x = AGTAm = 1 y = ATAs = 1 d = 1 j = F(i,j) i = TA TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = GTA -TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i = AGTA A-TA

Trace-back x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i-1, j-1) +  (Xi,Yj) F(i,j) = max F(i-1, j) – d F(i, j-1) – d F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A T-2 A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = F(i,j) i =

Using trace-back pointers x = AGTAm = 1 y = ATAs = 1 d = 1 AGTA A10 -2 T 0010 A-3 02 j = Optimal Alignment: F(4,3) = 2 AGTA A  TA F(i,j) i =

The Needleman-Wunsch Algorithm 1.Initialization. a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration. Filling in scores a.For each i = 1……M For each j = 1……N F(i-1,j) – d [case 1] F(i, j) = max F(i, j-1) – d [case 2] F(i-1, j-1) + σ(x i, y j ) [case 3] UP, if [case 1] Ptr(i,j)= LEFTif [case 2] DIAGif [case 3] 3.Termination. F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Performance Time: O(NM) Space: O(NM) Later we will cover more efficient methods

Equivalent graph problem (0,0) (3,4) A G TA A A T S1 = S2 = Number of steps: length of the alignment Path length: alignment score Optimal alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.  : a gap in the 2 nd sequence  : a gap in the 1 st sequence : match / mismatch Value on vertical/horizontal line: -d Value on diagonal: m or -s 1

Question If we change the scoring scheme, will the optimal alignment be changed? –Old: Match = 1, mismatch = gap = -1 –New: match = 2, mismatch = gap = 0 –New: Match = 2, mismatch = gap = -2?

Question What kind of alignment is represented by these paths? A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC Alternating gaps are impossible if –s > -2d

A variant of the basic algorithm Scoring scheme: m = s = d: 1 Seq1: CAGCA-CTTGGATTCTCGG || |:||| Seq2: ---CAGCGTGG Seq1: CAGCACTTGGATTCTCGG |||| | | || Seq2: CAGC-----G-T----GG The first alignment may be biologically more realistic Score = -7 Score = -2

A variant of the basic algorithm Maybe it is OK to have an unlimited # of gaps in the beginning and end: CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCGAGTTCATCTATCAC--GACCGC--GGTCG Then, we don ’ t want to penalize gaps in the ends

The Overlap Detection variant Changes: 1.Initialization For all i, j, F(i, 0) = 0 F(0, j) = 0 2.Termination max i F(i, N) F OPT = max max j F(M, j) x 1 ……………………………… x M y N ……………………………… y 1

Different types of overlaps x y x y

A non-bio variant Shell command diff: Compare two text files –Given file1 and file2 –Find the difference between file1 and file2 –Similar to sequence alignment –How to score? Longest common subsequence (LCS) Match has score 1 No mismatch penalty No gap penalty

File1 A B C D E F File2 G B C E F

File1 A B C D E F File2 G B C - E F $ diff file1 file2 1c1 < A --- > G 4c4 < D --- > - LCS = 4

The LCS variant Changes: 1.Initialization For all i, j, F(i, 0) = F(0, j) = 0 2.Filling in table F(i-1,j) F(i, j) = max F(i, j-1) F(i-1, j-1) + σ(x i, y j ) where σ(x i, y j ) = 1 if x i = y j and 0 otherwise. 3.Termination max i F(i, N) F OPT = max max j F(M, j)

More efficient algorithms What happens if you have 1 million lines of text in each file? O(mn) algorithm is too inefficient Memory inefficient –1 TB memory to store the matrix Bounded DP –maybe the majority of the two files are the same? (e.g., two versions of a software) Linear-space algorithm –same time complexity