Dynamic Programming and Biological Sequence Comparison Part I.

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Dynamic Programming.
BLAST Sequence alignment, E-value & Extreme value distribution.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Sequence Alignment.
Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Sequencing and Sequence Alignment
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
Lecture 8: Dynamic Programming Shang-Hua Teng. First Example: n choose k Many combinatorial problems require the calculation of the binomial coefficient.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence Alignments Revisited
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
Sequence alignment, E-value & Extreme value distribution
Sequence comparison: Local alignment
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Developing Pairwise Sequence Alignment Algorithms
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Chapter 3 Computational Molecular Biology Michael Smith
Greedy Methods and Backtracking Dr. Marina Gavrilova Computer Science University of Calgary Canada.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
The Purpose of DNA To make PROTEINS! Proteins give us our traits (ex: one protein gives a person blue eyes, another gives brown Central Dogma of Molecular.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Fundamental Data Structures and Algorithms Ananda Guna March 18, 2003 Dynamic Programming Part 1.
Sequence comparison and database search.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence comparison: Local alignment
Sequence Alignment.
Sequence Alignment ..
Sequence Alignment 11/24/2018.
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
Do now activity #5 How many strands are there in DNA?
DYNAMIC PROGRAMMING.
Longest Common Subsequence
Longest Common Subsequence
Presentation transcript:

Dynamic Programming and Biological Sequence Comparison Part I

\course\eleg f\Topic-2a.ppt2 Topic II – Biological Sequence Alignment and Database Search  Part I (Topic-2a): Dynamic programming and Sequence comparison  Part II (Topic-2b): Heuristic and Database Search (e.g. FAST, BLAST) sequence alignment  Part III (Topic-2c): Multiple sequence alignment

\course\eleg f\Topic-2a.ppt3 Outline  Concept of alignment  Two algorithm design techniques;  Dynamic Programming: Examples  Applying DP to Sequence Comparison;  The database search problem  Heuristic algorithms to database search

\course\eleg f\Topic-2a.ppt4 Alignment  The two sequences will have the same length (after possible insertions of spaces on either or both of them)  No space in one sequence can be aligned with a space in the other  Spaces can be inserted at the beginning or end of the sequences

\course\eleg f\Topic-2a.ppt5 Biological Sequence Alignment and Database Search 1.We have two sequences over the same alphabet, both about the same length (tens of thousands of characters) and the sequences are almost equal. The average frequency of these differences is low, say, one each hundred characters. We want to find the places where the differences occur. 2.We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there is a prefix of one which is similar to suffix of the other.

\course\eleg f\Topic-2a.ppt6 3.We have the same problem as in (2), but now we have several hundred sequences that must be compared (each one against all). In addition, we know that the great majority of sequence pairs are unrelated, that is, they will not have the required degree of similarity. 4.We have two sequences over the same alphabet with a few hundred characters each. We want to know whether there are two substrings, one from each sequence, that are similar. 5.We have the same problem as in (4), but instead of two sequences we have one sequence that must be compared to thousands of others. (cont’d)

\course\eleg f\Topic-2a.ppt7 Breaking Problems Down:  Divide and Conquer: Starting with the complete instance of a problem, divide it into smaller subinstances, solve each of them recursively and combine the partial solutions into a solution to the original problem.  Dynamic Programming: Starting with the smallest subinstances of a problem, solve and combine them until the complete instance of the original problem is solved. Two Related Algorithm Design Techniques

\course\eleg f\Topic-2a.ppt8 Divide and Conquer – Example becomes becomes Quick Sort

\course\eleg f\Topic-2a.ppt9 Divide and Conquer – Example 2 The Fibonacci numbers Fib(n) { if (n < 2) return 1; else return Fib(n-1)+Fib(n-2); } F 1 = 1, F 2 = 1 F n = F n-1 + F n-2 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, …

\course\eleg f\Topic-2a.ppt10 Divide and Conquer – Example 2 F 1 = 1, F 2 = 1 F n = F n-1 + F n-2 F(7) F(3) + F(2) F(1) F(4) + F(2) F(6) + F(3) + F(2) F(1) F(3) + F(2) F(1) F(4) + F(2) F(5) + + F(3) + F(2) F(1) F(3) + F(2) F(1) F(4) + F(2) F(5) + n … F n … F n / F n-1  1.6 F n  1.6 n, n >> 1 T(n)  #Internal_nodes = #leaves - 1 but #leaves = F n T(n) = O(1.6 n ) Exponential Time!

\course\eleg f\Topic-2a.ppt11 How to Compute Fib Function Using Dynamic Programming Method?

\course\eleg f\Topic-2a.ppt12 Dynamic Programming–Example 1 Fib(n) { int tab[n]; tab[1] = 1; tab[2] = 1; for (j = 3; j <= n; j++) tab[j]=tab[j-1] + tab[j-2]; return tab[n]; } Start by solving the smallest problems Use the partial solutions to solve bigger and bigger problems Extra memory to store intermediate values …. tab Linear Time! T(n) = O(n) Space-Time Tradeoff

\course\eleg f\Topic-2a.ppt13 Sequence Comparison Molecular sequence data are at the heart of Computational Biology  DNA sequences  RNA sequences  Protein sequences We can think of these sequences as strings of letters  DNA & RNA: alphabet of 4 letters (A,T,C,G)  Protein: alphabet of 20 letters code full name A alanine C cysteine D aspartate E glutamate F phenylalanine G glycine H histidine I isoleucine K lysine L leucine M methionine N aspartamine P proline Q glutamine R arginine S serine T threonine V valine W tryptophan Y tyrosine

\course\eleg f\Topic-2a.ppt14 Sequence Comparison – (Cont.) Why compare sequences?  Find similar genes/proteins Allows to predict function & structure  Locate common subsequences in genes/proteins Identify common recurrent patterns  Locate sequences that might overlap Help in sequence assembly

\course\eleg f\Topic-2a.ppt15 Sequence X = A T A A G T Sequence Y = A T G C A G T To compare the sequences we need to quantify the similariy matches = 1 mismatches = 0 Score Total = 2 Sequence Comparison – (Cont.)

\course\eleg f\Topic-2a.ppt16 Sequence Y = A T G C A G T Sequence X = A T A A G T Sequence Comparison – (Cont.) Sequence X = A T A A G T Taking positions of the letters into account matches = 1 mismatches = 0 Score Total = 3

\course\eleg f\Topic-2a.ppt17 Sequence Y = A T G C A G T Sequence X = A T A A G T Sequence Comparison – (Cont.) Sequence X = A T A - A G T How to take possible mutations into account? matches = 1 mismatches = 0 gap = -1 Score – Total = 4 matches = 1 mismatches = 0

\course\eleg f\Topic-2a.ppt18 Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG G - A G - - A GAGA - G A - GA AG GA A GA - A G - A - A - G AG GA A - G - AG - GA A G - A -G - G AG - - G AG - GA AG GA - - AG G - A - AG G - A - - A -G G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG GA AG - - scores T(n,n) = O(k n ) Exponential Time! choose the best score, i.e max(-2, 0, -2) choose the best score, i.e max(-3, 0, -1) choose the best score, i.e max(-1, 0, -3) choose the best score, i.e max(-1, 0, -1) total score = 0

\course\eleg f\Topic-2a.ppt19 G A AGAG Applying DP to Sequence Comparison Sequence X = GA Sequence Y = AG G - A G - - A GAGA - G A - GA AG GA A GA - A G - A - A - G AG GA A - G - AG - GA A G - A -G - G AG - - G AG - GA AG GA - - AG G - A - AG G - A - - A -G G - - A - AG - GA - A -G GA AG G - A AG - - GA - A - -G - GA A -G - G - A A -G - - GA AG GA AG T(n,n) = O(n 2 ) Polynomial Time!

\course\eleg f\Topic-2a.ppt20 Questions  Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example?  Answer: Let us count Total = 13 G A A G Question: from 1 to 9 how many paths?

\course\eleg f\Topic-2a.ppt21 DP algorithm for Sequence Comparison int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i. g for j = 0 to n do S[j,0] = j. g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] sb[i,j] - Substitution Matrix A T C G ATCGATCG Start by solving the smallest problems Extra memory to store intermediate values Use the partial solutions to solve bigger and bigger problems

\course\eleg f\Topic-2a.ppt22 The Substitution Matrix  For DNA we usually use identity matrices; A T C G ATCGATCG For proteins more sensitive matrices, derived empirically, are used; A B C D E F G H I K L M N P Q R S T V W Y Z A B C D E F G H I K L M N P Q R S T V W Y Z

\course\eleg f\Topic-2a.ppt23 Sequence Comparison revisited A T G C A G T ATAAGTATAAGT Similarity Matrix int S[m,n] m = length(X) n = length(Y) for i = 0 to m do S[i,0] = i. g for j = 0 to n do S[j,0] = j. g for i = 1 to m do for j = 1 to n do S[i,j] = max( S[i-1,j]+g, S[i-1,j-1]+sb[i,j], S[i,j-1]+g ) return S[m,n] (-1) 0 + (+1) -1 + (-1) (-1) -1 + ( 0 ) 1 + (-1) -3 + (-1) -2 + ( 0 ) 0 + (-1) (-1) -3 + ( 0 ) -1 + (-1) (-1) -4 + (+1) -2 + (-1) (-1) -6 + ( 0 ) -4 + (-1) (-1) -5 + ( 0 ) -3 + (-1)

\course\eleg f\Topic-2a.ppt24 What To Do Next? Answer: Finding alignments But, How?

\course\eleg f\Topic-2a.ppt25 Finding the Alignment(s) A T G C A G T ATAAGTATAAGT Similarity Matrix (-1) 3 + (+1) 2 + (-1) TTTT (-1) 2 + (+1) 2 + (-1) G T (-1) 1 + (+1) 2 + (-1) A G T (-1) 1 + ( 0 ) 2 + (-1) C A G T A A G T C A G T - A G T (-1) 0 + ( 0 ) 2 + (-1) G C A G T - A A G T (-1) 0 + (+1) -1 + (-1) (-1) 2 + ( 0 ) 1 + (-1) G C A G T A - A G T (-1) 1 + (+1) 0 + (-1) T G C A G T T - A A G T T G C A G T T A - A G T A T G C A G T A T A - A G T A T G C A G T A T - A A G T Global Alignments

\course\eleg f\Topic-2a.ppt26 How to Break a Tie?  Should one report all?  Or, report only one?

\course\eleg f\Topic-2a.ppt27 Advantage of DP Alignment Algorithms  Build up the solution by determining all similarities between arbitrary prefixes of the two sequences  Starting with the shorter prefixes and use previously computed results to solve for larger prefixes

\course\eleg f\Topic-2a.ppt28 The Complexity of the DP Alignment Algorithm?  Find an optimal alignment O (m + n)  Construction of the similarity matrix: O (m n)

\course\eleg f\Topic-2a.ppt29 Global versus Local Alignments  A global alignment attempts to match all of one sequence against all of another LGPSTKQFGKGSSSRIWDN | |||| | | LNQIERSFGKGAIMRLGDA A local alignment attempts to match subsequences of the two sequences; FGKG |||| FGKG

\course\eleg f\Topic-2a.ppt30 How to Compute Local Alignment?

\course\eleg f\Topic-2a.ppt31 Applying DP to Local Alignment Similarity Matrix Computation: a[i,j-1]+g a[i,j]= maxa[i-1,j-1]+sb(i,j) a[i-1,j]+g a[i,0]= 0 ; for i= 0…m a[0,j]= 0 ; for j= 0…n If the best alignment up to some point has a negative score, it’s better to start a new one, rather than extend the old one. Don’t penalize gaps on left and right ends!

\course\eleg f\Topic-2a.ppt32 Criteria of Finding a Local Alignment  Find the entries with maximum values in the simularity matrix  For each of such entries, construct an local alignment  See next example  We may also be interested in near-optimal alignments

\course\eleg f\Topic-2a.ppt33 A T G C A G T ATAAGTATAAGT Similarity Matrix Similarity Matrix Computation: a[i,j-1]+g a[i,j]= maxa[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 A T G C A G T A T - A A G T A T G C A G T A T A - A G T A T G C A A G T Applying DP to Local Alignment

\course\eleg f\Topic-2a.ppt34 Local Alignment using DP T G A T G G A G G T GATAGGGATAGG (-2) 0 + (-1) 0 + (-2) (-2) 0 + (+1) 0 + (-2) 0 T G A T G G A G G T A G G a[i,j-1]+g a[i-1,j-1]+sb(i,j) a[i-1,j]+g 0 a[i,j]= max A T C G ATCGATCG g = -2 T G A T - G G A G G T G A T A G G T G A T G G A G G T G A T A G T G A T G G A G G T G A T

\course\eleg f\Topic-2a.ppt35 How to Break a Tie?  Should one report all?  Or, report only one?

\course\eleg f\Topic-2a.ppt36 Extension to the Basic DP Method  Improving space complexity  Introduce general gap functions That is, the probability of a sequence of consecutive spaces is more likely than individual spaces Affine gap functions: w(k) = h + gk