Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter.

Slides:

Advertisements

Similar presentations

CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.

Advertisements

Overview What is Dynamic Programming? A Sequence of 4 Steps

Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.

Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.

Sequence Similarity Searching Class 4 March 2010.

Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &

Computational Genomics Lecture 1, Tuesday April 1, 2003.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.

Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.

Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 18: Application-Driven Hardware Acceleration (4/4)

Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.

Sequence Alignment.

Sequence Alignment. Before we start, administrivia Instructor: Serafim Batzoglou, CS x Office hours: Monday 2:00-3:30 TA:

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.

Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:

Welcome to CS262: Computational Genomics Instructor: Serafim Batzoglou TAs: Eugene Davydov Christina Pop Monday & Wednesday.

Sequence Alignment Slides courtesy of Serafim Batzoglou, Stanford Univ.

Sequence Alignment Oct 9, 2002 Joon Lee Genomics & Computational Biology.

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.

Algorithms Dr. Nancy Warter-Perez June 19, May 20, 2003 Developing Pairwise Sequence Alignment Algorithms2 Outline Programming workshop 2 solutions.

Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.

Computational Biology, Part 2 Sequence Comparison with Dot Matrices Robert F. Murphy Copyright  1996, All rights reserved.

Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.

Analysis of Algorithms

Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.

Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.

Lecture 7 Topics Dynamic Programming

Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.

Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.

Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,

Longest Common Subsequence

LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.

Sequence comparison: Local alignment

Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.

TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,

Developing Pairwise Sequence Alignment Algorithms

Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.

Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.

Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.

Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.

Pairwise & Multiple sequence alignments

Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.

ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.

Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.

1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.

Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.

Chapter 3 Computational Molecular Biology Michael Smith

Minimum Edit Distance Definition of Minimum Edit Distance.

CS 8833 Algorithms Algorithms Dynamic Programming.

Applied Bioinformatics Week 3. Theory I Similarity Dot plot.

Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.

Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-

Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.

Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.

DNA, RNA and protein are an alien language

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.

9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.

Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Sequence comparison: Local alignment

Definition of Minimum Edit Distance

Pairwise sequence Alignment.

Longest Common Subsequence

Longest Common Subsequence

Basic Local Alignment Search Tool (BLAST)

Dynamic Programming.

Reconfigurable Computing (EN2911X, Fall07)

Presentation transcript:

Dynamic Programming Presenters: Michal Karpinski Eric Hoffstetter

Background “Dynamic programming” originates with Richard Bellman (1940s) in multistage decision process problems. –While at RAND Corp, he wanted his work to appear more practical (“real work”) as opposed to theoretical. To shield himself from scrutiny, Bellman chose the word “programming,” which implies fruitful, deliberate effort and embellished it with “dynamic.” As he puts it “it’s impossible to use dynamic in a pejorative sense.” Applications: –String alignments / problems –Pattern recognition: Image matching / image recognition (2D & 3D) Speech recognition (Viterbi algorithm) –Manufacturing – find fastest way through factory –Order of matrices in matrix multiplication to minimize cost –Build optimal binary search tree – minimize number of nodes visited during search Language translator – most common words near root of tree

Used to solve problems exhibiting: –Overlapping Subproblems: “they occur as a subproblem of different problems” –Optimal Substructure: “An optimal solution to the problem contains within it optimal solutions to subproblems.” –Subproblem Independence: “the solution to one subproblem does not affect the solution to another subproblem, i.e., they do not share resources”

Tops Down and Bottoms Up –Top-down: problem is broken down to subproblems then solved using memoization to remember the solutions to subproblems already solved. Top down: function fib(n) if n = 0 return 0 if n = 1 return 1 else return fib(n − 1) + fib(n − 2) Top down with memoization (not memorization) var m := map(0 → 1, 1 → 1) function fib(n) if map m does not contain key n m[n] := fib(n − 1) + fib(n − 2) return m[n] –Bottom-up: all subproblems must be solved in advance to build solutions to larger problems function fib(n) var previousFib := 0, currentFib := 1 repeat n − 1 times var newFib := previousFib + currentFib previousFib := currentFib currentFib := newFib return currentFib

Biological Sequence Matching Problems 1 DNA –Two strands –Four letter alphabet (four bases) –Base pairing rules –Strands are directional and, within a gene, only one strand is translated RNA –Functional or intermediate step of protein manufacturing –Four letter alphabet Proteins –20 letter alphabet

Biological Sequence Matching Problems 2 Applications –Identify strains of viruses, bacteria –Identify genes (hair, skin, eye color, height) and genetic basis for diseases (lethal or susceptibility to cancer, etc.) –Identify evolutionary relationships Dynamic programming is the basis of BLAST (Basic Local Alignment Search Tool) – in top 3 of most cited papers in recent bioscience history (was #1 in 1990s)

Sequence Alignment Algorithm 1 -AGGCGGATC--- TAG-C--ATCTAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, Find the alignment with maximum score F = (# matches)  m - (# mismatches)  s – (#gaps)  d AGGCGGATC TAGCATCTAC

Sequence Alignment Algorithm 2 AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC There are > 2 N possible alignments.

Sequence Alignment Algorithm 3 Note: The score of aligningx 1 ……x M y 1 ……y N is additive Say thatx 1 …x i x i+1 …x M aligns to y 1 …y j y j+1 …y N Add the two scores: F(x 1 …x M, y 1 …y N ) = F(x 1 …x i, y 1...y j ) + F(x i+1 …x m, y j+1 …y N )

Sequence Alignment Algorithm 4 Original problem –Align x 1 …x M to y 1 …y N Divide into a finite number of subproblems (non-overlapping for efficiency) –Align x 1 …x i to y 1 …y j Subdivide the subproblem and construct the solution from smaller subproblems rogrammingClassic problem type for dynamic programming Let F(i, j) = optimal score of aligning x 1 ……x i y 1 ……y j F is the “matrix” or “table” or “program.” Hence the term “dynamic programming.”

Sequence Alignment Algorithm 5 Three cases: 1.x i aligns to y j x 1 ……x i-1 x i y 1 ……y j-1 y j 2.x i aligns to a gap x 1 ……x i-1 x i y 1 ……y j - 3.y j aligns to a gap x 1 ……x i - y 1 ……y j-1 y j diagonal move m, if x i = y j F(i, j) = F(i – 1, j – 1) + -s, if not horizontal move F(i, j) = F(i – 1, j) – d vertical move F(i, j) = F(i, j – 1) – d F = (# matches)  m - (# mismatches)  s – (# gaps)  d Scoring function s(x i, y j ) F(i, j) calculated with scoring function s(x i, y j ) or gap function g Gap function

Sequence Alignment Algorithm 6 How do we choose the case for each matrix position? Assume that the subproblems are solved: F(i, j – 1), F(i – 1, j), F(i – 1, j – 1) are optimal Therefore, F(i – 1, j – 1) + s(x i, y j ) F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d Where s(x i, y j ) = m, if x i = y j ;-s, if not

Set d = 1, m = 1, s = -0.5 F(i – 1, j – 1) + s(x i, y j ) F(i, j) = max F(i – 1, j) – 1 F(i, j – 1) – 1 Where s(x i, y j ) = 1, if x i = y j -0.5, if not Sequence Alignment Algorithm 7

Needleman-Wunsch Algorithm 1: Finds Global Optimal Alignment 1.Initialization a.F(0, 0) = 0 b.F(0, j) = - j  d c.F(i, 0)= - i  d 2.Main Iteration Filling-in partial alignments a.For each i = 1……M For each j = 1……N F(i – 1,j – 1) + s(x i, y j ) [case 1] F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3]  if [case 1] Ptr(i, j)=  if [case 2]  if [case 3] 3.Termination F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Needleman-Wunsch Algorithm 2 Initialization F(0, 0) = 0 F(0, j) = - j  d F(i, 0)= - i  d (1) F(i – 1,j – 1) + s(x i, y j ) F(i, j) = max (2) F(i – 1, j) – d (3) F(i, j – 1) – d  (1) Ptr(i, j) =  (2)  (3)

Smith-Waterman Algorithm 1: Finds local optimal alignment(s) Ignore poorly aligned regions 1.Initialization a.F(0, 0) = 0 b.F(0, j) = 0 c.F(i, 0)= 0 2.Main Iteration Filling-in partial alignments a.For each i = 1……M For each j = 1……N 0 F(i – 1,j – 1) + s(x i, y j ) [case 1] F(i, j) = max F(i – 1, j) – d [case 2] F(i, j – 1) – d [case 3]  if [case 1] Ptr(i, j)=  if [case 2]  if [case 3] 3.Termination F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Smith-Waterman Algorithm 2 Initialization F(0, 0) = 0 F(0, j) = 0 F(i, 0)= 0 (1) F(i – 1,j – 1) + s(x i, y j ) F(i, j) = max (2) F(i – 1, j) – d (3) F(i, j – 1) – d  (1) Ptr(i, j) =  (2)  (3)

Smith-Waterman Algorithm 3

Smith-Waterman Algorithm 4

Overlap Detection 1 When searching for matches of a short string in database of long strings, we don’t want to penalize overhangs x 1 …………………… x M y 1 ………………… y N x 1 …………………… x M y 1 ………… y N x y x y

Overlap Detection 2 F(i – 1, 0) F(i, 0) = maxF(i – 1, m) – T F(i – 1,j – 1) + s(x i, y j ) F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d

Overlap Detection 3 Needleman-Wunsch with Overlap Detection Smith-Waterman with Overlap Detection F(i – 1, 0) F(i, 0) = maxF(i – 1, m) – T 0 F(i – 1,j – 1) + s(x i, y j ) F(i, j) = max F(i – 1, j) – d F(i, j – 1) – d

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+ s(x i, y j ) F(i, j) = maxF(i, j – 1) – d, if j > i – k(N) F(i – 1, j) – d, if j < i + k(N) Termination:same x 1 ………………………… x M y 1 ……………………… y N k(N)

Largest Common Subsequence 1 1.Initialization a.F(0, 0) = 0 b.F(0, j) = 0 c.F(i, 0)= 0 2.Main Iteration a.For each i = 1……M For each j = 1……N F(i – 1,j – 1) + 1, if x i = y j [case 1] F(i, j) = max F(i – 1, j), if not(x i = y j )[case 2] F(i, j – 1), if not(x i = y j ) [case 3]  if [case 1] Ptr(i, j)=  if [case 2]  if [case 3] 3.Termination F(M, N) is the optimal score, and from Ptr(M, N) can trace back optimal alignment

Largest Common Subsequence 2 Initialization F(0, 0) = 0 F(0, j) = 0 F(i, 0)= 0 (1) F(i – 1,j – 1) + 1, if x i = y j F(i, j) = max (2) F(i – 1, j), if not(x i = y j ) (3) F(i, j – 1), if not(x i = y j )  (1) Ptr(i, j) =  (2)  (3)

Cormen: error on page 353 Corrected (to obtain figure 15.6) m = length[X] n = length[Y] for i = 1 to m do c[i,0] = 0 for j = 0 to n do c[0,j] = 0 for i = 1 to m for j = 1 to n if x i = y j then c[i,j] = c[i-1, j-1] + 1] b[i,j] = “  ” else if c[i-1, j] > c[i, j-1] then c[i,j] = c[i-1, j] b[i,j] = “  ” else c[i,j] = c[i, j-1] b[i,j] = “  ” return c and b Largest Common Subsequence 3

Performance Running Time: O(mn) + O(m+n) for output Storage: O(mn) –Possible to eliminate backpointer matrix for some problems Improvements –Overlap detection –Partitioning: Find local alignments to seed global alignment –Bounded DP –Gap opening vs. gap extension –Biochemically significant scoring function

Sources Altschul, S.F., et al. Basic Local Alignment Search Tool. J. Molec. Biol. 215(3): , Bellman, Richard. Dynamic Programming. Princeton University Press, Princeton: Cormen et al. Introduction to Algorithms. MIT Press, Cambridge: Dreyfus, Stuart Richard Bellman on the birth of dynamic programming. Operations Research 50: Durbin et al. Biological Sequence Analysis: Probabilistic models of proteins and nucleic acids. Cambridge University Press, New York: Gotoh, O An improved algorithm for matching biological sequences. Journal of Molecular Biology 162: Gusfield, Dan. Algorithms on Strings, Trees, and Sequences, Cambridge University Press, New York: Needleman, S.B. and Wunsch, C.D A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: Preiss. B.R. Data Structures and Algorithms with Object-Oriented Design Patterns in C#. Smith, T. F. and Waterman, M.S Identification of common molecular subsequences. Journal of Molecular Biology 147: Wikipedia

Sequence Alignment Algorithm X -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N, Find the alignment with maximum score F = (# matches)  m - (# mismatches)  s – (#gaps)  d AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC