15-853 Page 1 15-853:Algorithms in the Real World Computational Biology I – Introduction – LCS, Edit Distance.

Slides:



Advertisements
Similar presentations
DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Advertisements

Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Lecture 24 Coping with NPC and Unsolvable problems. When a problem is unsolvable, that's generally very bad news: it means there is no general algorithm.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Dynamic Programming: Sequence alignment
1 Theory I Algorithm Design and Analysis (10 - Shortest paths in graphs) T. Lauer.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Comp 122, Fall 2004 Dynamic Programming. dynprog - 2 Lin / Devi Comp 122, Spring 2004 Longest Common Subsequence  Problem: Given 2 sequences, X =  x.
Chapter 3 The Greedy Method 3.
CPSC 311, Fall 2009: Dynamic Programming 1 CPSC 311 Analysis of Algorithms Dynamic Programming Prof. Jennifer Welch Fall 2009.
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
Space Efficient Alignment Algorithms and Affine Gap Penalties
3 -1 Chapter 3 The Greedy Method 3 -2 The greedy method Suppose that a problem can be solved by a sequence of decisions. The greedy method has that each.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Introduction to Bioinformatics Algorithms Sequence Alignment.
CS 262 Discussion Section 1. Purpose of discussion sections To clarify difficulties/ambiguities in the problem set questions and lecture material. To.
Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)
Distance Functions for Sequence Data and Time Series
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2002 Lecture 1 (Part 3) Tuesday, 9/3/02 Design Patterns for Optimization.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Graph Algorithms: Shortest Path We are given a weighted, directed graph G = (V, E), with weight function w: E R mapping.
CPSC 411 Design and Analysis of Algorithms Set 5: Dynamic Programming Prof. Jennifer Welch Spring 2011 CPSC 411, Spring 2011: Set 5 1.
Multiple Sequence alignment Chitta Baral Arizona State University.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
Greedy Algorithms Like dynamic programming algorithms, greedy algorithms are usually designed to solve optimization problems Unlike dynamic programming.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Analysis of Algorithms
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
15-853:Algorithms in the Real World
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Developing Pairwise Sequence Alignment Algorithms
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
© The McGraw-Hill Companies, Inc., Chapter 3 The Greedy Method.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Some Independent Study on Sequence Alignment — Lan Lin prepared for theory group meeting on July 16, 2003.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
October 21, Algorithms and Data Structures Lecture X Simonas Šaltenis Nykredit Center for Database Research Aalborg University
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Nyhoff, ADTs, Data Structures and Problem Solving with C++, Second Edition, © 2005 Pearson Education, Inc. All rights reserved Graphs.
Gene expression & Clustering. Determining gene function Sequence comparison tells us if a gene is similar to another gene, e.g., in a new species –Dynamic.
Introduction to Algorithms Jiafen Liu Sept
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
COSC 2007 Data Structures II
Chapter 7 Dynamic Programming 7.1 Introduction 7.2 The Longest Common Subsequence Problem 7.3 Matrix Chain Multiplication 7.4 The dynamic Programming Paradigm.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Part 2 # 68 Longest Common Subsequence T.H. Cormen et al., Introduction to Algorithms, MIT press, 3/e, 2009, pp Example: X=abadcda, Y=acbacadb.
Chapter 20: Graphs. Objectives In this chapter, you will: – Learn about graphs – Become familiar with the basic terminology of graph theory – Discover.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Core String Edits, Alignments, and Dynamic Programming.
Proof of correctness of Dijkstra’s algorithm: Basically, we need to prove two claims. (1)Let S be the set of vertices for which the shortest path from.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
All-pairs Shortest paths Transitive Closure
Lecture 22 Complexity and Reductions
Distance Functions for Sequence Data and Time Series
CSCE 411 Design and Analysis of Algorithms
Enumerating Distances Using Spanners of Bounded Degree
CSE 589 Applied Algorithms Spring 1999
Graph Algorithms: Shortest Path
Lecture 22 Complexity and Reductions
Presentation transcript:

Page :Algorithms in the Real World Computational Biology I – Introduction – LCS, Edit Distance

Page 2 DNA DNA: sequence of base-pairs (bp): {A, C, T, G} Human Genome about 3 x 10 9 bps divided into 46 chromosomes with between 5 x 10 7 and 25 x 10 7 bps each Each chromosome is a sequence of base-pairs DNA is used to generate proteins: DNAmRNAProtein transcription translation

Page 3 Proteins Proteins: sequence of Amino Acids {gly, trp, cys, …} (20 of them) Each DNA bp triple (a “ codon ” ) encodes 1 amino acid Since there are 64 possible codons, this is a many to one mapping. Some triples have special meanings, e.g. EOF. Chromosomes are partitioned into genes each of which codes a protein. Some regions of the chromosome do not code anything (intergene DNA). gene 1gene 3gene 2

Page 4 Form and Function The Amino Acid sequence determines the protein ’ s 3d structure. The structure is also be affected by the environment. –The primary structure refers to the amino acid sequence. –The secondary structure refers to general configuration into alpha helixes and beta sheets –The tertiary structure refers to the full 3d structure Protein ’ s 3d structure determines its function.

Page 5 Some Goals in Molecular Biology 1.Extract and compare genome sequence for various organisms. 2.Determine what proteins they code. 3.Determine structure and purpose of coded proteins. Goals 2. and 3. can often be aided by matching genome or protein sequences to previously studied sequences Use to: –study and cure genetic diseases –design drugs –study evolution –understand cellular processes

Page 6 Example of MS Multiple Sclerosis is a disease in which the immune system attacks the myelin sheaths of nerve cells Conjecture: The immune system T-cells incorrectly identify the myelin sheaths as a virus or bacteria from an earlier infection. This was tested by comparing the proteins sequences of myelin sheaths with a database of viral and bacterial proteins. The ones that matched were tested in the laboratory to see if they were attacked by the T- cells. Some were.

Page 7 Approximate Matching and Sequence Alignment How similar are: actagtctac cgacgtcgata ? Allow for: mutations: x to y insertions: _ to y deletions: x to _ e.g. acta_gtc__tac | | ||| || _cgacgtcgata_ 1 mutation 3 insertions 2 deletions “ indels ”

Page 8 Applications of matching and alignment Used in many ways in computational biology –Sequencing (finding the sequence of DNA or Proteins for a particular individual or species) –Physical mapping (locating unique tags in DNA) –Database searches (does this DNA match any other DNA?) –Evolutionary trees (how similar are two species?) Before talking about the general matching problem in computational Biology, we will talk about a closely related, but simpler problem: longest common subsequence (LCS)

Page 9 Longest Common Subsequence Subsequence (  ): Any subset of the elements of a sequence that maintain relative order e.g. A = a 1 a 2 a 3 a 4 a 5 a 6 A ’ = a 2 a 4 a 5 (a subsequence of A) A ’ = a 2 a 1 a 5 (not a subsequence of A) Longest Common Subsequence (LCS): LCS(A,B) = C, where C  A, C  B, |C| is maximized e.g. A = a b a c d a c B = c a d c d d c C = a c d c |C| = 4

Page 10 Minimum Edit Distance Minimum Edit Distance: D(A,B) = minimum # of insertions and deletions needed to change A to B. e.g. A = a b a c d a c B = c a d c d d c Claim: D(A,B) = |A| + |B| - 2|LCS(A,B)| Proof outline: C = CS(A,B) (some common subsequence) A – C = deletions, B – C = insertions #deletions = |A| - |C| #insertions = |B| - |C| This reduction works both ways, hence equality. delete insert

Page 11 Applications Unix diff : Find Minimum Edit Distance and print edits GNU diff based on algorithm by Eugene Myers, who was VP of Informatics at Celera Used heavily in version control systems to only store edits that have been made to previous version. Computational biology: This will use generalizations of LCS, but the algorithms will work with slight modifications.

Page 12 Outline Will work our way up to the GNU diff algorithm –Recursive solution –Memoized solution –Dynamic programming –Memory efficient solution –Myers/Ukkonon algorithm

Page 13 Recursive Solution Note that this just returns the edit distance, but it is easy to include the edits. Can work from start or end (here from end). Why does it work? D(A, empty) = |A| delete(A) D(empty, B) = |B| insert(B) D(A:x, B:x) = D(A, B) D(A:a, B:b) = min(1 + D(A:a, B), 1 + D(A, B:b)) insert(b)delete(a)

Page 14 Recursive Solution Convert to use indices int ED(int i, int j) { if (i==0) r = j; if (j==0) r = i; if (A[i] == B[j]) r = ED(i-1,j-1); else r = min(1 + ED(i-1,j), 1 + ED(i,j-1)); return r; } ED(n,m); What is the running time?

Page 15 Memoized Solution int ED(int i, int j) { if (i==0) r = j; if (j==0) r = i; if (A[i] == B[j]) r = ED(i-1,j-1); else r = min(1 + ED(i-1,j), 1 + ED(i,j-1)); return M[i,j] = r; } M[1..n,1..m] = -1; ED(n,m);

Page 16 Memoized Solution int ED(int i, int j) { if (M[i,j] != -1) return M[i,j]; if (i==0) r = j; if (j==0) r = i; if (A[i] == B[j]) r = ED(i-1,j-1); else r = min(1 + ED(i-1,j), 1 + ED(i,j-1)); return M[i,j] = r; } M[1..n,1..m] = -1; ED(n,m);

Page 17 Dynamic programming for i = 1 to n M[i,1] = i; for j = 1 to m M[1,j] = j; for i = 1 to n for j = 1 to m if (A[i] == B[j]) M[i,j] = M[i-1,j-1]; else M[i,j] = 1 + min(M[i-1,j],M[i,j-1]);

Page 18 Example Note: can be filled in any order as long as the cells to the left and above are filled. Can follow path back through matrix to construct edits. atcacac t c a t A B insert delete common

Page 19 Space Efficient Solution for i = 1 to n for j = 1 to m if (A[i] == B[j]) R[j] = Rprev[j-1]; else R[j] = 1 + min(Rprev[j],R[j-1]); Rprev[0..m] = R[0..m] Requires only O(m) space. What is the problem?

Page 20 Space Efficient Solution For each entry in a row past n/2 keep track of which column it comes from in row n/ atcacac t c a3,02,03,02,31,32,33,34,3 4t4,03,02,03,02,33,34,35,3 The function ED ’ (A,B) returns column:

Page 21 Space efficient solution Now solve recursively: ,02,03,02,31,32,33,34,3 4,03,02,03,02,33,34,35,3 function RecED(A,B) = k = ED ’ (A,B) RecED(A[1..n/2],B[1..k]) ++ RecED(A[n/2..n],B[k..m]) k

Page 22 Space Efficient Solution Time: T(n,m) = T(n/2,k) + T(n/2,m-k) + O(nm) = O(nm) Space: S(n,m) = O(m) for ED ’ and O(m + n) for result = O(n+m)

Page 23 Space Efficient Solution: Notes We used divide-and-conquer to save space, not time. We allowed for redundant work, but that ’ s OK!

Page 24 Improving Time Bounds For many applications (e.g. diff), the difference between strings tends to be small. Can we do something better for this case? This idea was exploited by Myers and Ukkonen (independently) in Now the basis for GNU Diff. An example of an algorithms whose runtime is output sensitive.

Page 25 Viewing as a Graph Edit distance can be expressed as the problem of finding the shortest path in the following graph: _acta _ t a t a weight = 1 weight = 0 (only when chars match) How many vertices will Dijkstra ’ s algorithm visit as a function of n, m and d? start end

Page 26 Dijkstra ’ s algorithm (review) Single source (v s ) single destination (v d ) shortest path d(v) = 1 for all v  v s d(v s ) = 0 insert(v s, EmptyQ) while Min(Q)  v d v = deleteMin(Q) for each neighbor v ’ of v d(v ’ ) = min(d(v ’ ), d(v) + weight(v,v ’ )) insert(v ’,Q), or replace if already in there Takes |E| inner iterations, each taking O(log |V|) time. Can be improved to O(|E| + |V|log|V|) time. Finished Frontier (in Q) vsvs vdvd v

Page 27 Bounding Visited Vertices Theorem: D ij ¸ |j-i| Proof: every step away from the diagonal is along a horizontal or vertical edge. Each such edge contributes 1, so the total D, must be at least the distance from the diagonal Corollary: Dijkstra ’ s algorithm visits at most min(n,m)*2d vertices. searched vertices d

Page 28 Bounding time Priority Queue can be kept in constant time per operation (basically a modified breadth-first search), so total time is O(min(n,m) d). In practice this can be much less than O(nm). What about space?

Page 29 Optimizing Space Use recursive trick from before Problem: We do not know d ahead of time. Scanning row by row will not work without d. Need to bound size of frontier.

Page 30 Increasing Band-Widths Start with band of |m-n| and double on each step until large enough Visits at most twice as many vertices as it should.