Www.bioalgorithms.infoAn Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.

Slides:



Advertisements
Similar presentations
CS 206 Introduction to Computer Science II 02 / 27 / 2009 Instructor: Michael Eckmann.
Advertisements

Longest Common Subsequence
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Lecture 3: Parallel Algorithm Design
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Overview What is Dynamic Programming? A Sequence of 4 Steps
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Lecture 12: Revision Lecture Dr John Levine Algorithms and Complexity March 27th 2006.
Copyright © 2007 Pearson Addison-Wesley. All rights reserved. Problem Reduction Dr. M. Sakalli Marmara Unv, Levitin ’ s notes.
1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.
Stephen P. Carl - CS 2421 Recursive Sorting Algorithms Reading: Chapter 5.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Refining Edits and Alignments Υλικό βασισμένο στο κεφάλαιο 12 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University.
CMPS1371 Introduction to Computing for Engineers SORTING.
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
Dynamic Programming Reading Material: Chapter 7..
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Dynamic Programming Reading Material: Chapter 7 Sections and 6.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Divide and Conquer The most well known algorithm design strategy: 1. Divide instance of problem into two or more smaller instances 2. Solve smaller instances.
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.
Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Introduction to Bioinformatics Algorithms Sequence Alignment.
1 Designing algorithms There are many ways to design an algorithm. Insertion sort uses an incremental approach: having sorted the sub-array A[1…j - 1],
CSC 211 Data Structures Lecture 13
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
The Manhattan Tourist Problem Shane Wood 4/29/08 CS 329E.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Course Code #IDCGRF001-A 5.1: Searching and sorting concepts Programming Techniques.
1. Searching The basic characteristics of any searching algorithm is that searching should be efficient, it should have less number of computations involved.
Concepts of Algorithms CSC-244 Unit 15 & 16 Divide-and-conquer Algorithms ( Binary Search and Merge Sort ) Shahid Iqbal Lone Computer College Qassim University.
Chapter 4, Part II Sorting Algorithms. 2 Heap Details A heap is a tree structure where for each subtree the value stored at the root is larger than all.
PREVIOUS SORTING ALGORITHMS  BUBBLE SORT –Time Complexity: O(n 2 ) For each item, make (n –1) comparisons Gives: Comparisons = (n –1) + (n – 2)
Review Quick Sort Quick Sort Algorithm Time Complexity Examples
Chapter 9 Recursion © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
INTRO2CS Tirgul 8 1. Searching and Sorting  Tips for debugging  Binary search  Sorting algorithms:  Bogo sort  Bubble sort  Quick sort and maybe.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Chapter 11 Sorting Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and Mount.
Lecture 3: Parallel Algorithm Design
Divide & Conquer Algorithms
CSC 421: Algorithm Design & Analysis
CSE 5290: Algorithms for Bioinformatics Fall 2011
JinJu Lee & Beatrice Seifert CSE 5311 Fall 2005 Week 10 (Nov 1 & 3)
Divide and Conquer – and an Example QuickSort
Binary Search Back in the days when phone numbers weren’t stored in cell phones, you might have actually had to look them up in a phonebook. How did you.
Data Structures Review Session
Searching CLRS, Sections 9.1 – 9.3.
Dynamic Programming-- Longest Common Subsequence
Merge Sort 4/10/ :25 AM Merge Sort 7 2   7  2   4  4 9
Analysis of Algorithms CS 477/677
Merge Sort 5/30/2019 7:52 AM Merge Sort 7 2   7  2  2 7
CSE 5290: Algorithms for Bioinformatics Fall 2009
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Divide-and-Conquer 7 2  9 4   2   4   7
Presentation transcript:

Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Outline 1.MergeSort 2.Finding the middle vertex 3.Linear space sequence alignment 4.Block alignment 5.Four-Russians speedup 6.LCS in sub-quadratic time

Introduction to Bioinformatics Algorithms Section 1: MergeSort

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Divide and Conquer Algorithms Divide problem into sub-problems. Conquer by solving sub-problems recursively. If the sub- problems are small enough, solve them in brute force fashion. Combine the solutions of sub-problems into a solution of the original problem (tricky part).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Sorting Problem Revisited Given: An unsorted array. Goal: Sort it

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Mergesort: Divide Step log(n) divisions to split an array of size n into single elements. Step 1: DIVIDE

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Mergesort: Conquer Step Step 2: CONQUER O(n) log(n) iterations, each iteration takes O(n) time. Total Time: O(n log n)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Mergesort: Combine Step Step 3: COMBINE 2 arrays of size 1 can be easily merged to form a sorted array of size 2. In general, 2 sorted arrays of size n and m can be merged in O(n+m) time to form a sorted array of size n+m. 5225

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Mergesort: Combine Step Combining 2 arrays of size 4… Etcetera…

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Merge Algorithm 1. Merge(a,b) 2. n1  size of array a 3. n2  size of array b 4. a n1+1   5. a n2+1   6. i  1 7. j  1 8. for k  1 to n1 + n2 9.if a i < b j 10.c k  a i 11.i  i else 13.c k  b j 14.j  j+1 15.return c

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Mergesort: Example Divide Conquer

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info MergeSort Algorithm 1. MergeSort(c) 2. n  size of array c 3. if n = 1 4. return c 5. left  list of first n/2 elements of c 6. right  list of last n-n/2 elements of c 7. sortedLeft  MergeSort(left) 8. sortedRight  MergeSort(right) 9. sortedList  Merge(sortedLeft,sortedRight) 10. return sortedList

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info MergeSort: Running Time The problem is simplified to baby steps: For the ith merging iteration, the complexity of the problem is O(n). Number of iterations is O(log n). Running Time: O(n logn).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Divide and Conquer Approach to LCS 1.Path(source, sink) 2. if(source & sink are in consecutive columns) 3. output the longest path from source to sink 4. else 5. middle ← middle vertex between source & sink 6. Path(source, middle) 7. Path(middle, sink)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Divide and Conquer Approach to LCS The only problem left is how to find this “middle vertex”! 1.Path(source, sink) 2. if(source & sink are in consecutive columns) 3. output the longest path from source to sink 4. else 5. middle ← middle vertex between source & sink 6. Path(source, middle) 7. Path(middle, sink)

Introduction to Bioinformatics Algorithms Section 2: Finding the Middle Vertex

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Alignment Score Requires Linear Memory Space complexity of computing the alignment score is just O(n). We only need the previous column to calculate the current column, and we can then throw away that previous column once we’re done using it. 2 n n

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Computing Alignment Score: Recycling Columns Memory for column 1 is re-used to calculate column 3 Memory for column 2 is re-used to calculate column 4 Only two columns of scores are saved at any given time:

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Alignment Path Requires Quadratic Memory Space complexity for computing an alignment path for sequences of length n and m is O(nm). The reason is that we need to keep all backtracking references in memory to reconstruct the path (backtracking). n m

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Crossing the Middle Line m/2 m n We want to calculate the longest path from (0,0) to (n,m) that passes through (i,m/2) where i ranges from 0 to n and represents the i-th row. Define length(i) as the length of the longest path from (0,0) to (n,m) that passes through vertex (i, m/2).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Crossing the Middle Line We want to calculate the longest path from (0,0) to (n,m) that passes through (i,m/2) where i ranges from 0 to n and represents the ith row. Define length(i) as the length of the longest path from (0,0) to (n,m) that passes through vertex (i, m/2). m/2 m n

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info m/2 m n Define (mid,m/2) as the vertex where the longest path crosses the middle column. length(mid) = optimal length = max 0  i  n length(i) Crossing the Middle Line

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info prefix(i) suffix(i) Crossing the Middle Line m/2 m n Define (mid,m/2) as the vertex where the longest path crosses the middle column. length(mid) = optimal length = max 0  i  n length(i)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Computing prefix(i) prefix(i) is the length of the longest path from (0,0) to (i, m/2). Compute prefix(i) by dynamic programming in the left half of the matrix. 0 m/2 m Store prefix(i) column

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Computing suffix(i) suffix(i) is the length of the longest path from (i,m/2) to (n,m). suffix(i) is the length of the longest path from (n,m) to (i, m/2) with all edges reversed. Compute suffix(i) by dynamic programming in the right half of the “reversed” matrix. 0 m/2 m Store suffix(i) column

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info length(i) = prefix(i) + suffix(i) Add prefix(i) and suffix(i) to compute length(i): length(i)=prefix(i) + suffix(i) You now have a middle vertex of the maximum path (i,m/2) as maximum of length(i). Middle point found 0 m/2 m 0i0i

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Finding the Middle Point 0 m/4 m/2 3m/4 m

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Finding the Middle Point Again 0 m/4 m/2 3m/4 m

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info And Again… 0 m/8 m/4 3m/8 m/2 5m/8 3m/4 7m/8 m

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Time = Area: First Pass On first pass, the algorithm covers the entire area. Area = n  m

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Time = Area: First Pass Computing prefix(i) Computing suffix(i) On first pass, the algorithm covers the entire area. Area = n  m

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Time = Area: Second Pass On second pass, the algorithm covers only 1/2 of the area Area/2

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Time = Area: Third Pass On third pass, only 1/4th is covered. Area/4

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Geometric Reduction At Each Iteration 1 + ½ + ¼ (½) k ≤ 2 Runtime: O(Area) = O(nm) first pass: 1 2 nd pass: 1/2 3 rd pass: 1/4 5 th pass: 1/16 4 th pass: 1/8

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Can We Align Sequences in Subquadratic Time? Dynamic Programming takes O(n 2 ) for global alignment. Can we do better? Yes, use Four-Russians Speedup.

Introduction to Bioinformatics Algorithms Section 4: Block Alignment

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Partitioning Sequences into Blocks Partition the n x n grid into blocks of size t x t. We are comparing two sequences, each of size n, and each sequence is sectioned off into chunks, each of length t. Sequence u = u 1 …u n becomes |u 1 …u t | |u t+1 …u 2t | … |u n-t+1 …u n | and sequence v = v 1 …v n becomes |v 1 …v t | |v t+1 …v 2t | … |v n-t+1 …v n |

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Partitioning Alignment Grid into Blocks partition n n/tn/t n/tn/t t t n

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment Block alignment of sequences u and v. 1.An entire block in u is aligned with an entire block in v. 2.An entire block is inserted. 3.An entire block is deleted. Block path: a path that traverses every t x t square through its corners.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment: Examples validinvalid

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment Problem Goal: Find the longest block path through an edit graph. Input: Two sequences, u and v partitioned into blocks of size t. This is equivalent to an n x n edit graph partitioned into t x t subgrids. Output: The block alignment of u and v with the maximum score (longest block path through the edit graph).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Constructing Alignments within Blocks To solve: Compute alignment score BlockScore i,j for each pair of blocks |u (i-1)*t+1 …u i*t | and |v (j-1)*t+1 …v j*t |. How many blocks are there per sequence? (n/t) blocks of size t How many pairs of blocks for aligning the two sequences? (n/t) x (n/t) For each block pair, solve a mini-alignment problem of size t x t

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Constructing Alignments within Blocks n/tn/t Block pair represented by each square Solve mini-alignment problems

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment: Dynamic Programming Let s i,j denote the optimal block alignment score between the first i blocks of u and first j blocks of v.  block is the penalty for inserting or deleting an entire block. BlockScore(i, j) is the score of the pair of blocks in row i and column j.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment Runtime Indices i, j range from 0 to n/t. Running time of algorithm is O( [n/t]*[n/t]) = O(n 2 /t 2 ) if we don’t count the time to compute each  BlockScore(i, j).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment Runtime Computing all  BlockScore i,j requires solving (n/t)*(n/t) mini block alignments, each of size (t*t). So computing  all  i,j takes time O([n/t]*[n/t]*t*t) = O(n 2 ) This is the same as dynamic programming. How do we speed this up?

Introduction to Bioinformatics Algorithms Section 5: Four Russians Speedup

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Four Russians Technique Let t = log(n), where t is the block size, n is the sequence size. Instead of having (n/t)*(n/t) mini-alignments, construct 4 t x 4 t mini-alignments for all pairs of strings of t nucleotides (huge size), and put in a lookup table. However, size of lookup table is not really that huge if t is small. Let t = (logn)/4. Then 4 t x 4 t = n.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Look-up Table for Four Russians Technique Lookup table “Score” AAAAAA AAAAAC AAAAAG AAAAAT AAAACA … AAAAAA AAAAAC AAAAAG AAAAAT AAAACA … each sequence has t nucleotides size is only n, instead of (n/t)*(n/t)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info New Recurrence The new lookup table Score is indexed by a pair of t- nucleotide strings, so Key Difference: The Score function is taken from the hash table rather than computed by dynamic programming as before.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Four Russians Speedup Runtime Since computing the lookup table Score of size n takes O(n) time, the running time is mainly limited by the (n/t)*(n/t) accesses to the lookup table. Each access takes O(logn) time. Overall running time: O( [n 2 /t 2 ]*logn) Since t = logn, substitute in: O( [n 2 /{logn} 2 ]*logn) > O( n 2 /logn )

Introduction to Bioinformatics Algorithms Section 6: LCS in Sub-Quadratic Time

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info So Far… We can divide up the grid into blocks and run dynamic programming only on the corners of these blocks. In order to speed up the mini-alignment calculations to under n 2, we create a lookup table of size n, which consists of all scores for all t-nucleotide pairs. Running time goes from quadratic, O(n 2 ), to subquadratic: O(n 2 /logn)

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Four Russians Speedup for LCS Unlike the block partitioned graph, the LCS path does not have to pass through the vertices of the blocks. Block AlignmentLongest Common Subsequence

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment vs. LCS In block alignment, we only care about the corners of the blocks. In LCS, we care about all points on the edges of the blocks, because those are points that the path can traverse. Recall, each sequence is of length n, each block is of size t, so each sequence has (n/t) blocks.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Block Alignment vs. LCS: Points Of Interest Block alignment has (n/t)*(n/t) = (n 2 /t 2 ) points of interest LCS alignment has O(n 2 /t) points of interest

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Traversing Blocks for LCS Given alignment scores s i,* in the first row and scores s *,j in the first column of a t x t mini square, compute alignment scores in the last row and column of the minisquare. To compute the last row and the last column score, we use these 4 variables: 1.Alignment scores s i,* in the first row. 2.Alignment scores s *,j in the first column. 3.Substring of sequence u in this block (4 t possibilities). 4.Substring of sequence v in this block (4 t possibilities).

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Traversing Blocks for LCS If we used this to compute the grid, it would take quadratic, O(n 2 ) time, but we want to do better. We know these scores We can calculate these scores t x t block

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Four Russians Speedup Build a lookup table for all possible values of the four variables: 1.All possible scores for the first row s *,j 2.All possible scores for the first column s *,j 3.Substring of sequence u in this block (4 t possibilities). 4.Substring of sequence v in this block (4 t possibilities). For each quadruple we store the value of the score for the last row and last column. This will be a huge table, but we can eliminate alignments scores that don’t make sense.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Reducing Table Size Alignment scores in LCS are monotonically increasing, and adjacent elements can’t differ by more than 1. Example: 0,1,2,2,3,4 is ok; 0,1,2,4,5,8, is not because 2 and 4 differ by more than 1 (and so do 5 and 8). Therefore, we only need to store quadruples whose scores are monotonically increasing and differ by at most 1.

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Efficient Encoding of Alignment Scores Instead of recording numbers that correspond to the index in the sequences u and v, we can use binary to encode the differences between the alignment scores original encoding binary encoding

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Reducing Lookup Table Size 2 t possible scores (t = size of blocks) 4 t possible strings Lookup table size is (2 t * 2 t )*(4 t * 4 t ) = 2 6t Let t = (logn)/4; Table size is: 2 6((logn)/4) = n (6/4) = n (3/2) Time = O( [n 2 /t 2 ]*logn ) O( [n 2 /{logn} 2 ]*logn) > O( n 2 /logn )

An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Summary We take advantage of the fact that for each block of t = log(n), we can pre-compute all possible scores and store them in a lookup table of size n (3/2). We used the Four Russian speedup to go from a quadratic running time for LCS to subquadratic running time: O(n 2 /logn).