Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National.

Slides:



Advertisements
Similar presentations
Dynamic Programming ACM Workshop 24 August Dynamic Programming Dynamic Programming is a programming technique that dramatically reduces the runtime.
Advertisements

Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Longest Common Subsequence
Analysis of Algorithms
Finding a Length-Constrained Maximum-Density Path in a Tree Rung-Ren Lin, Wen-Hsiung Kuo, and Kun-Mao Chao.
© The McGraw-Hill Companies, Inc., Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Greedy Algorithms Basic idea Connection to dynamic programming Proof Techniques.
RAIK 283: Data Structures & Algorithms
Heaviest Segments in a Number Sequence Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.
Efficient Algorithms for Locating Maximum Average Consecutive Substrings Jie Zheng Department of Computer Science UC, Riverside.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin Tao Jiang.
Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2002 Lecture 1 (Part 3) Tuesday, 9/3/02 Design Patterns for Optimization.
Chapter 8 Dynamic Programming Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
2 -1 Chapter 2 The Complexity of Algorithms and the Lower Bounds of Problems.
Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang.
The Complexity of Algorithms and the Lower Bounds of Problems
Dynamic Programming – Part 2 Introduction to Algorithms Dynamic Programming – Part 2 CSE 680 Prof. Roger Crawfis.
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Dynamic Programming Dynamic Programming Dynamic Programming is a general algorithm design technique for solving problems defined by or formulated.
C++ Programming: Program Design Including Data Structures, Fourth Edition Chapter 19: Searching and Sorting Algorithms.
Analysis of Algorithms
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences.
2IL50 Data Structures Fall 2015 Lecture 2: Analysis of Algorithms.
Dynamic Programming Method for Analyzing Biomolecular Sequences Tao Jiang Department of Computer Science University of California - Riverside (Typeset.
Dynamic Programming.
The Lower Bounds of Problems
Data Structure Introduction.
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
Heaviest Segments in a Number Sequence Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan.
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Dynamic Programming Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan
Never-ending stories Kun-Mao Chao ( 趙坤茂 ) Dept. of Computer Science and Information Engineering National Taiwan University, Taiwan
Introduction to NP Instructor: Neelima Gupta 1.
Space-Saving Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan.
Ch03-Algorithms 1. Algorithms What is an algorithm? An algorithm is a finite set of precise instructions for performing a computation or for solving a.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
On the R ange M aximum-Sum S egment Q uery Problem Kuan-Yu Chen and Kun-Mao Chao Department of Computer Science and Information Engineering, National Taiwan.
Sequence Alignment Kun-Mao Chao (趙坤茂)
Dynamic-Programming Strategies for Analyzing Biomolecular Sequences
Algorithm Analysis (not included in any exams!)
The Complexity of Algorithms and the Lower Bounds of Problems
Heaviest Segments in a Number Sequence
A Quick Note on Useful Algorithmic Strategies
Dynamic Programming.
The Lower Bounds of Problems
Divide and Conquer Algorithms Part I
On the Range Maximum-Sum Segment Query Problem
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
Dynamic Programming.
Sequence Alignment Kun-Mao Chao (趙坤茂)
Space-Saving Strategies for Analyzing Biomolecular Sequences
Major Design Strategies
A Note on Useful Algorithmic Strategies
A Note on Useful Algorithmic Strategies
Dynamic Programming.
Major Design Strategies
Dynamic Programming Kun-Mao Chao (趙坤茂)
Presentation transcript:

Dynamic-Programming Strategies for Analyzing Biomolecular Sequences Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:

2 Dynamic Programming Dynamic programming is a class of solution methods for solving sequential decision problems with a compositional cost structure. Richard Bellman was one of the principal founders of this approach.

3 Two key ingredients Two key ingredients for an optimization problem to be suitable for a dynamic- programming solution: Each substructure is optimal. (Principle of optimality) 1. optimal substructures 2. overlapping subproblems Subproblems are dependent. (otherwise, a divide-and- conquer approach is the choice.)

4 Three basic components The development of a dynamic- programming algorithm has three basic components: –The recurrence relation (for defining the value of an optimal solution); –The tabular computation (for computing the value of an optimal solution); –The traceback (for delivering an optimal solution).

5 Fibonacci numbers.for       i>1 i F i F i F F F The Fibonacci numbers are defined by the following recurrence:

6 How to compute F 10 ? F 10 F9F9 F8F8 F8F8 F7F7 F7F7 F6F6 ……

7 Tabular computation The tabular computation can avoid recompuation. F0F0 F1F1 F2F2 F3F3 F4F4 F5F5 F6F6 F7F7 F8F8 F9F9 F

8 Maximum-sum interval Given a sequence of real numbers a 1 a 2 …a n, find a consecutive subsequence with the maximum sum. 9 –3 1 7 – –4 2 –7 6 – For each position, we can compute the maximum- sum interval starting at that position in O(n) time. Therefore, a naive algorithm runs in O(n 2 ) time.

9 O-notation: an asymptotic upper bound f(n) = O(g(n)) iff there exist two positive constant c and n 0 such that 0 f(n) cg(n) for all n n 0 cg(n) f(n)f(n) n0n0

10 How functions grow? 30n 92n log n 26n n 3 2n2n sec sec sec. 4 x yr. 100, sec.2.6 min.3.0 days22 yr. For large data sets, algorithms with a complexity greater than O(n log n) are often impractical! n function (Assume one million operations per second.)

11 Maximum-sum interval (The recurrence relation) Define S(i) to be the maximum sum of the intervals ending at position i. aiai If S(i-1) < 0, concatenating a i with its previous interval gives less sum than a i itself.

12 Maximum-sum interval (Tabular computation) 9 –3 1 7 – –4 2 –7 6 – S(i) – – The maximum sum

13 Maximum-sum interval (Traceback) 9 –3 1 7 – –4 2 –7 6 – S(i) – – The maximum-sum interval:

14 Two fundamental problems we recently solved (I) (joint work with Lin and Jiang) Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. U = 3 9 –3 1 7 – –4 2 –7 6 –

15 Joint work with Huang, Jiang and Lin Xiaoqiu Huang ( 黃曉秋 ) Iowa State University, USA Tao Jiang ( 姜濤 ) University of California Riverside, USA Yaw-Ling Lin ( 林耀鈴 ) Providence University, Taiwan

16 Two fundamental problems we recently solved (II) (joint work with Lin and Jiang) Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm. L =

17 Another example Given a sequence as follows: 2, 6.6, 6.6, 3, 7, 6, 7, 2 and L = 2, the highest-average interval is the squared area, which has the average value 20/3. 2, 6.6, 6.6, 3, 7, 6, 7, 2

18 C+G rich regions Our method can be used to locate a region of length at least L with the highest C+G ratio in O(n log L) time. ATGACTCGAGCTCGTCA Search for an interval of length at least L with the highest average.

19 After I opened this problem in Academia Sinica in Aug JCSS (2002) – Lin, Jiang, Chao –A linear-time algorithm for a maximum-sum segment with a lower bound L and an upper bound U. –An O(n log L) time algorithm for a maximum- average segment with a lower bound L. –Two open problems raised: O(n log L)  O(n)? average (with an upper bound U)?

20 After I opened this problem in Academia Sinica in Aug Comb. Math. (2002) – Yeh et. al –An expected O(nL)-time algorithm Comb. Math. (2002) – Lu –A linear time algorithm for a maximum-average segment with a lower bound L in a binary sequence.

21 After I opened this problem in Academia Sinica in Aug WABI (2002) & JCSS – Goldwasswer, Kao, and Lu –O(n) for a lower bound L only –O(n log (U-L+1)) for L and U IPL (2003) – Kim –O(n) for a lower bound L only (Reformulated as a Computational Geometry Problem)

22 After I opened this problem in Academia Sinica in Aug Bioinformatics (2003) – Lin, Huang, Jiang, Chao –A tool for finding k-best maximum-average segments Bioinformatics (2003) – Wang, Xu –A tool for locating a maximum-sum segment with a lower bound L and an upper bound U

23 After I opened this problem in Academia Sinica in Aug CIAA(2003) – Lu et al. –An alternative linear-time algorithm for a maximum- sum segment (online; for finding repeats) ESA (2003) – Chung, Lu –O(n log (U-L+1))  O(n) for L and U ISAAC (2003) – Lin (R.R.), Kuo, Chao –O(nL) for a maximum-average path in a tree –O(n) for a complete tree Some were omitted here... Some are coming soon...

24 Length-unconstrained version Maximum-average interval The maximum element is the answer. It can be done in O(n) time.

25 Q: computing density in O(1) time? prefix-sum(i) = S[1]+S[2]+…+S[i], –all n prefix sums are computable in O(n) time. sum(i, j) = prefix-sum(j) – prefix-sum(i-1) density(i, j) = sum(i, j) / (j-i+1) prefix-sum(j) i j prefix-sum(i-1)

26 A naive algorithm A simple shift algorithm can compute the highest-average interval of a fixed length in O(n) time Try L, L+1, L+2,..., n. In total, O(n 2 ).

27 An O(nL)-time algorithm (Huang, CABIOS’ 94) Observing that there exists an optimal interval of length bounded by 2L, we immediately have an O(nL)-time algorithm. We can bisect a region of length >= 2L into two segments, where each of them is of length >= L.

28 Good partners Finding good partner g(i) for each i=1, 2, …, n-L+1. L g(i)g(i) maximing avg[i, g(i)] i + L

29 Right-Skew Decomposition Partition S into substrings S 1,S 2,…,S k such that –each S i is a right-skew substring of S the average of any prefix is always less than or equal to the average of the remaining suffix. –density(S 1 ) > density(S 2 ) > … > density(S k ) [Lin, Jiang, Chao] –Unique –Computable in linear time.

30 Right-Skew Decomposition /33/51/3>>> Lu’s illustration

31 An O(n log L)-time algorithm (Lin, Jiang, Chao, JCSS’02) Decreasingly right-skew decomposition (O(n) time)

32 Right-Skew pointers p[ ] p[ ]

33 Illustration L i+L g(i)g(i) L Lu’s illustration

34 An O(n log L)-time algorithm Jumping tables that allows binary search. –O(log L) time for each position to find its good partner, therefore the total running time is O(n log L). We also implemented a program that linearly scans right-skew segments for the good partnership. Our empirical tests showed that it ran in linear time empirically.

35 Pointer-jumping tables p (0) [ ] p (1) [ ] p (2) [ ]

36 Goldwasser, Kao, and Lu’s progress An O(n) time algorithm for the problem –Appeared in Proceedings of the Second Workshop on Algorithms in Bioinformatics (WABI), Rome, Itay, Sep , –JCSS, accepted.

37 A new important observation i < j < g(j) < g(i) implies –density(i, g(i)) is no more than density(j, g(j)) ig(i) j g(j) Lu’s illustration

38 ig(i) j g(j)g(j) Lu’s illustration

39 Searching for all g(i) in linear time Lu’s illustration

40 k non-overlapping maximum-average segments (Lin, Huang, Jiang, Chao, Bioinformatics) Maintain all candidates in a priority queue. A new k-best algorithm + algorithms on trees (Lin and Chao (2003))

41 Longest increasing subsequence(LIS) The longest increasing subsequence is to find a longest increasing subsequence of a given sequence of distinct integers a 1 a 2 …a n. e.g are increasing subsequences. are not increasing subsequences. We want to find a longest one.

42 A naive approach for LIS Let L[i] be the length of a longest increasing subsequence ending at position i. L[i] = 1 + max j = 0..i-1 {L[j] | a j < a i } (use a dummy a 0 = minimum, and L[0]=0) L[i] ?

43 A naive approach for LIS L[i] L[i] = 1 + max j = 0..i-1 {L[j] | a j < a i } The maximum length The subsequence 2, 3, 7, 8, 10, 13 is a longest increasing subsequence. This method runs in O(n 2 ) time.

44 Binary search Given an ordered sequence x 1 x 2... x n, where x 1 <x 2 <... <x n, and a number y, a binary search finds the largest x i such that x i < y in O(log n) time. n... n/2 n/4

45 Binary search How many steps would a binary search reduce the problem size to 1? n n/2 n/4 n/8 n/ How many steps? O(log n) steps.

46 An O(n log n) method for LIS Define BestEnd[k] to be the smallest number of an increasing subsequence of length k BestEnd[1] BestEnd[2] BestEnd[3] BestEnd[4] BestEnd[5] BestEnd[6]

47 An O(n log n) method for LIS Define BestEnd[k] to be the smallest number of an increasing subsequence of length k BestEnd[1] BestEnd[2] BestEnd[3] BestEnd[4] BestEnd[5] BestEnd[6] For each position, we perform a binary search to update BestEnd. Therefore, the running time is O(n log n).

48 Longest Common Subsequence (LCS) A subsequence of a sequence S is obtained by deleting zero or more symbols from S. For example, the following are all subsequences of “president”: pred, sdn, predent. The longest common subsequence problem is to find a maximum-length common subsequence between two sequences.

49 LCS For instance, Sequence 1: president Sequence 2: providence Its LCS is priden. president providence

50 LCS Another example: Sequence 1: algorithm Sequence 2: alignment One of its LCS is algm. a l g o r i t h m a l i g n m e n t

51 How to compute LCS? Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. len(i, j): the length of an LCS between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, len(i, j)can be computed as follows.

52

53

54

55

56 Now try this example on your own Sequence 1: algorithm Sequence 2: alignment Their LCS? Recall that... a l g o r i t h m a l i g n m e n t