1 Longest Common Subsequence Problem and Its Approximation Algorithms Kuo-Si Huang ( 黃國璽 )

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Sequence Assembly for Single Molecule Methods Steven Skiena, Alexey Smirnov Department of Computer Science SUNY at Stony Brook {skiena,
Overview What is Dynamic Programming? A Sequence of 4 Steps
Sparse Compact Directed Acyclic Word Graphs
Chapter 7 Dynamic Programming.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Goodrich, Tamassia String Processing1 Pattern Matching.
§ 8 Dynamic Programming Fibonacci sequence
GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA exon intron intergene Find Gene Structures in DNA Intergene State First Exon State Intron State.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
4 -1 Chapter 4 The Sequence Alignment Problem The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Aligning Alignments Soni Mukherjee 11/11/04. Pairwise Alignment Given two sequences, find their optimal alignment Score = (#matches) * m - (#mismatches)
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Protein Sequence Classification Using Neighbor-Joining Method
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
© 2004 Goodrich, Tamassia Dynamic Programming1. © 2004 Goodrich, Tamassia Dynamic Programming2 Matrix Chain-Products (not in book) Dynamic Programming.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
Efficient algorithms for the scaled indexing problem Biing-Feng Wang, Jyh-Jye Lin, and Shan-Chyun Ku Journal of Algorithms 52 (2004) 82–100 Presenter:
Arc-Segment Alignment for RNA Secondary Structure 指導教授:楊昌彪 學生姓名:彭永興.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
Finding Regulatory Motifs in DNA Sequences. Motifs and Transcriptional Start Sites gene ATCCCG gene TTCCGG gene ATCCCG gene ATGCCG gene ATGCCC.
Introduction to Profile Hidden Markov Models
Multiple sequence alignment (MSA) Usean sekvenssin rinnastus Petri Törönen Help contributed by: Liisa Holm & Ari Löytynoja.
1 New tabulation and dynamic programming based techniques for sequence similarity problems Szymon Grabowski Sept Lodz University of Technology, Institute.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
Welcome to Honors Intro to CS Theory Introduction to CS Theory (Honors & Traditional): - formalization of computation - various models of computation (increasing.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Welcome to Honors Intro to CS Theory Introduction to CS Theory (Honors & Traditional): - formalization of computation - various models of computation (increasing.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Input Sensitive Algorithms for Multiple Sequence Alignment Pankaj Yonatan University Rachel
Chapter 3 Computational Molecular Biology Michael Smith
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2012 Lecture 16.
1 Chapter 6 Dynamic Programming. 2 Algorithmic Paradigms Greedy. Build up a solution incrementally, optimizing some local criterion. Divide-and-conquer.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
1 Longest Common Subsequence as Private Search Payman Mohassel and Mark Gondree U of CalgaryNPS.
Experimenting an approximation algorithm for the LCS Paola Bonizzoni, Gianluca Della Vedova., Giancarlo Mauri Discrete Applied Mathematics 110 (2001) 13–24.
Su ffi x Tree of Alignment: An E ffi cient Index for Similar Data JOONG CHAE NA1, HEEJIN PARK2, MAXIME CROCHEMORE3, JAN HOLUB4, COSTAS S. ILIOPOULOS3, LAURENT.
Conditional Lower Bounds for Dynamic Programming Problems Karl Bringmann Max Planck Institute for Informatics Saarbrücken, Germany.
Local Exact Pattern Matching for Non-fixed RNA Structures Mika Amit, Rolf Backofen, Steffen Heyne, Gad M. Landau, Mathias Mohl, Christina Schmiedl, Sebastian.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Core String Edits, Alignments, and Dynamic Programming.
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Aligning Genomes Genome Analysis, 12 Nov 2007 Several slides shamelessly stolen from Chr. Storm.
Finding approximate occurrences of a pattern that contains gaps Inbok Lee Costas S. Iliopoulos Alberto Apostolico Kunsoo Park.
Multiple Sequence Composition Alignment
Tries 07/28/16 11:04 Text Compression
String Processing.
Sequence Alignment 11/24/2018.
SMA5422: Special Topics in Biotechnology
Intro to Alignment Algorithms: Global and Local
Longest Common Subsequence
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Multiple Sequence Alignment
Presentation transcript:

1 Longest Common Subsequence Problem and Its Approximation Algorithms Kuo-Si Huang ( 黃國璽 )

2 Substring and Subsequence String vs. Substring –A string v is a substring of a string s if s = s 1 vs 2 for some prefix s 1 and suffix s 2 s = TAGTCACG v 1 = TAGT v 2 = AGTCAC v 3 = TAGTCACG … Sequence vs. Subsequence –A subsequence of a string s is a string obtained by deleting 0 or more characters from s. s = TAGTCACG s 1 = TTCCG s 2 = AGCACG s 3 = TAGTCACG …  (No T)

3 Longest Common Subsequence (1) 2-sequence version: –To find a longest common subsequence between two sequences. string1 : TAGTCACG string2 : AGACTGTC  LCS : AGACG –Dynamic programming:

4 Longest Common Subsequence (2) TAGTCACG AGACTGTC LCS:AGACG

5 Edit Distance To find a smallest edit process between two strings. TAGTCAC G AG ACTGTC Operation: DMMDDMMIMII

6 2-LCS and Sequence Alignment AGACTGTC TAGTCACG  -AG--ACTGTC TAGTCAC-G Wagner-Fischer, edit distance, O(m n) using dynamic programming

7 Algorithms Time Space Wagner-FischerO(m n)O(m n) 1975 HirschbergO(m n)O(n) 1977 Hunt-SzymanskiO((n+R)log n)O(R+n) 1977 HirschbergO(Ln + n log n)O(Ln) 1977 HirschbergO(L(m  L)log n)O((m  L) 2 +n) 1980 Masek-PatersonO(n max{1, m/log n})O(n 2 /log n) 1982 Nakatsu et al.O(n(m  L))O(m 2 ) 1984 Hsu-DuO(Lm log(n/L) + Lm)O(Lm) 1985 UkkonenO(Em)O(E min{m, E}) 1986 ApostolicoO(n+m log n + D log(mn/D)) O(R+m) 1987 Kumar-RanganO(n(m  L))O(n) 1987 Apostolico-GuerraO(Lm + n)O(D+n) 1990 Chin-PoonO(n+min{D, Lm})O(D+n) 1992 Apostolico et al.O(Lm)O(n) 1992 Eppstein et al.O(n+D log log min{D, mn/D}) O(D+m) Time and space complexity of algorithms computing L(u, v). Here m = |u|, n = |v|, m  n, R = number of matches, L = length of a longest common subsequence, E = m+n  2L = edit distance, D = number of dominant matches. (M. S. Paterson and V. Dancik(1994))

8 Global Alignment vs. Local Alignment Global alignment: Local alignment: Pairwise alignment

9 Multiple Sequence Alignment The multiple sequence alignment problem is to simultaneously align more than two sequences. For k sequences of length n: O(n k ) NP-Complete –L. Wang and T. Jiang. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1: , The exact multiple alignment algorithms for many sequences are not feasible. Some approximation algorithms are given. (e.g., 2 – l/k for any fixed l by Bafna et al.)

10 Counterexample for Progressive MSA S1 = taacc S2 = aatgg S3 = ccggt LCS(S1, S2) = LCS( taacc, aatgg ) = aa LCS((S1, S2), S3) = LCS( aa, ccggt ) = 0 LCS(S2, S3) = LCS( aatgg, ccggt ) = gg LCS((S2, S3), S1) = LCS( gg, taacc ) = 0 LCS(S1, S3) = LCS( taacc, ccggt ) = cc LCS((S1, S3), S2) = LCS( cc, aagtt ) = 0 LCS(S1, S2, S3) = LCS( taacc, aatgg, ccggt ) = t

11 Progressive Alignment s 1 = AAAAAGGGAAAAAGGG----- s 2 = GGGAAAAA-----GGGAAAAA s 3 = CCCCCGGGCCCCCGGG----- s 4 = GGGCCCCC-----GGGCCCCC ---AAAAAGGG GGGAAAAA CCCCCGGG GGGCCCCC--- What to optimize?

12 k-LCS Given k (k  2) strings S = {s 1, s 2, …, s k } over a finite alphabet , the problem is to find a longest sequence t = a 1 a 2  a p, which is a subsequence to each s i for all i  {1, 2, …, k}. s 1 = GCCGAGTTGGCT s 2 = AGCTACAGTGCT s 3 = AGACATGTACGA s 4 = ACGCAAGTGAGC t = GCAGTC Easy? NP-Complete problem D. Maier. The complexity of some problems on subsequences and supersequences. Journal of the ACM, 25:322–336, 1978.

13 Optimal k-LCS Method Dynamic programming: O(n k ) Koji Hakata and Hiroshi Imai (1992) O(n  k+D  k(log k  3 n+log k  2  )) –for k sequences of sequence length n on alphabet of size , and D is the number of dominant matches. R.W. Irving and C.B. Fraser (1992) Algorithm 1: O(kn(n – l) k-1 ) Algorithm 2: O(kl(n – l) k-1 + k  n) –for k sequences with length n, where l is the length of an LCS, and  is the alphabet size.

14 Time Complexity 1GHz = 10 9 Hz, 1 year  3  10 7 seconds  units of time  3years, units of time  3000 years

15 Approximate k-LCS Algorithm Input: k sequences with length n over a finite alphabet . Output: A near longest common subsequence of above k sequences. Long Run: O(kn) Expansion Algorithm: O(kn 4 log n) Paola Bonizzoni, Gianluca Della Vedova, Giancarlo Mauri, “Experimenting an Approximation Algorithm for the LCS.” Discrete Applied Mathematics, 110(1):13-24, 2001.

16 Long Run Algorithm s 1 = GCCGAGTTGGCT(1A 5G 3C 3T) s 2 = AGCTACAGTGCT(3A 3G 3C 3T) s 3 = AGACATGTACGA(5A 3G 2C 2T) s 4 = ACGCAAGTGAGC(4A 4G 3C 1T) (1A 3G 2C 1T) t = GGG Recall: t = GCAGTC ¼-approximation algorithm over  = { A,G,C,T }

17 Expansion Algorithm S = {a 4 b 3 a 4 b 2 a, a 3 b 4 a 4 b 3 } Sream: abab Sequences of the expansions: abab, a 2 bab, a 2 b 2 ab, a 2 b 2 a 2 b, a 2 b 2 a 2 b 2, a 2 b 2 a 4 b 2, a 3 b 2 a 4 b 2, a 3 b 3 a 4 b 2 Return: a 3 b 3 a 4 b 2 ¼-approximation algorithm over  = { A,G,C,T } Time complexity: O(kn 4 log n)

18 Semimanufacture Old version n = 20 s 1 = AGAGCGAAGGTACGTATACT s 2 = CTTAAGACGCATCGTACTAG t = AAGAGACGAT (10) lcs = AGAGCATCGTATA (13)

19 Semimanufacture Recent version s 1 = AGAGCGAAGGTACGTATACT s 2 = CTTAAGACGCATCGTACTAG t = AGACGACGTACT (12) lcs = GACGCCCCCGCG (13)

20 Semimanufacture 1. S1=AGAGCGAAGGTACGTATACT s2=CTTAAGACGCATCGTACTAG Conanical sequence: c1=ATAGACGGACGTATACT

21 Semimanufacture 2. s1=AGAGCGAAGGTACGTATACT s2=CTTAAGACGCATCGTACTAG c1=ATAGACGGACGTATACT Conanical sequence: c2=A(T)AGACGGACGTATACT

22 Semimanufacture 3. s1=AGAGCGAAGGTACGTATACT s2=CTTAAGACGCATCGTACTAG c2’=AAGACGGACGTATACT Conanical sequence: c2’=AAGACGGACGTATACT

23 Semimanufacture 4. s1=AGAGCGAAGGTACGTATACT c2’=AAGACGGACGTATACT LCS: cs1=AGACGAGCGTATACT s2=CTTAAGACGCATCGTACTAG c2’=AAGACGAGCGTATACT LCS: cs2=AAGACGACGTACT

24 Semimanufacture 5. cs1=AAGACGACGTACT cs2=AGACGAGCGTATACT LCS: cs=AGACGACGTACT

25 Our Time Complexity O(k  2 n 2 ) –where k: # of sequence,  : # of symbols, n: length of sequence 1GHz = 10 9 Hz, 1 year  3  10 7 seconds  units of time  3years, units of time  3000 years

26 Possible Contribution A faster method to evaluate (guess) the similarity of a set of sequences. A faster method to find the common subsequence (consensus) of several sequences. A faster method to generate a common subsequence which can be adopted by other local improvement methods.

27 Conclusion If we complete the mission with good result, –we can obtain the MSA based on the k-LCS. –compared with other MSA methods, it is a faster tool to view an MSA result. –we shall study the relation between the k-LCS and MSA for getting better MSA. –we can apply the k-LCS to construct evolutionary trees (cf. pairwise and progressive).