Comparison of large sequences

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
MUMmer 游騰楷杜海倫 王慧芬曾俊雄 2007/01/02. Outlines Suffix Tree MUMmer 1.0 MUMmer 2.1 MUMmer 3.0 Conclusion.
YES-NO machines Finite State Automata as language recognizers.
Equivalence, Order, and Inductive Proof
15-853Page : Algorithms in the Real World Suffix Trees.
1 String Matching of Bit Parallel Suffix Automata.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Why the algorithm works! Converting an NFA into an FSA.
Strings and Languages Operations
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
Suffix trees and suffix arrays presentation by Haim Kaplan.
Recuperació de la informació Modern Information Retrieval (1999) Ricardo-Baeza Yates and Berthier Ribeiro-Neto Flexible Pattern Matching in Strings (2002)
Suffix Trees ALGGEN: Algorithmics and genetics group Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Dr. Xavier Messeguer.
Derrick Coetzee, Microsoft Research CC0 waiverCC0 waiver: To the extent possible under law, I waive all copyright and related or neighboring rights to.
Master Course MSc Bioinformatics for Health Sciences H15: Algorithms on strings and sequences Xavier Messeguer Peypoch (
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
String Matching String matching: definition of the problem (text,pattern) depends on what we have: text or patterns Exact matching: Approximate matching:
Formal Methods in SE Theory of Automata Qasiar Javaid Assistant Professor Lecture # 06.
SPLASH: Structural Pattern Localization Analysis by Sequential Histograms A. Califano, IBM TJ Watson Presented by Tao Tao April 14 th, 2004.
Great Theoretical Ideas in Computer Science.
L ECTURE 3 Chapter 4 Regular Expressions. I MPORTANT T ERMS Regular Expressions Regular Languages Finite Representations.
Design & Analysis of Algorithms COMP 482 / ELEC 420 John Greiner.
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Joint Advanced Student School Compressed Suffix Arrays Compression of Suffix Arrays to linear size Fabian Pache.
1 Section 4.3 Order Relations A binary relation is an partial order if it transitive and antisymmetric. If R is a partial order over the set S, we also.
Suffix Trees ALGGEN: Algorithmics and genetics group Dep. Llenguatges i Sistemes Informàtics Universitat Politècnica de Catalunya Dr. Xavier Messeguer.
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
Bioinformatics PhD. Course 1. Biological introduction Exact Extended Approximate 6. Projects: PROMO, MREPATT, … 5. Sequence assembly 2. Comparison of short.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2005 Lecture 10Sept Carnegie Mellon University b b a b a a a b a b One.
Great Theoretical Ideas In Computer Science John LaffertyCS Fall 2006 Lecture 22 November 9, 2006Carnegie Mellon University b b a b a a a b a b.
Lecture 2 Theory of AUTOMATA
1 Chapter 3 Regular Languages.  2 3.1: Regular Expressions (1)   Regular Expression (RE):   E is a regular expression over  if E is one of:
Multiplying Polynomials “Two Special Cases”. Special Products: Square of a binomial (a+b) 2 = a 2 +ab+ab+b 2 = a 2 +2ab+b 2 (a-b) 2 =a 2 -ab-ab+b 2 =a.
Bioinformatic PhD. course Bioinformatics Xavier Messeguer Peypoch ( LSI Dep. de Llenguatges i Sistemes Informàtics BSC Barcelona.
Lecture 02: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Lecture 03: Theory of Automata:2014 Asif Nawaz Theory of Automata.
Great Theoretical Ideas In Computer Science Steven RudichCS Spring 2005 Lecture 9Feb Carnegie Mellon University b b a b a a a b a b One Minute.
Bioinformatics PhD. Course Summary (approximate) 1. Biological introduction 2. Comparison of short sequences (
COMP9319 Web Data Compression and Search
Theory of Computation Lecture #
Tries 07/28/16 11:04 Text Compression
Andrzej Ehrenfeucht, University of Colorado, Boulder
Theory of Automata.
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
CSC2431 February 3rd 2010 Alecia Fowler
Contents First week: algorithms for exact string matching:
Suffix trees.
String Data Structures and Algorithms
Algorithmic Complexity
CSE 589 Applied Algorithms Spring 1999
String Data Structures and Algorithms
Suffix trees and suffix arrays
One Minute To Learn Programming: Finite Automata
Suffix Arrays and Suffix Trees
Presentation transcript:

Comparison of large sequences 18/09/2018 First part: Alignment of large sequences

Dynamic programming What about genomes? 18/09/2018 accaccacaccacaacgagcata … acctgagcgatat a c . t acc.................................agt | | |.................................|xx acc.................................a-- Quadratic cost of space and time. Quadratic cost of space and time. Short sequences (up to 10.000 bps) can be aligned using dynamic programming As you have seen this morning .... What about genomes?

(1 minute becomes 2 years) Genomic sequences 18/09/2018 Genomic sequences have millions of base pairs. The length of sequences is 1000 times longer. The running time is 1.000.000 times higher ! (1 second becomes 11 days) (1 minute becomes 2 years) Dealing with genomes increase the length of sequences 100 times … then the cost is inceased at least one million times …makes unpractical this algorithm In which case Dynamic Programming can be applied?

First assumption ………………………….………………...…………...…. Genome A ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A …………………………………… …………………………….

Realistic assumption? Unrealistic assumption! ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A Unrealistic assumption! …………………………………………………………………. ………………………………………………...…………...…. Genome A Genome B More realistic assumption ………………… ……………… Genome A Genome B

Realistic assumptions? ………………………………………………………………. ………………………….………………...…………...…. Genome B Genome A Unrealistic assumption! ………………………………………………………………… ………………………………………………...…………...…. Genome A Genome B More realistic assumption But, now is it a real case? ………………… ……………… Genome A Genome B

Preview in a real case     Chlamidia muridarum: 1.084.689bps Chlamidia Thrachomatis:1057413bps      

Preview in a real case Pyrococcus abyssis: 1.790.334 bps Pyrococcus horikoshu: 1.763.341 bps    

Methodology of an alignment Make a preview: ……………………..…. …………………...…. 1st: 2nd: 3th: Identify the portions that can be aligned. Make the alignment: …..………… ………………. (Linear cost) (Linear cost)

Methodology of an alignment Make a preview: ……………………..…. …………………...…. ? 1st: 2nd: 3th: Identify the portions that can be aligned. Make the alignment: …..………… ………………. (Linear cost)

Preview-Revisited MUM Maximal Unique Matching Connect to MALGEN … a a t g….c t g... … c g t g….c c c ... Maximal Unique Matching MUM Connect to MALGEN

Methodology of an alignment Make a preview: ……………………..…. …………………...…. 1st: 2nd: 3th: Linear cost Suffix trees with How can MUMs be found? Identify the portions that can be aligned. Make the alignment: …..………… ………………. How can these portions be determined? With CLUSTALW, TCOFFEE,…

Bioinformatics PhD. Course 18/09/2018 Second part: Introducing Suffix trees

Suffix trees Given string ababaas: Suffixes: What kind of queries?

Applications of Suffix trees 1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? ………………………… a ba baas,1 as,3 baas,2 as,4 s,6 as,5 s,7

Quadratic insertion algorithm Invariant Properties: Given the string …………………………......  and the suffix-tree …... P1: the leaves of suffixes from  have been inserted

Quadratic insertion algorithm Given the string ababaabbs ababaabbs,1

Quadratic insertion algorithm Given the string ababaabbs ababaabbs,1 babaabbs,2

Quadratic insertion algorithm Given the string ababaabbs aba baabbs,1 ababaabbs,1 babaabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,3 aba baabbs,1 babaabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,3 aba baabbs,1 ba baabbs,2 babaabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,3 aba baabbs,1 ba baabbs,2 abbs,4

Quadratic insertion algorithm Given the string ababaabbs abbs,3 aba baabbs,1 abbs,3 ba a baabbs,1 ba abbs,4 abbs,4 baabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,5 abbs,3 ba a baabbs,1 ba abbs,4 abbs,4 baabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,5 abbs,3 ba a baabbs,1 ba abbs,4 abbs,4 baabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,5 a b a abbs,3 baabbs,1 ba ba abbs,4 abbs,4 baabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 ba abbs,4 abbs,4 baabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 ba abbs,4 abbs,4 baabbs,2

Quadratic insertion algorithm Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 a baabbs,2 b abbs,4 bs,7

Quadratic insertion algorithm Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 a baabbs,2 b abbs,4 bs,7 s,8

Quadratic insertion algorithm Given the string ababaabbs abbs,5 a bs,6 b a abbs,3 baabbs,1 s,9 a baabbs,2 b abbs,4 bs,7 s,7

Generalizad suffix tree The suffix tree of many strings … is called the generalized suffix tree … and it is the suffix tree of the concatenation of strings. For instance, the generalized suffix tree of ababaabb and aabaat … is the suffix tree of ababaabαaabaatβ, :

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : Given the suffix tree of ababaabα : abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 bα,6 b a abbα,3 baabbα,1 α,9 a baabbα,2 b abbα,4 bα,7 α,8

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a a baabbα,2 b abbα,4 bα,7 baabbα,1 α,8

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a a baabbα,2 b abbα,4 bα,7 baabbα,1 α,8

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : aaβ,1 ab a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : β,5 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : β,5 β,4 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Construction of the suffix tree of ababaabbαaabaaβ : β,5 β,4 β,6 aaβ,1 a b a bα,5 β,2 bα,6 b a α,9 bbα,3 a b bα,7 baabbα,1 α,8 a β,3 a bbα,4 baabbα,2

Generalizad suffix tree Generalized suffix tree of ababaabbαaabaaβ : a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

Applications of Suffix trees 1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? ………………………… a ba baas,1 as,3 baas,2 as,4 s,6 as,5 s,7

Applications of Suffix trees 2. The substring problem for a database of strings DB Does the DB contain any ocurrence of patterns abab, aab, and ab? a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

Applications of Suffix trees 3. The longest common substring of two strings a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

Applications of Suffix trees 5. Finding MUMs. a bα,5 b bbα,3 baabbα,1 bα,6 baabbα,2 bbα,4 bα,7 α,8 α,9 aaβ,1 β,2 β,3 β,4 β,5 β,6

Bioinformatics PhD. Course 18/09/2018 Third part: Suffix links

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 ? α,9 bα,7 a α,8

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8 ?

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8

Suffix links a  abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a α,8

Traversal using Suffix links Given S2 = a a b a a abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Given S2 = a a b a a abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a aa in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a aa in S2 [1] aab in S2 [1] = S1[5..6-7] in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a S1[5..6-7] in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a S1[5..6-7] in S2 [1] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-…] in S2 [2] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a b b a S1[5..6-7] in S2 [1] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

Traversal using Suffix links Unique matchings Given S2 = a a b a a b b a S1[5..8] in S2 [4] S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] S1[6..8] in S2 [5] S1[7..8] in S2 [6] abbα,5 a bα,6 b a abbα,3 baabbα,1 α,9 bα,7 a baabbα,2 b abbα,4 α,8

From UMs to MUMs Unique matchings Given S2 = a a b a a b b a S1[5..8] in S2 [4] and S1 = a b a b a a b b α S1[3..6-8] in S2 [2] S1[4..6-8] in S2 [3] Array of UMs S1[6..8] in S2 [5] 1 2 3 6-8 4 6-8 5 8 6 8 7 8 8 9 S1[7..8] in S2 [6] MUM: S1[3..6-8] in S2[2]

Bioinformatics PhD. Course 18/09/2018 Third part: Linear insertion algorithm

Quadratic insertion algorithm Invariant Properties: Given the string …………………………......  and the suffix-tree …... P1: the leaves of suffixes from  have been inserted

Linear insertion algorithm Invariant Properties:  Given the string …………………………......  and the suffix-tree  …... P1: the leaves of suffixes from  have been inserted P2: the string  is the longest string that can be spelt through the tree.

Linear insertion algorithm: example   a  Given the string ababaababb... ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

Linear insertion algorithm: example   Given the string ababaababb... 6 7 8 ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

Linear insertion algorithm: example  Given the string ababaababb... 6 7 8  ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

Linear insertion algorithm: example  Given the string ababaababb... 6 7 89  ba baababb...,2 a ababb...,5 ababb...,3 baababb...,1 ababb...,4

Linear insertion algorithm: example Given the string ababaababb...   6 7 89 a ababb...,5 ba ababb...,3 baababb...,1 baababb...,2 ababb...,4 baababb...,1 b b...,6 aababb...,1

Linear insertion algorithm: example Given the string ababaababb...   7 89 a ababb...,5 ba ababb...,3 baababb...,2 ababb...,4 b b...,6 aababb...,1

Linear insertion algorithm: example Given the string ababaababb...   7 89 a ababb...,5 ba ababb...,3 baababb...,2 ababb...,4 b b...,6 aababb...,1

Linear insertion algorithm: example Given the string ababaababb...   7 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 baababb...,2

Linear insertion algorithm: example Given the string ababaababb...   7 89 a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b aababb...,2 baababb...,2 baababb...,2

Linear insertion algorithm: example  Given the string ababaababb... 7 8…  a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 baababb...,2 b b...,7 aababb...,2 baababb...,2

Linear insertion algorithm: example  Given the string ababaababb... 89  a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

Linear insertion algorithm: example  Given the string ababaababb... 89  a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

Linear insertion algorithm: example  Given the string ababaababb... 89  a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

Linear insertion algorithm: example  Given the string ababaababb... 89  a ababb...,5 ababb...,3 ba b b...,6 aababb...,1 ba ababb...,4 b b...,7 aababb...,2

Linear insertion algorithm: example  Given the string ababaababb... 89  a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b ba ababb...,4 b aababb...,2 b...,7

Linear insertion algorithm: example  Given the string ababaababb... 89  a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 ba ababb...,4 b aababb...,2 b...,7

Linear insertion algorithm: example  Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 ba ababb...,4 b aababb...,2 b...,7

Linear insertion algorithm: example  Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 ba ababb...,4 b aababb...,2 b...,7

Linear insertion algorithm: example  Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 a b ababb...,4 b aababb...,2 b...,7

Linear insertion algorithm: example  Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 aababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

Linear insertion algorithm: example Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 ababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

Linear insertion algorithm: example Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 ababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

Linear insertion algorithm: example  Given the string ababaababb... 9  a ababb...,5 ababb...,3 b b...,6 ababb...,1 a b b...,8 a b ababb...,4 b...,9 b aababb...,2 b...,7

Index Suffix arrays Suffix-arrays: a new method for on-line string searches, G. Myers, U. Manber

Suffix arrays Given string ababaa#: Suffixes: … but lexicographically sorted 2: babaa# 1 2 3 4 5 6 7 1: # 3: abaa# 6: a# 4: baa# 5: aa# 5: aa# 6: a# 7: # 3: abaa# 1: ababaa# 4: baa# 2: babaa# Which is the cost? O(n log(n))

Applications of suffix arrays 1. Exact string matching Does the sequence ababaas contain any ocurrence of patterns abab, aab, and ab? 1: ababaa# 2: babaa# 3: abaa# 4: baa# 5: aa# 6: a# 1: # 1 2 3 4 5 6 7 Binary search … which is the cost? O(log(n) |P|) Can it be improved to … O(log(n)+|P|) ?

Fast search with cost O(log(n)+|P|) 1 2 … … n Suffix array Query: Invariant Properties: P1: α < query ≤ β α β P2: matches pref( query)

Fast search with cost O(log(n)+|P|) 1 2 … … n Suffix array P2: matches pref( query) Query: Invariant Properties: P1: α < query ≤ β α β γ If suff(γ)<suff(query) then α = γ else β = γ Algorithm: