Su ffi x Tree of Alignment: An E ffi cient Index for Similar Data JOONG CHAE NA1, HEEJIN PARK2, MAXIME CROCHEMORE3, JAN HOLUB4, COSTAS S. ILIOPOULOS3, LAURENT MOUCHARD5, AND KUNSOO PARK6 Presented by Ramin Fallahzadeh
Problem definition Indexing multiple data which are very similar: ◦Modifying existing data (e.g., new version of a source code) ◦Today’s back up vs yesterday’s back up ◦Individual’s genome vs Human reference genome (99% indentical)
Storing vs Indexing data Storing data: ◦Using alignment to store only the differences ◦Data compression schemes Indexing data: ◦Example: Search Engines ◦Suffix tree: linear time and space complexity ◦One solution: constructing generalized suffix tree
Generalized suffix tree GST(A,B): ◦|A|+|B| leaves ◦O(|A|+|B|) construction time ◦Drawbacks: ◦Some suffixes may be stored twice A = aaatcaaa B = aaatgaaa {aaa, aa, a} are stored twice in GST ◦two similar suffixes aaatcaaa and aaatgaaa are stored in distinct leaves even though they are very similar ◦Therefore for similar data most of the leaves are redundant!
Contribution Neither the suffix tree nor any variant of the suffix tree uses this similarity or alignment to index similar data efficiently!
Alignment
given alignment is not required to be optimal we can use a near-optimal alignment instead of the optimal alignment if the time to compute an alignment is important Since the given strings are assumed to be highly similar, a near-optimal alignment can be computed fast from exact string matching instead of dynamic programming requiring much time.
Naïve approach constructing the generalized suffix tree and deleting unnecessary leaves not time/space-efficient! The proposed algorithm is incremental, i.e., we construct the suffix tree of A and then transform it to the suffix tree of the alignment This algorithm uses constant-size extra working space except for our suffix tree itself more space-efficient compared to the naïve method
Simple alignment α
General Alignment
Definitions
Example Generalized suffix tree: A = aaabaaabbaaba# B = aaabaabaabbaba#
Example Suffix tree of alignment A = aaabaaabbaaba# B = aaabaabaabbaba# Alignment: aaabaa(abba/baabb)aba# Type-1 Type-2 Type-3 Type-4
Construction
Example ST(A) A = aaabaaabbaaba#
Example ST’(A) when step A is applied: A = aaabaaabbaaba# B = aaabaabaabbaba#
Example Suffix tree of alignment A = aaabaaabbaaba# B = aaabaabaabbaba# Alignment: aaabaa(abba/baabb)aba#
Su ffi x Tree of General Alignments
Construction
Space Complexity
Time complexity
Thank you for your attention Any questions?