String Data Structures and Algorithms: Suffix Trees and Suffix Arrays

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

Suffix Trees Come of Age in Bioinformatics Algorithms, Applications and Implementations Dan Gusfield, U.C. Davis.
Suffix Trees Construction and Applications João Carreira 2008.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter : k-difference.
Two implementation issues Alphabet size Generalizing to multiple strings.
1 Suffix tree and suffix array techniques for pattern analysis in strings Esko Ukkonen Univ Helsinki Erice School 30 Oct 2005 Modified Alon Itai 2006.
Suffix Trees and Suffix Arrays
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
CSE 746 – Introduction to Bioinformatics Research Project Two methods of DNA Sequencing – Comparing and Intertwining Suffix Trees and De Bruijn Graphs.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
1 More Specialized Data Structures String data structures Spatial data structures.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Obtaining Provably Good Performance from Suffix Trees in Secondary Storage Pang Ko & Srinivas Aluru Department of Electrical and Computer Engineering Iowa.
Space Efficient Linear Time Construction of Suffix Arrays
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
An O(N 2 ) Algorithm for Discovering Optimal Boolean Pattern Pairs Hideo Bannai, Heikki Hyyro, Ayumi Shinohara, Masayuki Takeda, Kenta Nakai and Satoru.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Hideo Bannai, Shunsuke Inenaga, Masayuki Takeda Kyushu University, Japan SPIRE Cartagena, Colombia.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
COMP9319 Web Data Compression and Search
Succinct Data Structures
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ariel Rosenfeld Bar-Ilan Uni.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Comparison of large sequences
Strings: Tries, Suffix Trees
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
CS 6293 Advanced Topics: Translational Bioinformatics
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Strings: Tries, Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

String Data Structures and Algorithms: Suffix Trees and Suffix Arrays David Fernández-Baca UNAM (Mexico) (based on notes by Srinivas Aluru) slightly modified by Benny Chor

BBSI Summer School - Iowa State University Why Strings? Biological sequences can be viewed as strings, or finite series of characters, over an alphabet Σ. There is a wealth of algorithmic theory developed for general strings that we can apply to specific biological problems. December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Look-up Tables Strings of length k over Σ can be represented by an integer index i, 0 ≤ i ≤ Σk – 1. DNA is composed of four characters. Σ = {A, G, C, T} |Σ| = 4 We can preprocess a database into a lookup table to locate all occurrences of a query index. December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Example Let: A = 0 (00) C = 1 (01) G = 2 (10) T = 3 (11) Strings are converted based on the binary string they represent String Binary Integer AAA = 000000 = 0 ATA = 001100 = 12 AAC = 000001 = 1 December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Search using an Index Size: |Σ|k Linked list of occurrences Query December 5, 2018 BBSI Summer School - Iowa State University

Applications of Indexing Seeds for searching sequence databases BLAST Pair generation for fragment assembly in sequencing projects CAP3 sequence assembly program December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Indexing Using a sparse representation, a database can be preprocessed in linear time to allow locating all instances of a short string. Major limitation: search is restricted to fixed length strings. December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Suffix Trees Paths from root to leaves represent all suffixes of S S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 A $ M LA YALAM$ AL 5 10 YALAM$ $M $M YALAM$ $ ALAYALAM$ $M 8 4 7 3 YALAM$ 1 9 6 2 December 5, 2018 BBSI Summer School - Iowa State University

Suffix tree properties For a string S of length n, there are n+1 leaves and at most n internal nodes. therefore requires only linear space, provided edge labels are O(1) space Each leaf represents a unique suffix. Concatenation of edge labels from root to a leaf spells out the suffix. Each internal node represents a distinct common prefix to at least two suffixes. December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Edge Encoding S = M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 (2, 2) (10, 10) (3, 4) (5, 10) (1, 1) 10 (3, 4) (5, 10) 5 (5, 10) (9, 10) (2, 10) (10, 10) (9, 10) 8 4 7 3 1 9 (9, 10) (5, 10) 6 2 December 5, 2018 BBSI Summer School - Iowa State University

Näive Suffix Tree Construction 1 MALAYALAM$ 2 ALAYALAM$ 3 LAYALAM$ 4 AYALAM$ 5 YALAM$ 6 ALAM$ 7 LAM$ 8 AM$ 9 M$ 10 $ Before starting: Why exactly do we need this $, which is not part of the alphabet? December 5, 2018 BBSI Summer School - Iowa State University

Näive Suffix Tree Construction 2 3 4 1 MALAYALAM$ 2 ALAYALAM$ 3 LAYALAM$ 4 AYALAM$ 5 YALAM$ 6 ALAM$ 7 LAM$ 8 AM$ 9 M$ 10 $ A $MALAYALAM LAYALAM$ LAYALAM$ YALAM$ 2 1 3 4 etc. December 5, 2018 BBSI Summer School - Iowa State University

Application: Finding a short Pattern in a long String Build a suffix tree of the string. Starting from the root, traverse a path matching characters of the pattern. If stuck, pattern not present in string. Otherwise, each leaf below gives a position of the pattern in the string. December 5, 2018 BBSI Summer School - Iowa State University

Finding a Pattern in a String Find “ALA” A $ M LA YALAM$ AL 5 10 YALAM$ M$ M$ YALAM$ $ ALAYALAM$ 3 M$ 8 4 7 YALAM$ 1 9 Two matches - at 6 and 2 6 2 December 5, 2018 BBSI Summer School - Iowa State University

Finding Common SubStrings Construct a generalized suffix tree for two strings (each suffix of each string is represented). Label each leaf with the suffix number and string label. Each internal node with a leaf from both strings in its subtree gives a common substring. December 5, 2018 BBSI Summer School - Iowa State University

Generalized Suffix Tree WINDOW$ INDIGO$ 1234567 1234567 $ D $OG I ND O W (2, 5) (1, 7) (2, 7) ND OW$ $OGI $OG $OGI OW$ $ $W $ INDOW$ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) December 5, 2018 BBSI Summer School - Iowa State University

Lowest Common Ancestors The lowest common ancestor (lca) of two nodes x and y in a rooted tree is the deepest node (farthest away from root) that is an ancestor of both x and y Concatenation of edge labels from root to the lca of two leaves spells out the longest common prefix (lcp) of two strings lca(x,y) an be found in constant time after linear preprocessing [Bender00] December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University A Useful Property String depth (lca (i , j)) = lcp (suffixi, suffixj) A A $ String depth = 3 M LA YALAM$ AL AL lca 5 10 YALAM$ M$ M$ YALAM$ $ ALAYALAM$ lcp = longest common prefix M$ 8 4 7 3 YALAM$ 1 9 6 2 December 5, 2018 BBSI Summer School - Iowa State University

Longest Common Extension RAI RAILWAY$ 12345678 GRAINY$ 1234567 RAI lce(1,1) = 0 lce(2,1) = 3 We’ll soon find lce’s useful in reconstructing phylogenetic trees based on whole genome/proteome sequences December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University lce’s and lca’s To compute lce’s for two strings S1 and S2 Build generalized suffix tree, T, of S1 and S2 Compute string depth for each node in T Preprocess T for lca queries lce(i,j) = string depth of lca of suffix i ofS1 and suffix j ofS2 December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Example WINDOW$ INDIGO$ 1234567 1234567 $ D $OG I ND O W (2, 5) (1, 7) (2, 7) ND OW$ $OGI $OG $OGI OW$ $ $W $ INDOW$ (2, 4) (2, 2) (1, 3) (1, 5) (2, 6) (2, 3) (1, 4) $OGI OW$ (1, 6) (1, 1) (2, 1) (1, 2) December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University lce’s, revisited Given two strings S1 and S2 , we are now interested in finding, for each i, the index j such that lce (i, j) is maximal. What is the meaning of this task? How do we accomplish it efficiently? Notice that computing the values lce (i, j) for all j is very inefficient! December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Palindromes A palindrome is a string that reads the same in both directions E.g., CATGTAC red rum, sir, is murder Palindrome problem: Find all maximal palindromes in a string S December 5, 2018 BBSI Summer School - Iowa State University

Finding Palindromes in S Construct the reverse S’ of S Build generalized suffix tree of S and S’ Preprocess T for lce queries Now what? Left as homework Requirement: Linear time (const. per query) S q + 1 December 5, 2018 BBSI Summer School - Iowa State University

Palindromes in DNA sequences We sometimes need to deal with Crick-Watson complemented palindromes A  T C  G E.g., ATCATGAT is a complemented palindrome All complemented palindromes in S can be found using a GST of S and the complement of S’ December 5, 2018 BBSI Summer School - Iowa State University

Suffix Array – Reducing Space 6 ALAM$ 2 ALAYALAM$ 8 AM$ 4 AYALAM$ 7 LAM$ 3 LAYALAM$ 1 MALAYALAM$ 9 M$ 5 YALAM$ 10 $ M A L A Y A L A M $ 1 2 3 4 5 6 7 8 9 10 6 2 8 4 7 3 1 9 5 10 Suffix Array: Lexicographic ordering of suffixes 3 1 2 - Derive Longest Common Prefix array Suffix 6 and 2 share “ALA” Suffix 2,8 share just “A”. lcp achieved for successive pairs. December 5, 2018 BBSI Summer School - Iowa State University

Pattern Search in Suffix Array All suffixes that share a common prefix appear in consecutive positions in the array. Pattern P can be located in the string using a binary search on the suffix array. Naïve Run-time = O(|P|  log n). Improved to O(|P| + log n) [Manber&Myers93], and to O(|P|) [Abouelhoda et al. 02]. December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Example Text M A L Y $ Position 1 2 3 4 5 6 7 8 9 10 Suffix Array 3 7 4 10 5 8 9 1 2 6 lcp Array 3 1 1 2 1 6 ALAM$ 2 ALAYALAM$ 8 AM$ 4 AYALAM$ 7 LAM$ 3 LAYALAM$ 1 MALAYALAM$ 9 M$ 5 YALAM$ 10 $ December 5, 2018 BBSI Summer School - Iowa State University

Suffix Trees vs. Suffix Arrays Suffix Array = Lexicographic order of the leaves of the Suffix Tree Suffix Tree  Suffix Array + lcp Array (why? Wait for next slide) December 5, 2018 BBSI Summer School - Iowa State University

Building a ST from a SA and a lcp 6 ALAM$ 2 ALAYALAM$ 8 AM$ 4 AYALAM$ 7 LAM$ 3 LAYALAM$ 1 MALAYALAM$ 9 M$ 5 YALAM$ 10 $ A LA D = 1 D = 2 AL YALAM$ $M $M YALAM$ D = 3 $M 8 4 7 3 YALAM$ 6 2 SA 6 2 8 4 7 3 1 9 5 10 lcp 3 1 2 - December 5, 2018 BBSI Summer School - Iowa State University

Known (amazing) Results Suffix tree can be constructed in O(n) time and O(n  |∑|) space [Weiner73, McCreight76, Ukkonen92]. Suffix arrays can be constructed without using suffix trees in O(n) time [Pang&Aluru03]. December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University More Applications Suffix-prefix overlaps in fragment assembly Maximal and tandem repeats Shortest unique substrings Maximal unique matches [MUMmer] Approximate matching Phylogenies based on complete genomes December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Dealing with errors The basic string data structures can only extract information in the absence of errors. To deal with errors, decompose into parts that do not involve errors. December 5, 2018 BBSI Summer School - Iowa State University

The k-mismatch problem Given a pattern P, a text T, and a number k, find all occurrences of P in T with at most k mismatches Example P = bend, T = abentbananaend, k = 2 Match 1: bent Match 2: bana Match 3: aend December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University Solution Build GST of P and T and preprocess it for lce queries For each starting index i in T, do at most k lce queries to determine if there is a k-mismatch beginning at i T P Time = O(k |T |) December 5, 2018 BBSI Summer School - Iowa State University

BBSI Summer School - Iowa State University References M. I. Abouelhoda, S. Kurtz and E. Ohlebusch, The enhanced suffix array and its applications to genome analysis, 2nd Workshop on Algorithms in Bioinformatics, pp. 449-463, 2002. M. A. Bender and M. Farach-Colton, The LCA Problem Revisited, LATIN, pages 88-94, 2000. P. Ko and S. Aluru, Linear time suffix sorting, CPM, pages 200-210, 2003. U. Manber and G. Myers. Suffix arrays: a new method for on-line search, SIAM J. Comput., 22:935-948, 1993. E. M. McCreight, A space-economical suffix tree construction algorithm, J. ACM, 23(2):262--272, 1976. E. Ukkonen, Constructing suffix trees on-line in linear time. Intern. Federation of Information Processing, pp. 484-492,1992. Also in Algorithmica, 14(3):249--260, 1995. P. Weiner, Linear pattern matching algorithms, Proc. of the 14th IEEE Annual Symp. on Switching and Automata Theory, pp. 1-11, 1973. December 5, 2018 BBSI Summer School - Iowa State University