Suffix trees.

Slides:



Advertisements
Similar presentations
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Advertisements

Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Paolo Ferragina, Università di Pisa Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa.
Text Indexing The Suffix Array. Basic notation and facts Occurrences of P in T = All suffixes of T having P as a prefix SUF(T) = Sorted set of suffixes.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Constant-Time LCA Retrieval
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
What about the trees of the Mississippi? Suffix Trees explained in an algorithm for indexing large biological sequences Jacob Kleerekoper & Marjolijn Elsinga.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
1 Suffix Trees © Jeff Parker, Outline An introduction to the Suffix Tree Some sample applications How to build a Suffix Tree efficiently.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Suffix Trees and Their Uses.
Lowest common ancestors. Write an Euler tour of the tree LCA(1,5) = 3 Shallowest node.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
1 Lempel-Ziv algorithms Burrows-Wheeler Data Compression.
Suffix arrays. Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The.
Suffix trees.
Suffix trees and suffix arrays presentation by Haim Kaplan.
Suffix trees. Trie A tree representing a set of strings. a c b c e e f d b f e g { aeef ad bbfe bbfg c }
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
Noiseless Coding. Introduction Noiseless Coding Compression without distortion Basic Concept Symbols with lower probabilities are represented by the binary.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Basics
15-853Page :Algorithms in the Real World Data Compression III Lempel-Ziv algorithms Burrows-Wheeler Introduction to Lossy Compression.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Exact String Matching Algorithms. Copyright notice Many of the images in this power point presentation of other people. The Copyright belong to the original.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Linear Time Suffix Array Construction Using D-Critical Substrings
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Succinct Data Structures
COMP261 Lecture 22 Data Compression 2.
Tries 07/28/16 11:04 Text Compression
COMP9319 Web Data Compression and Search
Two equivalent problems
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ariel Rosenfeld Bar-Ilan Uni.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Strings: Tries, Suffix Trees
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Greedy Algorithms Many optimization problems can be solved more quickly using a greedy approach The basic principle is that local optimal decisions may.
String Data Structures and Algorithms
String Data Structures and Algorithms
Greedy: Huffman Codes Yin Tat Lee
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
String Matching with k Mismatches
Strings: Tries, Suffix Trees
INF 141: Information Retrieval
CPS 296.3:Algorithms in the Real World
Presentation transcript:

Suffix trees

Trie A tree representing a set of strings. c { a aeef b ad bbfe bbfg e

Trie (Cont) Assume no string is a prefix of another c a b e b d e f c Each edge is labeled by a letter, no two edges outgoing from the same node are labeled the same. Each string corresponds to a leaf. a b e b d e f c f e g

Compressed Trie Compress unary nodes, label edges by strings c c a a b  c a a b e b d bbf d eef e f c f c e g e g

Suffix tree Given a string s a suffix tree of s is a compressed trie of all suffixes of s To make these suffixes prefix-free we add a special character, say $, at the end of s

Suffix tree (Example) Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$ $ { $ b$ ab$ bab$ abab$ } a b b $ a a $ b b $ $

Trivial algorithm to build a Suffix tree Put the largest suffix in a b $ a b b a Put the suffix bab$ in a b b $ $

a b b a a b b $ $ Put the suffix ab$ in a b b a b $ a $ b $

a b b a b $ a $ b $ Put the suffix b$ in a b b $ a a $ b b $ $

a b b $ a a $ b b $ $ $ Put the suffix $ in a b b $ a a $ b b $ $

$ a b b $ a a $ b b $ $ We will also label each leaf with the starting point of the corres. suffix. $ a b b 5 $ a a $ b 4 b $ $ 3 2 1

Analysis Takes O(n2) time to build. You can do it in O(n) time

What can we do with it ? Exact string matching: Given a Text T, |T| = n, preprocess it such that when a pattern P, |P|=m, arrives you can quickly decide if it occurs in T. We may also want to find all occurrences of P in T

Exact string matching In preprocessing we just build a suffix tree in O(n) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 Given a pattern P = ab we traverse the tree according to the pattern.

By traversing this subtree we get all k occurrences in O(n+k) time $ a b b 5 $ a a $ b 4 b $ $ 3 2 1 If we did not get stuck traversing the pattern then the pattern occurs in the text. Each leaf in the subtree below the node we reach corresponds to an occurrence. By traversing this subtree we get all k occurrences in O(n+k) time

Generalized suffix tree Given a set of strings S a generalized suffix tree of S is a compressed trie of all suffixes of s  S To associate each suffix with a unique string in S add a different special char to each s

Generalized suffix tree (Example) Let s1=abab and s2=aab here is a generalized suffix tree for s1 and s2 # { $ # b$ b# ab$ ab# bab$ aab# abab$ } $ a b 5 4 # $ a b a b 3 b $ 4 # a # $ 2 b 1 $ 3 2 1

So what can we do with it ? Matching a pattern against a database of strings

Longest common substring (of two strings) Every node with a leaf descendant from string s1 and a leaf descendant from string s2 represents a maximal common substring and vice versa. # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 Find such node with largest “string depth” $ 3 2 1

Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

Lowest common ancestors

Write an Euler tour of the tree 3 Shallowest node 12 9 3 8 1 3 2 3 1 4 1 7 1 3 5 6 5 3 2 5 1 11 4 10 7 5 6 2 4 7 6 LCA(1,5) = 3

3 12 9 3 8 1 2 5 1 11 4 10 7 5 6 minimum 2 4 7 6 3 2 3 1 4 1 7 1 3 5 6 5 3 1 1 2 1 2 1 1 2 1

Range minimum Preprocess an array, such that given i,j you can find the minimum in [i,j] fast 3 12 9 3 8 1 2 5 Reduction takes linear time 1 11 4 10 7 5 6 minimum 2 4 7 6 3 2 3 1 4 1 7 1 3 5 6 5 3 1 1 2 1 2 1 1 2 1

Trivial algorithms for RMQ

Less trivial algorithms to RMQ Try to use O(nlog(n)) space to do a query in O(1) time

Lowest common ancetors A lot more can be gained from the suffix tree if we preprocess it so that we can answer LCA queries on it

Why? The LCA of two leaves represents the longest common prefix (LCP) of these 2 suffixes # $ a b 5 4 # $ a b a b 3 b 4 # $ a # $ 2 b 1 $ 3 2 1

Finding maximal palindromes A palindrome: caabaac, cbaabc Want to find all maximal palindromes in a string s Let s = cbaaba The maximal palindrome with center between i-1 and i is the LCP of the suffix at position i of s and the suffix at position m-i+1 of sr

Maximal palindromes algorithm Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc# For every i find the LCA of suffix i of s and suffix m-i+1 of sr

Let s = cbaaba$ then sr = abaabc# 7 a b 7 $ b a c # baaba$ c # 6 c # a $ a b 6 a $ 4 abc # 5 5 3 3 $ a $ c # 4 1 2 2 1

Analysis O(n) time to identify all palindromes

Compression

LZ overview Cursor a a c a a c a b c a b a b a c previously coded Find the longest prefix of S[i,m] that appears in S[1,i-1] write its position, length

a c b (0,0,a)

a c b (0,0,a)

a c b (0,0,a)(1,1,c)

a c b (0,0,a)(1,1,c)

a c b (0,0,a)(1,1,c)(1,3,a)

a c b (0,0,a)(1,1,c)(1,3,a)

a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)

a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)

a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)(6,3,a)

a c b (0,0,a)(1,1,c)(1,3,a)(0,0,b)(6,3,a)

a c b Implement by maintaining a suffix tree for the coded part  linear time and space

Dictionary (previously coded) LZ77: Sliding Window Cursor a c b Dictionary (previously coded) Lookahead Buffer Dictionary and buffer “windows” are fixed length and slide with the cursor On each step: Output (p,l,c) p = relative position of the longest match in the dictionary l = length of longest match c = next char in buffer beyond longest match Advance window by l + 1

LZ77: Example (0,0,a) a a c a a c a b c a b a a a c a c b (1,1,c) a c Dictionary (size = 6) Longest match Next character

LZ77 Optimizations used by gzip LZSS: Output one of the following formats (0, position, length) or (1,char) Typically use the second format if length < 3. (1,a) a a c a a c a b c a b a a a c (1,a) a a c a a c a b c a b a a a c (1,c) a a c a a c a b c a b a a a c (0,3,4) a a c a a c a b c a b a a a c

Suffix array We loose some of the functionality but we save space. Let s = abab Sort the suffixes lexicographically: ab, abab, b, bab The suffix array gives the indices of the suffixes in sorted order 2 3 1

How do we build it ? Build a suffix tree Traverse the tree in in-order, lexicographically picking edges outgoing from each node and fill the suffix array. O(n) time Can also build it directly

Example Let S = mississippi i ippi issippi Let P = issa ississippi 7 4 1 9 8 6 3 10 5 2 ippi issippi Let P = issa ississippi mississippi pi M ppi sippi sisippi ssippi ssissippi R

How do we search for a pattern ? If P occurs in T then all its occurrences are consecutive in the suffix array. Do a binary search on the suffix array Takes O(mlogn) time Can also do it in O(m+log(n)) with an additional array