Nov 11 2005String algorithms, Q4 20051 Ukkonen’s suffix tree algorithm ● Recall McCreight’s approach: – For i = 1.. n+1, build compressed trie of {x[j..n]$

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Boosting Textual Compression in Optimal Linear Time.
AVL Trees COL 106 Amit Kumar Shweta Agrawal Slide Courtesy : Douglas Wilhelm Harder, MMath, UWaterloo
Greedy Algorithms Amihood Amir Bar-Ilan University.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Two implementation issues Alphabet size Generalizing to multiple strings.
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Alon Efrat Computer Science Department University of Arizona Suffix Trees.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Suffix trees and suffix arrays presentation by Haim Kaplan.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Tirgul 6 B-Trees – Another kind of balanced trees Problem set 1 - some solutions.
On-line Construction of Suffix Tree Esko Ukkonen Algorithmica Vol. 14, No. 3, pp , 1995.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
CS 473Lecture X1 CS473-Algorithms I Lecture X1 Properties of Ranks.
5.2 Trees  A tree is a connected graph without any cycles.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Union Find ADT Data type for disjoint sets: makeSet(x): Given an element x create a singleton set that contains only this element. Return a locator/handle.
TU/e Algorithms (2IL15) – Lecture 8 1 MAXIMUM FLOW (part II)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Nov String Algorithms, Q Algorithmic programming...and testing your suffix trees...
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
15-853:Algorithms in the Real World
Computational Geometry
Lecture 15 Nov 3, 2013 Height-balanced BST Recall:
McCreight's suffix tree construction algorithm
Andrzej Ehrenfeucht, University of Colorado, Boulder
Mark Redekopp David Kempe
Lecture 15 Pumping Lemma.
CMPS 3130/6130 Computational Geometry Spring 2017
B+-Trees.
Ukkonen's suffix tree construction algorithm
Chapter 5. Optimal Matchings
CS200: Algorithms Analysis
CSE 421: Introduction to Algorithms
Advanced Algorithms Analysis and Design
Lecture 9 Algorithm Analysis
Algorithms (2IL15) – Lecture 2
עידן שני ביה"ס למדעי המחשב אוניברסיטת תל-אביב
Indexing and Hashing Basic Concepts Ordered Indices
Advanced Algorithms Analysis and Design
Lecture 9 Algorithm Analysis
Lectures on Graph Algorithms: searching, testing and sorting
3.4 Push-Relabel(Preflow-Push) Maximum Flow Alg.
Lecture 9 Algorithm Analysis
Suffix trees.
A Robust Data Structure
Suffix trees and suffix arrays
Suffix Arrays and Suffix Trees
Algorithms (2IL15) – Lecture 7
Simple Sorting Algorithms
String Matching with k Mismatches
Heapsort Sorting in place
Clustering.
Binomial heaps, Fibonacci heaps, and applications
Simple Sorting Algorithms
Huffman Coding Greedy Algorithm
Analysis of Algorithms CS 477/677
Max Flows 3 Preflow-Push Algorithms
Presentation transcript:

Nov String algorithms, Q Ukkonen’s suffix tree algorithm ● Recall McCreight’s approach: – For i = 1.. n+1, build compressed trie of {x[j..n]$ | j ≤ i} ● Ukkonen’s approach: – For i = 1.. n+1, build compressed trie of {x[j..i]$ | j ≤ i} – Compressed trie of all suffixes of prefix x[1..i]$ of x$ – A suffix tree except for “leaf” property

Nov String algorithms, Q McCreight’s algorithm, x=aba T 1 : {x$[j..n] | j ≤ 1 } aba$ T 2 : {x$[j..n] | j ≤ 2 } aba$ $ab T 3 : {x$[j..n] | j ≤ 3 } a ba$ $ab $ T 4 : {x$[j..n] | j ≤ 4 } a ba$ $ab $ $

Nov String algorithms, Q Ukkonen’s algorithm, x=aba T 1 : {x$[j..1]} a T 2 : {x$[j..2]} ab b T 3 : {x$[j..3]} a ba ab T 4 : {x$[j..n]} Note: no node for x[3..3] = “a” a ba$ $ab $ $

Nov String algorithms, Q Tasks in iteration i ● In iteration i we must – Update each x[j..i] to x[j..i+1] – Add string x[i+1](special case of above) j j x[j..i] x[i+1] x[j..i] x[i+1] j x[j..i] x[i+1] x[j..i] x[i+1] j Leaf:Inner node:Edge:

Nov String algorithms, Q Simple algorithm ● “Obvious” algorithm: – Running time O(n 3 ) – Need lots of tricks to get O(n)! For i=1,...,n+1: for j=1,...,i: find x[j..i] append x[i+1]

Nov String algorithms, Q “Free” operations – leaves ● If we label leaves with (k,∞) – denoting “k to the current i”, updating a leaf is automatic j j x[j..i] x[i+1] x[j..i] x[i+1] j x[j..i] x[i+1] x[j..i] x[i+1] j Leaf:Inner node:Edge:

Nov String algorithms, Q “Free” operations – existing strings ● If x[j..i+1] is already in the tree, the update is automatic j j x[j..i] x[i+1] x[j..i] x[i+1] j x[j..i] x[i+1] x[j..i] x[i+1] j Leaf:Inner node:Edge:

Nov String algorithms, Q “Real” operations ● If we can recognize the free operations, we need only deal with the remaining j j x[j..i] x[i+1] x[j..i] x[i+1] j x[j..i] x[i+1] x[j..i] x[i+1] j Leaf:Inner node:Edge:

Nov String algorithms, Q Lemma ● Let j denote suffix x[j..i] of x[1..i] a)If j>1 is a leaf node in T i, then so is j-1 b)If, from j<i, there is a path in T i that begins with a, then there is a path in T i from j+1 beginning with a

Nov String algorithms, Q Proof of lemma (a) ● If j>1 is a leaf node in T i, then so is j-1 Assume j-1 is not a leaf. Then there exists k<j-1 such that: x[j-1..i] x[k..i] x[j-1..i] x[k..i] j-1 k Then j-1 k j k+1 thus: x[j..i] x[k+1..i] x[j..i] x[k+1..i]

Nov String algorithms, Q Proof of lemma (b) ● If, from j<i, there is a path in T i that begins with a, then there is a path in T i from j+1 beginning with a Assume j is followed by “a” there exists k<j such that: j k “a” thus: j+1 k+1 “a” Hence j+1 is followed by “a”.

Nov String algorithms, Q Corollary of lemma ● In iteration i, there exist indices j L and j R such that: – All suffixes j≤j L are leaves – All suffixes j≥j R are already in the trie jLjL jRjR II II I

Nov String algorithms, Q Corollary of lemma ● I and III are free operations jLjL jRjR II II I j j x[j..i] x[i+1] x[j..i] x[i+1] j x[j..i] x[i+1] x[j..i] x[i+1] j

Nov String algorithms, Q Updated algorithm ● Implicitly handling “free” operations: – j L in iteration i is the last leaf inserted in iteration i-1 (“once a leaf, always a leaf”) – j R in iteration i is the first index where x[j..i+1] is already in the trie For i=1,...,n+1: for j=j L,...,j R : find x[j..i] append x[i+1]

Nov String algorithms, Q Handling index j in II ● Whenever j L < j < j R, j is made a leaf: x[j..i] x[i+1] j x[j..i] x[i+1] j ● Once j is a leaf, it will be in I and never in II again

Nov String algorithms, Q Handling index j in II ● We handle j in II or implicitly in III 2n times: Iterations Index j in II 2 2 n+1 “Path” length is 2n

Nov String algorithms, Q Runtime ● Running time is 2n*T(find x[j..i]) – We just have to deal with T(find x[j..i]) in O(1) – No worries! For i=1,...,n+1: for j=j L,...,j R : find x[j..i] append x[i+1] Constant time Only 2n of these:

Nov String algorithms, Q Using fastscan and s(-) ● When searching for x[j..i], it is already in the trie – We can use fastscan for the search – T(find x[j..i]) in O(d) where d is the (node-)depth of x[j..i] ● If we keep suffix links, s(-), in the tree we can use these as shortcuts

Nov String algorithms, Q Suffix links ● Invariant: All inner nodes have suffix links ● Ensuring the invariant: – We only insert inner nodes x[j..i] when adding leaves j – Whenever we insert a new node, x[j..i] for some j<i, we also find or insert x[j+1..i], and can update s(x[j..i]) := x[j+1..i] – If we insert x[i..i], then s(x[i..i]) :=

Nov String algorithms, Q Finding x[j+1..i] from x[j..i] parent(j) s(parent(j)) s x[j+1..i] j Starting from here (initial j is j L and we can keep a pointer to that node between iterations) Using fastscan here

Nov String algorithms, Q Bound on fastscan ● Time usage by fastscan is bounded by – n – for the maximal (node-)depth in the trie – + total decrease of (node-)depth ● Decrease in depth: – Moving to parent(j): 1 – Moving to s(parent(j)): max 1 – “Restarting” at j L : ?

Nov String algorithms, Q Hacking the suffix links ● When searching for x[j L +1..i], update s(x[j L..i]) to point to the nearest ancestor of x[j L +1..i] parent(j L ) s(parent(j L )) s x[j L +1..i] jLjL Nearest ancestor of x[j L +1..i] ● “Restarting” becomes essentially free

Nov String algorithms, Q Running time Iterations Index j in II 2 2 n+1 ● Vertical steps are paid for by the previous horizontal step (free restarting) ● Horizontal steps are total fastscan bounded by O(n) ● Runtime O(n)

Nov String algorithms, Q Why Ukkonen? ● Ukkonen’s algorithm is an “online” algorithm: – As long as no suffix is a prefix of another, the intermediate trees are suffix trees – Generalized suffix trees can be built one string at a time