18 April Suffix Trees & Information Retrieval Harry Hochheiser

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Space-for-Time Tradeoffs
Simplifying CFGs There are several ways in which context-free grammars can be simplified. One natural way is to eliminate useless symbols those that cannot.
On-line Linear-time Construction of Word Suffix Trees Shunsuke Inenaga (Japan Society for the Promotion of Science & Kyushu University) Masayuki Takeda.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Incremental Discovery of Sequential Patterns (ACM-SIGMOD's 96 Data Mining Workshop)
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Goodrich, Tamassia String Processing1 Pattern Matching.
Design a Data Structure Suppose you wanted to build a web search engine, a la Alta Vista (so you can search for “banana slugs” or “zyzzyvas”) index say.
1 Trees. 2 Outline –Tree Structures –Tree Node Level and Path Length –Binary Tree Definition –Binary Tree Nodes –Binary Search Trees.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Lec 15 April 9 Topics: l binary Trees l expression trees Binary Search Trees (Chapter 5 of text)
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Data Structures & Algorithms Radix Search Richard Newman based on slides by S. Sahni and book by R. Sedgewick.
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Important Problem Types and Fundamental Data Structures
Binary Trees Chapter 6.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Vakhitov Alexander Approximate Text Indexing. Using simple mathematical arguments the matching probabilities in the suffix tree are bound and by a clever.
A Space-Economical Suffix Tree Construction Algorithm Edward M. McCreight (1976) { From Ukkonen to McCreight and Weiner: A Unifying } View of Linear-Time.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
CSE AU B-Trees1 B-Trees CSE 373 Data Structures.
Spring 2010CS 2251 Trees Chapter 6. Spring 2010CS 2252 Chapter Objectives Learn to use a tree to represent a hierarchical organization of information.
Boyer Moore Algorithm Idan Szpektor. Boyer and Moore.
Faster Algorithm for String Matching with k Mismatches (II) Amihood Amir, Moshe Lewenstin, Ely Porat Journal of Algorithms, Vol. 50, 2004, pp
Prof. Amr Goneid, AUC1 Analysis & Design of Algorithms (CSCE 321) Prof. Amr Goneid Department of Computer Science, AUC Part 8. Greedy Algorithms.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Programming Abstractions Cynthia Lee CS106X. Topics:  Binary Search Tree (BST) › Starting with a dream: binary search in a linked list? › How our dream.
Chapter 8 Properties of Context-free Languages These class notes are based on material from our textbook, An Introduction to Formal Languages and Automata,
ICS220 – Data Structures and Algorithms Analysis Lecture 14 Dr. Ken Cosh.
Generalization of a Suffix Tree for RNA Structural Pattern Matching Tetsuo Shibuya Algorithmica (2004), vol. 39, pp Created by: Yung-Hsing Peng Date:
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Andrzej Ehrenfeucht, University of Colorado, Boulder
Mark Redekopp David Kempe
Strings: Tries, Suffix Trees
String Data Structures and Algorithms
String Data Structures and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Strings: Tries, Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

18 April Suffix Trees & Information Retrieval Harry Hochheiser

18 April Motivating Problem Understanding user interaction patterns –Logs of User Sessions: WORD, EXPLORER, OUTLOOK, WORD –Use patterns to inform interface design Model as string search problem –ABCA… –Find common repeated subsequences of some length > k Brute-force O(n 2 ) algorithm based on diagonals of incidence matrix, but… We want approximate matches: –xxxxxABCxxxxCBAxxxxxBAC Still looking for an answer – I don’t think it’s an easy problem.

18 April Outline Basic Suffix Trees & Extensions –Definitions –Brute-Force –McCreight – suffix links –Searching –Application to Common Subsequences Parameterized Suffix Trees –Motivations –Definitions –Searching & Example –Parameterized Duplications

18 April Tries & Trees Trie: Digital Search Tree over strings in alphabet C Each edge is a symbol, and siblings represent distinct symbols Final character of string cannot occur elsewhere in string –Add marker symbol (“$”) to alphabet, if needed Inefficient Eliminate Unary Nodes Suffix Tree Arcs are non-empty substrings Each non-terminal, non-root has two children Sibling arcs begin with different characters $ c $ a b c $ b c $ a b c $ a b c $ BCABC$ Trie $ c $ abc$ bc $ abc$ Tree

18 April Formalities & Definitions Y: Suffix tree for string y, |y|=n –n leaves –Storage O(n) –Left-Child, Right Sibling Structure Edges: substrings y(k,l) Internal Nodes: longest common prefix of string’s suffixes. Locus: node representing a string Extension: string with u as a prefix Extended Locus: Locus of shortest extension of u that is found in tree Contracted Locus: Locus of longest prefix of u in tree Head i : longest prefix of y(i,n) which is a prefix of y(j,n) for some j<I Tail i : y(i,n)=head i tail i suf 4 = BC head 4 =BC tail 4 = $ $ c $ abc$ bc $ abc$ BCABC

18 April Construction: Brute Force Stage i: insert y(i,n) –Find extended locus of head i in Tree Y i-1 –If string is not head i, break edge and insert internal node –Add a new node for tail i, from locus of head i S=bcabc Y 4 – insert bc$ Head = bc Tail = $ abc$ bcabc$ cabc$ abc$ bc cabc$ abc$ $

18 April Analysis Key: finding the extended locus –Otherwise, O(1) time Brute-force search: start from root and trace edges –Worst-case: AAAAAAA$. –|head i |=n-i for 1< i < n, n-i comparisons for each search –Θ (n 2 ) Apostolico and Szpanowski (1992): average case: O(n log n) –Average maximum length of the longest common prefix of two suffixes is O(log n) –Probabilistic analysis McCreight – suffix links improve performance

18 April Searching Not mentioned in McCreight.. Follow arcs with matching prefixes? $ c $ abc$ bc $ abc$ BCAB Analysis/Hand-Waving: Height of Tree ~ O(log n) Size of alphabet: |Σ| Search Time ~ O(|Σ|log n)?

18 April Suffix Links Finding extended locus – node to be split - is the hard part Head i : longest prefix of y(i,n) which is a prefix of y(j,n) for some j<I Note: if head i-1 =az then z is a prefix of head i Use locus of head i-1 to find locus of head i in stage i –Use suffix links to go from head i-1 to head i –In ith tree, only locus of head i does not have a valid suffix link –In step i, contracted locus of head i in tree i-1 is visited –Find head i-1 =uvw, s.t. uv is the contracted locus of head i-1 in previous tree. –Rescan: Follow suffix link from this node and then go down path to extended locus of vw –Scan:continue downward to find extended locus of head i

18 April Suffix Links – Example BBBBBABABBBAABBBBB$ Insert suf 14 =BBBBB head 13 =ABBB u=A,v=B,w=BB head 14 =BBBBB=vwz,z=BB a b Node c b b Node d b a..$ b..$ a..$ b bb a..$ bb$ head 1 3 a..$ $ b Step A: Node c is suffix link of uv Step B - rescan: Node d is extended locus of vw Step C – Scan: Extend from d to add node for z=BB From head 14, we can move on to head 15 …. How do we get to contracted locus for uv quickly? from previous iteration’s Node d? head 1 4

18 April Suffix Links: analysis Rescan: –res i – shortest suffix of string to be rescanned in step length(res i+! )<=length(res i )-int i – int i = |res i | - |res i-1 | –length(res n )=0 & length(res 0 )=n –Σint i =n – total nodes visited in rescanning Scan: –Step i, to find head i, must find z –scan length(head i )-length(head i-1 )+1. –Totals to n All other steps are linear.. total time O(n)

18 April Searching with Suffix Links Chang & Lawler: to search for pattern P in text T, build suffix tree for P and compare T to it –follow path by symbols of T –use suffix links when a symbol can’t be matched rescan to find what we’ve already seen, then continue report a match when we hit the last symbol that matches P=ABAB T=ABACABAB Suffix Tree for P Match(T 1 ….T n )=aba$ Match(T 2 ….T n )=ba$ Match(T 3 ….T n )=a$ Match(T 4 ….T n )=$ Match(T 5 ….T n )=abab$ ab ab$ b $ $

18 April Suffix Trees & Repeated Subsequences Easy: track internal nodes – longest substring is given by maximum length internal-node substring Other internal nodes give repeated substrings ABCQRABCSABTU$ ab b c c tu$ q…$ s…$ c tu$ q…$ s…$ q…$ s…$ Possible Extensions Common substrings of two strings Non-overlapping repetitions – requires augmented tree

18 April Miscellany McCreight –Hash-coded tree for better search performance with large alphabets –Dynamic insertion into suffix trees in linear time Ukkonen (1992) –On-line, linear time construction – add successively longer prefixes (instead of shorter suffixes) Manber & Myers (1993) –Suffix Arrays: 3x-5x space savings in practice Apostolico & Szpanowski (1992) –Probabilistic analysis of brute-force construction

18 April Parameterized Suffix Trees Motivation: find duplication in software systems... *pmin++ = *pmax++; copy_number(&pmin,&pmax, pfi->min_bounds.lbearing, pfi->max_bounds.rbearing); *pmin++ = *pmax++;... *pmin++ = *pmax++; copy_number(&pmin,&pmax, pfh->min_bounds.left, pfh->max_bounds.left); *pmin++ = *pmax++;... Brenda Baker, Bell Labs Parameterized Duplications in Strings: Algorithms and an Application to Sotware Maintenance Siam J. Computing, October 1997.

18 April Definitions Σ ={a,b,c} - symbol alphabet, Π ={x,y,v} - parameter alphabet p-strings: (Σ + Π )* P1: p 1 & p 2 are p-matches iff p 1 can be transformed into p 2 by a one-to-one mapping S=axayb T=avaxb prev(S) - –leftmost occurrence of each parameter becomes 0 –subsequent occurrences replaced by difference in position compared to previous occurrence: parameter pointer prev(abuvabuvu) = ab00ab442 –S & S’ are p-matches iff prev(S)=prev(S’)

18 April More Definitions…. psuffix(s,i) = prev(S i S i+1 …S n ) S=abxvabuvuprev(S)=ab00ab442 psuffix(S,5)=ab002 P2: if P is a p-string pattern, and S is a p-string text, P p- matches at position i iff prev(P) is a prefix of psuffix(S,i) P=abu prev(P)=ab0 psuffix(S,1)=ab00ab442 psuffix(S,2)=b00ab442 psuffix(S,3)=00ab442 psuffix(S,4)=0ab042 psuffix(S,5)=ab002

18 April Brute-force Construction Like McCreight, but add succesive p-suffix entries Σ={a,b,c,$},Π ={v,x,y} S=xbyybx$ psuffix(S,1)=0b01b5$ (S,2)=b01b0$ (S,3)=01b0$ (S,4)= 1b0$ (S,5)=b0$ (S,6)=0$ b0 1b0$ $ 0 $ b0 1b5$ $ Searching: follow symbols from prev(p) p=vbx$ prev(p)=0b0$ Time: O(|P|log(|Σ |+|Π |)

18 April P-Trees & Suffix Links Baker: Why do suffix links work? –Common Prefix Property: if aS=bT then S=T –Distinct Right Context: if aS=bT and aSc  bTd then Sc  Td Distinct Right Context does not hold for p-strings S=xabxyabz prev(xabx)=0ab3prev(yabz)=0ab0 0ab=0ab (aS=bT) prev(xabx)  prev(yabz) (aSc  bTd) prev(abx)=ab0=prev(abz)(Sc=Td) xabx - repetition of parameter yabz - new parameter abx, abz - context lost

18 April P-Strings & Suffix Links…. Can’t always point from N to node representing appropriate suffix –might end up pointing to node with shortest extension of suffix instead. Use “best available” suffix link S=xbyyxbx 014b2$ b0 10b2$ $ 0 $ 0b2$ 2$ $ b 0b has two different right contexts.. 0b0 0b2

18 April Fixes Must get suffix links pointing to the right place Fix them up as we go along – lazy or eager eager: keep track of min - node of shortest length with a bad pointer pointing to N Six phases: –Set temporary suffix link for previous head –Scan head i – Add new internal node if needed call update to fix suffix links and min pointers –Fix min pointers of head i-1 –Create new node with appropriate suffix –update oldhd and oldchild pointers O(n(|Π |+log(| Π |+|Σ )))), improve to O(n logn): concatenable queues & dynamic trees.

18 April Searching for a pattern Chang & Lawler: to search for pattern P in text T, build suffix tree for P and compare T to it –follow path by symbols of T –use suffix links when a symbol can’t be matched rescan to find what we’ve already seen, then continue report a match when we hit the last symbol of P For suffix trees - slight difference –when we hit a mismatch, we may need to go back up to a parent node to find the right suffix link to follow with appropriate optimizations, O(|P|) space and O(|T| log(| Π |+|Σ |)) time

18 April Chang & Lawler Search in p-suffix Trees S=xbyyxbx T=vazbybxxyby 0a0b0b014b2 a0b0b014b2 0b0b014b2 014b2$ b0 10b2$ $ 0 $ 0b2$ 2$ $ b Search on prev(T) instead of T b0b014b2 0b014b2

18 April Finding parameterized duplications Maximal p-match - cannot be extended in either direction Node N - p-match that can’t be extended to the right –look at preceeding symbols to see if maximal p-strings - parameters in common prefix have different meanings 0 - next occurrence of a parameter proceeding this suffix,or earlier occurrence of a parameter first occurrence of a parameter S=xabcxT=yabcz prev(abcx)=prev(abcz)=abc0 prev(xabcx)=0abc3 prev(yabcz)=0abc0 - Check previous symbols - if they are parameters, check to see whether next occurrences of those parameters match

18 April Parameterized Duplications... Use A=(prev(S r )) r to find left matches S=xbxzbzzyby S r =ybyzzbzxbx prev(S r )=0b201b20b2 A=(prev(S r )) r =2b02b102b0 Only matches with appropriate values in A can be left-extended: if S i...S i+k matches S j ….S j+k, A i-1 =A j-1 means that we have a match

18 April Parameterized Duplications, example S=xbxzbzzyby A=(prev(S r )) r =2b02b102b0 0 0b 10b2$ $ b2 b0 10b2$ 0b210b2$ 10b2$ $ $ 0b210b2$ 2$ b210b2$ O(n+m(t,S)) - m(t,S) is number of matches of length at least t b2: S(5,6)=b2 (b210b2$) S(9,2)=b2 (b2$) S(2,2)=b2 (b20b210b2$) A 4 =A 8 =A 1 =2, so all matches can be extended

18 April Parameterized Duplication in Software Identifiers & constants become parameters Hash each line of code into a sybmol of Σ + 0 or more parameters Use hashing-based suffix tree Linear time: O(|T|+m(t,T)), but m(t,T) < |T| - O(|T|). With post-processing, 10 6 in 7 minutes –20% involved in parameterized duplication of  30 lines

18 April References Apostolico, A., Szpanowski, W. (1992).Self-alignments in words and their applications Journal of Algorithms, 13, pp Baker, B. (1993) A theory of parameterized pattern matching: Algorithms and applications (Extended Abstract) Proceedings 25 th ACM Symposium on Theory of Computing, 1993, pp Baker, B. (1997) Parameterized Duplications in Strings: algorithms and an Application to Software Maintenance SIAM J. Computing 25(6), October pp Chang, W.I. & Lawler, E.L.(1990) Approximate string matching in sublinear expected time Proc. Of 31st Symposium on Foundations of Computer Science, pp Manber U. & Myers, E.W. (1993) Suffix arrays: A new method for on-line string Searches SIAM Journal on Computing (October 1993), pp McCreight, E.M. (1976). A space-economical suffix tree construction algorithm Journal of the ACM, 23, 2, pp , April Stephen, G. (1994) String Searching Algorithms London: World Scientific Press.

18 April References, continued Ukkonen, E. (1992) Constructing suffix-trees online in linear time in Leeuwen, J. van (ed.) Algorithms, Software, Architecture: Information Processing 92, 1, p Amsterdam: Elsevier.