21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital.

Slides:



Advertisements
Similar presentations
Suffix Trees Construction and Applications João Carreira 2008.
Advertisements

Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
Two implementation issues Alphabet size Generalizing to multiple strings.
Suffix Trees and Suffix Arrays
Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Modern Information Retrieval
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Efficient encoding methods  Coding theory refers to study of code properties and their suitability to specific applications.  Efficient codes are used,
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
20/10/2015Applied Algorithmics - week31 String Processing  Typical applications: pattern matching/recognition molecular biology, comparative genomics,
String Matching with k Mismatches Moshe Lewenstein Bar Ilan University Modified by Ariel Rosenfeld.
Improved string matching with k mismatches (The Kangaroo Method) Galil, R. Giancarlo SIGACT News, Vol. 17, No. 4, 1986, pp. 52–54 Original: Moshe Lewenstein.
Constant-Time LCA Retrieval Presentation by Danny Hermelin, String Matching Algorithms Seminar, Haifa University.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
06/12/2015Applied Algorithmics - week41 Non-periodicity and witnesses  Periodicity - continued If string w=w[0..n-1] has periodicity p if w[i]=w[i+p],
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
1 Recap lecture 28 Examples of Myhill Nerode theorem, Quotient of a language, examples, Pseudo theorem: Quotient of a language is regular, prefixes of.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Ariel Rosenfeld Bar-Ilan Uni.
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
13 Text Processing Hongfei Yan June 1, 2016.
Strings: Tries, Suffix Trees
String Data Structures and Algorithms: Suffix Trees and Suffix Arrays
Suffix trees.
String Data Structures and Algorithms
String Data Structures and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
String Matching with k Mismatches
Strings: Tries, Suffix Trees
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

21/05/2015Applied Algorithmics - week51 Off-line text search (indexing)  Off-line text search refers to the situation in which a preprocessed digital collection of documents, e.g., a text database, is searched for specific patterns, similarities, irregularities, etc.  In the off-line text search precomputed text data structures support efficient simultaneous examination of multiple documents stored on the system.  Off-line text searching methods are used in a large variety of applications ranging from bibliographic databases, word processing environments, search engines (Google, Bing, Yahoo, etc), intrusion detection and analysis of DNA/RNA sequences.  Off-line text search methods are very often referred to as text indexing methods.

21/05/2015Applied Algorithmics - week52 Suffix Trees  A suffix tree is a data structure that exposes in detail the internal structure of a string  The real virtue of suffix trees comes from their use in linear time solutions to many string problems more complex than exact matching  Suffix trees provide a bridge between exact matching problems and matching with various types of errors

21/05/2015Applied Algorithmics - week53 Suffix Trees and pattern matching  In off-line pattern matching one is allowed to process the text T=T[0..n-1] in time O(n), s.t., any further matching queries with unknown pattern P=P[0..m-1] can be served in time O(m).  Compact suffix trees provide efficient solution to off-line pattern matching problem  Compact suffix trees provide also solution to a number of substring problems, periodicities and regularities

21/05/2015Applied Algorithmics - week54 Compact suffix trees - brief history  First linear algorithm for constructing compact suffix trees in ‘73 by Weiner  More space efficient also linear algorithm was introduced in ‘76 by McCreight  An alternative, conceptually different (and easier) algorithm for linear construction of compact suffix trees was proposed by Ukkonen in ‘95

21/05/2015Applied Algorithmics - week55 Tries - trees of strings  A trie T for a set of strings S over alphabet A is a rooted tree, such that: edges in T are labeled by single symbols from A, each string s  S is represented by a path from the root of T to some leaf of T, for some technical reasons (e.g., to handle the case when for some s,w  S, s is a prefix of w) every string s  S is represented in T as s#, where # is a special symbol that does not belong to A.

21/05/2015Applied Algorithmics - week56 Tries - example  Strings in S={a,aba,bba,abba,abab} are replaced by a#, aba#, bba#, abba#, abab# respectively a b a a a b b b b # # ## # root

21/05/2015Applied Algorithmics - week57 Suffix trees  A suffix tree ST(w) is a trie that contains all suffixes of a given word w, i.e.,  Similarly as it happens in tries ends of a suffixes are denoted by the special character # which form leaves in ST(w)  Moreover each internal node of the suffix tree ST(w) represent the end of some substring of w

21/05/2015Applied Algorithmics - week58 Suffix Trees - example  Take w=f 5 =babbabab (5th Fibonacci word)  The suffixes of w are b represented in ST(w) asb# ab represented in ST(w) asab# bab represented in ST(w) asbab# abab represented in ST(w) asabab# babab represented in ST(w) asbabab# bbabab represented in ST(w) asbbabab# abbabab represented in ST(w) asabbabab# babbabab represented in ST(w) asbabbabab#

21/05/2015Applied Algorithmics - week59 Suffix Trees - example  suffixes of w b# ab# bab# abab# babab# bbabab# abbabab# babbabab# b a # # # # # # # # b b b b b b b b b b b b b a a a a a a a a a

21/05/2015Applied Algorithmics - week510 Compact suffix trees  We know that suffix trees can be very large, i.e., quadratic in the size of an input string, e.g. when the input string has many different symbols.  This problem can be cured if we encode all chains (paths with nodes of degree 2) in the suffix tree by reference to some substring in the original string.  A suffix tree with encoded chains is called a compact suffix tree.

21/05/2015Applied Algorithmics - week511 Compact suffix trees - example # # # # # # # # b b b b b b b b b b b b b a a a a a a a a a ab Original string w = b a b b a b a b # # # # # # # # [0,0]Suffix treeCompact Suffix tree [1,2] [3,7]

21/05/2015Applied Algorithmics - week512 Compact suffix trees  Theorem: The size of a compact suffix tree constructed for any string w=w[0..n-1] is O(n) In the (compact) suffix tree there is only n leaves marked by #s Since each internal node in the compact suffix tree is of degree ≥ 2 there are ≤ n-1 edges in the tree Each edge is represented by two indexes in the original string w Thus the total space required is linear in n.

21/05/2015Applied Algorithmics - week513 Longest repeated sequence  Using a compact suffix tree for any string w=w[0..n-1] we can find the longest repeated sequence in w in time O(n). Find the deepest node in the tree which has degree at least 2 w[i+x..m-1] w[j+x..m-1] procedure longest(v:tree; depth: integer); if v is not a leaf then if (depth>max-depth) then max-depth  depth; for each u  v.children do longest(u,depth+length(v,u)); … max-depth  0; longest(T.root,0); return(max-depth); … depth x w[i..i+x-1]= w[j..j+x-1]

21/05/2015Applied Algorithmics - week514 Suffix trees for several strings  One can compute joint properties of two (or more) strings w 1 and w 2 constructing a single compact suffix tree T for string w 1 $w 2 #, where Symbol $ does not belong neither to w 1 nor to w 2 All branches in T are truncated below the special symbol $  For example, using similar procedure one can compute the longest substring shared by w 1 and w 2

21/05/2015Applied Algorithmics - week515 Longest shared substring  Initially, for each node v  T we compute attribute shares, which says whether v is an ancestor of leaves $ and # Find the deepest node in the tree which represents substrings from w 1 and w 2 w 1 [i+x..m-1] w 2 [j+x..m-1] depth x w 1 [i..i+x-1]= w 2 [j..j+x-1] function sharing(v:tree): set of {$,#} if v is a leaf then return(v.symbol) else set  {}; for each u  v.children do set  set  sharing(u); v.shares  set; return(v.shares); … sharing(T.root); …

21/05/2015Applied Algorithmics - week516 Longest shared substring  Using a truncated compact suffix tree for the string w 1 $w 2 we can find the longest shared substring by w 1 and w 2 in linear time. procedure longest(v:tree; depth: integer); if v.shares={$,#} then if (depth>max-depth) then max-depth  depth; for each u  v.child do longest(v,depth+length(v,u)); … max-depth  0; longest(T,0); return(max-depth); … Find the deepest node in the tree which represents substrings from w 1 and w 2 w 1 [i..m-1] w 2 [j..m-1] depth x w 1 [i+x-1]= w 2 [j+x-1]

21/05/2015Applied Algorithmics - week517 Lowest common ancestor - LCA  A node z is the lowest common ancestor of any two nodes u,v in the tree T rooted in the node r, z =lca T (u,v), iff: 1) node z belongs to both paths from u to r and from v to r 2) node z is the deepest node in T with property 1) u v z =lca(u,v) r

21/05/2015Applied Algorithmics - week518 Lowest common ancestor  Theorem: Any tree of size n can preprocessed in time O(n), such that, the lowest common ancestor query lca(u,v), for any two nodes u,v in the tree can be served in O(1) time.  For example, we can preprocess any suffix tree in linear time and then compute the longest prefix shared by any two suffixes in O(1) time.  LCA queries have also many other applications.

21/05/2015Applied Algorithmics - week519 Pattern matching with k mismatches  So far we discussed algorithmic solutions either for exact pattern matching or pattern matching with don’t care symbols, where the choice of text symbols was available at fixed pattern positions  In pattern matching with k mismatches we say that an occurrence of the pattern is acceptable if there is at most k mismatches between pattern symbols and respective substring of the text

21/05/2015Applied Algorithmics - week520 Pattern matching with k mismatches at most k mismatches acceptable pattern occurrence Text T Pattern P S i+1 SiSi SiSi matching substrings

21/05/2015Applied Algorithmics - week521 Pattern matching with k mismatches  As many other instances of pattern matching also in this case one can provide an easy solution with time complexity O(m·n). However we are after faster solution.  The search stage in pattern matching with k mismatches is preceded by the construction of a compact suffix tree ST for the string P$T#  The tree ST is later processed for LCA queries which will allow to fast recognition of matching substrings S i  Both steps are preformed in linear time

21/05/2015Applied Algorithmics - week522 Pattern matching with k mismatches  During the search stage each text position is tested for potential approximate occurrence of the pattern P  Consecutive blocks S i are recovered in O(1) time via LCA queries in preprocessed ST tree at most k times, which gives total complexity O(kn). text T pattern P S i+1 SiSi SiSi suffix u in P suffix v in T z = lca(u,v) in ST

21/05/2015Applied Algorithmics - week523 Suffix arrays  One of the very attractive alternatives to compact suffix trees is a suffix array  For any string w=w[0..n-1] the suffix array is an array of length n in which suffixes (namely their indexes) of w are sorted in lexicographical order  The space required to compute and store the suffix arrays is smaller, the construction is simpler, and the use/properties are comparable with suffix trees

21/05/2015Applied Algorithmics - week524 Suffix arrays - example Original string w = b a b b a b a b Suffix array w = b a b b a b a b [0] a b b a b a b [1] b b a b a b [2] b a b a b [3] a b a b [4] b a b [5] a b [6] b [7]  Suffix arrays provide tools for off-line pattern matching in time O(m·log n), where n is the length of the text and m is the length of the pattern  There exists linear transformation between suffix trees and suffix arrays  Suffix arrays provide simple and efficient mechanism for several text compression methods