Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

Slides:



Advertisements
Similar presentations
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Advertisements

1 Suffix Arrays: A new method for on-line string searches Udi Manber Gene Myers May 1989 Presented by: Oren Weimann.
Space-for-Time Tradeoffs
Suffix Trees. 2 Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen’s algorithm Applications of ST.
IP Routing Lookups Scalable High Speed IP Routing Lookups.
Suffix Trees, Suffix Arrays and Suffix Trays Richard Cole Tsvi Kopelowitz Moshe Lewenstein.
A New Compressed Suffix Tree Supporting Fast Search and its Construction Algorithm Using Optimal Working Space Dong Kyue Kim 1 andHeejin Park 2 1 School.
Two implementation issues Alphabet size Generalizing to multiple strings.
15-853Page : Algorithms in the Real World Suffix Trees.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
3 -1 Chapter 3 String Matching String Matching Problem Given a text string T of length n and a pattern string P of length m, the exact string matching.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
Goodrich, Tamassia String Processing1 Pattern Matching.
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
1 Applications of Suffix Trees Charles Yan Exact String Matching |P|=n, |T|=m P and T are both known at the same time Boyer-Moore, or Suffix.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
Dynamic Text and Static Pattern Matching Amihood Amir Gad M. Landau Moshe Lewenstein Dina Sokol Bar-Ilan University.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Construction of Aho Corasick automaton in Linear time for Integer Alphabets Shiri Dori & Gad M. Landau University of Haifa.
Indexing and Searching
Algorithms and Data Structures. /course/eleg67701-f/Topic-1b2 Outline  Data Structures  Space Complexity  Case Study: string matching Array implementation.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
CSC312 Automata Theory Lecture # 2 Languages.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
Book: Algorithms on strings, trees and sequences by Dan Gusfield Presented by: Amir Anter and Vladimir Zoubritsky.
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
COMP9319 Web Data Compression and Search
Andrzej Ehrenfeucht, University of Colorado, Boulder
Mark Redekopp David Kempe
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Digital Search Trees & Binary Tries
Strings: Tries, Suffix Trees
Suffix trees.
Digital Search Trees & Binary Tries
String Data Structures and Algorithms
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Suffix Arrays and Suffix Trees
Strings: Tries, Suffix Trees
Presentation transcript:

Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …

Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i <= j of S.  S = cater => ate is a substring.  car is not a substring.  Empty string is a substring of S.

Subsequence Subsequence of string S … string composed of characters i 1 < i 2 < … < i k of S.  S = cater => ate is a subsequence.  car is a subsequence.  The empty string is a subsequence.

String/Pattern Matching You are given a source string S. Answer queries of the form: is the string p i a substring of S? Knuth-Morris-Pratt (KMP) string matching.  O(|S| + | p i |) time per query.  O(n|S| +  i | p i |) time for n queries. Suffix tree solution.  O(|S| +  i | p i |) time for n queries.

String/Pattern Matching KMP preprocesses the query string p i, whereas the suffix tree method preprocesses the source string S. An application of string matching.  Genome project.  Databank of strings (gene sequences).  Character set is ATGC.  Determine if a “new” sequence is a substring of a databank sequence.

Definition Of Suffix Tree Compressed trie with edge information. Keys are the nonempty suffixes of a given string S. Nonempty suffixes of S = sleeper are:  sleeper  leeper  eeper  eper  per, er, and r.

String Matching & Suffixes p i is a substring of S iff p i is a prefix of some suffix of S. Nonempty suffixes of S = sleeper are:  sleeper  leeper  eeper  eper  per, er, and r. Which of these are substrings of S?  leep, eepe, pe, leap, peel

Last Character Of S Repeats When the last character of S appears more than once in S, S has at least one suffix that is a proper prefix of another suffix. S = creeper  creeper, reeper, eeper, eper, per, er, r When the last character of S appears more than once in S, use an end of string character # to overcome this problem. S = creeper#  creeper#, reeper#, eeper#, eper#, per#, er#, r#, #

Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b#

Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb#

Suffix Tree For S = abbbabbbb# abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb#

Suffix Tree Construction See Web write up for algorithm. Time complexity  |S| = n, alphabet size = r.  O(nr) using array nodes.  This is O(n) for r a constant (or r <= c).  O(n) expected time using a hash table.  O(n) time algorithm for large r in reference cited in Web write up.

Suffix Array Array that contains the start position of suffixes in lexicographic order. abbbabbbb#  Assume # < a < b  # < abbbabbbb# < abbbb# < b# < babbbb# < bb# < bbabbbb# < bbb# < bbbabbbb# < bbbb#  SA = [10, 1, 5, 9, 4, 8, 3, 7, 2, 6]  LCP = length of longest common prefix between adjacent entries of SA.  LCP = [0, 4, 0, 1, 1, 2, 2, 3, 3, -]

Suffix Array Less space than suffix tree Linear time construction Can be used to solve several of the problems solved by a suffix tree with same asymptotic complexity.  Substring matching  binary search for p using SA.  O(|p| log |S|).

O(|p i |) Time Substring Matching babbabbbababa abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb#

Find All Occurrences Of p i Search suffix tree for p i. Suppose the search for p i is successful. When search terminates at an element node, p i appears exactly once in the source string S.

Search Terminates At Element Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# abbbb#

Search Terminates At Branch Node When the search for p i terminates at a branch node, each element node in the subtree rooted at this branch node gives a different occurrence of p i.

Search Terminates At Branch Node abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# ab

Find All Occurrences Of p i To find all occurrences of p i in time linear in the length of p i and linear in the number of occurrences of p i, augment suffix tree:  Link all element nodes into a chain in inorder.  Each branch node keeps a pointer to the left most and right most element node in its subtree.

Augmented Suffix Tree abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# b

Longest Repeating Substring Find longest substring of S that occurs more than m > 1 times in S. Label branch nodes with number of element nodes in subtree. Find branch node with label >= m and max char# field.

Longest Repeating Substring abbb b # abbbb#b# #abbbb# b # # b b# abbbabbbb# m = m = 5 10

Longest Common Substring Given two strings S and T. Find the longest common substring. S = carport, T = airports  Longest common substring = rport  Longest common subsequence = arport Longest common subsequence may be found in O(|S|*|T|) time using dynamic programming. Longest common substring may be found in O(|S|+|T|) time using a suffix tree.

Longest Common Substring Let $ be a new symbol. Construct the suffix tree for the string U = S$T#.  U = carport$airports#  No repeating substring includes $.  Find longest repeating substring that is both to left and right of $. Find branch node that has max char# and has at least one element node in its subtree that represents a suffix that begins in S as well as at least one that begins in T.