1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

Slides:



Advertisements
Similar presentations
On-line Construction of Suffix Trees Chairman : Prof. R.C.T. Lee Speaker : C. S. Wu ( ) June 10, 2004 Dept. of CSIE National Chi Nan University.
Advertisements

Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
Suffix Trees Construction and Applications João Carreira 2008.
Suffix Tree. Suffix Tree Representation S=xabxac Represent every edge using its start and end text location.
Trie/Suffix Trie/Suffix Tree. Trie A trie (from retrieval), is a multi-way tree structure useful for storing strings over an alphabet. It has been used.
Data Compressor---Huffman Encoding and Decoding. Huffman Encoding Compression Typically, in files and messages, Each character requires 1 byte or 8 bits.
Two implementation issues Alphabet size Generalizing to multiple strings.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Suffix Trees Specialized form of keyword trees New ideas –preprocess text T, not pattern P O(m) preprocess time O(n+k) search time –k is number of occurrences.
Suffix Trees and Suffix Arrays
1 Suffix Trees Charles Yan Suffix Trees: Motivations Substring problem: One is given a text T of length m. After O (m) preprocessing time, one.
15-853Page : Algorithms in the Real World Suffix Trees.
OUTLINE Suffix trees Suffix arrays Suffix trees Indexing techniques are used to locate highest – scoring alignments. One method of indexing uses the.
296.3: Algorithms in the Real World
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
Tries Standard Tries Compressed Tries Suffix Tries.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Motivation  DNA sequencing processes large chains into subsequences of ~500 characters long  Assembling all pieces, produces a single sequence but… –At.
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU Linear Time Construction of Suffix Tree.
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search, part 1)
Goodrich, Tamassia String Processing1 Pattern Matching.
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Suffix trees and suffix arrays presentation by Haim Kaplan.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Survey: String Matching with k Mismatches Moshe Lewenstein Bar Ilan University.
1 Theory I Algorithm Design and Analysis (11 - Edit distance and approximate string matching) Prof. Dr. Th. Ottmann.
1 Exact Set Matching Charles Yan Exact Set Matching Goal: To find all occurrences in text T of any pattern in a set of patterns P={p 1,p 2,…,p.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
A Space-Economical Suffix Tree Construction Algorithm Edward M. McCreight (1976) { From Ukkonen to McCreight and Weiner: A Unifying } View of Linear-Time.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
CSC401 – Analysis of Algorithms Chapter 9 Text Processing
1. 2 Overview  Suffix tries  On-line construction of suffix tries in quadratic time  Suffix trees  On-line construction of suffix trees in linear.
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Suffix trees. Trie A tree representing a set of strings. a b c e e f d b f e g { aeef ad bbfe bbfg c }
Everything is String. Closed Factorization Golnaz Badkobeh 1, Hideo Bannai 2, Keisuke Goto 2, Tomohiro I 2, Costas S. Iliopoulos 3, Shunsuke Inenaga 2,
Applications of Suffix Trees Dr. Amar Mukherjee CAP 5937 – ST: Bioinformatics University of central Florida.
Contents What is a trie? When to use tries
Suffix Tree 6 Mar MinKoo Seo. Contents  Basic Text Searching  Introduction to Suffix Tree  Suffix Trees and Exact Matching  Longest Common Substring.
Chapter 11. Chapter Summary  Introduction to trees (11.1)  Application of trees (11.2)  Tree traversal (11.3)  Spanning trees (11.4)
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
ETH Zurich A Database Index to Large Biological Sequences (Ela Hunt, Malcolm P. Atkinson, Robert W. Irving) A report on the paper from Nicola.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
15-853:Algorithms in the Real World
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
Tries 07/28/16 11:04 Text Compression
McCreight's suffix tree construction algorithm
Tries 5/27/2018 3:08 AM Tries Tries.
Andrzej Ehrenfeucht, University of Colorado, Boulder
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Suffix trees.
Suffix trees and suffix arrays
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Sequences 5/17/ :43 AM Pattern Matching.
Presentation transcript:

1 Prof. Dr. Th. Ottmann Theory I Algorithm Design and Analysis (12 - Text search: suffix trees)

2 Text search Different scenarios: Dynamic texts Text editors Symbol manipulators Static texts Literature databases Library systems Gene databases World Wide Web

3 Properties of suffix trees Search index for a text  in order to search for patterns  Properties: 1. Substring search in time O(|  |). 2. Queries to  itself, e.g.: Longest substring in  occurring at least twice. 3. Prefix search: all positions in  with prefix .

4 Properties of suffix trees 4. Range search: all positions in  in interval [ ,  ] with   lex , e.g. abracadabra, acacia  [abc, acc], abacus  [abc, acc]. 5. Linear complexity: Required space and time for construction in O(|  |)

5 Tries Trie: tree for representing keys. alphabet , set S of keys, S   * Key String   * Edge of a trie T: label with a single character from  Neighboring edges: different characters

6 Tries a a a c b b c b b c c c Example:

7 Tries Each leaf represents a key: corresponds to the labeling of the edges on the path from the root to the leaf ! Keys are not stored in nodes !

8 Suffix tries Trie for all suffixes of a text Example:  = ababc suffixes: ababc = suf 1 babc = suf 2 abc = suf 3 bc = suf 4 c = suf 5 a a a c b b c b b c c c

9 Suffix tries Internal nodes of a suffix trie = substrings of . Each proper substring of  is represented as an internal node. Let  = a n b n :  n 2 + 2n + 1 different substrings = internal nodes  Space requirement in O(n 2 ).

10 Suffix tries A suffix trie T fulfills some of the required properties: a a a c b b c b b c c c 1. String matching for  : follow the path with edge labels  in T in time O (|  |). #leaves of the subtree #occurrences of  2. Longest repeated substring: internal node with the greatest depth which has at least two children. 3. Prefix search: all occurrences of strings with prefix  can be found in the subtree below the internal node corresponding to  in T.

11 Suffix trees A suffix tree is created from a suffix trie by contraction of unary nodes: a a a c b b c b b c c c ab abc b c c c suffix tree = contracted suffix trie

12 Internal representation of suffix trees Child-sibling representation Substring: pair of numbers (i,j) ab abc b c c c T Example:  = ababc

13 Internal representation of suffix trees (  ) (1,2)(2,2)(5,$) (3,$)(5,$)(3,$)(5,$) ab abc b c c c Example:  = ababc node v = (v.w, v.o, v.sn, v.br) Further pointers (suffix pointers) are added later

14 Properties of suffix trees (S1)No suffix is prefix of another suffix; this holds if (last character of  ) = $  Search: (T1)edge non-empty substring of . (T2) neighboring edges: corresponding substrings start with different characters.

15 Properties of suffix trees Size (T3)each internal node (  root) has at least two children (T4)leaf (non-empty) suffix of . Let n = |  |  1

16 Construction of suffix trees Definition: Partial path: path from the root to a node in T Path: a partial path ending in a leaf Location of a string  : node at the end of the partial path labeled with  (if it exists). ab abc b c c c T

17 Construction of suffix trees Extension of a string  : string with prefix  Extended location of a string  : place of the shortest extension of , whose place is defined. Contracted location of a string  : place of the longest prefix of , whose place is defined. ab abc b c c c T

18 Construction of suffix trees Definitions: suf i : suffix of  starting at position i, e.g. suf 1 = , suf n = $. head i : longest prefix of suf i which is also a prefix of suf j for a j < i. Example:  = bbabaabc  = baa (has no location) suf 4 = baabc head 4 = ba

19 Construction of suffix trees a abc c b aabc b baabc ac babaabc c  = bbabaabc

20 Naive suffix-tree construction Begin with the empty tree T 0 Tree T i+1 is created from T i by inserting suffix suf i+1. Algorithm suffix tree Input: a text  Output: the suffix tree T for  1 n := |  |; T 0 :=  ; 2 for i := 0 to n – 1do 3 insert suf i+1 in T i, resulting in T i+1 ; 4 end for

21 Naive suffix-tree construction In T i all suffixes suf j (j < i) already have a location.  head i = longest prefix of suf i whose extended location in T i-1 exists. Definition: tail i := suf i – head i, i.e. suf i = head i tail i. tail i  .

22 Naive suffix-tree construction Example:  = ababc suf 3 = abc head 3 = ab tail 3 = c T 0 = T 1 = T 2 = ababc babc

23 Naive suffix-tree construction T i+1 ca be constructed from T i as follows: 1.Determine the extended location of head i+1 in T i and split the last edge leading to this location into two new edges by inserting a new node. 2. Create a new leaf as location for suf i+1 x = erweiterter Ort von head i+1 x v head i+1 tail i+1

24 Naive suffix-tree construction Example:  = ababc babc c ababc abc ab T3T3 T2T2 head 3 = ab tail 3 = c

25 Naive suffix-tree construction Algorithm suffix insertion Input: tree T i and suffix suf i+1 Output: tree T i+1 1v := root of T i 2j := i 3repeat 4find child w of v with  w.u =  j+1 5k := w.u – 1; 6while k < w.o and  k+1 =  j+1 do 7 k := k +1; j := j + 1 8end while

26 Naive suffix-tree construction 9if k = w.o then v := w 10 until k <w.o or w = nil 11 /* v is the contracted location of head i+1 */ 12 insert the location of head i+1 and tail i+1 in T i below v Running time for suffix insertion: O( ) Total time for naive suffix-tree construction: O( )

27 The algorithm M (Mc Creight, 1976) When the extended location of head i+1 in T i has been found: creation of a new node and edge splitting in O(1) time.+ Idea: Extended location of head i+1 is determined in constant amortized time in T i. (Additional information is required!)

28 Analysis of algorithm M Theorem 1 Algorithm M constructs a suffix tree for  with |  | leaves and at most |  | - 1 internal nodes in time O(|  |). Remark: Ukkonen (1992) found an O(n) on-line algorithm for the construction of suffix trees, i.e. after each step i, the resulting structure is a correct suffix tree for t 1 …t i (where  = t 1 …t n ).

29 Suffix tree: application Usage of suffix tree T: 1Search for string  : follow the path with edge labeling  in T in time O(|  |). leaves of the subtree occurrences of  2Search for longest repeated substring: Find the location of a substring with the greatest weighted depth that is an internal node 3Prefix search: All occurrences of strings with prefix  can be found in the subtree below the „location“ of  in T.

30 Suffix tree: application 4Range query for [ ,  ] : Range boundaries

31 Suffix tree: example T 0 = T 1 = bbabaabc suf 1 = bbabaabcsuf 2 = babaabc head 2 = b

32 Suffix tree: example T 2 = b abaabc babaabbc T 3 = abaabc b abaabc babaabbc suf 3 = abaabc suf 4 = baabc head 3 =  head 4 = ba

33 Suffix tree: example T4 =T4 = abaabc b babaabbc a abc baabclocation of head 4 suf 5 = aabc head 5 = a

34 Suffix tree: example babaabbc a abc baabc location of head 5 abc a b T 5 = suf 6 = abc head 6 = ab baabc

35 Suffix tree: example babaabbc a abc baabc location of head 6 abc a b T 6 = b c aabc suf 7 = bc head 7 = b

36 Suffix tree: example babaabbc a abc baabc abc a b T7 =T7 = b c aabc c suf 8 = c

37 Suffix tree: example babaabbc a abc baabc abc a b T 8 = b c aabc c c