Alon Efrat Computer Science Department University of Arizona Suffix Trees.

Slides:



Advertisements
Similar presentations
§2 Binary Trees Note: In a tree, the order of children does not matter. But in a binary tree, left child and right child are different. A B A B andare.
Advertisements

Planar point location -- example
Alon Efrat Computer Science Department University of Arizona SkipList.
Michael Alves, Patrick Dugan, Robert Daniels, Carlos Vicuna
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
22C:19 Discrete Structures Trees Spring 2014 Sukumar Ghosh.
CS 332: Algorithms Binary Search Trees. Review: Dynamic Sets ● Next few lectures will focus on data structures rather than straight algorithms ● In particular,
Computer Science C++ High School Level By Guillermo Moreno.
Main Index Contents 11 Main Index Contents Week 6 – Binary Trees.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Suffix Trees Suffix trees Linearized suffix trees Virtual suffix trees Suffix arrays Enhanced suffix arrays Suffix cactus, suffix vectors, …
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
Tries Search for ‘bell’ O(n) by KMP algorithm O(dm) in a trie Tries
Advanced Algorithm Design and Analysis (Lecture 4) SW5 fall 2004 Simonas Šaltenis E1-215b
Suffix Trees String … any sequence of characters. Substring of string S … string composed of characters i through j, i ate is.
CS 171: Introduction to Computer Science II
Goodrich, Tamassia String Processing1 Pattern Matching.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
BTrees & Bitmap Indexes
Higher Order Tries Key = Social Security Number.   9 decimal digits. 10-way trie (order 10 trie) Height
Full-Text Indexing via Burrows-Wheeler Transform Wing-Kai Hon Oct 18, 2006.
Binary Search Trees1 Part-F1 Binary Search Trees   
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Digital Search Trees & Binary Tries Analog of radix sort to searching. Keys are binary bit strings.  Fixed length – 0110, 0010, 1010,  Variable.
CSC 213 Lecture 18: Tries. Announcements Quiz results are getting better Still not very good, however Average score on last quiz was 5.5 Every student.
Introduction to Data Structure, Fall 2006 Slide- 1 California State University, Fresno Introduction to Data Structure Chapter 10 Ming Li Department of.
Department of Computer Eng. & IT Amirkabir University of Technology (Tehran Polytechnic) Data Structures Lecturer: Abbas Sarraf Search.
6/26/2015 7:13 PMTries1. 6/26/2015 7:13 PMTries2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3) Huffman encoding.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
C++ Programming: Program Design Including Data Structures, Third Edition Chapter 20: Binary Trees.
Binary Trees Chapter 6.
Fundamental Structures of Computer Science March 02, 2006 Ananda Guna Binomial Heaps.
“On an Algorithm of Zemlyachenko for Subtree Isomorphism” Yefim Dinitz, Alon Itai, Michael Rodeh (1998) Presented by: Masha Igra, Merav Bukra.
© 2004 Goodrich, Tamassia Tries1. © 2004 Goodrich, Tamassia Tries2 Preprocessing Strings Preprocessing the pattern speeds up pattern matching queries.
Lecture 10 Trees –Definiton of trees –Uses of trees –Operations on a tree.
Binary Trees Chapter Definition And Application Of Binary Trees Binary tree: a nonlinear linked list in which each node may point to 0, 1, or two.
Chapter 6 Binary Trees. 6.1 Trees, Binary Trees, and Binary Search Trees Linked lists usually are more flexible than arrays, but it is difficult to use.
B + -Trees Same structure as B-trees. Dictionary pairs are in leaves only. Leaves form a doubly-linked list. Remaining nodes have following structure:
Tries1. 2 Outline and Reading Standard tries (§9.2.1) Compressed tries (§9.2.2) Suffix tries (§9.2.3)
Topics Definition and Application of Binary Trees Binary Search Tree Operations.
David Stotts Computer Science Department UNC Chapel Hill.
Heapsort. What is a “heap”? Definitions of heap: 1.A large area of memory from which the programmer can allocate blocks as needed, and deallocate them.
CS 261 – Recitation 7 Spring 2015 Oregon State University School of Electrical Engineering and Computer Science.
Copyright © 2012 Pearson Education, Inc. Chapter 20: Binary Trees.
Dynamic Programming & Memoization. When to use? Problem has a recursive formulation Solutions are “ordered” –Earlier vs. later recursions.
Copyright © 2015, 2012, 2009 Pearson Education, Inc., Publishing as Addison-Wesley All rights reserved. Chapter 20: Binary Trees.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 20: Binary Trees.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Tries 4/16/2018 8:59 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Tries 07/28/16 11:04 Text Compression
Tries 5/27/2018 3:08 AM Tries Tries.
Higher Order Tries Key = Social Security Number.
IP Routers – internal view
Mark Redekopp David Kempe
Lecture 22 Binary Search Trees Chapter 10 of textbook
Data Structures & Algorithm Design
Tries 9/14/ :13 AM Presentation for use with the textbook Data Structures and Algorithms in Java, 6th edition, by M. T. Goodrich, R. Tamassia, and.
Digital Search Trees & Binary Tries
Chapter 20: Binary Trees.
Chapter 21: Binary Trees.
Binary Tries (continued)
Fundamental Structures of Computer Science
Digital Search Trees & Binary Tries
Data Structure and Algorithms
Tries 2/23/2019 8:29 AM Tries 2/23/2019 8:29 AM Tries.
Suffix Trees String … any sequence of characters.
Tries 2/27/2019 5:37 PM Tries Tries.
Red Black Trees (Guibas Sedgewick 78)
Topic 25 Tries “In 1959, (Edward) Fredkin recommended that BBN (Bolt, Beranek and Newman, now BBN Technologies) purchase the very first PDP-1 to support.
Presentation transcript:

Alon Efrat Computer Science Department University of Arizona Suffix Trees

2 Purpose Given a (very long) text R, preprocess it, so that once a query text P is given, we can efficiently find if P appears in R. (Later – also where P appears in R). Example R= “ HelloWorldWhatANiceDay ”, IsIn( “ World ” ) = YES, IsIn( “ Word ” ) = No IsIn( “ l ” )=8 YES (note – appears more than once)

3 Definition: A suffix For a word R, a suffix is what is left of R after deleting the first few characters. All the suffixes of R= “ Hello ” Hello ello llo lo o

4 Alg for answering IsIn Preprocessing: Create an empty trie T. Given R= “ HelloWorldWhatANiceDay ”, insert into T all suffixes of R. Answering IsIn(P): Just check if P is in T That is, return find(P). (Here, find is as studied in the lecture on tries)

5 Example T= “ hello ”. Suffixes: “ hello ”, “ ello ”, “ llo ”, “ lo ”, ” o ”. h e l o e l l o l l o o Examples: P= “ ll ” l o

6 Lets get greedy Given a (very long) text R, preprocess it, so that once a query text P is given, we can find the location of P in R (if at all) efficiently. More specifically, report the index of where P starts to appear in R. (If more then one answer, report the last one). Example R= “ HelloWorldWhatANiceDay ”, Where( “ World ” ) = 5, that is, the answer is 5, since “ World ” appears starting at index 5 in R. Where( “ Word ” ) = NoWhere Where( “ l ” )=8 (also in other places)

7 Alg for answering Where Modify the trie, so that each node also contains a field b_inx. When inserting a word s to the trie, whose first character is in index k of R, modify to nodes along the insertion path to contain the value k. Preprocessing: Create an empty trie T. Given R= “ HelloWorldWhatANiceDay ”, insert into T all suffixes of R. Answering IsIn(P ): Just check if P is in T That is, return find(P), and the value of b_inx where the search terminates. ( Here, find is as studied in the lecture on tries) Resulting DataStructure is called: Uncompressed Suffix Tree

8 Example T= “ hello ”. Suffixes: “ hello ”, “ ello ”, “ llo ”, “ lo ”, ” o ”. h e l o e l l o l l o o Examples: P= “ ll ” l b_inx= o b_inx=2 3 4 \

9 So much memory ????? The problem with this data structure results from long paths: A sequence of nodes, each but the last one has a single child, and all has the same value of b_inx. h e l o e l l o b_inx=0 h e l o e l l o l l o o l o b_inx=2 3 4 \

10 More examples of paths

11 Solution Recall that all strings in the tree are suffixes of the same text R. Add a new field to each node, called c_inx and lng such that if lng>0 then when computing a string, we need to concatenate lng chars from P starting at position c_idx e l l o b_inx=0 h e l l o h c_idx=1, lng=4 e l o e l o R= “ h e l l o ”

12 Compressing the tree Assuming we are visiting nodes v of the tree, whose distance (num of edges) from the root in the uncompress trie is k. Also assume that v is the first node on a path. Then c_idx = b_idx + k. So the function compress_tree should `know ’ the distance from the root (in the uncompress tree) of the visited node.

13 Need a function compress_tree that accepts a node v of the tree, and the depth of v in the uncompressed tree. Also need the function check_path( NODE *p) returning the length (in # edges) of the path starting at *p. So for example if *p has two children, it returns 0;

14 Compressing the tree – cont ’ compress_tree( NODE * p, int depth){ for each cell ar[i] of *p if ( (d = check_path (p->ar[i] ) ) > 0 ){ Let q be a pointer to the node at the end of the path. Let h be the length of the path and let d be the depth of q (in the uncompressed tree). Both q, d and h should be obtained from check_path (think how) Set p->ar[i]=q Free unused nodes q -> c_idx = q -> b_idx+depth+1 q -> lng = h compress_tree( q, d ) }

15 How large is the tree now Lemma: If T is a tree with no node of degree 1, then the number of nodes is O(number-of-leaves) In our scenario, number-of-leaves<|R| So the size of the trie is O(|R|).