Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002

Outline of the Talk What is a memory hierarchy? Computational models Data structures We will describe data structures to efficiently deal with memory hierarchies while analyzing strings

Memory Hierarchy Imposed by technology Increasing size Decreasing speed Orders of magnitude! ALU L1 instL1 data L2 cache L3 cache main memory disk(s), network Relevant data must be kept in the faster levels Who should do it? Maybe the programmer

Computational Models No model describes the whole hierarchy External memory: I/O model  RAM: limited capacity M  Disk: transfers in blocks of size B  Cost: number of I/O transfers only Internal memory: ideal cache model  Cache  RAM, RAM  disk  Cache updates are automatic and optimal

Basic Data Structures: Tries Trie: tree built on a set of strings Every string has a path in the tree PATRICIA trie: nonbranching nodes collapsed, only 1st character stored  n strings   (n) space  Loss of information a b b b b b a a a (a,2) (b,1) (b,2) (a,1)(a,2)

B-trees Balanced search tree (enforced) ~n sorted keys per node, which partition the keys in the subtrees One node fits one disk block Searching for keys matching prefix x takes O(log B N + k/B) I/Os Updates: O(log B N) I/Os k1k1 k2k2 k3k3 knkn …

String B-trees (Ferragina & Grossi) Useful for “big” or “irregular” strings The keys are pointers to strings  Unsorted strings are in a separate array  Who knows lexicographic order? Each node contains a PATRICIA trie built on the corresponding strings search( x ) : O(log B n + (|x|+k)/B) I/Os update( x ) : O(log B n + |x|/B) I/Os

Cache-oblivious B-trees (1) Developed by Bender, Demaine, Farach The B-tree is stored in memory according to the van Emde Boas layout Balancing rule; long-distance nodes search( x ) : O(log B N + k/B) I/Os update( x ) : O(log B N) I/Os for suitable B 1 2345 h 12345 recursively!

Cache-oblivious B-trees (2) Developed by Brodal, Fagerberg and Jacob 1. Build a dynamic, binary, balanced search tree 2. Embed it into a static, binary tree 3. Store it using the van Emde Boas layout Same I/Os with a simpler data structure Cache obliviousness: The ideal cache model can be simulated on a realistic, multilevel model with constant slowdown (expected)

Suffix Trees Flat memory model: s.t. can be built in  linear time if  is finite  sorting time if  is infinite I/O model: studied by Farach et al. 1. Build the odd tree (through aux suffix tree) 2. Build the even tree (use the odd tree) 3. Merge the two trees (through anchor nodes) Sorting I/O complexity (optimal)

Suffix Arrays (1) “Simplified suffix tree” A S [i] = position of i -th suffix of S search( x ) : O((|x|/B)logN + k/B) I/Os Array construction, Manber & Myers: 1. Put suffixes into buckets 2. Stage i : sort according to first 2 i symbols 3. Read buckets in order Takes 8N space and O(NlogN) I/Os

Suffix Arrays (2) Array construction, Gonnet et al. : cubic I/O complexity, but…  worst case analysis is too pessimistic  nearly all I/Os are bulk, i.e. sequential Gonnet’s algorithm beats the one of Manber & Myers in practice Ferragina and Grossi: new algorithm  Better I/O complexity than Gonnet’s  Faster than Gonnet’s in practice

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.

Similar presentations

Presentation on theme: "Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.

Similar presentations

Presentation on theme: "Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002."— Presentation transcript:

Similar presentations

About project

Feedback