Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.

Slides:



Advertisements
Similar presentations
Online Algorithm Huaping Wang Apr.21
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
1 Spatial Join. 2 Papers to Present “Efficient Processing of Spatial Joins using R-trees”, T. Brinkhoff, H-P Kriegel and B. Seeger, Proc. SIGMOD, 1993.
Sorting Really Big Files Sorting Part 3. Using K Temporary Files Given  N records in file F  M records will fit into internal memory  Use K temp files,
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CSC 421: Algorithm Design & Analysis
Algorithms Analysis Lecture 6 Quicksort. Quick Sort Divide and Conquer.
Memory Management Chapter 7.
Memory Management Chapter 7. Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,
Sorting Chapter Sorting Consider list x 1, x 2, x 3, … x n We seek to arrange the elements of the list in order –Ascending or descending Some O(n.
Query Execution Professor: Dr T.Y. Lin Prepared by, Mudra Patel Class id: 113.
Modern Information Retrieval
B+-tree and Hashing.
FALL 2004CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
B-Trees Chapter 9. Limitations of binary search Though faster than sequential search, binary search still requires an unacceptable number of accesses.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
1 Lecture 8: Memory Mangement Operating System I Spring 2008.
Memory Management Last Update: July 31, 2014 Memory Management1.
CHAPTER 71 TREE. Binary Tree A binary tree T is a finite set of one or more nodes such that: (a) T is empty or (b) There is a specially designated node.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Memory Management. Roadmap Basic requirements of Memory Management Memory Partitioning Basic blocks of memory management –Paging –Segmentation.
Indexing.
Starting at Binary Trees
Sorting CS 105 See Chapter 14 of Horstmann text. Sorting Slide 2 The Sorting problem Input: a collection S of n elements that can be ordered Output: the.
Sorting CS 110: Data Structures and Algorithms First Semester,
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Liang, Introduction to Java Programming, Sixth Edition, (c) 2007 Pearson Education, Inc. All rights reserved Chapter 23 Algorithm Efficiency.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
FALL 2005CENG 351 Data Management and File Structures1 External Sorting Reference: Chapter 8.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
CENG 3511 External Sorting. CENG 3512 Outline Introduction Heapsort Multi-way Merging Multi-step merging Replacement Selection in heap-sort.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Advanced Sorting 7 2  9 4   2   4   7
Module 11: File Structure
Tries 07/28/16 11:04 Text Compression
CSC 421: Algorithm Design & Analysis
CSC 421: Algorithm Design & Analysis
Database Management System
CSC 421: Algorithm Design & Analysis
Lecture 2- Query Processing (continued)
Implementation of Relational Operations
CSC 421: Algorithm Design & Analysis
Evaluation of Relational Operations: Other Techniques
Quick-Sort 4/25/2019 8:10 AM Quick-Sort     2
CENG 351 Data Management and File Structures
CSC 421: Algorithm Design & Analysis
B+-trees In practice, B-trees are not used much as defined earlier.
Presentation transcript:

Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane

 Overview  Existing Algorithms  Motivation  Two Approaches  Experiments and Analysis

 Suffix tree – Compressed trie for nonempty suffixes of a string.  Example  Efficient for querying – Exact string matching = O(length of query)

 Weimer  McCreight  Ukkonen – O(n) – uses suffix links. Starts from empty tree and inserts suffixes into the partial tree from the longest to shortest suffix.  Pjama – mentioned in previous paper.  Deep Shallow – space efficient internal memory. O(n 2 logn).

Construction of suffix tree for single chromosome hr Even though, theoretical complexity of Deep Shallow algo is O(n 2 logn) – fastest inmemory algo in practice. Problem with previous methods: – Poor locality of reference. High cache miss and random i/o – Analogy: trying to use internal sorting for array > memory size. (=> instead of external sorting).

 Buffer management strategy for Top Down Disk based (TDD) algo (O(n 2 ))  New Disk based suffix tree construction algorithm that is based on sort-merge paradigm (More efficient than first)

 Reduces the main memory requirements through strategic buffering of the largest data structures.  This approach consists of : A suffix tree construction algorithm, called ‘Partition and Write only Top Down’ (PWOTD) The related buffer management strategy.

 based on the wotdeager algorithm suggested by Giegerich et al.  In PWOTD, the wotdeager algorithm is improved using a partitioning phase which allows one to immediately build larger, independent subtrees in memory.  Consists of 2 phases: Partitioning wotdeager algorithm

 Suffixes divided into |A|^prefixlen partitions.  The input string is scanned from left to right. At each index position i the prefixlen subsequent characters are used to determine one of the |A|^prefixlen partitions.  At the end of the scan, each partition will contain the suffix pointers for suffixes that all have the same prefix of size prefixlen.  Example  Partitioning decreases the main-memory requirements, allowing independent subtrees to be built entirely in main memory.  But the cost of the partitioning phase is O(n×prefixlen), which increases linearly with prefixlen.

 4 data structures used: an input string array String, a suffix array Suffixes, a temporary array Temp, and the suffix tree Tree.  The Suffixes array is first populated with suffixes from a partition after discarding the first prefixlen characters.  Illustration

 Tree buffer: The reference pattern to Tree consists mainly of sequential writes when the children of a node are being recorded. Occasionally, pages are revisited when an unexpanded node is popped off the stack.  Very good spatial and temporal locality. Hence LRU replacement policy

 Suffixes array:  Sequential scan to copy into temp array. And sorted array is written back.  There is some locality in the pattern of writes, since the writes start at each character-group boundary and proceed sequentially to the right.  Hence LRU performs reasonably well

 Temp array is referenced in 2 sequential scans: to copy all of the suffixes in the Suffixes array, and to copy all of them back into the Suffixes array in sorted order.  Hence MRU works best for Temp.

 String array:  smallest main-memory requirement of all the data structures.  But worst locality of access.  referenced when performing the count-sort and to find the longest common prefix in each sorted group.  fairly complex reference pattern, and there is some locality of reference, so both LRU and RANDOM would do well.

 The cache miss behavior for each buffer is approximately linear once the memory is allocated beyond a minimum point.  Once we identify these points, we can allocate the minimum buffer size necessary for each structure. The remaining memory is then allocated in order of decreasing slopes of the buffer miss curves.

 Tree needs least amount of buffering due to very good locality of reference.  String needs the most amount of buffer due to very poor locality of reference.  Temp has more locality than suffix

 |A| for temp and suffix: to avoid the penalty of operating in the initial high miss-rate region.  2 pages for tree: For parent written to a previous page and then pushed onto the stack for later processing.  Remaining pages allocated to the String array upto its maximum required amount.  Any more left over pages are allocated to Suffixes, Temp and Tree in order of preference.

 TDD – inefficient if input strings are significantly greater than the available memory.  ST-Merge employs divide and conquer strategy similar to the external merge sort algorithm.

 Partition the string into k disjoint subsets.  Partition strategy: ◦ Randomly assign a suffix to one of the k buckets or ◦ Given subset will contain only contiguous suffixes from the string. We will use this strategy.  k = [(n * f ) / M ] ◦ M is amount of memory available, n is the size of input string and f (> 1) is adjustment factor

 Apply TDD (PWOTD) algorithm on the individual partitions => set of suffix trees.

 Data structures used: ◦ Node (Suffix tree - nonlinear linked list of nodes) ◦ Edge (Ordered tuple of 2 nodes) ◦ NodeSet, EdgeSet.

 Important Subroutines: ◦ NodeMerge (NodeSet, ParentEdge) - Merges the root nodes of the trees that are generated by the first phase. It internally calls EdgeMerge. ◦ EdgeMerge (EdgeSet, ParentNode) – Merge multiple nodes that have common outgoing edges with a common prefix.

Example

 Proper partitioning ensures that most accesses to string are in memory. Therefore, less I/O.  Compared to TDD, the accesses to the string in the second phase have more spatial locality of reference (since smaller working set).  However, if amount of memory is greater than the size of the string, partitioning doesnot provide much benefit, and we simply use TDD.

 First Phase: O(n 2 )  Second Phase: ◦ Cost of merging the nodes: O( n * k) ◦ Cost of merging the edges : O(n 2 )  Therefore, the worst case complexity of ST- Merge is O(n 2 ).

 Making ST-Merge and TDD to execute parallely.  Using Multiple disk and overlapping I/O and computation.

 For source code of TDD, go to ml ml

Thank You.