@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

Slides:



Advertisements
Similar presentations
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
Advertisements

Rethinking Database Algorithms for Phase Change Memory
Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Allocating Memory.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Changkyu Kim1, Eric Sedlar2, Jatin Chhugani1,
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
@ Carnegie Mellon Databases Improving Hash Join Performance Through Prefetching Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki ‡ Carnegie.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
Memory Organization.
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.
Carnegie Mellon Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Joint work with Shimin Chen School of Computer Science Carnegie.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz Data Distilleries B.V. Amsterdam The Netherlands Stefan.
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Fast and deterministic hash table lookup using discriminative bloom filters  Author: Kun Huang, Gaogang Xie,  Publisher: 2013 ELSEVIER Journal of Network.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
Improving Content Addressable Storage For Databases Conference on Reliable Awesome Projects (no acronyms please) Advanced Operating Systems (CS736) Brandon.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Database Architecture Optimized for the new Bottleneck: Memory Access Chau Man Hau Wong Suet Fai.
Fractal Prefetching B + -Trees: Optimizing Both Cache and Disk Performance Author: Shimin Chen, Phillip B. Gibbons, Todd C. Mowry, Gary Valentin Members:
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Memory Management during Run Generation in External Sorting – Larson & Graefe.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Buffer-pool aware Query Optimization Ravishankar Ramamurthy David DeWitt University of Wisconsin, Madison.
Lecture 24 Query Execution Monday, November 28, 2005.
CS4432: Database Systems II Query Processing- Part 2.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad*
Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
HARD: Hardware-Assisted lockset- based Race Detection P.Zhou, R.Teodorescu, Y.Zhou. HPCA’07 Shimin Chen LBA Reading Group Presentation.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Sunpyo Hong, Hyesoon Kim
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Module 11: File Structure
Lecture 16: Data Storage Wednesday, November 6, 2006.
Parallel Data Laboratory, Carnegie Mellon University
Evaluation of Relational Operations
Evaluation of Relational Operations: Other Operations
Implementation of Relational Operations (Part 2)
External Sorting The slides for this text are organized into chapters. This lecture covers Chapter 11. Chapter 1: Introduction to Database Systems Chapter.
Selected Topics: External Sorting, Join Algorithms, …
Lecture 2- Query Processing (continued)
Database Design and Programming
Implementation of Relational Operations
Slides adapted from Donghui Zhang, UC Riverside
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research Pittsburgh 2 1,

@ Carnegie Mellon Databases Inspector Joins 2 Exploiting Information about Data  Ability to improve query depends on information quality  General stats on relations are inadequate  May lead to incorrect decisions for specific queries  Especially true for join queries  Previous approaches exploiting dynamic information  Collecting information from previous queries  Multi-query optimization [Sellis’88]  Materialized views [Blakeley et al. 86]  Join indices [Valduriez’87]  Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04]  This study exploits the inner structure of hash joins

@ Carnegie Mellon Databases Inspector Joins 3  Idea:  Examine the actual data in I/O partitioning phase  Extract useful information to improve join phase Exploiting Multi-Pass Structure of Hash Joins I/O Partitioning Join Extra information greatly helps phase 2 Inspection

@ Carnegie Mellon Databases Inspector Joins 4 Using Extracted Information  Enable a new join phase algorithm  Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance  Optimized for multi-processor systems  Choose the most suitable join phase algorithm for special input cases I/O Partitioning decide Cache Partitioning Cache Prefetching Simple Hash Join Inspection Join Phase New Algorithm Extracted Information

@ Carnegie Mellon Databases Inspector Joins 5 Outline  Motivation  Previous hash join algorithms  Hash join performance on SMP systems  Inspector join  Experimental results  Conclusions

@ Carnegie Mellon Databases Inspector Joins 6 Hash Table  Join Phase: (simple hash join)  Build hash table, then probe hash table GRACE Hash Join  I/O Partitioning Phase:  Divide input relations into partitions with a hash function Build Probe Build Probe  Random memory accesses cause poor CPU cache performance Over 70% execution time stalled on cache misses!

@ Carnegie Mellon Databases Inspector Joins 7 Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00]  Recursively produce cache-sized partitions after I/O partitioning  Avoid cache misses when joining cache-sized partitions  Overhead of re-partitioning Build Probe Memory-sized Partitions Cache-sized Partitions

@ Carnegie Mellon Databases Inspector Joins 8 Cache Prefetching [Chen et al. 04]  Reduce impact of cache misses  Exploit available memory bandwidth  Overlap cache misses and computations  Insert cache prefetch instructions into code  Still incurs the same number of cache misses Hash Table Probe Build

@ Carnegie Mellon Databases Inspector Joins 9 Outline  Motivation  Previous hash join algorithms  Hash join performance on SMP systems  Inspector join  Experimental results  Conclusions

@ Carnegie Mellon Databases Inspector Joins 10 Hash Joins on SMP Systems  Previous studies mainly focus on uni-processors  Memory bandwidth is precious  Each processor joins a pair of partitions in join phase Main Memory Shared bus Cache CPU Cache CPU Cache CPU Cache CPU Build 1 Probe 1 Build 4 Probe 4 Build 2 Probe 2 Build 3 Probe 3

@ Carnegie Mellon Databases Inspector Joins 11 Previous Algorithms on SMP Systems  Join phase performance of joining a 500MB and a 2GB relations (details later in the talk)  Aggregate performance degrades dramatically over 4 CPUs  Reduce data movement (memory to memory, memory to cache) Wall clock timeAggregate time on all CPUs Re-partition cost Bandwidth- sharing

@ Carnegie Mellon Databases Inspector Joins 12 Inspector Joins  Extracted information: summary of matching relationships  Every K contiguous pages in a build partition forms a sub-partition  Tells which sub-partition(s) every probe tuple matches Build Partition Sub-partition 0 Sub-partition 1 Sub-partition 2 Probe Partition I/O Partitioning Join Summary of Matching Relationship

@ Carnegie Mellon Databases Inspector Joins 13 Cache-Stationary Join Phase  Recall cache partitioning: re-partition cost I/O Partitioning Join Build Partition Probe Partition Hash Table CPU Cache  We want to achieve zero copying Copying cost

@ Carnegie Mellon Databases Inspector Joins 14 Cache-Stationary Join Phase  Joins a sub-partition and its matching probe tuples  Sub-partition is small enough to fit in CPU cache  Cache prefetching for the remaining cache misses  Zero copying for generating recursive cache-sized partitions I/O Partitioning Join Build Partition Probe Partition Hash Table CPU Cache Sub-partition 0 Sub-partition 1 Sub-partition 2

@ Carnegie Mellon Databases Inspector Joins 15 Filters in I/O Partitioning  How to extract the summary efficiently?  Extend filter scheme in commercial hash joins  Conventional single-filter scheme  Represent all build join keys  Filter out probe tuples having no matches Build Relation Filter Mem-sized Partitions Construct Test I/O Partitioning Join Probe Relation

@ Carnegie Mellon Databases Inspector Joins 16 Background: Bloom Filter  A bit vector  A key is hashed d (e.g. d=3) times and represented by d bits  Construct: for every build join key, set its 3 bits in vector  Test: given a probe join key, check if all its 3 bits are 1  Discard the tuple if some bits are 0  May have false positives Bit 0 =H 0 (key)Bit 1 =H 1 (key)Bit 2 =H 2 (key) Filter

@ Carnegie Mellon Databases Inspector Joins 17 Multi-Filter Scheme  Single filter: a probe tuple  entire build relation  Our goal: a probe tuple  sub-partitions  Construct a filter for every sub-partition  Replace a single large filter with multiple small filters Single Filter Build Relation Partition 0 Partition 1 Partition 2 Sub0,0 Sub0,1 Sub0,2 Sub1,0 Sub1,1 Sub1,2 Sub2,0 Sub2,1 Sub2,2 Multi-Filter I/O Partitioning Join

@ Carnegie Mellon Databases Inspector Joins 18 Testing Multi-Filters When partitioning the probe relation  Test a probe tuple against all the filters of a partition  Tells which sub-partition(s) the tuple may have matches  Store summary of matching relationships in partitions Probe Relation Partition 0 Partition 1 Partition 2 Multi- Filter Test I/O Partitioning Join

@ Carnegie Mellon Databases Inspector Joins 19 Minimizing Cache Misses for Testing Filters  Single filter scheme:  Compute 3 bit positions  Test 3 bits  Multi-filter scheme: if there are S sub-partitions in a partition  Compute 3 bit positions  Test the same 3 bits for every filter, altogether 3*S bits  May cause 3*S cache misses ! Test Probe Relation Partition 0 Partition 1 Partition 2 Multi- Filter S filters

@ Carnegie Mellon Databases Inspector Joins 20 Vertical Filters for Testing  Bits at the same position are contiguous in memory  3 cache misses instead of 3*S cache misses!  Horizontal  vertical conversion after partitioning build relation  Very small overhead in practice Probe Relation Partition 0 Partition 1 Partition 2 Test S filters Contiguous in memory I/O Partitioning Join

@ Carnegie Mellon Databases Inspector Joins 21 More Details in Paper  Moderate memory space requirement for filters  Summary information representation in intermediate partitions  Preprocessing for cache-stationary join phase  Prefetching for improving efficiency and robustness

@ Carnegie Mellon Databases Inspector Joins 22 Outline  Motivation  Previous hash join algorithms  Hash join performance on SMP systems  Inspector join  Experimental results  Conclusions

@ Carnegie Mellon Databases Inspector Joins 23 Experimental Setup  Relation schema: 4-byte join attribute + fixed length payload  No selection, no projection  50MB memory per CPU available for the join phase  Same join algorithm run on every CPU joining different partitions  Detailed cycle-by-cycle simulations  A shared-bus SMP system with 1.5GHz processors  Memory hierarchy is based on Itanium 2 processor

@ Carnegie Mellon Databases Inspector Joins 24 Partition Phase Wall-Clock Time  I/O partitioning can take advantage of multiple CPUs  Cut input relations into equal-sized chunks  Partition one chunk on every CPU  Concatenate outputs from all CPUs  Enhanced cache partitioning: cache partitioning + advanced prefetching  Inspection incurs very small overhead GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used

@ Carnegie Mellon Databases Inspector Joins 25 Join Phase Aggregate Time  Inspector join achieves significantly better performance when 8 or more CPUs are used  X speedups over cache prefetching  X speedups over enhanced cache partitioning 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used GRACE Cache prefetching Cache partitioning Enhanced cache partitioning Inspector join

@ Carnegie Mellon Databases Inspector Joins 26 Results on Choosing Suitable Join Phase  Case #1: a large number of duplicate build join keys  Choose enhanced cache partitioning  When a probe tuple on average matches 4 or more sub-partitions  Case #2: nearly sorted input relations  Surprisingly: cache-stationary join is very good I/O Partitioning decide Cache Partitioning Cache Prefetching Simple Hash Join Inspection Join Phase Cache Stationary Extracted Info

@ Carnegie Mellon Databases Inspector Joins 27 Conclusions  Exploit multi-pass structure for higher quality info about data  Achieve significantly better cache performance  1.6X speedups over previous cache-friendly algorithms  When 8 or more CPUs are used  Choose most suitable algorithms for special input cases  Idea may be applicable to other multi-pass algorithms

@ Carnegie Mellon Databases Inspector Joins 28 Thank You !

@ Carnegie Mellon Databases Inspector Joins 29 Partition Phase Wall-Clock Time  I/O partitioning can take advantage of multiple CPUs  Cut input relations into equal-sized chunks  Partition one chunk on every CPU  Concatenate outputs from all CPUs  Inspection incurs very small overhead 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used GRACE Cache prefetching Cache partitioning Inspector join

@ Carnegie Mellon Databases Inspector Joins 30 Join Phase Aggregate Time  Inspector join achieves significantly better performance when 8 or more CPUs are used  X speedups over cache prefetching  X speedups over enhanced cache partitioning 500MB joins 2GB 100B tuples, 4B keys 50% probe tuples no matches A build matches 2 probe tuples Number of CPUs used GRACE Cache prefetching Cache partitioning Inspector join

@ Carnegie Mellon Databases Inspector Joins 31 CPU-Cache-Friendly Hash Joins  Recent studies focus on CPU cache performance  I/O partitioning gives good I/O performance  Random memory accesses cause poor CPU cache performance  Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00]  Recursively produce cache-sized partitions from memory-sized partitions  Avoid cache misses during join phase  Pay re-partitioning cost  Cache Prefetching [Chen et al. 04]  Exploit memory system parallelism  Use prefetches to overlap multiple cache misses and computations Hash Table Probe Build

@ Carnegie Mellon Databases Inspector Joins 32 Example Special Input Cases  Example case #1: a large number of duplicate build join keys  Count the average number of sub-partitions a probe tuple matches  Must check the tuple against all possible sub-partitions  If too large, cache stationary join works poorly  Example case #2: nearly sorted input relations  A merge-based join phase might be better? Build Partition Probe Partition Sub-partition 0 Sub-partition 1 Sub-partition 2 A probe tuple

@ Carnegie Mellon Databases Inspector Joins 33 Varying Number of Duplicates per Build Join Key  Join phase aggregate performance  Choose enhanced cache part  When a probe tuple on average matches 4 or more sub-partitions

@ Carnegie Mellon Databases Inspector Joins 34 Nearly Sorted Cases  Sort both input relations, then randomly move 0%-5% of tuples  Join phase aggregate performance  Surprisingly: cache-stationary join is very good  Even better than merge join when over 1% tuples are out-of-order

@ Carnegie Mellon Databases Inspector Joins 35 Analyzing Nearly Sorted Case  Partitions are also nearly sorted  Probe tuples matching a sub-partition are almost contiguous  Similar memory behavior as merge join  No cost for sorting out-of-order tuples Build Partition Probe Partition Sub-partition 0 Sub-partition 1 Sub-partition 2 A probe tuple Nearly Sorted