Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

IMPLEMENTATION OF INFORMATION RETRIEVAL SYSTEMS VIA RDBMS.
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
Join Processing in Database Systems with Large Main Memories ACM Transactions on Database Systems Vol. 11, No. 3, Sep 1986 Leonard D. Shapiro Donghui Zhang,
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 28 Database Systems I The Relational Data Model.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
SPRING 2004CENG 3521 Join Algorithms Chapter 14. SPRING 2004CENG 3522 Schema for Examples Similar to old schema; rname added for variations. Reserves:
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Spring 2004 ECE569 Lecture ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
On the Complexity of Join Predicates Jeff Naughton with Jin-Yi Cai, Venkatesan Chakaravarthy,Raghav Kaushik, Jignesh Patel, Karthikeyan Ramasamy.
1 Implementation of Relational Operations: Joins.
The Relational Model These slides are based on the slides of your text book.
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Introduction to Database Systems1. 2 Basic Definitions Mini-world Some part of the real world about which data is stored in a database. Data Known facts.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
Multi-Way Hash Join Effectiveness M.Sc Thesis Michael Henderson Supervisor Dr. Ramon Lawrence 2.
Chapter 15 A External Methods. © 2004 Pearson Addison-Wesley. All rights reserved 15 A-2 A Look At External Storage External storage –Exists beyond the.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
© 2006 Pearson Addison-Wesley. All rights reserved15 A-1 Chapter 15 External Methods.
Storage Representations for Set-Oriented Selection Predicates Karthikeyan Ramasamy with Jeffrey F. Naughton and David Maier.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 19, March 29, 2016 Mohammad Hammoud.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Database Applications (15-415) DBMS Internals- Part VIII Lecture 17, Oct 30, 2016 Mohammad Hammoud.
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
Parallel Databases.
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Introduction to Query Optimization
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Database Management Systems (CS 564)
Relational Operations
Database Applications (15-415) DBMS Internals- Part VII Lecture 19, March 27, 2018 Mohammad Hammoud.
Database Applications (15-415) DBMS Internals- Part VI Lecture 15, Oct 23, 2016 Mohammad Hammoud.
CS222P: Principles of Data Management Notes #12 Joins!
CS222: Principles of Data Management Notes #12 Joins!
Selected Topics: External Sorting, Join Algorithms, …
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Lecture 2- Query Processing (continued)
Implementation of Relational Operations
Slides adapted from Donghui Zhang, UC Riverside
Lecture 13: Query Execution
Evaluation of Relational Operations: Other Techniques
Overview of Query Evaluation: JOINS
CSE 326: Data Structures Lecture #14
CS222/CS122C: Principles of Data Management UCI, Fall Notes #11 Join!
CS222P: Principles of Data Management UCI, Fall 2018 Notes #11 Join!
Presentation transcript:

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik

Introduction Most data mining today takes place outside of any DBMS Unfortunate - many potential advantages arise from using a DBMS: scalability flexibility consistency

Why don’t people use RDBMS? Too slow. Difficult to express data mining algorithms in SQL Potential for improvement: set-valued attributes.

Relational DBMS 101 Data model: tables with rows and columns each individual entry is an atomic element (an integer, a float, a character string.) New extension: set-valued attributes individual entries of tables can be sets.

Sets and Data Mining Canonical example: customers and their transactions. No sets: two tables, customers(cid, name, address, …) transactions(cid, product, date,…) Sets: one table, customers(cid, name, address, {trans}, …)

Open questions: How do you store the sets? How do you implement operations on these set-valued attributes? Do they really help move data mining “into SQL”?

Set Containment Joins Consider two relations: Containment is defined as Computes pair of tuples one from R and the other from S such that set from R tuple is contained or equal to the set from S tuple

Set Containment Joins (Cont.) Example STUDENT (sid, {courses-taken}) COURSES (cid, {prereqs}) Find the set of courses that student is eligible to take

Storage Representations Nested internal. Grouped and stored along with the rest of the attributes in the tuple. Unnested external. Set instances are unnested and stored in a separate relation. Requires join to assemble elements.

Nested Internal Representation Cardinality Element 1 Element 2 Element N.. Length Tuple A1A2A3

Unnested External - Good Old SQL SELECT R S.i, S S.j FROM R S, S S WHERE R S.b = S S.d GROUP BY R S.i, S S.j HAVING COUNT(*) = ( SELECT count(*) FROM NR S R S WHERE NR S.i = R S.i )

SQL Approach - Pros and Cons Pros. Easy to add to an existing DBMS. Cons Requires extra joins for projecting other attributes Nested query must be evaluated for each group Number of groups is |R|*|S|

SQL Approach - Mitigation Magic Sets Rewriting Count Query INSERT INTO T 1 (i,count i ) SELECT R S.i, COUNT(*) FROM R S GROUP BY R S.i Candidate Query INSERT INTO T 2 (i,j,count ij ) SELECT R S.i, S S.j, COUNT(*) FROM R S, S S WHERE R S.b = S S.d GROUP BY R S.i, S S.j Verify Query SELECT T 2.i, T 2.j FROM T 2, T 1 WHERE T 2.i = T 1.i AND T 2.count ij = T 1.count i

Signature Nested Loops (Sig-NL) Applicable for Nested Internal Representation Signatures Signatures are bit vectors for approximating sets Approximation leads to “false drops” Three phases of the algorithm Signature construction phase Comparing signature for containment Verification of actual subsets

Signature Nested Loops (Contd) Signature Construction Phase Take a bit vector Apply a hash function M for each element and set the corresponding bit Comparison Phase Necessary condition for subset satisfaction and

Partition Algorithms Reduce join execution time by partitioning the problem into smaller sub-problems. A partitioning function is used to partition the problem. An ideal partitioning function requires Tuple r of R falls in one of the partitions R i Tuple s of S falls in one of S i Join is accomplished by joining only R i with S i

Partitioned Set Join Algorithm Three phases of algorithm Partitioning Phase Joining Phase Verification Phase

Partition Set Join Algorithm (PSJ) S({1,2,3,6}) R({1,2,3}) (3, ,OID R ) (4, ,OID S ) Join

PSJ – Joining Phase Any efficient algorithm for joining signatures can be used. Signature based partition algorithm Partition R signatures based on randomly chosen bit that is set. Probe each S signature multiple times for each bit set. Outputs the result object id pairs (OID R,OID S ).

PSJ – Pros and Cons Pros Easy to implement – similar to hash joins Easily parallelizable Issues Determination of the number of partitions Determination of the signature size

PSJ – Number of Partitions Large number of partitions leads to large overhead Smaller number of partitions leads to more join cost Using a detailed analytical model

PSJ – Signature Size Inversely related to number of partitions Cyclic dependency. Solve simultaneously and use bisection method

Set Distributions Many degrees of freedom Each degree can follow a distribution of its own. Huge distribution space!

Classifying Set Distributions Small, SmallLarge, Small Large, LargeSmall, Large Relation Cardinality Set Cardinality SmallLarge Small Large

Performance – Settings Implementation in research version of Paradise using extensible operator framework and Set Adt Intel Pentium 333 MHz - Solaris 2.6 Main memory MB Buffer pool size - 32 MB Used raw disks of size 4 GB and I/O bandwidth of 6 MB/sec Each experiment was run against cold database Synthetic data set

Varying Relation Cardinality Set Cardinality of 20

Cost Breakdown of Sig-NL Set Cardinality of 20

Cost Breakdown of PSJ Set Cardinality of 20

Effect of Signature Size Relation Cardinality of and Set Cardinality of 20

Effect of Increasing Partitions Relation Cardinality of and Set Cardinality of 120

Performance Space Sig-NL, PSJ-1PSJ PSJ-1, PSJ Relation Cardinality Set Cardinality SmallLarge Small Large

Conclusion Developed a partition based algorithm for set containment joins Performance study shows that PSJ works well on most data sets The advantages of PSJ are Simple Effectiveness Easily parallelizable

Future Work Algorithm can be easily extended for set intersection joins Investigate the applicability of nested algorithms for unnested external representations