Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik

Introduction Most data mining today takes place outside of any DBMS Unfortunate - many potential advantages arise from using a DBMS: scalability flexibility consistency

Why don’t people use RDBMS? Too slow. Difficult to express data mining algorithms in SQL Potential for improvement: set-valued attributes.

Relational DBMS 101 Data model: tables with rows and columns each individual entry is an atomic element (an integer, a float, a character string.) New extension: set-valued attributes individual entries of tables can be sets.

Sets and Data Mining Canonical example: customers and their transactions. No sets: two tables, customers(cid, name, address, …) transactions(cid, product, date,…) Sets: one table, customers(cid, name, address, {trans}, …)

Open questions: How do you store the sets? How do you implement operations on these set-valued attributes? Do they really help move data mining “into SQL”?

Set Containment Joins Consider two relations: Containment is defined as Computes pair of tuples one from R and the other from S such that set from R tuple is contained or equal to the set from S tuple

Set Containment Joins (Cont.) Example STUDENT (sid, {courses-taken}) COURSES (cid, {prereqs}) Find the set of courses that student is eligible to take

Storage Representations Nested internal. Grouped and stored along with the rest of the attributes in the tuple. Unnested external. Set instances are unnested and stored in a separate relation. Requires join to assemble elements.

Nested Internal Representation Cardinality Element 1 Element 2 Element N.. Length Tuple A1A2A3

Unnested External - Good Old SQL SELECT R S.i, S S.j FROM R S, S S WHERE R S.b = S S.d GROUP BY R S.i, S S.j HAVING COUNT(*) = ( SELECT count(*) FROM NR S R S WHERE NR S.i = R S.i )

SQL Approach - Pros and Cons Pros. Easy to add to an existing DBMS. Cons Requires extra joins for projecting other attributes Nested query must be evaluated for each group Number of groups is |R|*|S|

SQL Approach - Mitigation Magic Sets Rewriting Count Query INSERT INTO T 1 (i,count i ) SELECT R S.i, COUNT(*) FROM R S GROUP BY R S.i Candidate Query INSERT INTO T 2 (i,j,count ij ) SELECT R S.i, S S.j, COUNT(*) FROM R S, S S WHERE R S.b = S S.d GROUP BY R S.i, S S.j Verify Query SELECT T 2.i, T 2.j FROM T 2, T 1 WHERE T 2.i = T 1.i AND T 2.count ij = T 1.count i

Signature Nested Loops (Sig-NL) Applicable for Nested Internal Representation Signatures Signatures are bit vectors for approximating sets Approximation leads to “false drops” Three phases of the algorithm Signature construction phase Comparing signature for containment Verification of actual subsets

Signature Nested Loops (Contd) Signature Construction Phase Take a bit vector Apply a hash function M for each element and set the corresponding bit Comparison Phase Necessary condition for subset satisfaction and

Partition Algorithms Reduce join execution time by partitioning the problem into smaller sub-problems. A partitioning function is used to partition the problem. An ideal partitioning function requires Tuple r of R falls in one of the partitions R i Tuple s of S falls in one of S i Join is accomplished by joining only R i with S i

Partitioned Set Join Algorithm Three phases of algorithm Partitioning Phase Joining Phase Verification Phase

Partition Set Join Algorithm (PSJ) S({1,2,3,6}) R({1,2,3}) (3,0100001,OID R ) (4,0100101,OID S ) Join

PSJ – Joining Phase Any efficient algorithm for joining signatures can be used. Signature based partition algorithm Partition R signatures based on randomly chosen bit that is set. Probe each S signature multiple times for each bit set. Outputs the result object id pairs (OID R,OID S ).

PSJ – Pros and Cons Pros Easy to implement – similar to hash joins Easily parallelizable Issues Determination of the number of partitions Determination of the signature size

PSJ – Number of Partitions Large number of partitions leads to large overhead Smaller number of partitions leads to more join cost Using a detailed analytical model

PSJ – Signature Size Inversely related to number of partitions Cyclic dependency. Solve simultaneously and use bisection method

Set Distributions Many degrees of freedom Each degree can follow a distribution of its own. Huge distribution space!

Classifying Set Distributions Small, SmallLarge, Small Large, LargeSmall, Large Relation Cardinality Set Cardinality SmallLarge Small Large

Performance – Settings Implementation in research version of Paradise using extensible operator framework and Set Adt Intel Pentium 333 MHz - Solaris 2.6 Main memory - 128 MB Buffer pool size - 32 MB Used raw disks of size 4 GB and I/O bandwidth of 6 MB/sec Each experiment was run against cold database Synthetic data set

Varying Relation Cardinality Set Cardinality of 20

Cost Breakdown of Sig-NL Set Cardinality of 20

Cost Breakdown of PSJ Set Cardinality of 20

Effect of Signature Size Relation Cardinality of 20000 and Set Cardinality of 20

Effect of Increasing Partitions Relation Cardinality of 20000 and Set Cardinality of 120

Performance Space Sig-NL, PSJ-1PSJ PSJ-1, PSJ Relation Cardinality Set Cardinality SmallLarge Small Large

Conclusion Developed a partition based algorithm for set containment joins Performance study shows that PSJ works well on most data sets The advantages of PSJ are Simple Effectiveness Easily parallelizable

Future Work Algorithm can be easily extended for set intersection joins Investigate the applicability of nested algorithms for unnested external representations

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Similar presentations

Presentation on theme: "Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.

Similar presentations

Presentation on theme: "Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik."— Presentation transcript:

Similar presentations

About project

Feedback