Download presentation
Presentation is loading. Please wait.
Published byRandolf Williamson Modified over 9 years ago
1
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik
2
Introduction Most data mining today takes place outside of any DBMS Unfortunate - many potential advantages arise from using a DBMS: scalability flexibility consistency
3
Why don’t people use RDBMS? Too slow. Difficult to express data mining algorithms in SQL Potential for improvement: set-valued attributes.
4
Relational DBMS 101 Data model: tables with rows and columns each individual entry is an atomic element (an integer, a float, a character string.) New extension: set-valued attributes individual entries of tables can be sets.
5
Sets and Data Mining Canonical example: customers and their transactions. No sets: two tables, customers(cid, name, address, …) transactions(cid, product, date,…) Sets: one table, customers(cid, name, address, {trans}, …)
6
Open questions: How do you store the sets? How do you implement operations on these set-valued attributes? Do they really help move data mining “into SQL”?
7
Set Containment Joins Consider two relations: Containment is defined as Computes pair of tuples one from R and the other from S such that set from R tuple is contained or equal to the set from S tuple
8
Set Containment Joins (Cont.) Example STUDENT (sid, {courses-taken}) COURSES (cid, {prereqs}) Find the set of courses that student is eligible to take
9
Storage Representations Nested internal. Grouped and stored along with the rest of the attributes in the tuple. Unnested external. Set instances are unnested and stored in a separate relation. Requires join to assemble elements.
10
Nested Internal Representation Cardinality Element 1 Element 2 Element N.. Length Tuple A1A2A3
11
Unnested External - Good Old SQL SELECT R S.i, S S.j FROM R S, S S WHERE R S.b = S S.d GROUP BY R S.i, S S.j HAVING COUNT(*) = ( SELECT count(*) FROM NR S R S WHERE NR S.i = R S.i )
12
SQL Approach - Pros and Cons Pros. Easy to add to an existing DBMS. Cons Requires extra joins for projecting other attributes Nested query must be evaluated for each group Number of groups is |R|*|S|
13
SQL Approach - Mitigation Magic Sets Rewriting Count Query INSERT INTO T 1 (i,count i ) SELECT R S.i, COUNT(*) FROM R S GROUP BY R S.i Candidate Query INSERT INTO T 2 (i,j,count ij ) SELECT R S.i, S S.j, COUNT(*) FROM R S, S S WHERE R S.b = S S.d GROUP BY R S.i, S S.j Verify Query SELECT T 2.i, T 2.j FROM T 2, T 1 WHERE T 2.i = T 1.i AND T 2.count ij = T 1.count i
14
Signature Nested Loops (Sig-NL) Applicable for Nested Internal Representation Signatures Signatures are bit vectors for approximating sets Approximation leads to “false drops” Three phases of the algorithm Signature construction phase Comparing signature for containment Verification of actual subsets
15
Signature Nested Loops (Contd) Signature Construction Phase Take a bit vector Apply a hash function M for each element and set the corresponding bit Comparison Phase Necessary condition for subset satisfaction and
16
Partition Algorithms Reduce join execution time by partitioning the problem into smaller sub-problems. A partitioning function is used to partition the problem. An ideal partitioning function requires Tuple r of R falls in one of the partitions R i Tuple s of S falls in one of S i Join is accomplished by joining only R i with S i
17
Partitioned Set Join Algorithm Three phases of algorithm Partitioning Phase Joining Phase Verification Phase
18
Partition Set Join Algorithm (PSJ) S({1,2,3,6}) R({1,2,3}) (3,0100001,OID R ) (4,0100101,OID S ) Join
19
PSJ – Joining Phase Any efficient algorithm for joining signatures can be used. Signature based partition algorithm Partition R signatures based on randomly chosen bit that is set. Probe each S signature multiple times for each bit set. Outputs the result object id pairs (OID R,OID S ).
20
PSJ – Pros and Cons Pros Easy to implement – similar to hash joins Easily parallelizable Issues Determination of the number of partitions Determination of the signature size
21
PSJ – Number of Partitions Large number of partitions leads to large overhead Smaller number of partitions leads to more join cost Using a detailed analytical model
22
PSJ – Signature Size Inversely related to number of partitions Cyclic dependency. Solve simultaneously and use bisection method
23
Set Distributions Many degrees of freedom Each degree can follow a distribution of its own. Huge distribution space!
24
Classifying Set Distributions Small, SmallLarge, Small Large, LargeSmall, Large Relation Cardinality Set Cardinality SmallLarge Small Large
25
Performance – Settings Implementation in research version of Paradise using extensible operator framework and Set Adt Intel Pentium 333 MHz - Solaris 2.6 Main memory - 128 MB Buffer pool size - 32 MB Used raw disks of size 4 GB and I/O bandwidth of 6 MB/sec Each experiment was run against cold database Synthetic data set
26
Varying Relation Cardinality Set Cardinality of 20
27
Cost Breakdown of Sig-NL Set Cardinality of 20
28
Cost Breakdown of PSJ Set Cardinality of 20
29
Effect of Signature Size Relation Cardinality of 20000 and Set Cardinality of 20
30
Effect of Increasing Partitions Relation Cardinality of 20000 and Set Cardinality of 120
31
Performance Space Sig-NL, PSJ-1PSJ PSJ-1, PSJ Relation Cardinality Set Cardinality SmallLarge Small Large
32
Conclusion Developed a partition based algorithm for set containment joins Performance study shows that PSJ works well on most data sets The advantages of PSJ are Simple Effectiveness Easily parallelizable
33
Future Work Algorithm can be easily extended for set intersection joins Investigate the applicability of nested algorithms for unnested external representations
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.