SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju Presented By: Proposed By, Ihab F. Ilyas Walid G. Aref Ahmed K. Elmagarmid
Outline 2 Introduction Existing join strategies Contributions Related Work Introduction to New Rank join algorithm Overview of Ripple Joins New Rank join algorithm Physical Rank Join Operators HRJN HRJN* Performance Evaluation Conclusion
Introduction Need for support of ranking in Relational Databases. Attributes in Relational databases spread across multiple relations, hence need for ranking on join queries. User mostly interested in top few results. Resultset should be ordered based on certain conditions (scoring functions). 3
Existing Join strategies Sort-Merge join Relations sorted on join columns. Nested loop join Tuples of outer relation are joined with tuples of the inner relation. Hash join 2 phases: Build, Probe Build hash table for smaller of the two relations. Probe this hash table with hash value for each tuple in the other relation. 4
Top-k using existing join strategies Given a query, how do we get the top-k results? SELECT A.1, B.2 FROM A, B, C WHERE A.1 = B.1 AND B.2 = C.2 ORDER BY ( 0.3 * A * B.2 ) STOP AFTER 5; Problems ? 1.Sorting is a blocking operation. 2.Sorting is expensive and has been done thrice. 5
Order limitations on existing joins Sort-merge join Sorting is done on joining columns, NOT on columns that participate in scoring function. Nested-loop join Orders of only the outer loop is maintained. Hash join Orders on both inputs are lost after the join, when hash tables do not fit in memory. Common characteristic in these joins: Decouple join from sort. 6
Contributions Proposed a new rank join algorithm. Implemented this algorithm in practical pipelined rank-join operators based on ripple join. Proposed a scoring guide function that reduces the number of tuples to be evaluated to get the desired resutls. 7
Desired Result SELECT A.1, B.2 FROM A, B, C WHERE A.1 = B.1 AND B.2 = C.2 ORDER BY ( 0.3 * A * B.2 ) STOP AFTER 5; 8 Using existing join strategies DESIRED: Using rank join
Related Work This problem is closely related to top-k selection queries. Here, scoring function is applied on multiple attributes m of the same relation. Related algorithms: Threshold Algorithm(TA), No-Random Access Algorithm(NRA), J*, A* 9
Introduction: New Rank Join Algorithm Tuples are retrieved in order to preserve ranking. Produces first ranked join results as quickly as possible. Uses a monotonic ranking function. Based on the idea of ripple join. Integration with existing physical query engines. Variations: HRJN, HRJN* 10
Overview of Ripple Joins Previously unseen random tuple from one relation is joined with previously seen tuples from another relation. Variations of Ripple Joins Block Hash 11
Rank Join Algorithm 12
10 Example 13 IdAB AB L R L.A = R.A Threshold (T): L_top L_bottom R_top R_bottom LI, RI not a valid join Right_threshold =f( R_top, L_bottom ) Left_threshold = f( L_top, R_bottom ) T = Max(Left_threshold, Right_threshold ) L1, R2 is a valid join L3, R3 | L2, R3 are valid joins [ (1,1,5) (2,1,4) ] = 9 [ (2,2,4) (3,2,3) ] = [ (3,2,3) (3,2,3) ] = 6 L4,R1 | L2,R4 | L3,R4 are valid joins [ (4,3,2) (1,3,5) ] = 7 [ (2,2,4) (4,2,2) ] = 6 [ (3,2,3) (4,2,2) ] = 5 Scoring Function: L.B+ R.B K = 2 K = 0K = 1K = 2
Hash Rank Join Operator (HRJN) Variant of Symmetrical hash join algorithm. Data Structures Hash table for each input. Priority Queue - holds valid join combinations along with their scores. Methods implemented Open: initializes its operator and prepares its internal state. Get Next: returns next ranked join result upon each call. Close: terminates the operator and performs the necessary clean up. 14
Open(L, R, C, f) 15 L = Left Input R = Right Input C = Join condition f = Monotonic scoring function L = Left Input R = Right Input C = Join condition f = Monotonic scoring function
GetNext() 16 Output: Next ranked join result
Local Ranking Problem 17 Unbalance retrieval rate of left and right inputs. Use concept of Block Ripple Join. Solving
Example 2 18 IdAB AB L.A = R.A L_top L_bottom R_top R_bottom Threshold (T): Right_threshold =f( R_top, L_bottom ) Left_threshold = f( L_top, R_bottom ) T = Max(Left_threshold, Right_threshold ) 10 Scoring Function: L.B+ R.B K = 2 Scoring Function: L.B+ R.B K = No valid joins L4, R1 is a valid join [ (4,3,2) (1,3,5) ] = 7 L R
HRJN*: Score-Guided Join Strategy 19 Retrieve tuple from input T1 = f( L_top, R_bottom) T2 = f( R_top, L_bottom) T1 = f( L_top, R_bottom) T2 = f( R_top, L_bottom) If T1 > T2 If T1 > T2 Input = R Input = L YesNo
Exploiting available indexes 20 Generalize Rank-join to use random access if available. Two cases: An index on join attribute(s) of one input. An index on join attribute(s) for each input. Problem : Duplicates can be produced as indexes will contain all data seen and not yet seen.
Exploiting Indexes: On-the-fly duplicate elimination 21 IdAB IdAB L R Scoring Function: L.B+ R.B Index available on R [ (1,1,100) (2,1,9) ] = 109 [ (2,2,5) (3,2,8) ] = 58 [ (2,2,50) (4,2,5) ] = 55 Any join result, not yet produced, cannot have a combined score greater than f( L_bottom, R_bottom) f( L_bottom, R_bottom) = 59
Exploiting Indexes: Faster Termination 22 Previously, T = ( 109, 60 ) = 109 After reducing L_top, T = ( 59, 60 ) = 60 IdAB IdAB L R L.A = R.A Scoring Function: L.B+ R.B Index available on R L_top = L_bottom Reduce L_top to L_bottom, i.e
Performance Evaluation Top-k join operators 23 M = 4 Selectivity = 0.2%
Effect of selectivity 24 M = 4 K = 50
Effect of pipelining 25 Selectivity = 0.2% K = 50
Conclusion 26 Supported top-k join queries using the new rank join algorithm. Algorithm uses ranking on the input relations to produce ranked join results on a combined score. The ranking is performed progressively during the join operation. HRJN, HRJN* operators implement the new algorithm. Generalization of this algorithm utilized available indexes for faster termination.
27
References 28 “Supporting Top-k Join Queries in Relational Databases.”, Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid, March 2004 Jing Chen: DIBR Spring 2005, CSE - UT Arlington
THANKYOU