Presentation is loading. Please wait.

Presentation is loading. Please wait.

SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju.

Similar presentations


Presentation on theme: "SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju."— Presentation transcript:

1 SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju Presented By: Proposed By, Ihab F. Ilyas Walid G. Aref Ahmed K. Elmagarmid

2 Outline 2  Introduction  Existing join strategies  Contributions  Related Work  Introduction to New Rank join algorithm  Overview of Ripple Joins  New Rank join algorithm  Physical Rank Join Operators  HRJN  HRJN*  Performance Evaluation  Conclusion

3 Introduction  Need for support of ranking in Relational Databases.  Attributes in Relational databases spread across multiple relations, hence need for ranking on join queries.  User mostly interested in top few results.  Resultset should be ordered based on certain conditions (scoring functions). 3

4 Existing Join strategies  Sort-Merge join  Relations sorted on join columns.  Nested loop join  Tuples of outer relation are joined with tuples of the inner relation.  Hash join  2 phases: Build, Probe  Build hash table for smaller of the two relations.  Probe this hash table with hash value for each tuple in the other relation. 4

5 Top-k using existing join strategies  Given a query, how do we get the top-k results? SELECT A.1, B.2 FROM A, B, C WHERE A.1 = B.1 AND B.2 = C.2 ORDER BY ( 0.3 * A.1 + 0.7 * B.2 ) STOP AFTER 5; Problems ? 1.Sorting is a blocking operation. 2.Sorting is expensive and has been done thrice. 5

6 Order limitations on existing joins  Sort-merge join  Sorting is done on joining columns, NOT on columns that participate in scoring function.  Nested-loop join  Orders of only the outer loop is maintained.  Hash join  Orders on both inputs are lost after the join, when hash tables do not fit in memory. Common characteristic in these joins: Decouple join from sort. 6

7 Contributions  Proposed a new rank join algorithm.  Implemented this algorithm in practical pipelined rank-join operators based on ripple join.  Proposed a scoring guide function that reduces the number of tuples to be evaluated to get the desired resutls. 7

8 Desired Result SELECT A.1, B.2 FROM A, B, C WHERE A.1 = B.1 AND B.2 = C.2 ORDER BY ( 0.3 * A.1 + 0.7 * B.2 ) STOP AFTER 5; 8 Using existing join strategies DESIRED: Using rank join

9 Related Work  This problem is closely related to top-k selection queries.  Here, scoring function is applied on multiple attributes m of the same relation.  Related algorithms: Threshold Algorithm(TA), No-Random Access Algorithm(NRA), J*, A* 9

10 Introduction: New Rank Join Algorithm  Tuples are retrieved in order to preserve ranking.  Produces first ranked join results as quickly as possible.  Uses a monotonic ranking function.  Based on the idea of ripple join.  Integration with existing physical query engines.  Variations: HRJN, HRJN* 10

11 Overview of Ripple Joins  Previously unseen random tuple from one relation is joined with previously seen tuples from another relation.  Variations of Ripple Joins  Block  Hash 11

12 Rank Join Algorithm 12

13 10 Example 13 IdAB 115 224 323 432 AB 135 214 323 422 L R L.A = R.A Threshold (T): L_top L_bottom R_top R_bottom LI, RI not a valid join Right_threshold =f( R_top, L_bottom ) Left_threshold = f( L_top, R_bottom ) T = Max(Left_threshold, Right_threshold ) 999999 109 L1, R2 is a valid join. 888888 8 L3, R3 | L2, R3 are valid joins [ (1,1,5) (2,1,4) ] = 9 [ (2,2,4) (3,2,3) ] = 7 777777 7 [ (3,2,3) (3,2,3) ] = 6 L4,R1 | L2,R4 | L3,R4 are valid joins [ (4,3,2) (1,3,5) ] = 7 [ (2,2,4) (4,2,2) ] = 6 [ (3,2,3) (4,2,2) ] = 5 Scoring Function: L.B+ R.B K = 2 K = 0K = 1K = 2

14 Hash Rank Join Operator (HRJN)  Variant of Symmetrical hash join algorithm.  Data Structures  Hash table for each input.  Priority Queue - holds valid join combinations along with their scores.  Methods implemented  Open: initializes its operator and prepares its internal state.  Get Next: returns next ranked join result upon each call.  Close: terminates the operator and performs the necessary clean up. 14

15 Open(L, R, C, f) 15 L = Left Input R = Right Input C = Join condition f = Monotonic scoring function L = Left Input R = Right Input C = Join condition f = Monotonic scoring function

16 GetNext() 16 Output: Next ranked join result

17 Local Ranking Problem 17  Unbalance retrieval rate of left and right inputs.  Use concept of Block Ripple Join. Solving

18 Example 2 18 IdAB 115 224 323 432 AB 135 214 323 422 L.A = R.A L_top L_bottom R_top R_bottom Threshold (T): Right_threshold =f( R_top, L_bottom ) Left_threshold = f( L_top, R_bottom ) T = Max(Left_threshold, Right_threshold ) 10 Scoring Function: L.B+ R.B K = 2 Scoring Function: L.B+ R.B K = 2 9 10 No valid joins. 8 10 7 10 L4, R1 is a valid join [ (4,3,2) (1,3,5) ] = 7 L R

19 HRJN*: Score-Guided Join Strategy 19 Retrieve tuple from input T1 = f( L_top, R_bottom) T2 = f( R_top, L_bottom) T1 = f( L_top, R_bottom) T2 = f( R_top, L_bottom) If T1 > T2 If T1 > T2 Input = R Input = L YesNo

20 Exploiting available indexes 20  Generalize Rank-join to use random access if available.  Two cases:  An index on join attribute(s) of one input.  An index on join attribute(s) for each input.  Problem : Duplicates can be produced as indexes will contain all data seen and not yet seen.

21 Exploiting Indexes: On-the-fly duplicate elimination 21 IdAB 11100 2250 3225 4310 IdAB 1310 219 328 425 L R Scoring Function: L.B+ R.B Index available on R [ (1,1,100) (2,1,9) ] = 109 [ (2,2,5) (3,2,8) ] = 58 [ (2,2,50) (4,2,5) ] = 55 Any join result, not yet produced, cannot have a combined score greater than f( L_bottom, R_bottom) f( L_bottom, R_bottom) = 59

22 Exploiting Indexes: Faster Termination 22 Previously, T = ( 109, 60 ) = 109 After reducing L_top, T = ( 59, 60 ) = 60 IdAB 11100 2250 3225 4310 IdAB 1310 219 328 425 L R L.A = R.A Scoring Function: L.B+ R.B Index available on R L_top = L_bottom Reduce L_top to L_bottom, i.e

23 Performance Evaluation Top-k join operators 23 M = 4 Selectivity = 0.2%

24 Effect of selectivity 24 M = 4 K = 50

25 Effect of pipelining 25 Selectivity = 0.2% K = 50

26 Conclusion 26  Supported top-k join queries using the new rank join algorithm.  Algorithm uses ranking on the input relations to produce ranked join results on a combined score.  The ranking is performed progressively during the join operation.  HRJN, HRJN* operators implement the new algorithm.  Generalization of this algorithm utilized available indexes for faster termination.

27 27

28 References 28  “Supporting Top-k Join Queries in Relational Databases.”, Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid, March 2004  Jing Chen: DIBR Spring 2005, CSE - UT Arlington

29 THANKYOU


Download ppt "SUPPORTING TOP-K QUERIES IN RELATIONAL DATABASES. PROCEEDINGS OF THE 29TH INTERNATIONAL CONFERENCE ON VERY LARGE DATABASES, MARCH 2004 Sowmya Muniraju."

Similar presentations


Ads by Google