Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Richa Varshney
Introduction O Ordered set of join results according to some provided function. O Often searches are done on multiple features. O Each feature produces a different ranking for the query. O Joining the individual feature rankings to produce a global ranking. 2
Example 1: Ranking in Multimedia Retrieval Color Histogram Edge Histogram Texture Query Color Histogram Edge Histogram Texture Video Database 3
Example 2 SELECT h.id, s.name FROM houses h, schools s WHERE h.location = s.location ORDER BY h.price+10 x s.tuition STOP AFTER 4 4 4
Example 2 (Cont’d) IDLocationPrice LafayetteW.LafayetteIndianapolisKokomoLafayetteKokomo……90,000110,000111,000118,000125,000154,000 IDLocationTuition IndianapolisW.LafayetteLafayetteLafayetteIndianapolisIndianapolisKokomoKokomo Schools Houses
Motivation SELECT A.1,B.2 FROM A,B,C WHERE A.1 = B.1 and B.2 = C.2 ORDER BY (0.3*A.1+0.7*B.2) STOP AFTER 5; Problems:- Sorting is an expensive operation. Sorting is a blocking operator. 6 6
Contribution O Propose a new Rank-Join algorithm O Analyze the I/O cost of the algorithm O Implement the algorithm O Propose a score-guided and adaptive join strategy O Evaluate performance 7 7
Ripple Join Cartesian product L x R (L1(1,1,5) R1(1,3,5)) 8 8 (L2,R2) {(2,2,4),(2,1,4)} (L2,R1) {2,2,4), (1,3,5)} (L1,R2) {(1,1,5), (2,1,4)} L L R R
Variation Of Ripple Join Rectangle Block Hash Ripple Join: where all the sampled tuples are kept in hash tables in memory 9 9
Query Model: Top-k Join O m Relations R 1, ….., R m | R i has: O n attributes O score attribute, s i (can be an expression over other attributes) O A global score for a join result is computed as F(s 1,…., s m ) O A top-k join query is an ordered set of join results according to some provided function that combines the orders on each input. O An example template: SELECT some_attributes FROM R 1,…..,R m WHERE join_condition ORDER BY F(s 1,…..,s m ) STOP AFTER k 10
1) Generate new valid join combinations 2) Compute score for each combination 3) For each incoming input, calculate the threshold score: a) The last seen feature value and the top ranked feature value for all other features in the query. b) Store the maximum of these as T (threshold) 4) Store top k(maximum combined score) results in priority queue. 5) Halt when lowest value of queue ≥ T 11 Rank-Join Algorithm 11
Select * From L, R Where L.A = R.A Order By L.B + R.B Stop After 3 Compute a Threshold (T) by Max {(Last L).B + (First R.B), (First L).B + (Last R).B} (1). Get a valid combination using any certain algorithm Ripple Select (L1, R1) => No Result 12 Example
Example--Cont. (1) Get a valid combination using any certain algorithm Select (L2, R2) (L2, R2), (L2, R1), (L1, R2) => (L1, R2) (2) Compute the score (J) for the result J1(L1, R2) => L.B + R.B = = 9 13 Select * From L, R Where L.A = R.A Order By L.B + R.B Stop After 3
O (3) Compute a Threshold (T) score by Max {(Last L).B + (First R.B), (First L).B + (Last R).B} Selection (L1, R1), (L2, R2) => T = Max (L2.B + R1.B, L1.B + R2.B) =Max (4+5, 5+4) = 9 O (4) J1= 9,T = 9,J1 >= T,Report J1 Since we need top 3 (k=3), continue until k=3 and Min(J1, J2, …Jk) > T 14 Select * From L, R Where L.A = R.A Order By L.B + R.B Stop After 3 Example--Cont. 14
(1) Select (L3, R3) (L3, R3), (L3, R1), (L3, R2), (L1, R3), (L2, R3) => (L3, R3), (L2, R3) (2) J2(L2, R3) = = 7 J3(L3, R3) = 3 + 3= 6 Example--Cont. 15 Select * From L, R Where L.A = R.A Order By L.B + R.B Stop After 3
O (3) Calculate T= Max { (Last L).B + (First R).B,(First L).B+ (Last R).B} = Max {L3.B + R1.B, L1.B + R3.B}= Max(3 + 5, 5 + 3) = 8 O (4) J1(L1,R2) = 9(reported),J2( L2, R3) = 7,J3(L3, R3) = 6 (Note, J’s are in descending order) Min (J) = 6 < T Continue 16 Select * From L, R Where L.A = R.A Order By L.B + R.B Stop After 3 Example--Cont. 16
(1)Select (L4, R4) => (L4, R1), (L2, R4), (L3, R4) (2) J(L4, R1) = 7, J(L2, R4) = 6, J(L3, R4) = 5 (3) T= Max(L4.B+R1.B, L1.B + R4.B) = Max(7, 7) = 7 (4) J1(L1,R2) = 9, J2(L2, R3) = 7, J3(L4, R1) = 7,J3(L3,R3) = 6, J4(L2, R4) = 6, J5(L3, R4) = 5 Min(J1, J2) = 7 >= T (k = 3) Example--Cont. 17 Select * From L, R Where L.A = R.A Order By L.B + R.B Stop After 3
Hash Rank Join (HRJN) Operator O Built on idea of hash ripple join O Initialized by specifying four parameters: O Two inputs(Can be HRJN operator) O Join condition(general equality condition/computes valid join) O Combining function(monotone/computes global scores) O Maintains highest (first) and lowest (last selected) objects from each relation. O Results are added to a priority queue 18
Hash Rank Join (HRJN) Operator: Problems O Buffer Problem O Cannot predict how many partial joins will result O Local Ranking Problem 19
HRJN Solutions O Use Block Ripple Join to solve Local Ranking Problem. (e.g. block size = 2) 20
HRJN Solutions—Cont. O HRJN* score-guided join strategy O How to select next (block) tuple T1 = f(L top,R bottom ) and T2 = f(L bottom,R top ), where f is the ranking function Case 1: T1 >T2, more inputs should be retrieved from R Case 2: T1 <T2, more inputs should be retrieved from L 21
An adaptive join strategy O Use input availability as a guide instead of the aforementioned score-guided strategy O If both inputs are available, choose the next input to process. O Otherwise, the available input is processed. O e.g., a mediator over Web-accessible sources and distributed multimedia repositories 22
Join Order O When more than two tables join, the join order matters. (A and C have high similarity) 23
O Rank-Join order heuristic - Get a ranked sample, top S ranked list from L and R - Calculate the similarity using Footrule 24 Join Order Algorithm where (i, j ) is a valid join result that joins object i from L with object j from R
25 Rank Join Order Heuristic
Performance Evaluation Changing the number of required answers: Selectivity = 0.2 % and m= 4 26
Performance Evaluation--Cont. Changing the number of required answers: Selectivity = 0.2 % and m= 4 27
Performance Evaluation--Cont. Changing the number of required answers: Selectivity = 0.2 % and m= 4 28
Performance Evaluation--Cont. Changing the join selectivity: m =4 and K =50 29
Performance Evaluation--Cont. Changing the join selectivity: m =4 and K =50 30
Performance Evaluation--Cont. Changing the join selectivity: m =4 and K =50 31
Performance Evaluation--Cont. Effect of pipelining: selectivity = 0. 2% and K =50 32
Performance Evaluation--Cont. Effect of pipelining: selectivity = 0. 2% and K =50 33
Performance Evaluation--Cont. Effect of pipelining: selectivity = 0. 2% and K =50 34
35