Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington
Introduction Often searches are done on multiple features Each feature produces a different ranking for the query Must thus join and aggregate rankings on different features
Example Find location for a house such that the combination of the cost of the house and 5 years tuition at a nearby school is minimal. Exact location is not predefined in query, per location the house and school features would have to be analyzed. Exact location is not predefined in query, per location the house and school features would have to be analyzed.
Motivation Current techniques decouple join and sorting (ranking) of results. Sorting is expensive and is a blocking operation. More apparent if ranking and the joining features are different.
Rank-Join Algorithm 1)Generate new valid join combinations 2)Compute score for each combination 3)For each incoming input, calculate the total score of: a)The last seen feature value and the top ranked feature value for all other features in the query. b)Store the maximum of these as T (threshold) 4)Store top k in priority queue. 5)Halt when lowest value of queue ≥ T
Optimality Is Instance Optimal over all correct top-K join algorithms. Guarantees that cost of Rank-Join is O (cost of any other algorithm). Guarantees that cost of Rank-Join is O (cost of any other algorithm). Mathematically: Mathematically: Cost(Rank-Join) ≤ c*Cost(Any Other Algorithm) + c’ c is the optimality ratio c, c’ > 0
Rank-Join Continued … Join strategy crucial Recommended: Ripple Join Recommended: Ripple Join Alternates between tuples Alternates between tuples Flexible in the way it sweeps out (rectangular, etc) Flexible in the way it sweeps out (rectangular, etc) Retains ordering in considering samples Retains ordering in considering samples Variant of Rank-Join Hash Rank Join (HRJN) Hash Rank Join (HRJN) Block Ripple Join Block Ripple Join
Hash Rank Join (HRJN) Operator Built on idea of hash ripple join Inputs are as two hash tables Inputs are as two hash tables Maintains highest (first) and lowest (last selected) objects from each relation. Results are added to a priority queue Advantages: Smaller space requirement Smaller space requirement Can be pipelined Can be pipelined
Hash Rank Join (HRJN) Operator: Problems Local Ranking Problem Results from three or more input streams Results from three or more input streams Larger queue sizes More database accesses Buffer Problem Cannot predict how many partial joins will result Cannot predict how many partial joins will result
HRJN Solutions? Block Ripple Joins Do comparisons as blocks Score-Guided Strategy If thresholds are very different, then this may be because of the way one of the rankings is larger and descends at a slower rate Can then take more inputs from the slower growing ranking so that the threshold goes closer to the other thresholds
Optimal Join-Order Try to have the least number of input records in order to get a correct ranking No clear way of estimating the order of joins Have a heuristic – Footrule Distance Simple measure of similarity among two rankings. Simple measure of similarity among two rankings. First join the most similar rankings This would quickly yield a join by accessing fewer records
Rank-Join Algorithm: Benefits What can it do? Integrates well with query plans Integrates well with query plans Produces results as fast as possible Produces results as fast as possible Provides performance guarantees Provides performance guarantees Minimizes space requirements Minimizes space requirements Offers a mechanism to determine the best order of joining to execute query optimally. Offers a mechanism to determine the best order of joining to execute query optimally. Can be improved further if random access is available Can be improved further if random access is available Can eliminate on-the-fly duplicate elimination Can eliminate on-the-fly duplicate elimination
References “Supporting top-k join queries in relational databases” - Ihab Ilyas, Walid Aref, Ahmed Elmagarmid (2004) Jing Chen : DBIR Spring 2005, CSE-UT Arlington Spring2005/DBIR/slides/top-k_join.ppt