Download presentation
Presentation is loading. Please wait.
1
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang 2, Jianmin Wang 3, Wenjie Zhang 1 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China
2
Introduction Top-k Pairs Query: Given a scoring function f() that computes the score of a pair of objects, return k pairs of objects with smallest scores. o2o2 o1o1 o3o3 o4o4 o5o5 x-axis y-axis Examples: k-closest pairs f(o u,o v ) = dist(o u,o v ) Answer (k=1) = (o 1,o 2 ) k-furthest pairs f(o u,o v ) = - dist(o u,o v ) Answer (k=1) = (o 2,o 4 ) f(o u,o v ) = (o u.x +o v.x) + (o u.y +o v.y) Answer (k=1) = (o 4,o 5 )
3
Related Work Computational geometry [M Smid, Handbook on Comp. Geometry] Database community [Hjaltason et. al, SIGMOD 1998] [Corral et. al, SIGMOD 2000] [Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003] Computational geometry [M Smid, Handbook on Comp. Geometry] Database community [Hjaltason et. al, SIGMOD 1998] [Corral et. al, SIGMOD 2000] [Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003] K-Closest Pairs Queries [Supowit, SODA 1990] [Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004] [Supowit, SODA 1990] [Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004] K-Furthest Pairs Queries Top-k Queries Fagin’s Algorithm [Fagin, PODS 1996] Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999], [G ȕ ntzer et. al, VLDB 2000] No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007] Fagin’s Algorithm [Fagin, PODS 1996] Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999], [G ȕ ntzer et. al, VLDB 2000] No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]
4
Motivation SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Other L p distances (e.g., Manhattan distance) ? More general scoring functions Chromatic queries Other L p distances (e.g., Manhattan distance) ? More general scoring functions Chromatic queries No existing work for more general queries SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; No existing unified algorithm One framework that answers a broad class of top-k pairs queries
5
Problem Definition (Preliminaries) Monotonic function f() is monotonic if f(x 1,…,x N ) ≤ f(y 1,…,y N ) whenever x i ≤ y i for every 1 ≤ I ≤ N Examples: f(x 1,…,x N ) = x 1 + x 2 + … + x N (summation) f(x 1,…,x N ) = (x 1 + x 2 + … + x N ) / N (average) f() is monotonic if f(x 1,…,x N ) ≤ f(y 1,…,y N ) whenever x i ≤ y i for every 1 ≤ I ≤ N Examples: f(x 1,…,x N ) = x 1 + x 2 + … + x N (summation) f(x 1,…,x N ) = (x 1 + x 2 + … + x N ) / N (average)
6
Problem Definition (Preliminaries) Loose monotonic function 0 ∞ -∞ s() takes two parameters and is loose monotonic if both of following hold for every fixed value x 1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases 2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases s() takes two parameters and is loose monotonic if both of following hold for every fixed value x 1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases 2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases x 1 y 2 5 s 1 (x,y) = |x – y| 1 4 = s 2 (x,y) = (x + y) 3 6 = y -3 1 -2 Loose monotonic functions are more general than the monotonic functions
7
Problem Definition SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Return k pairs of objects with smallest scores. SCORE (a,b) = f ( s 1 (a,b),…,s d (a,b) ) s i ( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes. SCORE (a,b) = f ( s 1 (a,b),…,s d (a,b) ) s i ( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes. s 1 (a,b) = | a.sold – b.sold | s 2 (a,b) = -| a.salary – b.salary | f( ) = s 1 (a,b) + s 2 (a,b)
8
Problem Definition SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Return k pairs of objects with smallest scores among the valid pairs. Let each object be assigned a color. Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors Let each object be assigned a color. Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager ≠ b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager ≠ b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager = b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager = b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k;
9
Contributions k-closest pairs, k-furthest pairs and variants (any L p distance) queries involving any arbitrary subset of attributes chromatic and non-chromatic queries skyline pairs queries and rank based top-k pairs queries k-closest pairs, k-furthest pairs and variants (any L p distance) queries involving any arbitrary subset of attributes chromatic and non-chromatic queries skyline pairs queries and rank based top-k pairs queries Unified algorithm (internal and external) efficiently builds a simple data structure on-the-fly can answer queries involving filtering conditions on objects efficiently builds a simple data structure on-the-fly can answer queries involving filtering conditions on objects No pre-built indexes required SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id 40 AND b.age > 40 ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id 40 AND b.age > 40 ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; existing R-tree based approaches may require arbitrarily large heaps our algorithm requires O(k) space + 2d buffer pages existing R-tree based approaches may require arbitrarily large heaps our algorithm requires O(k) space + 2d buffer pages Known memory requirement TheoreticallyOptimal for d ≤ 2 Experimentally TheoreticallyOptimal for d ≤ 2 Experimentally Efficient
10
Framework s 1 (a,b) s 2 (a,b) s d (a,b) … Top-K algorithms (e.g., FA, TA, NRA etc.) How to efficiently create and maintain these sources??? f ( s 1 (a,b), s 2 (a,b), …,s d (a,b) )
11
Creating/maintaining sources Naïve approach Create all possible pairs O(N 2 ) Sort them according to their local scoresO(N 2 log N) space requirement: O(N 2 ) Create all possible pairs O(N 2 ) Sort them according to their local scoresO(N 2 log N) space requirement: O(N 2 ) Features of our approach Optimal internal memory algorithm requires O(N) space returns first pair in O(N log N) each next best pair is returned in O( log N) Optimal external memory algorithm B = number of elements that can be stored in one disk page M = used internal memory minimum M = 2B returns first pair in O(N/B log M/B N/B) each next best pair is returned in O(log M/B N/B) Optimal internal memory algorithm requires O(N) space returns first pair in O(N log N) each next best pair is returned in O( log N) Optimal external memory algorithm B = number of elements that can be stored in one disk page M = used internal memory minimum M = 2B returns first pair in O(N/B log M/B N/B) each next best pair is returned in O(log M/B N/B)
12
Creating/maintaining sources 61214152030 215106 o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 6 3 Initialize sort the objects for each object o u create its best pair (o u,o v ) insert (o u,o v ) in heap getNextPair() report the top pair (o u,o v ) of heap create next best pair of o u enheap the new pair and delete (o u,o v ) s(x,y) = |x – y|
13
Homochromatic Queries 61214152030 o1o1 o2o2 o3o3 o4o4 o6o6 o5o5
14
Heterochromatic Queries 61214152030 o1o1 o2o2 o3o3 o4o4 o6o6 o5o5 Let (o u,o v ) be the pair o x = the object next to o v If o u and o x have different color (o u,o x ) is the next best pair else o y = the adjacent object of o x (o u,o y ) is the next best pair
15
Experiments K-closest pairs queries [Corral et. al, SIGMOD 2000] Data size: two dataset each containing 100K objects k: 10
16
Experiments Naive: join the dataset with itself using nested loop (block nested loop for external memory algorithm) Scoring function: Local scoring function is either sum or absolute difference (chosen randomly) Global scoring function is weighted aggregate (weights are chosen randomly and negative weights are allowed)
17
Number of Objects
18
Number of attributes (d)
19
Value of k
20
Number of colors
21
Thanks
22
Complexity Internal memory algorithm = External memory algorithm = d = number of local scoring functions involved N = total number of objects V = total number of valid pairs (N 2 at most) M = internal memory used by the algorithm B = the number of entries one disk page can store
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.