A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.

Slides:



Advertisements
Similar presentations
Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.
Advertisements

Finding the Sites with Best Accessibilities to Amenities Qianlu Lin, Chuan Xiao, Muhammad Aamir Cheema and Wei Wang University of New South Wales, Australia.
Web Information Retrieval
Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Spatio-temporal Databases
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Best-Effort Top-k Query Processing Under Budgetary Constraints
Fast Algorithms For Hierarchical Range Histogram Constructions
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Top-k Set Similarity Joins Chuan Xiao, Wei Wang, Xuemin Lin and Haichuan Shang University of New South Wales and NICTA.
Click to edit Present’s Name SLICE: Reviving Regions-Based Pruning for Reverse k Nearest Neighbors Queries Shiyu Yang 1, Muhammad Aamir Cheema 2,1, Xuemin.
CircularTrip: An Effective Algorithm for Continuous kNN Queries Muhammad Aamir Cheema Database Research Group, The School of Computer Science and Engineering,
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Spatio-temporal Databases Time Parameterized Queries.
Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
I/O-Algorithms Lars Arge University of Aarhus March 1, 2005.
Top-k and Skyline Computation in Database Systems
Aggregation Algorithms and Instance Optimality
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
Efficient Computation of the Skyline Cube Yidong Yuan School of Computer Science & Engineering The University of New South Wales & NICTA Sydney, Australia.
Computer Science and Engineering Loyalty-based Selection: Retrieving Objects That Persistently Satisfy Criteria Presented By: Zhitao Shen Joint work with.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Influence Zone: Efficiently Processing Reverse k Nearest Neighbors Queries Presented By: Muhammad Aamir Cheema Joint work with Xuemin Lin, Wenjie Zhang,
Efficient Processing of Top-k Spatial Preference Queries
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.
A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Information Technology (Some) Research Trends in Location-based Services Muhammad Aamir Cheema Faculty of Information Technology Monash University, Australia.
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Answering Top-k Queries with Multi-Dimensional Selections: The Ranking Cube Approach Dong Xin, Jiawei Han, Hong Cheng, Xiaolei Li Department of Computer.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Dense-Region Based Compact Data Cube
A Unified Algorithm for Continuous Monitoring of Spatial Queries
A Unified Framework for Efficiently Processing Ranking Related Queries
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Indexing & querying text
Database Management System
Stochastic Skyline Operator
TT-Join: Efficient Set Containment Join
Spatio-temporal Databases
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Range-Efficient Computation of F0 over Massive Data Streams
Presented by: Mahady Hasan Joint work with
Spatio-temporal Databases
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang 2, Jianmin Wang 3, Wenjie Zhang 1 1 University of New South Wales, Australia 2 Microsoft Research Asia 3 Tsinghua University, China

Introduction Top-k Pairs Query: Given a scoring function f() that computes the score of a pair of objects, return k pairs of objects with smallest scores. o2o2 o1o1 o3o3 o4o4 o5o5 x-axis y-axis Examples: k-closest pairs f(o u,o v ) = dist(o u,o v ) Answer (k=1) = (o 1,o 2 ) k-furthest pairs f(o u,o v ) = - dist(o u,o v ) Answer (k=1) = (o 2,o 4 ) f(o u,o v ) = (o u.x +o v.x) + (o u.y +o v.y) Answer (k=1) = (o 4,o 5 )

Related Work Computational geometry [M Smid, Handbook on Comp. Geometry] Database community [Hjaltason et. al, SIGMOD 1998] [Corral et. al, SIGMOD 2000] [Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003] Computational geometry [M Smid, Handbook on Comp. Geometry] Database community [Hjaltason et. al, SIGMOD 1998] [Corral et. al, SIGMOD 2000] [Yang et. al, IDEAS 2002] [Shan et. al, SSTD 2003] K-Closest Pairs Queries [Supowit, SODA 1990] [Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004] [Supowit, SODA 1990] [Katoh et. al, IJCGA 1995] [Corral et. al, DKE 2004] K-Furthest Pairs Queries Top-k Queries Fagin’s Algorithm [Fagin, PODS 1996] Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999], [G ȕ ntzer et. al, VLDB 2000] No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007] Fagin’s Algorithm [Fagin, PODS 1996] Threshold Algorithm [Fagin, JCSS 1999], [Nepal et. al, ICDE 1999], [G ȕ ntzer et. al, VLDB 2000] No Random Access Algoritm [Fagin, JCSS 1999], [Mamoulis et. al, TODS 2007]

Motivation SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Other L p distances (e.g., Manhattan distance) ? More general scoring functions Chromatic queries Other L p distances (e.g., Manhattan distance) ? More general scoring functions Chromatic queries No existing work for more general queries SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; No existing unified algorithm One framework that answers a broad class of top-k pairs queries

Problem Definition (Preliminaries) Monotonic function f() is monotonic if f(x 1,…,x N ) ≤ f(y 1,…,y N ) whenever x i ≤ y i for every 1 ≤ I ≤ N Examples: f(x 1,…,x N ) = x 1 + x 2 + … + x N (summation) f(x 1,…,x N ) = (x 1 + x 2 + … + x N ) / N (average) f() is monotonic if f(x 1,…,x N ) ≤ f(y 1,…,y N ) whenever x i ≤ y i for every 1 ≤ I ≤ N Examples: f(x 1,…,x N ) = x 1 + x 2 + … + x N (summation) f(x 1,…,x N ) = (x 1 + x 2 + … + x N ) / N (average)

Problem Definition (Preliminaries) Loose monotonic function 0 ∞ -∞ s() takes two parameters and is loose monotonic if both of following hold for every fixed value x 1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases 2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases s() takes two parameters and is loose monotonic if both of following hold for every fixed value x 1.for every y > x, s(x,y) either monotonically increases or monotonically decreases as y increases 2.for every y < x, s(x,y) either monotonically increases or montonically decreases as y decreases x 1 y 2 5 s 1 (x,y) = |x – y| 1 4 = s 2 (x,y) = (x + y) 3 6 = y Loose monotonic functions are more general than the monotonic functions

Problem Definition SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Return k pairs of objects with smallest scores. SCORE (a,b) = f ( s 1 (a,b),…,s d (a,b) ) s i ( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes. SCORE (a,b) = f ( s 1 (a,b),…,s d (a,b) ) s i ( ) is called local scoring function and can be any loose monotonic function of user’s choice. f( ) is called global scoring function and can be any monotonic function that involves an arbitrary set of attributes. s 1 (a,b) = | a.sold – b.sold | s 2 (a,b) = -| a.salary – b.salary | f( ) = s 1 (a,b) + s 2 (a,b)

Problem Definition SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; Return k pairs of objects with smallest scores among the valid pairs. Let each object be assigned a color. Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors Let each object be assigned a color. Chromatic Queries: Homochromatic Queries: pairs containing objects of same color Heterochromatic Queries: pairs containing objects of different colors SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager ≠ b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager ≠ b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager = b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id < b.id AND a.manager = b.manager ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k;

Contributions k-closest pairs, k-furthest pairs and variants (any L p distance) queries involving any arbitrary subset of attributes chromatic and non-chromatic queries skyline pairs queries and rank based top-k pairs queries k-closest pairs, k-furthest pairs and variants (any L p distance) queries involving any arbitrary subset of attributes chromatic and non-chromatic queries skyline pairs queries and rank based top-k pairs queries Unified algorithm (internal and external) efficiently builds a simple data structure on-the-fly can answer queries involving filtering conditions on objects efficiently builds a simple data structure on-the-fly can answer queries involving filtering conditions on objects No pre-built indexes required SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id 40 AND b.age > 40 ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; SELECT a.id, b.id FROM AGENT a, AGENT b WHERE a.id 40 AND b.age > 40 ORDER BY |a.sold – b.sold| - |a.salary – b.salary| LIMIT k; existing R-tree based approaches may require arbitrarily large heaps our algorithm requires O(k) space + 2d buffer pages existing R-tree based approaches may require arbitrarily large heaps our algorithm requires O(k) space + 2d buffer pages Known memory requirement TheoreticallyOptimal for d ≤ 2 Experimentally TheoreticallyOptimal for d ≤ 2 Experimentally Efficient

Framework s 1 (a,b) s 2 (a,b) s d (a,b) … Top-K algorithms (e.g., FA, TA, NRA etc.) How to efficiently create and maintain these sources??? f ( s 1 (a,b), s 2 (a,b), …,s d (a,b) )

Creating/maintaining sources Naïve approach Create all possible pairs O(N 2 ) Sort them according to their local scoresO(N 2 log N) space requirement: O(N 2 ) Create all possible pairs O(N 2 ) Sort them according to their local scoresO(N 2 log N) space requirement: O(N 2 ) Features of our approach Optimal internal memory algorithm requires O(N) space returns first pair in O(N log N) each next best pair is returned in O( log N) Optimal external memory algorithm B = number of elements that can be stored in one disk page M = used internal memory minimum M = 2B returns first pair in O(N/B log M/B N/B) each next best pair is returned in O(log M/B N/B) Optimal internal memory algorithm requires O(N) space returns first pair in O(N log N) each next best pair is returned in O( log N) Optimal external memory algorithm B = number of elements that can be stored in one disk page M = used internal memory minimum M = 2B returns first pair in O(N/B log M/B N/B) each next best pair is returned in O(log M/B N/B)

Creating/maintaining sources o1o1 o2o2 o3o3 o4o4 o5o5 o6o6 6 3 Initialize sort the objects for each object o u create its best pair (o u,o v ) insert (o u,o v ) in heap getNextPair() report the top pair (o u,o v ) of heap create next best pair of o u enheap the new pair and delete (o u,o v ) s(x,y) = |x – y|

Homochromatic Queries o1o1 o2o2 o3o3 o4o4 o6o6 o5o5

Heterochromatic Queries o1o1 o2o2 o3o3 o4o4 o6o6 o5o5 Let (o u,o v ) be the pair o x = the object next to o v If o u and o x have different color (o u,o x ) is the next best pair else o y = the adjacent object of o x (o u,o y ) is the next best pair

Experiments K-closest pairs queries [Corral et. al, SIGMOD 2000] Data size: two dataset each containing 100K objects k: 10

Experiments Naive: join the dataset with itself using nested loop (block nested loop for external memory algorithm) Scoring function: Local scoring function is either sum or absolute difference (chosen randomly) Global scoring function is weighted aggregate (weights are chosen randomly and negative weights are allowed)

Number of Objects

Number of attributes (d)

Value of k

Number of colors

Thanks

Complexity Internal memory algorithm = External memory algorithm = d = number of local scoring functions involved N = total number of objects V = total number of valid pairs (N 2 at most) M = internal memory used by the algorithm B = the number of entries one disk page can store