SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Nearest Neighbor Search

1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.

The Skyline Operator (Stephan Borzsonyi, Donald Kossmann, Konrad Stocker) Presenter: Shehnaaz Yusuf March 2005.

Fast Algorithms For Hierarchical Range Histogram Constructions

Maintaining Sliding Widow Skylines on Data Streams.

Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.

Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.

July 29HDMS'08 Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

2-dimensional indexing structure

Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.

Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.

On Efficient Spatial Matching Raymond Chi-Wing Wong (the Chinese University of Hong Kong) Yufei Tao (the Chinese University of Hong Kong) Ada Wai-Chee.

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,

Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

1 Progressive Computation of Constrained Subspace Skyline Queries Evangelos Dellis 1 Akrivi Vlachou 1 Ilya Vladimirskiy 1 Bernhard Seeger 1 Yannis Theodoridis.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.

A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.

1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006.

Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.

Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.

Efficient Processing of Top-k Spatial Preference Queries

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

The Curse of Dimensionality Richard Jang Oct. 29, 2003.

Spatial Query Processing Spatial DBs do not have a set of operators that are considered to be basic elements in a query evaluation. Spatial DBs handle.

The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.

Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.

A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

On Top-n Reverse Top-k Queries: Variants, Algorithms, and Applications 陳良弼 Arbee L.P. Chen National Chengchi University 9/21/2012 at NCHU.

Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.

Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Jian-Lin Kuo Author ： Aristidis Likas Nikos Vlassis Jakob J.Verbeek 國立雲林科技大學 National Yunlin.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.

School of Computing Clemson University Fall, 2012

SIMILARITY SEARCH The Metric Space Approach

Multiway Search Trees Data may not fit into main memory

Evaluation of Relational Operations

KD Tree A binary search tree where every node is a

COST ESTIMATION FOR THE RELATIONAL ALGEBRA OPERATIONS MIT 813 GROUP 15 PRESENTATION.

Spatio-temporal Pattern Queries

K Nearest Neighbor Classification

Introduction to Spatial Databases

The BIRCH Algorithm Davitkov Miroslav, 2011/3116

Spatial Indexing I R-trees

Skyline query with R*-Tree: Branch and Bound Skyline (BBS) Algorithm

Uncertain Data Mobile Group 报告人：郝兴.

The Skyline Query in Databases Which Objects are the Most Important?

Efficient Processing of Top-k Spatial Preference Queries

Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1

Presentation transcript:

SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr: Dr. Nikos Mamoulis

Skyline Queries  Given a set of d-dimenional points, a point p dominates another p’ if p[i]<=p’[i], for all i in d, and p[j]<p’[j], for any j in d  Skyline queries aim to find the points that are not dominated by any point foul rate turnover rate For the NBA database, Low turnover rate and low foul rate are two important factors for a defense player player Best point

Applications of Skyline Queries  Find a good hotel to me according to distance and price price 1000 A B C price 1500 price 500 D price 2000 Hotel D must not a good hotel for this user, since its price is higher and distance is farther than other hotels

Alternative applications of Skyline Queries - i  Some top-k queries are calculated by Skyline queries  A top-k query retrieves the k tuples in P with highest scores according to g  where g must be a monotonic function, ex: g(p) = p.x + p.y

Alternative applications of Skyline Queries - i  Please help me to find who are the top-2 NBA players according to sum of their points and assists in season assists points The values are represented by right-top corner of each player photo The results (up to Jan ) of this query are Allen Iverson, LeBron Jamesm, Top-2 results must be in top-2 skyband PRUNED

Alternative applications of Skyline Queries - ii  Another interesting measurement is dominating count (DC)  DC is counted by the number of dominating points to a query foul rate turnover rate player Ex: find the top-2 dominating players in the NBA database according to turnover rate and fould rate 2 Best point

Skyline Computations  Two categories of skyline computations  Computing from scratch (no index)  Relied on index 1.Computing from scratch (no index)  Advantage No any pre-computation Not to update any index when data changed  Drawback Must calculate from scratch –It must scan the entire data at least once

Skyline Computations 2.Relied on index  Once you built, get to use it many times  Lower query cost is occurred by performing the search on an appropriate structure B - tree R - tree …  Since all of us are database people, (I hope…) we prefer method 2 more

Related works 1.Computing from scratch (no index)  Block nested loop  Sort filter skyline  Divide and conquer  Bitmap  Linear elimination sort for skyline

Related works 2.Relied on index  B – tree approach index  R – tree approach Nearest neighbor (NN) Branch and bound skyline (BBS) –BBS has been proved that is I/O optimal. It accesses fewer disk pages than any algorithm based on R-trees

List yp 4 :0.1p 1 :0.2p 3 :0.3p 8 :0.6 List xp 5 :0.1p 6 :0.25p 2 :0.3p 7 :0.6 Related works  index p1p1 p2p2 p3p3 p4p4 p6p6 p5p5 p7p7 p8p8 Best point x y Point p adds to list i if p has the smallest value in dimension i 1) S sky = {p 5 } 2) S sky = {p 5,p 4 } 3) S sky = {p 5,p 4,p 1 } All remaining elements in List x are pruned by p 1 since both coordinates of p 6 is bigger than p 1 Due to the same reason, all remaining elements in List y are pruned by p 1 too

Related works  BBS p1p1 p2p2 p3p3 p4p4 p6p6 p5p5 p7p7 p8p8 N3N3 N4N4 N1N1 N2N2 M1M1 M2M2 M1M1 M2M2 N1N1 N2N2 N3N3 N4N4 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Best point H NN ={p 1,p 2,N 2,M 2 } p 1 is the first NN object from best point Dominant region of p 1 shows in grey color 2)p 2 is pruned by dominating region 3)Expand N 2 4)… Dominant region

SUBSKY  According to NBA database, we have more than 10 different attributes for one player  Skyline queries may be interested in some attributes only

SUBSKY  Build one R-tree and run BBS  BBS is an I/O optimal algorithm based on R-tree index, but their approaches are optimized for a fixed set of dimensions  Build R-trees for all elements in the power set of dimensions  Hugh storage space

SUBSKY for uniform data  Anchor point A c – the maximal corner of the data space having maximum coordinate on all dimensions x y 1 1 p sky p1p1 p2p2 AcAc f(p)=max(1-p[i]), where i is from 1 to d f sky (p sky )=min(1-p sky [i]), where i is from 1 to d f(p 2 ) f(p 1 ) f sky (p sky ) No any point p satisfying f(p)<f sky (p sky ) can belong to the skyline Pruning region of p sky maximum value of the coordinate Best point A similar result exists for the skyline of any subspace

SUBSKY for uniform data  Skyline queries only apply on relevant dimensions SUB f’ sky (p sky )=min(1-p sky [i]), where i is in SUB  Then, f(p) < f sky (p sky ) <= f’ sky (p sky )  No any point p satisfying the above equation can belong to the skyline

SUBSKY for uniform data  Assume that our skyline query is interested in dimension x and y only  First, we sort the data by f(p i )  p 3, p 4, p 1, p 2, p 5  S sky ={p 3 }, f’ sky (p 3 )=0.5 =min(1-0.5,1-0.3)  U=0.5 (largest f’ value in S sky )  S sky ={p 3,p 4 }, f’ sky (p 4 )=0.1  U=0.5  S sky ={p 1,p 4 }, f’ sky (p 1 )=0.8  p 3 is removed by adding p 1, since it is dominated by p 1  U=0.8 p1p1 p2p2 p3p3 p4p4 p5p5 x y z f(p i )

Analysis  Assume that you have 15D uniform distributed objects with cardinality 100k, and we want to retrieve the skyline in a subspace SUB containing any two dimensions. λ λ Greater than 90% to find an object in area(λ, λ), where λ=0.001, since (1- λ 2 ) 100k Therefore, f’ sky (p) = 1- λ = The volume evaluates to =98.5%, that is, we only need to access 1.5% of the dataset 1 1

General SUBSKY  In practical, data are usually clustered  If the data are clustered, then we should expect that one anchor point cannot give us enough pruning power x 1 1 AcAc Best point p sky A1A1

General SUBSKY x AcAc p sky A1A1 A2A2 cluster s 1 s2s2 s3s3 s4s4  Anchors for different clusters Two questions: 1)How to find the anchors? 2)How to assign points to anchors? 1 1 Best point

1 1 AcAc Finding the Anchors  First, let us see what a perfect anchor of a point p  If p is assigned to A, then p can be pruned by any skyline point dominating p p Major perpendicular plane A1A1 A2A2 A3A3 Any point on this line is a perfect anchor of point p Anti-dominant region of p

Finding the Anchors 1 1 Best point AcAc p1p1 Major perpendicular plane p2p2 a good anchor  For each point, find the projections to the plane  Ex: p’ 1, p’ 2 …  Partition the projected points into m clusters using algorithm k-means, and formulate an anchor for each cluster p’ 2 p’ 1

Finding the Anchors  How to decide an anchor for a cluster? Blue points are assigned to cluster S. How can we decide the anchor for S? 1)Obtain point B, whose coordinate on each dimension equals the lowest coordinate of the points in S in their original space on this axis 2)Then, the algorithm computes the smallest square opposite to B which covers all points in S A B

Assigning Points to Anchors  A naïve way is to assign points to their closest anchor point in the major perpendicular plane (projected space)  It is not directly quantifies the benefit of an assignment

Pruning region of p2 Pruning region of p1 ER of p Assigning Points to Anchors  In order to assign a point to a good anchor, this paper introduces a new measurement which name is effective region (ER) 1 1 Best point AcAc All points in yellow region (ER) can make a pruning region to A c that cover p p If ER-volume of p is larger, then p has more chance to be pruned p1p1 p2p2

ER of p Assigning Points to Anchors 1 1 Best point AcAc p ER of p 1 1 Best point AcAc p A’ p1p1 p2p2 p1p1 p2p2

Assigning Points to Anchors  The pruning volume size of a point p to an anchor point A j is ∏max(0,A j [i]-L ∞ (p,A j )), where i is from 1 to d  Therefore, assign a point p to A j that produces the largest pruning volume size

Query example  We use the same example in previous slide  Assume that we have two anchors, one is A c and the other A’ is found by K-means (m=1)  A c =(1,1,1) and A’=(1,1,0.8)  First, we calculate the ER volume of each data point with respect to A c and A’ p1p1 p2p2 p3p3 p4p4 p5p5 x y z f(p i ) p1p1 p2p2 p3p3 p4p4 p5p5 AcAc A’ Unit 10 -3

Query example  Sorted list by f:  A c p 4 p 1 p 2 p 5  A’ p 3 p1p1 p2p2 p3p3 p4p4 p5p5 x y z f(p i ) p1p1 p2p2 p3p3 p4p4 p5p5 AcAc A’ )S sky ={p 4 }, f’ sky (p 4 )=0.5 U=0.5 2)S sky ={p 4, p 1 }, f’ sky (p1)=0.8 U=0.8

Experiments  3 real datasets NBA, Household, and Color  2 synthetic data (10D)  Uniform  Clustered 10 cluster centroids For each centroid, it takes N/10 points whose coordinate on each axis follows a Gaussian distribution with variance 0.05 and a mean equal to the corresponding coordinate of the centroid NBAHouseholdColor Dimension1369 Cardinality17k127k68k

Experiments

3D subspaces, 1 million cardinality 3D subspaces, full- space dimensionality is 10

Conclusion  The core of SUBSKY is a transformation that convert multi-dimensional data into 1D values  Show better performance than a I/O optimized algorithm in the subspace skyline problem  Some continuous monitoring cases are good to investigate  How to adopt the set of anchor points if data update rapidly  The f values could be stored in other index structure to support fast update

Assigning Points to Anchors  Therefore, we have two ways to assign points to the anchors 1.Assign points to their closest anchor point in the major perpendicular plane (projected space) 2.Assign points to their closest anchor point by ER- volume in original space  The second approach is better than the first in the major perpendicular plane, because the ER-volume directly quantifies the benefit of an assignment