Presentation is loading. Please wait.

Presentation is loading. Please wait.

SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

Similar presentations


Presentation on theme: "SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:"— Presentation transcript:

1 SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr: Dr. Nikos Mamoulis

2 Skyline Queries  Given a set of d-dimenional points, a point p dominates another p’ if p[i]<=p’[i], for all i in d, and p[j]<p’[j], for any j in d  Skyline queries aim to find the points that are not dominated by any point foul rate turnover rate 0 1 1 For the NBA database, Low turnover rate and low foul rate are two important factors for a defense player player Best point

3 Applications of Skyline Queries  Find a good hotel to me according to distance and price price 1000 A B C price 1500 price 500 D price 2000 Hotel D must not a good hotel for this user, since its price is higher and distance is farther than other hotels

4 Alternative applications of Skyline Queries - i  Some top-k queries are calculated by Skyline queries  A top-k query retrieves the k tuples in P with highest scores according to g  where g must be a monotonic function, ex: g(p) = p.x + p.y

5 Alternative applications of Skyline Queries - i  Please help me to find who are the top-2 NBA players according to sum of their points and assists in 2007-2008 season assists points 20 10 0 The values are represented by right-top corner of each player photo The results (up to Jan 23 2008) of this query are Allen Iverson, 27+6.9 LeBron Jamesm, 29.7+7.4 Top-2 results must be in top-2 skyband PRUNED

6 Alternative applications of Skyline Queries - ii  Another interesting measurement is dominating count (DC)  DC is counted by the number of dominating points to a query foul rate turnover rate 0 1 1 player 1 4 0 1 Ex: find the top-2 dominating players in the NBA database according to turnover rate and fould rate 2 Best point

7 Skyline Computations  Two categories of skyline computations  Computing from scratch (no index)  Relied on index 1.Computing from scratch (no index)  Advantage No any pre-computation Not to update any index when data changed  Drawback Must calculate from scratch –It must scan the entire data at least once

8 Skyline Computations 2.Relied on index  Once you built, get to use it many times  Lower query cost is occurred by performing the search on an appropriate structure B - tree R - tree …  Since all of us are database people, (I hope…) we prefer method 2 more

9 Related works 1.Computing from scratch (no index)  Block nested loop  Sort filter skyline  Divide and conquer  Bitmap  Linear elimination sort for skyline

10 Related works 2.Relied on index  B – tree approach index  R – tree approach Nearest neighbor (NN) Branch and bound skyline (BBS) –BBS has been proved that is I/O optimal. It accesses fewer disk pages than any algorithm based on R-trees

11 List yp 4 :0.1p 1 :0.2p 3 :0.3p 8 :0.6 List xp 5 :0.1p 6 :0.25p 2 :0.3p 7 :0.6 Related works  index p1p1 p2p2 p3p3 p4p4 p6p6 p5p5 p7p7 p8p8 Best point x y Point p adds to list i if p has the smallest value in dimension i 1) S sky = {p 5 } 2) S sky = {p 5,p 4 } 3) S sky = {p 5,p 4,p 1 } All remaining elements in List x are pruned by p 1 since both coordinates of p 6 is bigger than p 1 Due to the same reason, all remaining elements in List y are pruned by p 1 too

12 Related works  BBS p1p1 p2p2 p3p3 p4p4 p6p6 p5p5 p7p7 p8p8 N3N3 N4N4 N1N1 N2N2 M1M1 M2M2 M1M1 M2M2 N1N1 N2N2 N3N3 N4N4 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Best point H NN ={p 1,p 2,N 2,M 2 } p 1 is the first NN object from best point Dominant region of p 1 shows in grey color 2)p 2 is pruned by dominating region 3)Expand N 2 4)… Dominant region

13 SUBSKY  According to NBA database, we have more than 10 different attributes for one player  Skyline queries may be interested in some attributes only

14 SUBSKY  Build one R-tree and run BBS  BBS is an I/O optimal algorithm based on R-tree index, but their approaches are optimized for a fixed set of dimensions  Build R-trees for all elements in the power set of dimensions  Hugh storage space

15 SUBSKY for uniform data  Anchor point A c – the maximal corner of the data space having maximum coordinate on all dimensions x y 1 1 p sky p1p1 p2p2 AcAc f(p)=max(1-p[i]), where i is from 1 to d f sky (p sky )=min(1-p sky [i]), where i is from 1 to d f(p 2 ) f(p 1 ) f sky (p sky ) No any point p satisfying f(p)<f sky (p sky ) can belong to the skyline Pruning region of p sky maximum value of the coordinate Best point A similar result exists for the skyline of any subspace

16 SUBSKY for uniform data  Skyline queries only apply on relevant dimensions SUB f’ sky (p sky )=min(1-p sky [i]), where i is in SUB  Then, f(p) < f sky (p sky ) <= f’ sky (p sky )  No any point p satisfying the above equation can belong to the skyline

17 SUBSKY for uniform data  Assume that our skyline query is interested in dimension x and y only  First, we sort the data by f(p i )  p 3, p 4, p 1, p 2, p 5  S sky ={p 3 }, f’ sky (p 3 )=0.5 =min(1-0.5,1-0.3)  U=0.5 (largest f’ value in S sky )  S sky ={p 3,p 4 }, f’ sky (p 4 )=0.1  U=0.5  S sky ={p 1,p 4 }, f’ sky (p 1 )=0.8  p 3 is removed by adding p 1, since it is dominated by p 1  U=0.8 p1p1 p2p2 p3p3 p4p4 p5p5 x0.20.40.50.90.6 y0.20.40.30.10.8 z0.50.90.10.60.7 f(p i )0.80.60.9 0.4

18 Analysis  Assume that you have 15D uniform distributed objects with cardinality 100k, and we want to retrieve the skyline in a subspace SUB containing any two dimensions. λ λ Greater than 90% to find an object in area(λ, λ), where λ=0.001, since (1- λ 2 ) 100k Therefore, f’ sky (p) = 1- λ = 0.999 The volume evaluates to 0.999 15 =98.5%, that is, we only need to access 1.5% of the dataset 1 1

19 General SUBSKY  In practical, data are usually clustered  If the data are clustered, then we should expect that one anchor point cannot give us enough pruning power x 1 1 AcAc Best point p sky A1A1

20 General SUBSKY x AcAc p sky A1A1 A2A2 cluster s 1 s2s2 s3s3 s4s4  Anchors for different clusters Two questions: 1)How to find the anchors? 2)How to assign points to anchors? 1 1 Best point

21 1 1 AcAc Finding the Anchors  First, let us see what a perfect anchor of a point p  If p is assigned to A, then p can be pruned by any skyline point dominating p p Major perpendicular plane A1A1 A2A2 A3A3 Any point on this line is a perfect anchor of point p Anti-dominant region of p

22 Finding the Anchors 1 1 Best point AcAc p1p1 Major perpendicular plane p2p2 a good anchor  For each point, find the projections to the plane  Ex: p’ 1, p’ 2 …  Partition the projected points into m clusters using algorithm k-means, and formulate an anchor for each cluster p’ 2 p’ 1

23 Finding the Anchors  How to decide an anchor for a cluster? Blue points are assigned to cluster S. How can we decide the anchor for S? 1)Obtain point B, whose coordinate on each dimension equals the lowest coordinate of the points in S in their original space on this axis 2)Then, the algorithm computes the smallest square opposite to B which covers all points in S A B

24 Assigning Points to Anchors  A naïve way is to assign points to their closest anchor point in the major perpendicular plane (projected space)  It is not directly quantifies the benefit of an assignment

25 Pruning region of p2 Pruning region of p1 ER of p Assigning Points to Anchors  In order to assign a point to a good anchor, this paper introduces a new measurement which name is effective region (ER) 1 1 Best point AcAc All points in yellow region (ER) can make a pruning region to A c that cover p p If ER-volume of p is larger, then p has more chance to be pruned p1p1 p2p2

26 ER of p Assigning Points to Anchors 1 1 Best point AcAc p ER of p 1 1 Best point AcAc p A’ p1p1 p2p2 p1p1 p2p2

27 Assigning Points to Anchors  The pruning volume size of a point p to an anchor point A j is ∏max(0,A j [i]-L ∞ (p,A j )), where i is from 1 to d  Therefore, assign a point p to A j that produces the largest pruning volume size

28 Query example  We use the same example in previous slide  Assume that we have two anchors, one is A c and the other A’ is found by K-means (m=1)  A c =(1,1,1) and A’=(1,1,0.8)  First, we calculate the ER volume of each data point with respect to A c and A’ p1p1 p2p2 p3p3 p4p4 p5p5 x0.20.40.50.90.6 y0.20.40.30.10.8 z0.50.90.10.60.7 f(p i )0.80.60.9 0.4 p1p1 p2p2 p3p3 p4p4 p5p5 AcAc 86411216 A’0-9-144 Unit 10 -3

29 Query example  Sorted list by f:  A c p 4 p 1 p 2 p 5  A’ p 3 p1p1 p2p2 p3p3 p4p4 p5p5 x0.20.40.50.90.6 y0.20.40.30.10.8 z0.50.90.10.60.7 f(p i )0.80.60.9 0.4 p1p1 p2p2 p3p3 p4p4 p5p5 AcAc 86411216 A’0-9-144 1)S sky ={p 4 }, f’ sky (p 4 )=0.5 U=0.5 2)S sky ={p 4, p 1 }, f’ sky (p1)=0.8 U=0.8

30 Experiments  3 real datasets NBA, Household, and Color  2 synthetic data (10D)  Uniform  Clustered 10 cluster centroids For each centroid, it takes N/10 points whose coordinate on each axis follows a Gaussian distribution with variance 0.05 and a mean equal to the corresponding coordinate of the centroid NBAHouseholdColor Dimension1369 Cardinality17k127k68k

31 Experiments

32

33 3D subspaces, 1 million cardinality 3D subspaces, full- space dimensionality is 10

34 Conclusion  The core of SUBSKY is a transformation that convert multi-dimensional data into 1D values  Show better performance than a I/O optimized algorithm in the subspace skyline problem  Some continuous monitoring cases are good to investigate  How to adopt the set of anchor points if data update rapidly  The f values could be stored in other index structure to support fast update

35 Assigning Points to Anchors  Therefore, we have two ways to assign points to the anchors 1.Assign points to their closest anchor point in the major perpendicular plane (projected space) 2.Assign points to their closest anchor point by ER- volume in original space  The second approach is better than the first in the major perpendicular plane, because the ER-volume directly quantifies the benefit of an assignment


Download ppt "SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:"

Similar presentations


Ads by Google