Download presentation
Presentation is loading. Please wait.
Published byAnnabella Pitts Modified over 9 years ago
1
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr: Dr. Nikos Mamoulis
2
Skyline Queries Given a set of d-dimenional points, a point p dominates another p’ if p[i]<=p’[i], for all i in d, and p[j]<p’[j], for any j in d Skyline queries aim to find the points that are not dominated by any point foul rate turnover rate 0 1 1 For the NBA database, Low turnover rate and low foul rate are two important factors for a defense player player Best point
3
Applications of Skyline Queries Find a good hotel to me according to distance and price price 1000 A B C price 1500 price 500 D price 2000 Hotel D must not a good hotel for this user, since its price is higher and distance is farther than other hotels
4
Alternative applications of Skyline Queries - i Some top-k queries are calculated by Skyline queries A top-k query retrieves the k tuples in P with highest scores according to g where g must be a monotonic function, ex: g(p) = p.x + p.y
5
Alternative applications of Skyline Queries - i Please help me to find who are the top-2 NBA players according to sum of their points and assists in 2007-2008 season assists points 20 10 0 The values are represented by right-top corner of each player photo The results (up to Jan 23 2008) of this query are Allen Iverson, 27+6.9 LeBron Jamesm, 29.7+7.4 Top-2 results must be in top-2 skyband PRUNED
6
Alternative applications of Skyline Queries - ii Another interesting measurement is dominating count (DC) DC is counted by the number of dominating points to a query foul rate turnover rate 0 1 1 player 1 4 0 1 Ex: find the top-2 dominating players in the NBA database according to turnover rate and fould rate 2 Best point
7
Skyline Computations Two categories of skyline computations Computing from scratch (no index) Relied on index 1.Computing from scratch (no index) Advantage No any pre-computation Not to update any index when data changed Drawback Must calculate from scratch –It must scan the entire data at least once
8
Skyline Computations 2.Relied on index Once you built, get to use it many times Lower query cost is occurred by performing the search on an appropriate structure B - tree R - tree … Since all of us are database people, (I hope…) we prefer method 2 more
9
Related works 1.Computing from scratch (no index) Block nested loop Sort filter skyline Divide and conquer Bitmap Linear elimination sort for skyline
10
Related works 2.Relied on index B – tree approach index R – tree approach Nearest neighbor (NN) Branch and bound skyline (BBS) –BBS has been proved that is I/O optimal. It accesses fewer disk pages than any algorithm based on R-trees
11
List yp 4 :0.1p 1 :0.2p 3 :0.3p 8 :0.6 List xp 5 :0.1p 6 :0.25p 2 :0.3p 7 :0.6 Related works index p1p1 p2p2 p3p3 p4p4 p6p6 p5p5 p7p7 p8p8 Best point x y Point p adds to list i if p has the smallest value in dimension i 1) S sky = {p 5 } 2) S sky = {p 5,p 4 } 3) S sky = {p 5,p 4,p 1 } All remaining elements in List x are pruned by p 1 since both coordinates of p 6 is bigger than p 1 Due to the same reason, all remaining elements in List y are pruned by p 1 too
12
Related works BBS p1p1 p2p2 p3p3 p4p4 p6p6 p5p5 p7p7 p8p8 N3N3 N4N4 N1N1 N2N2 M1M1 M2M2 M1M1 M2M2 N1N1 N2N2 N3N3 N4N4 p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 Best point H NN ={p 1,p 2,N 2,M 2 } p 1 is the first NN object from best point Dominant region of p 1 shows in grey color 2)p 2 is pruned by dominating region 3)Expand N 2 4)… Dominant region
13
SUBSKY According to NBA database, we have more than 10 different attributes for one player Skyline queries may be interested in some attributes only
14
SUBSKY Build one R-tree and run BBS BBS is an I/O optimal algorithm based on R-tree index, but their approaches are optimized for a fixed set of dimensions Build R-trees for all elements in the power set of dimensions Hugh storage space
15
SUBSKY for uniform data Anchor point A c – the maximal corner of the data space having maximum coordinate on all dimensions x y 1 1 p sky p1p1 p2p2 AcAc f(p)=max(1-p[i]), where i is from 1 to d f sky (p sky )=min(1-p sky [i]), where i is from 1 to d f(p 2 ) f(p 1 ) f sky (p sky ) No any point p satisfying f(p)<f sky (p sky ) can belong to the skyline Pruning region of p sky maximum value of the coordinate Best point A similar result exists for the skyline of any subspace
16
SUBSKY for uniform data Skyline queries only apply on relevant dimensions SUB f’ sky (p sky )=min(1-p sky [i]), where i is in SUB Then, f(p) < f sky (p sky ) <= f’ sky (p sky ) No any point p satisfying the above equation can belong to the skyline
17
SUBSKY for uniform data Assume that our skyline query is interested in dimension x and y only First, we sort the data by f(p i ) p 3, p 4, p 1, p 2, p 5 S sky ={p 3 }, f’ sky (p 3 )=0.5 =min(1-0.5,1-0.3) U=0.5 (largest f’ value in S sky ) S sky ={p 3,p 4 }, f’ sky (p 4 )=0.1 U=0.5 S sky ={p 1,p 4 }, f’ sky (p 1 )=0.8 p 3 is removed by adding p 1, since it is dominated by p 1 U=0.8 p1p1 p2p2 p3p3 p4p4 p5p5 x0.20.40.50.90.6 y0.20.40.30.10.8 z0.50.90.10.60.7 f(p i )0.80.60.9 0.4
18
Analysis Assume that you have 15D uniform distributed objects with cardinality 100k, and we want to retrieve the skyline in a subspace SUB containing any two dimensions. λ λ Greater than 90% to find an object in area(λ, λ), where λ=0.001, since (1- λ 2 ) 100k Therefore, f’ sky (p) = 1- λ = 0.999 The volume evaluates to 0.999 15 =98.5%, that is, we only need to access 1.5% of the dataset 1 1
19
General SUBSKY In practical, data are usually clustered If the data are clustered, then we should expect that one anchor point cannot give us enough pruning power x 1 1 AcAc Best point p sky A1A1
20
General SUBSKY x AcAc p sky A1A1 A2A2 cluster s 1 s2s2 s3s3 s4s4 Anchors for different clusters Two questions: 1)How to find the anchors? 2)How to assign points to anchors? 1 1 Best point
21
1 1 AcAc Finding the Anchors First, let us see what a perfect anchor of a point p If p is assigned to A, then p can be pruned by any skyline point dominating p p Major perpendicular plane A1A1 A2A2 A3A3 Any point on this line is a perfect anchor of point p Anti-dominant region of p
22
Finding the Anchors 1 1 Best point AcAc p1p1 Major perpendicular plane p2p2 a good anchor For each point, find the projections to the plane Ex: p’ 1, p’ 2 … Partition the projected points into m clusters using algorithm k-means, and formulate an anchor for each cluster p’ 2 p’ 1
23
Finding the Anchors How to decide an anchor for a cluster? Blue points are assigned to cluster S. How can we decide the anchor for S? 1)Obtain point B, whose coordinate on each dimension equals the lowest coordinate of the points in S in their original space on this axis 2)Then, the algorithm computes the smallest square opposite to B which covers all points in S A B
24
Assigning Points to Anchors A naïve way is to assign points to their closest anchor point in the major perpendicular plane (projected space) It is not directly quantifies the benefit of an assignment
25
Pruning region of p2 Pruning region of p1 ER of p Assigning Points to Anchors In order to assign a point to a good anchor, this paper introduces a new measurement which name is effective region (ER) 1 1 Best point AcAc All points in yellow region (ER) can make a pruning region to A c that cover p p If ER-volume of p is larger, then p has more chance to be pruned p1p1 p2p2
26
ER of p Assigning Points to Anchors 1 1 Best point AcAc p ER of p 1 1 Best point AcAc p A’ p1p1 p2p2 p1p1 p2p2
27
Assigning Points to Anchors The pruning volume size of a point p to an anchor point A j is ∏max(0,A j [i]-L ∞ (p,A j )), where i is from 1 to d Therefore, assign a point p to A j that produces the largest pruning volume size
28
Query example We use the same example in previous slide Assume that we have two anchors, one is A c and the other A’ is found by K-means (m=1) A c =(1,1,1) and A’=(1,1,0.8) First, we calculate the ER volume of each data point with respect to A c and A’ p1p1 p2p2 p3p3 p4p4 p5p5 x0.20.40.50.90.6 y0.20.40.30.10.8 z0.50.90.10.60.7 f(p i )0.80.60.9 0.4 p1p1 p2p2 p3p3 p4p4 p5p5 AcAc 86411216 A’0-9-144 Unit 10 -3
29
Query example Sorted list by f: A c p 4 p 1 p 2 p 5 A’ p 3 p1p1 p2p2 p3p3 p4p4 p5p5 x0.20.40.50.90.6 y0.20.40.30.10.8 z0.50.90.10.60.7 f(p i )0.80.60.9 0.4 p1p1 p2p2 p3p3 p4p4 p5p5 AcAc 86411216 A’0-9-144 1)S sky ={p 4 }, f’ sky (p 4 )=0.5 U=0.5 2)S sky ={p 4, p 1 }, f’ sky (p1)=0.8 U=0.8
30
Experiments 3 real datasets NBA, Household, and Color 2 synthetic data (10D) Uniform Clustered 10 cluster centroids For each centroid, it takes N/10 points whose coordinate on each axis follows a Gaussian distribution with variance 0.05 and a mean equal to the corresponding coordinate of the centroid NBAHouseholdColor Dimension1369 Cardinality17k127k68k
31
Experiments
33
3D subspaces, 1 million cardinality 3D subspaces, full- space dimensionality is 10
34
Conclusion The core of SUBSKY is a transformation that convert multi-dimensional data into 1D values Show better performance than a I/O optimized algorithm in the subspace skyline problem Some continuous monitoring cases are good to investigate How to adopt the set of anchor points if data update rapidly The f values could be stored in other index structure to support fast update
35
Assigning Points to Anchors Therefore, we have two ways to assign points to the anchors 1.Assign points to their closest anchor point in the major perpendicular plane (projected space) 2.Assign points to their closest anchor point by ER- volume in original space The second approach is better than the first in the major perpendicular plane, because the ER-volume directly quantifies the benefit of an assignment
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.