Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo
HKU CS DB Seminar2 Skyline A new operator (like “ORDER BY”) in database systems A set of data points that is not dominated by any other data points
HKU CS DB Seminar3 Example Find some good places for us to hold the next DB Seminar Good Closer to HKU (Min) Good Larger Area (Max) Return those homes that are not worse than any others in ALL DIMENSIONS Dataset (Table Homes): HomeDistance from HKUArea (m 2 ) K.K Loo1 km10 Ben9 km100 Ivy5 km2 Nikos8 km250
HKU CS DB Seminar4 Outline Introduction to skyline queries Non-progressive skylining on the Web Basic Distributed Skyline Algorithm (BDS) Progressive skylining on the Web Experimental result Conclusion and future directions
HKU CS DB Seminar5 Skylining on the Web One distributed site holds one attribute Attribute “Distance from HKU” stored at HKU Attribute “Area (m 2 )” stored at Purdue HomeDistance from HKU K.K Loo1 km Ben9 km Ivy5 km Nikos8 km HomeArea (m 2 ) K.K Loo10 Ben100 Ivy2 Nikos250 Purdue HKU Interne t
HKU CS DB Seminar6 Accessing interfaces HomeDistance from HKU K.K Loo1 km Ben9 km Ivy5 km Nikos8 km HomeArea (m 2 ) K.K Loo10 Ben100 Ivy2 Nikos250 Purdue HKU Interne t Interfaces of Web-accessible sites: 1. Sorted Access (SA): HKU getNext(): returns rank 1 st data tuple “K.K Loo” HKU getNext() 2 nd “Ivy”, HKU getNext() 3 rd “Nikos”, …. 2. Random Access (RA): Purdue getScore(“K.K. Loo”) 10 m 2 HKU getScore(“Nikos”) 8 km
HKU CS DB Seminar7 Basic distributed skyline algorithm (EDBT 04) Phase 1 – find all possible skyline: Perform sorted access on each source 1-by-1 S1 getNext(), S2 getNext(), S3 getNext() S1 getNext(), S2 getNext() …. …. Stop until there is an object which attribute values are all known
HKU CS DB Seminar8 Phase 1 f is the terminating object
HKU CS DB Seminar9 Phase 1 (15 sorted accesses)
HKU CS DB Seminar10 Implication f is the terminating object Objects that do not appear must be dominated by f
HKU CS DB Seminar11 Phase 2 Find skyline from candidates in phase 1 During sequential scanning of sources, data structures K 1, K 2, K 3, …, Kn are created n is the no. of dimension If source i getNext() returns a data object d 1. create an entry in Ki 2. update the lower_bound of the source i
HKU CS DB Seminar12 Phase 2: find skyline from candidates Ki A lemma shows that “Objects can only be dominated by objects in the same set Ki”
HKU CS DB Seminar13 Motivations BDS returns skyline results in a batch In practice, it would be useful to return skyline results progressively such that users could adjust their decisions right away Consider the “next DB seminar” skyline example: minimize “Distance from HKU”, maximize “Area” is first returned From HKU to Nikos’s home needs to take a $50 bus! Add the “travel-expense” attribute into the skyline query
HKU CS DB Seminar14 Progressive Distributed Skylining (PDS) Goal: Evaluates skyline queries progressively with minimal overhead Overhead: Network/Data source accesses Computational time
HKU CS DB Seminar15 Enable progressiveness To identify a data point belongs to the final skyline or not, we rely on the following lemma (assume the data values are distinct): If a data source Di returns data objects in a strictly monotonic order, an object O retrieved from Di would only be dominated by objects that are retrieved from Di before O
HKU CS DB Seminar16 If an object O is retrieved from a data source by sorted access, we could only need to test if O is dominated by any objects that appears before O in the same source only 2 usages: 1. We don’t need to consider objects appear in other data sources 2. After the test, we can output O as a skyline immediately O must be a skyline, we do not need to worry about objects appear later would dominate O
HKU CS DB Seminar17 An R-tree approach Build an r-tree Ri for each attribute/data source i involved in the skyline query For each object O retrieved from source i, we check to see if any object in Ri dominates O If no such objects exists, O is a skyline (output it immediately) If some objects dominates O in Ri, O is not a skyline object (O is discarded immediately)
HKU CS DB Seminar18 D3.getNext() the 1 st time SA on D3 returns e e is a skyline (no object is better than e on D3), e(7,4) is projected into r-tree R 3 e(7,4) D1 D2
HKU CS DB Seminar19 D3.getNext() the 2 nd time SA on D3 returns c Construct a query Q(origin, c) on R 3 Q returns no answer c is a skyline insert c into R 3 e(7,4) c(2,5)
HKU CS DB Seminar20 D3.getNext() the 3 rd time SA on D3 returns j Construct a query Q(origin, j) on R 3 Q returns c as an answer j is dominated by c discard j e(7,4) c(2,5) j(6,10)
HKU CS DB Seminar21 D3.getNext() the 4 th time SA on D3 returns f, construct a query Q(origin, f) on R 3 Q returns no answer f is a skyline Delete e after insertion of f to make the R-tree more compact and efficient e(7,4) c(2,5)
HKU CS DB Seminar22 The R-tree approach The R-tree is very small in size since it stores skyline objects with highest pruning power Containment query operation is very efficient
HKU CS DB Seminar23 A linear regression based heuristic The R-tree approach enable progressiveness with better efficiency We use a linear regression based heuristic to minimize the number of source accesses during the evaluation process
HKU CS DB Seminar24 A rank based approach 1. We use linear regression to estimate the rank of objects along the process 2. Assume the object with lowest rank is the real terminating object and probe the sources accordingly (rather than round- robin)
HKU CS DB Seminar25 Extensions Evaluation of top-K skyline queries Progress indicator (based on the estimated ranks) An clipart of Kevin Yip
HKU CS DB Seminar26 Experimental results – Number of source accesses
HKU CS DB Seminar27 Experimental results – Number of source accesses Random Distribution Denormalized Domain
HKU CS DB Seminar28 Experimental results – progressive behavior
HKU CS DB Seminar29 Experimental results – progress indicator
HKU CS DB Seminar30 Conclusion and future directions Skyline queries on the Web Return skyline points on-the-fly Future work: Improve the usability of PDS by allowing the users to barter between progressiveness and efficiency Compute skyline from real-time stream data Only 1 data source supports sorted access and the rest support random access only
HKU CS DB Seminar31 References S.Borzonyi, D.Kossmann, K.Stocker, The Skyline Operator, in ICDE D.Kossmann, F.Ramsak, S. Rost, Shooting Stars in the Sky: An Online Algorithm for Skyline Queries, in VLDB W.T.Balke, U.Guntzer, J.X. Zheng, Efficient Distributed Skylining for Web Information Systems, in EDBT 2004