1 Shooting Stars in the Sky:An Online Algorithm for Skyline Queries 作者: Donald Kossmann Frank Ramsak Steffen Rost 報告:黃士維.

Slides:



Advertisements
Similar presentations
Skyline Charuka Silva. Outline Charuka Silva, Skyline2  Motivation  Skyline Definition  Applications  Skyline Query  Similar Interesting Problem.
Advertisements

Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Indexing DNA Sequences Using q-Grams
Aggregating local image descriptors into compact codes
Representing and Querying Correlated Tuples in Probabilistic Databases
The Skyline Operator (Stephan Borzsonyi, Donald Kossmann, Konrad Stocker) Presenter: Shehnaaz Yusuf March 2005.
Maintaining Sliding Widow Skylines on Data Streams.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
Searching on Multi-Dimensional Data
When is “Nearest Neighbor Meaningful? Authors: Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan Uri Shaft Presentation by: Vuk Malbasa For CIS664 Prof.
Multidimensional Data
January 11, Csci 2111: Data and File Structures Week1, Lecture 1 Introduction to the Design and Specification of File Structures.
ISAC 教育學術資安資訊分享與分析中心研發專案 The Skyline Operator Stephan B¨orzs¨onyi, Donald Kossmann, Konrad Stocker EDBT
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
July 29HDMS'08 Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Bitmap Index Buddhika Madduma 22/03/2010 Web and Document Databases - ACS-7102.
Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Nearest Neighbor Retrieval Using Distance-Based Hashing Michalis Potamias and Panagiotis Papapetrou supervised by Prof George Kollios A method is proposed.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
FLANN Fast Library for Approximate Nearest Neighbors
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,
Chapter. 8: Indexing and Searching Sections: 8.1 Introduction, 8.2 Inverted Files 9/13/ Dr. Almetwally Mostafa.
Review of The Literature
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Introduction n How to retrieval information? n A simple alternative is to search the whole text sequentially n Another option is to build data structures.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
CS654: Digital Image Analysis
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
March 23 & 28, Csci 2111: Data and File Structures Week 10, Lectures 1 & 2 Hashing.
Optimal Resource Allocation for Protecting System Availability against Random Cyber Attack International Conference Computer Research and Development(ICCRD),
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
The σ-neighborhood skyline queries Chen, Yi-Chung; LEE, Chiang. The σ-neighborhood skyline queries. Information Sciences, 2015, 322: 張天彥 2015/12/05.
Cross-Product Packet Classification in GNIFS based on Non-overlapping Areas and Equivalence Class Author: Mohua Zhang, Ge Li Publisher: AISS 2012 Presenter:
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Graph Indexing From managing and mining graph data.
HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Packet Classification Using Multi- Iteration RFC Author: Chun-Hui Tsai, Hung-Mao Chu, Pi-Chung Wang Publisher: 2013 IEEE 37th Annual Computer Software.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Data Structures and Algorithm Analysis Dr. Ken Cosh Linked Lists.
Online Skyline Queries. Agenda Motivation: Top N vs. Skyline „Classic“ Algorithms –Block-Nested Loop Algorithm –Divide & Conquer Algorithm Online Algorithm.
Selected Topics: External Sorting, Join Algorithms, …
Skyline query with R*-Tree: Branch and Bound Skyline (BBS) Algorithm
Implementation of Relational Operations
The Skyline Query in Databases Which Objects are the Most Important?
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

1 Shooting Stars in the Sky:An Online Algorithm for Skyline Queries 作者: Donald Kossmann Frank Ramsak Steffen Rost 報告:黃士維

2 Outline Introduction An Online Algorithm for two-dimention Skylines The NN Algorithm for d-dimention Skylines Performance Experiments Conclusion

3 Introduction Skyline Bold points Skyline Queries Online Skyline Computation Related work

4 Introduction-Skyline Skyline queries ask for a set of interesting points from a potentially large set of data points. -ex:a cheaper,nearer,better food hotel. Skyline point-also called “bold point” -a point dominate another point in at least one dimension.

5 Introduction- Bold points The bold points respresent this hotels which part of Skyline.

6 Skyline Queries Skyline queries can involve more than 2 dimensions and they could depend on current position of a user. -to find a near,good food,cheap restaurant,but the user moves in the same time. If the user moves on,the Skyline should be re-computed for this user.

7 Online Skyline Computation-1 An online algorithm would probably take much longer than a batch-oriented algorithm to produce the full Skyline,but an online algorithm would produce a subset of the Skyline very quickly.

8 Online Skyline Computation-2 Six properties of an online algorithm: 1. The first result should be return instantaneously. 2. The algorithm should produce the full Skyline. 3. The algorithm should only return pionts which are part of the Skyline. 4. The algorithm should be fair. 5. The algorithm should be well-controled. 6. The algorithm should be universal in Skyline queries and DB.

9 Related work All these algorithms work in a batch-oriented way. Tan et al. proposed two progressive Skyline algorithms. - Bitmap algorithm, the order in which Skyline points are returned depends on the clustering of the data. -B-tree algorithm, the order in which points are returned depends on the value distribution of the data. Both algorithms require that all interesting properties are materialized in the database.

10 An Online Algorithm for two- dimention Skylines Three asumptions Basic Observations An example Discussion

11 Three asumptions All values are positive real numbers. There are no duplicates in the data set. We try to fine minimal points.

12 Basic Observations D is 2-dimentional data set. n ∈ D,is a nearest neighbor of O = (0, 0) according to f. - n is in the Skyline Observation 1: n must be part of the Skyline. Observation 2: n is in the Skyline of D.

13 Basic Observations If we partition the data set, then it is sufficient to look for Skyline points using nearest neighbor search in each region separately. This observation gives rise to a divide & conquer algorithm using nearest neighbor search.

14 An example-first step Region 1:corner sign,smaller y. Region 2:circle sign,smaller x, Region 3:triangle sign,greater x,y.

15 An example-second step Divide & conquer algorithm. In this way, the algorithm continues until all Skyline points are retrieved.

16 Algorithm Description

17 Discussion-1 Six requirements: We believe that Skyline queries that involve two dimensions are the most common case. The algorithm will find all points of the Skyline. All nearest neighbors found by the NN algorithm are part of the Skyline.

18 Discussion-2 The NN algorithm produces results from the whole range of results very quickly. The NN algorithm can continue to produce Skyline points using the new distance function without any adaptions. The NN algorithm is universal.

19 The NN Algorithm for d-dimention Skylines Introduction Laisser-faire Propagate Merge Fine-grained Partitioning Hybrid Approaches

20 Introduction-1 Find NN and compute

21 Introduction-2 Two observations from Section 2.1 can be generalized to three and more dimensions. Region 1 (xn, ∞, ∞) Region 2 (∞, yn, ∞) Region 3 (∞, ∞, zn) Etc… p will be produced by the NN algorithm twice.

22 Laisser-faire If the nearest neighbor p is found in the hash table, it is a duplicate and it is not output. The big disadvantage of this algorithm is that it results in a great deal of wasted work.

23 Propagate We can do this by “propagation”. Whenever the NN algorithm finds a Skyline point, it scans the whole to-do list in order to find those regions in the to-do list that contain that point. The big advantage of this technique is that it completely avoids wasted work to find duplicates

24 Merge Assume that two regions: a = (a1, a2,..., ad ) b = (b1, b2,..., bd) are in the to-do list. We can merge these two regions into a single region:a ⊕ b = (max(a1, b1), max(a2, b2),..., max(ad, bd)) A particular situation arises, if a supersedes b, a = a ⊕ b and thus b can be simply discarded from the to-do list.

25 Fine-grained Partitioning In Figure 5, for instance, we could partition into 8 nonoverlapping regions of which 6 regions would be relevant for further processing. Implementing this approach, however, results in a sharp growth of the number of regions in the to-do list.

26 Hybrid Approaches As mentioned above, merge can be combined with both the laisser-faire and the propagate approach. Another option would be to start and propagate duplicates until the to-do list has reached a certain size.

27 Performance Experiments Experimental Environment Points in database Variant algorithms Two-dimensional Skyline Queries Quality of Results Comparing Algorithm Variants

28 Experimental Environment 167MHz processor 128MB of main memory OS:Solaris G HD Program language: C++

29 Points in database 100,000 points VS 1million points Points are generated using one of the following three value distributions: Corr:points which are good in one dimension tend to be good in otherdimensions, too. Anti:points which are good in one dimension are bad in at least one other dimension. Indep:points are generated using a uniform distribution.

30 Variant algorithms NN : R-tree [BKSS90] in order to carry out nearest neighbor search. D&C : Kung’s divide & conquer algorithm extended by m-way partitioning and “Early Sky-line”. BNL : The block-nested-loops algorithm with a self- organizing list. Bitmap : scans the database and uses bitmaps in order to detect whether a point is part of the Skyline. B-tree : used a light-weight implementation of this algorithm that does not require the use of extended B-trees.

31 Two-dimensional Skyline Queries

32 Two-dimensional Skyline Queries It can be seen that the NN algorithm scales best. The general trends were the same: (1) The NN algorithm was the clear overall winner. (2) The Bitmap algorithm was in the same ball park as the BNL and D&C algorithms. (3) The B-tree algorithm showed good performance for the correlated and independent databases, but terrible performance for the anti-correlated database.

33 Quality of Results Figure 6a shows the full Skyline for the anti- correlated database for d = 2. Figure 6b shows the first 10 points returned by the NN algorithm (without user interaction). Figure 6c shows the first 10 points returned by the B-tree algorithm. It can be seen that the NN algorithm gives a good big picture of the Skyline.

34 Quality of Results Computing the full Skyline takes a long time,regardless what algorithm is used. an online algorithm is particularly important for interactive applications in this scenario: (1) toselect relevant points from the Skyline (rather than flooding the user with results) (2) to give the firstanswers quickly.

35 Quality of Results B-tree algorithm produces the results very quickly. If the user gives preferences, the B-tree algorithm cannot adapt well. The Bitmap algorithm is not competitive in either aspect: (1) It produces Skyline points at approximately the same rate as the NN algorithm. (2) It is no match to the NN algorithm in terms of interactivity because it produces Skyline points in a somewhat random order (based on the clustering of the data).

36 Comparing Algorithm Variants Figure 7 shows the performance of four variants for a 5-dimensional Skyline with an anti- correlated database and 1 million points. measured the following variants: Laisser-faire Propagate Merge:laisser-faire and merge regions whenever a duplicate is found. Hybrid: propagate to the first 20 % entries of the to-do list.

37 Comparing Algorithm Variants

38 Conclusion Skyline queries are important for several database application. All algorithms have their particular virtues. The NN algorithm is the only algorithm that gives the user control over the process and allows the user to give preferences. The B-tree algorithm gives “extreme” points preference (i.e., points good in one dimension) and returns points which are good in many dimensions very late. The Bitmap algorithm scans the database and uses Bitmaps in order to detect whether a point is part of the Skyline.

39 Conclusion(cont.) future work: First, improve the performance of the NN algorithm by exploiting new techniques for nearest neighbor search. Second, we would like to investigate special- purpose main-memory index structures Third, we would like to investigate how the NN algorithm can be combined with other algorithms.

40 Thanks for your listening!!