Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

Slides:



Advertisements
Similar presentations
Multi-Guarded Safe Zone: An Effective Technique to Monitor Moving Circular Range Queries Presented By: Muhammad Aamir Cheema 1 Joint work with Ljiljana.
Advertisements

Computer Science and Engineering Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Chengyuan Zhang 1,Ying Zhang 1,Wenjie Zhang 1, Xuemin.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.
Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.
Fast Algorithms For Hierarchical Range Histogram Constructions
Maintaining Sliding Widow Skylines on Data Streams.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
ISAC 教育學術資安資訊分享與分析中心研發專案 The Skyline Operator Stephan B¨orzs¨onyi, Donald Kossmann, Konrad Stocker EDBT
Efficient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
July 29HDMS'08 Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
COMP53311 Data Stream Prepared by Raymond Wong Presented by Raymond Wong
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
Computing Diameter in the Streaming and Sliding-Window Models J. Feigenbaum, S. Kannan, J. Zhang.
Complexity Analysis (Part I)
1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.
What ’ s Hot and What ’ s Not: Tracking Most Frequent Items Dynamically G. Cormode and S. Muthukrishman Rutgers University ACM Principles of Database Systems.
Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.
Bin Jiang, Jian Pei.  Problem Definition  An On-the-fly Method ◦ Interval Skyline Query Answering Algorithm ◦ Online Interval Skyline Query Algorithm.
Birch: An efficient data clustering method for very large databases
Creating Competitive Products Qian Wan [1], Raymond Chi-Wing Wong [1], Ihab F. Ilyas [2], M. Tamer Ozsu [2], Yu Peng [1] [1] Hong Kong University of Science.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
CHP-4 QUEUE.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.
Creating Competitive Products Qian Wan [1], Raymond Chi-Wing Wong [1], Ihab F. Ilyas [2], M. Tamer Ozsu [2], Yu Peng [1] [1] Hong Kong University of Science.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Efficient Elastic Burst Detection in Data Streams Yunyue Zhu and Dennis Shasha Department of Computer Science Courant Institute of Mathematical Sciences.
RELAXED REVERSE NEAREST NEIGHBORS QUERIES Arif Hidayat Muhammad Aamir Cheema David Taniar.
Computer Science and Engineering Efficiently Monitoring Top-k Pairs over Sliding Windows Presented By: Zhitao Shen 1 Joint work with Muhammad Aamir Cheema.
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
Marina Drosou, Evaggelia Pitoura Computer Science Department
August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.
The σ-neighborhood skyline queries Chen, Yi-Chung; LEE, Chiang. The σ-neighborhood skyline queries. Information Sciences, 2015, 322: 張天彥 2015/12/05.
A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES
D-skyline and T-skyline Methods for Similarity Search Query in Streaming Environment Ling Wang 1, Tie Hua Zhou 1, Kyung Ah Kim 2, Eun Jong Cha 2, and Keun.
Calculating frequency moments of Data Stream
Online Interval Skyline Queries on Time Series ICDE 2009.
1 Finding Competitive Price Yu Peng (Hong Kong University of Science and Technology) Raymond Chi-Wing Wong (Hong Kong University of Science and Technology)
Finding skyline on the fly HKU CS DB Seminar 21 July 2004 Speaker: Eric Lo.
Bin Jiang, Jian Pei ICDE 2009 Online Interval Skyline Queries on Time Series 1.
Approximate NN queries on Streams with Guaranteed Error/performance Bounds Nick AT&T labs-research Beng Chin Ooi, Kian-Lee Tan, Rui National.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Graph Indexing From managing and mining graph data.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Stochastic Skyline Operator
Query in Streaming Environment
Probabilistic Data Management
Xu Zhou Kenli Li Yantao Zhou Keqin Li
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Continuous Density Queries for Moving Objects
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group

Introduction

Skyline 900 m 600 kr 20m 1100 kr 700 m 600 kr 60 m 1200 kr 80 m 500 kr 20m 400 kr Find a good hotel cheap and near the beach

Skyline Price (€) Distance to beach (km)

On-line Shopping System Each products are evaluated in various aspects In addition, the seller is associated with a “trustability”. Customers may want to continuously monitor on-line advertisements by selecting the candidates for the best deal ---- skyline points. Note that the data is uncertain

Problem Statement In this paper, we study the problem of efficiently retrieving skyline elements from the most recent N elements for a sequence of uncertain elements in a d-dimensional numeric space, with the skyline probabilities not smaller than a given threshold q (0 < q ≤ 1)

Dominating Probabilities P sky (a) = P(a) × P old (a) × P new (a) P new (a 4 ) = 1 − P(a 5 ) = 0.9 P old (a 4 ) = (1−P(a2))(1−P(a3))(1−P(a1) ) = P sky (a 4 ) = P(a 4 )xP new (a 4 )xP old (a 4 ) = 0.034

Algorithm

Framework Given a probability threshold q and a sliding window with length N a old is the oldest element in current window and inserting a new incrementally computes q-skyline.

Pruning Let DS N to be the recent N elements Using S N,q instead of the whole window of DS N S N,q = {a|a ∈ DS N & P new (a) ≥ q} S N,q contains all skyline points with P sky ≥ q; Not lead to false positive nor false negative to continuously identify S N, q Minimality Size of S N,q is poly-logarithmic regarding N SKY N,q is the solution set; that is, for each element a in SKY N,q, P sky (a) ≥ q.

Inserting 0)In-memory R-trees R1 and R2 on SKY N,q and (S N,q − SKY N,q) 1) Update P new values of the elements dominated by a new by multiplying (1 − P(a new )) 2) Remove the elements a with updated P new (a) < q from R1 and R2

Inserting 3) Update Psky (via P old and P new ) values for the elements dominated by some of those removed elements 4) Move elements a in R1 with P sky (a) < q to R2 5) Calculate P sky (a new ) and insert it to R1 or R2 accordingly since P new (a new ) = 1

Expiration Once an element a old expires, 1) check if it is in S N,q. If it is in S N,q then we need to increase the P old values for elements dominated by a old. 2) After that, we need to determine the elements that need to be moved from R2 to R1.

Aggregate R-Tree

In-memory R-trees R1 and R2 on SKY N,q and (S N,q − SKY N,q) New element a14 arrives and a1 expires To find out the elements which are dominated by a14 and then to update R1 & R2

Aggregate R-Tree If the maximum values of Pnew multiplied by (1−P(a14)) smaller than q, the entry (i.e. all elements contained) will be removed from S N,q. On the other hand if the minimum value of Pnew multiplied by (1 − P(a14)) is not smaller than q, then the entry (i.e. all elements contained) remains in S N,q.

Aggregate R-Tree Similarly, at each entry we keep the minimum and maximum values of Psky for the elements contained to possibly terminate the determination of whether elements contained are in SKY N,q.

Analysis Space Complexity. Clearly, in our algorithm we use aggregate-R trees to keep each element in S N,q and each element is kept only once. Thus, the space complexity is O(|S N,q|). Time Complexity. No sensible time complexity analysis

Extension Multiple thresholds run multiple queries and intersect results together Ad-hoc Queries “find the skyline with skyline probability at least q”. Assume that currently we maintain k skylines as discussed above and q ≥ q k. First find an Ri such that q i ≤ q < q i −1; clearly elements {R j : j < i−1} are contained in the solution. Run search to get all elements in Ri with skyline probabilities ≥ q

Experiment

SYSTEM PARAMETERS Intel Xeon 2.4GHz dual CPU and 4G memory under Debian Linux. Real dataset is extracted from the stock statistics from NYSE (New York Stock Exchange). Synthetic datasets anti-correlated

Algorithms SSKY Techniques presented in Section IV to continuously compute q-skyline (i.e., skyline with the probability not less than a given q) against a sliding window. Naïve approach on basic problem is about 20 times slower than SSKY, so it’s been ruled out

Time Efficiency It shows that SSKY is very efficient, especially when the dimensionality is low. For 2 dimensional dataset, SSKY can support a workload where elements arrive at the speed of more than 38K per second even for stock and anti-correlated dataset. For 5d anti-correlated data, our algorithm can still support up to 728 elements per second, which is a medium speed for data streams.

Q&A Thanks