Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world.

Slides:



Advertisements
Similar presentations
Statistical Techniques I
Advertisements

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
VLDB 2011 Pohang University of Science and Technology (POSTECH) Republic of Korea Jongwuk Lee, Seung-won Hwang VLDB 2011.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
VLDB’2007 review Denis Mindolin. VLDB’07 program.
1 Efficient Subgraph Search over Large Uncertain Graphs Ye Yuan 1, Guoren Wang 1, Haixun Wang 2, Lei Chen 3 1. Northeastern University, China 2. Microsoft.
Ming Hua, Jian Pei Simon Fraser UniversityPresented By: Mahashweta Das Wenjie Zhang, Xuemin LinUniversity of Texas at Arlington The University of New South.
Autocorrelation and Linkage Cause Bias in Evaluation of Relational Learners David Jensen and Jennifer Neville.
Probabilistic Threshold Range Aggregate Query Processing over Uncertain Data Wenjie Zhang University of New South Wales & NICTA, Australia Joint work:
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Visual Recognition Tutorial
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
Evaluating Hypotheses
16 MULTIPLE INTEGRALS.
Visual Recognition Tutorial
Experimental Evaluation
Copyright © Cengage Learning. All rights reserved. 5 Integrals.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Www3.informatik.uni-wuerzburg.de Institute of Computer Science Department of Distributed Systems Prof. Dr.-Ing. P. Tran-Gia Performance Metrics for Resilient.
Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,
Kalman filtering techniques for parameter estimation Jared Barber Department of Mathematics, University of Pittsburgh Work with Ivan Yotov and Mark Tronzo.
Random Sampling, Point Estimation and Maximum Likelihood.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
INTEGRALS Areas and Distances INTEGRALS In this section, we will learn that: We get the same special type of limit in trying to find the area under.
Lesley Charles November 23, 2009.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
INTEGRALS 5. INTEGRALS In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.
Integrals  In Chapter 2, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.  In much the.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Bin Cui, Hua Lu, Quanqing Xu, Lijiang Chen, Yafei Dai, Yongluan Zhou ICDE 08 Parallel Distributed Processing of Constrained Skyline Queries by Filtering.
Algorithms for Addition and Subtraction. Children’s first methods are admittedly inefficient. However, if they are free to do their own thinking, they.
Scaling up Decision Trees. Decision tree learning.
Privacy-preserving rule mining. Outline  A brief introduction to association rule mining  Privacy preserving rule mining Single party  Perturbation.
An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto University)
Risk Analysis & Modelling Lecture 2: Measuring Risk.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Clustering of Uncertain data objects by Voronoi- diagram-based approach Speaker: Chan Kai Fong, Paul Dept of CS, HKU.
An Index of Data Size to Extract Decomposable Structures in LAD Hirotaka Ono Mutsunori Yagiura Toshihide Ibaraki (Kyoto Univ.)
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
In Chapters 6 and 8, we will see how to use the integral to solve problems concerning:  Volumes  Lengths of curves  Population predictions  Cardiac.
Answering Why-not Questions on Top-K Queries Andy He and Eric Lo The Hong Kong Polytechnic University.
Relation Strength-Aware Clustering of Heterogeneous Information Networks with Incomplete Attributes ∗ Source: VLDB.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
The expected value The value of a variable one would “expect” to get. It is also called the (mathematical) expectation, or the mean.
Copyright © Cengage Learning. All rights reserved. 4 Integrals.
INTEGRALS 5. INTEGRALS In Chapter 3, we used the tangent and velocity problems to introduce the derivative—the central idea in differential calculus.
A novel, low-latency algorithm for multiple group-by query optimization Duy-Hung Phan Pietro Michiardi ICDE16.
Copyright © Cengage Learning. All rights reserved.
Machine Learning: Ensemble Methods
Copyright © Cengage Learning. All rights reserved.
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Stochastic Skyline Operator
Probabilistic Data Management
Data Mining Association Analysis: Basic Concepts and Algorithms
Probabilistic Data Management
Xu Zhou Kenli Li Yantao Zhou Keqin Li
Probabilistic n-of-N Skyline Computation over Uncertain Data Streams
Sampling Distributions
Uncertain Data Mobile Group 报告人:郝兴.
Clustering.
The Skyline Query in Databases Which Objects are the Most Important?
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Probabilistic Skylines on Uncertain Data (VLDB2007) Jian Pei et al Supervisor: Dr Benjamin Kao Presenter: For Date: 22 Feb 2008 ??: the possible world concept

Outline Motivation Traditional and Probabilistic Skyline Problem Definition Computation Problem and Algorithms (Top down and Bottom up) Experimental Results

Motivation Skyline Analysis on NBA players performance (#Rebounds) (#Assists) Uncertainty Each Player has multiple records “First read the topic and then the subtopic to let others know what you are doing” “instance e dominate b,d,c” Define“ skyline explanation of the graph, the larger the better”

Motivation Skyline Analysis on NBA players with multiple records

Easy Approach – Averaging Arbor (x) is better in assist than Eddy, but Eddy (point b) dominates all games of Arbor (x). Bob (point a) bias the aggregate value “not so fair to say Eddy is a worse in assist than Arbor” “not so fair to Bob to be severely affected by only a game” Complete-Miss: need a new graph

Olajuwon and Kobe Bryant are missing from Aggregate Skyline but present in Probabilistic Skyline Their performance vary a lot over games Details in experiment analysis Motivation Motivating result using Probabilistic Skyline Completed :(Miss: Pictures of them)

Traditional and Probabilistic Skyline Semantics difference of Dominance between objects Dominance  Certain model: an object dominate another object with Probability 1.  Uncertain model: an object dominate another object with Probability P. g Certain DataUncertain Data Miss: A flash showing the calculation will be better “Assume smaller the value, the better”

Traditional and Probabilistic Skyline Semantics difference of Dominance between objects Dominance  Certain model: an object dominate another object with Probability 1.  Uncertain model: an object dominate another object with Probability P. g Uncertain Data Miss: A flash showing the calculation will be better “Assume smaller the value, the better” Certain Data

Traditional and Probabilistic Skyline Semantics difference of Dominance between objects Dominance  Certain model: an object dominate another object with Probability 1.  Uncertain model: an object dominate another object with Probability P. g Certain Data Miss: A flash showing the calculation will be better “Assume smaller the value, the better” Uncertain Data “Consider object d”

Traditional and Probabilistic Skyline Semantics difference of Dominance between objects Dominance  Certain model: an object dominate another object with Probability 1.  Uncertain model: an object dominate another object with Probability P. g Certain Data Miss: A flash showing the calculation will be better “Assume smaller the value, the better” Uncertain Data

Traditional and Probabilistic Skyline Semantics difference of Dominance between objects Dominance  Certain model: an object dominate another object with Probability 1.  Uncertain model: an object dominate another object with Probability P. g Certain Data Completed:Miss: A flash showing the calculation will be better “Assume smaller the value, the better” Uncertain Data

Probabilistic Skyline Calculation of Probability Object A dominating Object C “For easier illustration, discrete case are used” Miss: Need a flash to demonstrate the calculation of Dominance “Explanation of Symbols” Pr [A ≺ C] = 1/4*1/3 (4+..)

Probabilistic Skyline Calculation of Probability Object A dominates Object B “For easier illustration, discrete case are used” Miss: Need a flash to demonstrate the calculation of Dominance “Explanation of Symbols” Pr [A ≺ C] = 1/4*1/3 (4+4+..)

Probabilistic Skyline Calculation of Probability Object A dominates Object B “For easier illustration, discrete case are used” Completed:Miss: Need a flash to demonstrate the calculation of Dominance “Explanation of Symbols” Pr [A ≺ C] = 1/4*1/3 (4+4+0) =2/3

Probabilistic Skyline Probabilistic Skyline: From Dominance to Skyline Intuition of finding Skyline, probability of an object not to be dominated by other objects OK:Miss: using flash to do the grouping of object A,B,C We need a new measure …….. OK:Please change the equation of “0 <> (1/3)(1/3)” 0 (1/3)(1/3)

Probabilistic Skyline Probabilistic Skyline Idea Intuition  1) we know the dominance definition  2) skyline = not dominated by other objects Miss: not dominated demonstration of Object A,B “Consider Object A, instance by instance”

Probabilistic Skyline Probabilistic Skyline Idea Intuition  1) we know the dominance definition  2) skyline = not dominated by other objects Miss: not dominated demonstration of Object A,B “we see that instance of Object A is not dominated by instances of other objects”

Probabilistic Skyline Probabilistic Skyline Idea Intuition  1) we know the dominance definition  2) skyline = not dominated by other objects Miss: not dominated demonstration of Object A,B

Probabilistic Skyline Probabilistic Skyline Idea Intuition  1) we know the dominance definition  2) skyline = not dominated by other objects Miss: not dominated demonstration of Object A,B

Probabilistic Skyline Probabilistic Skyline Idea Intuition  Not dominated by other instances of objects, Probability of object A being dominated is 0. Probability skyline of object A is therefore 1. OK:Miss: not dominated demonstration of Object A,B

Probabilistic Skyline Calculation of Probabilistic Skyline Miss: another flash to show the calculation of Skyline Probability of an 7/12 ??: where to explain the consequence of an instance dorminated by an object Pr (D) ?

Probabilistic Skyline Calculation of Probabilistic Skyline Miss: another flash to show the calculation of Skyline Probability of an 7/12 ??: where to explain the consequence of an instance dorminated by an object Pr (D) ? Pr(d1) = (1-1/4)

Probabilistic Skyline Calculation of Probabilistic Skyline Miss: another flash to show the calculation of Skyline Probability of an 7/12 ??: where to explain the consequence of an instance dorminated by an object Pr (D) ? Pr(d1) = (1-1/4) Pr(d2) = (1-1/4) * (1-2/3)

Probabilistic Skyline Calculation of Probabilistic Skyline OK-Miss: another flash to show the calculation of Skyline Probability of an 7/12 ??: where to explain the consequence of an instance dorminated by an object Pr (D) ? Pr(d1) = (1-1/4) Pr(d2) = (1-1/4) * (1-2/3) Pr(d3) = (1-1/4) P(D) = 1/3(3/4+1/4+3/4) =7/12

Probabilistic Skyline The p-skyline 1-skyline  {A,B} 7/12 –skyline  {A,B,D} “If you have time, use the formula to find Object c probability as well”

Problem Definition Given a set of uncertain objects S and a probability threshold p (0 ≤ p ≤ 1), the problem of probabilistic skyline computation is to compute the p-skyline on S. 1-skyline  {A,B} 7/12 –skyline  {A,B,D}

Computation Problem of p-skyline First, each uncertain object may have many instances. We have to process a large number of instances. Second, we have to consider many probabilities in deriving the probabilistic skylines.

Algorithms (Top down and Bottom up) Data  Multiple records of objects in the hope of approximating the probability density function Techniques:  Bounding  Pruning  Refining “The whole algorithms are very detailed, technique authors use to efficient pruning will be discussed” “Assumption: the smaller the value, the better” “Please tell the audience clearly what is the data being processed”

Bottom-up Algorithm Technique – Minimum Bounding Box (MBB) OK:Miss: flash drawing the bounding box of object D and demonstrate the two property

Bottom-up Algorithm - Pruning Techniques (1/3) using Umin, Umax to decide membership of p-skyline For an uncertain object U and probability threshold p, if Pr(Umin) < p, then U is not in the p-skyline. If Pr(Umax) ≥ p, then U is in the p-skyline OK:Miss: Flash use figure 3 to illustrate

Bottom-up Algorithm - Pruning Techniques (2/3) using Umax to prune instances of objects Let U and V be uncertain objects such that U V. If u is an instance of U and Vmax ≺ u, then Pr(u) = 0. OK:Miss: Flash use equation ()()() to illustrate C2 is dominated by Umax, dominated by all instances in object D Pr(c2) = (1 – 3/3)(..)(..) = 0

Bottom-up Algorithm - Pruning Techniques (3/3) using subset of instance to prune objects OK: Better to use Flash illustration Estimate Pr(Vmin) upper bound by Pr(Umax’) ? How to say better Pr(Vmin) = (1 – |U’|/|U|)(..)(..) “ You can take min c{Pr(u)} for easy understanding” “ to estimate the upper bound of Vmin using U’ max assume all points of U appear only in U’ and green region, such that Vmin is dorminated by less objects If |U’| is large, more instances dominate Vmin, then Pr(Vmin) is low

Bottom-up Algorithm - Pruning Techniques (3/3) using subset of instance to prune objects Special Case  As a special case, if there exists an instance u ∈ U such that Pr(u) < p and u ≺ Vmin, then Pr(V ) < p and V can be pruned. Very useful: an uncertain object partially computed can be used to prune other objects

Bottom-up Algorithm simplified version of bottom-up algorithm If (u is dominated by another object) prune u //c2 is dominated by D end if If (u is Umin) compute Pr (Umin) if (Pr(Umin) < p) prune u //Umin < p end if Use Pr(u) to update Pr(U)’s upper and lower bound Decide membership of p-skyline of U prune other objects// check with other Umins End if Miss: Pictures of illustration “all instances of uncertain object are put into a list as well as the Umin ” Input: instances of objects and their Umin

Top-down Algorithm Difference between top down and bottom up algorithm Bottom up:  Start with single instance of an uncertain object Top down:  Start with the whole sets of instances of an uncertain object

The skyline probability of each subset of uncertain object can be bounded using its MBB. The skyline probability of the uncertain object can be bounded as the weighted mean of the bounds of subsets. Top-down Algorithm Idea of bounding Miss: if possible draw a graph with 4 squares inside it to replace the upper one

Top-down Algorithm supporting data structure : partition tree “for simplicity, a 2d tree will be used to illustrate the concept for easy understanding” Miss: the look of partition tree, with 2 dimension Miss: Mark the level of partition tree, 0,1,2 etc A A A B B B C D C D C D

Top-down Algorithm partition tree for bounding Compare the partition of U with other partition tree as follows: traverse the partition tree of other uncertain object V, in the depth-first manner. ??: Adding possible dominating object before discussing the algorithms **: wording needed to be changed if possible dominating object is mentioned A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’

Top-down Algorithm all possible situations during partition trees traversal A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’

Top-down Algorithm s ituations 1/3 during partition tree traversal for bounding calculation A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’

Top-down Algorithm s ituations 2/3 during partition tree traversal for bounding calculation (Place the two trees here, it is better to use subtree starting at level 1) A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’

Top-down Algorithm s ituations 3/3 during partition tree traversal for bounding calculation (Place the two trees here, it is better to use subtree starting at level 1) A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’ Estimate upper bound Estimate lower bound

Top-down Algorithm Pruning partition tree 1/3 “compare ABCD with B’ ” (better to put a tree here) A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’

Top-down Algorithm Pruning partition tree 2/3 (better to put a tree here) A B C D A B C D A B C D A’ B’ C’ D’ A’ B’ C’ D’ A’ B’ C’ D’

Top-down Algorithm Pruning partition tree 3/3 A B C D A B C D A B C D

Experiment Data and Experiment Experiment: aggregate skyline and probabilistic skyline (0.1-skyline) Data Set: NBA players performance record(339,721) Attributes: #points, #assists, #rebounds

Experiment Results 1) Top 12 players in probabilistic skyline also appear aggregate skyline 2) Players like (Olajuwon and Kobe Bryant) appear only in probabilistic skyline but not aggregate skyline. 3) Disagreement between probabilistic skyline and aggregate skyline. Player A dominate B in aggregate skyline but reverse in probabilistic skyline

Experiment

Experiment Results Analysis 2) Players like (Olajuwon and Kobe Bryant) appear only in probabilistic skyline but not aggregate skyline. Finding  Comparing to the aggregate skyline, the probabilistic skyline finds not only players consistently performing well, but also outstanding players with large variances in performance

Experiment Results Analysis 3) Disagreement between probabilistic skyline and aggregate skyline. Ewing( ) has a higher skyline probability than Brand( ), though Ewing is dominated by Brand in the aggregate data set Finding  Ewing play very well in few games  probabilistic skylines disclose interesting knowledge about uncertain data which cannot be captured by traditional skyline analysis.  Ranking can be performed on Probabilistic Skyline, which can not be done on aggregate skyline

Experiment Results Analysis

Other Experiments Synthesis data set Data  Synthesis data sets where instances of objects are generated in anti-correlated, independent, and correlated distributions

Other Experiment results Effect of probability threshold to size of skyline

Other Experiment results Effect of dimensionality to size of skyline

Other Experiment results Effect of cardinality (#instance) to size of skyline

Other Experiment results Scalability with respect to probability threshold

Other Experiment results Compare Top-Down and Bottom-Up with dimensionality and cardinality

The End