Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South.

Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia

Ranking Queries To retrieve a limited number of best qualified results from a large set of data – A broad range of queries, ranking by value, similarity, relevance, k nearest neighbor, etc. Best? – Specify ranking function over certain dimensions. --- top- k query – No ranking function available ? --- skyline, dominating, minimal regret ratio, etc..

Complex Objects Objects that cannot be modeled by a single d- dimensional value Focus of this talk: – Uncertain objects: multiple d-dimensional instances per object. Exclusive semantics – Multi-valued objects: multiple d-dimensional instances per object. Inclusive semantics. – Spatial + text

Applications Ranking Dimension Probability How to get the top-2 highest temperatures ?

Applications Who is a better player according to #rebounds and #points ?

Applications How to get the mobile holder nearest to a location P at time t ?

Applications p4 (t1) p6 (t2,t3 ) p1 (t1,t2) p10 (t1) p3 (t1,t3) p5 (t2,t3) p8 (t3) p11 (t2) p9 (t2 ) p2 (t1,t2) 11 restaurants with spatial Locations (t1, t2, t3) = (sushi, seafood, coffee)is the textual keyword Of each restaurant Find the restaurant nearest to the query with sushi & seafood

Why is ranking complex objects hard? To interpret the semantics for ranking – E.g., top-k ranking over uncertain objects is studied since 2007 with more than 10 ranking models proposed Computationally expensive – To handle multiple instance, and extra information (e.g., text)

Uncertain Data Ranking Dimension Probability How to get the top-2 highest temperatures ?

Top-k Ranking Queries U-topk: [Soliman, Ilyas, Chang, ICDE07], [Yi, Li, Srivastava, Kollios, 08] U-kRanks: [Soliman, Ilyas, Chang, ICDE07], [Lian, Chen, 08] PT-k: [Hua, Pei, Zhang, Lin, SIGMOD08] Global-topk: [Zhang, Chomicki, 08] Expected ranks: [Cormode, Li, Yi, ICDE09] Unified Ranking Approach: [Li, Saha, Reshpande, VLDB09] Representative top-k: [Ge, Zdonik, Madden, SIGMOD09] Top-k with Data Cleaning: [Mo, Cheng, Cheung, Li, Yang, ICDE13] Top-k Oracle: [Song, Li, Ge, ICDE13]

U-topk

U-kRanks

Global Top-k Based on PT-k, return the k tuples with highest probabilities Top-2 answer: (R3, R5)

Expected Rank The expected rank of a value across all possible worlds

How to evaluate if a ranking model is good ? To see if the properties of the original operator is retained Top-k operator – Value-invariance: ranking determined by relative order but not scores – Exact k: return exactly k results – Unique rank: each item has one and only one rank position – Containment: top-(k+1) list contains top-k list – Stability: if an element is in top-k list, after increasing its score or probability, it should stay in the list

How to evaluate if a ranking model is good ?

Unified Ranking Approach

Representative Top-k Higher scores with large total prob. Much higher score with similar prob. Based on Top-k Return samples of the distribution

Top-k Oracle Select an arbitrary number of top-k results (sort based on score) to form a top-k oracle Query evaluation is then executed based on this Oracle Any previous top-k semantics could be plugged-in

Top-k with Data Cleaning Cleaning: acquire exact/newest information for uncertain records at extra cost. E.g., to collect the sensor reading again To select the entities to be cleaned with limited budget to achieve highest quality. Any top-k semantics could be plugged in

What if ranking function is not given ? Top-k: a pre-given function to specify how to rank What if ranking functions are not available ? --- Skyline

Skylines Skyline: candidates of best options in multi-criteria decision applications. n-dimensional numeric space D = (D 1, …, D n ) on each dimension, a user preference ≺ is defined two points, u dominates v (u ≺ v), if –  D i (1 ≤ i ≤ n), u.D i ≺ = v.D i –  D j (1 ≤ j ≤ n), u.D j ≺ v.D j Skyline: points not dominated by another point.

Skylines A skyline building is either close to the viewing point, or higher than those in front of it.

Probabilistic Skyline Consider game-by-game statistics Conventional methods compute the skyline on – Aggregate: mean Limitations – Affected by outliers – Lose data distributions Probabilistic skylines [Pei, Jiang, Lin, Yuan, VLDB07] – An instance has a probability to represent the object – An object has a probability to be in the skyline

Uncertain Objects An uncertain object is represented as – Continuous case: a probabilistic density function (PDF) – Discrete case: a set of instances, each takes a probability to appear U = {u 1, …, u n }, 0 < p(u i ) ≤ 1 and  1≤i≤n p(u i ) = 1 Without loss of generality, assume equal probability, p(u i ) = 1 / |U|

Probabilistic Skyline o Assume each instance takes equal probability (0.5) to appear. o Possible world: W = {a i, b j, c k } (i, j, k = 1 or 2) with probability Pr(W) = 0.5 × 0.5 × 0.5 = 0.125 o  W  Pr(W) = 1,  is the set of all possible worlds. o Skyline of a possible world SKY({a 1, b 1, c 1 }) = {a 1, b 1 } o Skyline probability Pr(B) = 4 × 0.125 = 0.5 Pr(A) = 1 Pr(C) = 0

Probabilistic Skyline Player Name Skyline Probability Player Name Skyline Probability Player Name Skyline Probability LeBron James0.350699Dwyane Wade0.199065Steve Francis0.131061 Dennis Rodman0.327592Tracy Mcgrady0.198185Dirk Nowitzki0.130301 Shaquille O’Neal0.323401Grant Hill0.191164Paul Pierce0.127079 Charles Barkley0.309311John Stockton0.183591Gary Payton0.126328 Kevin Garnett0.302531David Robinson0.177437Baron Davis0.125298 Jason Kidd0.293569Stephon Marbury0.16683Vince Carter0.122946 Allen Iverson0.269871Tim Hardaway0.166206Antoine Walker0.121745 Michael Jordan0.250633Magic Johnson0.151813Steve Nash0.115874 Tim Duncan0.241252Chris Paul0.149264Andre Miller0.11275 Karl Malone0.239737Gilbert Arenas0.142883Isiah Thomas0.11076 Chris Webber0.22153Clyde Drexler0.138993Elton Brand0.10966 Kevin Johnson0.208991Patrick Ewing0.13577Scottie Pippen0.108941 Hakeem Olajuwon0.203641Rod Strickland0.135735Dominique Wilkins0.104323 Kobe Bryant0.200272Brad Daugherty0.133572Lamar Odom0.101803 Brand-Agg (20.39, 2.67, 10.37) Ewing-Agg (19.48, 1.71, 9.91)

Anything missed ? A desired property of skylines – Provides a minimum candidate set for all monotonic scoring functions – If an object is preferred by a scoring function, it is in skyline. if an object is not preferred by any scoring function, not in. Probabilistic skylines missed this property Borrowed idea from an important statistic tool --- stochastic orders

Expected Utility & Stochastic Order Expected Utility Principle: – Given a set U of uncertain objects and a decreasing utility function f, select U in U to maximize E[f (U)]. Stochastic Order: – Given a family ℱ of utility functions, U ≺ ℱ V if for each f in ℱ E[f(U)] ≥ E [f(V)] Decreasing Multiplicative Functions: – ℱ = where f i is nonnegative decreasing. Low orthant order: the stochastic order is defined over the family of decreasing multiplicative functions.

Example Utility function: o : nonnegative decreasing e.g. AthleteInstance 1 / probability Instance 2 / probability A(1,4) / 0.5(3,2) / 0.5 B(2,5) / 0.5(4,3) / 0.5 C(5,1) / 0.01(3,4) / 0.99

Stochastic Order I: lower orthant order Given U & V, U stochastically dominates V (U ≺ sd V) if for any x, U.cdf (x) ≥ V.cdf (x) and exists y such that U.cdf (y) > V.cdf (y). U.cdf (x): probability mass of U in the rectangular region R ((0,0,…0), x); see the shaded region. Stochastic Skyline: the objects in U not stochastically dominated by any others, called stochastic skyline. Problem Statement: efficiently compute stochastic skyline regarding discrete cases.

Minimality of stochastic skyline Stochastic skyline removes all objects not preferred by any nonnegative decreasing functions!

Testing if U ≺ sd V Violation point: a point x in R d + is a violation point regarding U ≺ sd V if U.cdf (x) < V.cdf (x). Testing algorithm: if no violation points, then U ≺ sd V. Not enough to test instances.

Reduce to Grid Points  Test if U.cdf ≥ V.cdf against grid points only (see (a)).  Testing the switching grid points only (see solid lines (b)).

Algorithm  Given a rectangular region R (x, y), if U.cdf (x) ≥ V.cdf (y), then no violation point in R (x, y). Partition base testing algorithm:  Get switching points  Initial check  Iteratively partition the grid to throw away non-promising sub- grids

Complexity The algorithm runs O (dm log m + m d (T (U artree ) + T (V artree ))) where m is the number of instances in V. NP-Complete regarding d. Covert (the decision version of) the minimal set cover problem to a special case of the testing problem.

Usual Order Lower orthant order helps retrieve minimum candidate sets for monotonic multiplication functions. How about more general monotonic functions, like linear functions ?

Usual Order r ≤ 3, l ≤ 3 2 ≤ r ≤ 3, l ≤ 1 r ≤ 2, l ≤ 3 E[f(A)], E[f(B)], E[f(C)] ?

40 Usual Order Lower Set:

41 Usual Order

42 General Stochastic Skyline

43 Verification Algorithm Verification: to determine if U ≺ uo V Naively: test U.cdf(S) ≥ V.cdf(S) against every lower set S (infinite number of lower sets) From infinite to finite: (all subsets of V still exponential)

44 Max-flow Given a road network, the weight along an edge shows the capacity. Question: what is the maximum flow from source to destination ? 4 0 6 2 0 3 0 1 2 2 4 0 1 3 0 2

45 Max flow Max-flow / min-cut Theorem: for any network having a single source and a single destination node, the maximum flow from origin to destination equals the minimum cut value for all cuts in the network. Ford and Fulkerson algorithm

46 Time Complexity: O( t G + mnlogm) t G : time to construct G U, V m: number of arcs n: number of nodes Verification

47 Verification Compression: R-tree based level-by-level dominance checking

48 Step 1: get full dominance list FD Verification FD: {(U 1, V 1 ), (U 2, V 2 ), (u 1, v 6 ), (u 2, v 6 )}

49 Verification

50 U ≺ uo V (U ≺ lo V) preserves the transitivity: if U ≺ uo V, V could be removed since for any W s.t. V ≺ uo W, U ≺ uo W Apply standard filtering paradigm Framework

51 Framework BBS Algorithm: access the entries based on the minimum distance to the origin [SIGMOD 03]

52 Index: a global R-tree, indexing the MBB of all objects Progressive: iteratively traverse the global R-tree to find the data entry with smallest distance from lower corner to origin Only need to check U ≺ uo V or V ≺ uo U, but not both Framework

53 Filtering Pruning Rule 1: throw away fully dominated entries

54 Pruning for lskyline: let R(x, y) denote a rectangular region in d- dimensional space where the lower and upper corners are x and y, respectively. Filtering

55 Filtering

56 Pruning for gskyline Filtering

57 Statistic based Pruning: Filtering mean of intermediate entry E: the minimum among all its children variance of intermediate entry E: the maximum among all its children

58 Size Estimation: Expected size: size of stochastic skyline in R d is bounded by that of conventional skyline in R d+1 ; i.e., ln d (n)/(d+1)!

Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South.

Similar presentations

Presentation on theme: "Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South.

Similar presentations

Presentation on theme: "Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South."— Presentation transcript:

Similar presentations

About project

Feedback