Download presentation
Presentation is loading. Please wait.
Published byWilla Hawkins Modified over 8 years ago
1
Computer Science and Engineering Ranking Complex Objects in a Multi-dimensional Space Wenjie Zhang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia
2
Ranking Queries To retrieve a limited number of best qualified results from a large set of data – A broad range of queries, ranking by value, similarity, relevance, k nearest neighbor, etc. Best? – Specify ranking function over certain dimensions. --- top- k query – No ranking function available ? --- skyline, dominating, minimal regret ratio, etc..
3
Complex Objects Objects that cannot be modeled by a single d- dimensional value Focus of this talk: – Uncertain objects: multiple d-dimensional instances per object. Exclusive semantics – Multi-valued objects: multiple d-dimensional instances per object. Inclusive semantics. – Spatial + text
4
Applications Ranking Dimension Probability How to get the top-2 highest temperatures ?
5
Applications Who is a better player according to #rebounds and #points ?
6
Applications How to get the mobile holder nearest to a location P at time t ?
7
Applications p4 (t1) p6 (t2,t3 ) p1 (t1,t2) p10 (t1) p3 (t1,t3) p5 (t2,t3) p8 (t3) p11 (t2) p9 (t2 ) p2 (t1,t2) 11 restaurants with spatial Locations (t1, t2, t3) = (sushi, seafood, coffee)is the textual keyword Of each restaurant Find the restaurant nearest to the query with sushi & seafood
8
Why is ranking complex objects hard? To interpret the semantics for ranking – E.g., top-k ranking over uncertain objects is studied since 2007 with more than 10 ranking models proposed Computationally expensive – To handle multiple instance, and extra information (e.g., text)
9
Uncertain Data Ranking Dimension Probability How to get the top-2 highest temperatures ?
10
Top-k Ranking Queries U-topk: [Soliman, Ilyas, Chang, ICDE07], [Yi, Li, Srivastava, Kollios, 08] U-kRanks: [Soliman, Ilyas, Chang, ICDE07], [Lian, Chen, 08] PT-k: [Hua, Pei, Zhang, Lin, SIGMOD08] Global-topk: [Zhang, Chomicki, 08] Expected ranks: [Cormode, Li, Yi, ICDE09] Unified Ranking Approach: [Li, Saha, Reshpande, VLDB09] Representative top-k: [Ge, Zdonik, Madden, SIGMOD09] Top-k with Data Cleaning: [Mo, Cheng, Cheung, Li, Yang, ICDE13] Top-k Oracle: [Song, Li, Ge, ICDE13]
11
U-topk
12
U-kRanks
13
PT-k
14
Global Top-k Based on PT-k, return the k tuples with highest probabilities Top-2 answer: (R3, R5)
15
Expected Rank The expected rank of a value across all possible worlds
16
How to evaluate if a ranking model is good ? To see if the properties of the original operator is retained Top-k operator – Value-invariance: ranking determined by relative order but not scores – Exact k: return exactly k results – Unique rank: each item has one and only one rank position – Containment: top-(k+1) list contains top-k list – Stability: if an element is in top-k list, after increasing its score or probability, it should stay in the list
17
How to evaluate if a ranking model is good ?
18
Unified Ranking Approach
19
Representative Top-k Higher scores with large total prob. Much higher score with similar prob. Based on Top-k Return samples of the distribution
20
Top-k Oracle Select an arbitrary number of top-k results (sort based on score) to form a top-k oracle Query evaluation is then executed based on this Oracle Any previous top-k semantics could be plugged-in
21
Top-k with Data Cleaning Cleaning: acquire exact/newest information for uncertain records at extra cost. E.g., to collect the sensor reading again To select the entities to be cleaned with limited budget to achieve highest quality. Any top-k semantics could be plugged in
22
What if ranking function is not given ? Top-k: a pre-given function to specify how to rank What if ranking functions are not available ? --- Skyline
23
Skylines Skyline: candidates of best options in multi-criteria decision applications. n-dimensional numeric space D = (D 1, …, D n ) on each dimension, a user preference ≺ is defined two points, u dominates v (u ≺ v), if – D i (1 ≤ i ≤ n), u.D i ≺ = v.D i – D j (1 ≤ j ≤ n), u.D j ≺ v.D j Skyline: points not dominated by another point.
24
Skylines A skyline building is either close to the viewing point, or higher than those in front of it.
25
Probabilistic Skyline Consider game-by-game statistics Conventional methods compute the skyline on – Aggregate: mean Limitations – Affected by outliers – Lose data distributions Probabilistic skylines [Pei, Jiang, Lin, Yuan, VLDB07] – An instance has a probability to represent the object – An object has a probability to be in the skyline
26
Uncertain Objects An uncertain object is represented as – Continuous case: a probabilistic density function (PDF) – Discrete case: a set of instances, each takes a probability to appear U = {u 1, …, u n }, 0 < p(u i ) ≤ 1 and 1≤i≤n p(u i ) = 1 Without loss of generality, assume equal probability, p(u i ) = 1 / |U|
27
Probabilistic Skyline o Assume each instance takes equal probability (0.5) to appear. o Possible world: W = {a i, b j, c k } (i, j, k = 1 or 2) with probability Pr(W) = 0.5 × 0.5 × 0.5 = 0.125 o W Pr(W) = 1, is the set of all possible worlds. o Skyline of a possible world SKY({a 1, b 1, c 1 }) = {a 1, b 1 } o Skyline probability Pr(B) = 4 × 0.125 = 0.5 Pr(A) = 1 Pr(C) = 0
28
Probabilistic Skyline Player Name Skyline Probability Player Name Skyline Probability Player Name Skyline Probability LeBron James0.350699Dwyane Wade0.199065Steve Francis0.131061 Dennis Rodman0.327592Tracy Mcgrady0.198185Dirk Nowitzki0.130301 Shaquille O’Neal0.323401Grant Hill0.191164Paul Pierce0.127079 Charles Barkley0.309311John Stockton0.183591Gary Payton0.126328 Kevin Garnett0.302531David Robinson0.177437Baron Davis0.125298 Jason Kidd0.293569Stephon Marbury0.16683Vince Carter0.122946 Allen Iverson0.269871Tim Hardaway0.166206Antoine Walker0.121745 Michael Jordan0.250633Magic Johnson0.151813Steve Nash0.115874 Tim Duncan0.241252Chris Paul0.149264Andre Miller0.11275 Karl Malone0.239737Gilbert Arenas0.142883Isiah Thomas0.11076 Chris Webber0.22153Clyde Drexler0.138993Elton Brand0.10966 Kevin Johnson0.208991Patrick Ewing0.13577Scottie Pippen0.108941 Hakeem Olajuwon0.203641Rod Strickland0.135735Dominique Wilkins0.104323 Kobe Bryant0.200272Brad Daugherty0.133572Lamar Odom0.101803 Brand-Agg (20.39, 2.67, 10.37) Ewing-Agg (19.48, 1.71, 9.91)
29
Anything missed ? A desired property of skylines – Provides a minimum candidate set for all monotonic scoring functions – If an object is preferred by a scoring function, it is in skyline. if an object is not preferred by any scoring function, not in. Probabilistic skylines missed this property Borrowed idea from an important statistic tool --- stochastic orders
30
Expected Utility & Stochastic Order Expected Utility Principle: – Given a set U of uncertain objects and a decreasing utility function f, select U in U to maximize E[f (U)]. Stochastic Order: – Given a family ℱ of utility functions, U ≺ ℱ V if for each f in ℱ E[f(U)] ≥ E [f(V)] Decreasing Multiplicative Functions: – ℱ = where f i is nonnegative decreasing. Low orthant order: the stochastic order is defined over the family of decreasing multiplicative functions.
31
Example Utility function: o : nonnegative decreasing e.g. AthleteInstance 1 / probability Instance 2 / probability A(1,4) / 0.5(3,2) / 0.5 B(2,5) / 0.5(4,3) / 0.5 C(5,1) / 0.01(3,4) / 0.99
32
Stochastic Order I: lower orthant order Given U & V, U stochastically dominates V (U ≺ sd V) if for any x, U.cdf (x) ≥ V.cdf (x) and exists y such that U.cdf (y) > V.cdf (y). U.cdf (x): probability mass of U in the rectangular region R ((0,0,…0), x); see the shaded region. Stochastic Skyline: the objects in U not stochastically dominated by any others, called stochastic skyline. Problem Statement: efficiently compute stochastic skyline regarding discrete cases.
33
Minimality of stochastic skyline Stochastic skyline removes all objects not preferred by any non- negative decreasing functions!
34
Testing if U ≺ sd V Violation point: a point x in R d + is a violation point regarding U ≺ sd V if U.cdf (x) < V.cdf (x). Testing algorithm: if no violation points, then U ≺ sd V. Not enough to test instances.
35
Reduce to Grid Points Test if U.cdf ≥ V.cdf against grid points only (see (a)). Testing the switching grid points only (see solid lines (b)).
36
Algorithm Given a rectangular region R (x, y), if U.cdf (x) ≥ V.cdf (y), then no violation point in R (x, y). Partition base testing algorithm: Get switching points Initial check Iteratively partition the grid to throw away non-promising sub- grids
37
Complexity The algorithm runs O (dm log m + m d (T (U artree ) + T (V artree ))) where m is the number of instances in V. NP-Complete regarding d. Covert (the decision version of) the minimal set cover problem to a special case of the testing problem.
38
Usual Order Lower orthant order helps retrieve minimum candidate sets for monotonic multiplication functions. How about more general monotonic functions, like linear functions ?
39
Usual Order r ≤ 3, l ≤ 3 2 ≤ r ≤ 3, l ≤ 1 r ≤ 2, l ≤ 3 E[f(A)], E[f(B)], E[f(C)] ?
40
40 Usual Order Lower Set:
41
41 Usual Order
42
42 General Stochastic Skyline
43
43 Verification Algorithm Verification: to determine if U ≺ uo V Naively: test U.cdf(S) ≥ V.cdf(S) against every lower set S (infinite number of lower sets) From infinite to finite: (all subsets of V still exponential)
44
44 Max-flow Given a road network, the weight along an edge shows the capacity. Question: what is the maximum flow from source to destination ? 4 0 6 2 0 3 0 1 2 2 4 0 1 3 0 2
45
45 Max flow Max-flow / min-cut Theorem: for any network having a single source and a single destination node, the maximum flow from origin to destination equals the minimum cut value for all cuts in the network. Ford and Fulkerson algorithm
46
46 Time Complexity: O( t G + mnlogm) t G : time to construct G U, V m: number of arcs n: number of nodes Verification
47
47 Verification Compression: R-tree based level-by-level dominance checking
48
48 Step 1: get full dominance list FD Verification FD: {(U 1, V 1 ), (U 2, V 2 ), (u 1, v 6 ), (u 2, v 6 )}
49
49 Verification
50
50 U ≺ uo V (U ≺ lo V) preserves the transitivity: if U ≺ uo V, V could be removed since for any W s.t. V ≺ uo W, U ≺ uo W Apply standard filtering paradigm Framework
51
51 Framework BBS Algorithm: access the entries based on the minimum distance to the origin [SIGMOD 03]
52
52 Index: a global R-tree, indexing the MBB of all objects Progressive: iteratively traverse the global R-tree to find the data entry with smallest distance from lower corner to origin Only need to check U ≺ uo V or V ≺ uo U, but not both Framework
53
53 Filtering Pruning Rule 1: throw away fully dominated entries
54
54 Pruning for lskyline: let R(x, y) denote a rectangular region in d- dimensional space where the lower and upper corners are x and y, respectively. Filtering
55
55 Filtering
56
56 Pruning for gskyline Filtering
57
57 Statistic based Pruning: Filtering mean of intermediate entry E: the minimum among all its children variance of intermediate entry E: the maximum among all its children
58
58 Size Estimation: Expected size: size of stochastic skyline in R d is bounded by that of conventional skyline in R d+1 ; i.e., ln d (n)/(d+1)!
59
59
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.