VLDB’2007 review Denis Mindolin
VLDB’07 program
Outline Probabilistic Skylines on Uncertain Data, Jian Pei et al Lazy Maintenance of Materialized Views, Jingren Zhou et al
Probabilistic Skylines on Uncertain Data Based on the VLDB’07 paper of Jian Pei et al
Skyline. General picture For a dataset D = {p 1,..,p n }, the skyline S is the set of all p i s.t. there is no other p j that dominates p i p i dominates p j if p i is better than p j in at least one dimension, and not worse than p j in all other dimensions Single game results: S = {Eddie, Carl}
Uncertain data Multiple game results: S=? Use some aggregate function? Can’t capture distribution! Can be biased by outliers!
Probabilistic dominance relation Uncertain data Uncertain object U={u 1,..,u l } Uncertain objects are independent Pr(u i ) = Pr(u j ) Probabilistic dominance relation Given two uncertain objects U={u 1, …, u l1 }, V={v 1, …, v l2 } The prob. that V dominates U is given by
Probabilistic dominance relation. Example Smaller values of X and Y are better
p-Skyline Let U={u 1,…,u l }. For all u U, probability of u in skyline := Probability u not dominated by any other object Skyline probability of U p-Skyline
The bottom up skyline algorithm Bounding Compute upper and lower bounds of skyline prob. for objects Pruning If the lower bound of Pr(U) is larger than p, then U is in the skyline. If the upper bound of Pr(U) is smaller than p, U is not in the skyline Refining If p is between the lower and the upper bounds, then we need to get tighter bounds of the skyline probabilities by the next iteration of the algorithm
Bounding u min =(min i=1 {u i.D 1 },…,min{u i.D l }) u max =(max i=1 {u i.D 1 },…,max{u i.D l }) Lemma If u i1 < u i2 then Pr(u i1 ) ≥ Pr(u i2 ) Pr(u min ) ≥ Pr(U) ≥ Pr(u max )
Pruning Rule1. For an uncertain object U and probability threshold p, if Pr(U min ) < p, then U is not in the p-skyline. If Pr(U max ) ≥ p, then U is in the p-skyline. Rule2. For each instance u U, let Pr + (u) and Pr - (u) be the upper and lower bounds of Pr(u) If, then U is not in the p-skyline If, then U is in the p-skyline Rule3. Let U and V be two different uncertain objects. If u U and V max < u, then Pr(u) = 0
Pruning Rule4. Let U and V be two uncertain objects and U’ U be a subset of instances of U such that U’ max V min. If, then Pr(V) < p and thus V is not in the p-skyline
Refinement Partition instances into layers
Algorithm summary Complexity: O(W total *R) W total – number of instances whose skyline probabilities are computed by the algorithm R – average cost of querying local R-tree of possible dominating objects W total is much smaller than the total number of instances Top-down algorithm: see the paper
Lazy Maintenance of Materialized Views Based on the VLDB’07 paper of Jingren Zhou et al
Eager and Deferred Materialized View Maintenance T1 V T2 Eager: User tran: {upd(T1), upd(T2)} Executed: {upd(T1), upd(T2), recomp(V)} Deferred: User tran: {upd(T1), upd(T2)} Executed: {upd(T1), upd(T2)} … User tran: {recomp(V)} … User tran: {Q(V)} Executed: {Q(V)}
Lazy Materialized View Maintenance T1 V T2 Lazy: User tran: {upd(T1), upd(T2)} Executed: {upd(T1), upd(T2)} … Executed: { recomp(V) } … User tran: {Q(V)} Executed: {Q(V)}
System architecture Based on MS SQL Server 2005
How it works
Delta tables Table 1 : {(transID i, stmtID i, rowID i, action i )} … Table n : {(transID i, stmtID i, rowID i, action i )} tranID – transaction id stmtID – statement id rowID – updated row id action = (ins|del) All “update” actions are converted into pairs of del/ins actions
Maintenance and its optimization Maintenance task is created for each view affected by a transaction Views updated incrementally using Delta tables “Smart” maintenance task scheduler Maintenance tasks are scheduled as low-priority jobs Maintenance tasks are combined using the Condense operator Proper times slot is allocated for each task
Delta stream Condense operator Intuition: Tran: {A:=1,…,A:=2,…,A:=3}=>{…,A:=3} Operator definition INS/INS condense: {ins 1 (row a ), …, ins k (row a )}=>{…, ins k (row a )} INS/DEL condense: {ins 1 (row a ), …, del k (row a )}=>{…} DEL/DEL condense: {del 1 (row a ), …, del k (row a )}=>{…, del k (row a )}
Performance results Response time is low Query response time is low Maintenance cost eager view update cost Overhead is low