Tuning the top-k view update process Eftychia Baikousi Panos Vassiliadis University of Ioannina Dept. of Computer Science
Forecast Problem of maintaining materialized top-k views, when updates occur in the base relation Extra difficulty: address the problem in the presence of high deletion rates The crux of the approach is to materialize an appropriate number of extra tuples kcomp to sustain the deletion rates that are drastically higher than average The correct estimation & fine tuning of kcomp is not obvious We use appropriate probabilistic methods M-Pref 2007, Vienna 23/9/2007
Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007
Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007
Top-k query Find k tuples with highest grades according to Q Given a relation R (id, x1, x2, x3) and a query Q, sum(x1, x2, x3) Find k tuples with highest grades according to Q R id x1 x2 x3 a 0.3 0.6 0.7 b 0.2 0.4 c 0.5 0.9 d 0.1 sum 1.6 0.9 1.8 1.4 Top-2 tuples M-Pref 2007, Vienna 23/9/2007
Motivating Example Shopping Center Given Maintain the view V Customers sign in with a palmtop (PDA) Need for advertisements – Special offers to Customers Given relation Customers (id, name, age, salary, …) materialized view V of the top-2 (Younger and Highly paid Customers) according to the query Q: - age + 2*salary Maintain the view V Customers sign in and out (e.g., train departures, working hours) Customers id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 V name Q Bill 44 John 22 M-Pref 2007, Vienna 23/9/2007
Problem definition Given Compute Such that a base relation R (ID, X, Y) that originally contains N tuples, a materialized view V that contains top-k tuples of the form (id, val) where val is the score according to a function Q(x,y)=ax + by and a, b are constant parameters, the update ratios ins, del and upd for insertions, deletions and updates respectively over the base relation R, Compute kcomp that is of the form kcomp = k + Δk Such that the view will contain at least k tuples, k ≤ kcomp, with probability p, after a period T V id Q k Δk kcomp M-Pref 2007, Vienna 23/9/2007
Related Work Ke Yi, Hai Yu, Jun Yang, Gangqiang Xia, Yuguo Chen: “Efficient Maintenance of Materialized Top-k Views”, ICDE ’03 Maintain a materialized top-k view when updates occur in the base table Compute a kmax (instead of the necessary k) adjusted at runtime so a refill query is rarely needed formulates the problem through a random walk model The method is theoretically guaranteed to work well only when the probabilities of insertions and deletions are equal, pins=pdel of insertions are more frequent than deletions pins>pdel There is no quality-of-service guarantee when deletions are more probable than insertions, pins<pdel M-Pref 2007, Vienna 23/9/2007
Motivating Example The view will not contain at least k tuples Customers sign in and out Due to train departures, working hours At certain time periods, deletions are more probable than insertions pins<pdel The view will not contain at least k tuples Customers id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 V name Q Bill 44 John 22 M-Pref 2007, Vienna 23/9/2007
Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007
Overview of the method Compute the ratios of the incoming source updates that affect the view Compute kcomp Fine tune kcomp M-Pref 2007, Vienna 23/9/2007
Empirical Cumulative Distribution Function ECDF ECDF is a non parametric cumulative distribution function that adapts itself to the data Definition Fn(x) represents the proportion of observations in a sample less than or equal to x assigns the probability 1/n to each of n observations in the sample estimates the true population proportion F(x) M-Pref 2007, Vienna 23/9/2007
Computation of update rates that affect V Given a relation Customers (id, name, age, salary, …) having N=4 tuples a materialized view V containing top-2 tuples (k=2) of the form (id, Q) where Q= -age +2*salary is the score Update ratios ins=1, del=2, upd=0 Find ins_aff and del_aff (insertions & deletions affecting the view) Customers V id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 name Q Bill 44 John 22 M-Pref 2007, Vienna 23/9/2007
Computation of update rates that affect V Given N=4, ins=1, del=2, upd=0 We compute the following: updates are treated as a combination of deletions and insertions from ECDF the probability of a new tuple affecting the view Ratios affecting the view M-Pref 2007, Vienna 23/9/2007
Overview of the method Compute the ratios of the incoming source updates that affect the view Compute kcomp Fine tune kcomp M-Pref 2007, Vienna 23/9/2007
Computation of kcomp Compute kcomp that is of the form kcomp = k + Δk id Q Δk k kcomp Compute kcomp such that it will guarantee that the view will contain at least k tuples, k ≤ kcomp, with probability p, after a period of operation T that is of the form kcomp = k + Δk Customers V id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 name Q Bill 44 John 22 Peter 17 M-Pref 2007, Vienna 23/9/2007
Computation of kcomp Customers V id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 5 Kate 30 Q 22 8 44 17 25 name Q Bill 44 Kate 25 John 22 Peter 17 There is 1 insertion and 2 deletions affecting the view Tuple (5, Kate, 25, 30) is inserted and Tuples (3, Bill, 26, 35) and (4, Peter, 57, 37) are deleted from the view The view will contain 2 tuples, as initially needed M-Pref 2007, Vienna 23/9/2007
Overview of the method Compute the ratios of the incoming source updates that affect the view Compute kcomp Fine tune kcomp M-Pref 2007, Vienna 23/9/2007
Fine tune kcomp kcomp is expressed as a formula depending on ins_aff and del_aff the ratios of insertions and deletions affecting the view The probability of a tuple affecting the view may vary according to probabilistic properties Fine tune kcomp by adding the appropriate variance M-Pref 2007, Vienna 23/9/2007
Fine tune kcomp The probability of a new tuple z affecting the view is p(z>valk) Bernoulli experiment with 2 possible events New tuple z affecting the view with probability p(z) New tuple z not-affecting the view with probability 1-p(z) The number of successes of ins Bernoulli experiments follow a Binomial distribution with VARIANCE : ins insertions in the base relation ins Bernoulli experiments M-Pref 2007, Vienna 23/9/2007
Fine tune kcomp In worst case, in order to guarantee that the view will contain at least k tuples with confidence 95% kcomp is computed as: VARins denotes the variance of the insertions VARdel denotes the variance of the deletions M-Pref 2007, Vienna 23/9/2007
Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007
Experimental methodology Test the following methods kcomp without fine tuning kcomp with fine tuning Yi et al @ ICDE03 For the following measures Number of tuples (# tuples) deleted from the view that fall below the threshold value of k Memory overhead for kcomp with & without fine tuning as number of extra tuples needed to keep in the view Number of extra tuples for kcomp with & without fine tuning compared to the number of extra tuples of the related work M-Pref 2007, Vienna 23/9/2007
Experimental methodology Experimental parameters: Size of source table R (tuples) |R| 1x105, 5x105, 1x106, 2x106 Size of mat. View (tuples) k 5, 10, 100, 1000 Size of update stream (pct over |R|) 1/1000, 1/100 Deletion rate over insertion rate (ratio) D/I 1.0, 1.5, 2.0 Synthetic data sets: Gaussian distribution with mean μ=50 and variance σ=10 Negative exponential distribution with parameters a=1.0 for X and a=2.0 for Y Zipf distribution with parameter a=2.1 M-Pref 2007, Vienna 23/9/2007
Max & average misses kcomp without fine tuning Gaussian distribution As a function of R and As a function of k and D/I M-Pref 2007, Vienna 23/9/2007
Memory overhead Number of extra tuples as a function of R and D/I M-Pref 2007, Vienna 23/9/2007
Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax of the related work as a function of R M-Pref 2007, Vienna 23/9/2007
Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax of the related work as a function of k M-Pref 2007, Vienna 23/9/2007
Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007
Conclusions We handled the problem of maintaining materialized top-k views in the presence of high deletion rates The method comprises the following steps: a computation of the rate that actually affects the materialized view, a computation of the necessary extension to k in order to handle the augmented number of deletions that occur and a fine tuning part that adjusts this value to take the fluctuation of the statistical properties of this value into consideration M-Pref 2007, Vienna 23/9/2007
Thank you for your attention! … many thanks to our hosts! This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF). M-Pref 2007, Vienna 23/9/2007
Auxiliary slides Formulas for kcomp M-Pref 2007, Vienna 23/9/2007
Time to build top-k view in microseconds Gauss Negative exponential Zipf 100K 5 328000 348500 242000 10 333000 345667 239667 100 335500 343000 1000 395333 406000 299500 500K 1650667 1715500 1216333 1713000 1208333 1653167 1710500 1205667 1736667 1796167 1291833 1M 3298667 3429000 2427167 3301333 3426667 2429667 3304000 3439500 2422167 3403167 3520500 2606667 2M 6650667 6900500 5406333 6653167 6900833 4909000 6747167 6906000 4906500 6895500 7082833 4992167 M-Pref 2007, Vienna 23/9/2007