Tuning the top-k view update process

Slides:



Advertisements
Similar presentations
Tuning the top-k view update process Eftychia Baikousi Panos Vassiliadis University of Ioannina Dept. of Computer Science.
Advertisements

Eftychia Baikousi Panos Vassiliadis
Estimation of Means and Proportions
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
Confidence Intervals This chapter presents the beginning of inferential statistics. We introduce methods for estimating values of these important population.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Normal Distribution * Numerous continuous variables have distribution closely resemble the normal distribution. * The normal distribution can be used to.
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference (Sec. )
Evaluating Hypotheses
CSE 221: Probabilistic Analysis of Computer Systems Topics covered: Statistical inference.
Today Today: More on the Normal Distribution (section 6.1), begin Chapter 8 (8.1 and 8.2) Assignment: 5-R11, 5-R16, 6-3, 6-5, 8-2, 8-8 Recommended Questions:
Continuous Random Variables and Probability Distributions
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Statistics Introduction 1.)All measurements contain random error  results always have some uncertainty 2.)Uncertainty are used to determine if two or.
B AD 6243: Applied Univariate Statistics Understanding Data and Data Distributions Professor Laku Chidambaram Price College of Business University of Oklahoma.
The paired sample experiment The paired t test. Frequently one is interested in comparing the effects of two treatments (drugs, etc…) on a response variable.
Population All members of a set which have a given characteristic. Population Data Data associated with a certain population. Population Parameter A measure.
Chapter 7 Estimates and Sample Sizes
PROBABILITY (6MTCOAE205) Chapter 6 Estimation. Confidence Intervals Contents of this chapter: Confidence Intervals for the Population Mean, μ when Population.
Random Sampling, Point Estimation and Maximum Likelihood.
1 Sampling Distributions Lecture 9. 2 Background  We want to learn about the feature of a population (parameter)  In many situations, it is impossible.
Theory of Probability Statistics for Business and Economics.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Biostatistics, statistical software VII. Non-parametric tests: Wilcoxon’s signed rank test, Mann-Whitney U-test, Kruskal- Wallis test, Spearman’ rank correlation.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Fundamentals of Data Analysis Lecture 3 Basics of statistics.
Lecture 2 Review Probabilities Probability Distributions Normal probability distributions Sampling distributions and estimation.
BUS304 – Chapter 6 Sample mean1 Chapter 6 Sample mean  In statistics, we are often interested in finding the population mean (µ):  Average Household.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
CpSc 881: Machine Learning Evaluating Hypotheses.
ENGR 610 Applied Statistics Fall Week 4 Marshall University CITE Jack Smith.
Machine Learning Chapter 5. Evaluating Hypotheses
Chapter Three McGraw-Hill/Irwin © 2006 The McGraw-Hill Companies, Inc., All Rights Reserved. Describing Data: Numerical Measures.
Review of Probability. Important Topics 1 Random Variables and Probability Distributions 2 Expected Values, Mean, and Variance 3 Two Random Variables.
Sampling and estimation Petter Mostad
Introduction A probability distribution is obtained when probability values are assigned to all possible numerical values of a random variable. It may.
Statistics for Business and Economics Module 1:Probability Theory and Statistical Inference Spring 2010 Lecture 4: Estimating parameters with confidence.
Confidence Intervals. Point Estimate u A specific numerical value estimate of a parameter. u The best point estimate for the population mean is the sample.
Chapter 8 Confidence Interval Estimation Statistics For Managers 5 th Edition.
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Estimation and Confidence Intervals Chapter 9.
Theoretical distributions: the Normal distribution.
Chapter 6 – Continuous Probability Distribution Introduction A probability distribution is obtained when probability values are assigned to all possible.
Virtual University of Pakistan
Introduction to Marketing Research
Lecture Slides Elementary Statistics Twelfth Edition
Hypothesis Testing: One-Sample Inference
Probability Distributions
Stat Lecture 7 - Normal Distribution
Continuous Probability Distributions
SUR-2250 Error Theory.
Ch 9 實習.
04/10/
Normal Probability Distributions
Sampling Distributions
Continuous Probability Distributions
Chapter 6. Continuous Random Variables
Statistics for Managers using Excel 3rd Edition
Another Population Parameter of Frequent Interest: the Population Mean µ
Sampling Distributions
Chapter 4 Continuous Random Variables and Probability Distributions
Introduction to Instrumentation Engineering
Inferential Statistics and Probability a Holistic Approach
SAMPLING DISTRIBUTIONS
Evaluating Hypotheses
Continuous Probability Distributions
Nonparametric Statistics
Chapter 5: Sampling Distributions
Presentation transcript:

Tuning the top-k view update process Eftychia Baikousi Panos Vassiliadis University of Ioannina Dept. of Computer Science

Forecast Problem of maintaining materialized top-k views, when updates occur in the base relation Extra difficulty: address the problem in the presence of high deletion rates The crux of the approach is to materialize an appropriate number of extra tuples kcomp to sustain the deletion rates that are drastically higher than average The correct estimation & fine tuning of kcomp is not obvious We use appropriate probabilistic methods M-Pref 2007, Vienna 23/9/2007

Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007

Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007

Top-k query Find k tuples with highest grades according to Q Given a relation R (id, x1, x2, x3) and a query Q, sum(x1, x2, x3) Find k tuples with highest grades according to Q R id x1 x2 x3 a 0.3 0.6 0.7 b 0.2 0.4 c 0.5 0.9 d 0.1 sum 1.6 0.9 1.8 1.4 Top-2 tuples M-Pref 2007, Vienna 23/9/2007

Motivating Example Shopping Center Given Maintain the view V Customers sign in with a palmtop (PDA) Need for advertisements – Special offers to Customers Given relation Customers (id, name, age, salary, …) materialized view V of the top-2 (Younger and Highly paid Customers) according to the query Q: - age + 2*salary Maintain the view V Customers sign in and out (e.g., train departures, working hours) Customers id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 V name Q Bill 44 John 22 M-Pref 2007, Vienna 23/9/2007

Problem definition Given Compute Such that a base relation R (ID, X, Y) that originally contains N tuples, a materialized view V that contains top-k tuples of the form (id, val) where val is the score according to a function Q(x,y)=ax + by and a, b are constant parameters, the update ratios ins, del and upd for insertions, deletions and updates respectively over the base relation R, Compute kcomp that is of the form kcomp = k + Δk Such that the view will contain at least k tuples, k ≤ kcomp, with probability p, after a period T V id Q k Δk kcomp M-Pref 2007, Vienna 23/9/2007

Related Work Ke Yi, Hai Yu, Jun Yang, Gangqiang Xia, Yuguo Chen: “Efficient Maintenance of Materialized Top-k Views”, ICDE ’03 Maintain a materialized top-k view when updates occur in the base table Compute a kmax (instead of the necessary k) adjusted at runtime so a refill query is rarely needed formulates the problem through a random walk model The method is theoretically guaranteed to work well only when the probabilities of insertions and deletions are equal, pins=pdel of insertions are more frequent than deletions pins>pdel There is no quality-of-service guarantee when deletions are more probable than insertions, pins<pdel M-Pref 2007, Vienna 23/9/2007

Motivating Example The view will not contain at least k tuples Customers sign in and out Due to train departures, working hours At certain time periods, deletions are more probable than insertions pins<pdel The view will not contain at least k tuples Customers id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 V name Q Bill 44 John 22 M-Pref 2007, Vienna 23/9/2007

Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007

Overview of the method Compute the ratios of the incoming source updates that affect the view Compute kcomp Fine tune kcomp M-Pref 2007, Vienna 23/9/2007

Empirical Cumulative Distribution Function ECDF ECDF is a non parametric cumulative distribution function that adapts itself to the data Definition Fn(x) represents the proportion of observations in a sample less than or equal to x assigns the probability 1/n to each of n observations in the sample estimates the true population proportion F(x) M-Pref 2007, Vienna 23/9/2007

Computation of update rates that affect V Given a relation Customers (id, name, age, salary, …) having N=4 tuples a materialized view V containing top-2 tuples (k=2) of the form (id, Q) where Q= -age +2*salary is the score Update ratios ins=1, del=2, upd=0 Find ins_aff and del_aff (insertions & deletions affecting the view) Customers V id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 name Q Bill 44 John 22 M-Pref 2007, Vienna 23/9/2007

Computation of update rates that affect V Given N=4, ins=1, del=2, upd=0 We compute the following: updates are treated as a combination of deletions and insertions from ECDF the probability of a new tuple affecting the view Ratios affecting the view M-Pref 2007, Vienna 23/9/2007

Overview of the method Compute the ratios of the incoming source updates that affect the view Compute kcomp Fine tune kcomp M-Pref 2007, Vienna 23/9/2007

Computation of kcomp Compute kcomp that is of the form kcomp = k + Δk id Q Δk k kcomp Compute kcomp such that it will guarantee that the view will contain at least k tuples, k ≤ kcomp, with probability p, after a period of operation T that is of the form kcomp = k + Δk Customers V id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 Q 22 8 44 17 name Q Bill 44 John 22 Peter 17 M-Pref 2007, Vienna 23/9/2007

Computation of kcomp Customers V id name age salary 1 John 18 20 2 Mary 42 25 3 Bill 26 35 4 Peter 57 37 5 Kate 30 Q 22 8 44 17 25 name Q Bill 44 Kate 25 John 22 Peter 17 There is 1 insertion and 2 deletions affecting the view Tuple (5, Kate, 25, 30) is inserted and Tuples (3, Bill, 26, 35) and (4, Peter, 57, 37) are deleted from the view The view will contain 2 tuples, as initially needed M-Pref 2007, Vienna 23/9/2007

Overview of the method Compute the ratios of the incoming source updates that affect the view Compute kcomp Fine tune kcomp M-Pref 2007, Vienna 23/9/2007

Fine tune kcomp kcomp is expressed as a formula depending on ins_aff and del_aff the ratios of insertions and deletions affecting the view The probability of a tuple affecting the view may vary according to probabilistic properties Fine tune kcomp by adding the appropriate variance M-Pref 2007, Vienna 23/9/2007

Fine tune kcomp The probability of a new tuple z affecting the view is p(z>valk) Bernoulli experiment with 2 possible events New tuple z affecting the view with probability p(z) New tuple z not-affecting the view with probability 1-p(z) The number of successes of ins Bernoulli experiments follow a Binomial distribution with VARIANCE : ins insertions in the base relation ins Bernoulli experiments M-Pref 2007, Vienna 23/9/2007

Fine tune kcomp In worst case, in order to guarantee that the view will contain at least k tuples with confidence 95% kcomp is computed as: VARins denotes the variance of the insertions VARdel denotes the variance of the deletions M-Pref 2007, Vienna 23/9/2007

Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007

Experimental methodology Test the following methods kcomp without fine tuning kcomp with fine tuning Yi et al @ ICDE03 For the following measures Number of tuples (# tuples) deleted from the view that fall below the threshold value of k Memory overhead for kcomp with & without fine tuning as number of extra tuples needed to keep in the view Number of extra tuples for kcomp with & without fine tuning compared to the number of extra tuples of the related work M-Pref 2007, Vienna 23/9/2007

Experimental methodology Experimental parameters: Size of source table R (tuples) |R| 1x105, 5x105, 1x106, 2x106 Size of mat. View (tuples) k 5, 10, 100, 1000 Size of update stream (pct over |R|)  1/1000, 1/100 Deletion rate over insertion rate (ratio) D/I 1.0, 1.5, 2.0 Synthetic data sets: Gaussian distribution with mean μ=50 and variance σ=10 Negative exponential distribution with parameters a=1.0 for X and a=2.0 for Y Zipf distribution with parameter a=2.1 M-Pref 2007, Vienna 23/9/2007

Max & average misses kcomp without fine tuning Gaussian distribution As a function of R and  As a function of k and D/I M-Pref 2007, Vienna 23/9/2007

Memory overhead Number of extra tuples as a function of R and D/I M-Pref 2007, Vienna 23/9/2007

Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax of the related work as a function of R M-Pref 2007, Vienna 23/9/2007

Comparison with related work Number of extra tuples of kcomp with fine tuning compared with kmax of the related work as a function of k M-Pref 2007, Vienna 23/9/2007

Contents Motivation & Problem Definition Overview of our Method Computation of rates affecting the view Computation of kcomp Fine tuning kcomp Experiments Conclusions M-Pref 2007, Vienna 23/9/2007

Conclusions We handled the problem of maintaining materialized top-k views in the presence of high deletion rates The method comprises the following steps: a computation of the rate that actually affects the materialized view, a computation of the necessary extension to k in order to handle the augmented number of deletions that occur and a fine tuning part that adjusts this value to take the fluctuation of the statistical properties of this value into consideration M-Pref 2007, Vienna 23/9/2007

Thank you for your attention! … many thanks to our hosts! This research was co-funded by the European Union in the framework of the program “Pythagoras IΙ” of the “Operational Program for Education and Initial Vocational Training” of the 3rd Community Support Framework of the Hellenic Ministry of Education, funded by 25% from national sources and by 75% from the European Social Fund (ESF). M-Pref 2007, Vienna 23/9/2007

Auxiliary slides Formulas for kcomp M-Pref 2007, Vienna 23/9/2007

Time to build top-k view in microseconds Gauss Negative exponential Zipf 100K 5 328000 348500 242000 10 333000 345667 239667 100 335500 343000 1000 395333 406000 299500 500K 1650667 1715500 1216333 1713000 1208333 1653167 1710500 1205667 1736667 1796167 1291833 1M 3298667 3429000 2427167 3301333 3426667 2429667 3304000 3439500 2422167 3403167 3520500 2606667 2M 6650667 6900500 5406333 6653167 6900833 4909000 6747167 6906000 4906500 6895500 7082833 4992167 M-Pref 2007, Vienna 23/9/2007