Presented by: Yesha Gupta

Presented by: Yesha Gupta
Answering Top-k Queries Using Views By Gautam Das,Dimitrios Gunopulos, Nick Koudas, Dimitris Tsirogiannis Presented by: Yesha Gupta Reference aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt

Top-k query To return k highest ranked values from a relation
Query: Return top-2 values for function 3x1+10x2+5x3 m R tid X1 X2 X3 1 82 59 2 53 19 83 3 29 4 80 22 90 5 28 8 87 tid Score 4 910 2 764 5 599 1 551 3 107 n

Views (1/2) Base table R with three attributes Top-5 query using function f1 = 2x1 + 5x2 Base View VX1 ordered by Score m R tid X1 1 82 4 80 2 53 9 42 3 29 5 28 10 23 8 18 7 16 6 12 tid X1 X2 X3 1 82 59 2 53 19 83 3 29 4 80 22 90 5 28 8 87 6 12 55 7 16 99 42 18 67 9 23 10 21 88 tid Score 7 527 6 299 4 270 8 246 2 201 n

Views (2/2) Advantages of Views Can we use view to answer top-k query?
faster query response time space-performance tradeoff to improve efficiency Can we use view to answer top-k query? Yes Challenges in using views guarantee of getting top-k query which views to select to answer a given query

Ranking Queries Top-k ranking queries : Select top[k] from relation(R) where order by Expressed as triple Assigns a numeric score to any tuple t Selection condition for the tuples

Ranking Views Materialized Ranking views : previously executed ranking query A set of k pairs, ordered by decreasing values of Relation R with m attributes, base views ,

Related work in Top-k query (1/2)
TA [Fagin et. al. ‘96] deterministic stopping condition always the correct top-k set m List1 (X1) List2 (X2) R tid X1 X2 1 82 2 53 19 3 29 4 80 22 5 28 8 tid Score 1 82 4 80 2 53 3 29 5 28 tid Score 4 22 2 19 5 8 1 3 n Number of sorted access = Number of random access

Related work in Top-k query (2/2)
PREFER [Hristidis et. al. ‘01] A prototype system and available at Stores multiple copies of base relation R as materialized views Utilizes only one This paper complements both the approaches

Problems for views Problem 1 (Top-k query answer using views)
Input: a set U of views {v1,v2,…}, Query Q Output: top-k set Algorithm: LPTA Problem 2 (View Selection) Input: a set V of views (Materialized Ranking Views and Base Views), Query Q Output: the most efficient subset to execute Q on Algorithm: SelectViews

Outline LPTA View Selection problem General queries and views
Conceptual discussion Cost estimation problem Various Select Views methods General queries and views Experiment Conclusion

LPTA Setting Linear additive scoring functions e.g.
Output Top-2 set; Q = (fQ, 2, *) Set of Views: Materialized views V1 ,V2 Sorted access on pairs Random access on the base table R The LPTA algorithm requires the scoring functions of the query and the views to be linear and additive. The algorithm uses a set of views to answer a top-k query and the scoring function of each view can be defined on arbitrary subset of attributes. Each view is a set of pairs of tuple identifier and score of that tuple using the view’s scoring function. The LPTA algorithm requires sorted access on each view in non-increasing order of that score. Finally, random access on the base table R is required.

LPTA execution R tid X1 X2 X3 1 82 59 2 53 19 83 3 29 4 80 22 90 5 28 8 87 6 12 55 7 16 99 42 18 67 9 23 10 21 88 tid Score 7 527 6 299 4 270 8 246 2 201 tid Score 6 219 4 202 10 197 V1 V2 Top-k buffer for Q = (fQ, 2, *) topkmin = 996 tid Score Unseenmax = ?

Calculate Unseenmax Unseenmax = 1338
The unseen tuples in the view have satisfy the following inequalities: The domain of each attribute of R [1,100] 1<=X1,X2,X3<= (1) 2x1 + 5x2 <= (2) x2 + 2x3 <= (3) Unseenmax = Solution to the linear program where we maximize the function subject to these inequalities. Unseenmax = 1338

LPTA – General Example(1/3)
V1 X1 stopping condition Q R(X1, X2) Top-1 V1 V2 Consider that we have a relation with 2 numerical attributes and normalized domains from 0 to 1 and we want to answer a top-1 query with a scoring function defined on these attributes. We also have sorted access on 2 views v1 and v2 with scoring functions defined on the attributes of R. On the figure the unit square OPRT contains all the tuples and they are denoted by green circles. The views and the top-k query are represented by vectors denoting directions of increasing score. Sorted access on each view can be visualized as sweeping a line perpendicular to V1 from infinity towards the origin and the order in which data tuples are encountered by this sweepline is the same as the order in the view. The score of a tuple with respect to the query or a view can be found by projecting that tuple to the vector of the query or the view respectively. (now the animation starts) The LPTA algorithm accesses in each iteration sequentially one tuple from each view. The score of each tuple w.r.t the scoring function of the query can be found by doing a random access on relation R. Notice that the score of the intersection point of these two lines perpendicular to V1 and V2 with respect to the query is the maximum opssible score of any tuple not yet visited in the views. If we could find that score then it would allow us to define a stopping condition for the algorithm. Remember that the LPTA algorithm uses linear functions and notice that these functions can be used to define a convex region. Now, the score of the intersection point can be found be solving a linear program. Start describing informally how lpta works. What the view vectors. The hyperplanes perpendicular to the view vectors define a convex region. The point with the red circle is the point in the region with the maximum score with respect to the scoring function of the top-k query. This point can be found by solving a linear program. Mention that the time spend in solving the linear program is a small fraction of the total running time of LPTA. Mention that the score of each point with respect to the coring function of the top-k query can be found by projecting that point to the vector of the query. V2 X2

Linear Programming adaptation of TA Q: V1 V2 The LPTA algorithm is a linear programming adaptation of the TA algorithm. In the general iteration d of the algorithm one tuple from each view is accessed sequentially and the LPTA algorithm formulates and solves a linear program. The linear program maximizes the scoring function of the top-k query subject to the following constraints. The attributes involved in the scoring function of the query must be in their active domains and the scoring function of v1 must be less than or equal to the score of the last seen tuple from v1 and the scoring function of v2 must be less than or equal to the score of the last seen tuple from v2. The solution of the linear program gives the maximum score of any unseen tuple. The algorithm stops once k seen tuples have scores above that score (the solution of the linear program = stopping condition). d iteration

V1 stopping condition Top-1 Q R(X1, X2) V1 V2 As we saw before, the scores of the two seen tuples are below the score of the intersection point so the algorithm must continue. In the next iteration the algorithm accesses two more tuples from the views and finds their scores w.r.t the query. The lines perpendicular to v1 and v2 define a new intersection point which can be found by solving a new linear program. In this case we see that we have one tuple with score above the score of the intersection point which means that the stopping condition has been reached and the algorithm terminates. There is one thing to notice here. We know that solving a linear program is an expensive task. Well, as we will see in the experimental section for up to a number of dimensions that we tried the total time spend by LPTA for solving the linear programs was at most 10% of the total running time. V2 X2

LPTA execution R tid X1 X2 X3 1 82 59 2 53 19 83 3 29 4 80 22 90 5 28 8 87 6 12 55 7 16 99 42 18 67 9 23 10 21 88 tid Score 7 527 6 299 4 270 8 246 2 201 tid Score 6 219 4 202 10 197 V1 V2 Top-k buffer for Q = (fQ, 2, *) topkmin = 996 tid Score 7 1248 6 996 Unseenmax = 953.5 Unseenmax = 1338 Is Unseenmax ≤ topkmin ? Yes No

LPTA LPTA becomes TA when the set of views U = set of base views
Execution cost: Both have Sequential as well as Random Access Every sequential access incurs random access # of sequential access can be considered as running cost Running cost = O(dr) d= #of lock-step iterations and r= # of views

View Selection Problem (1/5)
Given a collection of views and a query Q, determine the most efficient subset to execute Q on. Conceptual discussion Two dimensions Higher dimensions Formally, the view selection problem is the following. Given a collection of views and a query, we need to determine the most efficient subset of views to execute the query on. In other words, what will be the set of views that will result in the minimum running time for the LPTA algorithm? I will present a conceptual discussion of the view selection problem in two dimensions to show you that the problem makes sense and then I will show you how can we generalize it in higher dimensions.

View Selection – 2d (2/5) Q Min top-k tuple V1 A M V2 B B1 B2
Consider the following example where we want to answer a query Q and there are 2 ranking views available v1 and v2. Let M be the tuple with the k highest score w.r.t. the scoring function of the query. Let AB be the line segment perpendicular to the query vector that passes through M and intersects the unit square. Thus, the top-k tuples are contained in the triangle ABR. If the LPTA algorithms uses only v1 to answer the query the stopping condition will be reached once the sweepline perpendicular to v1 crosses position A1B. The colored region A1BR contains all the tuples that will be encountered from V1. If LPTA algorithms uses only V2 the colored region AB1PR contains the tuples encountered from v2 when the stopping conditions has been reached. If both views are used from the LPTA the stopping condition will be reached when the intersection point of the sweeplines perpendicular to v1 and v2 is on the line AB and this colored region will contain the tuples encountered from v1 and v2. Now, notice that depending on how tuples are distributed in the unit square it may be more efficient to use either v1, v2 or both views to answer the query.

View Selection - 2d (3/5) Q Min top-k tuple V1 A1 A B B2 M V2
Consider the following example where we want to answer a query Q and there are 2 ranking views available v1 and v2. Let M be the tuple with the k highest score w.r.t. the scoring function of the query. Let AB be the line segment perpendicular to the query vector that passes through M and intersects the unit square. Thus, the top-k tuples are contained in the triangle ABR. If the LPTA algorithms uses only v1 to answer the query the stopping condition will be reached once the sweepline perpendicular to v1 crosses position A1B. The colored region A1BR contains all the tuples that will be encountered from V1. If LPTA algorithms uses only V2 the colored region AB1PR contains the tuples encountered from v2 when the stopping conditions has been reached. If both views are used from the LPTA the stopping condition will be reached when the intersection point of the sweeplines perpendicular to v1 and v2 is on the line AB and this colored region will contain the tuples encountered from v1 and v2. Now, notice that depending on how tuples are distributed in the unit square it may be more efficient to use either v1, v2 or both views to answer the query.

View Selection - 2d (4/5) Theorem: If is a set of views for a two dimensional dataset and Q is a query. Let and be the closet view vectors in anticlockwise and clockwise order respectively. Then the optimal execution of LPTA requires a subset of views from

View Selection - Higher d (5/5)
Theorem: If is a set of views for an -dimensional dataset and Q a query, the optimal execution of LPTA requires a subset of views such that Question: How do we select the optimal subset of views? For higher dimensions we proved a theorem which you can read in the paper that says the following. If we have a collection of r views for an m-dimensional dataset and a query Q, the optimal execution of LPTA requires a subset of views of size at most m. So, the next thing to ask is how do we select the optimal subset of views?

Cost Estimation Framework
What is the cost of running LPTA when a specific set of views is used to answer a query? Cost = number of sequential accesses V1 Min top-k tuple A Q Cost = 6 sequential accesses What is the cost of running LPTA? We have sequential and random accesses as well as the cost for solving the linear programs. As I will show you later for up to a number of dimensions the cost for solving the lps is at most 10% of the total running time. So, the dominating cost factor is the number of sequential accesses on the views and the the number of random accesses on the base relation R. Thus, we define the cost of running the LPTA to answer a query using a specific set of views as the total number of sequential accesses until the stopping condition has been reached. Because every sequential access in a view prompts a random access to the base relation the number of sequential accesses is a precise indicator of the total cost. Consider the following example where the LPTA algorithm uses 2 views to answer a query. We see that after 3 iterations the stopping condition is reached. So, the cost as defined in our cost estimation framework is 6 sequential accesses. Now the problem is can we find that cost without actually running LPTA? We need to compute an estimate of the score of the k highest tuple in order to estimate the number of sequential accesses until the stopping condition is reached. V2 Can we find that cost without actually running LPTA? B

Simulation of LPTA on Histograms
Equi – depth Histogram Base View VX1 ordered by Score HVX1 Let’s say, each histogram has 2 buckets So b = 5 There are 10 tuples in the view, so n=10 tid X1 1 82 4 80 2 53 9 42 3 29 5 28 10 23 8 18 7 16 6 12 82 80 53 42 29 28 23 18 16 12 Each bucket will represent n/b data points or attribute values n/b = 10/5 = 2 b buckets

Estimate query score distribution by convolution Domain [0,1] tid Score 1 2 0.5 3 4 1.5 5 tid Score 1 2 0.5 3 tid Score 1 2 0.5 3

HQ: approximates the score distribution of the query Q Cost HQ HV1 HV2 Use HQ to estimate the score of the k highest tuple (topkmin). Simulate LPTA in a bucket by bucket lock step to estimate the cost. topkmin First we assume that we have b bucket equi-depth histograms of the score distribution for each view. They can be constructed on demand or when the view is materialized. Then we approximate the score distribution of the query by convolutions of the histrograms that represent the marginal distributions along each attribute. The estimated cost is the number of bucke. Using the score distribution of the query will allow us to estimate of the score of the k highest tuple (topkmin). Given that score we simulate the LPTA by walking down the histograms bucket by bucket in lock step until the stopping condition is reached. This is a cheap procedure because we have one iteration of the LPTA algorithm for every n/b tuples using the values from the bucket boundariests accessed multiplied by the number of tuples per bucket. Due to the large number of tuples on each bucket we further examine only from the last bucket of each view the number of tuples required for the stopping condition to be reached using linear interpolation for the scores of the tuples. And this gives us a refined estimation of the cost. We must notice that the error in the cost estimation procedure is due to the uniformity assumption on the last bucket accessed from each view. --- In order to get an estimate of the cost we must know when the stopping condition has been reached so we need an estimate of the score of the k highest tuple. In each step LPTA uses the values at the boundaries of each bucket. The gain from running LPTA on the histograms is that we have one iteration for every n/b tuples, where b is the number of buckets and n the number of tuples in the relation. For all except the last buckets we have zero estimation error. The error in the estimation is caused from the last bucket. Refine the cost estimation in the last bucket using linear interpolation. b buckets n/b tuples per bucket

Estimate cost Number of buckets visited (d) = 3
Number of views (r̕) = 2 Number of tuples per bucket (n/b) = 56/7 = 8 Number of tuples in last scanned bucket (n̕) = 3 Number of sorted access = ((d-1) n/b + n̕) r̕ = ((3-1) 8 + 3) 2 = 38 Running time = Ο((d-1) + logn̕)

View Selection Algorithms
Excustive(E): Check all possible subsets where and select subset of views with smallest cost Greedy (SV): Keep expanding the set of views to use until the estimated cost stops reducing. An obvious algorithm for the view selection problem is to estimate the cost of all the possible subsets of size of at most m where m is the number of attributes in the relation and select the subset with the smallest cost. We call this algorithm exhaustive. For large values of m we can follow a simple greedy heuristic to select a subset of views with the smallest cost.

Select Views Spherical (SVS)
Requires the solution of a single linear program. Select views by spherical is a heuristic inspired by the well behavior of uniform data distributions and is very cheap since it requires the solution of single linear program. We fix any valid score value s. Then we solve a linear program that maximizes the scoring function of the query subject to the inequalities that the scoring function of each view in the collectiton is less than or equal to s. Let T be the vertex of the convex region that maximizes the scoring function of the query. Notice, that in the special case where data are uniformly distributed in the m-dimensional hypersphere, then the set of views closer to the query will be those that intersect on T, and since T is an m-dimensional point it is defined by the intersection of exactly m hyperplames. Inspired by this behavior we select those those m views that intersect on T. However, it is clear that this procedure will only work well for restrictive data distributions. ---- (then running the LPTA using only those views that intersect on T would result on the same number of iterations for the stopping condition to be reached as if we were using all the views in the collection. Inspired by this behavior, the views selected are those that intersect with T. ) --- If data were uniformly distributed in a unit m-dimensional hypersphere and we were running the LPTA algorithm using all the views in the collection after d iterations the distance from the origin of each hyperplane perpendicular to a view would have been the same. Additionally, the convex region defined by those hyperplanes would also maintain it’s shape across the iterations. The algorithm chooses an arbitrary but valid score value s and formulates a linear program. The constraints are the scoring function of each view to be less than or equal to the value s. The algorithm selects those views whose hyperplanes intersect the point that maximizes the scoring function of the query. Selected Views Q T

Select Views By Angle (SVA)
Select Views By Angle (SVA): Sort the views by increasing angle with respect to Q. V4 V3 V2 Selected Views Q Select views by angle sorts the view vectors by increasing angle with the query vector and returns the top-m views. The intuition is that the views closer to the query will result in the minimum running time for the algorithm. V1

General Queries and Views
Views that materialize their top-k tuples. Convolute the view histograms Truncate the view histograms Run EstimateCost() Accommodating range conditions Select the views that cover the range conditions. Truncate each attribute’s histogram. Convolute histogram We mentioned in the beginning that a view maintains an ordering of all the tuples. If the views materialize only their top-k tuples the cost estimation procedure and the view selection algorithms need to be modified. Specifically, if the histograms of the views are constructed on the fly we need to truncate the histograms in order to represent the distribution of only the top-k tuples. If during the simulation of the LPTA the histogram of a view is exhausted before the stopping condition has been reached, a cost of infinity is returned. We also propose a simple yet effective solution for accomodating range conditions. First we select the set of views such that each view covers the range conditions of the query. Then, in order to estimate the score distribution of the query we truncate the attribute histograms to represent the distribution of the values in the range conditions.

Experiments Datasets (Uniform, Zipf, Real) Experiments:
Performance comparison of LPTA, PREFER and TA Accuracy of the cost estimation framework Performance of LPTA using each of the view selection algorithms Scalability of the LPTA algorithm In our experiments we used real and synthetic datasets. The synthetic data collection includes uniform and zipf data with varying skew parameters. The real dataset contains 30K tuples from a website specialized on automobiles. We conducted the following series of experiments. We evaluated the utility of the LPTA algorithm in terms of performance, compared to PREFER and TA. We measured the accuracy of the cost estimation framework. We compared the performance of the view selection algorithms and finally we evaluated the scalabilty of the LPTA algorithm. I will present a subset of the conducted experiments, the rest can be found in the paper.

Performance comparison of LPTA, PREFER and TA
The left figure presents the average running time of 50 runs for the LPTA, PREFER and TA algorithms as a function of k in a 2 attribute real dataset of 30K tuples. The right figure presents the average running time of the same number of runs as a function of k in a 3 attributes uniform dataset of 500K tuples. In both these cases there was no view selection procedure involved. The LPTA algorithm was forced to answer a randomly generated query using randomly generated views. The purpose of that experiment is to show that in expectation the LPTA algorithm will perform better than both TA and PREFER and that result is consistent in real and uniform dataset. --- For instance, in the uniform dataset of 3 dimensions, we randomly generated 3 views and a query and without applying any view selection algorithm we forced the LPTA algorithm to answer the query using all the views. 2d: average over 50 runs. In each run two random views were generated and a random query. Randomly generated coefficients. Randomly choose one to be used by PREFER. LPTA forced to use both. No view selection procedure. In all the cases LPTA performs better than TA and PREFER. Same for 3d. 3 randomly generated views and a query on 3 attributes. Real dataset, 2d Uniform dataset, 3d

Cost Estimation Accuracy
In this set of experiments we measured the accuracy of the cost estimation procedure. The left figure presents the relative error of the cost estimation procedure as a function of k for uniform, zipf and real datasets when the number of buckets is 0.5% of the number of tuples. In the right figure the number of buckets is 1% of the number of tuples. The number of tuples in both cases is 500K. There are three things to notice. First, the relative error is less than 11% in all the examined cases. Second, as k increases the error significantly decreases.The reduction in the relative error while k increases can be explained by observing that for a fixed number of buckets what really effects the error in the cost estimation procedure is the uniformity assumption on the values of the last bucket encountered before the estimation procedure halts. So, as k increases the influence of the uniformity assumption decreases. Third, as expected when the number of buckets increases the relative error decreases. --- n is the number of tuples in the base relaiton, both graphs refer to 2 dimensions Relative error: |estimated number of sequential accesses - actual number of sequential accesses| divided by the actual number of sequential accesses, synthetic data with 500K tuples, zipf with 1.0 parameter Explain why the relative error is reduced when the number of k increases. Last bucket error is amortized when k becomes larger. As the number of buckets increases, the relative error is reduced. 2d (buckets = 0.5% of n) (buckets = 1% of n)

Performance of LPTA using View Selection Algorithms
Here we present a comparison of the view selection algorithms. The cost in both these figures in the number of sequential accesses that the LPTA had to perform to answer a randomly generated query using the views selected by each of the view selection algorithms. The left figure presents the average cost of 50 runs in uniform and zipf 2 attribute datasets of 500K tuples when answering a top-100 query. The algorithms had to select from a collection of 20 randomly generated views. The right figure presents the same results in 3 attribute datasets and the algorithms had to select from a collection of 50 randomly generated views. As expected the exhaustive algorithm is the best. The greedy performs very good and close to the exhaustive while the select views by spherical follows next. The select views by angle has the worst performance. --- Cost is the number of sequential accesses that the LPTA had to do to answer a randomly generated query using the views selected by each of the view selection algorithms. Same graphs when time was measured. LP time less than 10%. Discriminates better the ability of each view selection algorithm to choose the optimal set of views. For 2d we randomly generated 20 views and for 3d 50 views with both 2 and 3 attributes. Average over 50 runs. Greedy performs very good and close to exhaustive, mention the good performance of SVS and the fact that it is very cheap. (2d) 500K tuples, top-100 (3d)

Scalability Experiments on LPTA
In the last set of experiments we evaluate the scalability of LPTA. In the left figure we present the running time of the LPTA for different values of k as a function of the number of tuples in a 2 attributes uniform dataset. We can notice that the algorithm scales sub-linearly as the number of tuples increases from half a million to 5 million. In the right figure we present the running time of the LPTA algorithm as a function of the number of attributes in the relation in a uniform dataset of 500K tuples when answering a top-100 query. We also present the fraction of the total time spend on solving the linear programs. There are two things to notice here. The running time increases linearly as the dimensionality increases. The time spend on solving the linear programs is at most 10% of the total running time. --- One thing to notice is the fraction of time spend on solving the linear programs. For up to 10 dimensions it was at most 10% of the total running time of LPTA. (2d, uniform dataset) (500K tuples, top-100)

Conclusions Using views for top-k query answering
LPTA: linear programming adaptation of TA View selection problem, cost estimation framework, view selection algorithms Experimental evaluation We addressed the problem of top-k query answering using multiple ranking views. We proposed the LPTA algorithm, a linear programming adaptation of TA that answers top-k queries using multiple ranking views. We formulated the view selection problem and proposed a cost estimation framework as well as algorithms for solving it. Finally, we conducted a detailed experimental evaluation on real and synthetic datasets to evaluate the utility of our solution.

References aitrc.kaist.ac.kr/~vldb06/slides/R13-1.ppt
Answering Top-k Queries Using Views: Gautam Das, Dimitrios Gunopulos, Nick Koudas , Dimitris Tsirogiannis

Thank you 

Presented by: Yesha Gupta

Similar presentations

Presentation on theme: "Presented by: Yesha Gupta"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Presented by: Yesha Gupta

Similar presentations

Presentation on theme: "Presented by: Yesha Gupta"— Presentation transcript:

Similar presentations

About project

Feedback