Harikrishnan Karunakaran Sulabha Balan CSE 6339
Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality & Performance Conclusion
Analysis of data in data warehouses useful in decision support Users of decision support systems want interactive systems OLAP – Online Analytical Processing Aggregate Query Answering Systems (AQUA) developed to reduce response time to desirable levels Tolerant of approximate results
Various Approaches Sampling-based Histogram-based Clustering Probabilistic Wavelet-based
BranchStateSales 1CA80K 2TX42K 3CA40K 4CA42K 5TX75K 6CA48K 7TX55K 8TX38K 9CA40K 10CA41K BranchStateSales 2TX42K 4CA42K 6CA48K 8TX38K 10CA41K 50% Sample SELECT SUM(sales) x 2 AS cnt FROM s_sales WHERE state = ‘TX’ S_sales scale factor Sales
Sample relation for aggregation query workload regarding Texas branches BranchStateSales 1CA80K 2TX42K 3CA40K 4CA42K 5TX75K 6CA48K 7TX55K 8TX38K 9CA40K 10CA41K BranchStateSales 2TX42K 4CA42K 5TX75K 7TX55K 8TX38K Sales S_sales
All tuples in a Uniform Random Sample are treated as equally important for answering queries Sample needs to be tuned to contain tuples which are more relevant to answer queries in a workload Need for a dynamic algorithm that changes the sample as and according to suit the queries being executed in the workload
Join of a Uniform Random Sample of a Fact Table with a set of accompanying Dimension Tables SELECT COUNT(*), AVG(LI Extendedprice), SUM(LI Extendedprice) FROM LI, C, O, S, N, R WHERE C Custkey=O Custkey AND O Orderkey=LI Orderkey AND LI Suppkey=S Suppkey AND C Nationkey=N Nationkey AND N Regionkey=R Regionkey AND R Name=North America AND O Orderdate AND O Orderdate ;
Any aggregate query on the fact table can be answered approximately using exactly one of a smaller number of synopses Uniform Random Sample of Relation wastes memory OLAP queries exhibit locality in their data access
Class of samples to capture data locality of aggregate queries of foreign key joins Identify focus of a query workload and sample accordingly Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R) Is a non-uniform sample of the original relation R
Algorithm is efficient due to Uniform Random Sample of L ensures tuple’s selection in its icicle is proportional to it’s frequency Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload Reservoir Sampling Algorithm
SELECT average(*) FROM widget-tuners WHERE date.month = ‘April’
In spite of unified sampling being used the result is a biased sample Frequency Relation maintained over all tuples in relation Different Estimation mechanisms for Average, Count and Sum
Average Average taken over set of distinct sample tuples that satisfy the query predicate of the average query is a pretty good estimate of the average Count Sum of Expected Contributions of all tuples in the sample that satisfy the given query Sum Estimate is given by the product of the average and the count estimates
Frequency Attribute added to the Relation Starting Frequency set to 1 for all tuples Incremented each time tuple is used to answer a query Frequencies of relevant tuples updated only when icicle updated with new query
When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation Accuracy improves with increase in number of tuples used to compute it Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle
SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LI, C, O, S, N, R WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LICOS-icicle, N, R WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= Q workload : Template for generating workloads Template for obtaining approximate answers
The Error Plots for Comparison Static uniform random sample on Join Synopsis Icicle as it evolves with the workload Icicle-Complete which is formed after entire workload has been executed once
Mixed Workload
Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples Icicle plot shows a convergence to the Icicle- Complete plot Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast
Improvement due to usage of icicles is not significant Can be concluded that icicles are at worst as good as the static samples
Icicles provide class of samples that adapt according to the characteristics of the workload It can never be worse than the case of static sampling It focuses on relatively small subsets in the relation
There is no significant gains in the case of Uniform Workload There is a trade-off between accuracy and cost Restricted to certain scenarios where the queries tend to be increasingly focused towards the workload.
V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999
Thank You Questions?