CUBE MATERIALIZATION E0 261 Jayant Haritsa

Slides:



Advertisements
Similar presentations
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Advertisements

Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 5 Fundamental Algorithm Design Techniques.
Combinatorial Algorithms
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
Complexity ©D Moshkovitz 1 Approximation Algorithms Is Close Enough Good Enough?
Greedy vs Dynamic Programming Approach
Computational problems, algorithms, runtime, hardness
Reducing the collection of itemsets: alternative representations and combinatorial problems.
Instructor Neelima Gupta Table of Contents Lp –rounding Dual Fitting LP-Duality.
Approximation Algorithms
Zoë Abrams, Ashish Goel, Serge Plotkin Stanford University Set K-Cover Algorithms for Energy Efficient Monitoring in Wireless Sensor Networks.
Linear Programming (LP)
1 Introduction to Approximation Algorithms Lecture 15: Mar 5.
Data Cube Computation Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all.
1 Dr. Panagiotis Symeonidis Data Engineering Laboratory Data Warehouse implementation: Part B.
LINEAR PROGRAMMING SIMPLEX METHOD.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.
1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.
1 Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung Implementing Data Cubes Efficiently.
Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.
Part Two: - The use of views. 1. Topics What is a View? Why Views are useful in Data Warehousing? Understand Materialised Views Understand View Maintenance.
Stochastic Protection of Confidential Information in SDB: A hybrid of Query Restriction and Data Perturbation ( to appear in Operations Research) Manuel.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
1 The Theory of NP-Completeness 2 Cook ’ s Theorem (1971) Prof. Cook Toronto U. Receiving Turing Award (1982) Discussing difficult problems: worst case.
COMP53311 Data Warehouse Prepared by Raymond Wong Presented by Raymond Wong
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.
Instructor Neelima Gupta Table of Contents Class NP Class NPC Approximation Algorithms.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
Lecture.6. Table of Contents Lp –rounding Dual Fitting LP-Duality.
1 Approximation algorithms Algorithms and Networks 2015/2016 Hans L. Bodlaender Johan M. M. van Rooij TexPoint fonts used in EMF. Read the TexPoint manual.
Non-LP-Based Approximation Algorithms Fabrizio Grandoni IDSIA
Clustering Data Streams A presentation by George Toderici.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Approximation algorithms for combinatorial allocation problems
The Theory of NP-Completeness
Approximating Set Cover
8.3.2 Constant Distance Approximations
Advanced Algorithms Analysis and Design
Algorithm Design Methods
Merge Sort 7/29/ :21 PM The Greedy Method The Greedy Method.
Haim Kaplan and Uri Zwick
Multi - Way Number Partitioning
Merge Sort 11/28/2018 2:18 AM The Greedy Method The Greedy Method.
The Greedy Method Spring 2007 The Greedy Method Merge Sort
Merge Sort 11/28/2018 8:16 AM The Greedy Method The Greedy Method.
Coverage Approximation Algorithms
Data Warehousing and Decision Support
Dynamic Programming Dynamic Programming 1/15/ :41 PM
Merge Sort 1/17/2019 3:11 AM The Greedy Method The Greedy Method.
Dynamic Programming Dynamic Programming 1/18/ :45 AM
Dynamic Programming Merge Sort 1/18/ :45 AM Spring 2007
One-Pass Algorithms for Database Operations (15.2)
BITMAP INDEXES E0 261 Jayant Haritsa Computer Science and Automation
D1 Discrete Mathematics
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Algorithm Design Methods
Algorithm Design Methods
Merge Sort 4/28/ :13 AM Dynamic Programming Dynamic Programming.
Merge Sort 5/2/2019 7:53 PM The Greedy Method The Greedy Method.
The Selection Problem.
Dynamic Programming Merge Sort 5/23/2019 6:18 PM Spring 2008
Implementation of Learning Systems
Approximation Algorithm
Algorithm Design Methods
Presentation transcript:

CUBE MATERIALIZATION E0 261 Jayant Haritsa Computer Science and Automation Indian Institute of Science

Views and Decision Support OLAP queries are typically aggregate queries. Precomputation is essential for interactive response times. The CUBE is in fact a collection of aggregate queries, and precomputation is especially important: lots of work on what is best to precompute given a limited amount of space to store precomputed results. Warehouses can be thought of as a collection of asynchronously replicated tables and periodically maintained views.

Issues in View Materialization What views should we materialize, and what indexes should we build on the precomputed results? Given a query and a set of materialized views, can we use the materialized views to answer the query? How frequently should we refresh materialized views to make them consistent with the underlying tables? (And how can we do this incrementally?)

TPC-D Example Supplier Part Customer

View Lattice PSC pivoting PC PS SC rollup P S C drilldown  (group by p,s,c) pivoting PC PS SC rollup P S C drilldown  (no group by) Given N dimensions, 2N views in lattice

Materialization Options Materialize everything minimum response time space explosion Materialize nothing maximum response time zero space Materialize a carefully chosen subset and derive others from this subset e.g. Any view can be derived from PSC today’s paper (received Best Paper award in Sigmod 96 !)

Problem Formulation Given a view lattice and a constraint on the number of views that can be materialized, which choice will result in minimizing the average cost across all views? Assumptions: All queries equi-probable Query Cost  number of rows examined No indexes

Solutions: Optimization problem is NP-hard Reduction from Set-Cover problem Given a set X of n elements, a family F of subsets of X that cover X, what is the smallest number of subsets whose union is X ? Therefore, heuristic-based approximate solutions are the only hope. greedy algorithm discussed in this paper Decision version of SetCover is NP-complete, optimization is NP-hard

Greedy Algorithm Given view lattice V, number of (interior) views k, and result to be stored in S S = {Top view} for i = 1 to k do do a full traversal of V – S select view v  S such that Benefit (v,S) is maximized S = S  {v}

Benefit Computation B(v,S) For each w ≼ v (w derivable from v) let u be the view of least cost in S such that w ≼ u . since top view is in S, there must be at least one such view in S if C(v) < C(u), then Bw = C(u) – C(v) Benefit to w of including v in set S i.e. Bw = C(current_parent) – C(new_candidate_parent) otherwise, Bw = 0 Then, B (v,S) = w ≼ v Bw overall benefit to all descendants,including itself, of v

Example 4.1 100 a 50 75 b c 30 40 20 e f d h g 10 1

View Choices (k=3) Figure 8 in paper The greedy selection is b, f and d Cost reduces from 800 (100 * 8) to 420 which coincides with the optimal

TPC-D database Figure 11 gives a visual example of tradeoff After picking first five views (cp,ns,nt,c,p), almost the minimum possible total time, while total space is hardly more than the mandatory space used for just the top view.

Performance Profile Bgreedy / Bopt  1 – ((k – 1)/k)k k = 2, ratio is 0.75 k  , ratio is 1 – 1/e = 0.63 Tight bound ! (Figure 9) No better algorithm possible! problem closed ? no, randomized algorithms possible Special cases Close to optimal if first view delivers most of the benefit Equal to optimal if the benefit of each successive view is the same

Proof Let v1, v2, …, vk be the views chosen in sequence by the greedy algorithm. Let ai be the benefit achieved by choosing vi (w.r.t. v1, …, vi-1) Similarly, let w1, w2, …, wk be the views chosen by optimal, and bi be the benefit achieved by choosing wi (w.r.t. w1, …, wi-1) Need to put an upper bound on the b’s in terms of the a’s

Proof Partition the improvement to an arbitrary view u effected by the v’s and by the w’s. e.g. for view g, cost improved from 100 to 20. 50 came from b and 30 from d. Greedy Optimal Assign contribution of wi’s to vj’s : e.g., contribution of w2 is wholly assigned to v1; w3 is divided among v1, v5, v7; w6 is not assigned; ... v1 v5 v7 w2 w3 w5 w6

Proof (contd) Define xij to be the sum over all views u in the lattice of the amount of the benefit bi (from wi) that is assigned to vj . Then, i xij  aj (total attribution cannot exceed complete value) Also i bi  a1 (o.w. wi would have been chosen instead of v1 by greedy algorithm) i bi – xi1  a2 (benefit of wi minus that already assigned to v1) … i bi – xi1 – xi2 – … – xi,j – 1 aj

Proof (contd) Summing each equation over i, and with the constraints that i bi = B , i ai = A, i xij  aj , we get B  ka1 B  ka2 + a1 B  ka3 + a1 + a2 … B  kak + a1 + a2 + … + ak–1 The bounds give maximum value of B when all right sides are equal. That is kai+1 – (k – 1) ai = 0

Proof (contd) Therefore, ai = (k/k-1) ai+1 For these values of a’s, A = i=0 to k-1 (k/k-1)i ak and from first (or any) inequality B  k (k/k-1)k-1 ak Therefore, A/B  1 – ((k-1)/k)k  1 – 1/e as k →∞

Dimension hierarchies Each dimension has a hierarchy Equivalent to “multiplying” lattices Example Figure 4: beautiful picture ! customer part nation size type none none

Alternative Problem Formulation Total Space is fixed, not number of views Means that Benefit per unit space needs to be computed. Performance guarantees still remain the same (ignoring boundary condition effects)

END CUBE MATERIALIZATION