View and Index Selection Problem in Data Warehousing Environments Soujanya Vadapalli Satyanarayana R Valluri Project Guide: Prof. Kamalakar Karlapalem B.Tech Final Year Project Presentation June 2002 12/28/2018
Traditional Databases Vs. Data Warehouses Operational Databases Data Warehouses Traditional Databases Required for day to day transactions Updated very frequently Requires concurrency and recovery Repositories of data Historical, consolidated Summarized Very low updates NO concurrency or recovery 12/28/2018
The other side of the coin: Query Processing Techniques Lazy or on-demand approach Eager or in-advance approach Information is extracted from the sources only when the query is posed Query processing is delayed until it is posed Most suited for frequently updated databases Query domain is ad-hoc Information is extracted well before the query is posed and stored in a centralized repository Query is processed without accessing the original sources but only this repository Most suited for databases with very low updates 12/28/2018
The connection Operational Databases Data Warehouses Lazy or on-demand approach Eager or in-advance approach 12/28/2018
The Technique of Data Warehousing Data Warehousing: Eager or in-advance technique A Data Warehouse is built by extracting, transforming and loading data from the data sources Properties of Data Warehouse Subject-oriented Integrated Time-varying Non-volatile Data is usually modeled multi-dimensionally. Supports On-Line Analytical Processing (OLAP) 12/28/2018
Views and Indexes Data from sources is pre-processed and stored in the data warehouse Most common structures used are Materialized Views Indexes BASIC ASSUMPTION: Most frequently asked queries are known External Sources Operational Databases Extract Transform Load Refresh Data Warehouse 12/28/2018
Selection of Views and Indexes QD Selection of views and indexes is done based on the most frequently asked queries. Views and Indexes which help in minimizing the query processing cost. Factors that affect the selection Query processing technique Cost Model Statistics available QA GMQEP BQEP VS IS Views Indices Min Cost Views and Indices 12/28/2018
Query Processing Global Multiple Query Execution Plan (GMQEP): All possible ways of executing the queries Best Query Execution Plan (BQEP): Best QEP among all GMQEPs Selection: Views and Indexes are selected based on BQEP Query Rewriting: Queries are rewritten in terms of views and indexes selected Maintenance: Views and indexes are to be maintained when the source tables are updated 12/28/2018
Representation of Query Execution Plan: AND-OR DAGs An AND-OR DAG consists of a set of Equivalent Nodes (Eq Nodes) and a set of Operational Nodes (Op Nodes). Eq node: the equivalence classes of logical expression that generate the same result set. Can have one or more child op-nodes Op node: an algebraic relational operation can have either one(unary) or two(binary) child eq-nodes GMQEP is represented by an AND-OR DAG. BQEP is represented by an AND DAG. 12/28/2018
AND-OR DAG Eq Node OR Node Op Node The existence of more than one children at any eq node indicates an OR node. Every op node represents an AND node. For unary operations, op node will have one child. For binary operations, op node will have two children. AND Node 12/28/2018
Join DAGs For a given number of joins, as the number of simple select conditions in the query increase, the size of the AND/OR DAG increases exponentially. Time complexity for the generation of AND/OR DAG is O(2n) where n is the number of operations in the query. Join DAG Special case Only join operations A B C A B B C A B C 12/28/2018
Sprinkling Selections over Join DAGs A simple heuristic for efficient query optimization. History Join DAG (HJDAG). Idea Pre-compute all possible join orderings (Complete HJDAG). Extract joins in the queries to incrementally build the DAG (Incremental HJDAG). Sprinkle the select conditions on HJDAG to find the best execution strategy. Can be easily extended to other relational operations like project, group by, etc. This work is still under progress! 12/28/2018
Cost Model Determines the cost units of database operations and assumptions made about the database statistics. Independent of the cost model. Assumed that all the statistics like join selectivity factor, selectivity factor, number of distinct values, etc, are available. Simple Cost Model: In terms of number of tuples or blocks read Join: Always nested-loop. Cost is the product of sizes of relations. Select, Project, Group By: Single scan. Cost is the size of relation. It is made as a separate function. Any new cost model can be incorporated. 12/28/2018
View Selection Problem Materialized View: A view stored as a relation by itself. Like a cache copy that can be accessed very quickly View Maintenance: When a base table is updated, all its dependent materialized views have to be updated. Incurs view maintenance cost. View Selection Problem: Given a set of queries Q, to select a set of views M that minimizes the query processing cost and view maintenance cost under a specified constraint. Storage space constraint. Maintenance cost constraint. Existing view selection algorithms: Greedy Algorithm MVPP (Multiple View Processing Plan) Algorithm 12/28/2018
View Relevance Driven Selection (VRDS) Algorithm View Relevance: Indicates how the materialization of a view affects the benefit of other views. View relevance of a pair of views is a measure of how relevant are the two views w.r.t. each other. It is calculated based on the parent-child relation between them and whether any of them is/are materialized. Algorithm: Input is an AND DAG Calculate View Relevance Matrix which contains view relevance values for every pair of views possible. Select that pair which has maximum view relevance value. Complexity is kn2 where n is the number of nodes in the AND DAG and k > 0. This work is published as a research paper in 13th Australasian Database Conference held in January 2002 at Melbourne. 12/28/2018
Results – Comparison with greedy algorithm 12/28/2018
Results – Comparison with greedy and MVPP Algorithm 12/28/2018
Index Selection Problem Index: Access structure for faster access of the data. Commonly used: B+-trees, bit-mapped indexes, bit-sliced indexes, and projection indexes. Index Selection Problem: To select the set of indexes which minimize the query processing cost and maintenance cost under a specified constraint like storage space or maintenance cost. Implementation Greedy algorithm. Primary and secondary indexes are built. Simple cost model given in Navathe & Elmasri. 12/28/2018
Space distribution between views and indexes Selection of view is dependent on the indexes present and vice-versa. Selection of views affects the indexes present and vice-versa. Two methods: Materialized Views First or Indexes First. Optimal distribution of storage space between views and indexes is a very complex problem. Method: Assume some distribution of space between views and indexes say (50%, 50%) and find the views and indexes. Iteratively improve the solution by removing some views and adding new indexes or vice-versa. View stealer and index stealer. 12/28/2018
VISDOM: View and Index Selector for Data warehOusing Management Integrated data warehousing toolkit to select optimal set of materialized views and indexes. Features: Query processing plan generation. Selection of views and indexes by earlier discussed algorithms. Comparison of three view selection algorithms. Distribution of space between views and indexes. User Interface: Graphical display of AND/OR DAGs. Interactive processing. Available for windows and linux platforms. 12/28/2018
Conclusions and Future Work Lessons Learnt: View and Index selection problem is a very complex problem. Very tightly coupled with query processing and optimization. Cost model is most important aspect. Need to have a comprehensive and unambiguous cost model. Assumptions about the statistics available. Future Work: Dynamic view materialization and index selection. Incremental maintenance of views. Common framework for views and indexes. 12/28/2018
Thank You 12/28/2018