Presentation is loading. Please wait.

Presentation is loading. Please wait.

View and Index Selection Problem in Data Warehousing Environments

Similar presentations


Presentation on theme: "View and Index Selection Problem in Data Warehousing Environments"— Presentation transcript:

1 View and Index Selection Problem in Data Warehousing Environments
Soujanya Vadapalli Satyanarayana R Valluri Project Guide: Prof. Kamalakar Karlapalem B.Tech Final Year Project Presentation June 2002 12/28/2018

2 Traditional Databases Vs. Data Warehouses
Operational Databases Data Warehouses Traditional Databases Required for day to day transactions Updated very frequently Requires concurrency and recovery Repositories of data Historical, consolidated Summarized Very low updates NO concurrency or recovery 12/28/2018

3 The other side of the coin: Query Processing Techniques
Lazy or on-demand approach Eager or in-advance approach Information is extracted from the sources only when the query is posed Query processing is delayed until it is posed Most suited for frequently updated databases Query domain is ad-hoc Information is extracted well before the query is posed and stored in a centralized repository Query is processed without accessing the original sources but only this repository Most suited for databases with very low updates 12/28/2018

4 The connection Operational Databases Data Warehouses
Lazy or on-demand approach Eager or in-advance approach 12/28/2018

5 The Technique of Data Warehousing
Data Warehousing: Eager or in-advance technique A Data Warehouse is built by extracting, transforming and loading data from the data sources Properties of Data Warehouse Subject-oriented Integrated Time-varying Non-volatile Data is usually modeled multi-dimensionally. Supports On-Line Analytical Processing (OLAP) 12/28/2018

6 Views and Indexes Data from sources is pre-processed and stored in the data warehouse Most common structures used are Materialized Views Indexes BASIC ASSUMPTION: Most frequently asked queries are known External Sources Operational Databases Extract Transform Load Refresh Data Warehouse 12/28/2018

7 Selection of Views and Indexes
QD Selection of views and indexes is done based on the most frequently asked queries. Views and Indexes which help in minimizing the query processing cost. Factors that affect the selection Query processing technique Cost Model Statistics available QA GMQEP BQEP VS IS Views Indices Min Cost Views and Indices 12/28/2018

8 Query Processing Global Multiple Query Execution Plan (GMQEP): All possible ways of executing the queries Best Query Execution Plan (BQEP): Best QEP among all GMQEPs Selection: Views and Indexes are selected based on BQEP Query Rewriting: Queries are rewritten in terms of views and indexes selected Maintenance: Views and indexes are to be maintained when the source tables are updated 12/28/2018

9 Representation of Query Execution Plan: AND-OR DAGs
An AND-OR DAG consists of a set of Equivalent Nodes (Eq Nodes) and a set of Operational Nodes (Op Nodes). Eq node: the equivalence classes of logical expression that generate the same result set. Can have one or more child op-nodes Op node: an algebraic relational operation can have either one(unary) or two(binary) child eq-nodes GMQEP is represented by an AND-OR DAG. BQEP is represented by an AND DAG. 12/28/2018

10 AND-OR DAG Eq Node OR Node Op Node The existence of more than one children at any eq node indicates an OR node. Every op node represents an AND node. For unary operations, op node will have one child. For binary operations, op node will have two children. AND Node 12/28/2018

11 Join DAGs For a given number of joins, as the number of simple select conditions in the query increase, the size of the AND/OR DAG increases exponentially. Time complexity for the generation of AND/OR DAG is O(2n) where n is the number of operations in the query. Join DAG Special case Only join operations A B C A B B C A B C 12/28/2018

12 Sprinkling Selections over Join DAGs
A simple heuristic for efficient query optimization. History Join DAG (HJDAG). Idea Pre-compute all possible join orderings (Complete HJDAG). Extract joins in the queries to incrementally build the DAG (Incremental HJDAG). Sprinkle the select conditions on HJDAG to find the best execution strategy. Can be easily extended to other relational operations like project, group by, etc. This work is still under progress! 12/28/2018

13 Cost Model Determines the cost units of database operations and assumptions made about the database statistics. Independent of the cost model. Assumed that all the statistics like join selectivity factor, selectivity factor, number of distinct values, etc, are available. Simple Cost Model: In terms of number of tuples or blocks read Join: Always nested-loop. Cost is the product of sizes of relations. Select, Project, Group By: Single scan. Cost is the size of relation. It is made as a separate function. Any new cost model can be incorporated. 12/28/2018

14 View Selection Problem
Materialized View: A view stored as a relation by itself. Like a cache copy that can be accessed very quickly View Maintenance: When a base table is updated, all its dependent materialized views have to be updated. Incurs view maintenance cost. View Selection Problem: Given a set of queries Q, to select a set of views M that minimizes the query processing cost and view maintenance cost under a specified constraint. Storage space constraint. Maintenance cost constraint. Existing view selection algorithms: Greedy Algorithm MVPP (Multiple View Processing Plan) Algorithm 12/28/2018

15 View Relevance Driven Selection (VRDS) Algorithm
View Relevance: Indicates how the materialization of a view affects the benefit of other views. View relevance of a pair of views is a measure of how relevant are the two views w.r.t. each other. It is calculated based on the parent-child relation between them and whether any of them is/are materialized. Algorithm: Input is an AND DAG Calculate View Relevance Matrix which contains view relevance values for every pair of views possible. Select that pair which has maximum view relevance value. Complexity is kn2 where n is the number of nodes in the AND DAG and k > 0. This work is published as a research paper in 13th Australasian Database Conference held in January 2002 at Melbourne. 12/28/2018

16 Results – Comparison with greedy algorithm
12/28/2018

17 Results – Comparison with greedy and MVPP Algorithm
12/28/2018

18 Index Selection Problem
Index: Access structure for faster access of the data. Commonly used: B+-trees, bit-mapped indexes, bit-sliced indexes, and projection indexes. Index Selection Problem: To select the set of indexes which minimize the query processing cost and maintenance cost under a specified constraint like storage space or maintenance cost. Implementation Greedy algorithm. Primary and secondary indexes are built. Simple cost model given in Navathe & Elmasri. 12/28/2018

19 Space distribution between views and indexes
Selection of view is dependent on the indexes present and vice-versa. Selection of views affects the indexes present and vice-versa. Two methods: Materialized Views First or Indexes First. Optimal distribution of storage space between views and indexes is a very complex problem. Method: Assume some distribution of space between views and indexes say (50%, 50%) and find the views and indexes. Iteratively improve the solution by removing some views and adding new indexes or vice-versa. View stealer and index stealer. 12/28/2018

20 VISDOM: View and Index Selector for Data warehOusing Management
Integrated data warehousing toolkit to select optimal set of materialized views and indexes. Features: Query processing plan generation. Selection of views and indexes by earlier discussed algorithms. Comparison of three view selection algorithms. Distribution of space between views and indexes. User Interface: Graphical display of AND/OR DAGs. Interactive processing. Available for windows and linux platforms. 12/28/2018

21 Conclusions and Future Work
Lessons Learnt: View and Index selection problem is a very complex problem. Very tightly coupled with query processing and optimization. Cost model is most important aspect. Need to have a comprehensive and unambiguous cost model. Assumptions about the statistics available. Future Work: Dynamic view materialization and index selection. Incremental maintenance of views. Common framework for views and indexes. 12/28/2018

22 Thank You 12/28/2018


Download ppt "View and Index Selection Problem in Data Warehousing Environments"

Similar presentations


Ads by Google