View and Index Selection Problem in Data Warehousing Environments

Slides:



Advertisements
Similar presentations
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Advertisements

An overview of Data Warehousing and OLAP Technology Presented By Manish Desai.
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Fast Algorithms For Hierarchical Range Histogram Constructions
Technical BI Project Lifecycle
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Physical Database Monitoring and Tuning the Operational System.
Query Processing Presented by Aung S. Win.
Data Warehouse & Data Mining
Copyright © 2003 by Prentice Hall Computers: Tools for an Information Age Chapter 13 Database Management Systems: Getting Data Together.
Database Management 9. course. Execution of queries.
Chapter 16 Methodology – Physical Database Design for Relational Databases.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Methodology – Physical Database Design for Relational Databases.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
Materialized View Selection and Maintenance using Multi-Query Optimization Hoshi Mistry Prasan Roy S. Sudarshan Krithi Ramamritham.
The Object-Oriented Database System Manifesto Malcolm Atkinson, François Bancilhon, David deWitt, Klaus Dittrich, David Maier, Stanley Zdonik DOOD'89,
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
CHAPTER 19 Query Optimization. CHAPTER 19 Query Optimization.
Practical Database Design and Tuning
15.1 – Introduction to physical-Query-plan operators
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
Chapter 13 Business Intelligence and Data Warehouses
The Object-Oriented Database System Manifesto
Data warehouse and OLAP
Physical Database Design and Performance
A paper on Join Synopses for Approximate Query Answering
Chapter 13 The Data Warehouse
Informix Red Brick Warehouse 5.1
Methodology – Physical Database Design for Relational Databases
Physical Database Design for Relational Databases Step 3 – Step 8
Data Warehouse.
Chapter 13 – Data Warehousing
Overview of Query Optimization
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 2 Database System Concepts and Architecture.
Chapter 15 QUERY EXECUTION.
Chapter 2 Database Environment Pearson Education © 2009.
MANAGING DATA RESOURCES
國立臺北科技大學 課程:資料庫系統 fall Chapter 18
Query Processing B.Ramamurthy Chapter 12 11/27/2018 B.Ramamurthy.
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
An Introduction to Data Warehousing
Introduction to Data Warehousing
C.U.SHAH COLLEGE OF ENG. & TECH.
Database Environment Transparencies
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Data Warehousing Data Model –Part 1
Database management concepts
Database Architecture
Advance Database Systems
Introduction of Week 9 Return assignment 5-2
Data Warehouse.
Indexing 4/11/2019.
Chapter 17 Designing Databases
Data Warehousing Concepts
Query Optimization.
New Technologies for Storage and Display of Meteorological Data
Managing data Resources:
Data Warehouse and OLAP Technology
Database management systems
Presentation transcript:

View and Index Selection Problem in Data Warehousing Environments Soujanya Vadapalli Satyanarayana R Valluri Project Guide: Prof. Kamalakar Karlapalem B.Tech Final Year Project Presentation June 2002 12/28/2018

Traditional Databases Vs. Data Warehouses Operational Databases Data Warehouses Traditional Databases Required for day to day transactions Updated very frequently Requires concurrency and recovery Repositories of data Historical, consolidated Summarized Very low updates NO concurrency or recovery 12/28/2018

The other side of the coin: Query Processing Techniques Lazy or on-demand approach Eager or in-advance approach Information is extracted from the sources only when the query is posed Query processing is delayed until it is posed Most suited for frequently updated databases Query domain is ad-hoc Information is extracted well before the query is posed and stored in a centralized repository Query is processed without accessing the original sources but only this repository Most suited for databases with very low updates 12/28/2018

The connection Operational Databases Data Warehouses Lazy or on-demand approach Eager or in-advance approach 12/28/2018

The Technique of Data Warehousing Data Warehousing: Eager or in-advance technique A Data Warehouse is built by extracting, transforming and loading data from the data sources Properties of Data Warehouse Subject-oriented Integrated Time-varying Non-volatile Data is usually modeled multi-dimensionally. Supports On-Line Analytical Processing (OLAP) 12/28/2018

Views and Indexes Data from sources is pre-processed and stored in the data warehouse Most common structures used are Materialized Views Indexes BASIC ASSUMPTION: Most frequently asked queries are known External Sources Operational Databases Extract Transform Load Refresh Data Warehouse 12/28/2018

Selection of Views and Indexes QD Selection of views and indexes is done based on the most frequently asked queries. Views and Indexes which help in minimizing the query processing cost. Factors that affect the selection Query processing technique Cost Model Statistics available QA GMQEP BQEP VS IS Views Indices Min Cost Views and Indices 12/28/2018

Query Processing Global Multiple Query Execution Plan (GMQEP): All possible ways of executing the queries Best Query Execution Plan (BQEP): Best QEP among all GMQEPs Selection: Views and Indexes are selected based on BQEP Query Rewriting: Queries are rewritten in terms of views and indexes selected Maintenance: Views and indexes are to be maintained when the source tables are updated 12/28/2018

Representation of Query Execution Plan: AND-OR DAGs An AND-OR DAG consists of a set of Equivalent Nodes (Eq Nodes) and a set of Operational Nodes (Op Nodes). Eq node: the equivalence classes of logical expression that generate the same result set. Can have one or more child op-nodes Op node: an algebraic relational operation can have either one(unary) or two(binary) child eq-nodes GMQEP is represented by an AND-OR DAG. BQEP is represented by an AND DAG. 12/28/2018

AND-OR DAG Eq Node OR Node Op Node The existence of more than one children at any eq node indicates an OR node. Every op node represents an AND node. For unary operations, op node will have one child. For binary operations, op node will have two children. AND Node 12/28/2018

Join DAGs For a given number of joins, as the number of simple select conditions in the query increase, the size of the AND/OR DAG increases exponentially. Time complexity for the generation of AND/OR DAG is O(2n) where n is the number of operations in the query. Join DAG Special case Only join operations A B C A B B C A B C 12/28/2018

Sprinkling Selections over Join DAGs A simple heuristic for efficient query optimization. History Join DAG (HJDAG). Idea Pre-compute all possible join orderings (Complete HJDAG). Extract joins in the queries to incrementally build the DAG (Incremental HJDAG). Sprinkle the select conditions on HJDAG to find the best execution strategy. Can be easily extended to other relational operations like project, group by, etc. This work is still under progress! 12/28/2018

Cost Model Determines the cost units of database operations and assumptions made about the database statistics. Independent of the cost model. Assumed that all the statistics like join selectivity factor, selectivity factor, number of distinct values, etc, are available. Simple Cost Model: In terms of number of tuples or blocks read Join: Always nested-loop. Cost is the product of sizes of relations. Select, Project, Group By: Single scan. Cost is the size of relation. It is made as a separate function. Any new cost model can be incorporated. 12/28/2018

View Selection Problem Materialized View: A view stored as a relation by itself. Like a cache copy that can be accessed very quickly View Maintenance: When a base table is updated, all its dependent materialized views have to be updated. Incurs view maintenance cost. View Selection Problem: Given a set of queries Q, to select a set of views M that minimizes the query processing cost and view maintenance cost under a specified constraint. Storage space constraint. Maintenance cost constraint. Existing view selection algorithms: Greedy Algorithm MVPP (Multiple View Processing Plan) Algorithm 12/28/2018

View Relevance Driven Selection (VRDS) Algorithm View Relevance: Indicates how the materialization of a view affects the benefit of other views. View relevance of a pair of views is a measure of how relevant are the two views w.r.t. each other. It is calculated based on the parent-child relation between them and whether any of them is/are materialized. Algorithm: Input is an AND DAG Calculate View Relevance Matrix which contains view relevance values for every pair of views possible. Select that pair which has maximum view relevance value. Complexity is kn2 where n is the number of nodes in the AND DAG and k > 0. This work is published as a research paper in 13th Australasian Database Conference held in January 2002 at Melbourne. 12/28/2018

Results – Comparison with greedy algorithm 12/28/2018

Results – Comparison with greedy and MVPP Algorithm 12/28/2018

Index Selection Problem Index: Access structure for faster access of the data. Commonly used: B+-trees, bit-mapped indexes, bit-sliced indexes, and projection indexes. Index Selection Problem: To select the set of indexes which minimize the query processing cost and maintenance cost under a specified constraint like storage space or maintenance cost. Implementation Greedy algorithm. Primary and secondary indexes are built. Simple cost model given in Navathe & Elmasri. 12/28/2018

Space distribution between views and indexes Selection of view is dependent on the indexes present and vice-versa. Selection of views affects the indexes present and vice-versa. Two methods: Materialized Views First or Indexes First. Optimal distribution of storage space between views and indexes is a very complex problem. Method: Assume some distribution of space between views and indexes say (50%, 50%) and find the views and indexes. Iteratively improve the solution by removing some views and adding new indexes or vice-versa. View stealer and index stealer. 12/28/2018

VISDOM: View and Index Selector for Data warehOusing Management Integrated data warehousing toolkit to select optimal set of materialized views and indexes. Features: Query processing plan generation. Selection of views and indexes by earlier discussed algorithms. Comparison of three view selection algorithms. Distribution of space between views and indexes. User Interface: Graphical display of AND/OR DAGs. Interactive processing. Available for windows and linux platforms. 12/28/2018

Conclusions and Future Work Lessons Learnt: View and Index selection problem is a very complex problem. Very tightly coupled with query processing and optimization. Cost model is most important aspect. Need to have a comprehensive and unambiguous cost model. Assumptions about the statistics available. Future Work: Dynamic view materialization and index selection. Incremental maintenance of views. Common framework for views and indexes. 12/28/2018

Thank You 12/28/2018