Semantic Query Optimization

Slides:



Advertisements
Similar presentations
Logical DB Design: ER to Relational Entity sets to tables. Employees ssn name lot CREATE TABLE Employees (ssn CHAR (11), name CHAR (20), lot INTEGER, PRIMARY.
Advertisements

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Outline  Introduction  Background  Distributed DBMS Architecture  Distributed Database Design  Semantic Data Control ➠ View Management ➠ Data Security.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
CS4432: Database Systems II
Database Management Systems, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Programming, Triggers Chapter 5 Modified by Donghui Zhang.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
SQL Lecture 10 Inst: Haya Sammaneh. Example Instance of Students Relation  Cardinality = 3, degree = 5, all rows distinct.
Data Mining for Query Optimization. 2 Outline Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity Estimation.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Incremental Maintenance for Non-Distributive Aggregate Functions work done at IBM Almaden Research Center Themis Palpanas (U of Toronto) Richard Sidle.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 3 The Basic (Flat) Relational Model.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
David Konopnicki Choosing Access Path ä The basic methods. ä The access paths and when they are available. ä How the optimizer chooses among the.
1 Relational Model. 2 Relational Database: Definitions  Relational database: a set of relations  Relation: made up of 2 parts: – Instance : a table,
Database Systems More SQL Database Design -- More SQL1.
Overview of Implementing Relational Operators and Query Evaluation
The Relational Model. Review Why use a DBMS? OS provides RAM and disk.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
C-Store: How Different are Column-Stores and Row-Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009.
FALL 2004CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic.
Constraints, Triggers and Views COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.
CSE314 Database Systems Lecture 3 The Relational Data Model and Relational Database Constraints Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
Query Processing – Implementing Set Operations and Joins Chap. 19.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Murali Mani Constraints. Murali Mani Keys: Primary keys and unique CREATE TABLE Student ( sNum int, sName varchar (20), dept char (2), CONSTRAINT key.
Chapter 3 The Relational Model. Why Study the Relational Model? Most widely used model. Vendors: IBM, Informix, Microsoft, Oracle, Sybase, etc. “Legacy.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
CENG 351 File Structures and Data Management1 Relational Model Chapter 3.
Database Constraints Ashima Wadhwa. Database Constraints Database constraints are restrictions on the contents of the database or on database operations.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
CSIS 115 Database Design and Applications for Business
Mining for Empty Rectangles in Large Data Sets
COP Introduction to Database Structures
CS580 Advanced Database Topics
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Quiz Questions Q.1 An entity set that does not have sufficient attributes to form a primary key is a (A) strong entity set. (B) weak entity set. (C) simple.
Choosing Access Path The basic methods.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Chapter 12: Query Processing
Introduction to Query Optimization
Evaluation of Relational Operations
Translation of ER-diagram into Relational Schema
Optimizing Queries Using Materialized Views
Database Management Systems (CS 564)
From ER to Relational Model
The Relational Model Relational Data Model
The Relational Model Textbook /7/2018.
The Relational Model The slides for this text are organized into chapters. This lecture covers Chapter 3. Chapter 1: Introduction to Database Systems Chapter.
Recommending Materialized Views and Indexes with the IBM DB2 Design Advisor (Automating Physical Database Design) Jarek Gryz.
Overview of Query Evaluation
Implementation of Relational Operations
Database Design: Relational Model
Chen Li Information and Computer Science
Chapter 17 Designing Databases
A Framework for Testing Query Transformation Rules
Statistics Profile For Query Optimization
Presentation transcript:

Semantic Query Optimization

Outline Semantic Query Optimization Soft Constraints Query Optimization via Soft Constraints Selectivity Estimation via Soft Constraints

Semantic Query Optimization Use integrity constraints associated with a database to rewrite a query into a form that may be evaluated more efficiently Some Techniques: Join Elimination Predicate Elimination Join Introduction Predicate Introduction Detecting an Empty Answer Set

Commercial implementations of SQO Few (if any!) Early Experiences: Could not spend too much time on optimization Few integrity constraints are ever defined Association with deductive databases

Join elimination: example select p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier where p_partkey = ps_partkey and s_suppkey = ps_suppkey and ps_partkey = l_partkey and ps_suppkey = l_suppkey; RI constraints: part-partsupp (on partkey) supplier-partsupp (on partkey) partsupp-lineitem (on partkey and suppkey) select p_name, p_retailprice, s_name, s_address from tpcd.lineitem, tpcd.partsupp, tpcd.part, tpcd.supplier where p_partkey = l_partkey and s_suppkey = l_suppkey;

Algorithm for join elimination 1. Derive column transitivity classes from the join predicates in the query 2. Divide the relations in the query that are related through RI constraints into removable and non-removable 3. Eliminate all removable relations from the query 4. Add is not null predicate to foreign key columns of all tables whose RI parents were removed

Algorithm for join elimination: example S.S PS.S PS.S O.C C.C S.S PS.S O.C O.C C.C

Performance results for join elimination

Predicate Introduction: Example select sum(l_extendedprice * l_discount) as revenue from tpcd.lineitem where shipdate >date('1994-01-01'); Check constraint: receiptdate >= shipdate Clustered Index on receiptdate select sum(l_extendedprice * l_discount) as revenue from tpcd.lineitem where shipdate >date('1994-01-01') and receiptdate >= date('1994-01-01');

Algorithm for Predicate Introduction N - set of predicates derivable from the query and check constraints If N is inconsistent, stop. Else, for each predicate A op B in N, add it to the query if: A or B is a join column B is a major column of an index no other index on B’s table can be used in the plan for the original query

Queries select 100.00 * sum (case when p_type like 'PROMO%' then l_extendedprice * (1 - l_discount) else 0 end) / sum(l_extendedprice * (1 - l_discount)) as promo_revenue from tpcd.lineitem, tpcd.part where l_partkey = p_partkey and l_shipdate >= date('1998-09-01') and l_shipdate < date('1998-09-01') + 1 month; Given the check constraint l_receiptdate >= l_shipdate we may add a new predicate to the query: l_receiptdate >= date(‘1998-09-01’)

Performance Results for Index Introduction

The Culprit New query plan uses an index, but the original table scan is still better! Why did this happen: incorrect estimate of the filter factor underestimation of the CPU cost of locking index pages

Soft Constraints

Soft Constraints Traditional (“hard”) integrity constraints are defined to prevent incorrect updates. A soft constraint is a statement that is true about the current state of the database, but does not verify updates. In fact, a soft constraint can be invalidated by an update.

Soft Constraints (cont.) Absolute soft constraints – no violation in the current state of the database Absolute soft constraints can be used for optimization in exactly the same way traditional constraints are. Statistical soft constraints – can have some (small) degree of violation Statistical soft constraints can be used for improved selectivity estimation

Implementation of Soft Constraints In Oracle the standard integrity constraints are marked with a rely option, so that they are not verified on updates. In DB2 soft constraints are called informational constraints.

Informational Check Constraint Example 1: Create an employee table where a minimum salary of $25,000 is guaranteed by the application CREATE TABLE emp(empno INTEGER NOT NULL PRIMARY KEY, name VARCHAR(20), firstname VARCHAR(20), salary INTEGER CONSTRAINT minsalary CHECK (salary >= 25000) NOT ENFORCED ENABLE QUERY OPTIMIZATION);

Enforcing Validation Example 2: Alter the employee table to start enforcing the minimum wage of $25,000 using DB2. DB2 will also verify existing data right away. ALTER TABLE emp ALTER CONSTRAINT minsalary ENFORCED

Informational RI Constraint Example 3: Create a department table where the application ensures the existence of departments to which the employees belong. CREATE TABLE dept(deptno INTEGER NOT NULL PRIMARY KEY, deptName VARCHAR(20), budget INTEGER); ALTER TABLE emp ADD COLUMN dept INTEGER NOT NULL CONSTRAINT dept_exist REFERENCES dept NOT ENFORCED ENABLE QUERY OPTIMIZATION);

Query Optimization via Empty Joins

Example select Model from Tickets T, Registration R where T.RegNum = R.RegNum and T.date > “1990-01-01” and R.Model LIKE “BMW Z3%” First BMW Z3 series cars were made in 1997. select Model from Tickets T, Registration R where T.RegNum = R.RegNum and T.date > “1997-01-01” and R.Model LIKE “BMW Z3%”

Matrix representation of empty joins A,B(R S)

Staircase data structure

Properties of the algorithm Time Complexity O(nm) requires a single scan of the sorted data Space Complexity O(min(n,m)) only two rows of the matrix need be kept in memory Scalable with respect to: number of tuples in the join result number of discovered empty rectangles size of the domain of one of the attributes

How many empty rectangles are there? Tests done on 4 pairs of attributes with numerical domain present in typical joins in a real-world workload of a health insurance company.

How big are the rectangles?

Query rewrite: simple case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<60 and...

Query rewrite: complex case select … from R, S,... where R.C=S.C and 60<R.A<80 and 20<S.B<80 and... select … from R, S,... where R.C=S.C and (… and …) or ...

Experiment I: Size of the Overlap

Experiment 2: Type of Overlap

Experiment 3: Number of Empty Joins Used in Rewrite

How much do the rectangles overlap with queries?

Query optimization experiments real-world workload of 26 queries 5 of the queries “qualified” for the rewrite only simple rewrites were considered all rewrites led to improved performance

Query Cardinality Estimate via Empty Joins

Query Cardinality Estimate via Empty Joins (SIEQE) Cardinality estimates crucial for designing good query evaluation plans Uniform data distribution (UDA): standard assumption in database systems Histograms effective in single dimensions: too expensive to build and maintain otherwise

The Strategy Cardinality UDA SIEQE Q1 100 62 Q2 125 With UDA, the “density”: 1 tuple/sq unit Empty joins cover 20% of the area Adjusted density: 1.25 tuples/sq unit Q1 Cardinality UDA SIEQE Q1 100 62 Q2 125 Q2

Experiments Number of queries for which the error is less than a given limit

Discovery of Check Constraints and Their Application in DB2 We discover two types of (rules) check constraints: correlations between attributes over ordered domains partitioning of attributes

Correlations between attributes over ordered domains Rules have the form: Y = bX + a + [emin, emax] Algorithm for all tables in the database for all comparable variable pairs (X and Y) in the table apply OLS estimation to get the function of the form: Y = a + bX calculate the max and min error (or residual) emax and emin endfor

Partitioning Algorithm Rules have the form: If X = a, then Y  [emin, emax] Algorithm for all tables in the database for any qualifying variable pair (X and Y) in the table calculate partitions by using GROUP BY X statements find the max and min value of Y for each partition endfor

Experiments in TPC-H TPC-H contains the following check constraint: L_RECEIPTDATE > L_SHIPDATE Our algorithm discovered the following rule: L_RECEIPTDATE = L_SHIPDATE + (1, 30), m = 0.0114. Rules discovered through partitioning: If L_LINESTATUS=F, then L_SHIPDATE=(01/04/1992, 06/17/1995), m = 0.50 If L_LINESTATUS=O, then L_SHIPDATE=(06/19/1995, 12/25/1998), m = 0.50

Applications DBA Wizard Semantic Query Optimization Improved Filter Factor Estimates

Example Consider a query issued against a hotel database, that requests the number of guests staying in the hotel on a given date. ARRIVAL DATE <= ‘1999-06-15’ AND DEPARTURE_DATE >= ‘1999-06-15’ The filter factor estimate for the query would be: ff = ff1 * ff2 If ‘1999-06-15’ was approximately midway in the date ranges, we would estimate a quarter of all the guests that came in over the number of years would be in the answer of the query!

Example (cont.) ff = (ff1 + ff2 –1) Assume that the following check constraint was discovered: DEPARTURE_DATE >= ARRIVAL_DATE + (1 DAY, 5 DAYS) The original condition in the query predicate can then be changed to: ARRIVAL_DATE <= ‘1999-06-15’ AND ARRIVAL_DATE >= ‘1999-06-18’ or ARRIVAL_DATE BETWEEN ‘1999-06-15’ AND ‘1999-06-18’ The filter factor is now estimated to: ff = (ff1 + ff2 –1)

Other Research on the Use of Soft Constraints in Query Optimization

Query-driven Approach Built multidimensional histograms based on query results (Microsoft) Improve cardinality estimates by looking at the intermediate query results (IBM) Both techniques generate statistical soft constraints

Data-driven Approach Lots of methods using Bayesian networks to infer statistical soft constraint Lots of methods to discover functional dependencies in data (absolute soft constraints) Most recently, BHUNT and CORDS use sampling to discover soft constraints (IBM)

References Q. Cheng, J. Gryz, F. Koo, T. Y. Cliff Leung, L. Liu, X. Qian, B. Schiefer: Implementation of Two Semantic Query Optimization Techniques in DB2 Universal Database. VLDB 1999. J. Edmonds, J. Gryz, D. Liang, R. Miller: Mining for Empty Rectangles in Large Data Sets. ICDT 2001. J. Gryz, B. Schiefer, J. Zheng, C. Zuzarte: Discovery and Application of Check Constraints in DB2. ICDE 2001. P. Godfrey, J. Gryz, C. Zuzarte: Exploiting Constraint-Like Data Characterizations in Query Optimization. SIGMOD 2001. J. Gryz, D. Liang: Query Optimization via Empty Joins. DEXA 2002. J. Gryz, D. Liang: Query Cardinality Estimation via Data Mining. IIS 2004.