Reverse Query Processing Carsten Binnig, Donald Kossmann and Eric Lo ICDE 2007 Presented by Bhupesh Chawda.

Reverse Query Processing Carsten Binnig, Donald Kossmann and Eric Lo ICDE 2007 Presented by Bhupesh Chawda

Motivation ● Testing database applications requires generating test databases ● Commercial tools available – Do not capture the semantics of the application logic ● Query executed often returns empty or non-meaningful results ● Example: foreach x in SELECT a FROM R do switch (x) case 1: do some work; case 2: do some work; do some work; end foreach

Motivation... ● Example of database generated by commercial tools: ● Query: ● SELECT orderdate, SUM(price*(1-discount)) FROM Lineitem, Orders ● WHERE l_oid=oid GROUP BY orderdate HAVING AVG(price*(1-discount))<=100 ● AND SUM(price*(1-discount))>=150;

RQP – Reverse Query Processing ● Given a database schema S D, a query Q and the result of the query R, goal of RQP is to find a database D (set of tables) such that R = Q(D) ● Database D must be complaint with the database schema S D and its integrity constraints ● Based on Reverse Relational Algebra (RRA) ● Logically, each operator in relational algebra has a corresponding operator in RRA.

Architecture of RQP

Architecture of RQP contd. 1. Parser – Reverse Query Tree 2. Bottom-up Query Annotation – Input and Output Schema 3. Query Optimization – Aggressive optimization than traditional query optimization 4. Top-down Data Instantiation – Data generation – Model Checker

What RQP actually does? Example Select a from R where p = 3; Schema of R: Result RTable: R Reverse Select p = 3 Reverse Project a a12a12 R Select p = 3 Project a a p 1 3 2 3 a p 1 3 2 3 a p 1 3 2 3 3 5 OR Relational Algebra TreeRRA Tree Output of Reverse Project Operation Output of Reverse Select Operation

Example Schema Table LINEITEMTable ORDERS

Reverse Query Tree QUERY SELECT SUM(price) FROM Lineitem, Orders WHERE l_oid=oid GROUP BY orderdate HAVING AVG(price)<=100;

Reverse Relational Algebra ● Reverse variant of relational Algebra. ● op(op −1 (R)) = R. Inverse is not true. ● Has 1 input and 0 or more outputs. ● Selection, projection, rename, Cartesian product, union, aggregation and minus are basic operators of RRA. Other operators can be composed using laws like associativity, commutativity etc.

Bottom-up Query Annotation (Content taken from [2] also) ● Annotates each operator op −1 of a reverse query tree with an output schema S OUT and an input schema S IN. ● Both schema (input and output) are defined by – Attributes A (names and data types) – Integrity Constraints C – Functional Dependencies F and – Join dependencies J ● A schema S in RQP is formally defined as the following four tuple: S = (A, C, F, J) ● Hence each operator can ● Check the correctness of the input data so that its own operation does not fail (S IN ) and ● Ensure that it generates valid output data so that the operators below it do not fail (S OUT )

Bottom-up Query Annotation S OUT for Leaves is defined first. In this case reverse Join is the leaf. Its S OUT will be the Database schema. S IN is computed from S OUT S IN of previous operator is used to initialize S OUT for the next operator.

Bottom-up Annotation Example ● Select a from R where p = 3; R Reverse Select p = 3 Reverse Project a I need column ”a” to be the PRIMARY KEY The values in column P must be nothing other than 3 Schema for R is: OK. I'll give all Unique ”a” values for tuple with Type OK. The only values in column p will be 3 S OUT for Select S OUT for Project S IN for Select

Reverse Projection ● Generates new columns according to its output schema.

Reverse Selection ● Returns a superset of its input. ● Input must satisfy the selection predicate. ● If additional tuples are generated, these must satisfy the negation of selection predicate.

Reverse Aggregation ● Generates new columns. May also generate new tuples to satisfy all the constraints of output schema.

Reverse join ● Takes one relation as input and generates two relations as output

Other RRA operators ● Reverse Union ● Takes one relation as input, generates two relations as output. ● Distributes tuples to the two output relations. ● Reverse Minus ● Routes the tuples from the input relation to the left branch ● Reverse Rename ● Only the output schema is affected, no data is manipulated

Top-down Data Instantiation ● Interprets the optimized reverse query execution plan using an RTable R and possibly query parameters as input and generates a database instance D as output. ● Consists of a set of physical RRA operators ● Each logical RRA operator has different counterparts in physical RRA. Application dependent. ● Eg. Functional vs Performance testing ● Limitation on implementing some physical RRA operators (Reverse Select, Reverse Project and Reverse Minus) – Example: – SELECT S1.A,S1.B,S2.A,S2.B FROM S as S1, S as S2 – WHERE S1.B=S2.B AND S1.A>5 AND S2.A<=5; – R is – S1 pushes: and – S2 pushes: and

Top-down Data Instantiation contd. ● Each operator in RQP is implemented as an iterator. Iterators are push based. ● Whenever an operator produces a tuple, it calls the pushNext method of the relevant child (output) operator and continues processing once the child operator is ready ● Rtable is scanned and each tuple is pushed one at a time to the child operators of the reverse query tree. ● Iterator has following methods: ● Open(), pushnext(Tuple t) and close()

Example: Bottom-up Annotation, Data Instantiation and Model checking ● Example: – Query: select A from R where A + B < 30; – Consider Reverse Project operation R Reverse Select A+B < 30 Reverse Project A ● Input schema: A = 3 ● Output schema: A + B < 30 A B 3 20 ● Instantiated data ● Call Model Checker with the formula: ● A=3 & A+B<30

* Not a part of the RQP paper Introduction to Model Checker* ● Given a model of a system, tests automatically whether this model meets a given specification. ● Mathematical formulation of the constraints and the system – Predicate logic ● Often also generates a model that satisfies or does not satisfy a given formula (specification) ● Examples: ● SATCHMO (SAtisfiability CHecking by Model generation) – Uses backtracking ● CVC3

* Not a part of the RQP paper Example- SATCHMO based Approach* A relation that this example instantiates could be OWNS(a PERSON, b CAR) and individual relations IS_PERSON(a PERSON) and IS_CAR(b CAR)

Example of CVC3 model checker* Input file for CVC3 % Possible values for PERSON data type DATATYPE PERSON = P1 | P2 | P3 | P4 | P5 END; % Possible values for CAR data type DATATYPE CAR = C1 | C2 | C3 | C4 | C5 END; % Iterator INDEX_INT : TYPE = SUBTYPE (LAMBDA (x: INT) : x > 0 AND x < 6); % Record type OWNS_TYPE OWNS_TYPE : TYPE = [PERSON, CAR]; % Table having records of type OWNS_TYPE R : ARRAY INT OF OWNS_TYPE; % Unique constraint ASSERT DISTINCT (R[1].0, R[2].0, R[3].0, R[4].0, R[5].0); ASSERT DISTINCT (R[1].1, R[2].1, R[3].1, R[4].1, R[5].1); % Application constraint ASSERT FORALL (i : INDEX_INT) : R[i].0 = P1 => R[i].1 /= C1;

* Not a part of the RQP paper Querying CVC3* ● Query: – CHECKSAT R[1].0 = P1 AND R[1].1 = C1; ● Response: – Unsatisfiable. ● Query: – QUERY R[1].0 = P1 AND R[1].1 = C1; – COUNTERMODEL; ● Response: – Invalid – ASSERT (R[1] = (P2, C1)); – ASSERT (R[2] = (P1, C2)); – ASSERT (R[3] = (P3, C3)); – ASSERT (R[4] = (P4, C4)); – ASSERT (R[5] = (P5, C5));

SPQR – System for Processing Queries Reversely ● A prototype for functional testing ● Transform the information from input and output schema into constraint formula. ● Feed the constraint formula to the model checker. ● Model checker returns one of the possible instantiations on all the variables as output.

● pushNext() method for Reverse Projection in SPQR Reverse Projection in SPQR

Reverse Projection in SPQR contd. ● instantiateData() function

● Constraint formula for second tuple of reverse projection: ● For n = 1 ● sum_price = 120 & orderdate != 19900102 & avg_price <= 100 & sum_price = price1 & avg_price = sum_price/1 ● For n = 2 ● sum_price = 120 & orderdate != 19900102 & avg_price <= 100 & sum_price = price1 + price2 & avg_price = sum_price/2 Example

Constraint solvers and Aggregation ● Constraint solvers available today, do not handle aggregation ● Inherent guessing for number of tuples involved ● Input to a constraint solver is a formula with variables. Result is instantiation of those variables. ● For more than one tuples, we need to consider variables in all the tuples.

Reverse Aggregation in SPQR ● pushNext() method of reverse aggregation

Other operators in SPQR ● Reverse Join in SPQR ● Can be implemented as a simple projection with duplicate elimination. Does not call model checker, hence cheaper. ● Reverse Selection ● Implements the identity function. Returns the input. ● Reverse Union ● If input tuple is compatible with left output schema, pushes tuple to left operator ● Else if input tuple is compatible with right output schema, pushes tuple to right operator ● Else return error

Reverse Union (Content from [2])

Other operators contd. ● Reverse Minus – If input tuple is compatible with the left output schema and not with the right output schema, push the tuple to left output. – Else returns error ● Reverse Rename – Does not manipulate data. Hence implements identity.

Nested Queries ● Uses the concept of 'Apply' i.e. nested iterations ● The inner sub query can be thought of as a reverse query tree whose input is parameterized on values generated for correlation variables of the outer query. ● Expensive – Quadratic complexity with size of RTable

Optimization of Data Instantiation ● Calls to model checker are expensive – exponential in size of formula. Hence needs optimization in the data instantiation component ● Definitions: ● Independent attribute – Attribute a is independent wrt. an output schema S OUT, iff S OUT has no integrity constraints limiting the domain of a and a is not correlated with another dependent attribute. ● Constrictive independent attribute – Attribute a is constrictive independent if it is independent wrt. an output schema S OUT disregarding certain optimization- dependent consistency constraints.

Optimization of Data Instantiation ● Default value optimization – Assign a default fixed value to an independent attribute a. ● Unique value optimization – Assign unique increment counter value to constrictive independent attribute a, bound by primary key or Unique constraints – If another independent or constrictive independent attribute a' is correlated with a on equality, set a' to the same unique value as a. ● Single value optimization – Used by constrictive independent attributes bound by CHECK – Included in constraint formula only once. Afterwards, the same value is reused.

Optimization of Data Instantiation ● Aggregation-value optimization – Applied to constrictive independent attributes involved in aggregation – If SUM(a) (a float) is in S IN and MIN(a) and MAX(a) are not in S IN, then instantiate a by solving for a = SUM(a) / n – If a is integer, then compute a by solving SUM(a)=n1 × a1 + n2 × a2, where a1 = sum(a)/n, a2 = sum(a)/n, n1 = n − n2 and n2 =(SUM(a) modulo n) – If MIN(a) and MAX(a) are also in S IN and n >= 3, then use MIN(a) and MAX(a) for instantiating a once and then use a = (SUM(a) – MIN(a) – MAX(a)) / (n-2) – If only COUNT(a) is in S IN, then default value optimization can be used, as a is independent in this case.

Optimization of Data Instantiation ● Count heuristics - Does not find instantiations. Instead reduces the number of attempts to guess the no. of tuples in reverse aggregation – If SUM(a) and AVG(a) are attributes of S IN, then n = SUM(a) / AVG(a) – If SUM(a) and MAX(a) are attributes of the operator’s input schema, then n ≥ SUM(a)/MAX(a) (if SUM(a) and MAX(a) ≥ 0; if SUM(a) and MAX(a) ≤ 0 use n ≤ SUM(a)/MAX(a)) – If SUM(a) and MIN(a) are attributes of the operator’s input schema, then n ≤ SUM(a)/MIN(a) (if SUM(a) and MIN(a) ≥ 0; if SUM(a) and MIN(a) ≤ 0 use n ≥ SUM(a)/MIN(a)) ● Tolerance on precision – Flexible constraints are used to speed up model checking. Eg. Specifying 90 < a < 110 instead of a = 100

Optimization of Data Instantiation ● Memoization – Cache calls to the model checker – π −1 and χ−1 often solve similar constraints and carry out the same kind of guessing. Results of guessing can be reused. – For example, the guessing of aggregation count for π −1 can be used for χ−1

Performance Experiments and Results ● The SPQR system was implemented in Java 1.4 and installed on a Linux AMD Opteron 2.2 GHz Server with 4 GB of main memory. ● In all experiments, SPQR was configured to allow 0 percent tolerance ● PostgreSQL 7.4.8 was used as the back end database system ● Cogent was used as a decision procedure ● SPQR was evaluated using the TPC-H benchmark

Size of Rtable and Database Generated

Running time for different scaling factors

Applications of RQP ● Generating test databases for functional testing ● foreach x in SELECT a FROM R do ● switch (x) ● case 1: do some work; ● case 2: do some work; ● do some work; ● end foreach ● Generating test database for decision support applications ● Query for Data Cube and Reports on the Cube ● Generate large test databases for performance and scalability testing

More Applications of RQP ● SQL Debugger ● If a query produces the wrong results, then RQP can be used to find the operators that are responsible for the wrong query results ● Program Verification ● To prove correctness of a program, RQP can find all necessary conditions of database in order to reach certain program states ● Updating views ● Database Sampling, Compression

Conclusion ● RQP is a new technique and has several applications. ● SPQR is a full fledged RQP system for generating databases for functional testing of database applications. SPQR scales linearly with the size of the database generated.

New Directions in Test Data Generation ● Reverse Query Processing (RQP) ● QAGen – Query Aware Generation of Test Databases ● Massive Stochastic generation of SQL

An Introduction to QAGen- Query Aware Test Database Generator Carsten Binnig, Donald Kossmann, Eric Lo and M. Tamer Ozsu SIGMOD 2007

Motivation ● RQP focused on generating databases to test database applications ● QAGen focuses on generating databases to test DBMS and DBMS components (correctness and performance testing) ● To test DBMS components, execute queries on the generated databases before and after the incorporation of the new DBMS component to compare system behavior. Not adequate. ● For adequately testing a DBMS component, it is necessary to control the input and output of the intermediate operators of the query ● Testing of a memory manager in a DBMS ● Memory allocated to a join can be determined by defining the output cardinality of its inputs ● Testing of cardinality estimation component of a DBMS

DBMS Testing Problem ● The DBMS testing problem is to guarantee that executing a test query on a database can obtain desired intermediate query results ● Example: Output cardinalities, join distributions ● Instead of generating test databases and then checking if it is possible for the test query to obtain desired results, generate a database tailored for each test case - QAGen

QAGen Architecture ● Given a database schema M, parametric query Qp, and set of user defined constraints on each operator (called knobs), QAGen generates a database D and parameter values P such that executing Qp on D with parameter values P, guarantees the user requirements on each operator to be satisfied.

QAGen Architecture ● Query Analyzer ● Correct knob selections - Identifies available knobs ● Assign physical implementations to operators - Assigns correct (knob supported) implementation to an operator ● Symbolic Query Processing ● Iterator model – open(), getNext() and close() ● Reads tuples from child operators, processes each tuple and returns the resulting tuple to parent ● Data Instantiation ● Constraint Solver

Symbolic Data Model ● Symbolic relations and symbolic databases ● Pre-grouping ● Data Storage ● Relational databases. Columns with var char data type ● Ptable – Relational table for storing predicates ● Predicates in Ptable are used to calculate the predicate closure which is fed to the constraint solver

Example for Symbolic execution of Selection Operator Select * from Customer where c_acctbal >= p1; Predicates from Ptable and the symbolic database together are used to calculate a predicate closure and this is fed to the constraint solver to instantiate concrete values in the symbolic database

Conclusion ● QAGen is a system which generated tailored databases for different DBMS test cases. ● It is based on traditional query processing and symbolic execution. ● QAGen can be used to generate databases for complex queries and it scales linearly with the size of the query.

An Introduction to Massive Stochastic Testing of SQL Don Slutz Microsoft Research VLDB 1998

Motivation ● Given a query and the result, how to verify that the result is correct? ● Verifying if the DBMS works correctly. ● Execute the query on multiple DBMS and then check if the query results match ● If other DBMS also do not match, take the majority

Introduction ● This paper describes a method to rapidly create a large number of SQL statements without human intervention. ● SQL statements are generated Stochastically ● Stochastic testing has the advantage that quality of test improves as the test size increases. ● Tests the queries generated by running them on several vendors' DBMS systems ● A system called RAGS (Random generation of SQL) is built and is currently used by the Microsoft SQL Server testing group

RAGS system

SQL Generation by RAGS ● RAGS generates SQL statements by walking a stochastic parse tree and printing it out ● Various choices for selecting the next path are selected stochastically by RAGS while generating SQL ● Controlling the length of the query: No. of Terminals and Non-terminals in the query ● Query: ● select name, salary ● + commission from ● employee where ● (salary > 10000) and ● (department = ”sales”)

Grammar for SQL Select from where {[schemaName.] tableName] [{{LEFT | RIGHT} [OUTER] | [INNER] | CROSS | NATURAL} JOIN tableExpression [TOP term] [DISTINCT | ALL] selectExpression [,...] * | expression [[AS] columnAlias] | tableAlias.* tableExpression [,...] andcondition [OR andcondition] SELECT_EXPRFROM_EXPR WHERE_EXPR operand [conditionRightHandSide] | NOT condition | EXISTS (select)

Testing Results and Conclusion ● Testing query results is tough. Only small subset of common SQL. ● Erroneous queries simplified automatically by removing where and having clauses while retaining the error. ● Conclusion: RAGS is an experiment in massive stochastic testing of SQL systems. Its main contribution is to generate entire SQL statements stochastically since this enables greater coverage of the SQL input domain as well as rapid test generation.

References 1. Reverse Query Processing, Carsten Binnig, Donald Kossmann and Eric Lo, ICDE 2007 2. C. Binnig, D. Kossmann, and E. Lo. Reverse Query Processing. Technical report, ETH Zurich, http://www.dbis.ethz.ch/research/publications/rqp.pd f, 2006 3. QAGen: Generating Query-Aware Test Databases, Carsten Binnig, Donald Kossmann, Eric Lo and M. Tamer. Ozsu, SIGMOD 2007 4. Massive Stochastic Testing of SQL Donald R. Slutz, VLDB 1998: 618-622

Thank You! Questions?

Reverse Query Processing Carsten Binnig, Donald Kossmann and Eric Lo ICDE 2007 Presented by Bhupesh Chawda.

Similar presentations

Presentation on theme: "Reverse Query Processing Carsten Binnig, Donald Kossmann and Eric Lo ICDE 2007 Presented by Bhupesh Chawda."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reverse Query Processing Carsten Binnig, Donald Kossmann and Eric Lo ICDE 2007 Presented by Bhupesh Chawda.

Similar presentations

Presentation on theme: "Reverse Query Processing Carsten Binnig, Donald Kossmann and Eric Lo ICDE 2007 Presented by Bhupesh Chawda."— Presentation transcript:

Similar presentations

About project

Feedback