Aquery: A DATABASE SYSTEM FOR ORDER Dennis Shasha, joint work with Alberto Lerner

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

Chapter 15 Algorithms for Query Processing and Optimization Copyright © 2004 Pearson Education, Inc.
BUILDING A DATABASE SYSTEM FOR ORDER New England Database Seminars April 2002 Alberto Lerner – ENST Paris Dennis Shasha – NYU
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Programming, Triggers Chapter 5 Modified by Donghui Zhang.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
1 Query-by-Example (QBE). 2 v A “GUI” for expressing queries. –Based on the Domain Relational Calulus (DRC)! –Actually invented before GUIs. –Very convenient.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
Database Management Systems 3ed, Online chapter, R. Ramakrishnan and J. Gehrke1 Query-by-Example (QBE) Online Chapter Example is the school of mankind,
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query-by-Example (QBE) Chapter 6 Example is the school of mankind, and they will learn at no.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Query Evaluation. SQL to ERA SQL queries are translated into extended relational algebra. Query evaluation plans are represented as trees of relational.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Paper by: A. Balmin, T. Eliaz, J. Hornibrook, L. Lim, G. M. Lohman, D. Simmen, M. Wang, C. Zhang Slides and Presentation By: Justin Weaver.
Instructor: Craig Duckett CASE, ORDER BY, GROUP BY, HAVING, Subqueries
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
AQuery A Database System for Order Dennis Shasha joint work with Alberto Lerner
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
Query Optimization. General Overview Relational model - SQL  Formal & commercial query languages Functional Dependencies Normalization Physical Design.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Systems More SQL Database Design -- More SQL1.
Chapter 19 Query Processing and Optimization
Introduction to Structured Query Language (SQL)
AQuery A Database System for Order
Relational Database Performance CSCI 6442 Copyright 2013, David C. Roberts, all rights reserved.
Query Optimization, part 2 CS634 Lecture 13, Mar Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Database Management 9. course. Execution of queries.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
Oracle DML Dr. Bernard Chen Ph.D. University of Central Arkansas.
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
1 SQL-3 Tarek El-Shishtawy Professor Ass. Of Computer Engineering.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 (Part II) INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor.
CS4432: Database Systems II Query Processing- Part 2.
Slide 6- 1 Additional Relational Operations Aggregate Functions and Grouping A type of request that cannot be expressed in the basic relational algebra.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 6 The Relational Algebra and Relational Calculus.
Query Processing – Implementing Set Operations and Joins Chap. 19.
1 SQL: The Query Language. 2 Example Instances R1 S1 S2 v We will use these instances of the Sailors and Reserves relations in our examples. v If the.
SQL and Query Execution for Aggregation. Example Instances Reserves Sailors Boats.
Chapter 13: Query Processing
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
CS4432: Database Systems II Query Processing- Part 1 1.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
Tuning Transact-SQL Queries
Database Management System
Instructor: Craig Duckett Lecture 09: Tuesday, April 25th, 2017
Query-by-Example (QBE)
Database Performance Tuning and Query Optimization
Relational Algebra Chapter 4, Part A
Overview of Query Optimization
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Evaluation of Relational Operations: Other Operations
Chapter # 7 Introduction to Structured Query Language (SQL) Part II.
Relational Algebra Chapter 4, Sections 4.1 – 4.2
SQL: Structured Query Language
Chapter 11 Database Performance Tuning and Query Optimization
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Relational Algebra Chapter 4 - part I.
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Aquery: A DATABASE SYSTEM FOR ORDER Dennis Shasha, joint work with Alberto Lerner

Motivation The need for ordered data Queries in finance, signal processing etc. depend on order. SQL 99 has extensions but they are clumsy.

Moving Averages is algorithmically linear but… Sales(month, total) SELECT t1.month+1 AS forecastMonth, (t1.total+ t2.total + t3.total)/3 AS 3MonthMovingAverage FROM Sales AS t1, Sales AS t2, Sales AS t3 WHERE t1.month = t2.month - 1 AND t1.month = t3.month – 2 Can optimizer make a 3-way (in general, n-way) join linear time? Ref: Data Mining and Statistical Analysis Using SQL Trueblood and Lovett Apress, 2001

Moving Averages in SQL99 is awkward Sales(month, total) Alberto, please put in SQL 99

Choose query from Jennifer Alberto, I suggest query 2 because that may be hard in sql 99.

Motivation Problems Extending SQL with Order Queries are hard to read Cost of execution is often non-linear (would not pass basic algorithms course) Few operators preserve order, so optimization hard.

Idea Whatever can be done on a table can be done on an ordered table (arrable). Not vice versa. Aquery – query language on arrables; operations on columns. Upward compatible from SQL 92. Optimization ideas are new, but natural.

SQL + Order Moving Averages Sales(month, total) SELECT month, avgs(8, total) FROM Sales ASSUMING ORDER month Execution (Sales is an arrable): 1.FROM clause – enforces the order in ASSUMING clause 2.SELECT clause – for each month yields the moving average (window size 8) ending at that month. avgs: vector-to-vector function, order-dependant and size-preserving order to be used on vector- to-vector functions

SQL + Order Top N Employee(ID, salary) SELECT first(N, salary) FROM Employee ASSUMING ORDER Salary first: vector-to-vector function, order-dependant and non size-preserving Execution: 1.FROM clause – orders arrable by Salary 2.SELECT clause – applies first() to the ‘salary’ vector, yielding first N values of that vector given the order. Could get the top earning IDs by saying first(N, ID).

SQL + Order Vector-to-Vector Functions prev, next, $, [] avgs(*), prds(*), sums(*), deltas(*), ratios(*), reverse, … drop, first, last order- dependant non order- dependant size- preserving non size- preserving rank, tilemin, max, avg, count

Challenge Given a table with securities’ price ticks, ticks(ID, date, price, timestamp), find the best possible profit if one bought and sold security ‘S’ in a given day Note that max – min doesn’t work, because must buy before you sell (no selling short).

Complex queries: Best spread In a given day, what would be the maximum difference between a buying and selling point of each security? Ticks(ID, price, tradeDate, timestamp, …) SELECT ID, max(price – mins(price)) FROM Ticks ASSUMING ORDER timestamp WHERE tradeDate = ‘99/99/99’ GROUP BY ID Execution: 1.For each security, compute the running minimum vector for price and then subtract from the price vector itself; result is a vector of spreads. 2.Note that max – min would overstate spread. max min best spread running min

Best-profit query SELECT max(price – mins(price)) FROM ticks ASSUMING ORDER timestamp WHERE day = ‘12/09/2002’ AND ID = ‘S’ price mins

Best-profit query comparison [AQuery] SELECT max(price–mins(price)) FROM ticks ASSUMING timestamp WHERE ID=“x” AND tradeDate='99/99/99' [SQL:1999] SELECT max(rdif) FROM (SELECT ID,tradeDate, price - min(price) OVER (PARTITION BY ID ORDER BY timestamp ROWS UNBOUNDED PRECEDING) AS rdif FROM Ticks ) AS t1 WHERE ID=“x” AND tradeDate='99/99/99'

Complex queries: Crossing averages part I When does the 21-day average cross the 5-month average? Market(ID, closePrice, tradeDate, …) TradedStocks(ID, Exchange,…) INSERT INTO temp FROM SELECT ID, tradeDate, avgs(21 days, closePrice) AS a21, avgs(5 months, closePrice) AS a5, prev(avgs(21 days, closePrice)) AS pa21, prev(avgs(5 months, closePrice)) AS pa5 FROM TradedStocks NATURAL JOIN Market ASSUMING ORDER tradeDate GROUP BY ID

Complex queries: Crossing averages part uses non 1NF Execution: 1.FROM clause – order-preserving join 2.GROUP BY clause – groups are defined based on the value of the Id column 3.SELECT clause – functions are applied; non-grouped columns become vector fields so that target cardinality is met. Violates first normal form  groups in ID and non-grouped column grouped ID and non-grouped column Vector field two columns with the same cardinality

Complex queries: Crossing averages part II Get the result from the resulting non first normal form relation temp SELECT ID, tradeDate FROM flatten(temp) WHERE a21 > a5 AND pa21 <= pa5 Execution: 1.FROM clause – flatten transforms temp into a first normal form relation (for row r, every vector field in r MUST have the same cardinality). Could have been placed at end of previous query. 2.Standard query processing after that.

SQL + Order Related Work: Research SEQUIN – Seshadri et al. –Sequences are first-class objects –Difficult to mix tables and sequences. SRQL – Ramakrishnan et al. –Elegant algebra and language –No work on transformations. SQL-TS – Sadri et al. –Language for finding patterns in sequence –But: Not everything is a pattern!

SQL + Order Related Works: Products RISQL – Red Brick –Some vector-to-vector, order-dependent, size- preserving functions –Low-hanging fruit approach to language design. Analysis Functions – Oracle 9i –Quite complete set of vector-to-vector functions –But: Can only be used in the select clause; poor optimization (our preliminary study) KSQL – Kx Systems –Arrable extension to SQL but syntactically incompatible. –No cost-based optimization.

Order-related optimization techniques Starburst’s “glue” (Lohman, 88) and Exodus/Volcano “Enforcers” (Graefe and McKeena, 93) DB2 Order optimization (Simmen et al., 96) Top-k query optimization (Carey and Kossman, 97) Hash-based order-preserving join (Claussen et al., 01) Temporal query optimization addressing order and duplicates (Slivinskas et al., 01)

SELECT ts.ID, ts.Exchange, avgs(10, hq.ClosePrice) FROM TradedStocks AS ts NATURAL JOIN HistoricQuotes AS hq ASSUMING ORDER hq.TradeDate GROUP BY Id Interchange sorting + order preserving operators (1) Sort then join preserving order (2) Preserve existing order (3) Join then sort before grouping op sort g-by avgs op avgs g-by op avgs g-by op sort (4) Join then sort after grouping avgs g-by sort

Transformations Early sorting + order preserving operators

UDFs evaluation order Gene(geneId, seq) SELECT t1.geneId, t2.geneId, dist(t1.seq, t2.seq) FROM Gene AS t1, Gene AS t WHERE dist(t1.seq, t2.seq) < 5 AND posA(t1.seq, t2.seq) posA asks whether sequences have Nucleo A in same position. Dist gives edit distance between two Sequences. posA dist posA (2)(1) (3) Switch dynamically between (1) and (2) depending on the execution history

Transformations UDFs Evaluation Order

Transformations Building Blocks Order optimization –Simmens et al. `96 – push-down sorts over joins, and combining and avoiding sorts Order preserving operators –KSQL – joins on vector –Claussen et al. `00 – OP hash-based join Push-down aggregating functions –Chaudhuri and Shim `94, Yan and Larson `94 – evaluate aggregation before joins UDF evaluation –Hellerstein and Stonebraker ’93 – evaluate UDF according to its ((output/input) – 1)/cost per tuple –Porto et al. `00 – take correlation into account

Original Query SELECT last(price) FROM ticks t,base b ASSUMING ORDER name, timestamp WHERE t.ID=b.ID AND name=“x”   sort baseticks ID name,timesamp Name=“x” last(price)

Last price for a name query The sort on name can be eliminated because there will be only one name Then, push sort –sort A (r 1 r 2 )  A sort A (r 1 ) lop r 2 –sort A (r)  A r   sort base ticks ID name,timesamp Name=“x” last(price)

Last price for a name query The projection is carrying an implicit selection: last(price) = price[n], where n is the last index of the price array –  f(r.col[i]) (r)  order(r)  f(r.col) (  pos()=i (r))   base ticks ID Name=“x” last(price) lop  pos()=last price 

Last price for a name query But why join the entire relation if we are only using the last tuple? Can we somehow push the last selection down the join?   base ticks ID pos()=last Name=“x” price  lop

Last price for a name query We can take the last position of each ID on ticks to reduce cardinality, but we need to group by ticks.ID first But trading a join for a group by is usually a good deal?! One more step: make this an “edge by”   base ticks ID safety Name=“x” price  lop pos()=last  Gby ID

The “edgeby” operator The pattern  edge condition (Gby(r)) can be efficiently implemented without grouping the whole arrable An edgeby is a kind of scan that take as parameters the grouping column(s), the direction to scan, a position, and if that position is the last (scan up to) or the first one the group (scan from this on). In the previous example, to find the last rows for each ID in ticks one would use, respectively, `ID,' `backwards,' ` 1,' and 'up to.'

Conclusion Arrable-based approach to ordered databases may be scary – dependency on order, vector-to-vector functions – but it’s expressive and fast. SQL extension that includes order is possible and reasonably simple. Optimization possibilities are vast.

Streaming Aquery has no special facilities for streaming data, but it is expressive enough. Idea for streaming data is to split the tables into tables that are indexed with old data and a buffer table with recent data. Optimizer works over both transparently.

Arrables’ Shape Properties The cardinality of an arrable’s array-columns must be the same An arrable’s tuple can contain single or array- valued column values. –A 1NF arrable contains only single values –A flatten-able ¬1NF arrable may contain array-valued columns, but they have the same cardinality –A non-flatten-albe ¬1NF arrable is one that is not in the previous categories

Arrables Primitives Single-value indexing r[k] retrieves the tuple formed by the k-th position value of each of the arrables array- columns Multi-value indexing r[k 1 k 2 … k n ] is similar to forming an arrable by insertion of r[k 1 ], r[k 2 ], …, r[k n ]

Array-columns Expressions Expressions in AQuery are column- oriented, as opposed to row-oriented price x y z > scalar = result T F T price x y z - prev(price) - x y = result - y-x z-y

Built-in functions prev v [i] = null, if i=0 or v[i-1], for 0<i  |v|-1 first v,n [i] = v[i], for 0  i<n mins v [i] = v[i], if i=0 or min(v[i],mins[i-1]), for 0<i  |v|-1 index bv [i] = i such that bv[i] is true

Each modifier Built-in functions are applied over array- columns The each keyword makes the function application occur in each tuple level

Group and flatten operations An arrable can be grouped over a grouping expression that assigns a group value to each row. The net effect is to nest an 1NF arrable into a flatten-able ¬1F one The flatten operation undoes the operation But there are data-ordering issues!

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING Arrables mentioned in the FROM clause are combined using cartesian product. This is operation is order- oblivious

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING The ASSUMING ORDER is enforced. From this point on any order-dependent expression can be used, regardless of the clause.

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING The WHERE expression is executed. It must generate a boolean vector that maps each row of the current arrable to true or false. The false rows are eliminated and the resulting arrable is passed along.

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING The GROUP BY expression is computed. It must map each tuple to a group. The resulting arrable has as many rows as there are groups. The columns that did not participate on the grouping expression now have array-valued values

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING The HAVING clause expression must result in a boolean vector that maps each group, i.e, row, to a boolean value. The rows that map to false are excluded.

AQuery syntax and execution It’s just SQL, really. But column-oriented. SELECT FROM ASSUMING ORDER WHERE GROUP BY HAVING Finally, each expression on the SELECT clause is executed, generating a new array-column. The resulting arrable is achieved by “zipping” these arrays side-by-side.

Translating queries into Operations SELECT ID, index(flags=‘FIN’)- index(flags=‘ACK’) FROM netmgmt ASSUMING ORDER timestamp GROUP BY ID  Gby sort netmgmt Order- preserving portion of the plan each

Non- and Order-preserving Variations Operators may or may not be required to be preserve order, depending on their location. For instance: –a group by based on hashing is an implementation of a non-order preserving group-by –A nested-loops join is an implementation of a left-order preserving join There are performance implications in preserving or not order

Pick the best Early sort and order- preserving group by Early, non-order preserving group by and sort  Gby sessionID sort timestamp netmgmt  Gby sessionID sort timestamp netmgmt each

AQuery Optimization Optimization is cost based The main strategies are: –Define the extent of the order-preserving region of the plan, considering (correctness, obviously, and) the performance of variation of operators –Apply efficient implementations of patterns of operators There are generic techniques yet each query category presents unique opportunities and challenges Equivalence rules between operators can be used in the process

Order Equivalence Let Order(r) return the attributes that the arrable r is “ordered by.” Two arrables r 1 and r 2 are order-equivalent with respect to the list of attributes A, denoted r 1  A r 2 if A is a prefix of Order(r 1 ) and of Order(r 2 ), and some permutation of r 1 is identical to r 2 We denote by r 1  {} r 2 the case where r 1 and r 2 are multiset-equivalent (i.e. they contain the same setof tuples and for each tuple in the set, they contain the same number of instances of that tuple).

Sort Elimination Often sort can be eliminated or can be done over less fields –sort A (r)  A r if A is prefix of Order(r) –sort AB (r)  AB sort’ A (r) if B is prefix of Order(r) and sort’ is stable –sort A (sort B (r))  AB sort A if A  B is a valid functional dependency –sort A (sort B (r))  AB sort A(B-A) (r) –oper-path(sort(r))  {} oper-path(r) if oper-path has no order-dependent operation

Conclusion Arrables offer a simple way to express order-based predicates and expressions Queries tend to be lucid, concise and, consequently, order idioms stand out Optimization naturally extend the know corpus of rules New possibility of operator implementing algorithm that take order into consideration

Order preserving joins select lineitem.orderid, avgs(10, lineitem.qty), lineitem.lineid from order, lineitem assuming order lineid where order.date > 45 and order.date < 55 and lineitem.orderid = order.orderid Basic strategy 1: restrict based on date. Create hash on order. Run through lineitem, performing the join and pulling out the qty. Basic strategy 2: Arrange for lineitem.orderid to be an index into order. Then restrict order based on date giving a bit vector. The bit vector, indexed by lineitem.orderid, gives the relevant lineitem rows. The relevant order rows are then fetched using the surviving lineitem.orderid. Strategy 2 is often 3-10 times faster.