Jingren Zhou, Per-Ake Larson, Ronnie Chaiken ICDE 2010 Talk by S. Sudarshan, IIT Bombay Some slides from original talk by Zhou et al. 1.

Slides:



Advertisements
Similar presentations
Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
Advertisements

1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Optimizing Join Enumeration in Transformation-based Query Optimizers ANIL SHANBHAG, S. SUDARSHAN IIT BOMBAY VLDB 2014
CS4432: Database Systems II
Jingren Zhou Microsoft Corp.. Large-scale Distributed Computing Large data centers (x1000 machines): storage and computation Key technology for search.
Ingres/Vectorwise Implementation Details XXV Ingres Benutzerkonferenz 2012 Confidential © 2011 Actian Corporation Doug Inkster 1 of 9.
The Volcano/Cascades Query Optimization Framework
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
University of Konstanz Advances in Database Query Processing Sahak Maloyan Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
PSoup Kevin Menard CS 561 4/11/2005. Streaming Queries over Streaming Data Sirish Chandrasekaran UC Berkeley August 20, 2002 with Michael J. Franklin.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
AutoJoin: Providing Freedom from Specifying Joins Terrence Mason Lixin Wang
Presented by: Sapna Jain & R. Gokilavani Some slides taken from Jingren Zhou's talk on Scope : isg.ics.uci.edu/slides/MicrosoftSCOPE.pptx.
Adapted from a talk by: Sapna Jain & R. Gokilavani Some slides taken from Jingren Zhou's talk on Scope : isg.ics.uci.edu/slides/MicrosoftSCOPE.pptx.
Lecturers : Kayvan Zarei - shahed mahmoodi 1Azad University of Sanandaj Professor : Dr. Kyumars Sheykh Esmaili SCOPE.
Optimizing Queries Using Materialized Views Qiang Wang CS848.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Query Optimization Arash Izadpanah. Introduction: What is Query Optimization? Query optimization is the process of selecting the most efficient query-evaluation.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
1 Execution Strategies for SQL Subqueries Mostafa Elhemali, César Galindo- Legaria, Torsten Grabs, Milind Joshi Microsoft Corp With additional slides from.
12.1Database System Concepts - 6 th Edition Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Join Operation Sorting 、 Other.
Bayu Adhi Tama, ST., MTI. Introduction Relational algebra and relational calculus are formal languages associated with the relational.
Semantic Query Optimization Techniques November 16, 2005 By : Mladen Kovacevic.
Chapter 5 Relational Algebra and Relational Calculus Pearson Education © 2009.
Indexes and Views Unit 7.
16.7 Completing the Physical- Query-Plan By Aniket Mulye CS257 Prof: Dr. T. Y. Lin.
 CS 405G: Introduction to Database Systems Lecture 6: Relational Algebra Instructor: Chen Qian.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Mostafa Elhemali Leo Giakoumakis. Problem definition QRel system overview Case Study Conclusion 2.
CSCI 5708: Query Processing II Pusheng Zhang University of Minnesota Feb 5, 2004.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Lecture 15: Query Optimization. Very Big Picture Usually, there are many possible query execution plans. The optimizer is trying to chose a good one.
1 Indexes ► Sort data logically to improve the speed of searching and sorting operations. ► Provide rapid retrieval of specified rows from the table without.
CS4432: Database Systems II Query Processing- Part 1 1.
LECTURE THREE RELATIONAL ALGEBRA 11. Objectives  Meaning of the term relational completeness.  How to form queries in relational algebra. 22Relational.
Some TPC-H queries on Teradata and PostgreSQL
COP4710 Database Systems Relational Algebra.
CS 440 Database Management Systems
Relational Algebra Chapter 4 1.
Optimizing Big-Data Queries using Program Synthesis
Relational Algebra Chapter 4, Part A
Evaluation of Relational Operations
Lecture 17: Distributed Transactions
File Processing : Query Processing
Relational Algebra Chapter 4 1.
Relational Algebra Chapter 4 - part I.
Instructor: Mohamed Eltabakh
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Dealing with Uniqueness Constraint in Query Optimization
Overview of Query Evaluation
A Framework for Testing Query Transformation Rules
Query Optimization.
Yan Huang - CSCI5330 Database Implementation – Query Processing
Presentation transcript:

Jingren Zhou, Per-Ake Larson, Ronnie Chaiken ICDE 2010 Talk by S. Sudarshan, IIT Bombay Some slides from original talk by Zhou et al. 1

Incorporating partitioning & parallel plans into optimizer Optimizer need to reason about partitioning & its interaction with sorting & grouping. SELECT R.a, S.c COUNT(*) AS count FROM R JOIN S ON R.a = S.a and R.b = S.b GROUP BY R.a, S.c HashAgg R.a S.c HashJoin R.a=S.a & R.b=S.b R Repartition R.a, S.c Repartition R.a, R.b Repartition S.a, S.b S HashAgg R.a, S.c HashJoin R.a=S.a & R.b=S.b R Repartition R.a Repartition S.a S Partition (R.a) => Partition on (R.a, R.b) Partition (R.a) => Partition (R.a, S.c) 2

Incorporating partitioning & parallel plans into optimizer Partitioning is a physical property. So, the logical operator DAG in Volcano optimizer will remain unchanged. In Physical DAG of volcano optimizer: For single machine plans we considered only 2 physical properties – sorting & indexing. To incorporate parallel plans we need to add partitioning & grouping property as well in list of physical properties of each node in physical operator DAG. 3

Partitioning scheme Takes one input stream and generates multiple output streams Hash Partitioning Range Partitioning Non-deterministic (round robin) partitioning Broadcasting 4

Merging Schemes It combines data from same bucket of multiple input streams into a single output stream. Random merge – randomly pulls data from different input stream. Sort merge – If input is sorted on some columns (may not be the partition column), combine using sort merge to preserve the sorting property. Concat merge – concatenate multiple input stream into one. Sort-Concat merge – Concatenate input in the order of their first rows. 5

Examples: To get Sort (A) & Partition (B) Sort each input (A), then hash partition on (B), then Sort merge each partition on (A). Hash partition (B), Random merge, Sort each partition on (A). Similar for range partition. 6

Merge Schemes: Exchange topology Initial PartitioningRe-partitioningFull merge Partial repartitioningPartial merge 7

8

Inferring Functional Dependencies Column equality constraints: A selection or join with a predicate Ri = Sk implies that the functional dependencies {Ri} → {Sk} and {Sk} → {Ri} hold in the result. Constant constraints: After a selection with a predicate Ri = constant all rows in the result have the same value for column Ri. This can be viewed as a functional dependency which we denote by ∅ → Ri. Grouping columns: After a group-by with grouping columns R, R is a key of the result and, thus, functionally determines all other columns in the result. 9

Structural properties Grouping:A sequence of rows is said to be grouped on a set of columns C = {C 1, C 2,…, C n } if rows with same value of these columns grouped together. It is denoted by C g. Sorting: A sequence of rows sorted on a list of columns C is denoted as C o. Partitioning: A relation R is set to be partitioned on set of columns C = {C 1, C 2,…, C n } if rows with same value of C belong to same partition (note that it may not be grouped together on C in that partition). Non-ordered : hash Ordered: range Note: We need to add enforcer operators for all physical properties. 10

Structural properties Structural property of each node in DAG can be represented as list of global & local structural properties: Global structural properties: applies to whole relation E.g. Partitioning Local structural properties – Properties like grouping and sorting which apply within each partition Partition1Partition2Partition3 {1,4,2}{4,1,5}{6,2,1} {1,4,5}{3,7,8}{6,2,9} {7,1,2}{3,7,9} {{C 1 } g, { {C 1, C 2 } g, C 3 o }} { P g ; { A 1, A 2,…, A n } } 11

Inference rules Partition (A) => Partition (A, B) Sort (A, B) => Sort (A) Sort (A) => Grouped (A) Now, using the inference rules while generating all possible rewriting, we need to consider all possible required physical properties. Example: Parallel Join (A, B, C) Partition (A, B, C) or Partition (A, B) or Partition(A, C) or Partition (B, C) or Partition (A) or Partition (B) or Partition (C) So the number of possible rewriting is 2 |c| 12

Example SELECT R.a, S.c COUNT(*) AS count FROM R JOIN S ON R.a = S.a and R.b = S.b GROUP BY R.a, S.c RS Join R.a=S.a & R.b=S.b Agg R.a, S.c RS Join R.a=S.a & R.b=S.b Agg R.a, S.c Join R.a=S.a & R.b=S.b Join R.a=S.a & R.b=S.b Partition(A) Partition(C) Partition(A, C) Partition(A) Repartition S.c Repartition R.a Repartition S.a Assume repartitioning cost is Repartition R.a, S.c 20 Partition(A) 10 RS Partition(A, B) Repartition R.a, R.b Repartition S.a, S.b 10 Partition(A, B) 10 HashAgg R.a, S.c HashJoin R.a=S.a & R.b=S.b R Repartition R.a Repartition S.a S Logical DAG Physical DAG 13

Structural Properties: Notation 14

Structural Properties: Notation 15

Structural Properties: Notation 16

Structural Properties: Notation 17

Structural Properties: Notation 18

Inference Rules 19

Deriving Structural Properties 20

Structural Properties after Merge. 21

Properties after repartitioning. 22

Required Properties: Example. 23

Required Properties. 24

Required Properties for Operators. 25

Property Matching Matching of structural properties can be done by matching global and local properties separately. Normalization in each partitioning, sorting, grouping property, and functional dependency, replace each column with the representative column in its equivalence class, then in each partitioning, sorting and grouping property, remove columns that are functionally determined by some other columns. 26

Enforcer Rules For each logical operator, consider both non- partitioned and partitioned implementations, as long as they can ever satisfy their requirements. Rely on a series of enforcer rules to modify requirements for structural properties E.g. from non-partitioned to partitioned, or from sorted to non- sorted, etc. Data exchange operators are enforcers of structural properties. 27

Enforce Data Exchange Algorithm. 28

Example plans. 29

Conclusions SCOPE: a new scripting language for large-scale analysis Strong resemblance to SQL: easy to learn and port existing applications High-level declarative language Implementation details (including parallelism, system complexity) are transparent to users Allows sophisticated optimization Future work Multi-query optimization (with parallel properties, optimization opportunities have been increased). Columnar storage & more efficient data placement. 30

31

TPC-H Query 2 // Extract region, nation, supplier, partsupp, part … RNS_JOIN = SELECT s_suppkey, n_name FROM region, nation, supplier WHERE r_regionkey == n_regionkey AND n_nationkey == s_nationkey; RNSPS_JOIN = SELECT p_partkey, ps_supplycost, ps_suppkey, p_mfgr, n_name FROM part, partsupp, rns_join WHERE p_partkey == ps_partkey AND s_suppkey == ps_suppkey; SUBQ = SELECT p_partkey AS subq_partkey, MIN(ps_supplycost) AS min_cost FROM rnsps_join GROUP BY p_partkey; RESULT = SELECT s_acctbal, s_name, p_partkey, p_mfgr, s_address, s_phone, s_comment FROM rnsps_join AS lo, subq AS sq, supplier AS s WHERE lo.p_partkey == sq.subq_partkey AND lo.ps_supplycost == min_cost AND lo.ps_suppkey == s.s_suppkey ORDER BY acctbal DESC, n_name, s_name, partkey; OUTPUT RESULT TO "tpchQ2.tbl"; // Extract region, nation, supplier, partsupp, part … RNS_JOIN = SELECT s_suppkey, n_name FROM region, nation, supplier WHERE r_regionkey == n_regionkey AND n_nationkey == s_nationkey; RNSPS_JOIN = SELECT p_partkey, ps_supplycost, ps_suppkey, p_mfgr, n_name FROM part, partsupp, rns_join WHERE p_partkey == ps_partkey AND s_suppkey == ps_suppkey; SUBQ = SELECT p_partkey AS subq_partkey, MIN(ps_supplycost) AS min_cost FROM rnsps_join GROUP BY p_partkey; RESULT = SELECT s_acctbal, s_name, p_partkey, p_mfgr, s_address, s_phone, s_comment FROM rnsps_join AS lo, subq AS sq, supplier AS s WHERE lo.p_partkey == sq.subq_partkey AND lo.ps_supplycost == min_cost AND lo.ps_suppkey == s.s_suppkey ORDER BY acctbal DESC, n_name, s_name, partkey; OUTPUT RESULT TO "tpchQ2.tbl";

Sub Execution Plan to TPCH Q2 1. Join on suppkey 2. Partially aggregate at the rack level 3. Partition on group-by column 4. Fully aggregate 5. Partition on partkey 6. Merge corresponding partitions 7. Partition on partkey 8. Merge corresponding partitions 9. Perform join

A Real Example