Lecture 17: Distributed Transactions

Slides:



Advertisements
Similar presentations
Unit 1:Parallel Databases
Advertisements

Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Query Execution Optimizing Performance. Resolving an SQL query Since our SQL queries are very high level, the query processor must do a lot of additional.
Oct 28, 2003Murali Mani Relational Algebra B term 2004: lecture 10, 11.
Lecture 24: Query Execution Monday, November 20, 2000.
Relational Algebra on Bags A bag is like a set, but an element may appear more than once. –Multiset is another name for “bag.” Example: {1,2,1,3} is a.
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CS 347Notes 031 CS 347: Distributed Databases and Transaction Processing Notes03: Query Processing Hector Garcia-Molina.
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
Computer Science 101 Web Access to Databases SQL – Extended Form.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Database Management 9. course. Execution of queries.
Chapter 21: Parallel Databases Chapter 21: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Ahsan Abdullah 1 Data Warehousing Lecture-9 Issues of De-normalization Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
Parallel Database Systems Instructor: Dr. Yingshu Li Student: Chunyu Ai.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
CS4432: Database Systems II Query Processing- Part 2.
1 Chapter 13 Parallel SQL. 2 Understanding Parallel SQL Enables a SQL statement to be: – Split into multiple threads – Each thread processed simultaneously.
©Silberschatz, Korth and Sudarshan18.1Database System Concepts - 6 th Edition Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Computing & Information Sciences Kansas State University Monday, 03 Nov 2008CIS 560: Database System Concepts Lecture 27 of 42 Monday, 03 November 2008.
©Silberschatz, Korth and Sudarshan20.1Database System Concepts 3 rd Edition Chapter 20: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.
Lecture 17: Query Execution Tuesday, February 28, 2001.
1 Ch.19 Divide and Conquer. 2 BIRD’S-EYE VIEW Divide and conquer algorithms Decompose a problem instance into several smaller independent instances May.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
1 Lecture 23: Query Execution Monday, November 26, 2001.
1 CS4402 – Parallel Computing Lecture 7 - Simple Parallel Sorting. - Parallel Merge Sort.
CPT-S Advanced Databases 11 Yinghui Wu EME 49.
Plan for Final Lecture What you may expect to be asked in the Exam?
Chapter # 6 The Relational Algebra and Calculus
Parallel Databases.
Chapter 18: Parallel Databases
Scaling SQL with different approaches
Interquery Parallelism
CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
Evaluation of Relational Operations
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Database Management Systems (CS 564)
Evaluation of Relational Operations: Other Operations
CS222: Principles of Data Management Notes #13 Set operations, Aggregation Instructor: Chen Li.
Implementation of Relational Operations (Part 2)
On Spatial Joins in MapReduce
Relational Operations
(Two-Pass Algorithms)
Chapters 15 and 16b: Query Optimization
Lecture 2- Query Processing (continued)
Chapter 20: Parallel Databases
Distributed Database Management Systems
Chapter 21: Parallel Databases
Parallel DBMS Chapter 22, Sections 22.1–22.6
Chapter 18: Parallel Databases
Query Functions.
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Introduction to Spark.
Chapter 11: Indexing and Hashing
Evaluation of Relational Operations: Other Techniques
Wednesday, 5/8/2002 Hash table indexes, physical operators
Lecture 11: B+ Trees and Query Execution
CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.
The Gamma Database Machine Project
Evaluation of Relational Operations: Other Techniques
Chapter 21: Parallel Databases
Parallel DBMS DBMS Textbook Chapter 22
Presentation transcript:

Lecture 17: Distributed Transactions 11/8/2017

Recap: Parallel Query Processing Three main ways to parallelize 1. Run multiple queries, each on a different thread 2. Run operators in different threads (“pipeline”) A filter sort Processor 1 Processor 2 3. Partition data, process each partition in a different processor A1 filter sort Processor 2 merge filter sort Processor 1 A2 Runs on 1 of the processors

Partitioning Strategies Random / Round Robin Evenly distributes data (no skew) Requires us to repartition for joins Range partitioning Allows us to perform joins without repartitioning, when tables are partitioned on join attributes Subject to skew Hash partitioning Only subject to skew when there are many duplicate values

Join Strategies If tables are partitioned properly, just run local joins Also, if one table is replicated, no need to join Otherwise, several options: Collect all tables at one node Inferior except in extreme cases, i.e., very small tables Re-partition one or both tables Depending on initial partitioning Replicate (smaller) table on all nodes

Proper Partitioning Example SELECT * FROM A,B WHERE A.a = B.b Suppose we have hashed A on a, using hash function F to get F(A.a)  1..n (n = # machines) Also hash B on b using same F merge A1 B1 ⨝ A2 B2 ⨝ An Bn ⨝ … Processor 1 Processor 2 Processor n

Repartitioning Example Suppose A pre-partitioned on a, but B needs to be repartitioned split split split A1 B1 A2 B2 An Bn Processor 1 Processor 2 Processor n

Repartitioning Example Suppose A pre-partitioned on a, but B needs to be repartitioned merge Generalizes to the case of repartitioning both tables ⨝ A1 B1 B1 B2 Bn A2 B2 An Bn Processor 1 Processor 2 Processor n split split split A1 B1 A2 B2 An Bn Processor 1 Processor 2 Processor n

Repartitioning Example Suppose A pre-partitioned on a, but B needs to be repartitioned merge Each node sends and receives |B|/n * (n-1)/n bytes ⨝ A1 B1 B2 Bn B1 A2 B2 An Bn Processor 1 Processor 2 Processor n |B|/n/n split split split |B|/n A1 B1 A2 B2 An Bn Processor 1 Processor 2 Processor n

Replication vs Repartioning Replication requires each node to send smaller table to all other nodes |T| / n * n-1 bytes sent by each node When would replication be preferred over repartitioning for joins? If size of smaller table < data sent to repartition one or both tables Should also account for cost of join, since will be higher with replicated table E.g., |B| = 1 MB, |A|=100 MB, n=3, need to repartition A Data to repartition A is |A|/3 * 2/3 = 22.22 MB per node Join .33 MB to 33 MB Data to broadcast B is |B| = 1/3 * 2 = .66 MB Join 1 MB to 33 MB

Aggregation filter agg Processor 2 A1 merge filter agg Processor 1 A2 Runs on 1 of the processors In general, each node will have data for the same groups So merge will need to combine groups, e.g.: MAX (MAX1, MAX2) SUM (SUM1, SUM2) What about average? Maintain SUMs and COUNTs, combine in merge step

Generalized Parallel Aggregates Express aggregates as 3 functions: INIT – create partial aggregate value MERGE –  combine 2 partial aggregates FINAL – compute final aggregate E.g., AVG: INIT(tuple)  (SUM=tuple.value, COUNT=1) MERGE (a1, a2)  (SUM=a1.SUM + a2.SUM, COUNT=a1.count+a2.count) FINAL(a)  a.SUM/a.COUNT

What does MERGE do? For aggregate queries, receives partial aggregates from each processor, MERGEs and FINALizes them For non-aggregates, just UNIONs results