Parallel Databases COMP3017 Advanced Databases Dr Nicholas Gibbins - 2012-2013.

Slides:



Advertisements
Similar presentations
Distributed Databases COMP3017 Advanced Databases Dr Nicholas Gibbins –
Advertisements

Distributed databases
CMPT 401 Summer 2007 Dr. Alexandra Fedorova Lecture X: Transactions.
Transaction.
MIS 385/MBA 664 Systems Implementation with DBMS/ Database Management Dave Salisbury ( )
Chapter 13 (Web): Distributed Databases
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 16 – Intro. to Transactions.
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
Transaction Processing Lecture ACID 2 phase commit.
Distributed Databases Logical next step in geographically dispersed organisations goal is to provide location transparency starting point = a set of decentralised.
Transaction Management and Concurrency Control
Overview Distributed vs. decentralized Why distributed databases
Chapter 8 : Transaction Management. u Function and importance of transactions. u Properties of transactions. u Concurrency Control – Meaning of serializability.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Chapter 18: Distributed Coordination (Chapter 18.1 – 18.5)
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Distributed Transactions Transaction may access data at several sites. Each site has a local.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
1 ICS 214B: Transaction Processing and Distributed Data Management Distributed Database Systems.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
Transaction Management WXES 2103 Database. Content What is transaction Transaction properties Transaction management with SQL Transaction log DBMS Transaction.
Transaction. A transaction is an event which occurs on the database. Generally a transaction reads a value from the database or writes a value to the.
Distributed Databases
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
III. Current Trends: 2 - Distributed DBMSsSlide 1/47 III. Current Trends Distributed DBMSs: Advanced Concepts 3C13/D63C13/D6.
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Recovery & Concurrency Control. What is a Transaction?  A transaction is a logical unit of work that must be either entirely completed or aborted. 
DISTRIBUTED DATABASE SYSTEM.  A distributed database system consists of loosely coupled sites that share no physical component  Database systems that.
Database Management System Module 5 DeSiaMorewww.desiamore.com/ifm1.
Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
BIS Database Systems School of Management, Business Information Systems, Assumption University A.Thanop Somprasong Chapter # 10 Transaction Management.
Chapterb19 Transaction Management Transaction: An action, or series of actions, carried out by a single user or application program, which reads or updates.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Parallel Databases COMP3211 Advanced Databases Dr Nicholas Gibbins
PAVANI REDDY KATHURI TRANSACTION COMMUNICATION. OUTLINE 0 P ART I : I NTRODUCTION 0 P ART II : C URRENT R ESEARCH 0 P ART III : F UTURE P OTENTIAL 0 R.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.
Concurrency Control in Database Operating Systems.
1 Concurrency Control II: Locking and Isolation Levels.
Distributed DBMSs- Concept and Design Jing Luo CS 157B Dr. Lee Fall, 2003.
Databases Illuminated
INTRODUCTION TO DBS Database: a collection of data describing the activities of one or more related organizations DBMS: software designed to assist in.
The Relational Model1 Transaction Processing Units of Work.
XA Transactions.
MBA 664 Database Management Systems Dave Salisbury ( )
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Coupling Facility. The S/390 Coupling Facility (CF), the key component of the Parallel Sysplex cluster, enables multisystem coordination and datasharing.
Transaction Management Overview. Transactions Concurrent execution of user programs is essential for good DBMS performance. – Because disk accesses are.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Transaction Processing Concepts
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
CS 540 Database Management Systems
Multidatabase Transaction Management COP5711. Multidatabase Transaction Management Outline Review - Transaction Processing Multidatabase Transaction Management.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
CMS Advanced Database and Client-Server Applications Distributed Databases slides by Martin Beer and Paul Crowther Connolly and Begg Chapter 22.
Distributed Databases
Distributed Databases
SYSTEMS IMPLEMENTATION TECHNIQUES TRANSACTION PROCESSING DATABASE RECOVERY DATABASE SECURITY CONCURRENCY CONTROL.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Database System Implementation CSE 507
Commit Protocols CS60002: Distributed Systems
Chapter 10 Transaction Management and Concurrency Control
Distributed Databases
Introduction of Week 13 Return assignment 11-1 and 3-1-5
DISTRIBUTED DATABASES
Distributed Databases Recovery
The Gamma Database Machine Project
Transactions, Properties of Transactions
Presentation transcript:

Parallel Databases COMP3017 Advanced Databases Dr Nicholas Gibbins

Definitions 2 Parallelism –An arrangement or state that permits several operations or tasks to be performed simultaneously rather than consecutively Parallel Databases –have the ability to split –processing of data –access to data –across multiple processors

Why Parallel Databases 3 Hardware trends Reduced elapsed time for queries Increased transaction throughput Increased scalability Better price/performance Improved application availability Access to more data In short, for better performance

Tightly coupled Symmetric Multiprocessor (SMP) P = processor M = memory Shared Memory Architecture 4 P Global Memory PP

Less complex database software Limited scalability Single buffer Single database storage Software – Shared Memory 5 P Global Memory PP

Loosely coupled Distributed Memory Shared Disc Architecture 6 PPP MMM S

Avoids memory bottleneck Same page may be in more than one buffer at once – can lead to incoherence Needs global locking mechanism Single logical database storage Each processor has its own database buffer Software – Shared Disc 7 PPP MMM S

Massively Parallel Loosely Coupled High Speed Interconnect (between processors) Shared Nothing Architecture 8 PPP MMM

One page is only in one local buffer – no buffer incoherence Needs distributed deadlock detection Needs multiphase commit protocol Needs to break SQL requests into multiple sub-requests Each processor has its own database buffer Each processor owns part of the data Software - Shared Nothing 9 PPP MMM

Hardware vs. Software Architecture 10 It is possible to use one software strategy on a different hardware arrangement Also possible to simulate one hardware configuration on another –Virtual Shared Disk (VSD) makes an IBM SP shared nothing system look like a shared disc setup (for Oracle) From this point on, we deal only with shared nothing software mode

Shared Nothing Challenges 11 Partitioning the data Keeping the partitioned data balanced Splitting up queries to get the work done Avoiding distributed deadlock Concurrency control Dealing with node failure

Hash Partitioning 12 Table

Range Partitioning 13 A-H I-P Q-Z

Schema Partitioning 14 Table 1 Table 2

Rebalancing Data 15 Data in proper balance Data grows, performance drops Add new nodes and disc Redistribute data to new nodes

Dividing up the Work 16 Application Coordinator Process Worker Process

DB Software on each node 17 App1 DBMS W1W2 C1 DBMS W1W2 App2 DBMS W1W2 C2

Transaction Parallelism 18 Improves throughput Inter-Transaction –Run many transactions in parallel, sharing the DB –Often referred to as Concurrency

Query Parallelism 19 Improves response times (lower latency) Intra-operator (horizontal) parallelism –Operators decomposed into independent operator instances, which perform the same operation on different subsets of data Inter-operator (vertical) parallelism –Operations are overlapped –Pipeline data from one stage to the next without materialisation

Intra-Operator Parallelism 20 SQL Query Subset Queries Processor

Intra-Operator Parallelism 21 Example query: –SELECT c1,c2 FROM t1 WHERE c1>5.5 Assumptions: –100,000 rows –Predicates eliminate 90% of the rows

I/O Shipping 22 Coordinator and Worker Network Worker 25,000 rows 10,000 (c1,c2)

Function Shipping 23 Coordinator Network Worker 2,500 rows 10,000 (c1,c2)

Function Shipping Benefits 24 Database operations are performed where the data are, as far as possible Network traffic in minimised For basic database operators, code developed for serial implementations can be reused In practice, mixture of Function Shipping and I/O Shipping has to be employed

Inter-Operator Parallelism 25 time ScanJoinSort Scan Join Sort

Resulting Parallelism 26 time ScanJoinSort Scan Join Sort Scan Join Sort

Basic Operators 27 The DBMS has a set of basic operations which it can perform The result of each operation is another table –Scan a table – sequentially or selectively via an index – to produce another smaller table –Join two tables together –Sort a table –Perform aggregation functions (sum, average, count) These are combined to execute a query

The Volcano Architecture 28 The Volcano Query Processing System was devised by Goetze Graefe (Oregon) Introduced the Exchange operator –Inserted between the steps of a query to: –Pipeline results –Direct streams of data to the next step(s), redistributing as necessary Provides mechanism to support both vertical and horizontal parallelism Informix ‘Dynamic Scalable Architecture’ based on Volcano

Exchange Operators 29 Example query: –SELECT county, SUM(order_item) FROM customer, order WHERE order.customer_id=customer_id GROUP BY county ORDER BY SUM(order_item)

Exchange Operators 30 SORT GROUP HASH JOIN SCAN CustomerOrder

Exchange Operators 31 EXCHANGE SCAN Customer HASH JOIN

Exchange Operators 32 EXCHANGE SCAN Customer HASH JOIN EXCHANGE SCAN Order

EXCHANGE SCAN Customer HASH JOIN EXCHANGE SCAN Order EXCHANGE GROUP SORT 33

Query Processing

Some Parallel Queries 35 Enquiry Collocated Join Directed Join Broadcast Join Repartitioned Join

Orders Database 36 CUSTKEYC_NAME…C_NATION… ORDERKEYDATE…CUSTKEY… SUPPKEYS_NAME…S_NATION… SUPPKEY… CUSTOMER ORDER SUPPLIER

Enquiry/Query 37 “How many customers live in the UK?” SCAN COUNT COORDINATOR SLAVE TASKS SUM Multiple partitions of customer table Return subcounts to coordinator Return to applicationData from application

Collocated Join 38 “Which customers placed orders in July?” SCAN UNION ORDER Tables both partitioned on CUSTKEY (the same key) and therefore corresponding entries are on the same node Requires a JOIN of CUSTOMER and ORDER SCAN JOIN CUSTOMER

Directed Join 39 “Which customers placed orders in July?” (tables have different keys) CUSTOMER SCAN ORDER SCAN JOIN Slave Task 1Slave Task 2 Coordinator Return to application ORDER partitioned on ORDERKEY, CUSTOMER partitioned on CUSTKEY Retrieve rows from ORDER, then use ORDER.CUSTKEY to direct appropriate rows to nodes with CUSTOMER

Broadcast Join 40 “Which customers and suppliers are in the same country?” CUSTOMER SCAN SUPPLIER SCAN JOIN Slave Task 1Slave Task 2 Coordinator Return to application SUPPLIER partitioned on SUPPKEY, CUSTOMER on CUSTKEY. Join required on *_NATION Send all SUPPLIER to each CUSTOMER node BROADCAST

Repartitioned Join 41 “Which customers and suppliers are in the same country?” CUSTOMER SCAN SUPPLIER Slave Task 1 Coordinator SUPPLIER partitioned on SUPPKEY, CUSTOMER on CUSTKEY. Join required on *_NATION. Repartition both tables on *_NATION to localise and minimise the join effort SCAN Slave Task 2 JOIN Slave Task 3

Query Handling 42 Critical aspect of Parallel DBMS User wants to issue simple SQL, and not be concerned with parallel aspects DBMS needs structures and facilities to parallelise queries Good optimiser technology is essential As database grows, effects (good or bad) of optimiser become increasingly significant

Further Aspects 43 A single transaction may update data in several different places Multiple transactions may be using the same (distributed) tables simultaneously One or several nodes could fail Requires concurrency control and recovery across multiple nodes for: –Locking and deadlock detection –Two-phase commit to ensure ‘all or nothing’

Deadlock Detection

Locking and Deadlocks 45 With Shared Nothing architecture, each node is responsible for locking its own data No global locking mechanism However: –T1 locks item A on Node 1 and wants item B on Node 2 –T2 locks item B on Node 2 and wants item A on Node 1 –Distributed Deadlock

Resolving Deadlocks 46 One approach – Timeouts Timeout T2, after wait exceeds a certain interval –Interval may need random element to avoid ‘chatter’ i.e. both transactions give up at the same time and then try again Rollback T2 to let T1 to proceed Restart T2, which can now complete

Resolving Deadlocks 47 More sophisticated approach (DB2) Each node maintains a local ‘wait-for’ graph Distributed deadlock detector (DDD) runs at the catalogue node for each database Periodically, all nodes send their graphs to the DDD DDD records all locks found in wait state Transaction becomes a candidate for termination if found in same lock wait state on two successive iterations

Reliability

49 We wish to preserve the ACID properties for parallelised transactions –Isolation is taken care of by 2PL protocol –Isolation implies Consistency –Durability can be taken care of node-by-node, with proper logging and recovery routines –Atomicity is the hard part. We need to commit all parts of a transaction, or abort all parts Two-phase commit protocol (2PC) is used to ensure that Atomicity is preserved

Two-Phase Commit (2PC) Distinguish between: –The global transaction –The local transactions into which the global transaction is decomposed Global transaction is managed by a single site, known as the coordinator Local transactions may be executed on separate sites, known as the participants 50

Phase 1: Voting Coordinator sends “prepare T” message to all participants Participants respond with either “vote-commit T” or “vote-abort T” Coordinator waits for participants to respond within a timeout period

Phase 2: Decision If all participants return “vote-commit T” (to commit), send “commit T” to all participants. Wait for acknowledgements within timeout period. If any participant returns “vote-abort T”, send “abort T” to all participants. Wait for acknowledgements within timeout period. When all acknowledgements received, transaction is completed. If a site does not acknowledge, resend global decision until it is acknowledged.

Normal Operation 53 C P prepare T vote-commit T commit T ack vote-commit T received from all participants

Parallel Utilities

55 Ancillary operations can also exploit the parallel hardware –Parallel Data Loading/Import/Export –Parallel Index Creation –Parallel Rebalancing –Parallel Backup –Parallel Recovery