WPI, MOHAMED ELTABAKH PARALLEL & DISTRIBUTED DATABASES 1.

Slides:



Advertisements
Similar presentations
More About Transaction Management Chapter 10. Contents Transactions that Read Uncommitted Data View Serializability Resolving Deadlocks Distributed Databases.
Advertisements

V. Megalooikonomou Distributed Databases (based on notes by Silberchatz,Korth, and Sudarshan and notes by C. Faloutsos at CMU) Temple University – CIS.
Enterprise Systems Distributed databases and systems - DT
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Distributed databases
Transaction.
Chapter 13 (Web): Distributed Databases
CMPT Dr. Alexandra Fedorova Lecture X: Transactions.
ABCSG - Distributed Database 1 Data Management Distributed Database Data Replication.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Distributed Database Management Systems
Overview Distributed vs. decentralized Why distributed databases
1 © Prentice Hall, 2002 Chapter 13: Distributed Databases Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
©Silberschatz, Korth and Sudarshan19.1Database System Concepts Lecture-10 Distributed Database System A distributed database system consists of loosely.
Chapter 12 Distributed Database Management Systems
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Outline Introduction Background Distributed Database Design
Distributed databases
Distributed Databases
Distributed Databases and DBMSs: Concepts and Design
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
CS162 Section Lecture 10 Slides based from Lecture and
Distributed Transactions March 15, Transactions What is a Distributed Transaction?  A transaction that involves more than one server  Network.
Database Design – Lecture 16
DISTRIBUTED DATABASES IN ADBMS Shilpa Seth
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
Massively Distributed Database Systems - Distributed DBS Spring 2014 Ki-Joune Li Pusan National University.
Session-8 Data Management for Decision Support
Lecture 16- Distributed Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 12 Distributed Database Management Systems.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Distributed Database Systems Overview
DISTRIBUTED DATABASES 1. RECAP: PARALLEL DATABASES Three possible architectures Shared-memory Shared-disk Shared-nothing (the most common one) Parallel.
1 Distributed Databases (DDBs) Chap Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Distributed Databases DBMS Textbook, Chapter 22, Part II.
Instructor: Marina Gavrilova. Outline Introduction Types of distributed databases Distributed DBMS Architectures and Storage Replication Synchronous replication.
ASMA AHMAD 28 TH APRIL, 2011 Database Systems Distributed Databases I.
Databases Illuminated
CS338Parallel and Distributed Databases11-1 Parallel and Distributed Databases Lecture Topics Multi-CPU and distributed systems Monolithic system Client–server.
Distributed database system
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
CS742 – Distributed & Parallel DBMSM. Tamer Özsu Page 1.1 Outline Introduction & architectural issues What is a distributed DBMS Problems Current state-of-affairs.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Topic Distributed DBMS Database Management Systems Fall 2012 Presented by: Osama Ben Omran.
MBA 664 Database Management Systems Dave Salisbury ( )
Chapter 19 Distributed Databases. 2 Distributed Database System n A distributed DBS consists of loosely coupled sites that share no physical component.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
1 Distributed Databases architecture, fragmentation, allocation Lecture 1.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Distributed DBMS, Query Processing and Optimization
1 Chapter 22 Distributed DBMS Concepts and Design CS 157B Edward Chen.
SHUJAZ IBRAHIM CHAYLASY GNOPHANXAY FIT, KMUTNB JANUARY 05, 2010 Distributed Database Systems | Dr.Nawaporn Wisitpongphan | KMUTNB Based on article by :
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
CMS Advanced Database and Client-Server Applications Distributed Databases slides by Martin Beer and Paul Crowther Connolly and Begg Chapter 22.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Distributed Database Concepts
Parallel Databases.
Parallel and Distributed Databases
Two phase commit.
Chapter 19: Distributed Databases
Distributed Databases
distributed databases
CS561-Spring 2012 WPI, Mohamed eltabakh
Distributed Databases
Presentation transcript:

WPI, MOHAMED ELTABAKH PARALLEL & DISTRIBUTED DATABASES 1

INTRODUCTION In centralized database: Data is located in one place (one server) All DBMS functionalities are done by that server Enforcing ACID properties of transactions Concurrency control, recovery mechanisms Answering queries In Distributed databases: Data is stored in multiple places (each is running a DBMS) New notion of distributed transactions DBMS functionalities are now distributed over many machines Revisit how these functionalities work in distributed environment 2

WHY DISTRIBUTED DATABASES Data is too large Applications are by nature distributed Bank with many branches Chain of retail stores with many locations Library with many branches Get benefit of distributed and parallel processing Faster response time for queries 3

PARALLEL VS. DISTRIBUTED DATABASES Distributed processing usually imply parallel processing (not vise versa) Can have parallel processing on a single machine 4 Similar but different concepts

PARALLEL VS. DISTRIBUTED DATABASES Parallel Databases: Assumptions about architecture Machines are physically close to each other, e.g., same server room Machines connects with dedicated high-speed LANs and switches Communication cost is assumed to be small Can shared-memory, shared-disk, or shared-nothing architecture 5

PARALLEL VS. DISTRIBUTED DATABASES Distributed Databases: Assumptions about architecture Machines can far from each other, e.g., in different continent Can be connected using public-purpose network, e.g., Internet Communication cost and problems cannot be ignored Usually shared-nothing architecture 6

PARALLEL DATABASE PROCESSING 7

DIFFERENT ARCHITECTURE Three possible architectures for passing information 8 Shared-memory Shared-disk Shared-nothing

1- SHARED-MEMORY ARCHITECTURE Every processor has its own disk Single memory address-space for all processors Reading or writing to far memory can be slightly more expensive Every processor can have its own local memory and cache as well 9

2- SHARED-DISK ARCHITECTURE Every processor has its own memory (not accessible by others) All machines can access all disks in the system Number of disks does not necessarily match the number of processors 10

3- SHARED-NOTHING ARCHITECTURE Most common architecture nowadays Every machine has its own memory and disk Many cheap machines (commodity hardware) Communication is done through high- speed network and switches Usually machines can have a hierarchy Machines on same rack Then racks are connected through high- speed switches 11 Scales better Easier to build Cheaper cost

PARTITIONING OF DATA 12 To partition a relation R over m machines Range partitioning Hash-based partitioning Round-robin partitioning Shared-nothing architecture is sensitive to partitioning Good partitioning depends on what operations are common

PARALLEL SCAN σ c ( R ) Relation R is partitioned over m machines Each partition of R is around |R|/m tuples Each machine scans its own partition and applies the selection condition c If data are partitioned using round robin or a hash function (over the entire tuple) The resulted relation is expected to be well distributed over all nodes All partitioned will be scanned If data are range partitioned or hash-based partitioned (on the selection column) The resulted relation can be clustered on few nodes Few partitions need to be touched 13 Parallel Projection is also straightforward All partitions will be touched Not sensitive to how data is partitioned

PARALLEL DUPLICATE ELIMINATION If relation is range or hash-based partitioned Identical tuples are in the same partition So, eliminate duplicates in each partition independently If relation is round-robin partitioned Re-partition the relation using a hash function So every machine creates m partitions and send the i th partition to machine i machine i can now perform the duplicate elimination 14

PARALLEL SORTING Range-based Re-partition R based on ranges into m partitions Machine i receives all i th partitions from all machines and sort that partition The entire R is now sorted Skewed data is an issue Apply sampling phase first Ranges can be of different width Merge-based Each node sorts its own data All nodes start sending their sorted data (one block at a time) to a single machine This machine applies merge-sort technique as data come 15

DISTRIBUTED DATABASE 16

DEFINITIONS A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. Distributed database system (DDBS) = DB + Communication 17

DISTRIBUTED DATABASES MAIN CONCEPTS Data are stored at several locations Each managed by a DBMS that can run autonomously Ideally, location of data is unknown to client Distributed Data Independence Distributed Transactions Clients can write Transactions regardless of where the affected data are located Big question: How to ensure the ACID properties Distributed Transactions??? 18

Transparent management of distributed, fragmented, and replicated data Improved reliability/availability through distributed transactions Improved performance Easier and more economical system expansion DISTRIBUTED DBMS PROMISES 19

TRANSPARENCY & DATA INDEPENDENCE Data distributed (with some replication) Transparently ask query: 20 SELECTENAME,SAL FROMEMP,ASG,PAY WHEREDUR > 12 ANDEMP.ENO = ASG.ENO ANDPAY.TITLE = EMP.TITLE

TYPES OF DISTRIBUTED DATABASES Homogeneous Every site runs the same type of DBMS Heterogeneous: Different sites run different DBMS (maybe even RDBMS and ODBMS) 21 Homogeneous DBs can communicate directly with each other Heterogeneous DBs communicate through gateway interfaces

MAIN ISSUES Data Layout Issues Data partitioning and fragmentation Data replication Query Processing and Distributed Transactions Distributed join Transaction atomicity using two-phase commit Transaction serializability using distributed locking 22

MAIN ISSUES Data Layout Issues Data partitioning and fragmentation Data replication Query Processing and Distributed Transactions Distributed join Transaction atomicity using two-phase commit Transaction serializability using distributed locking 23

FRAGMENTATION How to divide the data? Can't we just distribute relations? What is a reasonable unit of distribution? 24

FRAGMENTATION ALTERNATIVES – HORIZONTAL Stored in London Stored in Boston 25

FRAGMENTATION ALTERNATIVES – VERTICAL Stored in LondonStored in Boston Horizontal partitioning is more common 26

Completeness Decomposition of relation R into fragments R 1, R 2,..., R n is complete if and only if each data item in R can also be found in some R i Reconstruction (Lossless) If relation R is decomposed into fragments R 1, R 2,..., R n, then there should exist some relational operator  such that R =  1≤i≤n R i  Disjointness (Non-overlapping) If relation R is decomposed into fragments R 1, R 2,..., R n, and data item d i is in R j, then d i should not be in any other fragment R k (k ≠ j ). CORRECTNESS OF FRAGMENTATION 27

REPLICATION ALTERNATIVES 28

DATA REPLICATION Pros: Improves availability Disconnected (mobile) operation Distributes load Reads are cheaper Cons: N times more updates N times more storage Synchronous vs. asynchronous Synchronous: all replica are up-to-date Asynchronous: cheaper but delay in synchronization 29 Catalog Management Catalog is needed to keep track of the location of each fragment & replica Catalog itself can be centralized or distributed

MAIN ISSUES Data Layout Issues Data partitioning and fragmentation Data replication Query Processing and Distributed Transactions Distributed join Transaction atomicity using two-phase commit Transaction serializability using distributed locking 30

DISTRIBUTED JOIN R(X,Y) ⋈ S(Y,Z) Option 1: Send R to S’s location and join their Option 2: Send S to R’s location and join their Communication cost is expensive, too much data to send Is there a better option ??? Semi Join Bloom Join 31 R(X1,X2, …Xn, Y) S(Y, Z1, Z2,…, Zm) Stored in LondonStored in Boston Join based on R.Y = S.Y

SEMI-JOIN Send only S.Y column to R’s location Do the join based on Y columns in R’s location (Semi Join) Send the records of R that will join (without duplicates) to S’s location Perform the final join in S’s location 32 R(X1,X2, …Xn, Y) S(Y, Z1, Z2,…, Zm) Stored in LondonStored in Boston

IS SEMI-JOIN EFFECTIVE Depends on many factors: If the size of Y attribute is small compared to the remaining attributes in R and S If the join selectivity is high  is small If there are many duplicates that can be eliminated 33 R(X1,X2, …Xn, Y) S(Y, Z1, Z2,…, Zm) Stored in LondonStored in Boston

BLOOM JOIN Build a bit vector of size K in R’s location (all 0’s) For every record in R, use a hash function(s) based on Y value (return from 1 to K) Each function hashes Y to a bit in the bit vector. Set this bit to 1 Send the bit vector to S’s location S will use the same hash function(s) to hash its Y values If the hashing matched with 1’s in all its hashing positions, then this Y is candidate for Join Otherwise, not candidate for join Send S’s records having candidate Y’s to R’s location for join …001

MAIN ISSUES Data Layout Issues Data partitioning and fragmentation Data replication Query Processing and Distributed Transactions Distributed join Transaction atomicity using two-phase commit Transaction serializability using distributed locking 35

TRANSACTIONS A Transaction is an atomic sequence of actions in the Database (reads and writes) Each Transaction has to be executed completely, and must leave the Database in a consistent state If the Transaction fails or aborts midway, then the Database is “rolled back” to its initial consistent state (before the Transaction began) 36 ACID Properties of Transactions

ATOMICITY IN DISTRIBUTED DBS One transaction T may touch many sites T has several components T1, T2, …Tm Each Tk is running (reading and writing) at site k How to make T is atomic ???? Either T1, T2, …, Tm complete or None of them is executed Two-Phase Commit techniques is used 37

TWO-PHASE COMMIT Phase 1 Site that initiates T is the coordinator When coordinator wants to commit (complete T), it sends a “prepare T” msg to all participant sites Every other site receiving “prepare T”, either sends “ready T” or “don’t commit T” A site can wait for a while until it reaches a decision (Coordinator will wait reasonable time to hear from the others) These msgs are written to local logs 38

TWO-PHASE COMMIT (CONT’D) Phase 2 IF coordinator received all “ready T” Remember no one committed yet Coordinator sends “commit T” to all participant sites Every site receiving “commit T” commits its transaction IF coordinator received any “don’t commit T” Coordinator sends “abort T” to all participant sites Every site receiving “abort T” commits its transaction These msgs are written to local logs 39 Straightforward if no failures happen In case of failure logs are used to ensure ALL are done or NONE Example 2: What if all sites in Phase 1 replied “ready T”, then one site crashed??? Example 1: What if one sites in Phase 1 replied “don’t commit T”, and then crashed???

MAIN ISSUES Data Layout Issues Data partitioning and fragmentation Data replication Query Processing and Distributed Transactions Distributed join Transaction atomicity using two-phase commit Transaction serializability using distributed locking 40

DATABASE LOCKING Locking mechanisms are used to prevent concurrent transactions from updating the same data at the same time Reading(x)  shared lock on x Writing(x)  exclusive lock on x More types of locks exist for efficiency 41 Shared lockExclusive lock Shared lockYesNo Exclusive lockNo What you have What you request In Distributed DBs: x may be replicated in multiple sites (not one place) The transactions reading or writing x may be running at different sites

DISTRIBUTED LOCKING Centralized approach One dedicated site managing all locks Cons: bottleneck, not scalable, single point of failure Primary-Copy approach Every item in the database, say x, has a primary site, say Px Any transaction running any where, will ask Px for lock on x Fully Distributed approach To read, lock any copy of x To write, lock all copies of x Variations exists to balance the cots of read and write op. 42 Deadlocks are very possible. How to resolve them??? Using timeout: After waiting for a while for a lock, abort and start again

Promises of DDBMSs Transparent management of distributed, fragmented, and replicated data Improved reliability/availability through distributed transactions Improved performance Easier and more economical system expansion Classification of DDBMS Homogeneous vs. Heterogeneous Client-Sever vs. Collaborative Servers vs. Peer-to-Peer SUMMARY OF DISTRIBUTED DBMS 43

SUMMARY OF DISTRIBUTED DBMS (CONT’D) Data Layout Issues Data partitioning and fragmentation Data replication Query Processing and Distributed Transactions Distributed join Transaction atomicity using two-phase commit Transaction serializability using distributed locking 44