 Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management.

Slides:



Advertisements
Similar presentations
Enterprise Systems Distributed databases and systems - DT
Advertisements

Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Distributed Databases Chapter 22 By Allyson Moran.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Distributed databases
Transaction.
Chapter 13 (Web): Distributed Databases
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
Distributed Databases
ICS 421 Spring 2010 Distributed Transactions Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/16/20101Lipyeow.
Database Replication techniques: a Three Parameter Classification Authors : Database Replication techniques: a Three Parameter Classification Authors :
CS 582 / CMPE 481 Distributed Systems
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Distributed Database Management Systems
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Overview Distributed vs. decentralized Why distributed databases
1 © Prentice Hall, 2002 Chapter 13: Distributed Databases Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
Chapter 12 Distributed Database Management Systems
©Silberschatz, Korth and Sudarshan18.1Database System Concepts Centralized Systems Run on a single computer system and do not interact with other computer.
Definition of terms Definition of terms Explain business conditions driving distributed databases Explain business conditions driving distributed databases.
Distributed Databases
Database System Architectures  Client-server Database System  Parallel Database System  Distributed Database System Wei Jiang.
Distributed databases
Distributed Databases
PMIT-6102 Advanced Database Systems
1 Distributed and Parallel Databases. 2 Distributed Databases Distributed Systems goal: –to offer local DB autonomy at geographically distributed locations.
Database Design – Lecture 16
04/18/2005Yan Huang - CSCI5330 Database Implementation – Distributed Database Systems Distributed Database Systems.
DISTRIBUTED DATABASE SYSTEM.  A distributed database system consists of loosely coupled sites that share no physical component  Database systems that.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 12 Distributed Database Management Systems.
Database Systems: Design, Implementation, and Management Ninth Edition Chapter 12 Distributed Database Management Systems.
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Implementation of Database Systems, Jarek Gryz1 Distributed Databases Chapter 21, Part B.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
DISTRIBUTED COMPUTING
Operating Systems Distributed Coordination. Topics –Event Ordering –Mutual Exclusion –Atomicity –Concurrency Control Topics –Event Ordering –Mutual Exclusion.
Chapter 16 Recovery Yonsei University 1 st Semester, 2015 Sanghyun Park.
DDBMS Distributed Database Management Systems Fragmentation
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
Distributed Databases DBMS Textbook, Chapter 22, Part II.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Instructor: Marina Gavrilova. Outline Introduction Types of distributed databases Distributed DBMS Architectures and Storage Replication Synchronous replication.
ASMA AHMAD 28 TH APRIL, 2011 Database Systems Distributed Databases I.
1 Distributed Databases BUAD/American University Distributed Databases.
Databases Illuminated
 Distributed file systems having transaction facility need to support distributed transaction service.  A distributed transaction service is an extension.
Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
MBA 664 Database Management Systems Dave Salisbury ( )
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Lecture 14- Parallel Databases Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch
Introduction to Distributed Databases Yiwei Wu. Introduction A distributed database is a database in which portions of the database are stored on multiple.
 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.
Distributed DBMS, Query Processing and Optimization
Unit - 4 Introduction to the Other Databases.  Introduction :-  Today single CPU based architecture is not capable enough for the modern database.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Topics in Distributed Databases Database System Implementation CSE 507 Some slides adapted from Navathe et. Al and Silberchatz et. Al.
CMS Advanced Database and Client-Server Applications Distributed Databases slides by Martin Beer and Paul Crowther Connolly and Begg Chapter 22.
Distributed Databases
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Distributed Databases – Advanced Concepts Chapter 25 in Textbook.
Parallel and Distributed Databases
Interquery Parallelism
Database System Implementation CSE 507
Outline Announcements Fault Tolerance.
Evaluation of Relational Operations: Other Techniques
Distributed Databases
Database System Architectures
Distributed Databases
Presentation transcript:

 Parallel databases  Architecture  Query evaluation  Query optimization  Distributed databases  Architectures  Data storage  Catalog management  Query processing  Transactions

 A parallel database system is designed to improve performance through parallelism  Loading data, building indexes, evaluating queries  Data may be stored in a distributed way, but solely for performance reasons  A distributed database system is physically stored across several sites  Each site is managed by an independent DBMS  Distribution is affected by local ownership, and availability as well as performance

HHow long does it take to scan a 1 terabyte table at 10MB/s? 11,099,511,627,776 bytes = 1,024 4 or 2 40 bytes 110MB = 10,485,760 bytes 11,099,511,627,776 / 10,485,760 = 104,858 1104,858 / (60 * 20 * 24) = 1.2 days! UUsing 1,000 processors in parallel the time can be reduced to 1.5 minutes

 A coarse-grain parallel machine consists of a small number of processors  Most current high-end computers  A fine-grain parallel machine uses thousands of smaller processors  Also referred to as a massively parallel machine

 Both throughput and response time can be improved by parallelism  Throughput – the number of tasks completed in a given time  Processing many small tasks in parallel increases throughput  Response time – the time it takes to complete a single task  Subtasks of large transactions can be performed in parallel increasing response time

 Speed-up  More resources means less time for a given amount of data  Scale-up  If resources increase in proportion to increase in data size, time is constant degree of parallelismthroughput ideal degree of parallelismresponse time ideal

 Where possible a parallel database should carry out evaluation steps in parallel  There are many opportunities for parallelism in a relational database  There are three main parallel DBMS architectures  Shared nothing  Shared memory  Shared disk

 Multiple CPUs attached to an interconnection network  Accessing a common region of main memory  Similar to a conventional system  Good for moderate parallelism  Communication overhead is low  OS services control the CPUs  Interference increases with size  As CPUs are added memory contention becomes a bottleneck  Adding more CPUs eventually slows the system down P D interconnection network DD … global shared memory PP

 Each CPU has private memory and direct access to data  Through the interconnection network  Good for moderate parallelism  Suffers from interference in the interconnection network  Which acts as a bottleneck  Not a good solution for a large scale parallel system M M … M M M M P PP interconnection network DDD

 Each CPU has local memory and disk space  No two CPUs access the same storage area  All CPU communication is through the network  Increases complexity  Linear speed-up ▪ Operation time decreases proportional to increase in CPUs  Linear scale-up ▪ Performance maintained if CPU increase is proportional to data interconnection network P … PP M M M M M M DDD

 A relational query execution plan is a tree, or graph, of relational algebra operators  Operators in a query tree can be executed in parallel  If one operator consumes the output of another, there is pipelined parallelism  Otherwise the operators can be evaluated independently  An operator blocks if it does not produce any output until it has consumed all its inputs  Pipelined parallelism is limited by blocking operators

 Individual operators can be evaluated in a parallel way by partitioning input data  In data-partitioned parallel evaluation the input data is partitioned, and worked on in parallel  The results are then combined  Tables are horizontally partitioned  Different rows are assigned to different processors

 Partition using a round-robin algorithm  Partition using hashing  Partition using ranges of field values

 Partition using a round-robin algorithm  Assign record i to processor i mod n ▪ Similar to RAID systems  Suitable for evaluating queries that access the entire table  Less efficient for queries that access ranges of values and queries on equality

 Partition using hashing  A hash function based on selected attributes is applied to each record to determine its processor  The data remains evenly distributed as the table grows, or shrinks over time  Good for equality selections  Only one disk is used, leaving the others free  Also useful for sequential scans where the partitioning attributes are a candidate key

 Partition using ranges of field values  Ranges are chosen from the sort key values, each range should contain the same number of records ▪ Each disk contains one range  If a range is too large can lead to data skew  Skew can lead to the processors with large partitions becoming bottlenecks  Good for equality selections, and range selections

 Both hash and range partitioning may result in data skew  Where some partitions are larger or smaller  Skew can dramatically reduce the speed-up obtained from parallelism  In range partitioning skew can be reduced by using histograms  The histograms contain the number of attributes and are used to derive even partitions

 Parallel data streams are used to provide data for relational operators  The streams can come from different disks, or  Output of other operators  Streams are merged or split  Merged to provide the inputs for a relational operator  Split as needed to parallelize processing  These operations can buffer data, and should be able to halt operators that provide their input data  A parallel evaluation consists of a network of relational, merge and split operators

 Inter-query parallelism  Different queries or transactions execute in parallel  Throughput is increased but response time is not  Easy to support in a shared-memory system  Intra-query parallelism  Executing a single query in parallel to speed up large queries  Which in turn can entail either intra-operation or inter-operation parallelism, or both

 Scanning and loading  Pages can be read in parallel while scanning a relation  The results can be merged  If hash or range partitioning is used selections can be directed to the relevant processors  Sorting  Joins

 The simplest sort method is for each processor to sort its portion of the table  Then merge the sorted records  The merging phase may limit the amount of parallelism  A better method is to first redistribute the records over the processors using range partitioning  Using the sort attributes  Each processor sorts its set of records  The sets of sorted records are then retrieved in order  To make the partitions even, the data in the processors can be sampled

 Join algorithms can be parallelized  Parallelization is most effective for hash or sort- merge joins ▪ Parallel hash join is widely used  The process for parallel hash join is  First partition the two tables across the processors using the same hash function  Join the records locally, using any join algorithm  Merge the results of the local joins, the union of these results is the join of the two tables

 If tables are very large, parallel hash join may have a high cost at each processor  If each partition is large, multiple passes will be required for the local joins  An alternative approach is to use all processors for each partition  Partition the tables using h 1 ▪ Each partition of the smaller relation should fit into the combined memory of the processors  Process each partition using all processors ▪ Use h 2 to determine which processor to send records to

 Partitioning is not suitable for joins on inequalities  Such as R ⋈ R.a < S.b S  Since all records in R could join with a record in S  Fragment and replicate joins can be used  In asymmetric fragment and replicate join ▪ One of the relations is partitioned ▪ The other relation is replicated across all partitions

P 0,0  Each relation can be both fragmented and replicated  Into m fragments of R and n of S  However m * n processors are required  This works with any join condition  When partitioning is not possible R0R0 R0R0 R1R1 R1R1 R2R2 R2R2 … … R m-1 S0S0 S0S0 P 1,0 P 2,0 P 0,1 P 1,1 P 0,2 S1S1 S1S1 S2S2 S2S2 … … S n-1 P m-1,n-1

 Selection – the table may already be partitioned on the selection attribute  If not, it can be scanned in parallel  Duplicate elimination – use parallel sorting  Projection – can be performed by scanning  Aggregation – partition by the grouping attribute  If records do have to be transferred between processors it may be possible to just send partial results  The final result can then be calculated from the partial results ▪ e.g. sum

 Using parallel processors reduces the time to perform an operation  Possibly to as little as 1/n * original cost ▪ Where n is the number of processors  However there are also additional costs  Start-up costs for initiating the operation  Skew which may reduce the speed-up  Contention for resources resulting in delays  Cost of assembling the final result

 As well as parallelizing individual operators, different operators can be processed in parallel  D ifferent processors perform different operations  Result of one operator can be pipelined into another  Note that sorting and the hash-join partitioning block pipelines  Multiple independent operations can be executed concurrently  Using bushy, rather than left-deep, join trees

 The best serial plan may not be the best parallel plan  Also note that parallelization introduces further complexity into query optimization  Consider a table partitioned into two nodes, with a local secondary index  Node 1 contains names between A and M  Node 2 contains names between N and Z  Consider the selection: name < “Noober“  Node 1 should scan its partition, but  Node 2 should use the name index

 In a large-scale parallel system the chances of failure increase  Such systems should be designed to operate even if a processor disk fails  Data can be replicated across multiple processors  Failed processors or disks are tracked  And request re-routed to the backup

 Architecture  Shared-memory is easy, but costly and does not scale well  Shared-nothing is cheap and scales well, but is harder to implement  Both intra-operation, and inter-operation parallelism are possible  Most relational algebra operations can be performed in parallel  How the data is partitioned across processors is very important

 A distributed database is motivated by a number of factors  Increased availability ▪ If a site containing a table goes down, the table may still be available if a copy is maintained at another site  Distributed access to data ▪ An organization may have branches in several cities ▪ Access patterns are typically affected by locality  Analysis of distributed data  Distributed systems must support integrated access

 Data is stored at several sites  Each site is managed by an independent DBMS  The system should make the fact that data is distributed transparent to the user  Distributed Data Independence  Users should not need to know where the data is located  Queries that access several sites should be optimized  Distributed Transaction Atomicity  Users should be able to write transactions that access several sites, in the same way as local transactions

 Users may have to be aware of where data is located  Distributed data independence and distributed transaction atomicity may not be supported  These properties may be hard to support efficiently ▪ Sites may be connected by a slow long-distance network  Consider a global system  Administrative overheads for viewing data as a single unified collection may be prohibitively expensive

 Distributed and shared-nothing parallel systems appear similar  In practice these are often very different since distributed DBs are typically  Geographically separated  Separately administered  Have slower interconnections  May have both local and global transactions

 Homogeneous  Data is distributed but every site runs the same DBMS software  Heterogeneous, or multidatabase  Different sites run different DBMSs, and the sites are connected to enable access to data  Require standards for gateway protocols  A gateway protocol is an API that allows external applications access to the database ▪ e.g. ODBC and JDBC  Gateways add a layer of processing, and may not be able to entirely mask differences between servers

 Client-Server  Collaborating Server  Middleware

 One or more client processes and one or more server processes  A client process sends a query to any one server process  Clients are responsible for UI  Servers manage data and execute transactions  A popular architecture  Relatively simple to implement  Servers do not have to deal with user-interactions  Users can run a GUI on clients  Communication between client and server should be as set-oriented as possible  e.g. stored procedures vs. cursors

 Client-server systems do not allow a single query to access multiple servers as this would require  Breaking the query into sub-queries to be executed at different sites and merging the answers to the sub-queries  To do this the client would have to be overly complex  In a collaborating server system the distinction between clients and servers is eliminated  A collection of DB servers, each able to run transactions against local data  When a query is received that requires data from other servers the server generates appropriate sub-queries

 Designed to allow a single query to access multiple servers, but  Without requiring all servers to be capable of managing multi-site query execution  Often used to integrate legacy systems  Requires one database server (the middleware) capable of managing multi-server queries  Other servers only handle local queries and transactions  The special server coordinates queries and transactions  The middleware server typically doesn’t maintain any data

 In a distributed system tables are stored across several sites  Accessing a table stored elsewhere incurs message- passing costs  A single table may be replicated or fragmented across several sites  Fragments are stored at the sites where they are most often accessed  Several replicas of a table may be stored at different sites  Fragmentation and replication can be combined

 Fragmentation consists of breaking a table into smaller tables, or fragments  The fragments are stored instead of the original table  Possibly at different sites  Fragmentation can either be vertical or horizontal TIDempIDfNamelNameagecity 1111SamSpade43Chicago 2222PeterWhimsey51Surrey 3333SherlockHolmes35Surrey 4444AnitaBlake29Boston horizontal vertical

 Records that belong to a horizontal fragment are usually identified by a selection query  e.g. all the records that relate to a particular city, achieving locality, reducing communication costs  A horizontally fragmented table can be recreated by computing the union of the fragments ▪ Fragments are usually required to be disjoint  Records belonging to a vertical fragment are identified by a projection query  The collection of vertical fragments must be a lossless-join decomposition  A unique tuple ID is often assigned to records

 Replication entails storing several copies of a table or of table fragments for  Increased availability of data, which protects against ▪ Failure of individual sites, and ▪ Failure of communication links  Faster query evaluation ▪ Queries can execute faster by using a local copy of a table  There are two kinds of replication, synchronous, and asynchronous  These differ in how replicas are kept current when the table is modified

 Distributing data across sites adds complexity  It is important to track where replicated or fragmented tables are stored  Each replica or fragment must be uniquely named  Naming should be performed locally  A global relation name consists of {birth site, local name } ▪ The birth site is the site where the table was created  A site catalog records fragments and replicas at a site, and tracks replicas of tables created at the site  To locate a table, look up its birth site catalog  The birth site never changes, even if the table is moved

 Estimating the cost of an evaluation plan must include communication costs  Evaluate the number of page reads or writes, and  The number of pages that must be sent from one site to another  Pages may need to be shipped between a number of sites  Sites where the data is located, and where the result is computed, and  The site that initiated the query

 Simple, one table, queries are affected by fragmentation and replication  If a table is horizontally fragmented a query has to be evaluated at multiple sites  And the union of the result computed  Selections that only require data at one site can be executed just at that site  If a table is vertically fragmented the fragments have to be joined on the common attribute  If a table is replicated, the shipping costs have to be considered to determine which site to use

 Joins of tables at different sites can be very expensive  There are a number of strategies for computing joins  Fetch as needed  Ship to one site  Semijoins and Bloomjoins

 Designate one table as the outer relation, and compute the join at that site  Fetch records of the inner relation as needed; the cost depends on  The size of the relations  Whether the inner relation is cached at the outer relation's site ▪ If not, communication costs are incurred once for each time the inner relation is read  The size of the result relation  If the size of the result (R ⋈ S) is greater than R + S it is cheaper to ship both relations to the query site

 In this strategy, relations are shipped to a site and the join carried out at that site  The site can be one of the sites involved in the join  The result has to be shipped from where it was computed to the site where the query was posed  Alternatively both input relations can be shipped to the site where the query was originally posed  The join is then computed at that site

 Consider a join between two relations, R and S at different sites, London and Vancouver  Assume that S (the inner join) is to be shipped to London where the join will be computed  Note that some S records may not join to R records  Shipping costs can be reduced by only shipping those S records that will actually join to R records  There are two techniques that can reduce the number of S records to be shipped  Semi-joins, and  Bloom-joins

 At the first site (London) compute the projection of R on the join columns, a  Ship this relation to site 2 (Vancouver)  At Vancouver compute the join of  a (R) and S  The result of this join is the reduction of S with respect to R  Ship the reduction of S to London  At London compute the join of the reduction of S, and R  The effectiveness of this technique depends on how much smaller the reduction of S is compared to S

 Bloom-joins are similar to semi-joins, except that a bit vector is sent to the second site  The vector is size k and each record in R is hashed to it ▪ A bit is set to 1 if a record hashes to it ▪ The hash function is on the join attribute  The reduction of S is then computed in step 2  By hashing records of S to the bit vector  Only those records that hash to a bit with the value of 1 are included in the reduction  The cost to send the bit vector is less than the cost to send the projection (of the join attribute on R)  But some unwanted records of S may be in the reduction

 The basic cost based approach is to consider a set of plans and pick the cheapest  Communication costs must be considered  Local autonomy must be respected  Some operations can be carried out in parallel  The query site generated a global plan with suggested local plans  Local sites are allowed to change their suggested plans if they can improve them

 If data is distributed it should be transparent to users  Users should be able to ask queries without having to worry where tables are stored  Transactions should be atomic actions, regardless of data fragmentation or replication  If so, all copies of a replicated relation must be modified before the transaction commits  Referred to as synchronous replication  Another approach, asynchronous replication, allows copies of a relation to differ  More efficient, but compromises data independence

 There are two techniques for ensuring that a transaction sees the same values  Regardless of which copy of an object it accesses  In voting, a transaction must write a majority of copies to modify an object, and  Must read enough copies to ensure that it sees at least one most recent copy  e.g. 10 copies of an object, at least 6 copies must be written, and at least 5 read  Note that the copies include a version number so that it is possible to tell which copy is the latest

 Voting is not a generally efficient technique  Reading an object requires that multiple copies of the object must be read  Typically, objects are read more than they are written  The read-any write-all policy allows any single copy to be read, but  All copies must be written when an object is written  Writes are slower, relative to voting, but  Reads are fast, particularly is a local copy is available  Read-any write-all is usually used for synchronous replication

 Synchronous replication is expensive  Before an update transaction is committed it must obtain X locks on all copies of the data  This may entail sending lock requests to remote sites and waiting for the locks to be confirmed  While holding its other locks  If sites, or the communication links fail, the transaction cannot commit until they are back up  Committing the transaction requires ending multiple messages as part of a commit protocol  An alternative is to use asynchronous replication

 A transaction is allowed to commit before all the copies have been changed  Readers still only look at a single copy  Users must be aware of which copy they are reading, and that copies may be out of sync  There are two approaches to asynchronous replication  Peer-to-peer, and  Primary site

 More than one copy can be designated as updatable  Changes to the master(s) must be propagated to other copies  If two masters are changed a conflict resolution strategy must be used  Peer-to-peer replication is best used when conflicts do not arise  Where each master site owns a disjoint fragment ▪ Usually a horizontal fragment  Update rights are only held by one master at a time ▪ A backup site may gain update rights if the main site fails

 One copy of a table is designated as the primary or master copy  Users register or publish the primary copies  Other sites subscribe to the table (or fragments of it), by creating secondary copies  Secondary copies cannot be directly updated  Changes to the primary copy must be propagated to the secondary copies  First, capture change made by committed transactions  Apply the changes to secondary copies

 Log-based capture creates an update record from the recovery log when it is written to stable storage  Log changes that affect replicated tables are written to a change data table (CDT)  Note that aborted transactions must, at some point, be removed from the CDT  Another approach is to use procedural capture  A trigger invokes a procedure which takes a snapshot of the primary copy  Log-based capture is cheaper and has less delay, but relies on proprietary log details

 The apply step takes the capture step changes and propagates them to secondary copies  This can be continuously pushed from the master whenever a CDT is generated, or  Periodically requested (or pulled) by the copies ▪ A timer or application controls the frequency of the requests  Log-based capture with continuous apply minimizes delay  A cheaper substitute for synchronous replication  Procedural capture and application driven apply gives the most flexibility

 Complex decision support queries that require data from multiple sites are popular  To improve query efficiency, all the data can be copied to one site, which is then queried  These data collections are called data warehouses  Warehouses use asynchronous replication  The source data is typically controlled by different DBMSs  Source data often has to be cleaned when creating the replicas  Procedural capture and application apply is best used for this environment

 Transactions may be submitted at one site but can access data at other sites  The transaction manager breaks the transaction into sub- transactions that execute at different sites  The sub-transactions are submitted to the other sites  The transaction manger at the initial site must coordinate the activity of the sub-transactions  Distributed concurrency control  Locks and deadlocks must be managed across sites  Distributed recovery  Transaction atomicity must be ensured across sites

 In centralized locking, a single site is in charge of handling lock and unlock requests  This is vulnerable to single site failure and bottlenecks  In primary copy locking, all locking is done at the primary copy site for an object  Reading a copy of an object usually requires communication with two sites  In fully distributed locking, lock requests are handled by the lock manager at the local site  X locks must be set at all sites when copies are modified  S locks are only set at the local site  There are other protocols for locking replicated data

 If deadlock detection is being used (rather than prevention) the scheme must be modified  Centralized - send all local waits-for graphs to a central site  Hierarchical - organize sites into a hierarchy and send local graphs to parent  Timeout - abort the transaction if it waits too long  Communication delays can cause phantom deadlocks T1T2 site A T1T2 site B T1T2 global

 Recovery in a distributed system is more complex  New kinds of failure can occur  Communication failures, and  Failures at remote sites where sub-transactions are executing  To ensure atomicity, either all or no sub- transactions must commit  This property must be guaranteed regardless of site or communication failure  This is achieved using a commit protocol

 During normal execution each site maintains a log  Transactions are logged where they execute  The transaction manager at the originating site is called the coordinator  Transaction managers at sub-transaction sites are referred to as subordinates  The most widely used commit protocol is two-phase commit  The 2PC protocol for normal execution starts when the user commits a transaction

 Coordinator sends prepare messages  Subordinates decide whether to abort or commit  Force-write an abort or prepare log record  Send no or yes messages to coordinator  If the coordinator receives unanimous yes, it force- writes commit record and sends commit messages  Otherwise, force-writes abort and sends abort messages  Subordinates force-write abort or commit log records and send acknowledge messages to the coordinator  When all acknowledge messages have been received the coordinator writes an end log record

 2PC requires two rounds of messages  Voting phase  Termination phase  Any site’s transaction manager can unilaterally abort a transaction  Log records describing decisions are always forced to stable storage before the message is sent  Log records include the record type, transaction ID, and coordinator ID  The coordinator’s commit or abort log record includes the IDs of all subordinates

 If there is a commit or abort log record for transaction, T, but no end record, T must be redone  If the site is a coordinator keep sending commit, or abort messages until all acknowledge messages are received  If there is a prepare log record for T, but not commit or abort the site is a subordinate  The coordinator is repeatedly contacted to determine T’s status, until a commit or abort message is received  If there is no prepare log record for T, the transaction is unilaterally aborted  And send an abort message if contacted by a subordinate

 If a coordinator fails, the subordinates are unable to determine whether to commit or abort  The transaction is blocked until the coordinator recovers  What happens if a remote site does not respond during the commit protocol?  If the site is the coordinator the transaction should be aborted  If the site is a subordinate that has not voted yes, it should abort the transaction  If the site is a subordinate that has voted yes, it is blocked until the coordinator responds

 The acknowledge messages are used to tell the coordinator that it can forget a transaction  Until all acknowledge messages are received it must keep T in the transaction table  The coordinator may fail after prepare messages, but before commit or abort  It therefore has no information about the transaction’s status before the crash ▪ So it subsequently aborts the transaction  If another site enquires about T, the recovery process responds with an abort message  If a sub-transaction doesn’t perform updates its commit or abort status is irrelevant

 When a coordinator aborts T, it can undo T and remove it from the transaction table  If there is no information about T, it is presumed to be aborted  Similarly, subordinates do not need to send ack messages on abort  As the coordinator does not have to wait for acks to abort a transaction  Abort log records do not have to be force-written  As the default decision is to abort a transaction

 It a sub-transaction does not perform updates it responds to prepare with a reader message  And writes no log records  If the coordinator receives a reader message it is treated as yes  But no further messages are sent to that subordinate  If all sub-transactions are readers the second phase of the protocol is not required  The transaction can be removed from the transaction table

 In cloud computing a vendor supplies computing resources as a service  A large number of computers are connected through a communication network  Such as the internet …  The client runs applications and stores data using these resources  And can access the resources with little effort

 Web applications have to be highly scalable  Applications may have hundreds of millions of users  Requiring data to be partitioned across thousands of processors  There are a number of systems for data storage on the cloud  Such as Bigtable (from Google)  They do not necessarily guarantee the ACID properties ▪ They drop ACID …

 Many web data storage systems are not built around an SQL data model  Such as NoSql DBs or BigTable  Some support semi-structured data  Many web applications manage without extensive query language support  Data storage systems often allow multiple versions of data items to be stored  Versions can be identified by timestamp

 Data is often partitioned using hash or range partitioning  Such partitions are referred to as tablets  This is performed dynamically as required  It is necessary to know which site contains a particular tablet  A tablet controller site tracks the partitioning function ▪ And can map a request to the appropriate site  The mapping information can be replicated to a set of router sites ▪ So that the controller does not act as a bottleneck

 A cloud DB introduces a number of challenges to making a DB ACID compliant  Locking  Ensuring transactions are atomic  Frequent communication between sites  In addition there are a number of issues that relate to both DBs and data storage  Replication is controlled by the cloud vendor  Security and legal issues