Download presentation
Presentation is loading. Please wait.
Published byKerry Grant Modified over 9 years ago
2
Parallel databases Architecture Query evaluation Query optimization Distributed databases Architectures Data storage Catalog management Query processing Transactions
3
A parallel database system is designed to improve performance through parallelism Loading data, building indexes, evaluating queries Data may be stored in a distributed way, but solely for performance reasons A distributed database system is physically stored across several sites Each site is managed by an independent DBMS Distribution is affected by local ownership, and availability as well as performance
5
HHow long does it take to scan a 1 terabyte table at 10MB/s? 11,099,511,627,776 bytes = 1,024 4 or 2 40 bytes 110MB = 10,485,760 bytes 11,099,511,627,776 / 10,485,760 = 104,858 1104,858 / (60 * 20 * 24) = 1.2 days! UUsing 1,000 processors in parallel the time can be reduced to 1.5 minutes
6
A coarse-grain parallel machine consists of a small number of processors Most current high-end computers A fine-grain parallel machine uses thousands of smaller processors Also referred to as a massively parallel machine
7
Both throughput and response time can be improved by parallelism Throughput – the number of tasks completed in a given time Processing many small tasks in parallel increases throughput Response time – the time it takes to complete a single task Subtasks of large transactions can be performed in parallel increasing response time
8
Speed-up More resources means less time for a given amount of data Scale-up If resources increase in proportion to increase in data size, time is constant degree of parallelismthroughput ideal degree of parallelismresponse time ideal
9
Where possible a parallel database should carry out evaluation steps in parallel There are many opportunities for parallelism in a relational database There are three main parallel DBMS architectures Shared nothing Shared memory Shared disk
10
Multiple CPUs attached to an interconnection network Accessing a common region of main memory Similar to a conventional system Good for moderate parallelism Communication overhead is low OS services control the CPUs Interference increases with size As CPUs are added memory contention becomes a bottleneck Adding more CPUs eventually slows the system down P D interconnection network DD … global shared memory PP
11
Each CPU has private memory and direct access to data Through the interconnection network Good for moderate parallelism Suffers from interference in the interconnection network Which acts as a bottleneck Not a good solution for a large scale parallel system M M … M M M M P PP interconnection network DDD
12
Each CPU has local memory and disk space No two CPUs access the same storage area All CPU communication is through the network Increases complexity Linear speed-up ▪ Operation time decreases proportional to increase in CPUs Linear scale-up ▪ Performance maintained if CPU increase is proportional to data interconnection network P … PP M M M M M M DDD
13
A relational query execution plan is a tree, or graph, of relational algebra operators Operators in a query tree can be executed in parallel If one operator consumes the output of another, there is pipelined parallelism Otherwise the operators can be evaluated independently An operator blocks if it does not produce any output until it has consumed all its inputs Pipelined parallelism is limited by blocking operators
14
Individual operators can be evaluated in a parallel way by partitioning input data In data-partitioned parallel evaluation the input data is partitioned, and worked on in parallel The results are then combined Tables are horizontally partitioned Different rows are assigned to different processors
15
Partition using a round-robin algorithm Partition using hashing Partition using ranges of field values
16
Partition using a round-robin algorithm Assign record i to processor i mod n ▪ Similar to RAID systems Suitable for evaluating queries that access the entire table Less efficient for queries that access ranges of values and queries on equality
17
Partition using hashing A hash function based on selected attributes is applied to each record to determine its processor The data remains evenly distributed as the table grows, or shrinks over time Good for equality selections Only one disk is used, leaving the others free Also useful for sequential scans where the partitioning attributes are a candidate key
18
Partition using ranges of field values Ranges are chosen from the sort key values, each range should contain the same number of records ▪ Each disk contains one range If a range is too large can lead to data skew Skew can lead to the processors with large partitions becoming bottlenecks Good for equality selections, and range selections
19
Both hash and range partitioning may result in data skew Where some partitions are larger or smaller Skew can dramatically reduce the speed-up obtained from parallelism In range partitioning skew can be reduced by using histograms The histograms contain the number of attributes and are used to derive even partitions
20
Parallel data streams are used to provide data for relational operators The streams can come from different disks, or Output of other operators Streams are merged or split Merged to provide the inputs for a relational operator Split as needed to parallelize processing These operations can buffer data, and should be able to halt operators that provide their input data A parallel evaluation consists of a network of relational, merge and split operators
21
Inter-query parallelism Different queries or transactions execute in parallel Throughput is increased but response time is not Easy to support in a shared-memory system Intra-query parallelism Executing a single query in parallel to speed up large queries Which in turn can entail either intra-operation or inter-operation parallelism, or both
22
Scanning and loading Pages can be read in parallel while scanning a relation The results can be merged If hash or range partitioning is used selections can be directed to the relevant processors Sorting Joins
23
The simplest sort method is for each processor to sort its portion of the table Then merge the sorted records The merging phase may limit the amount of parallelism A better method is to first redistribute the records over the processors using range partitioning Using the sort attributes Each processor sorts its set of records The sets of sorted records are then retrieved in order To make the partitions even, the data in the processors can be sampled
24
Join algorithms can be parallelized Parallelization is most effective for hash or sort- merge joins ▪ Parallel hash join is widely used The process for parallel hash join is First partition the two tables across the processors using the same hash function Join the records locally, using any join algorithm Merge the results of the local joins, the union of these results is the join of the two tables
25
If tables are very large, parallel hash join may have a high cost at each processor If each partition is large, multiple passes will be required for the local joins An alternative approach is to use all processors for each partition Partition the tables using h 1 ▪ Each partition of the smaller relation should fit into the combined memory of the processors Process each partition using all processors ▪ Use h 2 to determine which processor to send records to
26
Partitioning is not suitable for joins on inequalities Such as R ⋈ R.a < S.b S Since all records in R could join with a record in S Fragment and replicate joins can be used In asymmetric fragment and replicate join ▪ One of the relations is partitioned ▪ The other relation is replicated across all partitions
27
P 0,0 Each relation can be both fragmented and replicated Into m fragments of R and n of S However m * n processors are required This works with any join condition When partitioning is not possible R0R0 R0R0 R1R1 R1R1 R2R2 R2R2 … … R m-1 S0S0 S0S0 P 1,0 P 2,0 P 0,1 P 1,1 P 0,2 S1S1 S1S1 S2S2 S2S2 … … S n-1 P m-1,n-1
28
Selection – the table may already be partitioned on the selection attribute If not, it can be scanned in parallel Duplicate elimination – use parallel sorting Projection – can be performed by scanning Aggregation – partition by the grouping attribute If records do have to be transferred between processors it may be possible to just send partial results The final result can then be calculated from the partial results ▪ e.g. sum
29
Using parallel processors reduces the time to perform an operation Possibly to as little as 1/n * original cost ▪ Where n is the number of processors However there are also additional costs Start-up costs for initiating the operation Skew which may reduce the speed-up Contention for resources resulting in delays Cost of assembling the final result
30
As well as parallelizing individual operators, different operators can be processed in parallel D ifferent processors perform different operations Result of one operator can be pipelined into another Note that sorting and the hash-join partitioning block pipelines Multiple independent operations can be executed concurrently Using bushy, rather than left-deep, join trees
31
The best serial plan may not be the best parallel plan Also note that parallelization introduces further complexity into query optimization Consider a table partitioned into two nodes, with a local secondary index Node 1 contains names between A and M Node 2 contains names between N and Z Consider the selection: name < “Noober“ Node 1 should scan its partition, but Node 2 should use the name index
32
In a large-scale parallel system the chances of failure increase Such systems should be designed to operate even if a processor disk fails Data can be replicated across multiple processors Failed processors or disks are tracked And request re-routed to the backup
33
Architecture Shared-memory is easy, but costly and does not scale well Shared-nothing is cheap and scales well, but is harder to implement Both intra-operation, and inter-operation parallelism are possible Most relational algebra operations can be performed in parallel How the data is partitioned across processors is very important
35
A distributed database is motivated by a number of factors Increased availability ▪ If a site containing a table goes down, the table may still be available if a copy is maintained at another site Distributed access to data ▪ An organization may have branches in several cities ▪ Access patterns are typically affected by locality Analysis of distributed data Distributed systems must support integrated access
36
Data is stored at several sites Each site is managed by an independent DBMS The system should make the fact that data is distributed transparent to the user Distributed Data Independence Users should not need to know where the data is located Queries that access several sites should be optimized Distributed Transaction Atomicity Users should be able to write transactions that access several sites, in the same way as local transactions
37
Users may have to be aware of where data is located Distributed data independence and distributed transaction atomicity may not be supported These properties may be hard to support efficiently ▪ Sites may be connected by a slow long-distance network Consider a global system Administrative overheads for viewing data as a single unified collection may be prohibitively expensive
38
Distributed and shared-nothing parallel systems appear similar In practice these are often very different since distributed DBs are typically Geographically separated Separately administered Have slower interconnections May have both local and global transactions
39
Homogeneous Data is distributed but every site runs the same DBMS software Heterogeneous, or multidatabase Different sites run different DBMSs, and the sites are connected to enable access to data Require standards for gateway protocols A gateway protocol is an API that allows external applications access to the database ▪ e.g. ODBC and JDBC Gateways add a layer of processing, and may not be able to entirely mask differences between servers
40
Client-Server Collaborating Server Middleware
41
One or more client processes and one or more server processes A client process sends a query to any one server process Clients are responsible for UI Servers manage data and execute transactions A popular architecture Relatively simple to implement Servers do not have to deal with user-interactions Users can run a GUI on clients Communication between client and server should be as set-oriented as possible e.g. stored procedures vs. cursors
42
Client-server systems do not allow a single query to access multiple servers as this would require Breaking the query into sub-queries to be executed at different sites and merging the answers to the sub-queries To do this the client would have to be overly complex In a collaborating server system the distinction between clients and servers is eliminated A collection of DB servers, each able to run transactions against local data When a query is received that requires data from other servers the server generates appropriate sub-queries
43
Designed to allow a single query to access multiple servers, but Without requiring all servers to be capable of managing multi-site query execution Often used to integrate legacy systems Requires one database server (the middleware) capable of managing multi-server queries Other servers only handle local queries and transactions The special server coordinates queries and transactions The middleware server typically doesn’t maintain any data
45
In a distributed system tables are stored across several sites Accessing a table stored elsewhere incurs message- passing costs A single table may be replicated or fragmented across several sites Fragments are stored at the sites where they are most often accessed Several replicas of a table may be stored at different sites Fragmentation and replication can be combined
46
Fragmentation consists of breaking a table into smaller tables, or fragments The fragments are stored instead of the original table Possibly at different sites Fragmentation can either be vertical or horizontal TIDempIDfNamelNameagecity 1111SamSpade43Chicago 2222PeterWhimsey51Surrey 3333SherlockHolmes35Surrey 4444AnitaBlake29Boston horizontal vertical
47
Records that belong to a horizontal fragment are usually identified by a selection query e.g. all the records that relate to a particular city, achieving locality, reducing communication costs A horizontally fragmented table can be recreated by computing the union of the fragments ▪ Fragments are usually required to be disjoint Records belonging to a vertical fragment are identified by a projection query The collection of vertical fragments must be a lossless-join decomposition A unique tuple ID is often assigned to records
48
Replication entails storing several copies of a table or of table fragments for Increased availability of data, which protects against ▪ Failure of individual sites, and ▪ Failure of communication links Faster query evaluation ▪ Queries can execute faster by using a local copy of a table There are two kinds of replication, synchronous, and asynchronous These differ in how replicas are kept current when the table is modified
49
Distributing data across sites adds complexity It is important to track where replicated or fragmented tables are stored Each replica or fragment must be uniquely named Naming should be performed locally A global relation name consists of {birth site, local name } ▪ The birth site is the site where the table was created A site catalog records fragments and replicas at a site, and tracks replicas of tables created at the site To locate a table, look up its birth site catalog The birth site never changes, even if the table is moved
51
Estimating the cost of an evaluation plan must include communication costs Evaluate the number of page reads or writes, and The number of pages that must be sent from one site to another Pages may need to be shipped between a number of sites Sites where the data is located, and where the result is computed, and The site that initiated the query
52
Simple, one table, queries are affected by fragmentation and replication If a table is horizontally fragmented a query has to be evaluated at multiple sites And the union of the result computed Selections that only require data at one site can be executed just at that site If a table is vertically fragmented the fragments have to be joined on the common attribute If a table is replicated, the shipping costs have to be considered to determine which site to use
53
Joins of tables at different sites can be very expensive There are a number of strategies for computing joins Fetch as needed Ship to one site Semijoins and Bloomjoins
54
Designate one table as the outer relation, and compute the join at that site Fetch records of the inner relation as needed; the cost depends on The size of the relations Whether the inner relation is cached at the outer relation's site ▪ If not, communication costs are incurred once for each time the inner relation is read The size of the result relation If the size of the result (R ⋈ S) is greater than R + S it is cheaper to ship both relations to the query site
55
In this strategy, relations are shipped to a site and the join carried out at that site The site can be one of the sites involved in the join The result has to be shipped from where it was computed to the site where the query was posed Alternatively both input relations can be shipped to the site where the query was originally posed The join is then computed at that site
56
Consider a join between two relations, R and S at different sites, London and Vancouver Assume that S (the inner join) is to be shipped to London where the join will be computed Note that some S records may not join to R records Shipping costs can be reduced by only shipping those S records that will actually join to R records There are two techniques that can reduce the number of S records to be shipped Semi-joins, and Bloom-joins
57
At the first site (London) compute the projection of R on the join columns, a Ship this relation to site 2 (Vancouver) At Vancouver compute the join of a (R) and S The result of this join is the reduction of S with respect to R Ship the reduction of S to London At London compute the join of the reduction of S, and R The effectiveness of this technique depends on how much smaller the reduction of S is compared to S
58
Bloom-joins are similar to semi-joins, except that a bit vector is sent to the second site The vector is size k and each record in R is hashed to it ▪ A bit is set to 1 if a record hashes to it ▪ The hash function is on the join attribute The reduction of S is then computed in step 2 By hashing records of S to the bit vector Only those records that hash to a bit with the value of 1 are included in the reduction The cost to send the bit vector is less than the cost to send the projection (of the join attribute on R) But some unwanted records of S may be in the reduction
59
The basic cost based approach is to consider a set of plans and pick the cheapest Communication costs must be considered Local autonomy must be respected Some operations can be carried out in parallel The query site generated a global plan with suggested local plans Local sites are allowed to change their suggested plans if they can improve them
61
If data is distributed it should be transparent to users Users should be able to ask queries without having to worry where tables are stored Transactions should be atomic actions, regardless of data fragmentation or replication If so, all copies of a replicated relation must be modified before the transaction commits Referred to as synchronous replication Another approach, asynchronous replication, allows copies of a relation to differ More efficient, but compromises data independence
62
There are two techniques for ensuring that a transaction sees the same values Regardless of which copy of an object it accesses In voting, a transaction must write a majority of copies to modify an object, and Must read enough copies to ensure that it sees at least one most recent copy e.g. 10 copies of an object, at least 6 copies must be written, and at least 5 read Note that the copies include a version number so that it is possible to tell which copy is the latest
63
Voting is not a generally efficient technique Reading an object requires that multiple copies of the object must be read Typically, objects are read more than they are written The read-any write-all policy allows any single copy to be read, but All copies must be written when an object is written Writes are slower, relative to voting, but Reads are fast, particularly is a local copy is available Read-any write-all is usually used for synchronous replication
64
Synchronous replication is expensive Before an update transaction is committed it must obtain X locks on all copies of the data This may entail sending lock requests to remote sites and waiting for the locks to be confirmed While holding its other locks If sites, or the communication links fail, the transaction cannot commit until they are back up Committing the transaction requires ending multiple messages as part of a commit protocol An alternative is to use asynchronous replication
65
A transaction is allowed to commit before all the copies have been changed Readers still only look at a single copy Users must be aware of which copy they are reading, and that copies may be out of sync There are two approaches to asynchronous replication Peer-to-peer, and Primary site
66
More than one copy can be designated as updatable Changes to the master(s) must be propagated to other copies If two masters are changed a conflict resolution strategy must be used Peer-to-peer replication is best used when conflicts do not arise Where each master site owns a disjoint fragment ▪ Usually a horizontal fragment Update rights are only held by one master at a time ▪ A backup site may gain update rights if the main site fails
67
One copy of a table is designated as the primary or master copy Users register or publish the primary copies Other sites subscribe to the table (or fragments of it), by creating secondary copies Secondary copies cannot be directly updated Changes to the primary copy must be propagated to the secondary copies First, capture change made by committed transactions Apply the changes to secondary copies
68
Log-based capture creates an update record from the recovery log when it is written to stable storage Log changes that affect replicated tables are written to a change data table (CDT) Note that aborted transactions must, at some point, be removed from the CDT Another approach is to use procedural capture A trigger invokes a procedure which takes a snapshot of the primary copy Log-based capture is cheaper and has less delay, but relies on proprietary log details
69
The apply step takes the capture step changes and propagates them to secondary copies This can be continuously pushed from the master whenever a CDT is generated, or Periodically requested (or pulled) by the copies ▪ A timer or application controls the frequency of the requests Log-based capture with continuous apply minimizes delay A cheaper substitute for synchronous replication Procedural capture and application driven apply gives the most flexibility
70
Complex decision support queries that require data from multiple sites are popular To improve query efficiency, all the data can be copied to one site, which is then queried These data collections are called data warehouses Warehouses use asynchronous replication The source data is typically controlled by different DBMSs Source data often has to be cleaned when creating the replicas Procedural capture and application apply is best used for this environment
71
Transactions may be submitted at one site but can access data at other sites The transaction manager breaks the transaction into sub- transactions that execute at different sites The sub-transactions are submitted to the other sites The transaction manger at the initial site must coordinate the activity of the sub-transactions Distributed concurrency control Locks and deadlocks must be managed across sites Distributed recovery Transaction atomicity must be ensured across sites
72
In centralized locking, a single site is in charge of handling lock and unlock requests This is vulnerable to single site failure and bottlenecks In primary copy locking, all locking is done at the primary copy site for an object Reading a copy of an object usually requires communication with two sites In fully distributed locking, lock requests are handled by the lock manager at the local site X locks must be set at all sites when copies are modified S locks are only set at the local site There are other protocols for locking replicated data
73
If deadlock detection is being used (rather than prevention) the scheme must be modified Centralized - send all local waits-for graphs to a central site Hierarchical - organize sites into a hierarchy and send local graphs to parent Timeout - abort the transaction if it waits too long Communication delays can cause phantom deadlocks T1T2 site A T1T2 site B T1T2 global
74
Recovery in a distributed system is more complex New kinds of failure can occur Communication failures, and Failures at remote sites where sub-transactions are executing To ensure atomicity, either all or no sub- transactions must commit This property must be guaranteed regardless of site or communication failure This is achieved using a commit protocol
75
During normal execution each site maintains a log Transactions are logged where they execute The transaction manager at the originating site is called the coordinator Transaction managers at sub-transaction sites are referred to as subordinates The most widely used commit protocol is two-phase commit The 2PC protocol for normal execution starts when the user commits a transaction
76
Coordinator sends prepare messages Subordinates decide whether to abort or commit Force-write an abort or prepare log record Send no or yes messages to coordinator If the coordinator receives unanimous yes, it force- writes commit record and sends commit messages Otherwise, force-writes abort and sends abort messages Subordinates force-write abort or commit log records and send acknowledge messages to the coordinator When all acknowledge messages have been received the coordinator writes an end log record
77
2PC requires two rounds of messages Voting phase Termination phase Any site’s transaction manager can unilaterally abort a transaction Log records describing decisions are always forced to stable storage before the message is sent Log records include the record type, transaction ID, and coordinator ID The coordinator’s commit or abort log record includes the IDs of all subordinates
78
If there is a commit or abort log record for transaction, T, but no end record, T must be redone If the site is a coordinator keep sending commit, or abort messages until all acknowledge messages are received If there is a prepare log record for T, but not commit or abort the site is a subordinate The coordinator is repeatedly contacted to determine T’s status, until a commit or abort message is received If there is no prepare log record for T, the transaction is unilaterally aborted And send an abort message if contacted by a subordinate
79
If a coordinator fails, the subordinates are unable to determine whether to commit or abort The transaction is blocked until the coordinator recovers What happens if a remote site does not respond during the commit protocol? If the site is the coordinator the transaction should be aborted If the site is a subordinate that has not voted yes, it should abort the transaction If the site is a subordinate that has voted yes, it is blocked until the coordinator responds
80
The acknowledge messages are used to tell the coordinator that it can forget a transaction Until all acknowledge messages are received it must keep T in the transaction table The coordinator may fail after prepare messages, but before commit or abort It therefore has no information about the transaction’s status before the crash ▪ So it subsequently aborts the transaction If another site enquires about T, the recovery process responds with an abort message If a sub-transaction doesn’t perform updates its commit or abort status is irrelevant
81
When a coordinator aborts T, it can undo T and remove it from the transaction table If there is no information about T, it is presumed to be aborted Similarly, subordinates do not need to send ack messages on abort As the coordinator does not have to wait for acks to abort a transaction Abort log records do not have to be force-written As the default decision is to abort a transaction
82
It a sub-transaction does not perform updates it responds to prepare with a reader message And writes no log records If the coordinator receives a reader message it is treated as yes But no further messages are sent to that subordinate If all sub-transactions are readers the second phase of the protocol is not required The transaction can be removed from the transaction table
84
In cloud computing a vendor supplies computing resources as a service A large number of computers are connected through a communication network Such as the internet … The client runs applications and stores data using these resources And can access the resources with little effort
85
Web applications have to be highly scalable Applications may have hundreds of millions of users Requiring data to be partitioned across thousands of processors There are a number of systems for data storage on the cloud Such as Bigtable (from Google) They do not necessarily guarantee the ACID properties ▪ They drop ACID …
86
Many web data storage systems are not built around an SQL data model Such as NoSql DBs or BigTable Some support semi-structured data Many web applications manage without extensive query language support Data storage systems often allow multiple versions of data items to be stored Versions can be identified by timestamp
87
Data is often partitioned using hash or range partitioning Such partitions are referred to as tablets This is performed dynamically as required It is necessary to know which site contains a particular tablet A tablet controller site tracks the partitioning function ▪ And can map a request to the appropriate site The mapping information can be replicated to a set of router sites ▪ So that the controller does not act as a bottleneck
88
A cloud DB introduces a number of challenges to making a DB ACID compliant Locking Ensuring transactions are atomic Frequent communication between sites In addition there are a number of issues that relate to both DBs and data storage Replication is controlled by the cloud vendor Security and legal issues
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.