Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Distributed Database: Part 2

Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions of a distributed DBMS: Functions of a distributed DBMS: –Locate data with a distributed data dictionary –Determine location from which to retrieve data and process query components –DBMS translation between nodes with different local DBMSs –Data management functions: security, concurrency, deadlock control, query optimization, failure recovery –Provide consistency among copies of data across the remote sites –Global primary key control –Scalability –Data and stored procedure replication –Allowing for different DBMSs and application code at different nodes

Distributed DBMS architecture

Local & Global Transaction  LOCAL TRANSACTION  a transaction that requires reference only to data that are stores at the site where the transaction originates  GLOBAL TRANSACTON:  A transaction that requires reference to data at one or more non-local sites to satisfy the request.

Local Transaction Steps 1. Application makes request to distributed DBMS 2. Distributed DBMS checks distributed data repository for location of data. Finds that it is local 3. Distributed DBMS sends request to local DBMS 4. Local DBMS processes request 5. Local DBMS sends results to application

Distributed DBMS Architecture Local Transaction Local transaction– all data stored locally 1 3 452

Global Transaction Steps 1. Application makes request to distributed DBMS 2. Distributed DBMS checks distributed data repository for location of data. Finds that it is remote 3. Distributed DBMS routes request to remote site 4. Distributed DBMS at remote site translates request for its local DBMS if necessary, and sends request to local DBMS 5. Local DBMS at remote site processes request 6. Local DBMS at remote site sends results to distributed DBMS at remote site 7. Remote distributed DBMS sends results back to originating site 8. Distributed DBMS at originating site sends results to application

Distributed DBMS architecture Global Transaction Global transaction–some data is at remote site(s) 1 2 4 5 6 3 7 8

Distributed DBMS Transparency Objectives Location Transparency Location Transparency Replication Transparency Replication Transparency Failure Transparency Failure Transparency Concurrency Transparency Concurrency Transparency

Location Transparency User/application does not need to know where data resides User/application does not need to know where data resides To achieve location transparency, the distributed DBMS must have access to an accurate and current data dictionary/directory that indicates location(s) of all data in the network. To achieve location transparency, the distributed DBMS must have access to an accurate and current data dictionary/directory that indicates location(s) of all data in the network. Directories must be synchronized: each copy of the directory reflects the same information concerning the location of data. Directories must be synchronized: each copy of the directory reflects the same information concerning the location of data.

San Mateo: List of all company customers whose total purchases exceed 100000. SELECT * FROM CUSTOMER WHERE TOTAL_SALES < 100000; San Mateo, CaliforniaTulsa, Oklahoma Location Transparency

Tulsa: List of all orange-colored parts (regardless of location) SELECT DISTINCT PART_NUMBER, PART_NAME FROM PART WHERE COLOR = ‘Orange’ ORDER BY PART_NO; San Mateo, CaliforniaTulsa, Oklahoma Location Transparency

Replication Transparency Sometimes called fragmentation transparency Sometimes called fragmentation transparency User/application does not need to know about duplication User/application does not need to know about duplication

An identical copy of Standard Price List is maintained at all 3 nodes Reading part list: Distributed DBMS consult data dictionary & determine local transaction. User need not be aware that the same data are stored at other sites. San Mateo, CaliforniaTulsa, Oklahoma Replication Transparency

Failure Transparency Either all or none of the actions of a transaction are committed Either all or none of the actions of a transaction are committed Each site has a Transaction Manager (TM) Each site has a Transaction Manager (TM) –Logs transactions and before and after images –Concurrency control scheme to ensure data integrity For global transaction: TM at each participating site cooperate to ensure that all update operations are synchronized. If not, data integrity can be lost when failure happens For global transaction: TM at each participating site cooperate to ensure that all update operations are synchronized. If not, data integrity can be lost when failure happens

New York: change the price of a part in the Standard Price List file Global transaction: every copy of the record for that part must be updated. Price list records in New York & Tulsa are successfully updated, however transmission failure occurs: the price list record in San Mateo is not updated. Failure Transparency: either all the actions of a transaction are committed or none of them are committed. San Mateo, CaliforniaTulsa, Oklahoma FailureTransparency Failure Transparency

Failure Transparency To ensure data integrity for real-time, distributed update operations, the cooperating TM execute a commit protocol To ensure data integrity for real-time, distributed update operations, the cooperating TM execute a commit protocol –An algorithm to ensure that a transaction is successfully completed or else it is aborted Most widely used: two-phase commit Most widely used: two-phase commit –An algorithm for coordinating updates in a distributed database –Ensure concurrent transactions at multiple sites are processed as though they were executed in the same, serial order at all sites –Something like arranging a meeting between many people

Failure Transparency The site originating the global transaction or an overall coordinating site sends a request to each of the sites that will process some portion of the transaction. The site originating the global transaction or an overall coordinating site sends a request to each of the sites that will process some portion of the transaction. Each site processes the sub transaction, but does not immediately commit/store the result to the local database Each site processes the sub transaction, but does not immediately commit/store the result to the local database The result is stored in a temporary file The result is stored in a temporary file Each site lock its portion of the database being updated Each site lock its portion of the database being updated Each site notifies the originating site when it has completed its sub transaction Each site notifies the originating site when it has completed its sub transaction When all sites have responded, the originating site now initiates the two-phase commit protocol When all sites have responded, the originating site now initiates the two-phase commit protocol

Two-Phase Commit Prepare Phase Prepare Phase –Coordinator receives a commit request –Coordinator instructs all resource managers to get ready to “go either way” on the transaction. Each resource manager writes all updates from that transaction to its own physical log –Coordinator receives replies from all resource managers. If all are ok, it writes commit to its own log; if not then it writes rollback to its log FailureTransparency Failure Transparency

Two-Phase Commit Commit Phase Commit Phase –Coordinator then informs each resource manager of its decision and broadcasts a message to either commit or rollback (abort). If the message is commit, then each resource manager transfers the update from its log to its database –A failure during the commit phase puts a transaction “in limbo.” –A limbo transaction can be identified by a timeout or polling –Timeout No confirmation of commit for a specified time periodNo confirmation of commit for a specified time period Not possible to distinguish between busy or failed siteNot possible to distinguish between busy or failed site –Polling Expensive in terms of network load and processing timeExpensive in terms of network load and processing time

Concurrency Control Design goal for distributed database with the property that although a distributed systems runs many transactions, it appears that a given transaction is the only activity in the system. Design goal for distributed database with the property that although a distributed systems runs many transactions, it appears that a given transaction is the only activity in the system. The TM at each site must cooperate to provide concurrency control in a distributed database The TM at each site must cooperate to provide concurrency control in a distributed database 3 basic approaches may be used: 3 basic approaches may be used: –Locking –Versioning –Time stamping

Concurrency Control Time stamping Time stamping –A concurrency control mechanism that assigns a globally unique time stamp to each transaction –Alternative to locks in distributed databases –To ensure that transactions are processed in serial order: avoiding the use of locks. –Every record in the database carries the time stamp of the transaction that last updated it. –If a new transaction attempts to update that record and its time stamp is earlier than that carried in the record, the transaction is assigned a new time stamp and restarted. –A transaction cannot process a record until its time stamp is later that that carried in the record, therefore it cannot interfere with another transaction.

Concurrency Control Advantage: –Locking and deadlock detection are avoided. Disadvantage: –Conservative approach: sometimes transactions restarted even there is no conflict with other transactions.

Query Optimization In a query involving a multi-site join and, possibly, a distributed database with replicated files, the distributed DBMS must decide where to access the data and how to proceed with the join. In a query involving a multi-site join and, possibly, a distributed database with replicated files, the distributed DBMS must decide where to access the data and how to proceed with the join. Three step to develop a query processing plan: Three step to develop a query processing plan: –Query decomposition – simplified and rewritten into a structured, relational algebra form –Data localization – query fragmented so that fragments reference data at only one site –Global optimization Order in which to execute query fragmentsOrder in which to execute query fragments Data movement between sitesData movement between sites Where parts of the query will be executedWhere parts of the query will be executed

Query Optimization One technique used to make processing a distributed query more efficient: Semijoin One technique used to make processing a distributed query more efficient: Semijoin –Semijoin operation: only the joining attribute of the query is sent from one site to another, only the required rows are returned. (rather than all selected attributes) SITE 1 Customer table Cust_No10 bytes Cust_Name50 bytes Zip_Code10 bytes SIC 5 bytes 10,000 rows SITE 2 Order table Order_No10 bytes Cust_No10 bytes Order_Date 4 bytes Order_Amount 6 bytes 400,000 rows

Query Optimization Query at Site 1: display the Cust_Name, SIC, Order_Date for all customers in a particular Zip_Code range and an Order_Amount above a specified limit. Assume that 10% of the customers fall in the Zip_Code range and 2 % of the orders are above the amount limit.

Query Optimization A semijoin would work as follows: –A query is executed at Site 1 to create a list of the Cust_No values in the desired Zip_Code range. 10%  1000 rows satisfy the Zip_Code 1000 rows of 10 bytes each for the Cust_No attribute or 10,000 (1000 * 10) bytes will be sent to Site 2 –A query is executed at site 2 to create a list of the Cust_No and Order_Date values to be sent back to site 1 to compose the final result. Assume the same number of orders for each customer: 40,000 rows of the Order table will match with the customer numbers sent from Site 1 Customer Order: 2% above the amount limit: 800 rows Customer_No & Order_Date: 14 bytes * 800 = 11200 bytes Total data transferred: 10,000 + 11,200 = 21,200 bytes

Query Optimization If not using Semijoin: –To send data from Site 1 to Site 2: need to send Cust_No. Cust_Name and SIC (10+50+5 = 65) bytes for (10,000*10% = 1000) rows of the Customer table (65 * 1000 = 65000) bytes to Site 2 –To send data from Site 2 to Site 1: need to send Cust_No and Order_Date (10+4 = 14) bytes for (400,000*2% = 8000) rows of the Order table (14 * 8000 = 112,000) bytes –Total data transferred: 65,000 + 112,000 = 177,000)

Evolution of Distributed DBMS Distributed DBMS still an emerging rather than established technology Distributed DBMS still an emerging rather than established technology 3 stages in the evolution: 3 stages in the evolution: –Remote Unit of Work –Distributed Unit of Work –Distributed Request “Unit of Work”: sequence of instructions required to process a transaction. “Unit of Work”: sequence of instructions required to process a transaction.

Remote Unit of Work – Remote Transaction Allow multiple SQL statements to be originated at one location and executed as a single unit of work on a single remote DBMS Allow multiple SQL statements to be originated at one location and executed as a single unit of work on a single remote DBMS The originating computer does not consult the data directory to locate the site containing the selected tables in the remote of unit work The originating computer does not consult the data directory to locate the site containing the selected tables in the remote of unit work The originating application must know where the data reside and connect to the remote DBMS prior to each remote unit of work The originating application must know where the data reside and connect to the remote DBMS prior to each remote unit of work Remote Unit of Work concept  does not support location transparency Remote Unit of Work concept  does not support location transparency Allows updates at the single remote computer Allows updates at the single remote computer All updates within a unit of work are tentative until a commit operation makes them permanent or a rollback undoes them All updates within a unit of work are tentative until a commit operation makes them permanent or a rollback undoes them Evolution of Distributed DBMS

Remote Unit of Work – Remote Transaction Transaction integrity is maintained for a single remote site Transaction integrity is maintained for a single remote site An application cannot assure transaction integrity when more than one remote location involved. An application cannot assure transaction integrity when more than one remote location involved. Example: Example: –An application in San Mateo could update the Part file in Tulsa and transaction integrity would be maintained –The application could not simultaneously update the Part file in two or more locations –Remote Unit of Work does not provide Failure Transparency Evolution of Distributed DBMS

Distributed Unit of Work Allows various statements within a unit of work to refer to multiple remote DBMS locations Allows various statements within a unit of work to refer to multiple remote DBMS locations Support some location transparency Support some location transparency All tables in a single SQL statement must be at a single site All tables in a single SQL statement must be at a single site Evolution of Distributed DBMS

San Mateo, CaliforniaTulsa, Oklahoma Evolution of Distributed DBMS

Distributed Unit of Work Distributed Unit of Work would not allow: Distributed Unit of Work would not allow: –Assemble parts information from all three sites SELECT DISTINCT PART_NUMBER, PART_NAME FROM PART WHERE COLOR = ‘Orange’ ORDER BY PART_NUMBER; –A single SQL statement that attempts to update data at more than one location UPDATE PART SET UNIT_PRICE = 127.49 WHERE PART_NUMBER = 12345; Evolution of Distributed DBMS

Distributed Request Allows a single SQL statement to refer to tables in more than one remote site Allows a single SQL statement to refer to tables in more than one remote site –Overcome a major limitation of the distributed unit of work Supports true location transparency Supports true location transparency May not support replication transparency or failure transparency May not support replication transparency or failure transparency Evolution of Distributed DBMS

Information in this slides were taken from Modern Database Management by Jeffrey A. Hoffer, Mary B. Prescott, Heikki Topi

Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Similar presentations

Presentation on theme: "Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions.

Similar presentations

Presentation on theme: "Distributed Database: Part 2. Distributed DBMS Distributed database requires distributed DBMS Distributed database requires distributed DBMS Functions."— Presentation transcript:

Similar presentations

About project

Feedback