Parallel and Distributed Databases

Parallel and Distributed Databases
Presentation By: Mr. Krutibash Nayak Assistant Professor Department of Electrical Engineering,PIET

A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (distributed DBMS) is then defined as the software system that permits the management of the distributed database and makes the distribution transparent to the users [1]

Centralized database[2]
• Data is located in one place (one server) • All DBMS functionalities are done by that server • Enforcing ACID properties of transactions • Concurrency control, recovery mechanisms • Answering queries

Distributed databases
• Data is stored in multiple places (each is running a DBMS) • New notion of distributed transactions • DBMS functionalities are now distributed over many machines • Revisit how these functionalities work in distributed environment

What is not a DDBS A database system which resides on a timesharing computer system A loosely or tightly coupled multiprocessor system One of the nodes of a network of computers - this is a centralised database on a network node[3]

Parallel Database Management System
A database management system that is implemented on a tightly coupled multiprocessor( the tightly coupled system has shared memory) Parallel database system improves performance of data processing using multiple resources in parallel, like multiple CPU and disks are used parallelly.

Goals of Parallel Databases
Improve performance: The performance of the system can be improved by connecting multiple CPU and disks in parallel. Many small processors can also be connected in parallel. Improve availability of data: Data can be copied to multiple locations to improve the availability of data. For example: if a module contains a relation (table in database) which is unavailable then it is important to make it available from another module. Improve reliability: Reliability of system is improved with completeness, accuracy and availability of data. Provide distributed access of data: Companies having many branches in multiple cities can access data with the help of parallel database system.

PARALLEL VS. DISTRIBUTED DATABASES
Distributed processing usually imply parallel processing (not vise versa) • Can have parallel processing on a single machine

Assumptions about Architecture
Parallel Databases : • Machines are physically close to each other, e.g., same server room • Machines connects with dedicated high-speed LANs and switches • Communication cost is assumed to be small • Can shared-memory, shared-disk, or shared-nothing architecture Distributed Databases : • Machines can far from each other, e.g., in different continent • Can be connected using public-purpose network, e.g., Internet • Communication cost and problems cannot be ignored • Usually shared-nothing architecture

PARALLEL PROCESSING Divide a big problem into many smaller ones to be solved in parallel • Increase bandwidth (in our case decrease queries’ response time)

Parallel Query Optimization
Parallel query optimization is the process of analysing a query and choosing the best combination of parallel and serial access methods to yield the fastest response time for the query. In addition to the costing performed for serial query optimization, parallel optimization analyses the cost of parallel access methods for each combination of join orders, join types, and indexes. The optimizer can choose any combination of serial and parallel access methods to create the fastest query plan.

Distributed Database Architecture
A distributed database system allows applications to access data from local and remote databases. In a homogenous distributed database system, each database is an Oracle Database. [4] In a heterogeneous distributed database system, at least one of the databases is not an Oracle Database. Distributed databases use a client/server architecture to process information requests.

Distributed Database Architecture[5]

Homogeneous Distributed Databases
In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its properties are − The sites use very similar software. The sites use identical DBMS or DBMS from the same vendor. Each site is aware of all other sites and cooperates with other sites to process user requests. The database is accessed through a single interface as if it is a single database. Types of Homogeneous Distributed Database There are two types of homogeneous distributed database − Autonomous − Each database is independent that functions on its own. They are integrated by a controlling application and use message passing to share data updates. Non-autonomous − Data is distributed across the homogeneous nodes and a central or master DBMS co-ordinates data updates across the sites.

Heterogeneous Distributed Databases
In a heterogeneous distributed database, different sites have different operating systems, DBMS products and data models. Its properties are − Different sites use dissimilar schemas and software. The system may be composed of a variety of DBMSs like relational, network, hierarchical or object oriented. Query processing is complex due to dissimilar schemas. Transaction processing is complex due to dissimilar software. A site may not be aware of other sites and so there is limited co-operation in processing user requests. Types of Heterogeneous Distributed Databases Federated − The heterogeneous database systems are independent in nature and integrated together so that they function as a single database system. Un-federated − The database systems employ a central coordinating module through which the databases are accessed.

Distributed DBMS Architectures
DDBMS architectures are generally developed depending on three parameters − Distribution − It states the physical distribution of data across the different sites. Autonomy − It indicates the distribution of control of the database system and the degree to which each constituent DBMS can operate independently. Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system components and databases.

Distributed Catalog Management
Catalogs are databases themselves containing metadata about the distributed database system. Efficient catalog management in distributed databases is critical to ensure satisfactory performance related to site autonomy, view management, and data distribution and replication. Three popular management schemes for distributed catalogs are centralized catalogs, fully replicatedcatalogs, and partitioned catalogs.

Centralized Catalogs In this scheme, the entire catalog is stored in one single site. Owing to its central nature, it is easy to implement. On the other hand, the advantages of reliability, availability, autonomy, and distribution of processing load are adversely impacted. For read operations from noncentral sites, the requested catalog data is locked at the central site and is then sent to the requesting site. On completion of the read operation, an acknowledgement is sent to the central site, which in turn unlocks this data. All update operations must be processed through the central site. This can quickly become a performance bottleneck for write-intensive applications.

Fully Replicated Catalogs
In this scheme, identical copies of the complete catalog are present at each site. This scheme facilitates faster reads by allowing them to be answered locally. However, all updates must be broadcast to all sites. Updates are treated as transactions and a centralized two-phase commit scheme is employed to ensure catalog consistency. As with the centralized scheme, write-intensive applications may cause increased network traffic due to the broadcast associated with the writes.

Partially Replicated Catalogs.
The centralized and fully replicated schemes restrict site autonomy since they must ensure a consistent global view of the catalog. Under the partially replicated scheme, each site maintains complete catalog information on data stored locally at that site. Each site is also permitted to cache entries retrieved from remote sites. However, there are no guarantees that these cached copies will be the most recent and updated. The system tracks catalog entries for sites where the object was created and for sites that contain copies of this object. Any changes to copies are propagated immediately to the original (birth) site. Retrieving updated copies to replace stale data may be delayed until an access to this data occurs. In general, fragments of relations across sites should be uniquely accessible. Also, to ensure data distribution transparency, users should be allowed to create synonyms for remote objects and use these synonyms for subsequent referrals.

Distributed Query Processing
Architecture

Example: Retrieve details of all projects whose status is “Ongoing
The global query will be &inus; σstatus="ongoing"(PROJECT)σstatus="ongoing"(PROJECT) Query in New Delhi’s server will be − σstatus="ongoing"(NewD−PROJECT)σstatus="ongoing"(NewD−PROJECT) Query in Kolkata’s server will be − σstatus="ongoing"(Kol−PROJECT)σstatus="ongoing"(Kol−PROJECT) Query in Hyderabad’s server will be − σstatus="ongoing"(Hyd−PROJECT) ”

Distributed Query Optimization
The main issues for distributed query optimization are − Optimal utilization of resources in the distributed system. Query trading. Reduction of solution space of the query.

Optimal Utilization of Resources in the Distributed System
Following are the approaches for optimal resource utilization − Operation Shipping − In operation shipping, the operation is run at the site where the data is stored and not at the client site. The results are then transferred to the client site. This is appropriate for operations where the operands are available at the same site. Example: Select and Project operations. Data Shipping − In data shipping, the data fragments are transferred to the database server, where the operations are executed. This is used in operations where the operands are distributed at different sites. This is also appropriate in systems where the communication costs are low, and local processors are much slower than the client server. Hybrid Shipping − This is a combination of data and operation shipping. Here, data fragments are transferred to the high-speed processors, where the operation runs. The results are then sent to the client site.

Reference: 1. Distributed and Parallel Database Systems : M. Tamer Ozsu 2. 3. miskolc.hu/tempus/discom/doc/db/tema01a.pdf 4. pts001.htm#ADMIN12074 _database_environments.htm Management_11597/

Parallel and Distributed Databases

Similar presentations

Presentation on theme: "Parallel and Distributed Databases"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel and Distributed Databases

Similar presentations

Presentation on theme: "Parallel and Distributed Databases"— Presentation transcript:

Similar presentations

About project

Feedback