Unit - 4 Introduction to the Other Databases
Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance, increase availability, distributed access to data, analysis of distributed data and so on. To meet the complex requirement of users, the modern database system today operate with the architecture where multiple CPUs are working parallel to provide the complex database services. In some of the architectures, multiple CPUs are working in parallel and are physically located in closed environment, in the same building and communicating at very high speed. The databases operating in such a environment are called Parallel Databases.
In parallel database system, multiple CPUs work in parallel to improve performance through parallel implementation of various operations such as loading data, building indexes and evaluating queries. Parallel processing divides a large task into many smaller task and execute the smaller tasks concurrently on several CPUs. As a result the larger task complete more quickly. Parallel database system improve the processing and I/O speed by using multiple CPUs and disks working in parallel. the parallel databases are essentially useful for applications that have to query large databases and process large number of transactions per second. In parallel processing many operations are performed simultaneously, as opposed to the centralized processing, in which serial computation is performed.
The goal of Parallel Database System :- To ensure that the database system can continue to perform at one acceptable speed, even as the size of database and the number of transactions increases. And this can be done by increasing the capacity of the system by increasing the parallelism provides a smoother path for growth for an enterprise then does replacing a centralized system by a faster machine.
The parallel database systems are usually designed to provide a best cost-performance and they are quit uniform in site machine architecture. The cooperation between site machines is usually achieved at the level of the transaction module of the database system. Parallel database system represent an attempt to construct a faster centralized computer using several small CPUs.
WHY DO WE NEED THEM? More and More Data! We have databases that hold a high amount of data, in the order of bytes: 10,000,000,000,000 bytes! Faster and Faster Access! We have data applications that need to process data at very high speeds: 10,000s transactions per second! SINGLE-PROCESSOR DBMS CANNOT DO THIS JOB.....!
Advantages of Parallel Database System :- Increase Throughput (Scale-Up). Increase Response time (Speed-Up) Useful to the application to query extremely large databases and to process an extremely large number of transactions rate (in order of thousands of transactions per second). Increase availability of the system. Grater flexibility. Possible to serve a large number of users. Disadvantages of Parallel Database System :- More Start-Up Cost. Interface Problem.
Improves Response Time. INTERQUERY PARALLELISM It is possible to process a number of transactions in parallel with each other. Improves Throughput. INTRAQUERY PARALLELISM It is possible to process ‘sub-tasks’ of a transaction in parallel with each other. BENEFITS OF A PARALLEL DBMS
Speed-Up. As you multiply resources by a certain factor, the time taken to execute a transaction should be reduced by the same factor: 10 seconds to scan a DB of 10,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs HOW TO MEASURE THE BENEFITS Scale-up. As you multiply resources the size of a task that can be executed in a given time should be increased by the same factor. 1 second to scan a DB of 1,000 records using 1 CPU 1 second to scan a DB of 10,000 records using 10 CPUs
Linear speed-up (ideal) Number of CPUs Number of transactions/second 1000/Sec 5 CPUs 2000/Sec 10 CPUs SPEED-UP
SCALE-UP 10 CPUs 2 GB Database Number of CPUs, Database size Number of transactions/second Linear scale-up (ideal) 1000/Sec 5 CPUs 1 GB Database
1.) Shared-Memory Multiple CPU :- 2.) Shared-Disk Multiple CPU :- 3.) Shared-Nothing Multiple CPU :-
Shared-Memory Multiple CPU :- In this system a computer has multiple simultaneously active CPUs that are attached to an interconnected network and can share a single MAIN MEMORY. Thus in this architecture a single copy of a multithreaded Operating System and multithreaded DBMS can support multiple CPUs. This architecture of Parallel Database System is closest to the traditional single CPU processer of centralized database system, but much faster in performance as compare to the single CPU of the same power.
Shared Memory Multiple CPU Architecture
Benefits of Shared-Memory :- Communication between CPUs is extremely efficient. Data can be access by any CPU without being moved with software. A CPU can send a message to the other CPU much faster by using memory writes, which usually takes less then a microsecond, then by sending a message through a communication mechanism. The communication overhead are low, because of main memory can be used for this purpose and operating services can be used to utilize the additional CPUs.
Limitations of Shared-Memory :- Memory access uses a very high speed mechanism that is difficult to partition without losing efficiency. Thus the design must take the special type of different CPUs have equal access to a common memory. Since the communication bus or interconnection network is shared by all CPUs, this architecture is not capable beyond 80 or 100 CPUs in parallel. The bus and interconnection network become a bottleneck as the number CPUs increase. The addition of more CPUs causes CPUs to spend time waiting for their turn on the bus to access memory.
Shared-Disk Multiple CPU :- In this system multiple CPUs are attached to an interconnection network and each CPU ha its own memory but all of them have access to the same disk storage or more commonly to the shared array of disk. The scalability of the system is largely determine by the capacity and the throughput of the interconnection network. Since the main memory is not shared among the CPU, each machine has its own OS and its own DBMS. It is possible that with the same data accessible, two or more nodes want to read or write the same data at the same time. Therefore the global locking scheme is require to preservation of the data integrity.
Shared Disk Multiple CPU Architecture
Benefits of Shared Disk Architecture :- Easy to load balance, because data does not have to be permanently divided among available CPUs. Since each CPU has its own memory, the memory bus is not a bottleneck. It offers a low cost solution to provide a degree of fault tolerance. In this case of a CPU or memory failure, the other CPUs take over its task; since the database is resident on disk that are accessible from all CPUs. It has found acceptance in wide applications.
Limitations of Shared Disk :- It is also facing the problems of interface and memory contention bottleneck as the number of CPUs increase. As more CPUs are added, the existing CPUs are slow down because of the increased contention for memory accesses and network bandwidth. It is also having the problem of scalability. The interconnection to the disk subsystem become bottleneck, particularly when the database makes the large number of access to the disk.
Shared Nothing Multiple CPU :- In this system multiple CPUs are attached with interconnecting network and each CPU has a local memory and a local disk storage, but no two CPU can access the same storage area. All communication between CPUs is through a high-speed interconnection network. Thus the shared nothing environments involve no sharing on memory or disk.
Each CPU has its own copy of OS and its own copy of DBMS and its own copy of a portion of a data managed by DBMS. In this type of architecture CPUs sharing responsibilities for database services usually split up the data among themselves. CPUs then perform the transactions and queries by dividing up the work and communicating by messages over the high speed network.
Shared Nothing Multiple CPU Architecture
Benefits of Shared Nothing Architecture :- This architecture minimized the connection of CPUs by not sharing resources and therefore offer a high degree of scalability. Since local disk references are serviced by local disk ay each CPU, this architecture overcomes the limitations of requiring all I/O to go through a single interconnection network. Only queries accesses to non-local disk and result relation pass through the network.
The interconnection network for this architecture are usually designed to be scalable. Thus adding more CPUs and more disks enables the system grow in a manner that is divided the power and the capacity of the newly added component. In other words the shared-nothing architecture provides linear Speed-Up and linear Scale-Up. Linear Speed-Up and Scale-Up properties increase the transmission capacity of shared- nothing architecture as more nodes are added and therefore, it can easily support the large number of CPUs.
Limitations of Shared Nothing Architecture :- Shared nothing architecture are difficult to load-balance. In many multi CPU environments, it necessary to split the system work load in some way so that all system resources are being used efficiently. Proper splitting or balancing workload across the shared nothing system requires an administrator to properly partition or divide the data across the various disks. In practice this is difficult to achieve. Adding a new CPU and disk to Shared-Nothing Architecture means the data may needed to be redistributed in order to make advantage of the new resources and thus require more extensive reorganization of DBMS.
The cost of communication and non-local disk access are higher then in Shared-Disk or Shared- Memory architecture because of sending data involves software interaction at both the ends. The high speed network are limited in size, because of speed-of –light consideration. This leads to the requirement that a parallel architecture has CPUs that are physically closed together. It requires an OS that is capable of accommodating the heavy amount of messaging that are require to support the inter processor communication.
1.) Speed-Up :- 2.) Scale-Up :- 3.) Synchronization :- 4.) Locking :-
1.) Speed-Up :- Speed-Up is the property in which the time taken for performing the task decreases in case of increasing the number of CPUs. In other word Speed-Up is the property of running a given task in less time by increasing the degree of parallelism (more number of hardware). With additional hardware, Speed-Up holds the task constant and measure the time saved. Thus, Speed-Up enables user to improve the system response time for their queries, assuming the size of their database remain the same.
To = Execution time of a task on the original or smaller machine (or original processing time) Tp = execution time of the same task on parallel or larger machine (or parallel processing time). Here the original processing time To is the time spent by a centralized system or small system on the given task. And the parallel processing time Tp is the time spent by large system or Parallel System on the same task.
Consider a database application running on a parallel system with a certain number of CPUs and disks. Now suppose the size of system is increase by increasing the number of CPUs, disks and other hardware components. The goal is to process the task in time inversely proportional to the number of CPUs and disk allocated. For example, if original system takes 60 seconds to perform the task and the parallel system (with double capacity) takes 30 seconds to complete the same task then the value of Speed-Up = 60/30 = 2. the Speed-Up value 2 in indicate the Linear Speed-Up. If the Speed-Up is N when the larger system has N times the resources of the smaller system. If the Speed-Up value is less then N then the system is said to demonstrate Sub Linear Speed-Up.
Linear speed-up (ideal) Number of CPUs Number of transactions/second 1000/Sec 5 CPUs 2000/Sec 10 CPUs SPEED-UP Sub Linear speed-up
2.) Scale-Up :- Scale-Up is the property in which the performance of the parallel database is sustained if the number of CPU and disk are increased in proportional to the amount of data. In other word, Scale-Up is the ability of handling the large task by increasing the degree of parallelism, in the same time period as the original system. With added hardware the formula for Scale-Up holds the time constant and measure the increase size of task. Thus the Scale-Up enable users to increase the size of their database while maintaining the same response time.
Vp = Parallel or Large Processing Volume. Vo =Original or Small Processing Volume. Here the Original Processing Volume is the transaction volume process in the given amount of time on a smaller system. Parallel Processing Volume is the transaction volume process in the given amount of time on a larger system. For Example, if the original system can process 3000 transactions in given amount of time and if the parallel system can process 6000 transactions in the same amount of time then the Scale-Up = 6000/3000 = 2.
The Scale-Up value 2 is an indication of the Linear Scale-Up, which means that the twice as much of hardware can process twice the data volume in same amount of time. If the Scale-Up value is less then 2 then it is called Sub Linear Scale-Up. That means as much of times we increase the resources of the parallel system, the value of Linear Scale-Up will also be increase that much of times.
SCALE-UP 10 CPUs 2 GB Database Number of CPUs, Database size Number of transactions/second Linear scale-up (ideal) 1000/Sec 5 CPUs 1 GB Database Sub Linear scale-up
3.) Synchronization :- Synchronization is the coordination of the current task. For a successful operation of the parallel database system, the task should be divided such that the synchronization requirement is less. It is necessary for the correctness. With less synchronization requirement better speed-up and scale-up can be achieved. The amount of synchronization depends on the amount of resources and the number of users and the task working on the resources. More synchronization is require to coordinate large number of concurrent tasks.
4.) Locking :- Locking is a method of synchronizing current task., Both internal as well as external locking mechanisms are used for synchronization of tasks that are required by the parallel database system. For external locking, a distributed lock manager (DLM) is used, which is apart of the OS. DLM coordinate the resources sharing between communication nodes running a parallel server. The instances of parallel server use the DLM to communicate with each other and coordinate modification of database resources. The DLM allows application to synchronize access to resources such as data, software and devices, so that current requests for the same resource are coordinate between applications running on different nodes.
1.) Intra-Query Parallelism :- 2.) Inter-Query Parallelism :- 3.) Intra-Operation Parallelism :- 4.) Inter-Operation Parallelism :- 5.) Input / Output Parallelism :-
1.) Intra-Query Parallelism :- Intra-Query Parallelism refers to the execution of single query in parallel on multiple CPUs using Shared-Nothing Architecture Technique. It is some times called Parallel Query Processing. For example, suppose a table has been partitioned across multiple disks by range partitioning on some attribute and now user want to perform SORT on the partitioning attribute. The SORT operation can be implemented by sorting each portion in parallel, then concatenating the sorted portions to get the final sorted relation. Thus a query can be parallelized by parallelizing individual operations.
Advantages :- Intra-Query Parallelism Speeds Up long running queries. They are beneficial for decision support applications that issues complex, read-only queries, including queries involving multiple JOINs. CPU 1 CPU 2 CPU 3 CPU ‘N’ Interconnection Network Query 1
2.) Inter-Query Parallelism :- In Inter-Query Parallelism multiple transactions are executed in parallel, One by each CPU. It sometimes also called as Parallel Transaction Processing. The primary use of Inter-Query Parallelism is to Scale-Up a Transaction Processing system to support a large number of transaction per second. To support a Inter-Query Parallelism DBMS generally uses a task or transaction dispatching. Efficient lock management is another method to used by DBMS to support Inter-Query Parallelism, particularly in Shared-Disk Architecture. Since in Inter-Query Parallelism each query is run sequentially, it does not help in speeding up in long running query.
In such a case DBMS must understand the locks held by different transactions executing on different CPUs in order to preserve data integrity. Inter-Query Parallelism on Shared-Disk architecture perform best when transactions that execute in parallel do not access the same disk. Transaction 1 Transaction N CPU 1 CPU 2 CPU 3 CPU N Interconnection Network
Advantages :- Easiest form of parallelism to support in a database system, particularly in Shared-Disk Parallel System. It Scale-Up a transaction processing system to support a large number of transactions per second. Disadvantages :- Response time of individual transaction are no faster then they would be if the transaction were run in isolation. It is more complicated in Shared-Memory and Shared-Nothing Architectures.
3.) Intra-Operation Parallelism :- In Intra-Query Parallelism of each individual operation of a task, such as sorting, projection, join and so on. Since the number of operations in a typical query small, compared to the number of tuples processed by each operation, Intra-Operation Parallelism scales better with increasing parallelism. Advantages :- Inter-Operation Parallelism is natural in a Database. Degree of Parallelism is potentially enormous.
4.) Inter-Operation Parallelism :- In Inter-Operation Parallelism, the different operations in a query expression are executed in parallel. Following two types of Inter-Operation Parallelism are used : Pipelined Parallelism :- Independent Parallelism :- 1.Pipelined Parallelism :- In this parallelism output tuples of one operation A are consumed by second operation B, even before the first operation has produced the entire set of tuples in its output. Thus it is possible to run operation A and B simultaneously in different processors, so that the operation B consumes tuples in parallel with operation A producing them.
Advantages :- Pipelined parallelism useful with smaller number of CPUs. Also pipelined execution avoid writing intermediate result to disk. Disadvantages :- It does not Scale-Up well. Pipelined chain do not attain sufficient length to provide a high degree of parallelism. It is not possible to pipeline relational operators that do not produce output until all inputs have been accessed. Only marginal Speed-Up is obtained for the frequent case in which one operation’s cost is much higher then the others.
2.Independent Parallelism :- In an independent parallelism the operations in query expression that do not depend on one other can be ececute in parallel. Advantages :- It is useful with a lower degree of parallelism. Disadvantages :- Like pipelined parallelism, independent parallelism does not provide a high degree of parallelism so it is less useful in highly parallel system.