Fall 2008Parallel Databases1
Fall 2008Parallel Databases2 Ideal Parallel Systems Two key properties: Linear Speedup: Twice as much hardware can perform the task in half the elapse time (i.e., speedup = number of processors.) Linear Scaleup: Twice as much hardware can perform twice as large a task in the same elapsed time(i.e., scaleup = 1.)
Fall 2008Parallel Databases3 Barriers to Parallelism Startup: The time needed to start a parallel operation (thread creation/connection overhead) may dominate the actual computation time. Interference: When accessing shared resources, each new process slows down the others (hot spot problem). Skew: The response time of a set of parallel processes is the time of the slowest one.
Fall 2008Parallel Databases4 The Challenges The ideal database machine has: A single infinitely fast processor. An infinitely large memory with infinite bandwidth. Unfortunately, technology is not delivering such machines. The challenges are: To build an infinitely fast processor out of infinitely many processors of finite speed. To build an infinitely large memory with infinitely many storage units of finite speed.
Fall 2008Parallel Databases5 Why Parallel Databases? High-performance, low-cost commodity components have recently become available. Microprocessor-based systems are much cheaper than traditional mainframes. Widespread adoption of the relational data model. Relational data model is ideally suited to parallel execution. Terabyte online databases are becoming common as the price of online storage decreases. It is difficult to build mainframes powerful enough to meet the I/O demands of large relational databases.
Fall 2008Parallel Databases6 Hardware Architecture Communication network P0P0 Memory Disk Shared-Everything (SE)... P1P1 PnPn … Communication network P0P0 Disk Shared-Disk (SD)... P1P1 PnPn Memory … IBM 3090 series Digital VAX Sequent Symmetry nCUBE/2 Original Digital VAX cluster Sun Fire (72) IBM pSeries (32)
Fall 2008Parallel Databases7 Shared-Nothing Architecture Consensus: Shared-Nothing architecture is most scalable to support very large databases. Communication network P0P0 Memory Disk P1P1 Shared-Nothing (SN) PnPn... IBM SP/2 Teradata DBC/1012 Tandem Processing Node (PN)
Fall 2008Parallel Databases8 IBM RS/6000 SP It allows for up to 8,192 individual processors to be combined and managed as a single system. Processors are packaged in shared memory nodes of up to 16 processors each. IBM's well-planned roadmap for the SP allows customers to start small and scale up to larger, more powerful systems. This may entail adding nodes without having to replace existing hardware -- ensuring long-term investment protection as operating needs grow.
Fall 2008Parallel Databases9 Parallel Database Servers Tandem NonStop SQL Informix: Online 7.0 supported SE environment with the Informix Parallel Data Query (PDQ). Its 8.0 version supports SN computer. Oracle: two products, Parallel Server and Parallel Query Option (PQO). Sybase: Navigation Server. AT&T Global Information Solutions (GIS). IBM DB2 Parallel Edition: Supports the IBM SP2 SN multiprocessor.
Fall 2008Parallel Databases10 Process Structure for PDB Query Optimizer Executor Storage Manager Hardware SQLResults SE, SD, SN architectures Data placement Query Scheduling Query optimization
Fall 2008Parallel Databases11 Parallelism in Relational Data Model Pipeline Parallelism: If one operator sends its output to another, the two operators can execute in parallel. INSERT JOIN SCAN Table ATable B C
Fall 2008Parallel Databases12 Partitioned Parallelism: By taking the large relational operators and partitioning their inputs and outputs, it is possible to turn one big job into many concurrent independent little ones. INSERT JOIN INSERT SCAN A0A0 A2A2 A1A1 B0B0 B1B1 C0C0 C1C1 C2C2
Fall 2008Parallel Databases13 Data Partitioning Strategies There are two problems for SN architecture: The degree of parallelism is determined by the physical layout of the data across the PNs. Its performance is very sensitive to the skewness in data distributed. Partitioned data is the key to partitioned execution: Round-Robin Hash Partitioning Range Partitioning
Fall 2008Parallel Databases14 Round-Robin Partitioning It maps the ith tuple to disk i mod n, where n is the number of disks. Advantage: It’s simple. Disadvantage: It does not support associative search. D0D0 D1D1 D2D2 D3D … Records
Fall 2008Parallel Databases15 Hash Partitioning It maps each tuple to a disk location based on a hash function. Advantage: Associative access to the tuples with a specific attribute value can be directed to a single disk. Disadvantage: It tends to randomize data rather than cluster it. … D0D0 D1D1 D2D2 D3D3 Hash
Fall 2008Parallel Databases16 Range Partitioning It maps contiguous attribute ranges of a relation to various disks. Advantage: It is good for associative search and clustering data. Disadvantage: It risks execution skew in which all the execution occurs in one partition. D0D0 D1D1 D2D2 D3D3 A~FG~LM~RS~Z
Fall 2008Parallel Databases17 Horizontal Data Partitioning
Fall 2008Parallel Databases18 Problems for Horizontal Partitioning Query 1: Retrieve the names of students who have a GPA better than 2.0. Only P2 and P3 can participate. In a multi-user environment, the system can effectively use all the remaining PNs for other queries (generally not achievable). Query 2: Retrieve the names of students who major in Computer Science. The whole file must be searched. It cannot be easily addresses.
Fall 2008Parallel Databases19 To Address the Problem The relation is horizontally partitioned and distributed across the PNs. Locally, each partition is organized as a grid file. The relation is partitioned using multiple attributes. Locally, each partition can be organized as a grid file (investigated in most of researches).
Fall 2008Parallel Databases20 Multidimensional Data Partitioning Salary (K) Age attribute query 1-attribute query
Fall 2008Parallel Databases21 Advantage of MDP Degree of parallelism is maximized (using as many processing nodes as possible). Search space is minimized (searching only relevant data blocks).
Fall 2008Parallel Databases22 Query Types Query Shape: The shape of the data sub-space accessed by a range query. Square Query: The query shape is a square. Row Query: The query shape is a rectangle containing a number of rows. Column Query: The query shape is a rectangle containing a number of column.
Fall 2008Parallel Databases23 Disk Modulo (DM) Allocation A 16×16 DM example
Fall 2008Parallel Databases24 Disk Modulo Advantage: optimal for row and column queries. Disadvantage: poor for square queries.
Fall 2008Parallel Databases25 Hilbert Curve Allocation Method(HCAM) Hilbert Curve 16×16 HCAM
Fall 2008Parallel Databases26 HCAM HCAM is based on the idea of space filling curves. A space filling curve visits all points in a k- dimensional space grid exactly once and never crosses itself. Advantages: good for square range queries. Disadvantages: poor for row and column queries.
Fall 2008Parallel Databases27 General Multidimensional Data Allocation 2-D GeMDA 16 ×16 GeMDA
Fall 2008Parallel Databases28 2-D GeMDA Regular Rows: Circular left shift positions. Check Rows: Circular left shift +1 positions. Number of check rows: GCD( , N) - 1 Advantages: optimal for row, column, and small square range queries (|Q| < 2 ). N is the number of PNs
Fall 2008Parallel Databases29 3-D GeMDA
Fall 2008Parallel Databases30 Mapping Function For GeMDA
Fall 2008Parallel Databases31 Optimality Comparison Allocation Scheme Optimal with respect to row queries column queries small square queries HCAMNo DMYes No GeMDAYes