Gamma DBMS (Part 2): Failure Management Query Processing Shahram Ghandeharizadeh Computer Science Department University of Southern California
Failure Management Techniques Teradata’s Interleaved Declustering Teradata’s Interleaved Declustering A partitioned table has a primary and a backup copy. The primary copy is constructed using one of the partitioning techniques. The secondary copy is constructed by: Dividing the nodes into clusters (cluster size is 4), Partition a primary fragment (R0) across the remaining nodes of the cluster: 1, 2, and 3. Realizing r0.0, r0.1, and r0.2.
Teradata’s Interleaved Declustering When a node (say 1) fails, its backup copy processes requests directed towards the primary copy of R1. When a node (say 1) fails, its backup copy processes requests directed towards the primary copy of R1. Three backup fragments r1.2, r1.0 and r1.1. Note that the load of R1 is distributed across the remaining nodes of the cluster. Note that the load of R1 is distributed across the remaining nodes of the cluster.
Teradata’s Interleaved Declustering MTTR involves: MTTR involves: 1. Replacing the failed node with a new one. 2. Reconstructing the primary copy of the fragment assigned to the failed node, R1. By reading r1.2, r1.0, and r1.1 from Nodes 0, 2, and Reconstructing the backup fragments assigned to the failed node: r0.0, r2.2, and r3.1.
Teradata’s Interleaved Declustering When does data become unavailable? When does data become unavailable?
Teradata’s Interleaved Declustering When does data become unavailable? When does data become unavailable? When a second node in a cluster fails prior to repair of the first failed node in that cluster. Note that it is a bit more complex than the discussion here.
Teradata’s Interleaved Declustering What is the advantage of making the cluster size equal to 8? What is the advantage of making the cluster size equal to 8?
Teradata’s Interleaved Declustering What is the advantage of making the cluster size equal to 8? What is the advantage of making the cluster size equal to 8? Better distribution of the workload across the nodes in the presence of a failure.
Teradata’s Interleaved Declustering What is the advantage of making the cluster size equal to 8? What is the advantage of making the cluster size equal to 8? Better distribution of the workload across the nodes in the presence of a failure. What is the dis-advantage of making the cluster size equal to 8? What is the dis-advantage of making the cluster size equal to 8?
Teradata’s Interleaved Declustering What is the advantage of making the cluster size equal to 8? What is the advantage of making the cluster size equal to 8? Better distribution of the workload across the nodes in the presence of a failure. What is the dis-advantage of making the cluster size equal to 8? What is the dis-advantage of making the cluster size equal to 8? Higher likelihood of data becoming unavailable.
Teradata’s Interleaved Declustering What is the advantage of making the cluster size equal to 8? What is the advantage of making the cluster size equal to 8? Better distribution of the workload across the nodes in the presence of a failure. What is the dis-advantage of making the cluster size equal to 8? What is the dis-advantage of making the cluster size equal to 8? Higher likelihood of data becoming unavailable. Tradeoff between load-balancing (in the presence of a failure) and data availability. Tradeoff between load-balancing (in the presence of a failure) and data availability.
Gamma’s Chained Declustering Nodes are divided into disjoint groups called relation clusters. Nodes are divided into disjoint groups called relation clusters. A relation is assigned to one relation cluster and its records are declustered across the nodes of that relation cluster using a partitioning strategy (Range, Hash). A relation is assigned to one relation cluster and its records are declustered across the nodes of that relation cluster using a partitioning strategy (Range, Hash). Given a primary fragment Ri, its backup copy is assigned to node (i+1) mod M (M is the number of nodes in the relation cluster). Given a primary fragment Ri, its backup copy is assigned to node (i+1) mod M (M is the number of nodes in the relation cluster).
Gamma’s Chained Declustering During normal mode of operation: During normal mode of operation: Read requests are directed to the fragments of primary copy, Write requests update both primary and backup copies.
Gamma’s Chained Declustering In the presence of failure: In the presence of failure: Both primary and backup fragments are used for read operations, Objective: Balance the load and avoid bottlenecks! Write requests update both primary and backup copies. Note: Note: Load of R1 (on node 1) is pushed to node 2 in its entirety. A fraction of read request from each node is pushed to the others for a 1/8 load increase attributed to node 1’s failure.
Gamma’s Chained Declustering MTTR involves: MTTR involves: Replace node 1 with a new node, Reconstruct R1 (from r1 on node 2) on node 1, Reconstruct backup copy of R0 (i.e., r0) on node 1. Note: Note: Once Node 1 becomes operational, primary copies are used to process read requests.
Gamma’s Chained Declustering Any two node failures in a relation cluster does not result in data un-availability. Any two node failures in a relation cluster does not result in data un-availability. Two adjacent nodes must fail in order for data to become unavailable. Two adjacent nodes must fail in order for data to become unavailable.
Gamma’s Chained Declustering Re-assignment of active fragments incurs neither disk I/O nor data movement. Re-assignment of active fragments incurs neither disk I/O nor data movement.
Join Hash-join Hash-join A data-flow execution paradigm A data-flow execution paradigm
Example Join of Emp and Dept Emp join Dept (using dno) SS#NameAgeSalarydno 1Joe Mary Bob Kathy Shideh EMPdnodnamefloormgrss#1Toy15 2Shoe21 DeptSS#NameAgeSalarydnodnamefloormgrss#1Joe Shoe21 2Mary Toy15 3Bob Toy15 4Kathy Shoe21 5Shideh440001Toy15
Hash-Join: 1 Node Join of Tables A and B using attribute j (A.j = B.j) consists of two phase: Join of Tables A and B using attribute j (A.j = B.j) consists of two phase: 1. Build phase: Build a main-memory hash table on Table A using the join attribute j, e.g., build a hash table on the Toy department using dno as the key of the hash table. 2. Probe phase: Scan table B one record at a time and use its attribute j to probe the hash table constructed on Table A, e.g., probe the hash table using the rows of the Emp department.
Hash-Join: Build Read rows of Dept table one at a time and place in a main-memory hash table. Read rows of Dept table one at a time and place in a main-memory hash table. 1Toy15 2Shoe21 dno % 7
Hash-Join: Build Read rows of Emp table and probe the hash table. Read rows of Emp table and probe the hash table. 1Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydno1Joe
Hash-Join: Build Read rows of Emp table and probe the hash table and produce results when a match is found. Read rows of Emp table and probe the hash table and produce results when a match is found. SS#NameAgeSalarydno 1Joe Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydnodnamefloormgrss#1Joe Shoe21
Hash-Join: Build Termination condition is when all rows of the Emp table have been processed! Termination condition is when all rows of the Emp table have been processed! SS#NameAgeSalarydno 1Joe Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydnodnamefloormgrss#1Joe Shoe21
Hash-Join Key challenge: Key challenge:
Hash-Join Key challenge: Table used to build the hash table does not fit in main memory! Key challenge: Table used to build the hash table does not fit in main memory! Solution: Solution:
Hash-Join Key challenge: Table used to build the hash table does not fit in main memory! Key challenge: Table used to build the hash table does not fit in main memory! A divide-and-conquer approach: A divide-and-conquer approach: Use the inner table (Dept) to construct n memory buckets where each bucket is a hash table. Every time memory is exhausted, spill a fixed number of buckets to the disk. The build phase terminates with a set of in-memory buckets and a set of disk-resident buckets. Read the outer relation (Emp) and probe the in-memory buckets for joining records. For those records that map onto the disk- resident buckets, stream and store them to disk. Discard the in memory buckets to free memory space. While disk-resident buckets of inner-relation exist: Read as many (say i) of the disk-resident buckets of the inner- relation into memory as possible. Read the corresponding buckets of the outer relation (Emp) to probe the in-memory buckets for joining records. Discard the in memory buckets to free memory space. Delete the i buckets of the inner and outer relations.
Hash-Join: Build Two buckets of Dept table. One in memory and the second is disk-resident. Two buckets of Dept table. One in memory and the second is disk-resident. 1Toy15 2Shoe21 dno % 7
Hash-Join: Probe Read Emp table and probe the hash table for joining records when dno = 1. With dno=2, stream the data to disk. Read Emp table and probe the hash table for joining records when dno = 1. With dno=2, stream the data to disk. 1Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydno1Joe Mary Bob Kathy Shideh440001
Hash-Join: Probe Those rows of Emp table with dno=1 probed the hash table and produce 3 joining records. Those rows of Emp table with dno=1 probed the hash table and produce 3 joining records. 1Toy15 2Shoe21 dno % 7 SS#NameAgeSalarydno1Joe Kathy
Hash-Join: While loop Read the disk-resident bucket of Dept into memory. Read the disk-resident bucket of Dept into memory. 2Shoe21 dno % 7 SS#NameAgeSalarydno1Joe Kathy
Hash-Join: While loop Read the disk-resident bucket of Dept into memory. Read the disk-resident bucket of Dept into memory. Probe it with the disk-resident buckets of Emp table to produce the remaining two joining records. Probe it with the disk-resident buckets of Emp table to produce the remaining two joining records. 2Shoe21 dno % 7 SS#NameAgeSalarydno1Joe Kathy
Parallelism and Hash-Join Each node may perform hash-join independently when: Each node may perform hash-join independently when: The join attribute is the declustering attribute of the tables participating in the join operation. The participating tables are declustered across the same number of nodes using the same declustering strategy. The system may re-partition the table (see the next bullet) if its aggregate memory exceeds the size of memory the tables are declustered across. Otherwise, the data must be re-partitioned to perform the join operation correctly. Otherwise, the data must be re-partitioned to perform the join operation correctly. Show an example! Show an example!
Parallelism and Hash-Join (Cont…) R join S where R is the inner table. R join S where R is the inner table.
Data Flow Execution Paradigm Retrieve all those Employees working for the toy department: Retrieve all those Employees working for the toy department: SELECT * FROM Dept d, Emp e WHERE d.dno = e.dno and d.dname = Toy
Data Flow Execution Paradigm Producer/Consumer relationship where consumers are activated in advance of the producers. Producer/Consumer relationship where consumers are activated in advance of the producers.
Data Flow Execution Paradigm “Split Table” contains routing information for the records “Split Table” contains routing information for the records The consumers must be setup in order to activate producers. The consumers must be setup in order to activate producers.