N. GSU Slide 1 Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University
N. GSU Slide 2 Chapter 05 Review and Introduction
N. GSU Slide 3 Chapter 05 Design Objectives of Clusters and MPPs Cluster and MPP System Architectures Design Principles of Clustered Systems Multiple Job Scheduling and Management Virtual Clustering and Resource Provisioning Homework Problems Chapter 04 Main Contents
N. GSU Slide 4 Chapter 05 Scalability Packaging Control Homogeneity Security Design Objectives of Clustered Systems
N. GSU Slide 5 Chapter 05 Design Objectives of Clustered Systems
N. GSU Slide 6 Chapter 05 Fundamental Cluster Design Issues Scalable Performance Single System Image Availability Support Cluster Job Management Internode Communication Fault Tolerance and Recovery Growth of Servers in HPC and HTC Systems
N. GSU Slide 7 Chapter 05 Resource-Sharing in Cluster Systems
N. GSU Slide 8 Chapter 05 An Idealized Cluster Architecture Conventional databases and OLTP monitors offer users a desktop environment Supports parallel programming based on standard languages and communication libraries A user-interface subsystem combines the advantages of the Web interface and the windows GUI
N. GSU Slide 9 Chapter 05 Node Architectures and System Packaging Two types of cluster nodes compute nodes service nodes
N. GSU Slide 10 Chapter 05 Compute Node Examples
N. GSU Slide 11 Chapter 05 Modular Packaging of IBM BlueGene/L System
N. GSU Slide 12 Chapter 05 Cluster System Interconnects
N. GSU Slide 13 Chapter 05 High-Bandwidth Interconnects
N. GSU Slide 14 Chapter 05 An InfiniBand Cluster Interconnection Network
N. GSU Slide 15 Chapter 05 High-bandwidth Interconnects in Top-500 Systems
N. GSU Slide 16 Chapter 05 Hardware, Software, and Middleware Support
N. GSU Slide 17 Chapter 05 Design Principles of Clusters Single-System-Image (SSI ) Features Single System Single Control Symmetry Location Transparent
N. GSU Slide 18 Chapter 05 Design Principles of Clusters Single-System-Image Layers Application Software Layer Hardware or Kernel Layer Middleware Layer
N. GSU Slide 19 Chapter 05 Design Principles of Clusters Single-System-Image Composition Single Entry Point Single File Hierarchy Single I/O, Networking, and Memory Space Other Desired SSI Features
N. GSU Slide 20 Chapter 05 Single Entry Point
N. GSU Slide 21 Chapter 05 Single File Hierarchy It is persistent. It is fault tolerant to some degree. Network File System (NFS) and Andrew File System (AFS).
N. GSU Slide 22 Chapter 05 Single File Hierarchy
N. GSU Slide 23 Chapter 05 Single I/O, Networking, and Memory Space Single Input/Output Single Networking Single Point of Control Single Memory Space
N. GSU Slide 24 Chapter 05 Single I/O, Networking, and Memory Space
N. GSU Slide 25 Chapter 05 An Example
N. GSU Slide 26 Chapter 05 Other Desired SSI Features Single Job Management System Single User Interface Single Process Space
N. GSU Slide 27 Chapter 05 Middleware Support for SSI Clustering
N. GSU Slide 28 Chapter 05 High Availability Through Redundancy Reliability Availability Serviceability
N. GSU Slide 29 Chapter 05 Availability and Failure Rate
N. GSU Slide 30 Chapter 05 Availability Values of Several Representative Systems
N. GSU Slide 31 Chapter 05 Redundancy Techniques
N. GSU Slide 32 Chapter 05 Fault-Tolerant Cluster Configurations Hot Standby Mutual Takeover Fault-Tolerance
N. GSU Slide 33 Chapter 05 Recovery Schemes Backward recovery Forward recovery: in real- time systems
N. GSU Slide 34 Chapter 05 Checkpointing and Recovery Techniques Kernel, Library, and Application Levels Checkpoint Overheads Choosing an Optimal Checkpoint Interval
N. GSU Slide 35 Chapter 05 Checkpointing Parallel Programs
N. GSU Slide 36 Chapter 05 Cluster Job Scheduling and Management Cluster Job Management Issues A user server A job scheduler A resource manager
N. GSU Slide 37 Chapter 05 Cluster Job Types Serial jobs Parallel jobs Interactive jobs Batch jobs Foreign jobs
N. GSU Slide 38 Chapter 05 Multi-Job Scheduling Schemes
N. GSU Slide 39 Chapter 05 Share Cluster Nodes Dedicated Mode Space Sharing Time Sharing
N. GSU Slide 40 Chapter 05 Migration Schemes Issues Node Availability Migration Overhead Recruitment Threshold : the amount of time a workstation stays unused before the cluster considers it an idle node
N. GSU Slide 41 Chapter 05 Virtual Clustering and Resource Provisioning
N. GSU Slide 42 Chapter 05 Five Virtual Cluster Research Projects
N. GSU Slide 43 Chapter 05 Live VM Migration and Cluster Management
N. GSU Slide 44 Chapter 05 Effect by Live Migration
N. GSU Slide 45 Chapter 05 Dynamic Virtual Resource Provisioning
N. GSU Slide 46 Chapter 05 Autonomic Adaptation of Virtual Environments
N. GSU Slide 47 Chapter 05 Some References and Further Reading
N. GSU Slide 48 Chapter 05 Homework Problems
N. GSU Slide 49 Chapter 05 Homework Problems