Clusters of Multiprocessor Systems

Clusters of Multiprocessor Systems

Multiprocessing ■ Multiprocessor
- Computer system containing more than one processor ■ Reasons - Increase the processing power of a system - Parallel processing

Clusters ■ A collection of workstations of PCs that are
interconnected by a high-speed network ■ Work as an integrated collection of resources ■ Have a single system image spanning all its nodes

Cluster Computer Architecture

Cluster Models

Tightly Coupled Systems

Memory-Coupling, Message-Coupling and DSM
■ UMA – Uniform Memory Access ■ NUMA – Non-Uniform Memory Access ■ NORMA – No Remote Memory Access

Memory a) All main memory at the global bus (B2)
b) Main memory distributed among the clusters

Components of Cluster Computers
■ Multiple High Performance Computers - PCs - Workstations - SMPs (CLUMPS) - Distributed HPC Systems leading to Metacomputing ■ State of the art Operating Systems - Linux (Beowulf) - Microsoft NT (Illinois HPVM) - SUN Solaris (Berkeley NOW) - IBM AIX (IBM SP2) - HP UX (Illinois-PANDA) - Mach (Microkernel based OS) (CMU) - OS gluing layers (Berkeley Glunix) - Cluster Operating Systems (Solaris MC, Compaq TruClusters, MOSIX (academic project)

Components of Cluster Computers
■ High Performance Networks/Switches - Ethernet (10Mbps) - Fast Ethernet (100Mbps) - Gigabit Ethernet (1Gbps) - SCI (Dolphin -MPI-12micro-sec latency) - ATM - Myrinet(1.2Gbps) - Compaq Memory Channel (800 Mbps) - Quadrics QsNET(340 MBps) ■ Fast Communication Protocols and Services - Active Messages (Berkeley) - Fast Messages (Illinois) - U-net (Cornell) - XTP (Virginia)

Components for Clusters
■ Processors - Intel x86 Processors (Pentium Pro and Pentium Xeon, AMD x86, Cyrix x86, etc) - Compaq Alpha (Alpha processor integrates processing, memory controller, network interface into a single chip) - IBM PowerPC - Sun SPARC - SGI MIPS - HP PA - Berkeley Intelligent RAM (IRAM) integrates processor and DRAM onto a single chip

■ Memory and Cache - Standard Industry Memory Module (SIMM) - Dual Inline Memory Module (DIMM) - RDRAM (Rambus), SDRAM, SLDRAM (SyncLink) - Access to DRAM is extremely slow compared to the speed of the processor (SRAM used for Cache is fast, but expensive & cache control circuitry becomes more complex as the size of the cache grows) - 64-bit wide memory paths - ECC and RAID protection (high availability)

■ System Bus - PCI bus (Up to 64 bit wide and 66 MHz, Up to 512 Mbytes/s transfer rate, Adopted both in Pentium- based PC and non-Intel platform) - PCI-X (Up to 1 GB/s transfer rate) ■ Disk and I/O - Disk access time improves less than 10% per year - Amdahl’s law (Make the common case faster) - Performance – Carry out I/O operations parallel, supported by parallel file system based on hardware or software striping - High Availability – Software RAID

DASH CEDAR ■ Cluster-based machine developed at Stanford
■ Each cluster is a 4-CPU bus-based Silicon Graphics Power System/340, with 32 Mbytes of memory visible to all processors in the machine ■ The clusters are interconnected with a 2-D mesh ■ Each processor has three directmapped caches ■ The CPUs are 33-Mhz MIPS R3000s ■ Cache coherence is supported in hardware: a snoopy-based protocol within each cluster and a directory-based one across clusters CEDAR ■ A 4-cluster vector multiprocessor developed at the Univ. of Illinois' Center for Supercomputing Research and Development ■ Each cluster is an 8-CPU bus-based Alliant FX/8 ■ All processors in a cluster share a 512 Kbyte direct-mapped cache and 64 Mbytes of memory visible only to the cluster ■ Fast synchronization is possible via a percluster synchronization bus ■ Each processor has a 4 Kbyte prefetch buffer ■ All processors in the machine are connected to 64 Mbytes of shared memory via forward and return omega networks and no caches

HP/Convex Exemplar HP 9000 V2200 Enterprise Server
■ The Exemplar X-Class is the second generation SPP from HP/Convex ■ A ccNUMA architecture comprised of multiple nodes ■ Each node may contain up to 16 PA-8000 processors, 16 Gbytes of memory and 8 PCI busses ■ Memory access is UMA within each node and is accomplished via a nonblocking crossbar ■ Each node can be correctly considered as a symmetric multiprocessor ■ The interconnect between nodes is a derivative of the IEEE standard, SCI, which permits up to 32 nodes to be connected in a 2 dimensional topology ■ The system includes features to aid high performance engineering/scientific computations HP 9000 V2200 Enterprise Server ■ Includes one CPU, 256 megabytes of memory and an unlimited HP-UX license ■ The machine features up to a 16-way SMP with up to 32-way SMP ■ Supports nonuniform memory access (NUMA) in the hardware level, while HP-UX will support NUMA in the operating system, so it can be added to other HP 9000 machines when it is shipped

Motorola/IBM PowerPC 604/604e
■ Four-issue superscalar RISC processors from IBM Microelectronics and Motorola, implementations of the PowerPC architecture specification ■ The processors are targeted at general-purpose desktop computing and have found design wins in the Apple Macintosh line of personal computers and in Macintosh clones ■ The fastest version of the PowerPC 604 operates at a clock speed of 180 MHz with a 3.3-volt supply ■ The PowerPC 604e is an enhanced version of the PowerPC 604 and can operate at a clock speed of 225 MHz with a 2.5-volt supply for the processor core and a 3.3-volt supply for I/O ■ The PowerPC 604 uses a superscalar RISC architecture and can dispatch and complete up to four instructions in a single clock cycle ■ The processor operates on 32-bit instructions and integer data and on 64-bit double-precision or 32-bit single-precision floating-point data. ■ The PowerPC 604 has independent floating-point and integer data paths ■ The PowerPC 604 supports virtual memory via separate instruction and data TLBs for fast address translations ■ The PowerPC 604 allows both the instruction and data caches to be locked

Motorola/IBM PowerPC 604/604e

AlphaServer 8400 ■ Supports up to twelve 21164 microprocessors
and 14 gigabytes of memory, creating breakthroughs in very large database performance ■ Provides a viable alternative to supercomputers and mainframes, with a peak throughput of 6.6 GF (gigaflops)

Architecture of NOW System

Beowulf Clusters ■ Simple and highly configurable ■ Low cost
■ Networked - Computers connected to one another by a private Ethernet network - Connection to an external network is through a single gateway computer ■ Configuration - COTS – Commodity-off-the-shelf components such as inexpensive computers - Blade components – computers mounted on a motherboard that are plugged into connectors on a rack - Either shared-disk or shared-nothing model

Blade and Rack of Beowulf Cluster

Cluster Model ■ Each node is a SMP containing 4 Pentium Pro Processors ■ 8 nodes connected Communication – Ethernet using MPICH – Myrinet using NICAM (Network Interface Communication using Active Messages) ■ Solaris OS ■ Single user, single job

Programming SMP Clusters Shared Memory Architecture
■ Thread synchronization and mutual exclusion is needed ■ Communication overhead is low ■ Performance is limited by system bus bandwidth ■ Possibility for full cache utilization ■ Distributed memory architecture ■ Implicit synchronization when exchanging messages ■ Communication overhead is high ■ Performance limited by network bandwidth

Quad Pentium shared memory multiprocessor system

General model of a shared memory multiprocessor system with caches

Scalable Cache Coherent Systems
■ Scalable, distributed memory plus coherent replication ■ Scalable distributed memory machines - P-C-M nodes connected by network - Communication assist interprets network transactions, forms interface ■ Final point was shared physical address space - Cache miss satisfied transparently from local or remote memory ■ Natural tendency of cache is to replicate - No broadcast medium to snoop on ■ Not only hardware latency, but also protocol must scale

Two-level Hierarchies

Bus Hierarchies with Distributed Memory
■ Main memory distributed among clusters - Cluster is a full-fledged bus-based machine, memory and all - Automatic scaling of memory (each cluster brings some with it) - Good placement can reduce global bus traffic and latency, but latency to far-away memory may be larger than to root

Benchmarking HPC (High Performance Computing) Clusters using AMD Opteron Processors
■ The AMD64 architecture - and specifically the AMD Opteron processor - is a great platform for HPC needs ■ Good where are large datasets, extensive memory requirements, and a lot of integer or floating-point arithmetic ■ Opteron processor offers low latency, scalable memory bandwidth with on-chip memory controller ■ Minimizing memory latency in multi-processor systems

High Performance Computing
■ Minimize turnaround time to complete specific application problem ■ Maximize the problem size that can be solved in a given amount of time

The End

Clusters of Multiprocessor Systems

Similar presentations

Presentation on theme: "Clusters of Multiprocessor Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clusters of Multiprocessor Systems

Similar presentations

Presentation on theme: "Clusters of Multiprocessor Systems"— Presentation transcript:

Similar presentations

About project

Feedback