Multiprocessors & Multicomputers

Name: Multiprocessors & Multicomputers
Uploaded: 2017-07-18T09:58:22+00:00
Duration: PTM25S32
Channel: Mae Williamson
Description: Multiprocessors & Multicomputers

Multiprocessors & Multicomputers

MIMD Independent program control
IS1 DS1 CU1 PE1 MM1 IS2 DS2 CU2 PE2 MM2 ISn DSn CUn PEn MMm Independent program control Each processing element executes its own program (even if all processing elements execute the same program on different data sets)

MIMD Classification based on memory access
Parallel Vector Processor Uniform Memory Access Central memory Symmetric Multiprocessor Multiprocessors Single address space shared-memory Cache Only Memory Access Non-Uniform Memory Access Distributed memory Cache Coherent NUMA MIMD Non-Cache Coherent NUMA Multicomputers Multiple address spaces non-shared-memory Software-Coherent NUMA Cluster No-Remote Memory Access Massively Parallel Processor

MIMD Shared-Memory Interconnection Networks
Common bus Multiple bus Crossbar (Xbar) Multiport memory Multistage switch

MIMD Shared-Memory UMA – Uniform Memory Access
Five dimensions: Control: distributed Data Flow: parallel Address Space: global Physical Memory: central Interconnect: static or dynamic Processor Processor Interconnect Memory

MIMD Shared-Memory Symmetric vs Non-symmetric Multiprocessor
I/O LAN Processor Processor I/O LAN Memory

MIMD Shared-Memory Symmetric Multiprocessor Systems
Characteritics DEC Alphaserver /440 HP 9000/T600 IBM RS6000/R40 Sun Ultra Enterprise 6000 SGI Power Challenge XL No. Processors 12 8 30 36 Processor type 437 MHz Alpha 21164 180 MHz PA 8000 112 MHz PowerPC 604 167 MHz UltraSPARC I 195 MHz MIPS R10000 Max memory 28 GB 16 GB 2 GB 30 GB Interconnect bandwidth Bus 2.1 GB/s 960 MB/s Bus + Xbar 1.8 GB/s 2.6 GB/s 1.2 GB/s Internal disk 192 GB 168 GB 38 GB 63 GB 144 GB

MIMD Shared-Memory Non-Uniform Memory Access
Five dimensions: Global Logical Address Space Control: distributed Data Flow: parallel Address Space: global Physical Memory: Interconnect: static or dynamic Processor Memory Processor Memory Interconnect

MIMD Shared-Memory Cache-Only Memory Access
All local memories are structured as caches (COMA caches) The only architecture providing hardware support for replicating the same cache block in multiple local caches Examples: KSR-1, DDM Node P Node Q COMA Cache A COMA Cache A load R1, A

MIMD Shared-Memory Non-Cache Coherent NUMA
Besides local memory, each node has a set of node-level registers called E-registers A value from the address A may be loaded into an E-reister and then transferred to a processor register or local memory The cache block containing the address A is not automatically copied into the local processor cache or the local memory Examples: Cray T3E It is possible to implement the cache coherency in software (Software-Coherent NUMA, Distributed Shared Memory) Examples: TreadMarks, Wind Tunnel, IVY, Shrimp Node P Node Q Local Memory Local Memory A load E1, A E-registers E-registers

MIMD Shared-Memory Cache Coherent NUMA
Five dimensions: Processor Memory Control: distributed Data Flow: parallel Address Space: global Physical Memory: Distributed with cache coherency Interconnect: static or dynamic Directory Memory Management Unit Local bus Interconnect Directory entry: Node Block Offset

MIMD Shared-Memory Cache Coherent NUMA
An instruction loads the value of A into a local processor register R1 The cache block of A is automatically copied into a node-level cache called remote cache (RC), but not the local node memory Two approaches Only one cache copy Multiple cache copies Node P Node Q Local Memory Local Memory A load R1, A Remote Cache A

MIMD Shared-Memory Snooping caches and cache coherence protocols
Snoopy caches The cache controller is specially designed to monitor all bus requests (”snoop”) and take appropriate actions in certain cases Write through cache coherence protocol Action Local request Remote request Read miss Fetch data from memory Read hit Use data from local cache Write miss Update data in memory Write hit Update cache and memory Invalidate cache entry

MIMD Shared-Memory Snooping caches and cache coherence protocols
Write back MESI cache coherence protocol Modified – the entry is valid; memory is invalid; no copies exist Exclusive – no other cache holds the line; memory is up to date Shared – multiple caches may hold the line; memory is up to date Invalid – the cache entry does not contain valid data

MIMD Shared-Memory Invalidate vs Update strategy
Instead of invalidating its entry on a write hit, snooping cache can accept a new value and update its cache Conceptually, updating the cache is the same as invalidating it followed by reading the word from (up-to-date) memory Both protocols perform differently under different loads Update messages carry payloads, hence are larger than invalidate messages, but may prevent future cache misses

MIMD Shared-Memory Cache coherence for large multiprocessors
Snooping caches are technically feasible for relatively small multiprocessor systems Up to 64 PUs, typically 2, 4 For large multiprocessors a different approach is required, known as directory-based multiprocessor cache coherence A database is maintained storing the information where each cache line is stored and what is its status When a cache line is referenced, the database is queried to find out where it is and whether it is clean or dirty (modified) The database is queried on every instruction referencing memory, extremly-fast special-purpose hardware is required to provide acceptable response time

MIMD Shared-Memory Cache Coherent NUMA Architectures
Features Stanford Dash Sequent NUMA-Q HP/Convex Exemplar SGI/Cray Origin 2000 Node architecture 4-CPU SMP node with snoopy bus 8-CPU SMP node with crossbar 2-CPU non-SMP node with HUB Internode connection 2D mesh SCI ring (Scalabale Coherence Interface) Multiple SCI rings Fat hypercube Cache coherency protocol Snoopy within each node and Directory globally SCI linked list coherence protocol Modified from SCI protocol Modified from Dash protocol Other performance features Intranode cache-to-cache sharing Node cache, gang scheduling, processor affinity Node cache Gang scheduling, page migration, placement and replication

MIMD Distributed-Memory No-Remote Memory Access
Five dimensions: Local Address Spaces Control: distributed Data Flow: parallel Address Space: local Physical Memory: Interconnect: static or dynamic Processor Memory Processor Memory Interconnect

MIMD Distributed-Memory NORMA - Message Passing Architecture
Node P Node Q Evolution of message passing systems involved addition of Dedicated message processor Network interface circuitry (NIC) Local Memory B Local Memory A recv B from Q send A to P Processor Processor Memory Memory Interconnect

MIMD Distributed-Memory Massively Parallel Processors
Massively Parallel Processors MPP Large-scale computer system consisting of hundreds or thousands of processors Scalable to thousands of processors with proportional increase in memory and I/O capacity and bandwidth P/C P/C Memory P/C P/C Memory Disks and other I/O Disks and other I/O Local Interconnect Local Interconnect NIC NIC Disks and other I/O High Speed Network (HSN)

MIMD Distributed-Memory Massively Parallel Processors
MPP Models Intel/Sandia ASCI Option Red IBM SP2 SGI/Cray Origin 2000 Large sample configuration 9072 processors, 1.8 Tflops at SNL 400 processors, 100Gflops at MHPCC 128 processors, 51 Gflops at NCSA Available date December 1996 September 1994 October 1996 Processor type 200 MHz, 200 Mflops Pentium Pro 67 MHz, 267 Mflops POWER2 200 MHz, 400 Mflops MIPS R10000 Node architecture and data storage 2 processors, 32 to 256 MB of memory, shared disk 1 processor, 64 MB to 2 GB local memory, 1-4.5 GB local disk 2 processors, 64 MB to 256 GB of DSM and shared disk Interconnect and memory model Split 2D mesh, NORMA Multistage network, Fat hypercube, CC-NUMA Node operating system Light-weighted kernel (LWK) Complete AIX (IBM Unix) Microkernel Cellular IRIX Native programming mechanism MPI based on PUMA Portals MPI and PVM Power C Power Fortran Other programming mechanisms Nx, PVM, HPF HPF, Linda MPI, PVM

Processors for MIMD architectures
Early MIMD machines (e.g. Caltech's Cosmic Cube) used cheap off-the-shelf processors, with purpose-built inter-processor communications hardware Motivation for building early parallel computers was that many cheap microprocessors could give similar performance to an expensive Cray vector supercomputer Later machines (e.g. nCUBE, transputers) used proprietary processors, with on-chip communications hardware These could not compete with the rapid increase in performance of mass-produced processors for workstations and PCs, e.g. year-old 16-processor nCUBE or transputer machine typically had same performance as a new single-processor workstation!

Processors for MIMD architectures
Current MIMD machines use off-the-shelf processors, usually RISC processors used in state-of-the-art high-performance workstations (IBM RS-6000, SGI MIPS, DEC Alpha, Sun UltraSPARC, HP PA-RISC). Processors for PCs are now of comparable performance, and the first machine to reach 1 Teraflop, the 9200-processor ASCI Red from Intel, uses Pentium Pro processors These machines still need special communications hardware, and expensive high-speed networks and switches

MIMD Clusters Basic concepts of clustering
A cluster is a collection of complete computers (nodes) physically interconnected by a high-performance network or a local-area network that work collectively as a single system to provide uninterrupted (availability) and efficient (performance) services

Programming environment and applications Availability and single-system image infrastructure OS OS OS Node Node Node Commodity or propriatery interconnect

Cluster nodes Each node is a complete computer Each node has its processor(s), cache, memory, disk, I/O facilities With a complete, standard operating system Appears as a single system to users and applications Is homogeneous or near A node may be: Typical PC - Cluster of PCs (CoPs), Piles of PCs (PoPs) Workstation – Cluster of Workstations (COWs) SMP – Cluster of Symmetric Multiprocessors/Multiprocessors (CLUMPs) (recently very popular in high performance supercomputing)

Single-System Image (SSI) A cluster is a single computing resource SSI is the illusion, created by software or hardware, that presents a collection of resources as one, more powerful resource SSI makes the cluster appear like a single machine to the user, to applications, and/or to the network SSI support can exist at different levels within the system, one able to be build on another

Internode connection The nodes of a cluster are usually connected through a commodity network Ethernet (10Mbps) Fast Ethernet (100Mbps) Gigabit Ethernet (1Gbps) SCI (Dolphin - MPI- 12micro-sec latency) ATM (622Mbps) Myrinet (1.2Gbps) Compaq Memory Channel (800 Mbps) Quadrics QsNET (340 MBps) Standard protocols are are used to smooth internode communication

Enhanced availability Clustering offers a cost-effective way to enhance the availability of a system (percentage of time a system remains available to a user) Better performance In certain areas Superserver – if each of n nodes can serve m clients, the cluster can serve m*n clients Large grain distributed parallel processing

MIMD Clusters Scalable performance vs system availability
MPP Cluster SMP Fault-Tolerant Systems UP System availability

MIMD Clusters Dedicated vs Enterprise clusters
Attributes Attribute value Packaging Compact Slack Control Centralized Decentralized Homogeneity Homogeneous Heterogeneous Security Enclosed Exposed Example Dedicated cluster Enterprise cluster

MIMD Clusters Dedicated clusters
Dedicated clusters are used as substitutes for traditional mainframes and supercomputers Typically installed in a deckside rack in a central computer room Typically homogeneously configured with the same type of nodes Managed by a single administrator group like a mainframe Typically accessed via a front-end system

MIMD Clusters Enterprise clusters
Enterprise clusters are mainly used to utilize idle resources in the nodes Each node is a SMP, worksattion or PC Nodes are typically geographically distributed Nodes are individually owned by multiple owners, the cluster administrator has only limited control over nodes (the owner’s local jobs have higher priority than enterprise jobs) Cluster is often configured with heterogenous computer nodes Nodes are often connected through a low-cost Ethernet

MIMD Clusters Cluster architectures
Shared nothing Nodes connected through I/O bus Shared disk Small-scale availability clusters in business applications Shared-memory/Clusters of Multiprocessors (SMPs) ”CLUMPS” Emerging direction in HPC P/C P/C M M M I/O M I/O D D LAN NIC NIC P/C P/C M M M I/O M I/O D D Shared Disk NIC NIC P/C P/C M M M I/O D M I/O D SCI NIC NIC

Sample Commercial Clusters
Company System Name Brief description DEC VMS-Clusters High-availability cluster for VMS TruClusters Unix cluster of SMP servers NT Clusters Alpha-based clusters for Windows NT HP Apollo 9000 Cluster Computational cluster MC/ServiceGuard HP NetServer for NT cluster solution IBM Sysplex Shared-disk mainframe cluster for commercial batch and OLTP HACMP High-availability cluster multiProcessor Scalable POWERparallel (SP) Workstation cluster built with POWER2 nodes and Omega switch as a scalable MPP Microsoft Wolfpack An open standard for clustering of Windows NT servers SGI POWER CHALLENGE array A scalable cluster of SMP server nodes built with HiPPI switch for distributed parallelism Sun Solaris MC Extension of Solaris Sun workstation cluster SPARCluster 1000/2000 PDB High-availability cluster server for OLTP and database procesing Tandem Himalaya Highly scalable and fault-tolerant cluster with duplexed nodes for OLTP and database Marathon MIAL2 High-availability cluster with complete redundancy and failback

Multiprocessors & Multicomputers

Similar presentations

Presentation on theme: "Multiprocessors & Multicomputers"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multiprocessors & Multicomputers

Similar presentations

Presentation on theme: "Multiprocessors & Multicomputers"— Presentation transcript:

Similar presentations

About project

Feedback