Clusters of Multiprocessor Systems

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 6: Multicore Systems
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Types of Parallel Computers
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
Introduction What is Parallel Algorithms? Why Parallel Algorithms? Evolution and Convergence of Parallel Algorithms Fundamental Design Issues.
Parallel Processing Architectures Laxmi Narayan Bhuyan
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Computer performance.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Interconnection Structures
Computer System Architectures Computer System Software
CHAPTER 11: Modern Computer Systems
CLUSTER COMPUTING STIMI K.O. ROLL NO:53 MCA B-5. INTRODUCTION  A computer cluster is a group of tightly coupled computers that work together closely.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Lecture # 10 Processors Microcomputer Processors.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
CHAPTER 11: Modern Computer Systems
Overview Parallel Processing Pipelining
Computer Hardware.
Berkeley Cluster Projects
Types of RAM (Random Access Memory)
CS5102 High Performance Computer Systems Thread-Level Parallelism
Definition of Distributed System
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
CS 147 – Parallel Processing
Constructing a system with multiple computers or processors
Unit 2 Computer Systems HND in Computing and Systems Development
CMSC 611: Advanced Computer Architecture
Parallel and Multiprocessor Architectures – Shared Memory
Lecture 1: Parallel Architecture Intro
Chapter 17 Parallel Processing
Computer Organization
Multiple Processor Systems
Multiple Processor Systems
Parallel Processing Architectures
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
Shared Memory. Distributed Memory. Hybrid Distributed-Shared Memory.
Multiple Processor and Distributed Systems
High Performance Computing
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Multiprocessor and Thread-Level Parallelism Chapter 4
Database System Architectures
CSL718 : Multiprocessors 13th April, 2006 Introduction
Types of Parallel Computers
Cluster Computers.
Presentation transcript:

Clusters of Multiprocessor Systems

Multiprocessing ■ Multiprocessor - Computer system containing more than one processor ■ Reasons - Increase the processing power of a system - Parallel processing

Clusters ■ A collection of workstations of PCs that are interconnected by a high-speed network ■ Work as an integrated collection of resources ■ Have a single system image spanning all its nodes

Cluster Computer Architecture

Cluster Models

Tightly Coupled Systems

Memory-Coupling, Message-Coupling and DSM ■ UMA – Uniform Memory Access ■ NUMA – Non-Uniform Memory Access ■ NORMA – No Remote Memory Access

Memory a) All main memory at the global bus (B2) b) Main memory distributed among the clusters

Components of Cluster Computers ■ Multiple High Performance Computers - PCs - Workstations - SMPs (CLUMPS) - Distributed HPC Systems leading to Metacomputing ■ State of the art Operating Systems - Linux (Beowulf) - Microsoft NT (Illinois HPVM) - SUN Solaris (Berkeley NOW) - IBM AIX (IBM SP2) - HP UX (Illinois-PANDA) - Mach (Microkernel based OS) (CMU) - OS gluing layers (Berkeley Glunix) - Cluster Operating Systems (Solaris MC, Compaq TruClusters, MOSIX (academic project)

Components of Cluster Computers ■ High Performance Networks/Switches - Ethernet (10Mbps) - Fast Ethernet (100Mbps) - Gigabit Ethernet (1Gbps) - SCI (Dolphin -MPI-12micro-sec latency) - ATM - Myrinet(1.2Gbps) - Compaq Memory Channel (800 Mbps) - Quadrics QsNET(340 MBps) ■ Fast Communication Protocols and Services - Active Messages (Berkeley) - Fast Messages (Illinois) - U-net (Cornell) - XTP (Virginia)

Components for Clusters ■ Processors - Intel x86 Processors (Pentium Pro and Pentium Xeon, AMD x86, Cyrix x86, etc) - Compaq Alpha (Alpha 21364 processor integrates processing, memory controller, network interface into a single chip) - IBM PowerPC - Sun SPARC - SGI MIPS - HP PA - Berkeley Intelligent RAM (IRAM) integrates processor and DRAM onto a single chip

Components for Clusters ■ Memory and Cache - Standard Industry Memory Module (SIMM) - Dual Inline Memory Module (DIMM) - RDRAM (Rambus), SDRAM, SLDRAM (SyncLink) - Access to DRAM is extremely slow compared to the speed of the processor (SRAM used for Cache is fast, but expensive & cache control circuitry becomes more complex as the size of the cache grows) - 64-bit wide memory paths - ECC and RAID protection (high availability)

Components for Clusters ■ System Bus - PCI bus (Up to 64 bit wide and 66 MHz, Up to 512 Mbytes/s transfer rate, Adopted both in Pentium- based PC and non-Intel platform) - PCI-X (Up to 1 GB/s transfer rate) ■ Disk and I/O - Disk access time improves less than 10% per year - Amdahl’s law (Make the common case faster) - Performance – Carry out I/O operations parallel, supported by parallel file system based on hardware or software striping - High Availability – Software RAID

DASH CEDAR ■ Cluster-based machine developed at Stanford ■ Each cluster is a 4-CPU bus-based Silicon Graphics Power System/340, with 32 Mbytes of memory visible to all processors in the machine ■ The clusters are interconnected with a 2-D mesh ■ Each processor has three directmapped caches ■ The CPUs are 33-Mhz MIPS R3000s ■ Cache coherence is supported in hardware: a snoopy-based protocol within each cluster and a directory-based one across clusters CEDAR ■ A 4-cluster vector multiprocessor developed at the Univ. of Illinois' Center for Supercomputing Research and Development ■ Each cluster is an 8-CPU bus-based Alliant FX/8 ■ All processors in a cluster share a 512 Kbyte direct-mapped cache and 64 Mbytes of memory visible only to the cluster ■ Fast synchronization is possible via a percluster synchronization bus ■ Each processor has a 4 Kbyte prefetch buffer ■ All processors in the machine are connected to 64 Mbytes of shared memory via forward and return omega networks and no caches

HP/Convex Exemplar HP 9000 V2200 Enterprise Server ■ The Exemplar X-Class is the second generation SPP from HP/Convex ■ A ccNUMA architecture comprised of multiple nodes ■ Each node may contain up to 16 PA-8000 processors, 16 Gbytes of memory and 8 PCI busses ■ Memory access is UMA within each node and is accomplished via a nonblocking crossbar ■ Each node can be correctly considered as a symmetric multiprocessor ■ The interconnect between nodes is a derivative of the IEEE standard, SCI, which permits up to 32 nodes to be connected in a 2 dimensional topology ■ The system includes features to aid high performance engineering/scientific computations HP 9000 V2200 Enterprise Server ■ Includes one CPU, 256 megabytes of memory and an unlimited HP-UX license ■ The machine features up to a 16-way SMP with up to 32-way SMP ■ Supports nonuniform memory access (NUMA) in the hardware level, while HP-UX 11.00 will support NUMA in the operating system, so it can be added to other HP 9000 machines when it is shipped

Motorola/IBM PowerPC 604/604e ■ Four-issue superscalar RISC processors from IBM Microelectronics and Motorola, implementations of the PowerPC architecture specification ■ The processors are targeted at general-purpose desktop computing and have found design wins in the Apple Macintosh line of personal computers and in Macintosh clones ■ The fastest version of the PowerPC 604 operates at a clock speed of 180 MHz with a 3.3-volt supply ■ The PowerPC 604e is an enhanced version of the PowerPC 604 and can operate at a clock speed of 225 MHz with a 2.5-volt supply for the processor core and a 3.3-volt supply for I/O ■ The PowerPC 604 uses a superscalar RISC architecture and can dispatch and complete up to four instructions in a single clock cycle ■ The processor operates on 32-bit instructions and integer data and on 64-bit double-precision or 32-bit single-precision floating-point data. ■ The PowerPC 604 has independent floating-point and integer data paths ■ The PowerPC 604 supports virtual memory via separate instruction and data TLBs for fast address translations ■ The PowerPC 604 allows both the instruction and data caches to be locked

Motorola/IBM PowerPC 604/604e

AlphaServer 8400 ■ Supports up to twelve 21164 microprocessors and 14 gigabytes of memory, creating breakthroughs in very large database performance ■ Provides a viable alternative to supercomputers and mainframes, with a peak throughput of 6.6 GF (gigaflops)

Architecture of NOW System

Beowulf Clusters ■ Simple and highly configurable ■ Low cost ■ Networked - Computers connected to one another by a private Ethernet network - Connection to an external network is through a single gateway computer ■ Configuration - COTS – Commodity-off-the-shelf components such as inexpensive computers - Blade components – computers mounted on a motherboard that are plugged into connectors on a rack - Either shared-disk or shared-nothing model

Blade and Rack of Beowulf Cluster

Cluster Model ■ Each node is a SMP containing 4 Pentium Pro Processors ■ 8 nodes connected Communication – Ethernet using MPICH – Myrinet using NICAM (Network Interface Communication using Active Messages) ■ Solaris OS ■ Single user, single job

Programming SMP Clusters Shared Memory Architecture ■ Thread synchronization and mutual exclusion is needed ■ Communication overhead is low ■ Performance is limited by system bus bandwidth ■ Possibility for full cache utilization ■ Distributed memory architecture ■ Implicit synchronization when exchanging messages ■ Communication overhead is high ■ Performance limited by network bandwidth

Quad Pentium shared memory multiprocessor system

General model of a shared memory multiprocessor system with caches

Scalable Cache Coherent Systems ■ Scalable, distributed memory plus coherent replication ■ Scalable distributed memory machines - P-C-M nodes connected by network - Communication assist interprets network transactions, forms interface ■ Final point was shared physical address space - Cache miss satisfied transparently from local or remote memory ■ Natural tendency of cache is to replicate - No broadcast medium to snoop on ■ Not only hardware latency, but also protocol must scale

Two-level Hierarchies

Bus Hierarchies with Distributed Memory ■ Main memory distributed among clusters - Cluster is a full-fledged bus-based machine, memory and all - Automatic scaling of memory (each cluster brings some with it) - Good placement can reduce global bus traffic and latency, but latency to far-away memory may be larger than to root

Benchmarking HPC (High Performance Computing) Clusters using AMD Opteron Processors ■ The AMD64 architecture - and specifically the AMD Opteron processor - is a great platform for HPC needs ■ Good where are large datasets, extensive memory requirements, and a lot of integer or floating-point arithmetic ■ Opteron processor offers low latency, scalable memory bandwidth with on-chip memory controller ■ Minimizing memory latency in multi-processor systems

High Performance Computing ■ Minimize turnaround time to complete specific application problem ■ Maximize the problem size that can be solved in a given amount of time

The End