Parallel platforms, etc.

Slides:



Advertisements
Similar presentations
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Advertisements

Distributed Systems CS
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Fundamental of Computer Architecture By Panyayot Chaikan November 01, 2003.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Today’s topics Single processors and the Memory Hierarchy
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
Parallel Computers Past and Present Yenchi Lin Apr 17,2003.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Chapter 5 Array Processors. Introduction  Major characteristics of SIMD architectures –A single processor(CP) –Synchronous array processors(PEs) –Data-parallel.
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Edgar Gabriel Short Course: Advanced programming with MPI Edgar Gabriel Spring 2007.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Parallel Computing.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Multiprocessor So far, we have spoken at length microprocessors. We will now study the multiprocessor, how they work, what are the specific problems that.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
01/28/2011CS267 Lecture 31 Parallel Machines and Programming Models Lecture 3 Based on James Demmel and Kathy Yelick
These slides are based on the book:
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Parallel Architecture
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Constructing a system with multiple computers or processors
What is Parallel and Distributed computing?
Chapter 17 Parallel Processing
Outline Interconnection networks Processor arrays Multiprocessors
Constructing a system with multiple computers or processors
Constructing a system with multiple computers or processors
AN INTRODUCTION ON PARALLEL PROCESSING
Constructing a system with multiple computers or processors
Part 2: Parallel Models (I)
Chapter 4 Multiprocessors
Types of Parallel Computers
Presentation transcript:

Parallel platforms, etc.

Taxonomy of platforms? It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all past and present systems But it’s not going to happen Up until last year Gordon Bell and Jim Gray published an article in Comm. of the ACM, discussing what the taxonomy should be Dongarra, Sterling, etc. answered telling them they were wrong and saying what the taxonomy should be, and proposing a new multi-dimensional scheme! Both papers agree that terms are conflated, misused, etc. (MPP) We’ll look at one traditional taxonomy We’ll look at current categorizations from Top500 We’ll look at examples of platforms We’ll look at interesting/noteworthy architectural features that one should know as part of one’s parallel computing culture What about conceptual models of parallel machines?

The Flynn taxonomy Proposed in 1966!!! Functional taxonomy based on the notion of streams of information: data and instructions Platforms are classified according to whether they have a single (S) or multiple (M) stream of each of the above Four possibilities SISD (sequential machine) SIMD MIMD MISD (rare, no commercial system... systolic arrays)

SIMD PEs can be deactivated and activated on-the-fly Processing Element Control Unit single stream of instructions fetch decode broadcast PEs can be deactivated and activated on-the-fly Vector processing (e.g., vector add) is easy to implement on SIMD Debate: is a vector processor an SIMD machine? often confused strictly not true according to the taxonomy (it’s really SISD with pipelined operations) more later on vector processors

MIMD Most general category Pretty much everything in existence today is a MIMD machine at some level This limits the usefulness of the taxonomy But you had to have heard of it at least once because people keep referring to it, somehow... Other taxonomies have been proposed, none very satisfying Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway

A host of parallel machines There are (have been) many kinds of parallel machines For the last 11 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500 It is a good source of information about what machines are (were) and how they have evolved http://www.top500.org

What is the LINPACK Benchmark LINPACK: “LINear algebra PACKage” A FORTRAN Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. LINPACK Benchmark Dense linear system solve with LU factorization 2/3 n^3 + O(n^2) Measure: MFlops The problem size can be chosen You have to report the best performance for the best n, and the n that achieves half of the best performance.

What can we find on the Top500?

Pies

Platform Architectures SIMD Cluster Vector Constellation SMP MPP

SIMD ILLIAC-IV, TMC CM-1, MasPar MP-1 Expensive logic for CU, but there is only one Cheap logic for PEs and there can be a lot of them 32 procs on 1 chip of the MasPar, 1024-proc system with 32 chips that fit on a single board! 65,536 processors for the CM-1 Thinking Machine’s gimmick was that the human brain consists of many simple neurons that are turned on and off, and so was their machine CM-5 hybrid SIMD and MIMD Death Machines not popular, but the programming model is. Vector processors often labeled SIMD because that’s in effect what they do, but they are not SIMD machines Led to the MPP terminology (Massively Parallel Processor) Ironic because none of today’s “MPPs” are SIMD

SMPs “Symmetric MultiProcessors” (often mislabeled as “Shared-Memory Processors”, which has now become tolerated) Processors all connected to a (large) memory UMA: Uniform Memory Access, makes is easy to program Symmetric: all memory is equally close to all processors Difficult to scale to many processors (<32 typically) Cache Coherence via “snoopy caches” P1 network/bus $ memory P2 Pn

Distributed Shared Memory Memory is logically shared, but physically distributed in banks Any processor can access any address in memory Cache lines (or pages) are passed around the machine Cache coherence: Distributed Directories NUMA: Non-Uniform Memory Access (some processors may be closer to some banks) SGI Origin2000 is a canonical example Scales to 100s of processors Hypercube topology for the memory (later) P1 P2 Pn $ $ $ memory memory network memory memory

Clusters, Constellations, MPPs These are the only 3 categories today in the Top500 They all belong to the Distributed Memory model (MIMD) (with many twists) Each processor/node has its own memory and cache but cannot directly access another processor’s memory. nodes may be SMPs Each “node” has a network interface (NI) for all communication and synchronization. So what are these 3 categories? interconnect P0 memory NI . . . P1 Pn

Clusters 58.2% of the Top500 machines are labeled as “clusters” Definition: Parallel computer system comprising an integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes A commodity cluster is one in which both the network and the compute nodes are available in the market In the Top500, “cluster” means “commodity cluster” A well-known type of commodity clusters are “Beowulf-class PC clusters”, or “Beowulfs”

What is Beowulf? An experiment in parallel computing systems Established vision of low cost, high end computing, with public domain software (and led to software development) Tutorials and book for best practice on how to build such platforms Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software Project initiated by T. Sterling and D. Becker at NASA in 1994

Constellations??? Commodity clusters that differ from the previous ones by the dominant level of parallelism Clusters consist of nodes, and nodes are typically SMPs If there are more procs in an node than nodes in the cluster, then we have a constellation Typically, constellations are space-shared among users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP To be honest, this term is not very useful and not very used.

MPP???????? Probably the most imprecise term for describing a machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?) May use proprietary networks, vector processors, as opposed to commodity component IBM SP2, Cray T3E, IBM SP-4 (DataStar), Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs. Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500. Let’s look at these “non-commodity” things

Vector Processors Vector architectures were based on a single processor Multiple functional units All performing the same operation Instructions may specify large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel Historically important Overtaken by MPPs in the 90s as seen in Top500 Re-emerging in recent years At a large scale in the Earth Simulator (NEC SX6) and Cray X1 At a small scale in SIMD media extensions to microprocessors SSE, SSE2 (Intel: Pentium/IA64) Altivec (IBM/Motorola/Apple: PowerPC) VIS (Sun: Sparc) Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

Vector Processors r1 r2 … … r3 … … … Definition: a processor that can do elt-wise operations on entire vectors with a single instruction, called a vector instruction These are specified as operations on vector registers A processor comes with some number of such registers A vector register holds ~32-64 elts The number of elements is larger than the amount of parallel hardware, called vector pipes or lanes, say 2-4 The hardware performs a full vector operation in #elements-per-vector-register / #pipes r1 r2 … vr1 … vr2 + + (logically, performs #elts adds in parallel) r3 … vr3 … vr1 … vr2 + + + + (actually, performs #pipes adds in parallel)

Vector Processors Advantages Memory-to-memory quick fetch and decode of a single instruction for multiple operations the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion The compiler does the work for you of course Memory-to-memory no registers can process very long vectors, but startup time is large appeared in the 70s and died in the 80s Cray, Fujitsu, Hitachi, NEC

Global Address Space Cray T3D, T3E, X1, and HP Alphaserver cluster Network interface supports “Remote Direct Memory Access” NI can directly access memory without interrupting the CPU One processor can read/write memory with one-sided operations (put/get) Not just a load/store as on a shared memory machine Remote data is typically not cached locally (remember the MPI-2 extension) interconnect P0 memory NI . . . P1 Pn

Cray X1: Parallel Vector Architecture Cray combines several technologies in the X1 12.8 Gflop/s Vector processors (MSP) Shared caches (unusual on earlier vector machines) 4 processor nodes sharing up to 64 GB of memory Single System Image to 4096 Processors Remote put/get between nodes (faster than MPI)

Cray X1: the MSP Cray X1 building block is the MSP Multi-Streaming vector Processor 4 SSPs (each a 2-pipe vector processor) Compiler will (try to) vectorize/parallelize across the MSP, achieving “streaming” custom blocks 12.8 Gflops (64 bit) S S S S 25.6 Gflops (32 bit) V V V V V V V V 51 GB/s 25-41 GB/s 0.5 MB $ 0.5 MB $ 0.5 MB $ 0.5 MB $ 2 MB Ecache shared caches At frequency of 400/800 MHz To local memory and network: 25.6 GB/s 12.8 - 20.5 GB/s Figure source J. Levesque, Cray

Cray X1: A node Shared memory P $ M mem IO Shared memory 32 network links and four I/O links per node

Cray X1: 32 nodes R R R R R R R R Fast Switch

Cray X1: 128 nodes

Cray X1: Parallelism Many levels of parallelism Within a processor: vectorization Within an MSP: streaming Within a node: shared memory Across nodes: message passing Some are automated by the compiler, some require work by the programmer Hard to fit the machine into a simple taxonomy Similar story for the Earth Simulator

The Earth Simulator (NEC) Each node: Shared memory (16GB) 8 vector processors + I/O processor 640 nodes fully-connected by a 640x640 crossbar switch Total: 5120 8GFlop processors -> 40GFlop peak

DataStar 8-way or 32-way Power4 SMP nodes Connected via IBM’s Federation (formerly Colony) interconnect 8-ary Fat-tree topology 1,632 processors 10.4 TeraFlops Each node is directly connected via fiber to IBM’s GPFS (parallel file system) Similar to the SP-x series, but higher bandwidth and higher arity of the fat-tree

Blue Gene/L 65,536 processors (still being assembled) Relatively modest clock rates, so that power consumption is low, cooling is easy, and space is small (1024 nodes in the same rack) Besides, processor speed is on par with the memory speed so faster does not help 2-way SMP nodes! several networks 64x32x32 3-D torus for point-to-point tree for collective operations and for I/O plus other Ethernet, etc.

If you like dead Supercomputers Lots of old supercomputers w/ pictures http://www.geocities.com/Athens/6270/superp.html Dead Supercomputers http://www.paralogos.com/DeadSuper/Projects.html e-Bay Cray Y-MP/C90, 1993 $45,100.70 From the Pittsburgh Supercomputer Center who wanted to get rid of it to make space in their machine room Original cost: $35,000,000 Weight: 30 tons Cost $400,000 to make it work at the buyer’s ranch in Northern California

Network Topologies People have experimented with different topologies for distributed memory machines, or to arrange memory banks in NUMA shared-memory machines Examples include: Ring: KSR (1991) 2-D grid: Intel Paragon (1992) Torus Hypercube: nCube, Intel iPSC/860, used in the SGI Origin 2000 for memory Fat-tree: IBM Colony and Federation Interconnects (SP-x) Arrangement of switches pioneered with “Butterfly networks” like in the BBN TC2000 in the early 1990 200 MHz processors in a multi-stage network of switches Virtually Shared Distributed memory (NUMA) I actually worked with that one!

Hypercube Defined by its dimension, d 1D 2D 3D 4D

Hypercube Properties Has 2d nodes The number of hops between two nodes is at most d The diameter of the network grows logarithmically with the number of nodes, which was the key for interest in hypercubes But each node needs d neighbors, which is a problem Routing and Addressing 1110 1111 0110 0111 d-bit address routing from xxxx to yyyy: just keep going to a neighbor that has a smaller hamming distance reminiscent of some p2p things TONS of Hypercube research (even today!!) 0010 0011 1010 1011 0100 0101 1100 1101 1001 1000 0000 0001

Systolic Array? Array of processors in some topology with each processor having a few neighbors typically 1-D linear array or 2-D grid Processors perform regular sequences of operations among data that flow between them e.g. receive from my left and top neighbor, compute, pass to my right and bottom neighbor Like SIMD machines, everything happens in locked step Example: CMU’s iWarp by Intel (1988 or so) Allows for convenient algorithms for some problems Today: used in FPGA systems that build systolic arrays to run a few algorithms. regular computations (matrix multiply) genetic algorithms Impact: allows us to reason about algorithms

Models for Parallel Computation We have seen broad taxonomies of machines, examples of machines, techniques to program them (OpenMP, MPI, etc.) At this point, how does one reason about parallel algorithms, about their complexity, about their design, etc.? What one needs is abstract models of parallel platforms Some are really abstract Some are directly inspired from actual machines Although these machines may no longer exist or be viable, the algorithms can be implemented on more relevant architectures, or at least give us clues e.g.: Matrix multiply on a systolic array helps doing matrix multiply on a logical 2-D grid topology that sits on top of a cluster of workstations. PRAM, Sorting networks, systolic arrays, etc.