6/25/2015 Parallel Computer Architectures Computer Science and Engineering 1 Parallel Computer Architectures Duncan A. Buell.

Slides:



Advertisements
Similar presentations
Parallel Processors.
Advertisements

Multiple Processor Systems
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
Datorteknik F1 bild 1 Higher Level Parallelism The PRAM Model Vector Processors Flynn Classification Connection Machine CM-2 (SIMD) Communication Networks.
SE-292 High Performance Computing
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
1 Parallel Scientific Computing: Algorithms and Tools Lecture #3 APMA 2821A, Spring 2008 Instructors: George Em Karniadakis Leopold Grinberg.
Jie Liu, Ph.D. Professor Department of Computer Science
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Multiprocessors CSE 4711 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor –Although.
History of Distributed Systems Joseph Cordina
Tuesday, September 12, 2006 Nothing is impossible for people who don't have to do it themselves. - Weiler.
2. Multiprocessors Main Structures 2.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Chapter 17 Parallel Processing.
Multiprocessors CSE 471 Aut 011 Multiprocessors - Flynn’s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) –Conventional uniprocessor.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
1 CSE SUNY New Paltz Chapter Nine Multiprocessors.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
4. Multiprocessors Main Structures 4.1 Shared Memory x Distributed Memory Shared-Memory (Global-Memory) Multiprocessor:  All processors can access all.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Introduction to Parallel Processing Ch. 12, Pg
Parallel Architectures
Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Course Outline Introduction in software and applications. Parallel machines and architectures –Overview of parallel machines –Cluster computers (Myrinet)
Parallel Computing Basic Concepts Computational Models Synchronous vs. Asynchronous The Flynn Taxonomy Shared versus Distributed Memory Interconnection.
1 Lecture 20: Parallel and Distributed Systems n Classification of parallel/distributed architectures n SMPs n Distributed systems n Clusters.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
MIMD Shared Memory Multiprocessors. MIMD -- Shared Memory u Each processor has a full CPU u Each processors runs its own code –can be the same program.
Parallel Algorithms Sorting and more. Keep hardware in mind When considering ‘parallel’ algorithms, – We have to have an understanding of the hardware.
Parallel Computer Architecture and Interconnect 1b.1.
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Institute for Software Science – University of ViennaP.Brezany Parallel and Distributed Systems Peter Brezany Institute for Software Science University.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
Parallel Algorithms. Parallel Models u Hypercube u Butterfly u Fully Connected u Other Networks u Shared Memory v.s. Distributed Memory u SIMD v.s. MIMD.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
M U N - February 17, Phil Bording1 Computer Engineering of Wave Machines for Seismic Modeling and Seismic Migration R. Phillip Bording February.
Parallel Computing.
Anshul Kumar, CSE IITD Other Architectures & Examples Multithreaded architectures Dataflow architectures Multiprocessor examples 1 st May, 2006.
Data Structures and Algorithms in Parallel Computing Lecture 1.
1 Basic Components of a Parallel (or Serial) Computer CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM.
Outline Why this subject? What is High Performance Computing?
Lecture 3: Computer Architectures
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Background Computer System Architectures Computer System Software.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
These slides are based on the book:
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
Parallel Architecture
Higher Level Parallelism
Multiprocessor Systems
CS5102 High Performance Computer Systems Thread-Level Parallelism
Parallel Computers Definition: “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.”
Course Outline Introduction in algorithms and applications
CS 147 – Parallel Processing
Parallel Computer Architectures Duncan A. Buell
Data Structures and Algorithms in Parallel Computing
Parallel Processing Architectures
Parallel Computer Architectures Duncan A. Buell
Chapter 4 Multiprocessors
An Overview of MIMD Architectures
Husky Energy Chair in Oil and Gas Research
Presentation transcript:

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 1 Parallel Computer Architectures Duncan A. Buell

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 2 Rules for Parallel Computing

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 3 There are no rules

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 4 Parallel Computing History Late 1960s ILLIAC CDC STAR s Denelcor HEP Tera Computer Corp. MPA Alliant Sequent Stardent Kendall Square Research (KSR) Intel Hypercube NCube BBN Butterfly NASA MPP Thinking Machines CM-2 MasPar 1990s and forward Cray T3D, T3E Thinking Machines CM-5 Tera Computer Corp. MPA SGI Challenge Sun Enterprise SGI Origin HP-Convex DEC 84xx Pittsburgh Terascale ASCI machines Beowulf clusters IBM SP-1, SP-2 New DoE-inspired machines

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 5 Memory Latency is the Problem Instructions execute in nanoseconds Memory provides data in 100s of nanoseconds The problem is keeping processors fed with data Standard machines use levels of cache How do we keep lots of processors fed?

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 6 Solutions(?) to the Latency Problem Connect all the processors to all the memory SMP: Sun Enterprise, SGI Challenge, Cray multiprocessors Provide fast, constant time, memory fetch to anywhere from anywhere Requires a fast, expensive, full crossbar switch

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 7 Solutions(?) to the Latency Problem (2) Build a machine that is physically structured like the computations to be performed Vectors: Cray, CDC SIMD: MPP, CM-2, MASPAR 2D/3D Grid: CRAY T3D, T3E Butterfly: BBN Meiko “computing surface” Works well on problems on which it works well Works badly on problems that don’t fit

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 8 Solutions(?) to the Latency Problem (3) Build a machine with “generic” structure and software support for computations that may not fit well Butterfly: BBN Log network: CM-2, CM-5 Relies on magic Magic has always been hard to do

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 9 Solutions(?) to the Latency Problem (4) Build an an SMP and then connect SMPs together in clusters SGI: Origin (NUMA, ccNUMA) DoE: ASCI Red, Blue Pacific, White, etc. Performance requires distributable computations, because the memory access is slow off the local node

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 10 Solutions(?) to the Latency Problem (5) Ignore performance and concentrate on cost Beowulf clusters Networks of workstations If the machine is cheap, and works very well on some (distributable) computations, then maybe no one will notice that it’s not so great on other computations.

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 11 The Vector Dinosaurs

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 12 Vector Computers Much of high end computing is for scientific and engineering applications Many of these involve linear algebra We happen to know how to do linear algebra Many solutions can be expressed with lin alg (Lin alg is both the hammer and the nail) The basic operation is a dot product, i.e. a vector multiplication Vector computers do blocks of arithmetic ops as one operation Register-based (CRAY) or memory-memory(CDC)

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 13 Programming Vector Computers Everything reduces to a compiler’s recognizing (or being told to recognize) a loop whose ops can be done in parallel. for(i=0; i < n; i++) /* works just fine */ a[i] = b[i] * c[i]; for(i = 0; i < n; i++) /* fails, a[.] values not independent */ a[i] = a[i-1] * b[i]; Programming involves contortions of code to make it into independent operations inside the loops.

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 14 Vector Computing History 1960s Seymour R. Cray does CDC 6400 Cray leaves CDC, forms Cray Research, Inc., produces CRAY-1 (1976) CDC Cyber 205 (late 1970s) CDC spins off ETA, liquid nitrogen ETA-10 fails, ETA fails CRAY X-MP (1983?), CRAY 2 runs Unix (1985) Convex C-1 and a host of “Cray-ettes”, now HP-Convex CRAY Y-MP (1988?), C90, T90, J series (1990s) Steve Chen leaves CRI, forms SSC, fails spectacularly Cray leaves CRI, forms Cray Computer Corp. CCC CRAY 3 fails, CRAY 4 fails, CCC SSS fails CRI sold to SGI, then sold to Tera Computer Corp S.R. Cray killed in auto wreck by teenager

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 15 True Parallel Computing

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 16 Parallel Computers The theoretic model of a PRAM Symmetric Multi Processors Distributed memory machines Machines with an inherent structure Non Uniform Memory Access machines Massively parallel machines Grid computing

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 17 Theory – The PRAM Model PRAM (Parallel Random Access Machine): Control unit Global memory Unbounded set of procs Private mem for each processor

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 18 PRAM Types of PRAM: EREW (Exclusive Read Exclusive Write) CREW (Concurrent Read Exclusive Write) CRCW (Concurrent Read Concurrent Write) Flaws with PRAM: Logical flaw: –Must deal with the concurrent write problem Practicality flaw: –Can’t really assume unbounded number of processors –Can’t really afford to build the interconnect switch Nonetheless, it’s a good starting place

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 19 Standard Single Processor Machine One processor One memory block Bus to memory All addresses visible Processor Memory

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 20 (Michael) Flynn’s Taxonomy SISD (Single Instruction, Single Data) – The ordinary computer MIMD (Multiple Instruction, Multiple Data) – True, symmetric, parallel computing (Sun Enterprise) SIMD (Single Instruction, Multiple Data) – Massively parallel army-of-ants approach – Processors execute the same sequence of instructions (or else NO-OP) in lockstep (TMC CM-2) SCMD/SPMD (Single Code/Program Multiple Data) – Processors run the same program, but on their own local data (Beowulf clusters)

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 21 Symmetric Multi-Processor (SMP) (MIMD) Lots of processors (32? 64? 128? 1024?) Multiple “ordinary” processors Lots of global memory All addresses visible to all processors Closest thing to a PRAM This the holy grail Processors Memory

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 22 SMP Characteristics Middle level parallel execution Processors spawn “threads” at or below the size of a function Compiler magic to extract parallelism (if no pointers in the code, then at the function level one can determine independence of use of variables) Compiler directives to force parallelism Sun Enterprise, SGI Challenge, … Processors Memory

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 23 But SMPs Are Hard to Build N processors M memory blocks N*M connections This is hard and expensive PPP P MMM M

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 24 But SMPs Are Hard to Build For large N and M, we do this as a switch, not point to point But it’s still hard and expensive Half the cost of a CRAY was the switch between processors and memory Beyond 128 processors, almost impossible PPP P MMM M SWITCH

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 25 Memory Banking Issues Many processors requesting data Processors generate addresses faster than memory can respond Memory banking: use low bits of address to specify the physical bank so consecutive addresses go to physically different banks But power-of-2 stride (as in an FFT) hits the same bank repeatedly CDC deliberately used 17 memory banks to randomize accesses

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 26 FFT Butterfly Communication

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 27 Distributed Parallelism Beowulf cluster of Linux nodes (requires an identifible “computer” to be a Beowulf?) SNOW (Scalable Network of Workstations) GIMP, Beowulfs programmed with MPI or PVM MPI uses explicit processor-to-processor message passing Sun (and others) have tools for networks

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 28 Distributed Parallel Computers Usually we can’t get to the memory except through the processor, but we would like to have memory-to-memory connections. PM PM PM PM PM Network

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 29 Parallel Computers With Structure If it’s hard/expensive to build an SMP, is it useful to build the structure into the machine? Build in a communication pattern that you expect to see in the computations, but keep things simple enough to make them buildable Make sure that you have efficient algorithms for the common computational tasks

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 30 Parallel Computers With Structure Ring-connected machines (Alliant) 2-dimensional meshes (CRAY T3D, T3E) 3-D mesh with missing links (Tera MPA) Logarithmic tree interconnections –Thinking Machines Connection Machine CM-2, CM-5 –MasPar MP-1, MP-2) Bolt, Beranek, and Newman BBN Butterfly

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 31 2-dimensional Mesh with Wraparound A vector multiply can be done very efficiently (shift column data up past row data), but what about a matrix transpose?

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 32 Logarithmic Tree Communications

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 33 Parallel Computers With Structure Machines with structure that were intended to be SMPs were generally not successful Alliant, Sequent, BBN Butterfly, etc. CM-5 claimed magical compilers, but efficiency only came by using the structure explicitly T3D, T3E were the ONLY machines that allowed shared memory with clusters of nodes—and had it work

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 34 NUMA Clusters of SMPs 2-4 Processors, 2-4Gbytes memory on a node 4 (plus or minus) nodes per cabinet with a switch Cabinets interconnected with another switch Non Uniform Memory Access –Fast access to node memory –Slower access elsewhere in the cabinet –Yet slower access off-cabinet Nearly all large machines are NUMA (DoE ASCI, SGI Origin, Pittsburgh Terascale, …

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 35 Massively Parallel SIMD Computers NASA Massively Parallel Processor –Built by Goodyear 1984 for image processing – bit procs, 1024 bits/proc of mem –Mesh connections Thinking Machines CM-2 (1986) – bit procs, 8192 bits/proc –Log network –Compute cost = communication cost? MasPar MP-1, MP-2 (late 1980s) – bit processors –Log network

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 36 Massively Parallel SIMD Computers Plane of processors each sitting above an array of memory bits Usually a log network connecting the processors Usually also some local connections (e.g., 16 procs/node on CM-2) Memory Procs Control processor

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 37 Massively Parallel SIMD Computers Control processor sends instructions clock by clock to the compute processors All compute processors execute the instruction (or NO-OP) on the same relative data location Obvious image processing model Allows variable data types (although TMC didn’t do this until told to)

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 38 Massively Parallel SIMD Computers Processor in Memory (PIM) Take half the memory off a chip Use the silicon for implementing SIMD processors Extra address bit toggles mode If 0, use address as address If 1, use “address” as SIMD instruction 2048 processors per memory chip Cray Computer Corp. SSS would have provided millions of processors

6/25/2015 Parallel Computer Architectures Computer Science and Engineering 39 Grid Computing