Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Computer Architectures Duncan A. Buell

Similar presentations


Presentation on theme: "Parallel Computer Architectures Duncan A. Buell"— Presentation transcript:

1 Parallel Computer Architectures Duncan A. Buell
Computer Science and Engineering 9/20/2018

2 Rules for Parallel Computing
Parallel Computer Architectures Computer Science and Engineering 9/20/2018

3 There are no rules Parallel Computer Architectures
Computer Science and Engineering 9/20/2018

4 Memory Latency is the Problem
Instructions execute in nanoseconds Memory provides data in 100s of nanoseconds The problem is keeping processors fed with data Standard machines use levels of cache How do we keep lots of processors fed? Parallel Computer Architectures Computer Science and Engineering 9/20/2018

5 Solutions(?) to the Latency Problem
Connect all the processors to all the memory SMP: Sun Enterprise, SGI Challenge, Cray multiprocessors Provide fast, constant time, memory fetch to anywhere from anywhere Requires a fast, expensive, full crossbar switch Parallel Computer Architectures Computer Science and Engineering 9/20/2018

6 Solutions(?) to the Latency Problem (2)
Build a machine that is physically structured like the computations to be performed Vectors: Cray, CDC SIMD: MPP, CM-2, MASPAR 2D/3D Grid: CRAY T3D, T3E Butterfly: BBN Meiko “computing surface” Works well on problems on which it works well Works badly on problems that don’t fit Parallel Computer Architectures Computer Science and Engineering 9/20/2018

7 Solutions(?) to the Latency Problem (3)
Build a machine with “generic” structure and software support for computations that may not fit well Butterfly: BBN Log network: CM-2, CM-5 Relies on magic Magic has always been hard to do Parallel Computer Architectures Computer Science and Engineering 9/20/2018

8 Solutions(?) to the Latency Problem (4)
Build an an SMP and then connect SMPs together in clusters SGI: Origin (NUMA, ccNUMA) DoE: ASCI Red, Blue Pacific, White, etc. Performance requires distributable computations, because the memory access is slow off the local node Parallel Computer Architectures Computer Science and Engineering 9/20/2018

9 Solutions(?) to the Latency Problem (5)
Ignore performance and concentrate on cost Beowulf clusters Networks of workstations If the machine is cheap, and works very well on some (distributable) computations, then maybe no one will notice that it’s not so great on other computations. Parallel Computer Architectures Computer Science and Engineering 9/20/2018

10 The Vector Dinosaurs Parallel Computer Architectures
Computer Science and Engineering 9/20/2018

11 Vector Computers Much of high end computing is for scientific and engineering applications Many of these involve linear algebra We happen to know how to do linear algebra Many solutions can be expressed with lin alg (Lin alg is both the hammer and the nail) The basic operation is a dot product, i.e. a vector multiplication Vector computers do blocks of arithmetic ops as one operation Register-based (CRAY) or memory-memory(CDC) Parallel Computer Architectures Computer Science and Engineering 9/20/2018

12 Programming Vector Computers
Everything reduces to a compiler’s recognizing (or being told to recognize) a loop whose ops can be done in parallel. for(i=0; i < n; i++) /* works just fine */ a[i] = b[i] * c[i]; for(i = 0; i < n; i++) /* fails, a[.] values not independent */ a[i] = a[i-1] * b[i]; Programming involves contortions of code to make it into independent operations inside the loops. Parallel Computer Architectures Computer Science and Engineering 9/20/2018

13 Vector Computing History
1960s Seymour R. Cray does CDC 6400 Cray leaves CDC, forms Cray Research, Inc., produces CRAY-1 (1976) CDC Cyber 205 (late 1970s) CDC spins off ETA, liquid nitrogen ETA-10 fails, ETA fails CRAY X-MP (1983?), CRAY 2 runs Unix (1985) Convex C-1 and a host of “Cray-ettes”, now HP-Convex CRAY Y-MP (1988?), C90, T90, J series (1990s) Steve Chen leaves CRI, forms SSC, fails spectacularly Cray leaves CRI, forms Cray Computer Corp. CCC CRAY 3 fails, CRAY 4 fails, CCC SSS fails CRI sold to SGI, then sold to Tera Computer Corp. 1996 S.R. Cray killed in auto wreck by teenager Parallel Computer Architectures Computer Science and Engineering 9/20/2018

14 True Parallel Computing
Parallel Computer Architectures Computer Science and Engineering 9/20/2018

15 The theoretic model of a PRAM Symmetric Multi Processors
Parallel Computers The theoretic model of a PRAM Symmetric Multi Processors Distributed memory machines Machines with an inherent structure Non Uniform Memory Access machines Massively parallel machines Grid computing Parallel Computer Architectures Computer Science and Engineering 9/20/2018

16 Theory – The PRAM Model PRAM (Parallel Random Access Machine):
Control unit Global memory Unbounded set of procs Private mem for each processor Parallel Computer Architectures Computer Science and Engineering 9/20/2018

17 PRAM Types of PRAM: EREW (Exclusive Read Exclusive Write)
CREW (Concurrent Read Exclusive Write) CRCW (Concurrent Read Concurrent Write) Flaws with PRAM: Logical flaw: Must deal with the concurrent write problem Practicality flaw: Can’t really assume unbounded number of processors Can’t really afford to build the interconnect switch Nonetheless, it’s a good starting place Parallel Computer Architectures Computer Science and Engineering 9/20/2018

18 Standard Single Processor Machine
One processor One memory block Bus to memory All addresses visible Processor Memory Parallel Computer Architectures Computer Science and Engineering 9/20/2018

19 (Michael) Flynn’s Taxonomy
SISD (Single Instruction, Single Data) – The ordinary computer MIMD (Multiple Instruction, Multiple Data) – True, symmetric, parallel computing (Sun Enterprise) SIMD (Single Instruction, Multiple Data) – Massively parallel army-of-ants approach – Processors execute the same sequence of instructions (or else NO-OP) in lockstep (TMC CM-2) SCMD/SPMD (Single Code/Program Multiple Data) – Processors run the same program, but on their own local data (Beowulf clusters) Parallel Computer Architectures Computer Science and Engineering 9/20/2018

20 Symmetric Multi-Processor (SMP) (MIMD)
Lots of processors (32? 64? 128? 1024?) Multiple “ordinary” processors Lots of global memory All addresses visible to all processors Closest thing to a PRAM This the holy grail Memory Parallel Computer Architectures Computer Science and Engineering 9/20/2018

21 SMP Characteristics Middle level parallel execution
Processors spawn “threads” at or below the size of a function Compiler magic to extract parallelism (if no pointers in the code, then at the function level one can determine independence of use of variables) Compiler directives to force parallelism Sun Enterprise, SGI Challenge, … Processors Memory Parallel Computer Architectures Computer Science and Engineering 9/20/2018

22 But SMPs Are Hard to Build
N processors M memory blocks N*M connections This is hard and expensive P M Parallel Computer Architectures Computer Science and Engineering 9/20/2018

23 But SMPs Are Hard to Build
For large N and M, we do this as a switch, not point to point But it’s still hard and expensive Half the cost of a CRAY was the switch between processors and memory Beyond 128 processors, almost impossible P SWITCH M Parallel Computer Architectures Computer Science and Engineering 9/20/2018

24 Memory Banking Issues Many processors requesting data
Processors generate addresses faster than memory can respond Memory banking: use low bits of address to specify the physical bank so consecutive addresses go to physically different banks But power-of-2 stride (as in an FFT) hits the same bank repeatedly CDC deliberately used 17 memory banks to randomize accesses Parallel Computer Architectures Computer Science and Engineering 9/20/2018

25 FFT Butterfly Communication
Parallel Computer Architectures Computer Science and Engineering 9/20/2018

26 Distributed Parallelism
Beowulf cluster of Linux nodes (requires an identifible “computer” to be a Beowulf?) SNOW (Scalable Network of Workstations) GIMP, Beowulfs programmed with MPI or PVM MPI uses explicit processor-to-processor message passing Sun (and others) have tools for networks Parallel Computer Architectures Computer Science and Engineering 9/20/2018

27 Distributed Parallel Computers
Network Usually we can’t get to the memory except through the processor, but we would like to have memory-to-memory connections. Parallel Computer Architectures Computer Science and Engineering 9/20/2018

28 Parallel Computers With Structure
If it’s hard/expensive to build an SMP, is it useful to build the structure into the machine? Build in a communication pattern that you expect to see in the computations, but keep things simple enough to make them buildable Make sure that you have efficient algorithms for the common computational tasks Parallel Computer Architectures Computer Science and Engineering 9/20/2018

29 Parallel Computers With Structure
Ring-connected machines (Alliant) 2-dimensional meshes (CRAY T3D, T3E) 3-D mesh with missing links (Tera MPA) Logarithmic tree interconnections Thinking Machines Connection Machine CM-2, CM-5 MasPar MP-1, MP-2) Bolt, Beranek, and Newman BBN Butterfly Parallel Computer Architectures Computer Science and Engineering 9/20/2018

30 2-dimensional Mesh with Wraparound
A vector multiply can be done very efficiently (shift column data up past row data), but what about a matrix transpose? Parallel Computer Architectures Computer Science and Engineering 9/20/2018

31 Logarithmic Tree Communications
Parallel Computer Architectures Computer Science and Engineering 9/20/2018

32 Parallel Computers With Structure
Machines with structure that were intended to be SMPs were generally not successful Alliant, Sequent, BBN Butterfly, etc. CM-5 claimed magical compilers, but efficiency only came by using the structure explicitly T3D, T3E were the ONLY machines that allowed shared memory with clusters of nodes—and had it work Parallel Computer Architectures Computer Science and Engineering 9/20/2018

33 NUMA Clusters of SMPs 2-4 Processors, 2-4Gbytes memory on a node
4 (plus or minus) nodes per cabinet with a switch Cabinets interconnected with another switch Non Uniform Memory Access Fast access to node memory Slower access elsewhere in the cabinet Yet slower access off-cabinet Nearly all large machines are NUMA (DoE ASCI, SGI Origin, Pittsburgh Terascale, … Parallel Computer Architectures Computer Science and Engineering 9/20/2018

34 Massively Parallel SIMD Computers
NASA Massively Parallel Processor Built by Goodyear 1984 for image processing bit procs, 1024 bits/proc of mem Mesh connections Thinking Machines CM-2 (1986) bit procs, 8192 bits/proc Log network Compute cost = communication cost? MasPar MP-1, MP-2 (late 1980s) bit processors Parallel Computer Architectures Computer Science and Engineering 9/20/2018

35 Massively Parallel SIMD Computers
Control processor Plane of processors each sitting above an array of memory bits Usually a log network connecting the processors Usually also some local connections (e.g., 16 procs/node on CM-2) Procs Memory Parallel Computer Architectures Computer Science and Engineering 9/20/2018

36 Massively Parallel SIMD Computers
Control processor sends instructions clock by clock to the compute processors All compute processors execute the instruction (or NO-OP) on the same relative data location Obvious image processing model Allows variable data types (although TMC didn’t do this until told to) Parallel Computer Architectures Computer Science and Engineering 9/20/2018

37 Massively Parallel SIMD Computers
Processor in Memory (PIM) Take half the memory off a chip Use the silicon for implementing SIMD processors Extra address bit toggles mode If 0, use address as address If 1, use “address” as SIMD instruction 2048 processors per memory chip Cray Computer Corp. SSS would have provided millions of processors Parallel Computer Architectures Computer Science and Engineering 9/20/2018

38 Grid Computing Parallel Computer Architectures
Computer Science and Engineering 9/20/2018

39 Parallel Computing History
Late 1960s ILLIAC-4 1970 CDC STAR-100 1980s Denelcor HEP Tera Computer Corp. MPA Alliant Sequent Stardent Kendall Square Research (KSR) Intel Hypercube NCube BBN Butterfly NASA MPP Thinking Machines CM-2 MasPar 1990s Cray T3D, T3E Thinking Machines CM-5 Tera Computer Corp. MPA SGI Challenge Sun Enterprise SGI Origin HP-Convex DEC 84xx Pittsburgh Terascale ASCI machines Beowulf clusters IBM SP-1, SP-2 Parallel Computer Architectures Computer Science and Engineering 9/20/2018


Download ppt "Parallel Computer Architectures Duncan A. Buell"

Similar presentations


Ads by Google