ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University Slides are derived from work by Sarita Adve (Illinois), Babak Falsafi (CMU), Mark Hill (Wisconsin), Alvy Lebeck (Duke), Steve Reinhardt (Michigan), and J. P. Singh (Princeton). Thanks!

2 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Outline System Organization Taxonomy Distributed Memory Systems (w/o cache coherence) Clusters and Networks of Workstations Presentations of Research Papers

3 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 A Taxonomy of System Organizations Flynn’s Taxonomy of Organizations –SISD = Single Instruction, Single Data = uniprocessor –SIMD = Single Instruction, Multiple Data –MISD = Multiple Instruction, Single Data = doesn’t exist –MIMD = Multiple Instruction, Multiple Data This course addresses SIMD and MIMD machines

4 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 SIMD Single Instruction, Multiple Data E.g., every processor executes an Add on diff data Matches certain programming models –E.g., High performance Fortran’s “Forall” For data parallel algorithms, SIMD is very effective –Else, it doesn’t help much (Amdahl’s Law!) Note: SPMD is programming model, not organization SIMD hardware implementations –Vector machines –Instruction set extensions, like Intel’s MMX –Digital Signal Processors (DSPs)

5 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 MIMD Multiple Instruction, Multiple Data Completely general multiprocessor system model –Lots of ways to design MIMD machines Every processor executes own program w/own data MIMD machines can handle many program models –Shared memory –Message passing –SPMD (using shared memory or message passing)

7 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Distributed Memory Machines Multiple nodes connected by network: –Node = processor, caches, memory, communication assist Design issue #1: Communication assist –Where is it? (I/O bus, memory bus, processor registers) –What does it know? »Does it just move bytes, or does it perform some functions? –Is it programmable? –Does it run user code? Design issue #2: Network transactions –Input & output buffering –Action on remote node These two issues are not independent!

8 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Ex 1: Massively Parallel Processor (MPP) Network interface typically close to processor –Connected to memory bus: »Locked to specific processor architecture/bus protocol –Connected to registers/cache: »Only in research machines Time-to-market is long –Processor already available or must work closely with processor designers Maximize performance (cost is no object!) I/O Bus Memory Bus Processor Cache Main Memory Disk Controller Disk Network Interface Network I/O Bridge

9 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Ex 2: Network of Workstations Network interface on I/O bus Standards (e.g., PCI)  longer life, faster to market Slow (microseconds) to access network interface “System Area Network” (SAN): between LAN & MPP I/O Bus Processor Cache Main Memory Disk Controller Disk Graphics Controller Network Interface Graphics Network interrupts Core Chip Set

10 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Spectrum of Communication Assist Designs 1)None: Physical bit stream –Physical DMA using OSnCUBE, iPSC,... 2)User-level communication –User-level portCM-5, *T –User-level handlerJ-Machine, Monsoon,... 3)Communication co-processor –Processing, translationParagon, Meiko CS-2, Myrinet –Reflective memoryMemory Channel, SHRIMP 4)Global physical address –Proc + Memory controllerRP3, BBN, T3D, T3E 5)Cache-to-cache (later) –Cache controllermost current MPs (Sun, HP) Increasing HW Support, Specialization, Intrusiveness, Performance (???)

11 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Network Transaction Interpretation 1)Simplest MP: communication assist doesn’t interpret much if anything –DMA from/to buffer, then interrupt or set flag on completion 2)User-level messaging: get the OS out of the way –Assist does protection checks to allow direct user access to network (e.g., via memory mapped I/O) 3)Dedicated communication co-processor: get the CPU out of the way (if possible) –Basic protection plus address translation: user-level bulk DMA 4)Global physical address space (NUMA): everything in hardware –Complexity increases, but performance does too (if done right) 5)Cache coherence: even more so –Stay tuned

12 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (1) No Comm Assist: Physical DMA (C&S 7.3) Physical addresses: OS must initiate transfers –System call per message on both ends: ouch! Sending OS copies data to OS buffer w/ header/trailer Receiver copies packet into OS buffer, then interprets –User message then copied (or mapped) into user space

13 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2) User Level Communication (C&S 7.4) Problem with raw DMA: too much OS activity Solution: let user to user communication skip OS –Can’t skip OS if communication is user-system or system-user Two flavors: –User-level ports  map network into user space –User-level handlers  message reception triggers user handler

14 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2a) User Level Network Ports (C&S 7.4.1) Map network hardware into user’s address space –Talk directly to network via loads & stores User-to-user communication without OS intervention – Low latency Communication assist must protect users & system DMA hard … CPU involvement (copying) becomes bottleneck

15 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2a) User Level Network Ports Appears to user as logical message queues plus status register Sender and receiver must coordinate –What happens if receiver doesn’t pop messages?

16 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Thinking Machines CM-5 (C&S 7.4.2) Input and output FIFO for each network Two networks –Request & Response Save/restore network buffers on context switch More later on CM5!

17 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (2b) User Level Handlers (C&S 7.4.3) Like user-level ports, but tighter coupling between port and processor Ports are mapped to processor registers, instead of to special memory regions Hardware support to vector to address specified in message: incoming message directs execution Examples: J-Machine, Monsoon, *T (MIT), iWARP (CMU)

18 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: J-Machine (MIT) Each node a small message- driven processor HW support to queue msgs and dispatch to msg handler task

19 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (Brief) Example: Active Messages Messages are “active” – they contain instructions for what the receiving processor should do Optional reading assignment on Active Messages

20 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (3) Dedicated Communication Co-Processor Use a programmable message processor (MP) –MP runs software that does communication assist work MP and CPU communicate via shared memory Network ° ° ° dest Mem P NI UserSystem Mem P NI UserSystem MP

21 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (3) Dedicated Communication Co-Processor User processor stores cmd / msg / data into shared output queue –Must still check for output queue full (or have it grow dynamically) Message processors make transaction happen –Checking, translation, scheduling, transport, interpretation Avoid system call overhead Multiple bus crossings likely bottleneck Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI

22 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Intel Paragon Network ° ° ° Mem P M P NI i860xp 50 MHz 16 KB $ 4-way 32B Block MESI sDMA rDMA 64 400 MB/s $$ 16175 MB/s Duplex I/O Nodes rte MP handler Var data EOP I/O Nodes Service Devices 2048 B

23 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Dedicated MP w/specialized NI: Meiko CS-2 Integrate message processor into network interface –Active messages-like capability –Dedicated threads for DMA, reply handling, simple remote memory access –Supports user-level virtual DMA »Own page table »Can take a page fault, signal OS, restart Meanwhile, nack other node Problem: processor is slow, time-slices threads –Fundamental issue with building your own CPU

24 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (4) Shared Physical Address Space Implement shared address space model in hardware w/o caching –Actual caching must be done by copying from remote memory to local –Programming paradigm looks more like message passing than Pthreads »Nevertheless, low latency & low overhead transfers thanks to HW interpretation; high bandwidth too if done right »Result: great platform for MPI & compiled data-parallel codes Implementation: –“Pseudo-memory” acts as memory controller for remote mem, converts accesses to network transactions (requests) –“Pseudo-CPU” on remote node receives requests, performs on local memory, sends reply –Split-transaction or retry-capable bus required (or dual-ported mem)

25 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Cray T3D Up to 2,048 Alpha 21064 microprocessors –No off-chip L2 to avoid inherent latency In addition to remote memory ops, includes: –Prefetch buffer (hide remote latency) –DMA engine (requires OS trap) –Synchronization operations (swap, fetch&inc, global AND/OR) –Global synch operations (barrier, eureka) –Message queue (requires OS trap on receiver) Big problem: physical address space –21064 supports only 32 bits –2K-node machine limited to 2M per node –External “DTB annex” provides segment-like registers for extended addressing, but management is expensive & ugly

26 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: Cray T3E Similar to T3D, uses Alpha 21164 instead of 21064 (on-chip L2) –Still has physical address space problems E-registers for remote communication and synchronization –512 user, 128 system; 64 bits each –Replace/unify DTB Annex, prefetch queue, block transfer engine, and remote load / store, message queue –Address specifies source or destination E-register and command –Data contains pointer to block of 4 E-regs and index for centrifuge Centrifuge –Supports data distributions used in data-parallel languages (HPF) –4 E-regs for global memory operation: mask, base, two arguments

27 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Cray T3E (continued) Atomic Memory operations –E-registers & centrifuge used –Fetch&Increment, Fetch&Add, Compare&Swap, Masked_Swap Messaging –Arbitrary number of queues (user or system) –64-byte messages –Create message queue by storing message control word to memory location Message Send –Construct data in aligned block of 8 E-regs –Send like put, but destination must be message control word –Processor is responsible for queue space (buffer management) Barrier and Eureka synchronization DISCUSSION OF PAPER

28 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 (5) Cache Coherent Shared Memory Hardware management of caching shared memory –Less burden on programmer to manage caches –But tough to do on very big systems (e.g., Cray T3E) Cache coherent systems are focus of this course

29 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Summary of Distributed Memory Machines Convergence of architectures –Everything “looks basically the same” –Processor, cache, memory, communication assist Communication Assist –Where is it? (I/O bus, memory bus, processor registers) –What does it know? »Does it just move bytes, or does it perform some functions? –Is it programmable? –Does it run user code? Network transaction –Input & output buffering –Action on remote node

31 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Clusters & Networks of Workstations (C&S 7.7) Connect a bunch of commodity machines together Options: –Low perf comm, high throughput: Condor-style job management –High perf comm: fast SAN and LAN technology Nothing fundamentally different from what we’ve studied already In general, though, network interface further from processor (I.e., on I/O bus) –Less integration, more software support

32 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Network of Workstations (NOW) Network interface on I/O bus Standards (e.g., PCI) => longer life, faster to market Slow (microseconds) to access network interface “System Area Network” (SAN): between LAN & MPP I/O Bus Processor Cache Main Memory Disk Controller Disk Graphics Controller Network Interface Graphics Network interrupts Core Chip Set

33 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Myricom Myrinet (Berkeley NOW) Programmable network interface on I/O Bus (Sun SBUS or PCI) –Embedded custom CPU (“Lanai”, ~40 MHz RISC CPU) –256KB SRAM –3 DMA engines: to network, from network, to/from host memory Downloadable firmware executes in kernel mode –Includes source-based routing protocol SRAM pages can be mapped into user space –Separate pages for separate processes –Firmware can define status words, queues, etc. »Data for short messages or pointers for long ones »Firmware can do address translation too … w/OS help –Poll to check for sends from user Bottom line: I/O bus still bottleneck, CPU could be faster

34 (C) 2004 Daniel J. Sorin from Adve, Falsafi, Hill, Lebeck, Reinhardt, Singh ECE 259 / CPS 221 Example: DEC Memory Channel (Princeton SHRIMP) Reflective Memory Writes on sender appear in receiver’s memory –Must specify send & receive regions (not arbitrary locations) Receive region is pinned in memory Requires duplicate writes, really just message buffers Virtual Physical Virtual

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.

Similar presentations

Presentation on theme: "ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University.

Similar presentations

Presentation on theme: "ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Machine Organizations Copyright 2004 Daniel J. Sorin Duke University."— Presentation transcript:

Similar presentations

About project

Feedback